I know I only briefly went over it in the video (and didn't do your work justice), it is a very clever and efficient solution! A few other FPGA developers and myself often refer back to it when we need to implement a LDZ (for FP or even odd cases like binary allocators).
Thnx :) Well optimizing down to the last gate is what I like to do, and this was the result of that. For me it is no surprise that this was the best in your comparison even if this was an FPGA implementation.
The VHDL source code that you show is apparently an unrolled version of some code that could be in a loop :-) BTW : finding the first Zero is something that is VERY common, for example to increment numbers or propagate carry, and many FPGA have a "fast carry" path but I don't know how to infer it in VHDL.
It could have been left in a rolled form, but I think that might have been more confusing to people who don't know VHDL all that well. As for finding the leading zero being common, I am aware of that. However, the problem is that most people do it the slow and inefficient way and leave it at that. You could see by the results that the naïve way that it's commonly implemented is quite poor in performance compared to an optimized solution, hence the title / goal of the video. The "fast carry" path is inferred by the tool, not by the HDL. Often the tool goes crazy with it actually, so I usually turn off that option (it prioritizes the fast carry path over logic, and ends up using 2x as many routing resources resulting in a 2x longer propagation delay just from net routing).
I need to make a leading zero counter for fp addition 52 bit mantissa. how can i go about this? it can be pipelined into two stages or more if needed. 120 mhz clock
Use one of the methods I described in the video? Other than that, research it - that's what I did. The later methods can be pipelined. The clocks speed depends entirely on your target.
It's going to take a while for my brain to absorb this. lol I get the part about needing to shift the values before adding them, but I didn't quite get the leading zeroes thing on the first viewing. I'd love to be able to write a full FPU core from scratch, as I could then think about attempting my own R4300 core. I'm relatively familiar with the aoR3000 core now, and how the SysAD bus on the R4300 works, so I've wondered recently about how feasible it would be to turn the aoR3000 into a full R4300.
Thanks for the vid btw. Information like this on the 90s era consoles is quite rare on TH-cam, and a massive help for people like me who are trying to understand the whole machine. A very welcome improvement of fMax as well. I never really thought about synthesizing individual blocks like that so the most efficient designs could be found. I do tend to rely on the optimizations in Quartus too much tbh.
(re-watching)... OK, I get it now. The FP values are usually normalized (with a leading 1) before doing any maths on them, and the result needs to be normalised as well (by detecting the extra leading 1s to determine this shift amount). ;) EDIT: "By detecting the position on the leading 1s, by counting the leading zeros", I meant. I'll have to start reading up again, because I've forgotten how to reconcile the base-10 version of a FP number with how it relates to the base-2 version. (I mean, obviously I can normally convert between decimal and binary, but it's not quite so intuitive with floating point yet.)
It took me a while to understand as well, but trying a few operations by hand really helped. There are plenty of floating point calculators online, however, the easiest way to convert a decimal number is to take log_2 of the number, then round the exponent away from zero (so -37.6 becomes -38), find the difference between the rounded exponent and the actual exponent (in the above case, it would be 0.4), and that value is the exponent of the mantissa (so 2^0.4). And of course you have to add a bias to the exponent, so 127 for single, and 1023 for double. I don't think that you can make a R4300 with a R3000 core, however, you could probably make a R4000 with it, since the R4000 had a separate FPU. You would need to extend the datapath to 64-bits though, and add support for the additional instructions (I believe the R3000 was MIPS 2 and the R4000/R4300 is MIPS 3). And code that was compiled to run on the R4300 should run on the R4000. As for testing / synthesizing components by themselves, I always do that unless the component has more signals than I/O pins. For one, it can help you debug the components, to see what the RTL viewer is coming up with. It can also give you an idea of which component is going to slow your design down. And, both Altera and Xilinx offer the ability to partially compile designs, so you can synthesize / fit components one at a time, to ensure that components that need to be fast end up being fast. Be very careful when you let Quartus optimize things, and always look at the RTL view and technology mapping to make sure the results make sense. In the simple example that I showed, Quartus produced a technology mapping that was similar to the RTL view. Additionally, I had Quartus optimize away a register file for a block RAM with added bypass logic... Clever inference, but that sort of defeats the purpose of being able to quickly access registers.
RTL Engineering Wow, thanks for the detailed answer. ;) I'm glad I'm not alone in being confused by floating point. lol Yeah, building on top of the aoR3000 core probably isn't ideal. Might be best to start from scratch, by I've never implemented a full CPU core before, especially anything pipelined. It's something I've wanted to attempt for years, but I'm still struggling with the PS1 GPU commands atm, and only just starting on the GTE stuff. Yep, I do tend to look at the RTL viewer more often now, I just wish it was a lot easier to look at, because they still tend to display the signals in a weird vertical fashion, so it's quite hard to navigate larger designs.
RTL Engineering Do the N64 games really use the 64-bit instructions at all btw? I can understand them using double floats even, but I don't think I've seen many games use the true 64-bit opcodes on the R4300? (I realise it uses a 32-bit SysAD bus, but that's a separate issue, of course.)
Hello there from one of the authors of Implementation (4).
I know I only briefly went over it in the video (and didn't do your work justice), it is a very clever and efficient solution!
A few other FPGA developers and myself often refer back to it when we need to implement a LDZ (for FP or even odd cases like binary allocators).
Thnx :)
Well optimizing down to the last gate is what I like to do, and this was the result of that.
For me it is no surprise that this was the best in your comparison even if this was an FPGA implementation.
The VHDL source code that you show is apparently an unrolled version of some code that could be in a loop :-)
BTW : finding the first Zero is something that is VERY common, for example to increment numbers or propagate carry, and many FPGA have a "fast carry" path but I don't know how to infer it in VHDL.
It could have been left in a rolled form, but I think that might have been more confusing to people who don't know VHDL all that well.
As for finding the leading zero being common, I am aware of that. However, the problem is that most people do it the slow and inefficient way and leave it at that. You could see by the results that the naïve way that it's commonly implemented is quite poor in performance compared to an optimized solution, hence the title / goal of the video.
The "fast carry" path is inferred by the tool, not by the HDL. Often the tool goes crazy with it actually, so I usually turn off that option (it prioritizes the fast carry path over logic, and ends up using 2x as many routing resources resulting in a 2x longer propagation delay just from net routing).
I need to make a leading zero counter for fp addition 52 bit mantissa. how can i go about this? it can be pipelined into two stages or more if needed. 120 mhz clock
Use one of the methods I described in the video? Other than that, research it - that's what I did.
The later methods can be pipelined. The clocks speed depends entirely on your target.
It's going to take a while for my brain to absorb this. lol
I get the part about needing to shift the values before adding them, but I didn't quite get the leading zeroes thing on the first viewing.
I'd love to be able to write a full FPU core from scratch, as I could then think about attempting my own R4300 core.
I'm relatively familiar with the aoR3000 core now, and how the SysAD bus on the R4300 works, so I've wondered recently about how feasible it would be to turn the aoR3000 into a full R4300.
Thanks for the vid btw.
Information like this on the 90s era consoles is quite rare on TH-cam, and a massive help for people like me who are trying to understand the whole machine.
A very welcome improvement of fMax as well. I never really thought about synthesizing individual blocks like that so the most efficient designs could be found.
I do tend to rely on the optimizations in Quartus too much tbh.
(re-watching)...
OK, I get it now.
The FP values are usually normalized (with a leading 1) before doing any maths on them, and the result needs to be normalised as well (by detecting the extra leading 1s to determine this shift amount). ;)
EDIT: "By detecting the position on the leading 1s, by counting the leading zeros", I meant.
I'll have to start reading up again, because I've forgotten how to reconcile the base-10 version of a FP number with how it relates to the base-2 version.
(I mean, obviously I can normally convert between decimal and binary, but it's not quite so intuitive with floating point yet.)
It took me a while to understand as well, but trying a few operations by hand really helped. There are plenty of floating point calculators online, however, the easiest way to convert a decimal number is to take log_2 of the number, then round the exponent away from zero (so -37.6 becomes -38), find the difference between the rounded exponent and the actual exponent (in the above case, it would be 0.4), and that value is the exponent of the mantissa (so 2^0.4). And of course you have to add a bias to the exponent, so 127 for single, and 1023 for double.
I don't think that you can make a R4300 with a R3000 core, however, you could probably make a R4000 with it, since the R4000 had a separate FPU. You would need to extend the datapath to 64-bits though, and add support for the additional instructions (I believe the R3000 was MIPS 2 and the R4000/R4300 is MIPS 3). And code that was compiled to run on the R4300 should run on the R4000.
As for testing / synthesizing components by themselves, I always do that unless the component has more signals than I/O pins. For one, it can help you debug the components, to see what the RTL viewer is coming up with. It can also give you an idea of which component is going to slow your design down. And, both Altera and Xilinx offer the ability to partially compile designs, so you can synthesize / fit components one at a time, to ensure that components that need to be fast end up being fast.
Be very careful when you let Quartus optimize things, and always look at the RTL view and technology mapping to make sure the results make sense. In the simple example that I showed, Quartus produced a technology mapping that was similar to the RTL view. Additionally, I had Quartus optimize away a register file for a block RAM with added bypass logic... Clever inference, but that sort of defeats the purpose of being able to quickly access registers.
RTL Engineering
Wow, thanks for the detailed answer. ;)
I'm glad I'm not alone in being confused by floating point. lol
Yeah, building on top of the aoR3000 core probably isn't ideal.
Might be best to start from scratch, by I've never implemented a full CPU core before, especially anything pipelined.
It's something I've wanted to attempt for years, but I'm still struggling with the PS1 GPU commands atm, and only just starting on the GTE stuff.
Yep, I do tend to look at the RTL viewer more often now, I just wish it was a lot easier to look at, because they still tend to display the signals in a weird vertical fashion, so it's quite hard to navigate larger designs.
RTL Engineering
Do the N64 games really use the 64-bit instructions at all btw?
I can understand them using double floats even, but I don't think I've seen many games use the true 64-bit opcodes on the R4300?
(I realise it uses a 32-bit SysAD bus, but that's a separate issue, of course.)