Good timing again. I almost finished a new PCB design today, for an 8MB SRAM board, for that PCIe FPGA board I bought. It will also have a socket for an N64 Mask ROM on it. ;) I want to attempt to write a CPU core from scratch soon. It'll have to be one that hasn't been done yet (or at least not released publicly). Maybe the TMS34010 for Mortal Kombat, but I'd rather work on something like the N64 RSP.
Writing a CPU core from scratch is a good exercise, but I highly recommend against your first CPU being the RSP. It's basically a full 32-bit MIPS III minus the system control instructions, multiplication, division, and the FPU. It's more complicated than the R3000 (remember that the RSP is a 5-stage dual issue super-scalar processor - not true super-scalar, but the complexity of one). If the TMS34010 hasn't been done, then that could be a good option. I don't have a list of available CPU cores, so I can't give you an option that has not been implemented publicly yet.
@@RTLEngineering True. The RSP probably isn't the best one to go for, for a first project. lol I got quite far before with SuperFX. It was running instructions, then I started to work on the muxes for the data / address paths. And then somebody released a full GSU core anyway, so I shelved it I'm tempted by the TMS, but the thing is that it would only run a handful of arcade games. I mean, it's great if we can keep adding a new arcade core every few weeks, and MK would be a big one, but I'm thinking it might be best to work on cores that can play many games on the same platform. What I should really do is finish the CPS1 core, as that's at least running code just fine, albeit without the graphics being displayed yet. Audio is basically done, too.
Thanks! I plan to at some point, but these videos take quite a bit of time to make (about 1-2 weeks of work for me), and I have found that I would rather use that time actually doing architecture design and implementation. Once I figure out a way to make them quicker, then I will upload more.
Thanks for the offer, however, it's not necessary. I have been considering using one of the AI based speech synthesis programs to help speed it up (since it allows me to generate clean audio by typing in the script) - that would probably save 3-4 days of work for each video. But someone had brought up the concern that it would remove the human component making the videos less enjoyable.
To clarify, the 3-4 days includes cleaning up the audio and editing it. Recording still usually takes almost a full day for a 20 minute video though. That's why I was thinking about computer generated audio, which wouldn't require me to do any of those steps, though I would still have to proof it to ensure correct pronunciations (that would cut down the 3-4 days into a couple of hours, at the cost of removing the human element to the audio).
The DSP slice can only do fixed shifts of 17-bits. You could potentially use the multiplier to accomplish the same goal though, since a shift by 4 would be a multiplication by 16. However, to implement that, you would either need a shifter to convert the shift amount to a multiplication value, or use a LUT ROM (for a 64-bit number, you would need the ROM to be 64bits x 64 entries). Furthermore, the DSP slices are only capable of doing 25x18 bit multiplication. So if you wanted to do a 32-bit shift this way, you would need 4 DSP slices, and I believe that you would need 12 for a 64x64 multiply. Basically, it would save few LUTs at the cost of the DSPs, a tradeoff that probably wouldn't be a good one, especially on smaller FPGAs which may only have a handful of DSP slices.
Is there a specific point about it you want my opinion on? Overall, I think it's a very clean ISA (the base ISA), with a lot of similarities with MIPS. However, they made many design decisions that I don't particularly like, even more so with the extensions. A few examples: 1) the multiply-divide extensions produce results in the GPR, and are complimented. So you have a low multiple and high multiply instruction. Similarly you have a divide and remainder instruction. The ISA suggests using them in a specific order if you want both results, but that makes it much harder to implement them. Instead, I wished they had done what MIPS did with the HiLo registers, which many systems took advantage of to do multiplication and division asynchronously from the pipeline - some very infamously (you can't do that with RISCV). 2) the FP extension is not coupled with the vector extension which requires FP. So if you want the vector extensions, you need to have a primary FPU and a vector FPU, even if you only ever use one at once. Otherwise you need to shuffle around the micro-architecture control to allow unit reuse. And if you want vector extensions without FP support (like for ML), then you're out of luck and need to roll your own extension. 3) the vector extensions are a complete mess. They're meant for vector processing and not SIMD, where SIMD is more accessible to modern software (since that's what ARM and x86 use). I apologize for that little rant - part of my research (for my day job) revolves around RISCV micro-architecture design, so I have a lot of opinions there. As long as you don't stray off of the original integer ISA, it's straightforward though.
Ok, that's exactly what I was hoping for, thanks. The points you brought up are very interesting. I also wish for a more differentiated V extension. Like only implementing float vectors or whatever your needs are. But a packed SIMD extension is still in the making. However I don't see much difference between a Vector and a further SIMD expansion, maybe you can clarify the difference? Do you mean with the async. mul/div, the result only has to be actually computed if the move low/high is called?
On the surface there isn't much of a difference with SIMD and vector, since SIMD is a subset of vector. However, there's the requirement of the vector extensions that it be vector, which require a lot more hardware and features. Additionally, vector does require some setup (size, stride, etc.), whereas SIMD is fixed. There are a lot of benefits to having a vector extension like RISCV's, but the flexibility leads to other problems (compiler targets and patterns, overhead, etc.). It would be better to have had multiple tiers of extensions which were compatible - SIMD integer, SIMD FP, vector integer (superset of SIMD integer), and vector FP (superset of SIMD FP). That way you could easily target SIMD operations like with MMX/SSE/NEON, as well as the fancier vector operations like you see with Cray. As I understand it, the way the extensions work (which is enforced at the compiler level), is that it's all or nothing. If you don't implement the extension completely, then the compiler won't know what is not implemented and may generate invalid instructions for a given target. As for async, that's correct. Although, it's usually computed when the instruction is issued. Multiplication and Division are not fast operations (at the worst case, a 32-bit division could take 35 cycles). So it makes more sense to let the CPU be able to move ahead (and have the programmer be aware of that behavior by it being part of the ISA), which allows for more efficient instruction level parallelism (instead of relying on the micro-architecture to extract that parallelism). What MIPS did in that case, was lock the Hi-Lo register with a hazard flag when a mul/div was executed. If that register was accessed while locked, then the pipeline would interlock (stall). The R5900 in the PS2 actually relied heavily around that, to allow the mul/div instructions to effectively complete out-of-order (that was probably a nightmare to debug, because you couldn't put a breakpoint on one of those instructions, but it greatly simplified the micro-architecture while minimizing stalls). Furthermore, the HiLo register was 2x the datawidth size, so for the 64-bit R5900, it was a 128-bit register (Hi and Lo), which would give you the 128-bit multiplication result, or the remainder and quotient of a division. That means you don't need to worry about instruction ordering or dependency detection from the instruction stream (there was no Hi mult and Lo mult, which you have to call in a specific order and have the micro-arch detect to ensure that the multiplication isn't done twice, like with RISCV). Also, MIPS could use the HiLo register as an extra GPR if need be (I doubt any compiler did that, but some assembly might have - you can always use more registers, especially if they're effectively free).
Thanks for clarifying. Im totally on your side. Actually I thought about a more subdivided Vector set while reading through the extension. But afaik it is not ratified yet, so lets hope they add that. For the mul/div problem, I agree. Especially, since riscv is built in a way to simplify the microarchitecture enormously. With strangely located imm and a lot more. Its really strange to me why they did not implement the same pattern from mips. But I think their reasoning was to have all registers in one place. (given that there is also no status register)
I doubt the vector extension will be broken up. Although, if Intel gets involved, then maybe it will be (or they will propose their own which is more similar to MMX/SSE - one can only hope). The imm fields are indeed strangely located in the instructions, unlike with MIPS. And that does make the decoder a little more complicated, though not substantially so. Having all of the registers in one place is not valid here. The FP registers are separate from the GRP. You could argue that this is due to the fact that FP is an extension, but so is mul/div (i.e. a it would be consistent with the FP extension to have added special registers for the mul/div instructions). A status register isn't really relevant for RISC though, since those outcomes aren't really necessary. The main purpose of the status register was 1) to extend 8-bit and 16-bit operands (add carry and shift/rotate), but with 32-bit registers, that's not really an issue. And 2) to allow for status based branching, which is not relevant here either since most RISC ISAs do in-branch comparisons with register values (1 instruction instead of 2). There are, however, some RISCV implementation that place status registers in the CSR space, to indicate carry, sign, zero, etc. (it's not required, but some may find it useful). One other point I want to mention is the RISCV insistence of adding extra functionality to extensions. One prime example is the bit-manipulation extension. Everything there is pretty easy to implement in hardware with a low area overhead.... except for the CRC instructions. I understand why CRC is needed / wanted, but I don't think it belongs in the same extension as things like "rotate left". Again, this is because of the requirement that an extension is all or nothing, so if you don't want to waste the hardware on computing CRC, then you can't tell the compiler that the extension is supported, and have to use manual intrinsics hacked in with macro'd assembly. (CRC should also probably be in a security related extension too, not bit-manipulation).
Thanks! I am still working on it, however, that project is currently on hold - I am working on something else at the moment. I do plan to return to the N64 project when the current one is done.
@@RTLEngineering Thanks for the update! Looking forward to your future videos. Out of curiosity, is the other thing something in this FPGA gaming space, or just something entirely unrelated?
Thanks! I'm not sure when the next video will be though - this other project is taking up a lot of my time. It is something in the FPGA gaming space. I can't go into details, but I can say that it's something new.
Always love seeing Digilent come up. They are like my work Neighbors. Their campus is like a couple blocks down the street from where I work.
Good timing again.
I almost finished a new PCB design today, for an 8MB SRAM board, for that PCIe FPGA board I bought.
It will also have a socket for an N64 Mask ROM on it. ;)
I want to attempt to write a CPU core from scratch soon.
It'll have to be one that hasn't been done yet (or at least not released publicly). Maybe the TMS34010 for Mortal Kombat, but I'd rather work on something like the N64 RSP.
Writing a CPU core from scratch is a good exercise, but I highly recommend against your first CPU being the RSP. It's basically a full 32-bit MIPS III minus the system control instructions, multiplication, division, and the FPU. It's more complicated than the R3000 (remember that the RSP is a 5-stage dual issue super-scalar processor - not true super-scalar, but the complexity of one). If the TMS34010 hasn't been done, then that could be a good option. I don't have a list of available CPU cores, so I can't give you an option that has not been implemented publicly yet.
@@RTLEngineering
True. The RSP probably isn't the best one to go for, for a first project. lol
I got quite far before with SuperFX. It was running instructions, then I started to work on the muxes for the data / address paths. And then somebody released a full GSU core anyway, so I shelved it
I'm tempted by the TMS, but the thing is that it would only run a handful of arcade games. I mean, it's great if we can keep adding a new arcade core every few weeks, and MK would be a big one, but I'm thinking it might be best to work on cores that can play many games on the same platform.
What I should really do is finish the CPS1 core, as that's at least running code just fine, albeit without the graphics being displayed yet. Audio is basically done, too.
did you get that terasic one with like 310k LU? thumbs up on cps :) warham
Hey, could you please make some more videos on FPU design? I liked them a lot
Thanks! I plan to at some point, but these videos take quite a bit of time to make (about 1-2 weeks of work for me), and I have found that I would rather use that time actually doing architecture design and implementation. Once I figure out a way to make them quicker, then I will upload more.
I am willing to help, if I can do so feel free to contact me :D
Thanks for the offer, however, it's not necessary.
I have been considering using one of the AI based speech synthesis programs to help speed it up (since it allows me to generate clean audio by typing in the script) - that would probably save 3-4 days of work for each video. But someone had brought up the concern that it would remove the human component making the videos less enjoyable.
That is definitely true
But 3-4 days of audio recording is quite intense, isn't there a possibility to reduce that?
To clarify, the 3-4 days includes cleaning up the audio and editing it. Recording still usually takes almost a full day for a 20 minute video though. That's why I was thinking about computer generated audio, which wouldn't require me to do any of those steps, though I would still have to proof it to ensure correct pronunciations (that would cut down the 3-4 days into a couple of hours, at the cost of removing the human element to the audio).
Hi. Noobie question: Is it possible to use Artix-7 DSP slices to help with this?
The DSP slice can only do fixed shifts of 17-bits. You could potentially use the multiplier to accomplish the same goal though, since a shift by 4 would be a multiplication by 16. However, to implement that, you would either need a shifter to convert the shift amount to a multiplication value, or use a LUT ROM (for a 64-bit number, you would need the ROM to be 64bits x 64 entries).
Furthermore, the DSP slices are only capable of doing 25x18 bit multiplication. So if you wanted to do a 32-bit shift this way, you would need 4 DSP slices, and I believe that you would need 12 for a 64x64 multiply.
Basically, it would save few LUTs at the cost of the DSPs, a tradeoff that probably wouldn't be a good one, especially on smaller FPGAs which may only have a handful of DSP slices.
One more independent question: What do you think about the RISC-V ISA?
Is there a specific point about it you want my opinion on?
Overall, I think it's a very clean ISA (the base ISA), with a lot of similarities with MIPS. However, they made many design decisions that I don't particularly like, even more so with the extensions. A few examples: 1) the multiply-divide extensions produce results in the GPR, and are complimented. So you have a low multiple and high multiply instruction. Similarly you have a divide and remainder instruction. The ISA suggests using them in a specific order if you want both results, but that makes it much harder to implement them. Instead, I wished they had done what MIPS did with the HiLo registers, which many systems took advantage of to do multiplication and division asynchronously from the pipeline - some very infamously (you can't do that with RISCV). 2) the FP extension is not coupled with the vector extension which requires FP. So if you want the vector extensions, you need to have a primary FPU and a vector FPU, even if you only ever use one at once. Otherwise you need to shuffle around the micro-architecture control to allow unit reuse. And if you want vector extensions without FP support (like for ML), then you're out of luck and need to roll your own extension. 3) the vector extensions are a complete mess. They're meant for vector processing and not SIMD, where SIMD is more accessible to modern software (since that's what ARM and x86 use).
I apologize for that little rant - part of my research (for my day job) revolves around RISCV micro-architecture design, so I have a lot of opinions there. As long as you don't stray off of the original integer ISA, it's straightforward though.
Ok, that's exactly what I was hoping for, thanks. The points you brought up are very interesting. I also wish for a more differentiated V extension. Like only implementing float vectors or whatever your needs are. But a packed SIMD extension is still in the making. However I don't see much difference between a Vector and a further SIMD expansion, maybe you can clarify the difference? Do you mean with the async. mul/div, the result only has to be actually computed if the move low/high is called?
On the surface there isn't much of a difference with SIMD and vector, since SIMD is a subset of vector. However, there's the requirement of the vector extensions that it be vector, which require a lot more hardware and features. Additionally, vector does require some setup (size, stride, etc.), whereas SIMD is fixed. There are a lot of benefits to having a vector extension like RISCV's, but the flexibility leads to other problems (compiler targets and patterns, overhead, etc.). It would be better to have had multiple tiers of extensions which were compatible - SIMD integer, SIMD FP, vector integer (superset of SIMD integer), and vector FP (superset of SIMD FP). That way you could easily target SIMD operations like with MMX/SSE/NEON, as well as the fancier vector operations like you see with Cray. As I understand it, the way the extensions work (which is enforced at the compiler level), is that it's all or nothing. If you don't implement the extension completely, then the compiler won't know what is not implemented and may generate invalid instructions for a given target.
As for async, that's correct. Although, it's usually computed when the instruction is issued. Multiplication and Division are not fast operations (at the worst case, a 32-bit division could take 35 cycles). So it makes more sense to let the CPU be able to move ahead (and have the programmer be aware of that behavior by it being part of the ISA), which allows for more efficient instruction level parallelism (instead of relying on the micro-architecture to extract that parallelism). What MIPS did in that case, was lock the Hi-Lo register with a hazard flag when a mul/div was executed. If that register was accessed while locked, then the pipeline would interlock (stall). The R5900 in the PS2 actually relied heavily around that, to allow the mul/div instructions to effectively complete out-of-order (that was probably a nightmare to debug, because you couldn't put a breakpoint on one of those instructions, but it greatly simplified the micro-architecture while minimizing stalls). Furthermore, the HiLo register was 2x the datawidth size, so for the 64-bit R5900, it was a 128-bit register (Hi and Lo), which would give you the 128-bit multiplication result, or the remainder and quotient of a division. That means you don't need to worry about instruction ordering or dependency detection from the instruction stream (there was no Hi mult and Lo mult, which you have to call in a specific order and have the micro-arch detect to ensure that the multiplication isn't done twice, like with RISCV). Also, MIPS could use the HiLo register as an extra GPR if need be (I doubt any compiler did that, but some assembly might have - you can always use more registers, especially if they're effectively free).
Thanks for clarifying. Im totally on your side. Actually I thought about a more subdivided Vector set while reading through the extension. But afaik it is not ratified yet, so lets hope they add that.
For the mul/div problem, I agree. Especially, since riscv is built in a way to simplify the microarchitecture enormously. With strangely located imm and a lot more. Its really strange to me why they did not implement the same pattern from mips. But I think their reasoning was to have all registers in one place. (given that there is also no status register)
I doubt the vector extension will be broken up. Although, if Intel gets involved, then maybe it will be (or they will propose their own which is more similar to MMX/SSE - one can only hope).
The imm fields are indeed strangely located in the instructions, unlike with MIPS. And that does make the decoder a little more complicated, though not substantially so.
Having all of the registers in one place is not valid here. The FP registers are separate from the GRP. You could argue that this is due to the fact that FP is an extension, but so is mul/div (i.e. a it would be consistent with the FP extension to have added special registers for the mul/div instructions). A status register isn't really relevant for RISC though, since those outcomes aren't really necessary. The main purpose of the status register was 1) to extend 8-bit and 16-bit operands (add carry and shift/rotate), but with 32-bit registers, that's not really an issue. And 2) to allow for status based branching, which is not relevant here either since most RISC ISAs do in-branch comparisons with register values (1 instruction instead of 2).
There are, however, some RISCV implementation that place status registers in the CSR space, to indicate carry, sign, zero, etc. (it's not required, but some may find it useful).
One other point I want to mention is the RISCV insistence of adding extra functionality to extensions. One prime example is the bit-manipulation extension. Everything there is pretty easy to implement in hardware with a low area overhead.... except for the CRC instructions. I understand why CRC is needed / wanted, but I don't think it belongs in the same extension as things like "rotate left". Again, this is because of the requirement that an extension is all or nothing, so if you don't want to waste the hardware on computing CRC, then you can't tell the compiler that the extension is supported, and have to use manual intrinsics hacked in with macro'd assembly. (CRC should also probably be in a security related extension too, not bit-manipulation).
I love all your work! are you still working on n64s core?
Thanks! I am still working on it, however, that project is currently on hold - I am working on something else at the moment. I do plan to return to the N64 project when the current one is done.
@@RTLEngineering Thanks for the update! Looking forward to your future videos. Out of curiosity, is the other thing something in this FPGA gaming space, or just something entirely unrelated?
Thanks! I'm not sure when the next video will be though - this other project is taking up a lot of my time. It is something in the FPGA gaming space. I can't go into details, but I can say that it's something new.
@@RTLEngineering No worries, please take your time. Glad to hear your other current efforts are also in this space!