N64 Hardware Architecture

แชร์
ฝัง
  • เผยแพร่เมื่อ 28 พ.ย. 2024

ความคิดเห็น • 53

  • @adbethsing6800
    @adbethsing6800 6 ปีที่แล้ว +30

    dude this is so complicated. it sounds like Data talking about the warp engines spacetime manifold. srsly. Props. Digital engineering is awesome.

  • @sdfsdfsdfsdsdfsf8468
    @sdfsdfsdfsdsdfsf8468 6 ปีที่แล้ว +5

    Love your work please more videos each week,and go in details for chips also always give web links

  • @derbezacesanchez3779
    @derbezacesanchez3779 6 ปีที่แล้ว +8

    I get the feeling the Nintendo 64 is/was very complicated hardware then and even now.

    • @RTLEngineering
      @RTLEngineering  6 ปีที่แล้ว +11

      It was / is, when compared to modern hardware (even though modern architectures are orders of magnitude more complicated, all of the complexity is abstracted away by drivers and libraries). Though just like with modern hardware, Nintendo did provide libraries in C which did all of the low level stuff, the only issue was the developers still needed to have an idea of what the libraries and hardware was doing. The same can be said for the PS2, which was even more complicated than the N64. Though I think development for older systems (like the SNES) or the Sega systems would have been more difficult, considering it was mostly done in assembly as opposed to C.
      One nice thing about the N64 though, is that it has all of the basic architecture of modern APUs (like what's in the xbox One and PS4), but in a much simpler form.

    • @Quaker763
      @Quaker763 ปีที่แล้ว +1

      @@RTLEngineering I feel like after SONY got smoked by the 64 technologically, they over compensated big time with complexity. The Emotion Engine, and the proceeding Cell Broadband architecture were, in my opinion, terrible choices for hardware design, especially when you consider most games ended up using the same middleware because a few dev houses were able to comprehend the complexity.
      Awesome channel by the way, I can't believe I haven't stumbled across your videos earlier!

  • @myownfriend23
    @myownfriend23 6 ปีที่แล้ว +4

    This is exactly what I was looking for. I have a few questions about the RDP though. In your diagram, you state that the Color Combiner and Rasterizer submit a pixel's color, x, y, and z info to the Blender which it then uses to write to the area in memory where the z-buffer and frame buffer are stored in RDRAM.
    However online documentation that I've found says that the programmer is the one who defines the location of the Z-buffer, the amount of frame buffers and their sizes. I was wondering if someone could instead theoretically implement a tile-based renderer with 32x16 pixel tiles where the Blender could be told to loop the z-buffer and frame buffer data back into to TMEM through the system bus. If so then the z-buffer wouldn't use up any of RDRAM space or bandwidth and the frame buffer could be written in bursts to possibly reduce the hit from RDRAM's latency. Is the requirement to render the frame buffer to DRAM enforced by the "OS" and microcode or is it the case that it absolutely HAS to write to main memory since the Blender is connected to the Memory Interface?

    • @RTLEngineering
      @RTLEngineering  6 ปีที่แล้ว +4

      So the programmer sets the location of the Z-buffer (called the z image) and the frame buffer (called the color image) via RDP commands. These commands are placed in a memory block and transferred into the RDP via the RDP DMA controller. When doing so, an image (buffer) width is specified for the color image, as well as a RDRAM address for both.
      I don't know for certain, but most likely, this command populates internal registers within the blender, setting the Z-buffer start address, the frame-buffer start address, and the image width. Then as pixels come in, there is another DMAC which isn't mentioned by the documentation, which will retrieve the current image block, and also write it back to the correct memory location. You know exactly which memory address you are working with given x and y, since the address is going to be addr=(y*width + x)*bit_depth.
      So you can't loop it quite like what you were thinking, but you can do something similar, which is mentioned in some documentation. TMEM does not exist in the same address space as RDRAM, and therefore is not accessible by the RCP's internal bus.
      What you could do though, is to render a small tile to RDRAM, then copy it into TMEM as a texture, then draw the texture on the screen to a new frame buffer, etc. There may have been some games that did this, but I don't know of any for certain. Your biggest issue will be memory bandwidth (which is what you said), where it is doubtful that you would obtain a decent framerate doing so.
      So to more explicitly answer your question, having the framebuffer in DRAM is a hardware limitation.
      With that said, there is no reason that an emulator (software or hardware) couldn't provide a feedback path. Both commands to set the framebuffer and z-buffer use 26-bit addresses, with an extra 6 bits free. Those bits could be used to specify a memory location. For example, bit 26 = 0 could mean RDRAM, and bit 26=1 mean TMEM (obviously the TMEM address would need to be much smaller, but it could still be done). My original though for an FPGA implementation, would be to use a modified address (so set color image would force bit 27 to be 1 for example), placing it in another part of the system memory, and then make sure that the video interface realizes this change. The main issue would be reloading the framebuffer into a texture, so some address tracking might be needed. Also, if some games cleared these images from the RSP or CPU, then that wouldn't work. I may have to rethink that, however, since the memory bandwidth of DDR3 for the target dev board I was considering is listed as 800 Mbps, which is less than the original N64 peak bandwidth (4028 Mbps) - it's possible that one of the sources made a mistake with Mb vs MB (bits vs bytes). If that's the case, a custom board would be needed, with a higher memory bandwidth - a 1080p framebuffer wouldn't fit in the internal block RAMs either.
      Hopefully that was helpful, and not too long of a response / confusing.

    • @myownfriend23
      @myownfriend23 6 ปีที่แล้ว +1

      @@RTLEngineering Ah thanks. I was just trying to see if any practices that began after the N64's lifespan could have been used to its benefit. Doesn't look like it's flexible enough though. Like I guess someone could get something like deferred shading but I can't imagine there would be much benefit lol

    • @RTLEngineering
      @RTLEngineering  6 ปีที่แล้ว +1

      @@myownfriend23 It could have worked. There is a statement about rendering to indexed buffers in some of the documentation, which can then be re-used in the final render. It only has the limitation of going through the RDRAM, which poses bandwidth issues. So you could implement deferred shading, but that's tricky since it basically uses a fixed-function pipeline, and not a programmable one that is needed for deferred shading. I think the indexed image idea (indexed because it stores a larger image in a smaller memory footprint) could be used for reflections though (water, mirrors, etc.). Also, some of the parallel programming engines used in the PS2 for particle effects could probably be implemented in the RSP, all of which are techniques that came after the N64. The N64 manual even talks about hot to more easily parallelize a particle like loop to run on the RSP.

    • @myownfriend23
      @myownfriend23 6 ปีที่แล้ว

      @@RTLEngineering I'm back with more questions! lol
      In the VR4300, was the CPU the only one responsible for things like branching or could CP1 do that, too?
      Having looked at programming documentation, I saw that there was a microcode command to send six vertices to the RSP to create two triangles. This raises some questions. For one, the documentation states that the vertex coordinates are specified as 16-bit integers, not floats, is this true? And if so, why would the VU need to do transformations with 32-bit operations?
      The other is that it suggests that there's no way for the RSP to take advantage of triangle strips or indexed triangle strips with Nintendo's own microcode. If that's true then wouldn't that be a simple way to at least double the amount of lit polygons since any triangle after the first would only need one vertex to be transformed and lit? I can also imagine the back-face culling process benefiting from this as it knows that the triangles are connected. So as soon as it determines that a triangle is back-facing, then all it has to do is determine is the next vertice is to the left of the last two in view space. If so, that triangle can be culled. If one of them goes to the right, then if the last triangle that passed the visibility test is in front of it, then it too can be culled.
      Would something like fast inverse square root be implemented to the N64's advantage?
      I've also read about examples where the CPU worked as a a kind of fragment shader on the frame buffer to do things like full screen guassian/box blur and jigsaw puzzle effects and it was able to do a full screen lens distortion effect in Perfect Dark which I believe ran at 640x480. If a 93.75 Mhz, single core CPU can iterate through 76,800 to 307,200 pixels fast enough to do such complicated effects after the RSP and RDP had already done texturing and lighting, wouldn't a 62.5 Mhz 8x16-bit SIMD unit like the one in the RSP be able to do the same operation up to 5x faster? At that point couldn't microcode be devised to allow the RSP to do per pixel lighting? That would essentially treat the RSP like a unified mix-precision shader cores in mobile GPUs. If someone got as far as getting per-pixel lighting running, then the next thing to try would obviously be normal mapping... though I strongly doubt that the RSP as enough oomf to compute tangent and bitangent data then transform the normal maps to world space AND do per pixel lighting lol
      And the last question (for this comment at least lol). Factor 5 apparently stated that they create custom texture formats for Indiana Jones and the Infernal Machine. How exactly would they have gone about that if the RDP wasn't all that programmable? Were they simply creating texture formats to get the size down that were interpreted by the RSP and sent to the RDP as indexed textures? That's probably the one aspect of the N64 that I've seen documented the least, the RDPs instruction set. I did find out that the clipping and culling is all done on the RSP and not in the rasterizer though.

    • @RTLEngineering
      @RTLEngineering  6 ปีที่แล้ว +1

      @@myownfriend23 I'm happy to answer them to the best of my knowledge.
      There were CP1 (floating-point co-processor) branch instructions, which basically did a floating-point test, which set a flag, and then they did branch on true or branch on false. The important point to understand there, is that CP1 and the main CPU operate in lock-step, so you can't have CP1 run in parallel. In reality, the VR4300 has an integrated integer and floating-point pipeline, so the only distinction between CP1 and the main CPU was logical / instruction based, and not visible in the architecture.
      The RSP used fixed-point arithmetic and not floating-point. However, there are some cases where fixed-point is more accurate (since you have more digits). I'm not sure what you are asking about in regards to 32-bit operations... All of the instructions are 32-bit in length, and the RSP had 128-bit ports to the vector register file.
      It wouldn't quite double it, but it could help with bandwidth, at the cost of requiring the RSP to do more work. The advantage of triangle strips and triangle fans, is that they require fewer vertices to be specified, but all 3 vertices need to be calculated independently within the final render hardware. The N64 development kit did allow for developers to compile their own micro-code for the RSP, so it's possible and likely that some of the studios did implement triangle strip / triangle fan rendering routines. In terms of back face culling, this wouldn't be helpful, since there is no requirement that the strip or fan be planar. For example, you can create a sphere out of only triangle strips, where some of the faces would be culled but not all of them.
      There is a "fast" inverse square root implemented in the RSP via a look-up ROM. Are you thinking about in the CPU? Or some method that does not require a ROM?
      I didn't know of games using the CPU to do rendering effects, but it makes sense. For most games, the VR4300 is sitting idle while it waits for the RCP to complete draw commands. In theory the RSP would be faster, however, it was almost always in use, and when it wasn't, it was waiting for data / instructions to be moved in from RDRAM. If you wanted to do per-pixel lighting, you would still have the issue with bandwidth as mentioned previously. You would basically need to operate on the image per-pixel, and store it back into DMEM. Since the RSP had 4K DMEM, everything would have to fit in there. Requiring 912K per image for 640x480. So it might be possible to do with 32x32 blocks, which you would need 300 of. And you can pipeline the DMA operations (swapping which part of DMEM you are using), so it might be possible to do it without pausing the RSP. I have never tried it though, so I'm not sure if it's feasible, but it is certainly possible (you might end up with 2 fps doing so though). If you wanted to go that route though, you would be better off using an implementation that replaces the RDP with 1 or 2 more RSPs (since the RDP was much larger).
      For the custom texture format, I'm not actually sure how they did that. The RDP can't be programmed to use custom formats, so there are two possibilities that I can think of.
      1) They sent the compressed texture to the RSP, decompressed it there, did a DMA to RDRAM, and then another DRAM to TMEM (which makes no sense, since it only helps with loading time from the cartridge at the expense of extra compute time).
      2) They implemented an indexed texture which had the color table changed by the RSP in real time.
      There is another possibility that occurred to me.
      3) They could have sent the compressed texture to the RSP, decompressed part of it into an indexed texture, performed a DMA to RDRAM back to TMEM, then also swapped the color tables for the indexed texture from the RSP. It should be possible to pipeline this process, so effectively you are only ever doing temporary block transfers from the RSP -> RDRAM -> TMEM, and altering the texture in TMEM on the fly.
      The third option seems like the most likely possibility, since it spreads out the bandwidth over time instead of having it happen at once. It would be very tricky / finicky to implement though, but should be possible via custom micro-code. The biggest problem is the lack of a connection from the DMEM of the RSP to the TMEM of the RDP other than via RDRAM (the DMA controllers can't access DMEM or TMEM directly). The only connection between the RDP and RSP is the X-bus, which is between DMEM and the display command buffer.
      As for clipping / culling, you are correct, it's done on the RSP. The RDP is pretty dumb, for a textured triangle, you basically have to provide a start / stop coordinate, as well as a slope for each edge, and that's all you can do. The coordinates have to be within screen space as well, so the clipping / culling has to be done before the display command is generated. I believe the microcode used for triangle draws did all of that within the RSP though (culling, clipping, re-triangulation, transforms, texture coordinate mapping, slope calculation, and display command generation).

  • @FurEngel
    @FurEngel 5 ปีที่แล้ว +3

    The ROM cart could only run as fast as the slowest mask rom allowed for production. I doubt that is anywhere near 140MHz. Also, correct me if I am wrong, but the RCP created images using glide3D. This would probably be the most difficult thing to implement, as most emulators use glide to D3D wrappers. Implementing that in VHDL would need a really high gate count FPGA. So many LE.. so many.

    • @RTLEngineering
      @RTLEngineering  5 ปีที่แล้ว +6

      The quoted peak bandwidth of the cartridge port is something like 264 MB/s, which when divided by 2 (since it's 16-bits) is roughly 140 MHz. It wasn't actually clocked, but was instead latched. The latch timings were set in registers within the RCP, so it could allow for various speed Mask ROMs. The initial header read was always done at a minimum speed though, which provides the RCP with the exact timing parameters to use. It's possible that no Mask ROM would allow a transfer at that speed, however, there was also the disk-drive that could be attached through the same port (underside of the unit), which may have been able to operate at that bandwidth.
      As for the RCP, 3D was generated using Nintendo's own graphical library (it may have been called glide3D - I don't recall). If you wanted to intercept the glide3D calls and implement them, then yes, that would use a lot of LEs. The nice thing about FPGAs, is that they can reproduce the actual hardware which the 3D library ran on top of. I know of someone who has been able to compile the original RCP verilog for a Cyclone V (it doesn't work quite right, but everything is there), and it only uses around 60K LEs. For reference, the planned FPGA to use for this core is the Xilinx Artix-7 200T, which has ~212K LEs.

  • @elliottzuk3008
    @elliottzuk3008 7 หลายเดือนก่อน +1

    Please do a Dreamcast one!

  • @guillaumefigarella1704
    @guillaumefigarella1704 2 วันที่ผ่านมา

    Whats your opinion on current n64 FPGA offering?
    i think the mister n64 core is very impressive, but i want to see what analogue did with the huge cyclone 10 gx, even if i really don't like the lack of communication

  • @ThootenTootinTabootin
    @ThootenTootinTabootin 5 หลายเดือนก่อน +1

    Doing the lords work

  • @nathanlamaire
    @nathanlamaire ปีที่แล้ว

    Is it possible for RDRAM to stay hydrated with data transfer to not letting stalling happen?

    • @RTLEngineering
      @RTLEngineering  ปีที่แล้ว

      Unfortunately no, stalling is part of the bus architecture (both RDRAM and the internal bus). It's needed for turn-around, and synchronization.

  • @AngelGarcia-fg9es
    @AngelGarcia-fg9es 6 ปีที่แล้ว +2

    Amazing!!!

  • @gsestream
    @gsestream ปีที่แล้ว

    how much memory is chip-internal-local in RDP DMEM? to be used as hardware z-buffer memory buffer, or frame buffer chip-local extension

    • @RTLEngineering
      @RTLEngineering  ปีที่แล้ว

      None. There's a small cache that's controlled by the hardware (to cover bursting), but otherwise the z-buffer and frame buffer are stored in the shared system memory.
      The DMEM on the RCP can't be used for z-buffer or color directly. It can be used for it indirectly, but you're going to end up copying stuff in and out of main memory which will perform worse than not using it at all. Alternatively, it's possible to program a software renderer using SIMD on the RCP, but it would leave the RDP idle.

    • @gsestream
      @gsestream ปีที่แล้ว

      you can do microcode changes directly, maybe a true hardware z-buffer, using the DMEM/IMEM 4kb caches@@RTLEngineering

    • @gsestream
      @gsestream ปีที่แล้ว

      maybe TMEM could be partially used as local z-buffer cache, while other part is used as normal texture memory@@RTLEngineering

    • @RTLEngineering
      @RTLEngineering  ปีที่แล้ว

      That's what I meant by "software render using SIMD". There's no read/write path between the DMEM and IMEM, nor is there a read/write path between the DMEM and the fixed-function RDP path. All communication between them would need to be done using DMA over the main system bus.
      Regarding TMEM, it's the same. There's no direct write path, where you can only write to the TMEM using DMA. Worse yet, the DMACs in all cases required that one address be in main memory, so you couldn't DMA between the memories without first going through the main memory.

  • @Metroid24abd
    @Metroid24abd 2 ปีที่แล้ว

    How thé vidéo dac IS 30fps when fzero x, smash Bros and other games ran at 60 fps ?? (Ntsc)

    • @RTLEngineering
      @RTLEngineering  2 ปีที่แล้ว +1

      NTSC is 60i, so each frame was only half-height. In other words, it only output a full frame 30 times per second (30p). The PSX did the same thing, because that effectively allows you to draw more complex scenes in the same amount of "time". Though the actual render loop (with updates) ran at 60 fps or some games updated at 30fps instead to also give more time to the game logic.

  • @Sauraen
    @Sauraen 3 ปีที่แล้ว

    The info on your slide "Game Media" has some errors. You say most games have two ROM chips--I've never seen a single game that has two ROM chips. You show the EEPROM connected to the second ROM chip, whereas it's actually completely independent and connected to the PIF on controller ports 5 and 6 (usually just 5, I don't know if any retail game used port 6). Finally, you say most carts have SRAM and most carts have EEPROM--as far as I know no carts have both, they use one or the other, and some cheap games don't use either but instead controller mempaks.

    • @RTLEngineering
      @RTLEngineering  3 ปีที่แล้ว

      Some of what you said is correct. The two chips on the game cartridges were the ROM and the security chip. I'm not sure what kind of ROM was used - though they were probably EPROMs? (I find it hard to believe they were still using mask ROMs).
      The diagram was drawn in a simplified way, and didn't represent the hardware exactly. The block diagram of the internal structure of the RCP is also incorrect, since it was far more complicated inside.
      I don't think the PIF had a controller port 5 or 6, where it instead connected to the security chip using a special connection. There were special functions that the PIF could perform that acted on the security channel and not the controller channels.
      Some cartridges did have SRAM and some had EEPROMs, however, they may have all been internal to the same package (what looks like one chip on the board can have many internal parts).

    • @Sauraen
      @Sauraen 3 ปีที่แล้ว

      ​@@RTLEngineering
      They are mask ROMs. EPROMs would be much, much more expensive and possibly degrade over time. Technically there is a small ROM within the security chip (about 32 bytes or something) but that doesn't count as a separate ROM as far as I'm concerned.
      The PIF definitely does have a controller port 5 and 6, you can see them on some early devkits and in the SGI headers. Port 5 is the serial interface to connect to the EEPROM in the cartridge, and port 6 is a second serial interface to the cartridge which may never have been used. This is completely separate from the PIF's connection to the security chip.
      The EEPROMs were 8-pin DIP packages, and the SRAMs were large, wide DIP packages almost the same size as the mask ROM chips, plus an auxiliary IC for battery back-up, as well as the battery itself. There's no chance these were put in the same package; they were both off-the-shelf parts.

    • @RTLEngineering
      @RTLEngineering  3 ปีที่แล้ว

      I'm not sure any of this really matters all that much to the system architecture, since it was game specific.
      For EPROMs vs Mask ROMs, I'm not convinced. I would have to see a die shot. At the densities of the ROM chips, I think Mask ROMs would have been impractical, especially to have different masks for each game. The boot ROM was definitely a Mask ROM though, but that was only a few KB and in a large process technology if I recall correctly. Note that when I say EPROM, I am referring to one time electrically programmable ROM. That would allow one chip to be mass produced, and then most likely programmed at an early stage in production using specialized equipment. The difference is that electrically programming doesn't require a custom lithographic mask for each title.
      For the PIF controller port 5 and 6, I'm also not convinced there. Just because the dev kit implemented it that way, doesn't mean the final system had it implemented that way. I would be you that the dev kit also implemented the graphics in the RCP differently than in the final product. All that matters is that the code is compatible.
      I'm not sure what you're talking about with EEPROMs, SRAMs, etc. in DIP packages. If you are referring to the game paks, those ROM chips were custom made for Nintendo, whether they were Mask ROM or EPROM.

    • @Sauraen
      @Sauraen 3 ปีที่แล้ว

      @@RTLEngineering EPROM stands for "erasable programmable read only memory" and means the package has the little window which you can erase the memory by shining UV light on it. I think you're talking about PROMs, which it's possible could have been used for some games, I can't rule that out. But the PCB labels the ROM chip as "MROM" so that's pretty clear that was at least what was intended when the PCB layout was done. If you can find any evidence they used anything but Mask ROMs, I'd love to see that.
      I stand partially corrected on the EEPROM situation. The PIF does indeed have two "EEP DATA" pins, but they're tied together. All the controllers also have two pins, one for read and one for write, so it's possible this is what these are too. So there's only one EEPROM channel in the cart (controller 5), not two (5 and 6). However it is still true that this is completely separate from how the SRAM is connected, and that games don't have both EEPROM and SRAM.
      If you take apart a few N64 carts you'll be easily able to identify the EEPROM in an 8-pin DIP package, or the SRAM in a wide DIP package (I think 32 pins but I could be wrong about that number).

    • @RTLEngineering
      @RTLEngineering  3 ปีที่แล้ว

      I was thinking EPROM = Electronically Programmable ROM, however, that would just be PROM, where as you said, EPROMs are the ones with the UV windows.
      It looks like most of the N64 ROM chips were Macronix NVM, which could have been Mask ROMs... They also have a type of PROM though which is one-time programmable. One thing that surprised me was that most of the cartridges did use DIP packages as you said (for some reason I thought they were all SMT).
      Although, even if the PCB labels are "MROM", that could just be a convention where a PROM would appear as a MROM as far as the system is concerned.
      Also, there were PROMs at the sizes of the N64 ROM chips around that time (at least as far as I could tell).
      That's interesting, so the GamePaks were exclusively EEPROM or SRAM. I guess that makes sense if it was only used for game save files. Also, the schematic shows that there were 4 wires going to the cartridge, EERPOM_DATA, EEPROM/CIC_CLK, CIC_OUT, CIC_IN. Since the EERPOM_DATA is coupled with the CIC_CLK, that would imply that it's not using a controller port... perhaps it just appears that way internally to the PIF? To be honest, I haven't spent much time trying to figure out how exactly the PIF appears to / interacts with the RCP.
      If the SRAMs are in a 32-pint DIP package, then they can't be connected to the EEPROM lines (they would probably connect to the parallel port along with the ROM), unless they had a serial IO mode?

  • @LittleRainGames
    @LittleRainGames 5 ปีที่แล้ว +1

    Did you make the Super NT?

    • @RTLEngineering
      @RTLEngineering  5 ปีที่แล้ว +4

      No, that was someone else who goes by Kevtris.

  • @josnelihurt
    @josnelihurt 5 ปีที่แล้ว

    Any additional link for this project ?

    • @RTLEngineering
      @RTLEngineering  5 ปีที่แล้ว

      Nope, there are no additional links for it, since all of the information that would be at the link is contained in my other videos. If you are interested in more details, there additional videos regarding the VR4300 and implementation / optimization of the VR4300 (see the "Troubles of the VR4300" playlist).
      In terms of the other components, I have yet to start working on the RCP, since many of the components within the VR4300 are identical or near identical to those in the RSP (it would make the most sense to finish the VR4300 first). Aside from that, there are a few vague ideas on how to architecture the actual RCP and corresponding interfaces, but those are not written down anywhere.

  • @RyanPerfect
    @RyanPerfect 3 ปีที่แล้ว

    translation video coming? *****le.

    • @RTLEngineering
      @RTLEngineering  3 ปีที่แล้ว

      Translation to what? There are auto-generated subtitles which can be auto-translated.
      If you would like, I can go over those to make sure they are correct in English to improve the translation accuracy.