GPU Memory, and Clogged Pipes (Part 2 - 3dfx Voodoo) -

แชร์
ฝัง
  • เผยแพร่เมื่อ 24 พ.ย. 2024

ความคิดเห็น • 26

  • @golarac6433
    @golarac6433 ปีที่แล้ว +2

    I cannot overstate how much i like your videos. I think I've watched this series 3 times already. I hope you make more videos like this

  • @Sluggernaut
    @Sluggernaut 2 ปีที่แล้ว +2

    For more information in a much different but deeper level of GPU operation, check AMDs public Instruction Set Architecture. There was an actual 3rd party student development of this architecture called MIAOW iirc. So they effectively designed their own chip that would be compatible with AMDs Sea Islands chips, again if memory serves.

    • @RTLEngineering
      @RTLEngineering  2 ปีที่แล้ว +4

      I don't think I would necessarily call that deeper. The move to programmable shaders would likely imply a higher level of operation, since the fixed-function hardware is replaced with software. Though in exchange, the simpler hardware is now a more complicated vector processor. You can get a bunch of extra information about the surrounding hardware from the ISAs though, such as the fact that GCN at least uses barycentric parameter interpolation in the shaders themselves. And that the parameter interpolation is implemented by a 3-step set of instructions. When compared to Nvidia GPUs, it's unclear whether that's added to the shader program during compilation, or if it's prepended in the setup script (since they don't provide nearly as detailed of an ISA explanation).
      I am familiar with MIAOW, and recall watching their hotchips talk a few years ago. One of the most memorable questions was if there were any legal implications for using the AMD ISA, which they didn't have an answer for (that's always a concern if it's not something custom).
      There's also FlexGrip, which I believe used PTX as the ISA base (with custom encodings), although I would have to go back and read the paper / dissertation to verify.
      Both FPGA GPU cases are only focused on GPGPU, which is at the shader level, and completely ignores the other parts that allow a GPU to do graphics. Also, neither of them implement the out-of-order operand collection logic that modern GPUs do, but that's also significantly more complicated to do. Either way, they're both very impressive works.

  • @AxiomofDiscord
    @AxiomofDiscord 2 ปีที่แล้ว +2

    I like how you take corrections well and also you just make me so hungry for a FPGA solution for an N64.

    • @RTLEngineering
      @RTLEngineering  2 ปีที่แล้ว +1

      Well, there are more knowledgeable people out there (e.g. engineers who worked on the chip design for these systems), and I'm more interested in the truth than being right. Although a correction needs to be plausible for me to accept it - I was hesitant to accept that the performance of the Voodoo 5-5500 in 3DMark 2001 was so low due to memory bandwidth (the detailed statistics seemed to indicate that the DX8 implementation disabled one pipeline per chip). It took some convincing, but it was indeed caused by the bandwidth (as mentioned in this video), which interestingly would produce the appearance of one pipeline being disabled.
      As for a FPGA N64, I completely agree. I was working on one, but lost motivation since someone else beat me to the punch. We'll see how well their core performs once it's released. Right now, I'm working on more challenging cores that require extensive architecture optimization to map to a FPGA, something that I think requires a more specialized skillset.

    • @AxiomofDiscord
      @AxiomofDiscord 2 ปีที่แล้ว

      @@RTLEngineering I admire your attempt. I have seen a few people getting some good headway on some sub 1K boards. Seeing a lot of Digilent boards in these experiments.
      I like to call them neighbors since they are just down the street from my place of employment. If they ever make a next gen mister board. I could just walk over and ask to buy one lol.

    • @RTLEngineering
      @RTLEngineering  2 ปีที่แล้ว +2

      Those are certainly contenders. But I think that the next "MiSTer" would be based on the K26 SoM. It's even small enough that you could reasonable place it in an original GBA shell (that sort of form factor). It's also possible that you could fit an entire PS2 or Naomi in two K26 SoMs tied together via the GTH. I know of a few projects that are using that SoM as a platform as well, but FPGA attached DRAM is going to be one of the challenging aspects (the DRAM through the ARM side has too high of a latency to be useful).

  • @Saturn2888
    @Saturn2888 2 ปีที่แล้ว +2

    All of this seems very relevant to why the RTX 4080 (12GB) model's 192-bit bus, while really fast, will probably cause performance issues in the future.

    • @RTLEngineering
      @RTLEngineering  2 ปีที่แล้ว +4

      Quite possibly. Especially for raytracing which is a very memory constrained due to the random read access patterns.

  • @CompatibilityMadness
    @CompatibilityMadness 2 ปีที่แล้ว +4

    Amazing.
    Great detail, high quality pictures/diagrams to show what is where, and commentary to top it all off.
    BIG kudos for researching all the necessary information needed for this type video.
    Do you plan to do TnL GPUs, or any other early weird 3D cards (like Verite, mPact2, Permedia, i740, etc.), after PS2 and dreamcast ?

    • @RTLEngineering
      @RTLEngineering  2 ปีที่แล้ว +3

      Thanks! It does take a lot of effort to cross-correlate the different sources, and to try to find the underlying meaning / implications of what appear to be trivial notes. Funny enough, the Dreamcast and PS2 have proved to be quite a bit more challenging, since there are hints at more complex functionality, but there are no sources that elaborate further.
      As for the other GPUs, probably not. The issue comes down to available information, such as the low level programming interface, die photos, and performance presentations (all three are necessary). To my knowledge, aside from the earlier game consoles, only the 3dfx GPUs have some of that information public. I had briefly looked into the TNT series, which appear to have some interesting hardware secrets, but I would need to probe a real one with microbenchmarks to verify, and that's not worth the effort involved.

  • @brucetungsten5714
    @brucetungsten5714 2 ปีที่แล้ว +2

    Thanks for your high quality content!

  • @theangel540
    @theangel540 2 ปีที่แล้ว +2

    So great video, really!
    Would love to see your opinion on Rendition verite architecture at this time!
    And arcade hardware who works with quads.
    And why not a "little" word about the GIANTs SGI and Evans&Sutherland!
    A lot of work 😅

    • @RTLEngineering
      @RTLEngineering  2 ปีที่แล้ว

      Thanks! I don't really know as much about the Rendition Verite architecture, nor specifics of SGI's architectures (other than the fact they they liked to use dedicated chips for each part of the pipeline). If I find the time, I can try to see what I can find and if there's an interesting video to make out of those architectures.
      As for arcade hardware with quads, that's true, but then they gave up and moved to triangles. As I understand it, quad rendering was largely meant for distorting 2D sprites and not for polygons (mathematically you're very restricted by what you can draw with them). A major advantage though, is that rendering quads is much more efficient for drawing sprites and 2D graphics. So it makes sense that quad graphics would stick around in the arcade world for a bit longer.

    • @theangel540
      @theangel540 2 ปีที่แล้ว

      @@RTLEngineering
      Yes, you're right.
      Quads were a more natural bridge between the well-established 2D/sprites hardware and full 3D worlds. The sega saturn is the very exemple of this topology.
      This system agrees with your explaination about memory bandwidth. The saturn is bottlenecked with is unified cpu/vdp bus i presume and her Master/Slave CPU implementation.
      If we put aside the UFOs from SGI with their infinite reality engine, arcade hardware like model 2 were really topnotch at this era.
      If I remember correctly, there are 1/sqrt(x) tables stored in ROM rather than calculated in this hardware (OPR ROMs). A time when storing precomputed tables was always faster than the best DSPs/FPUs.
      A time when MFLOP was common and TFLOP a sweet dream :-) .

  • @yiannos3009
    @yiannos3009 2 ปีที่แล้ว +1

    Thank you so much for doing the research and communicating this information. Your dedication to this material reminds me of Michael Abrash's Black book of Graphics programming. I'd love to have a book filled with detailed renditions of what you discovered, should you decide to write one. I'll buy it and keep it right next to my copy of the Black book of Graphics Programming.

    • @yiannos3009
      @yiannos3009 2 ปีที่แล้ว

      A request: Please do the NV20 (GeForce 3)

    • @RTLEngineering
      @RTLEngineering  ปีที่แล้ว +2

      Thanks!
      I wasn't aware of the Black book of Graphics programing, it looks like it will come in handy for the PC project I am working on, thanks for sharing!
      Writing a book is a lot of work, and I barely have time as it is (hence why no videos have been uploaded in a while). Maybe someday I'll compile all of these topics into a form like that.
      As for the NV20, that's much more challenging (any Nvidia product is), since there is no documentation available. There are some basic things I can say about it, but it's largely speculation from die photos, specs, and driver code. We're lucky that 3dfx and the game consoles all had very detailed programming manuals.

    • @yiannos3009
      @yiannos3009 ปีที่แล้ว +1

      @@RTLEngineering Thank you so much for sharing your thoughts!

  • @Aexomer
    @Aexomer ปีที่แล้ว

    Interestingly, the 3Dfx documentation for Voodoo 2 mentions cards with three TMUs. A 16mb card with 4mb frame buffer and 4mb texture memory per chip would have been pretty expensive, no doubt, but two of them in SLI would definitely have been the 4090 equivalent "must-have" hardware for the time. I have to wonder why a retail card was never offered in that configuration. What are your thoughts on such a beast?

    • @RTLEngineering
      @RTLEngineering  ปีที่แล้ว +3

      You could have a card with three TMUs, but you would need software to explicitly support that. You would also need to figure out how to make use of them since each TMU maps one texture at a time (i.e. they won't necessarily making texture mapping faster, only reduce over-draw from blended effects).
      You're right about the 4090 equivalent being two cards in SLI, but they would only have two TMUs each. Aside from the lack of supporting software, and that market place held by 2x cards in SLI, the other constraint would have been board area. Fitting 3x TMUs + FBI + DRAM for all the chips would have exceeded the PCI card spec. This is likely why that configuration only ever showed up on custom boards for simulators.
      Personally, I would have preferred the 2x cards in SLI rather than 3x TMUs, or 4x in SLI. One of the main limitations of the Voodoo architecture was the raster performance, but this scaled quite nicely when using SLI (almost linear). Not only could you render faster, but you could render at higher resolutions as well since each card only rendered a portion of the screen (in horizontal bars).

  • @阿綸的全勳學院
    @阿綸的全勳學院 ปีที่แล้ว +1

    it is so grerat

  • @TheCj71984
    @TheCj71984 2 ปีที่แล้ว

    Great video love video like this

  • @raysmith5124
    @raysmith5124 ปีที่แล้ว

    didnt the bansee have 2 tmu's but 1 was disabled but could be enabled in the driver's ...?

    • @RTLEngineering
      @RTLEngineering  ปีที่แล้ว

      I believe you are thinking of the Velocity 100, which uses the Avenger GPU. The Avenger did have two TMUs on the die, which are clearly visible in die photos.

    • @raysmith5124
      @raysmith5124 ปีที่แล้ว

      That must be the 1 I'm thinking about then