Ring or Mesh, or other? AMD's Future on CPU Connectivity

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ก.ย. 2024

ความคิดเห็น • 379

  • @TechTechPotato
    @TechTechPotato  3 ปีที่แล้ว +274

    One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them. In the Land of Interposers, where the interconnects lie.
    Shoutout to Adored's video tackling the subject in mid-2018: th-cam.com/video/G3kGSbWFig4/w-d-xo.html

    • @Yusufyusuf-lh3dw
      @Yusufyusuf-lh3dw 3 ปีที่แล้ว

      AMDs cross bar interconnect technology is difficult to scale beyond 8 CCX, I think with Genoa AMD will have more trouble addressing crossbar latency. Also AMD needs something like intel emib coz sending traces to the substrate using a serdes will add additional latency and power consumption. If it's a silicon bridge like emib then you don't need serdes coz your traces don't leave the silicon die. But then having 16 + 1 chiplets and connecting through emibs could be challenging. Amd needs to develop mesh interconnect to have meaningful scalability with 128 or more core processors.

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว

      But you didn't learn, huh.
      They only wanted "a little bit of evil", just as you're only wanting "a little but of 3D". Sorry, it's all or nothing ;P
      Not sure where in the video I'm referencing but it got a very large head-tilt and much chattering with myself XD

    • @SaturnusDK
      @SaturnusDK 3 ปีที่แล้ว +6

      AMD might throw us a curveball here. They could be using the ring bus to only pass the L3 cache address of the information to the other core it communicating with and simultaneously putting the information in the L3 cache, and then the other core pull the information from that L3 cache address directly instead. That would explain why it in almost every single way performs as if they were all p2p connected. And since there's effectively moved less data on the bus the bandwidth is never saturated and uses less power.

    • @thecolossus_5917
      @thecolossus_5917 3 ปีที่แล้ว +2

      Ohhh nooo, it's melting, my precious power ring, gone forever.... REEEEEEEEEE

    • @5poolcatrush
      @5poolcatrush 3 ปีที่แล้ว +10

      I clearly remember exact same butter donuts from AdoredTV few years ago. Quite ironic not many believed him.
      Also techpowerup recently published few AMD slides that were worth including in this video tbh.
      Sadly all of the actual innovation AMD always try to bring may be again overshaded by intel's obscure garbage. Creating those "energy efficient cores" (who needs more than lets say 2 or 4 of them for an idle PC state to perform their allegedly main function, yet alder lake has 8 of them and raptor lake will have 16 of them?) is just a most efficient way for them to regain rendering and other scalable by default apps performance crown while keeping comfortable (nowadays, thanks to AMD forcing intel to make at least 8) amount of actual cores for ages again. And with this trick intel having those crowns basically will hold down the industry and handicap global computing power and its scalability because everything is being made for leader like software development and optimizations so we're stuck in that vicious cycle of single-core circlejerk.

  • @admiral_hoshi3298
    @admiral_hoshi3298 3 ปีที่แล้ว +118

    I have honestly never considered the topology of CPU core interconnects, thanks for this fun mental exercise!

    • @BlissBatch
      @BlissBatch 3 ปีที่แล้ว +8

      I like how you say "honestly," as if you're surprised that you've never done it, despite how commonplace it is among the general population to ponder the topology of CPU core interconnects.

    • @llothar68
      @llothar68 3 ปีที่แล้ว +5

      It's one of the most important and still unsolved design issues in CPU engineering. And thats for two decades now a hot topic. But yeah, if you don't have an interest in CPU design or high performance computing it's a topic you don't run into.

    • @FrankHarwald
      @FrankHarwald 2 ปีที่แล้ว

      My favorite long-term candidate for CPU-interconnect & NoC topology would be a 2D/3D flattened butterfly topology, & depending on how the cores & components are layed-out maybe a twisted variant thereof. Butterfly networks in general have several advantages over rings, fully connected but even over mesh designs for a lot of parameters but also overall. They are also well understood & have efficient routing algoirthms. Original 2007 Paper: "Flattened butterfly: a cost-efficient topology for high-radix networks" by John Kim, William J. Dally, Dennis Abts. & there are several studies analysing this toplogy. Unfortunately it's unlikely that any processor or chip will use such a topology until 2030 because it still has a patent pending - unless any of the manufacturers is willing to pay the inventors.

  • @BlahBleeBlahBlah
    @BlahBleeBlahBlah 3 ปีที่แล้ว +133

    Fascinating Ian, thank you as always - tech hasn’t been this exciting in quite a long time!

    • @Jaker788
      @Jaker788 3 ปีที่แล้ว +2

      Tech is always interesting when you read research papers. That co authored paper about networks is a few years old, I think 2017 or 2018. So this has been contemplated for a while

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว +2

      @@Jaker788 2015 I believe.
      And LOL @ "interesting... reading research papers". Well, there's a clear division right there on subjective terms (for some anything to do with "reading" automatically excludes it from possibly being "interesting", let alone "research papers) :D
      Please note, I'm also self-mocking as I don't like reading (I'm a slow reader so I don't read much so I fall out of habit ad-infinitum)

    • @Jaker788
      @Jaker788 3 ปีที่แล้ว +2

      @@ChrispyNut Yeah I'll agree that research into concepts isn't exactly as exciting as something being presented as functional.
      It's the difference between "yeah we could potentially do this really awesome thing" and "look at this really awesome thing we actually have working and you can buy it soon!"

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว +2

      @@Jaker788 No, I think there's a misunderstanding.
      I'm WAY more into concepts and theories and ifs, butts and maybes than I am finished "stuff" (especially as the finished, commercialised stuff barely resembles the initial concepts (see Intel Light Peak -> Thunderbolt)).
      Just ... such people are in the minority (and I'm in the minority of the minority in not being a reader). :)
      I spend so much of time with my head in the future that when the latest, greatest, taking-the-world-by-storm thing is all the rage I'm seen as a misery guts cos it's so fuckin' lame next to the thing I'd been thinking about that was interrupted with the thing I'd been thinking about decade(s) earlier. :'-(
      If you get my drift.

  • @Zorro33313
    @Zorro33313 3 ปีที่แล้ว +94

    I remember a long ass 2018 AMD's research paper on shitton of different connectivity approaches from a simple mesh to some crzy shit like double-crossed-toroidal-butterfly-god-of-ancient-elves.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +41

      Link is already in the description!

    • @Zorro33313
      @Zorro33313 3 ปีที่แล้ว +31

      @@TechTechPotato that's a true journalist - always a step ahead!

  • @jamegumb7298
    @jamegumb7298 3 ปีที่แล้ว +30

    SGI workstations had a glorious design, at the centre of it there was a massive switch design. Every component could talk to every other one, full speed. Complicated, yet simple, very fast and responsive system.
    There is a guy on YT that does teardowns on systems he collected, and he opened one up. I wish all pc's followed this glorious architecture.

    • @dercooney
      @dercooney 3 ปีที่แล้ว +7

      it's expensive. you spend your budget on chiplets and interconnects - big interconnect -> less chiplet, so you have to make a tradeoff

    • @niks660097
      @niks660097 2 ปีที่แล้ว +1

      doesn't infinity fabric is already doing this, entire I/0 goes through infinity fabric and as far AMD's paper's go, infinity fabric have inbuilt buffers of multiple KBs for all IO operations..

    • @dercooney
      @dercooney 2 ปีที่แล้ว +1

      @@niks660097 yes, but SGI was in the 90s.
      infinity fabric does take power, so you have to budget your components against the total TDP budget on the package

    • @mapesdhs597
      @mapesdhs597 2 ปีที่แล้ว +1

      The beauty of SGI's crossbar was that it allowed systems to scale bandwidth with socket count. The initial Origin2000 series did impose a latency penalty but this was resolved with the Origin3000 series which increased the scaling from 128 to 2048 sockets with lower worst case latency, increasing bisection bw to more than 1TB/sec. Naturally these systems ran as a NUMA design with a single OS instance (or it could be partitioned), so a user doing, say, defense imaging or GIS could run a task and immediately have access to dozens or hundreds of CPUs and relevant connected I/O and gfx power (infact I've still not seen any modern product that quotes faster image loading rates than the Group Station, though it's likely a thing but just not public; NVIDIA probably makes custom tech that isn't COTS, ditto AMD, indeed SGI did this at times, eg. for Lockheed).
      The 8-port crossbar was the most complex chip SGI ever designed, it required 6 months of Verilog testing. Each port had a 2MB cache buffer, so although installed CPUs might have 2MB L2 (such as in a max spec Octane2), the crossbar had a lot more memory of a similar type, so not cheap. The crossbar had four independent connections and these could change which ports were connected to which other ports on each clock tick, allowing for continuously variable I/O paths. At the same time, applications could not only lock in an I/O path to secure guaranteed bandwidth (with DMA), the REACT extensions to IRIX supported real-time response certainty aswell, hence the broad use of SGIs in defense and other industrial applications. This meant, for example, that a digital video stream could be routed through to main RAM without involving the CPUs, and with a hw guarantee that it would never drop a frame, while at the same time the same crossbar is routing other data aswell.
      CPUs were not connected directly to the crossbar though; in Origin, CPUs and RAM were connected to a HUB chip. Each HUB had two ports: one goes to a crossbar, the other to the router fabric (similar tech, ie. NUMAlink). Thus, any CPU could connect to any other either directly via its local HUB, or via a crossbar link, or via a router link. See:
      www.sgidepot.co.uk/mod_block_diag_server.gif
      This did mean more hops though with Origin2000, but the arch changed with the 3000 series to solve this (along with a modular brick design instead of connected half-racks), resulting in much lower latency penalties for long routes (I think the worst case scenario in a 1024-CPU O3K is 50% latency penalty for the most distant nodes). The design also used an interesting caching mechanism to cope with the situation where data changed by one CPU could invalidate copies held by many others, but that's a whole other thing. There's a lot more nuance to all this of course (see below for refs, PDFs, etc.)
      Note Octane used a simplified chip called HEART, to which the CPUs and RAM are connected, but HEART has just a single link to the crossbar because there's no router fabric.
      For more, see my index pages:
      www.sgidepot.co.uk/origin/
      www.sgidepot.co.uk/octane/
      Note SGI had been planning to scale single image support with Origin4000 to 37500 sockets (along with IR5 for gfx), but alas with all the management screwups, loss of staff, etc., that never happened, but the NUMAlink tech lives on, I think HP is still using it as NUMALink8 or something, giving 64GB/sec per port, though I doubt they'll carry on the arch any further.
      A caveat to the awesomeness though: many XIO option cards (such as PCI, FC, etc.) used an XIO/PCI bridge chip and the early versions of these chips were kinda naff, limiting PCI bw to around 185MB/sec. The boards for O3K were better. Still, I was able to get 600MB/sec from an Octane, which for 1997 is kinda nuts. Not had a chance to try the same thing with my O3800 yet.

    • @mapesdhs597
      @mapesdhs597 2 ปีที่แล้ว +1

      @@dercooney I can't speak for AMD or modern markets, but for SGI the cost aspect wrt their target markets was largely irrelevant. One oil company told me their $2M Onyx2/RealityCentre setup paid for itself in *six seconds* (brownie points if you can guess how). Note I was the head sysadmin of a RealityCentre for a few years; an early version, it was a 16-CPU 3-rack Onyx2 with five IR2E pipes.
      Wish I had the time to do vids on my SGIs, but alas YT came along a tad too late for that really. Maybe some day.

  • @abdulsadiq8873
    @abdulsadiq8873 3 ปีที่แล้ว +18

    Oh no! The potatoes are multiplying!! 1:40

  • @jooch_exe
    @jooch_exe 3 ปีที่แล้ว +15

    Now this is a quality vid, you deserve way more credit for stuff like this.

  • @christopherpetersen342
    @christopherpetersen342 3 ปีที่แล้ว +10

    These ideas go back decades and will always hold true. I'm hoping AMD has split the core-to-interconnect so that the core can stay the same, but they can change out the interconnect topology at will.

  • @chromos33
    @chromos33 3 ปีที่แล้ว +6

    Been a while when last I heard about the Butter Donut.. think was a video from AdoredTV some years back.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +2

      Link to that video is already in the description

  • @mikemarkong
    @mikemarkong 3 ปีที่แล้ว +6

    Amazingly, I understood a lot through your explanation. As a marketing & retail professional, I really do not need much about this. I'm just really a curious gamer. :P

  • @TheLaziestGuyEver
    @TheLaziestGuyEver 3 ปีที่แล้ว +3

    nice vid. I remember watching a vid from jim at adored some time ago about interposer technology and all the untapped potential. nice to know that they are finally looking into those options for higher bandwidth/ less latency

  • @bakedpotato8602
    @bakedpotato8602 3 ปีที่แล้ว +3

    Happy to have this video recommended in my feed! Learning a lot more from TH-cam than school :P

  • @johnknightiii1351
    @johnknightiii1351 3 ปีที่แล้ว +1

    I do think you are correct in that their next step is using an interposer. The step after will be an active interposer with some logic built in, this is when it will becomes really exciting

  • @jtd8719
    @jtd8719 3 ปีที่แล้ว +36

    I'm not terribly worried about AMD and their ability to innovate with design with respect to CPUs. The constraint will likely be the capabilities of their foundry partner.

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว +12

      Physical/engineering limitations, always the party-pooper of cool concepts..... until they become the enabler and round we go in the endless loop until a blackhole comes along to keep infinity out of the equation :)

    • @Apocalymon
      @Apocalymon 3 ปีที่แล้ว +1

      @@ChrispyNut computers can get only so dense until you start dealing with exotic matter, weird particle effects, & then black holes. I want to see how wild CPUs will get in the future

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว +1

      @@Apocalymon I don't mean we create the black hole, rather that everything eventually ends up in a black hole (figuratively, not literally everything).
      Maybe we see how wild CPUs get all the time, the organic brain.

  • @rougenaxela
    @rougenaxela 3 ปีที่แล้ว +5

    20:38 What's not clear to me exactly, is why you'd use an interposer for interconnect within a lone chiplet. You can do your butterfly/torus/etc on regular metal layers without needing to go out to an interposer, it's plenty doable to have signals weave cross different metal layers. Is there a shortage of metal layers in the readily available processes? I wouldn't imagine so. Even when you bring multiple chiplets into play, you can design things such that the intra-chiplet links of a big butterfly/torus are in metal layers, while only the inter-chiplet links go into the interposer.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +5

      The point of the video is that as you scale to 16 cores, a ring doesn't work, so you might want to do an on-die mesh. But even then, there are better meshes, so with a one-chiplet interposer it would be easier to work on independently, or optimize when it comes to Serdes links.

    • @rougenaxela
      @rougenaxela 3 ปีที่แล้ว

      @@TechTechPotato Ah, I see.

    • @callums____
      @callums____ 3 ปีที่แล้ว

      @@TechTechPotato great video! I bet a big contributing factor of what's designed and used in the future will be highly dependent on how well software and core scheduling evolves and works. From a workload perspective, the amount of use cases for more than 8 cores to be allocated to a single process/job whereby inter-core latency is highly important seems very low even in the enterprise space. In the vast majority of use cases that I've seen whereby beyond 8 cores is required, it has been a highly parallel and core/thread independent workload. Therefore, if essentially good NUMA aware scheduling is being used, I highly doubt there'd be many use cases whereby the extra connections and overhead of a more complex and expensive architecture would be worth it short to medium term. 4 slower cores in Zen 1 was certainly not ideal for plenty of enterprise and hosting use cases while the latest 8 high performance low latency interconnected cores seems by far the sweet spot.

  • @tobiassteindl2308
    @tobiassteindl2308 3 ปีที่แล้ว +25

    That talk about interconnection topologies and butterdonuts gave me AdoredTV flashbacks.

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว

      Which is the more appealing accent though?

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +9

      Link to that video is already in the description

    • @ABaumstumpf
      @ABaumstumpf 3 ปีที่แล้ว

      But thankfully, contrary to AdoredTV, this is a channel that does not make shit up all the time and is talking about interesting things (that he also understands).

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว +8

      @@ABaumstumpf Wow you could be a professional political pundit on Fox News with how wrong you are.
      Jim doesn't "Make shit up", he interprets information to deduce expected outcomes. Sometimes gets it wrong sometimes right, but he's clear that what he does has a high margin of error.
      He also talks about interesting things (in this case 3 years before this channel.
      Ian also talks about things he doesn't really understand (in fact, I'm pretty sure I recall him stating such in this very video about the very topics under discussion).
      Basically, you don't like Jim, you do like Ian. That's cool, but doesn't vomit garbage everywhere in the hopes it blinds us to the truth.

    • @ABaumstumpf
      @ABaumstumpf 3 ปีที่แล้ว

      @@ChrispyNut "he interprets information to deduce expected outcomes"
      Yeah no. He just regurgitates stories he got from anywhere without doing even the most basic of checks most of the time. And only very rarely does he do some simple interpolation (can't do much wrong there).
      "Sometimes gets it wrong sometimes right, but he's clear that what he does has a high margin of error."
      In a way - he has a very high error-margin as in - he is about as accurate as saying "at some time in the morning the sun will rise and in the evening it will set".
      "(in fact, I'm pretty sure I recall him stating such in this very video about the very topics under discussion)"
      Well, Ian understands what he understands and what he does not full understand (or where he is not an expert) - aka the opposite of AdoredTV.
      "Basically, you don't like Jim, you do like Ian. "
      Nope, i just can't stand that AdoredTV is spouting bullshit.

  • @Simengie
    @Simengie 3 ปีที่แล้ว +2

    Outstanding video. Really makes you think about what is possible with the Zen road map. Seems the real limiting factor will be socket size. Even with the 3D concept you discussed the fact remains that more cores equals more space. I am sure 3 nm will help with the space problem some but will not be perfect nor will it be the only method used. I think on the EPYC side the socket size was planed well out from the beginning and probably will support 128 core products with node shrinks. The new AM5 socket size for consumer is going to tell a lot about where core counts are going to go in that market.
    Something you did not mention but I could see benefitting from the interposer layer would be a GPU chiplet with its own 3D layered graphics memory/cache (64-128 MB). It would open AMD up to providing more powerful integrated GPU's. The ability to make an APU would just be a choice of including the chiplet on the cpu package. Till your interposer idea I understood why this was not done as the CPU/GPU traffic would use most of the bandwidth through the IO die. The interposer would remove a good portion of the IO die bandwidth being used allowing for a very powerful APU to be produced using chiplets.
    My thinking here would be that the Ryzen chip would have space for up to three 8-core chiplets on an interposer. The high end chips greater than 16 cores would only be CPU's. The 16 and lower core count chips would have space for a GPU chiplet and allow for AMD to produce up to 16 core APU's. On the flip side they could produce 4/6/8 core APU's with two GPU chiplets for better graphics performance.
    In all your take on the the third layer in the 3D stack being an interposer opens many doors for AMD. Their chiplet approach has proven to have significant advantages and I am sure 3D cache is just he beginning to the "3D" nature of future chips.
    Again great video and insight.

  • @descmba
    @descmba 3 ปีที่แล้ว +1

    Now... imagine that the interposer is based on silicon photonics... you want one ring... trivial... two rings, three rings, fifty rings... it's all a matter of adding another wavelength of light... oh, and it doesn't generate heat, or magnetic field, or more power. I never really understood the significance of this tech until you explained the interposer. I doubt it will make it down to the consumer, but for the rack designs, this will be huge.

  • @HighYield
    @HighYield 3 ปีที่แล้ว +1

    How about vertically stacking not only cache but also cores? Thermal issues aside, a third dimension could open up new interesting solutions.
    Really interesting video, thank you!

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +1

      Unfortunately it's the thermal issues that are the reasons why that doesn't happen. But 3D topology is basically graph theory, and we have centuries of research there.

  • @jonsquare1248
    @jonsquare1248 3 ปีที่แล้ว +5

    Take your double bisected ring and cross the bisects and you now have something that looks like an "infinity" symbol. "Infinity Cache" coincidence?

    • @williambrasky3891
      @williambrasky3891 3 ปีที่แล้ว +1

      Interesting insight. You may be onto something.

  • @Michael-OBrien
    @Michael-OBrien 3 ปีที่แล้ว +1

    Dr. Cutress, love this type of content. Could you be bothered to discuss the various implementations of SMT to help us learn why it’s a practical (aside from ~33% greater bandwidth) and also “pointless” at the same time?

  • @lordrozene
    @lordrozene 3 ปีที่แล้ว +1

    When I studied Networking, I never thought I would see the same topology concepts applied to the hardware itself later on.

  • @m_sedziwoj
    @m_sedziwoj 3 ปีที่แล้ว +2

    I remember this paper about connectivity, AdoredTV make video about it, hope we would see it in reality, even if in servers only.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +2

      That video is in the description and pinned comment, paper is in the description

  • @jouniosmala9921
    @jouniosmala9921 3 ปีที่แล้ว +2

    Or they have a central bidirectional ring with minimal physical distance inside the ring, but the longest distance is from the core to the ring. So latency between ring hops could be marginal but there would be single significant latency and that's from core to the ring.

  • @Cerbereus
    @Cerbereus 3 ปีที่แล้ว +2

    Thanks Ian, you took something as topology of cpu made it easier than it is (of course was a glimpse of it), great work. By the way the potato always remember me a pringle or lays hahaha make me want to buy one.

  • @davidjohnston4240
    @davidjohnston4240 3 ปีที่แล้ว +6

    I you like your cpu array, put a ring on it.

  • @uncivil_engineer8013
    @uncivil_engineer8013 3 ปีที่แล้ว +1

    Every time I think of CPU interconnects now, I immediately see the words "butter donut" appear in my mind and get hungry. Thanks AMD and Adored!

  • @TheBackyardChemist
    @TheBackyardChemist 3 ปีที่แล้ว +6

    What if the "structural silicon" on top of the cores is replaced with another bisected ring connecting the cores on that side? Could allow for 16 core chiplets.

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว +5

      More power, more complexity, more cost.
      Can always do more to get more faster, it's whether it's worth the cost :|

    • @TheBackyardChemist
      @TheBackyardChemist 3 ปีที่แล้ว +1

      @@ChrispyNut i would imagine a bottom interposer, making the stack 3 high would also incur a rise in packaging cost

    • @ChrispyNut
      @ChrispyNut 3 ปีที่แล้ว

      @@TheBackyardChemist You literally made me facepalm and given how sweaty I am, that was really unpleasant!

    • @duckrutt
      @duckrutt 3 ปีที่แล้ว

      I think part of the reason they put the cache on top of the cache and not the cores was heat so I don't see them putting anything on top if they can help it.

    • @TheBackyardChemist
      @TheBackyardChemist 3 ปีที่แล้ว

      @@duckrutt Well they are already putting silicon over it, its just blank. Interconnects use some power but not huge amounts of it. I think a bigger issue is having to drill TSVs nearby of the cores.

  • @WayneBorean
    @WayneBorean 3 ปีที่แล้ว +3

    Nicely done again Ian. I love your deep dives.

  • @hammerheadcorvette4
    @hammerheadcorvette4 3 ปีที่แล้ว +1

    "Buttered Donut" was covered by Adored TV yrs ago. He was on to something then...

  • @triadwarfare
    @triadwarfare 3 ปีที่แล้ว +8

    I think the very reason why Intel was struggling to go to 10nm in the first place is the mesh may be too complex, whereas AMD (and by extension TSMC) can make simpler designs like bisected rings and achieve great yields on a smaller node, hence, they were able to get it released first (3rd Gen), and refine it and get better results on the same process node (Ryzen 5000 series)

    • @williambrasky3891
      @williambrasky3891 3 ปีที่แล้ว +1

      Solid point. Solid possibility.
      & not that it matters as it pertains to whether this scenario is any more or less likely, but compared w/ the popular narrative that intel's struggles have been a result of complacency it'd be interesting to find the reality was just the opposite; they floundered out of an unwillingness to compromise away from theoretical peak performance (the cutting edge).
      The long term rewards of success w/ such a complex topology would be quite the siren song. The performance possible w/ said approach could likely result in 5+ yrs of virtually assured industry dominance.
      That said, whether sticking w/ such an approach given its power efficiency disadvantages as market trends continued toward greater & greater emphasis on performance ÷ power was a wise decision is a whole other can of worms.

    • @Jaker788
      @Jaker788 3 ปีที่แล้ว +9

      The reason Intel struggled with 10nm was from a few design reasons:
      1. They went for contact over active gate to replace FinFET, way too sensitive to disturbance at fab time and yields were terrible as a result. Everyone is now going for Gate all around or Intel's SUPERFIN! COAG was dropped
      2. They used quad patterning for the fine grained details before EUV was available. Way to complicated and wide error range, double pattering is the reasonable limit.
      3. They used Cobalt electrical channels instead of copper, reason being that copper needed more insulation than wire at these small scales and cobalt would not. However Cobalt is hard and brittle in comparison to copper, temp swings might break or fracture these channels.
      Each reason is ranked in contribution to the disaster.

    • @williambrasky3891
      @williambrasky3891 3 ปีที่แล้ว +2

      @@Jaker788 Interesting. Thanks for sharing.

  • @DwAboutItManFr
    @DwAboutItManFr 3 ปีที่แล้ว +9

    I wonder how much faster subjectively computers will be in 10 years.

    • @kamilazman2943
      @kamilazman2943 3 ปีที่แล้ว +3

      At this point, I wonder when will we reach the limits of physics.

    • @jaredgarbo3679
      @jaredgarbo3679 3 ปีที่แล้ว +2

      @@kamilazman2943 The limit of physics is centuries away.

    • @DigitalJedi
      @DigitalJedi 3 ปีที่แล้ว +3

      We may eventually hit the limit of digital computing on silicon, but there are advances in quantum and optical technologies, which have a lot of potential.

    • @saricubra2867
      @saricubra2867 3 ปีที่แล้ว +1

      It almost took 10 years for twice the singlethread perfomance of a 2600K around 3.4GHz but modern CPUs have to clock 30% more, kinda embarrasing.

    • @jonnyj.
      @jonnyj. 3 ปีที่แล้ว +1

      @@saricubra2867 If you look beyond desktop cpu's, that outlook becomes so INSANELY fucking stupid. Apple specifically, but also all high end arm architectures have insane per watt single thread performance. Desktops arent nearly everything...

  • @mapesdhs597
    @mapesdhs597 2 ปีที่แล้ว

    19:30 - Ian, could it be that AMD is using something which is bisecting the ring but doing so in a dynamic manner? ie. multiple bisections, but they can change their routing clock to clock like a nonblocking crossbar?

  • @rcavicchijr
    @rcavicchijr 3 ปีที่แล้ว +1

    I'm a big fan of the butter donut. Mostly because Jim from adoredtv did a video on that paper a while back, and I laughed when he kept saying "butter donut" with his accent.

  • @richardskinner6391
    @richardskinner6391 3 ปีที่แล้ว +2

    How about a 3 layer stack (as they are doing with VNAND).
    Bottom layer just for inter-core connectivity (potentially on a 12nm process node), middle layer cores etc (possibly with some interconnects, allowing them to "cross" the other interconnects), top layer cache?
    Edit: I should wait until the end of the video before commenting, shouldn't I.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +2

      Consider it a ++ you guessed the end before the end! Realistically most interposers to date are 65nm ish. Super cheap, super easy to do.

    • @Bellissima2k
      @Bellissima2k 3 ปีที่แล้ว

      @@TechTechPotato Does 65nm introduce latency for communication? How come 65nm is acceptable for a mesh interposer?

    • @richardskinner6391
      @richardskinner6391 3 ปีที่แล้ว

      @@TechTechPotato So given that cutting edge processes are the most expensive and have the worst yields, a 3 layer stack with a cheap interconnect stack, tiny single core dies on the latest process, and one big unified cache die covering all of them on a not-quite-cutting edge process, giving the best yields?

  • @marcopolo8584
    @marcopolo8584 8 หลายเดือนก่อน

    I have come back to this paper repeatedly as a reference for multiplayer level design.

  • @theevilmuppet
    @theevilmuppet 3 ปีที่แล้ว +1

    AMD have made a number of statements regarding the power requirements of the IO Die in EPYC - knowing the breakdown of DDR4, PCI-e, all those other on-SoC buses and SerDes in terms of power usage...

  • @bigmaddad7689
    @bigmaddad7689 3 ปีที่แล้ว

    I've been so curious about possible mesh architectures. The way you described AMDs potential 3D multi chiplet with the (Butterfly, Torus) mesh interposer was so interesting. I also think there may be another step AMD may take, by maybe arranging a stacked mesh. Having a stacked interposer mesh, should allow the latency to to be reduced exponentially. The human brain has a stacked mesh I think. In the way we can access simple memories or fine details and skills that we've learned, is like stacks on stacks on stacks etc. Like one core to another core to another core etc., the mesh stacks could give that similar connectivity. We may never see it in our Zen6 or Zen(?) PCs but that must be a path they are trying to achieve in the future don't you think? I don't imagine the V-Cache will be individually allocated for each core in the future though just massive cache for every core to access. They are probably heading to where the a similar mesh could be used for Cache as well. Sorry for rambling. I hope you have a great day and keep up the super content.

  • @cannesahs
    @cannesahs 2 ปีที่แล้ว

    btw, dunno if you spotted, but some document about zen 4 had weird (not making sense) stuff about IF/CCD/CCX, which indicated difference from zen 3.

  • @fowlmouth824
    @fowlmouth824 3 ปีที่แล้ว +4

    I'm merely an observer here, but can a twisted toroidal shape still be described as a 'ring' (for the sake of politricks), but actually present full connectivity with 8 rings twisted into one toroidal shape?

    • @kazedcat
      @kazedcat 3 ปีที่แล้ว +1

      Ring is 1 dimensional interconnect. A Mesh is 2 dimensional . A Torus would be 3 dimensional.

    • @SimonBuchanNz
      @SimonBuchanNz 3 ปีที่แล้ว

      @@kazedcat nope that would be a donut. A torus is just the 2d surface!

  • @MarcoComercibjt2
    @MarcoComercibjt2 3 ปีที่แล้ว

    I was thinking about your last minute talk on the 5950... What about putting the chiplets directly on the IOD via TSVs? A good amount of IOD power is for the SERDES of the chiplets. Maybe the mesh can be but on the IOD or an interposer can be put between the chiplet and the IOD (the V cache can be still present on top)...

  • @leax_Flame
    @leax_Flame 3 ปีที่แล้ว

    I’ve read about this, but talked about as if this was the end of AMD.
    This sounds really interesting and quite hopeful for AMD in the future.

  • @alexanderdiogenes8067
    @alexanderdiogenes8067 3 ปีที่แล้ว

    Dr. Cutress doing foul, eldritch sorcery @ 4:22 on the right lol

  • @Arowx
    @Arowx 3 ปีที่แล้ว

    At what level of cache size will it become easier and faster to move the program code and state to the data?

  • @johnpaulbacon8320
    @johnpaulbacon8320 3 ปีที่แล้ว

    Wonderful video. I like all of the technical discussion on how the new chip technology can work.

  • @AlexSeesing
    @AlexSeesing 3 ปีที่แล้ว

    I wonder why no one ever mentions the root-tree kind of topology. It's definitely 3D and the stem is a multi lane multiplexer where relatively easily a 32 connection can be made on each lane. Scale that with stop gap multiple stems and a cpu finds itself in the gpu territory but with a much more efficient topology. Just a thought.

  • @tringuyen7519
    @tringuyen7519 3 ปีที่แล้ว +3

    I really don’t think that AMD is concerned with interconnect power. TSMC is already working on micro heat spreader designs for your 3D chiplet stack. Basically a die which is all metal with an inner chamber of thermal fluid to dissipate heat.

    • @fataliity101
      @fataliity101 3 ปีที่แล้ว +1

      Its not just about power. Also latency. What Ian is proposing would essentially make the 6 chiplets on each side into a mega chiplet latency wise. 48 cores all 1 hop away instead of a hop to IO die, across the die, and then a hop to the correct chiplet.

    • @Steamrick
      @Steamrick 3 ปีที่แล้ว

      I think it's less about heat dissipation and more about efficiency...

    • @مقاطعمترجمة-ش8ث
      @مقاطعمترجمة-ش8ث 3 ปีที่แล้ว

      Sht I was thinking in such thing while watching this video

  • @reinerfranke5436
    @reinerfranke5436 3 ปีที่แล้ว

    Active interposers are also lithography field limited. Stitching as for CIS is 2x for FF and 4x for MF sensors is difficult. Array stitching with yield tolerance for the active com interposer is the right choice. 32nm instead of 65nm is better for > 10GBaud serials because Ft peak there.

  •  3 ปีที่แล้ว

    Maybe Ryzen's hops are not stops. Maybe transmission is more like broadcast on shared bus - whole bus becomes busy and only sending and receiving cores are interested in data. When you put shortcuts in ring bus you can temporarily cut one ring bus into two subbuses allowing for two simultaneus transmissions. Add bidirectionality and ring bus gets even more throughput. I dont know, just guessing.

  • @katietree4949
    @katietree4949 3 ปีที่แล้ว

    Crossbar reminds me of cross-connect in physical telecommunications copper pairs.

  • @JakeDownsWuzHere
    @JakeDownsWuzHere 3 ปีที่แล้ว

    love your content, man. i feel like i'm taking an intro to electronics engineering course. Thanks for putting these videos together! :)

  • @kwamepalavin8405
    @kwamepalavin8405 3 ปีที่แล้ว

    With multiplexers all you need is a one time broadcast of all cores to every other core to find its location on the ‘ ring ‘, then store that location as an address in that ‘ ring ‘ so you can write to that address.

    • @Hugh_I
      @Hugh_I 3 ปีที่แล้ว

      that would essentially be a shared bus used by all cores. The problem with that I suspect is that you get contention when multiple cores need to talk to one another, you need to sync and incur stalls when the Bus is in use, and the overall bandwidth is smaller than with lots of small interconnects. Also you may get into trouble with your fabric clock speeds, since the signal needs to go all the way around in long traces, instead of just a number of short fast hops.

  • @andytroo
    @andytroo 3 ปีที่แล้ว +1

    a 4 core fully connected has 3 connections at each core. a ring only needs 2 connections at each core. I'd guess that there are cross-connects on the ring with those extra interconnects.

  • @SaveTheRbtz
    @SaveTheRbtz 3 ปีที่แล้ว

    IMHO as scale increases CPU architects will take more and more pages from the Network/Datacenter Engineering playbooks. I would imaging after some number of cores they'll switch to either CLOS topology (through multiple levels of switches) or maybe they'll go straight to a Dragonfly topology.

  • @saf-riz
    @saf-riz 3 ปีที่แล้ว +1

    brilliant. Linus should watch this video.

  • @Ianochez
    @Ianochez 2 ปีที่แล้ว

    yes they the interposer can be an interconnectivity interface, altough it might be not, because they showed a 3900XT3D on a event this last august and therefore the adding connectivity wasn't used in the 3000 series baseline, my question is, how they engineered the unused connections points ( I remember hearing that those connection were already there) to be compatible with the Zen2 and the Zen 3 architecture? OR the L3 cache installed on the 3900XT3D and those on the next 5800X3D aren't the same. they shouldn't.

  • @goodiezgrigis
    @goodiezgrigis 3 ปีที่แล้ว +1

    Excelent explanation and logical speculation. 👍
    I would just add that if there are smaller steps on the way to full interposer layer, we will see all of them before we get full mesh one. It is just buisness for AMD and Intel. Baby steps make big $$$.

  • @YtterbiumUK
    @YtterbiumUK 3 ปีที่แล้ว

    How do Intel chips do perform close to AMD with much less cache?

  • @Alex.The.Lionnnnn
    @Alex.The.Lionnnnn ปีที่แล้ว

    I know this video is old now, but is packaging the future of IC performance development?

  • @glenwaldrop8166
    @glenwaldrop8166 3 ปีที่แล้ว

    6:00
    It seems like a link between 1 and 16 and 4 and 12 would significantly improve latency while only marginally increasing power usage.
    Kinda surprised that isn't a factor in their mesh design.

  • @dinokknd
    @dinokknd 3 ปีที่แล้ว

    This vid has made me subscribe. thanks mate, was really informative!

  • @BenjaminRonlund
    @BenjaminRonlund 3 ปีที่แล้ว

    Please use a mic filter or use a sound filter in your editing to help the audio sound cleaner.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว

      RTX Voice, then audacity background noise removal. I work and can only record in a noisy testing lab.

  • @erejnion
    @erejnion 3 ปีที่แล้ว

    I actually wonder if AMD will go above 8 cores per chiplet anytime soon. If they can make 25mm^2 chiplets, they will happily do it, I bet. And if I remember correctly, Adored was talking about some research paper how 4 to 8 core chiplets is the golden mean.
    And then, of course, you get to have 24 really small chiplets interposed on a large i/o connectivity die.

  • @Hostilenemy
    @Hostilenemy 3 ปีที่แล้ว +1

    How do these ring configurations translate into FPS in Crysis?

    • @TravisFabel
      @TravisFabel 3 ปีที่แล้ว +1

      You just buy the cheapest processor you find on sale, and then slap in a good graphics card.

  • @MrMysticphantom
    @MrMysticphantom 3 ปีที่แล้ว

    DO A VIDEO/VLOG ABOUT YOUR SETUP

  • @denvera1g1
    @denvera1g1 3 ปีที่แล้ว

    Honestly, look at that demo TSV cache 5900x. That additional cache was sooo much larger than the existing cache, even though the die area was the same, could this be because the original cache was also the bus? Maybe it was 3D

    • @kazedcat
      @kazedcat 3 ปีที่แล้ว +1

      No they are using cache optimized process that is why it is denser. The thing with SRAM is that the cell size are relatively fix but the support logics like how often you need sense amplifiers and signal booster changes their density. For a CPU the metal layers have to connect a lot of things which means you are constrained in connecting up all the SRAM cells for your cache. But a separate die cache then you have all the metal layers just for connecting the cells. This would also mean that you can reduce the needed support logic by properly designing the wires to have less noise and reduce impedance.

  • @MikaelKKarlsson
    @MikaelKKarlsson 3 ปีที่แล้ว

    Taters sliced. Donuts buttered. Onion rings cooking. Eagerly waiting for the recipe to say when to add the salt.

  • @e2rqey
    @e2rqey 3 ปีที่แล้ว

    So I "attended" Hot Chips this year, but was disappointed that I didn't get any semiconductor company swag. Any advice on how to get T-shirts and other stuff from these kinds of companies?

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +1

      If you watched the synopsys talk, there was a link to a free t-shirt. Intel had a small contest that was easy, and a t-shirt there. Otherwise this year was devoid of swag compared to last year

  • @deilusi
    @deilusi 3 ปีที่แล้ว

    well ring on core chiplet, and then interposer with extra spaghetti, to make that >4 jumps on average, or ideally for all of them.
    that middle layer sounds a lot for me like "infinity fabric".
    I guess that having something like intel had, those "doubled" connections that can be disabled for power savings, but then this truly become spaghetti tornado.
    vcache vcache vcache vcache vcache vcache vcache L3
    interposer (as well as passthrough to all cache for all chips)
    chiplet chiplet chiplet chiplet chiplet chiplet chiplet chiplet
    interposer
    ^ this is cake I want for my next build.

  • @Patrick73787
    @Patrick73787 3 ปีที่แล้ว +1

    Very informative. Thanks.

  • @falconeagle3655
    @falconeagle3655 3 ปีที่แล้ว +2

    Jim from Adored TV talked about this mesh paper and this butter donut 1 or 2 year ago. AMD filed a patent back then.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +3

      Yup, I linked his 2018 video in the description. This goes a step beyond with the new info from AMD since then.

    • @falconeagle3655
      @falconeagle3655 3 ปีที่แล้ว +2

      @@TechTechPotato Your analysis is really thorough. Jim's one was more like what are the possibilities. From you video here I am kind of convinced that those complicated pathways might be the way to go. No more easy connections like ring or mesh.

  • @fVNzO
    @fVNzO 3 ปีที่แล้ว

    If you wan't more info on the paper in question Adored has an old video about the "butter doughnuts" and other core topologies i can recommend watching.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +1

      It's almost as if I didn't already link to the video in the description....

    • @fVNzO
      @fVNzO 3 ปีที่แล้ว

      @@TechTechPotato I don't read descriptions, few people do. Luckily I'm here to help everyone that don't:). You learn the lessons of TH-cam the hard way. Keep it up.

  • @TheNubaHS
    @TheNubaHS 3 ปีที่แล้ว +1

    Just wanted to say his lecture also applies to network :v

  • @kayaogz
    @kayaogz 3 ปีที่แล้ว

    Isn’t Intel doing a similar approach with the tile-based design with interposer on the bottom connecting tiles?

    • @Jaker788
      @Jaker788 3 ปีที่แล้ว +1

      Their approach is more of a simple mesh. Their cross bridge tech is just point to point with no interposer, each chiplet will be connected to another just like a mesh. It's the same tech they used to connect an AMD Vega GPU to their CPU for a NUC.

  • @genstian
    @genstian 3 ปีที่แล้ว +1

    Wouldn't it just be a lot smarter to REQUIRE less interconnectivity? And for more core you could just deal with groups, say you have a future 128 core CPU, it could be 4x32core or 8x16 core design with some ineffectively instead. Kinda like NUMA.

    • @Hugh_I
      @Hugh_I 3 ปีที่แล้ว

      Isn't Epyc Rome/Milan essentially that already? You have 4-core (Zen2) or 8-core (Zen3) CCXes that are internally interconnected, and each CCX (group) connects through the IO-die to the other CCXes.
      The question I guess is, would it make sense for future AMD designs to somewhat directly connect the CCXes to each other as well for lower inter-CCX latency? and in what topology? Thus making a trade of between latency and number of interconnects required.

  • @jwo7777777
    @jwo7777777 3 ปีที่แล้ว

    Reminds me of public access channel quality.

  • @AgentSmith911
    @AgentSmith911 3 ปีที่แล้ว +1

    When are we getting CPUs that are using photons to communicate instead of electrons? I've heard about it for decades lol

  • @ashutoshmishra372
    @ashutoshmishra372 3 ปีที่แล้ว

    How do you verify the latencies from one cpu node to another cpu node and from one cpu node to the DDR memory. Do you use some tools to generate traffic which can stress the fabric(The ring or the mesh) ? I am assuming you do all these testing at a post silicon level.

    • @kazedcat
      @kazedcat 3 ปีที่แล้ว

      They have a program that test thread to thread latency. The program pingpongs data between threads and they can measure the performance on varying pingpong level.

  • @Teatime4Tom
    @Teatime4Tom 3 ปีที่แล้ว +1

    I'm positive this video will help me understand the Anandtech article I just read.
    Mmmmm... butterdonuts.

  • @stennan
    @stennan 3 ปีที่แล้ว +1

    Any thoughts on how this kind of 3d interposer could work for a big.little setup? I guess that the issue with stacking stuff on top of the cores, it would become an extra layer for the heat to travel through before reaching the heat sink

    • @UNSCPILOT
      @UNSCPILOT 3 ปีที่แล้ว

      Makes you wonder if they can integrate some sort of cooling into the dies themselves to quickly reject heat threw the layers, micro heat pipes maybe or something more clever such as Graphene or Carbon Nanotubes, sooner or later those with probably be part or the semiconductors but they also have good heat transfer if I remember correctly, I guess the only trick is how to arrange them vertically threw the horizontal layers of chiplets

  • @CharcharoExplorer
    @CharcharoExplorer 3 ปีที่แล้ว +1

    Do it via the big D cache!

  • @Pharesm
    @Pharesm 3 ปีที่แล้ว +1

    Zen3 - Lord of the Rings... Tolkien already knew.
    The ring of rings to enslave them all...

  • @bfish9700
    @bfish9700 3 ปีที่แล้ว

    This went way above my head really fast.

    • @TechTechPotato
      @TechTechPotato  3 ปีที่แล้ว +1

      Start with section 1. The idea is to go slowly into the topic of simply 'how do we connect things' and built from there.

  • @theophilusthistler5885
    @theophilusthistler5885 3 ปีที่แล้ว

    What path will AMD take with it's upcoming embedded "ultra efficient" CPU/SOC/...FPGA?

  • @davivify
    @davivify 3 ปีที่แล้ว +1

    hopcount is important. Especially if you're brewing beer.

    • @DerIchBinDa
      @DerIchBinDa 3 ปีที่แล้ว

      Here take your upvote!

  • @thelongslowgoodbye
    @thelongslowgoodbye 3 ปีที่แล้ว

    Zen 3 kind of resembles that IBM chip from a previous video.

  • @wordlv
    @wordlv 3 ปีที่แล้ว

    Boy! Ian is a Boss!

  • @GabrielFoote
    @GabrielFoote 3 ปีที่แล้ว

    Absolutely love this channel. Thank you much, sir

  • @blakeliu3713
    @blakeliu3713 3 ปีที่แล้ว

    Can I subscribe twice? Bc this is awesome!

  • @tomdchi12
    @tomdchi12 3 ปีที่แล้ว +4

    Miss Aligned ButterDonut(X) is my new stripper name.

  • @TechLevelUpOfficial
    @TechLevelUpOfficial 3 ปีที่แล้ว +2

    okay those are some cute slides i must say lol

  • @infango
    @infango 3 ปีที่แล้ว

    So how CPU Ring theory correlates Star War Ring theory ?

  • @TheRealSwidi
    @TheRealSwidi 3 ปีที่แล้ว

    Mind the hops. Cheers.

  • @biosgl
    @biosgl 3 ปีที่แล้ว

    so AMD to remove L3 from chiplets all together? and stack cache on top? Do it!

  • @juancarlospizarromendez3954
    @juancarlospizarromendez3954 3 ปีที่แล้ว

    Not mentioned here the topology: the hypercubes.

  • @eclipsegst9419
    @eclipsegst9419 3 ปีที่แล้ว

    MCM is basically a means to an end no? But monolithic is the superior end goal. Been that way since what, the Pentium D?

  • @Ichi.Capeta
    @Ichi.Capeta 3 ปีที่แล้ว

    time to buy more shares :O

  • @divexo
    @divexo 3 ปีที่แล้ว

    Love the content well done. Well researched

  • @scottr7683
    @scottr7683 3 ปีที่แล้ว +4

    Didn't watch the video yet but...
    misaligned butter donut please?

  • @gedw99
    @gedw99 3 ปีที่แล้ว

    My head is ringing after all that .
    It’s turtles all the way down - the internet is a set of tubes and so are chips apparently