everybody will think that the new memory format is great because its efficient an faster-closer, YEAH until the software devs see this and make the software heavier, and then this new ram behaves the same as the old one.
@@Jackson-bh1jw Devs making software heavier has nothing to do with increasing RAM. CXL isn't about improving the performance of unoptimized programs. CXL memory doesn't perform better then DIMMs. CXL memory has higher latency and lower throughput than DIMMs. The performance gains of system with CXL memory are from significantly reducing page faults. CXL also allows you to upgrade your RAM independently from your CPU which significantly reduces the cost of upgrading systems.
@@mohammedgoder no, no... and no. about heavy software, I dont care if it has nothing or all to do with the rams, thats will happen, ergo the transition is useless, you can attach the new techs to the old format anyways. And it looks like an argument bait coz you just said it has better upgradability, so yeah win, win.... and win for me. Tnx
@@Jackson-bh1jw You literally lost the argument by saying "I dont care if it has nothing or all to do with the rams." Your whole argument was contingent upon a link between CXL and bad devs. You failed to substantiate your claim and then you claim to win an argument. I on the other-hand provided a detailed rundown for the importance of CXL memory. It seems as though you are cognitively challenged.
There is a very good article from semianalysis about CXL , ai and chip design. CXL uses pcie which uses more die area and chip shore region per gigabit/second transfer speed. Unless Amd uses more silicon with chaplets to increase shore area (which it is positioned to do) compared to nvlnk, it can not reach the nvidia speeds.
True, but with hbm and good handling of memory tiering you might not need much bandwidth. Phisons product working with only nand, even without cxl, is perhaps some evidence of this.
Space for the PCIe lanes can come from reducing pin outs to DIMMs/traditional ram interconnects. Also CXL has switches that allows for expansion regardless of how many PCIe lanes the host CPU supports.
Not sure AMD will continue the APU route. that said, the MI300A (24 genoa core plus 228 CU) is really quite a feat, I had the chance to play with it a bit and AFAIK its the only *near production ready* Unified Shared Memory (USM/UVM) enabled system where its worth using USM on large problem. The benchmarks are looking good although a bit lacking on the HBM throughput (due to the big LLC ? or the IO die bottleneck ?). (although a bit low, we still get much more than the H100's 3 To/s) Anyway, on paper and benchmarks, it crushes H100 for many HPC and AI tasks (while consuming less power and costing less than half to a third of a H100). For HPC, nodes with 4 APUs are already being used (see DOE's El-capitan machine), stay tuned for SC in November, top500 may see some changes in both raw Flop/s and energy efficiency (Flop/J). The MI300A fills the AMD to Nvidia software gap in way, because it lowers the cost to entry on AMD GPUs, you can have an half ported HPC simulation code, with a bunch of stuff done on the CPU and the rest on the GPU, without paying copies. Ok its not optimal and does not help porting a code, but it enables people with incomplet port to use GPUs (though careful not becoming vendor locked). As an aside, the C++ standard only supports what they call std par (a way to extend the standard library's algorithm) on GPUs if USM is available. From experience, this C++ feature is crap unless you are so careful that you are not the audience for it (aka hpc engineer/benchmarker). The C++ guys are very clear that they are not going to adhere to the device queue concept of opencl, cuda/hip, sycl etc. The c++ std sees memory as only one big array, nothing is disjoint. For C++, the only way is USM. On MI300A you have USM and its performant. On GH200, you have USM and page migration (ouch). The MI300A requires quite a bit of kernel hackery though, the kernels will run in HBM and this is weird.
Ryzen AI chips with the CPU, GPU, and NPU are the one to watch for. If they start adding more DDR5 memory channels (beyond the current two) then you start to move into the bandwidth discrete GPU(s) enjoy and what Apple currently employs with the M1-M4 chips.
exactly, and I don't know why people keep saying that main DDR memory is slow, it is not. GDDR5 is as fast as DDR4, the difference is the way the controller access memory. DDR4 is better at random access, whilee GDDR is better at sequential access because it has a wider bus. 2048 bits vs 512 bits for the CPU.
A larger L3 will always help, but the additional bandwidth of HBM will not. With regular DDR having lower latency, it will be the better option. This is because latency is more important then bandwidth for CPUs. The opposite is true for GPUs - hence two different memory standards. The question then becomes which to use when both a GPU and CPU use the same memory? I suppose it all depends on what limits these future systems.
CXL has similar needs as persistent memory management and LLVM support via memory alias is key. Context Save, Load, Replay, power loss recovery/Restore for programs is necessary at the kernel level to ensure uniformity across the software stack. Compiler language key words such as volatile, static, persist would be the best hints in the C/C++ to make existing software easily extendable. Furthermore, Compiler with integrated AI just in time compilation maybe the best real world accelerator that the tech giants are seeking.
I really like unified memory for its potential to increase memory bandwidth. Apple’s m ultra uses soldered lpddr5 yet they are about to get 800gb/s bandwidth. That’s close to gpu speeds. While Apple is not a good example when it comes to reducing costs. VRAM is more expensive than lpddr memory. If we can use lpddr to get similar speeds to vram then I’m all for it. Unified memory then makes a lot more sense. It’s less hassle for developers too.
3:48 I’m a bit of a strange weirdo: All the system developments described in this video with possibly huge performance gains are great and all; but what I’m really in awe of is the physical plumbing of that passive electrical interconnect bridge thingy Jensen is holding here. A fantastillion TB/s to connect multiple physical systems in a rack with each other “just” with high-quality copper wiring since optical transceivers would need much more power at the same scale. Meanwhile on the end customer market it is still pretty difficult to find and actually get high-quality PCIe SFF cables that can do PCIe Gen4 or Gen5 properly without introducing PCIe Bus Errors :(
Perhaps controlling both end of the cable plugging in has something to do with it. Maybe the current standard and tolerance levels are too high for something like pcie to work safely within it's theoretical maximums
It’s mostly cable manufacturers scamming the customers. The standards and tolerances needed for these PCIe SFF cables are widely known but in the past they mostly got away with manufacturing actually-lower-than-advertised quality cables and stuff in general still worked (up to PCIe Gen3 is relatively “easy”). But these tolerances become tighter and tighter with the faster generations which then leads to this house of cards collapsing. Even if you have nothing to do with PCIe cables (which is understandable since it is a niche volume-wise, especially outside of data centers), it’s a similar situation to USB 5, 10, 20 & 40 Gbit/s, HDMI 2.1 48 Gbit/s or DisplayPort 2.1 cables but with drastically higher prices per cable which in turn of course attracts scammers that smell quick profits.
@@dieselphiend Well the adapter exists, more optimal would be still to use the the daughterboards for 4 SXM modules that connect to motherboard via cooledge, Dell and Supermicro ones would probably work with minimal challenges
AMD was working with Vega to use videoram more efficiently. They found out that games often took twice as much as they needed. What happened with that research and why does it seem nobody uses it? Also: why do GPU's lack an M.2 slot? Drivers could use the flash drive to store data like textures on the flash drive for multiple games and programs.
I'm hoping/liking that server capabilities are trickling down into the consumer market. I'd be happy to have a big unified chip that contains both the GPU/CPU and a single BGA HBM/RAM card (they can figure out how to do a giant shared cache on active interposer). Anything that can slim down and simplify technology in a user-facing manner is a win to me. Right now I'm digging the AM5 server motherboards and want to see 4 node 2U configs since that would make sense to me since that config limits PCIe I/O. All I want is good IPMI and ECC RAM. These changes in computing have been a long time coming and I'm excited to see the future of it.
Sounds good, as long as you never need to upgrade or replace a component. Gpu goes bad, replace the whole unit. Need to upgrade the cpu, replace the whole unit. And the cost of buying/replacing a whole unit would be on par with buying an entire computer system, as it basically will be.
@@chrisbaker8533 True. I've had GPUs burn out on me, but never CPUs. Then there is my favorite. Did my PSU burn out my GPU or was it over temp? People who constantly upgrade their computer would find the new scheme annoying. Think of it like a big APU and big hunk of unified Memory instead.
Not certain here, so don't quote me on this. I use Gentoo, and last time I was in the memory management section of my kernel config, I could have sworn that I saw drivers for RAM over CXL, as well as a hotplug driver for it. Right near level 5 page table support, I think.
I read a while ago about that so no wonder if it is already there, one more thing in my opinion this man is doing only pure hype, in somehow too much detail he is keeping aside, PR.😂
I have this notion that in the future we're going to move to a more distributed hardware architecture. As Wendell mentioned, the problem has always been software - We can build amazing computers, and we absolutely need to because our software gets slower at about the same rate as computers get faster. But *If* we eventually get good at writing parallelizeable code, the trick is probably to just have a bunch of tiny cores tightly coupled to very little memory, clustered together in a larger computer. Right now, the compute is attached to various kinds of memory via a single fast big memory bus, but I think "in the future" the computers will just be a bunch of tiny memory cells tightly coupled to a local CPU in a "networked" cluster, leaving a lot of the memory, cache optimizations etc. (the stuff we're tweaking right now in hardware) to software.
To move to the vision you outlined, the new product / architecture/ vision needs to beat the legacy product by a significant degree, and it is that legacy product that killed all of the new visions (VLIW, ARM, RISC, etc)
@@EdDale44135 Ok, first of all all the architectures you listed are in fact alive an well(VLIW TTA architectures are the cool stuff for LLVMs OpenASIP, ARM is more popular than ever, probably the most common CPU architecture outside of servers, and is also RISC. Not to mention RISC-V also already making some strides in the embedded world. So non-x86 CPU designs are alive an well(IMHO). Second, yes you're going to need to be better at something with a new CPU design if you want to gain market adoption. But you don't have to be the best at everything right away. And I think coupling storage to compute is probably more efficient(faster and/or more energy efficient). The problem, as stated before, will be the toolchains etc. - Software - to get useful(non-benchmark) work out of these chips.
Serious question: When do you think we will get a Grace/Hopper or M100 style 'all-in-one' home computer? APUs are already very well established, Apple M chips seemingly doing well. I want an Nvidia/Arm all in one PC to game on.
The thing is, those companies don't want you to buy one item. They want you to buy at least two. That's why the performant CPUs and GPUs are separated.
On a more serious note could this architecture benefit non ai computing? I am thinking the huge boost AMD received with its v3d cache helped gaming and even running virtual machines
There are a number of workloads that x3d cache speeds up, phoronix did a deep dive on it so I'm guessing the same would apply here. Like Wendell said the sticking point is when you need more memory than is on the SOC which CXL could address. It will be interesting to see how this all shapes up over the next couple of years, no question there.
I’m excited to see more unified memory options on the market as it’s the future bottleneck for one of our fintech projects. We can scale the compute easily enough, but once the dataset passes a certain size we become VRAM constrained and all GPU performance advantages are lost. It’s one of the reasons we didn’t put the CUDA version into production and kept with the parallel scalar approach instead. It’s a bit slower, but much cheaper & more scalable to split the work around CPU cores where we can access the memory we need. I’m curious about how the GPU Direct Storage API can help. It allows a CUDA app to directly stream data from fast NVME drives without having to loop it through system memory. For the right use case VRAM caching NVME can be faster that VRAM + system memory. We have a few years of scaling our products on CPU before we need to take the GPU route. I’m hoping there will be more options by the time we need them.
I shutter to think what situation would cause system DRAM with 3 orders of magnitude superior latency to NVME and an order of magnitude more bandwidth to perform worse than an NVME drive at streaming data into a GPU. I suppose direct storage API would reduce CPU overhead though.
@@paulblair898 aye. It’s the single threaded setup and deconstruction process which hurts the most. By separating the data path from the wrap calls a lot of map-reduce type tasks can run faster as it moves the bottleneck away from the CPU. For our application which streams data (a few billion objects per go) we’re compute rather than bandwidth limited. If the NVME can feed the GPU fast enough during the parallel map phase we get a performance boost from the CPU being less loaded. We can also preserve the VRAM for the smaller results dataset which can fit in memory for a fast pass at the reduce phase of the process. Performance stuff is really weird sometimes. We’re not using this path in production yet, but it’s nice to know we can scale this way if we need to. It was a fun R&D project to pull together the solution using the NVIDIA RAPIDS framework.
How will the power consumption be affected by that? I guess to store in a SLC NAND will probably require more power then in HBM/DDR-memory, isn’t it? Will that be a significant change or not, I am not sure of though, but I am quite sure that power consumption will limit the compute power quite soon
I would really love to see in-depth analysis of Qualcomms new x-elite chip, specially it's GPU. There is decent amount of information about it's cpu out there but very little about it's GPU and it's GPU is pretty decent for a first gen integrated GPU.
I have it on good authority that the dual GPU in Apple's M Ultra has abysmal results in compute precisely because Apple presents a single unified GPU as opposed to two separate ones. The cost is in maintaining coherency between the two GPUs.
What keeps playing in my mind "All non-trivial abstractions are, to some degree, leaky" Yes, the idea is to abstract away the inherent system architecture so that you don't need to focus on optimizing it so much But what that actually sounds like to me, is that I'm going to have to go under the hood to do the optimizations. I'm going to have to work around the abstractions.
That more than just AI scientific programmers work on UMA platforms, but also more desktop/end user facing programming and game developers (Apple WWDC24 announced more native game ports) work on UMA platforms, makes it likely that UMA, fast and slow memory, will be the future going forward.
Can totally see MI350X with 384G of HBM3E. ChatGPT 4o Sky needs very low latency but also higher sophistication. Developers will brute force it until they can’t anymore!
Interesting discussion. I think we really need an open source implementation of a GPU memory paging system, one with a flexible number of tiers, so we can pull weights from SSD, to CXL, to system memory, and onto the accelerator as they're needed. I believe that modern GPUs have MMUs, so if it's possible to register a page fault handler, then you could build a system like this. Such a system would enable LLM inference on much lower end hardware, and even with the paging overhead, it will most likely have drastically lower latency than connecting to a cloud service (the speed of light is indeed a bitch).
I did not fully get the concept of the mi300a. I can not see how a umified memory space could work effectively if thats what its tring to do? The x86 chip would slow down the gpu so much if dynamic workloads would be done and for undynamic algorithems where is the benefit from the traditional system? There might be some tiny response time gains but that against 50 years of software development with the traditional memory architecture (software is 99% about memory)?
I hope we we see more UMA on the x86 side too in a consumer products, I don't know what type of work the viewers here are doing, but as an ordinary software engineer I can say in a lot of companies as a developer machine you get (try to guess) a window's lapt top with intell or mac and if you want to not be on the mac ecosystem you spend a lot of time building. Not to mention that if you want to experiment witl LLM during work and you are on intell you probably better not trying. With the new intell SOC's it may be better but it take time for big companies to change developers hardware.
Perhaps foresight or perhaps luck, but the unified memory architecture is pure win for Apple. Lower power, better performance, easier to put out together a crazy combination of different SKU's including what's in their data centre. Want sustained performance - add a fan. Nice.
The DDR5-6-7+ progression is simply insufficient. With the rise of integrated GPUs, NPUs, and even new AI workloads on CPUs, it just does not make any sense. The only reason we ended up with such slow system memory in todays products is that it was simply good enough for yesterday's workloads, but things have changed. The companies that don't get this will be scrambling to catch up in the near future.
But it does matter!!!! If you actually build kernels, you need to be aware of UVM as under the hood it is just hardware based page faults which are slow. Sure we don't have to explicitly malloc, but still need to be aware of what needs to be pre-fetched etc, hell sometimes we need a few SMs to do data transfers exclusively. Stuff under interconnects moreso need to be handled with care, like while using mpi or nvshmem. The architectural changes bring in more complexity to the programming model and not less by any measure.
Not sure I agree with "slow" and "fast" system memory, More like "GPU memory" and "CPU memory" In the classic CPU example the difference is the "interconnect" speed vs the "memory bus" speed. This kinda feels like NUMA all over again, except for using the unified memory like the M-Series and now the MI300. CPUs solved the throughput problem with caches so their not geared for HighBandwidth memory, I'm not sure how much of a core redesign it would take to remove the caching and have a cpu attached to memory like a gpu and have branch prediction etc work. Maybe this is the hold up?
Pfff, the speed of light issues can be easily solved, just migrate to photon computing where the state of multiple photons is linked via quantum entanglement instantly no matter the distance… Come on, people! ;-)
Unified Memory, On CUDA programming perspective, there are tools build for this kind of practice for a decade (altho apps never use them AFAIK). I would speculate they are just trying to deliver what customer wants rather than going on for a more aggressive full-on unified structure. Until some real impactful application IRL really demands them, in that case they would be able to deliver them as "significant improvements" based on they are actually having a close enough arch for applications to be deployed and compared to traditional arch. Just as today the "impactful application" can't be AI. The Transformer arch including its variance today, can't be really benefited, at least in an obvious way enough from the unified arch on high density training environment.
As a gamer it's interesting that your video applys to gaming too😂 With server farms moving to 3 year life cycles could Nintendo buy grace-hopper or newer server's as companies move to newer Architectures and build cloud services on the chep enabling streaming to Switch devices
There is a lack of real investment in the market, we are seeing "gimmicks", companies soldering chips and changing memory formats, preventing upgrades, etc., all to avoid assuming the basics, they are repeating the same recipe for decades, and are hitting all kinds of limits. The future is total memory abstraction (IMO), unified memory and separation between slow and fast memories in CXL, all at the same time. IMO on the CPU, with some level of HBM3e for "server" products, total freedom of dissolving memories, since it "speaks" OMI your chip, hardware etc. doesn't matter (cases like the HMC Hybrid Memory Cube memory would not have fallen by the way), do you want to use HMC/HBM3e on a memory stick? Feel free, it's possible to mix DDR4/5 and GDDR all at the same time ok, 5 TBs of memory access? OK, a chiplet-level cache and so on. There is a lack of courage, who would have thought IBM's "dinos" were at the forefront of the "innovative" mainstream PC/server market.
programming gpu should be as simple as having the same code in cpu. no hoops for programmers, devs, whatever. in gpu land, optimization at the code level does not matter. its then only down to the algorithm you are running. only. yep gpu is designed to run everything massively parallel, so thats not the main concern of normal cpu programming issue. yep you have to transform the algorithm to work on massively parallel format, not even cpu avx vector code. gpu vector code. well if you make a smart compatible vector instruction compiler for cpu/gpu, then you dont have to recompile to get the same vector code to run on gpu directly. make universal vector computation code. universal compute shader code. that you already have. yep both gpu and cpu can run it. in vector mode. why bother with any other code format than compute shaders for vectorized compute units. npu tpu gpu cpu. all running the same code. all the janitor code around the core compute is secondary. yep, make also the cpu act as a single gpu, concerning vectorized compute code. compute shaders. the more you spend in secondary janitor cpu glue logic, the less you are doing the main thing. say ray tracing and getting the data to be traced. which one you think is the main thing. of course getting the massive compute running, running the main task. dont make the janitor task be the focus. unified memory, you dont need to manage any of those memory things. as the program dev. in other words, python numpy runs every vectorized load on anything, but instead of that, at low level, make the processors cores run the same loads directly. in addition to the unified memory, then it really does not care where the compute resources come ie where its being run. yep really flexible compute execution. why do you need the slow memory if you already have enough high speed memory. ie gddr67 instead of ddr5. no ddr5 at all or minimal with the cpu boot package. all the os nonsense etc.
buddy where else are you gonna get a gpu with 128gb of vram, oh wait that's the new macbook sorry (no seriously though that might be the best ai platform out there)
Have AMD or Nvidia produced a counter to Intel's Xeon CPU Max Series? "Maximize bandwidth with the Intel® Xeon® CPU Max Series, the only x86-based processor with high-bandwidth memory (HBM). Architected to supercharge the Intel® Xeon® platform with HBM, Intel® Max Series CPUs deliver up to 4.8x better performance compared to competition on real-world workloads1, such as modeling, artificial intelligence, deep learning, high performance computing (HPC) and data analytics"
what makes blachwell what it is , is not nv arch. its tsmc same applies to amd 3d cache is tsmc tech not amds, amd could never make 3d chips they don't have the money or the fab, design companies are design companies only real chip makers have foundries and the foundries design there own 3d chips . tsmc is happy to rent there tech to design houses. thing is if amd switches fabs they lose the 3 d chip onless they go intel or sansung 3d, i have heard nothing on samsung 3d cache, at any rate what people believe is amd /nv tech is not their teck it's the fabs. in so far as 3d chips go .
if ai is what you types claim the software should not be a problem . when nv crashes and it will. because its going to come out that ai is for the most part over hyped and when million lose in the market what do you people say opps . that will get you bullets if co pilot is what its all about its already a fail just ask ai hard questions the only answer you will get is pure bs because its info relies on men inputting correct info and most all people can not do that period, So it is useless and if democrat programmers are involved now it gets scary.
AMD really squandered a lot of good will when it comes to local ai. They’re more interested in nvidia taking a larger share and keeping prices high than they are in increasing the overall market. I’m not going to even wait for strix halo. I’m going to get the m4 Max when the studio version comes out. Yes it’s really expensive. But there’s really no better alternative without risking your power circuitry. AMD just failed the market.
I love how he says "To wrap this up" and he has another 5 minutes left. Only 3/4 through the video
That's the beauty of Wendell.
#SysAdminThings 😂
tiered memory would also be awesome on laptops with ability to expand capacity with camm2 and have faster memory on package like apple silicon.
Do a video demoing CXL memory expanders.
everybody will think that the new memory format is great because its efficient an faster-closer, YEAH until the software devs see this and make the software heavier, and then this new ram behaves the same as the old one.
@@Jackson-bh1jw Devs making software heavier has nothing to do with increasing RAM. CXL isn't about improving the performance of unoptimized programs.
CXL memory doesn't perform better then DIMMs. CXL memory has higher latency and lower throughput than DIMMs. The performance gains of system with CXL memory are from significantly reducing page faults.
CXL also allows you to upgrade your RAM independently from your CPU which significantly reduces the cost of upgrading systems.
@@mohammedgoder no, no... and no.
about heavy software, I dont care if it has nothing or all to do with the rams, thats will happen, ergo the transition is useless, you can attach the new techs to the old format anyways.
And it looks like an argument bait coz you just said it has better upgradability, so yeah win, win.... and win for me. Tnx
@@Jackson-bh1jw You literally lost the argument by saying "I dont care if it has nothing or all to do with the rams." Your whole argument was contingent upon a link between CXL and bad devs. You failed to substantiate your claim and then you claim to win an argument.
I on the other-hand provided a detailed rundown for the importance of CXL memory. It seems as though you are cognitively challenged.
@@mohammedgoder cool story bro
There is a very good article from semianalysis about CXL , ai and chip design. CXL uses pcie which uses more die area and chip shore region per gigabit/second transfer speed. Unless Amd uses more silicon with chaplets to increase shore area (which it is positioned to do) compared to nvlnk, it can not reach the nvidia speeds.
True, but with hbm and good handling of memory tiering you might not need much bandwidth. Phisons product working with only nand, even without cxl, is perhaps some evidence of this.
Space for the PCIe lanes can come from reducing pin outs to DIMMs/traditional ram interconnects. Also CXL has switches that allows for expansion regardless of how many PCIe lanes the host CPU supports.
I would love to have a Threadripper with a bit of super fast HBM memory that is supplemented with CXL memory.
Not sure AMD will continue the APU route. that said, the MI300A (24 genoa core plus 228 CU) is really quite a feat, I had the chance to play with it a bit and AFAIK its the only *near production ready* Unified Shared Memory (USM/UVM) enabled system where its worth using USM on large problem.
The benchmarks are looking good although a bit lacking on the HBM throughput (due to the big LLC ? or the IO die bottleneck ?). (although a bit low, we still get much more than the H100's 3 To/s) Anyway, on paper and benchmarks, it crushes H100 for many HPC and AI tasks (while consuming less power and costing less than half to a third of a H100).
For HPC, nodes with 4 APUs are already being used (see DOE's El-capitan machine), stay tuned for SC in November, top500 may see some changes in both raw Flop/s and energy efficiency (Flop/J).
The MI300A fills the AMD to Nvidia software gap in way, because it lowers the cost to entry on AMD GPUs, you can have an half ported HPC simulation code, with a bunch of stuff done on the CPU and the rest on the GPU, without paying copies. Ok its not optimal and does not help porting a code, but it enables people with incomplet port to use GPUs (though careful not becoming vendor locked).
As an aside, the C++ standard only supports what they call std par (a way to extend the standard library's algorithm) on GPUs if USM is available. From experience, this C++ feature is crap unless you are so careful that you are not the audience for it (aka hpc engineer/benchmarker). The C++ guys are very clear that they are not going to adhere to the device queue concept of opencl, cuda/hip, sycl etc. The c++ std sees memory as only one big array, nothing is disjoint. For C++, the only way is USM. On MI300A you have USM and its performant. On GH200, you have USM and page migration (ouch).
The MI300A requires quite a bit of kernel hackery though, the kernels will run in HBM and this is weird.
Ryzen AI chips with the CPU, GPU, and NPU are the one to watch for. If they start adding more DDR5 memory channels (beyond the current two) then you start to move into the bandwidth discrete GPU(s) enjoy and what Apple currently employs with the M1-M4 chips.
Yep Strix Halo maybe what we are all looking for with its 256bit total memory bus across cpu gpu and npu
4 channels is as likely as 32 pcie lanes 😢
PS4 did this in 2013. The slow response time of GDDR/HBM can probably be mitigated by a bigger L3, which should make more viable for CPUs
exactly, and I don't know why people keep saying that main DDR memory is slow, it is not. GDDR5 is as fast as DDR4, the difference is the way the controller access memory. DDR4 is better at random access, whilee GDDR is better at sequential access because it has a wider bus. 2048 bits vs 512 bits for the CPU.
A larger L3 will always help, but the additional bandwidth of HBM will not. With regular DDR having lower latency, it will be the better option. This is because latency is more important then bandwidth for CPUs. The opposite is true for GPUs - hence two different memory standards. The question then becomes which to use when both a GPU and CPU use the same memory? I suppose it all depends on what limits these future systems.
@@williamdouglas8040 I think we might even see a comeback for L4 cache at that point.
CXL has similar needs as persistent memory management and LLVM support via memory alias is key. Context Save, Load, Replay, power loss recovery/Restore for programs is necessary at the kernel level to ensure uniformity across the software stack. Compiler language key words such as volatile, static, persist would be the best hints in the C/C++ to make existing software easily extendable. Furthermore, Compiler with integrated AI just in time compilation maybe the best real world accelerator that the tech giants are seeking.
I really like unified memory for its potential to increase memory bandwidth.
Apple’s m ultra uses soldered lpddr5 yet they are about to get 800gb/s bandwidth. That’s close to gpu speeds.
While Apple is not a good example when it comes to reducing costs. VRAM is more expensive than lpddr memory. If we can use lpddr to get similar speeds to vram then I’m all for it. Unified memory then makes a lot more sense. It’s less hassle for developers too.
3:48 I’m a bit of a strange weirdo:
All the system developments described in this video with possibly huge performance gains are great and all; but what I’m really in awe of is the physical plumbing of that passive electrical interconnect bridge thingy Jensen is holding here.
A fantastillion TB/s to connect multiple physical systems in a rack with each other “just” with high-quality copper wiring since optical transceivers would need much more power at the same scale.
Meanwhile on the end customer market it is still pretty difficult to find and actually get high-quality PCIe SFF cables that can do PCIe Gen4 or Gen5 properly without introducing PCIe Bus Errors :(
Perhaps controlling both end of the cable plugging in has something to do with it.
Maybe the current standard and tolerance levels are too high for something like pcie to work safely within it's theoretical maximums
It’s mostly cable manufacturers scamming the customers. The standards and tolerances needed for these PCIe SFF cables are widely known but in the past they mostly got away with manufacturing actually-lower-than-advertised quality cables and stuff in general still worked (up to PCIe Gen3 is relatively “easy”). But these tolerances become tighter and tighter with the faster generations which then leads to this house of cards collapsing.
Even if you have nothing to do with PCIe cables (which is understandable since it is a niche volume-wise, especially outside of data centers), it’s a similar situation to USB 5, 10, 20 & 40 Gbit/s, HDMI 2.1 48 Gbit/s or DisplayPort 2.1 cables but with drastically higher prices per cable which in turn of course attracts scammers that smell quick profits.
Optical has higher latency as well
Yay the music is back!
Am I just dreaming to imagine the SXM socket ever coming to PC desktops?
I share that dream. I'm sick of GPU cards.
Technically there are adapter boards to standard PCIe, but they are really rare, nvidia made some
@@Gastell0 Would defeat the purpose. SXM5 delivers up to 700w right through the socket. Much faster signalling, too.
@@dieselphiend Well the adapter exists,
more optimal would be still to use the the daughterboards for 4 SXM modules that connect to motherboard via cooledge, Dell and Supermicro ones would probably work with minimal challenges
I'm sick of the ATX design of motherboard, we need something better
AMD was working with Vega to use videoram more efficiently. They found out that games often took twice as much as they needed. What happened with that research and why does it seem nobody uses it?
Also: why do GPU's lack an M.2 slot? Drivers could use the flash drive to store data like textures on the flash drive for multiple games and programs.
I'm hoping/liking that server capabilities are trickling down into the consumer market. I'd be happy to have a big unified chip that contains both the GPU/CPU and a single BGA HBM/RAM card (they can figure out how to do a giant shared cache on active interposer). Anything that can slim down and simplify technology in a user-facing manner is a win to me.
Right now I'm digging the AM5 server motherboards and want to see 4 node 2U configs since that would make sense to me since that config limits PCIe I/O. All I want is good IPMI and ECC RAM.
These changes in computing have been a long time coming and I'm excited to see the future of it.
Sounds good, as long as you never need to upgrade or replace a component.
Gpu goes bad, replace the whole unit.
Need to upgrade the cpu, replace the whole unit.
And the cost of buying/replacing a whole unit would be on par with buying an entire computer system, as it basically will be.
@@chrisbaker8533 True. I've had GPUs burn out on me, but never CPUs. Then there is my favorite. Did my PSU burn out my GPU or was it over temp? People who constantly upgrade their computer would find the new scheme annoying.
Think of it like a big APU and big hunk of unified Memory instead.
Love this channel!
22:56 I took a pill in Ibiza 🎶
22:44 - Is this Liqid?
With so many devs working on AI, I'm sure one poor smuck is working on Radeon GPUs.
Only until they get the AI trained well enough to move that poor smuck over to the AI section
Not certain here, so don't quote me on this.
I use Gentoo, and last time I was in the memory management section of my kernel config, I could have sworn that I saw drivers for RAM over CXL, as well as a hotplug driver for it. Right near level 5 page table support, I think.
I read a while ago about that so no wonder if it is already there, one more thing in my opinion this man is doing only pure hype, in somehow too much detail he is keeping aside, PR.😂
I have this notion that in the future we're going to move to a more distributed hardware architecture. As Wendell mentioned, the problem has always been software - We can build amazing computers, and we absolutely need to because our software gets slower at about the same rate as computers get faster. But *If* we eventually get good at writing parallelizeable code, the trick is probably to just have a bunch of tiny cores tightly coupled to very little memory, clustered together in a larger computer. Right now, the compute is attached to various kinds of memory via a single fast big memory bus, but I think "in the future" the computers will just be a bunch of tiny memory cells tightly coupled to a local CPU in a "networked" cluster, leaving a lot of the memory, cache optimizations etc. (the stuff we're tweaking right now in hardware) to software.
Its hard to build amazing software for amazing hardware when you use programming languages made for the PDP11
To move to the vision you outlined, the new product / architecture/ vision needs to beat the legacy product by a significant degree, and it is that legacy product that killed all of the new visions (VLIW, ARM, RISC, etc)
@@EdDale44135 Ok, first of all all the architectures you listed are in fact alive an well(VLIW TTA architectures are the cool stuff for LLVMs OpenASIP, ARM is more popular than ever, probably the most common CPU architecture outside of servers, and is also RISC. Not to mention RISC-V also already making some strides in the embedded world. So non-x86 CPU designs are alive an well(IMHO). Second, yes you're going to need to be better at something with a new CPU design if you want to gain market adoption. But you don't have to be the best at everything right away. And I think coupling storage to compute is probably more efficient(faster and/or more energy efficient). The problem, as stated before, will be the toolchains etc. - Software - to get useful(non-benchmark) work out of these chips.
I'm not sure who said developers more. You or Ballmer XD.
Please feel free to make more of these. This was interesting insight.
Serious question: When do you think we will get a Grace/Hopper or M100 style 'all-in-one' home computer? APUs are already very well established, Apple M chips seemingly doing well. I want an Nvidia/Arm all in one PC to game on.
The thing is, those companies don't want you to buy one item. They want you to buy at least two. That's why the performant CPUs and GPUs are separated.
@@r0galik Nvidia doesn't sell CPUs
@@--waffle- it does.
Also, AMD does (a lot).
I still believe Tulips are the future
Nvidia definitely needs to tempt fate and release a hardware or software named "Nvidia Tulips"!!!!!!!!!!!!!!!!!!!!!!
On a more serious note could this architecture benefit non ai computing? I am thinking the huge boost AMD received with its v3d cache helped gaming and even running virtual machines
There are a number of workloads that x3d cache speeds up, phoronix did a deep dive on it so I'm guessing the same would apply here. Like Wendell said the sticking point is when you need more memory than is on the SOC which CXL could address. It will be interesting to see how this all shapes up over the next couple of years, no question there.
great job
I’m excited to see more unified memory options on the market as it’s the future bottleneck for one of our fintech projects.
We can scale the compute easily enough, but once the dataset passes a certain size we become VRAM constrained and all GPU performance advantages are lost.
It’s one of the reasons we didn’t put the CUDA version into production and kept with the parallel scalar approach instead. It’s a bit slower, but much cheaper & more scalable to split the work around CPU cores where we can access the memory we need.
I’m curious about how the GPU Direct Storage API can help. It allows a CUDA app to directly stream data from fast NVME drives without having to loop it through system memory. For the right use case VRAM caching NVME can be faster that VRAM + system memory.
We have a few years of scaling our products on CPU before we need to take the GPU route. I’m hoping there will be more options by the time we need them.
I shutter to think what situation would cause system DRAM with 3 orders of magnitude superior latency to NVME and an order of magnitude more bandwidth to perform worse than an NVME drive at streaming data into a GPU. I suppose direct storage API would reduce CPU overhead though.
@@paulblair898 aye. It’s the single threaded setup and deconstruction process which hurts the most. By separating the data path from the wrap calls a lot of map-reduce type tasks can run faster as it moves the bottleneck away from the CPU.
For our application which streams data (a few billion objects per go) we’re compute rather than bandwidth limited.
If the NVME can feed the GPU fast enough during the parallel map phase we get a performance boost from the CPU being less loaded.
We can also preserve the VRAM for the smaller results dataset which can fit in memory for a fast pass at the reduce phase of the process.
Performance stuff is really weird sometimes.
We’re not using this path in production yet, but it’s nice to know we can scale this way if we need to. It was a fun R&D project to pull together the solution using the NVIDIA RAPIDS framework.
How will the power consumption be affected by that? I guess to store in a SLC NAND will probably require more power then in HBM/DDR-memory, isn’t it? Will that be a significant change or not, I am not sure of though, but I am quite sure that power consumption will limit the compute power quite soon
New hardware is almost always waiting for software to catch up, unless it's hardware meant to accelerate existing software.
Well done sir :)
I would really love to see in-depth analysis of Qualcomms new x-elite chip, specially it's GPU. There is decent amount of information about it's cpu out there but very little about it's GPU and it's GPU is pretty decent for a first gen integrated GPU.
I have it on good authority that the dual GPU in Apple's M Ultra has abysmal results in compute precisely because Apple presents a single unified GPU as opposed to two separate ones. The cost is in maintaining coherency between the two GPUs.
What keeps playing in my mind
"All non-trivial abstractions are, to some degree, leaky"
Yes, the idea is to abstract away the inherent system architecture so that you don't need to focus on optimizing it so much
But what that actually sounds like to me, is that I'm going to have to go under the hood to do the optimizations. I'm going to have to work around the abstractions.
That more than just AI scientific programmers work on UMA platforms, but also more desktop/end user facing programming and game developers (Apple WWDC24 announced more native game ports) work on UMA platforms, makes it likely that UMA, fast and slow memory, will be the future going forward.
Can totally see MI350X with 384G of HBM3E. ChatGPT 4o Sky needs very low latency but also higher sophistication. Developers will brute force it until they can’t anymore!
Makes me think of the Amiga's chip vs fast ram :D
Interesting discussion. I think we really need an open source implementation of a GPU memory paging system, one with a flexible number of tiers, so we can pull weights from SSD, to CXL, to system memory, and onto the accelerator as they're needed. I believe that modern GPUs have MMUs, so if it's possible to register a page fault handler, then you could build a system like this. Such a system would enable LLM inference on much lower end hardware, and even with the paging overhead, it will most likely have drastically lower latency than connecting to a cloud service (the speed of light is indeed a bitch).
I did not fully get the concept of the mi300a.
I can not see how a umified memory space could work effectively if thats what its tring to do? The x86 chip would slow down the gpu so much if dynamic workloads would be done and for undynamic algorithems where is the benefit from the traditional system? There might be some tiny response time gains but that against 50 years of software development with the traditional memory architecture (software is 99% about memory)?
But lvltech? Is the acid database web scale? 🤔
It's got the bigger geebees!
I hope we we see more UMA on the x86 side too in a consumer products, I don't know what type of work the viewers here are doing, but as an ordinary software engineer I can say in a lot of companies as a developer machine you get (try to guess) a window's lapt top with intell or mac and if you want to not be on the mac ecosystem you spend a lot of time building. Not to mention that if you want to experiment witl LLM during work and you are on intell you probably better not trying. With the new intell SOC's it may be better but it take time for big companies to change developers hardware.
Oxide Computer 👀
Hype if true
Perhaps foresight or perhaps luck, but the unified memory architecture is pure win for Apple. Lower power, better performance, easier to put out together a crazy combination of different SKU's including what's in their data centre. Want sustained performance - add a fan. Nice.
Developers, developers, developers, developers, developers, developers. 😊
Thank you. Interesting stuff.
What do you think about Strix Halo which seems to be a unified memory system?
The DDR5-6-7+ progression is simply insufficient. With the rise of integrated GPUs, NPUs, and even new AI workloads on CPUs, it just does not make any sense. The only reason we ended up with such slow system memory in todays products is that it was simply good enough for yesterday's workloads, but things have changed. The companies that don't get this will be scrambling to catch up in the near future.
@19:25
++good
But it does matter!!!! If you actually build kernels, you need to be aware of UVM as under the hood it is just hardware based page faults which are slow.
Sure we don't have to explicitly malloc, but still need to be aware of what needs to be pre-fetched etc, hell sometimes we need a few SMs to do data transfers exclusively.
Stuff under interconnects moreso need to be handled with care, like while using mpi or nvshmem. The architectural changes bring in more complexity to the programming model and not less by any measure.
Not sure I agree with "slow" and "fast" system memory,
More like "GPU memory" and "CPU memory" In the classic CPU example the difference is the "interconnect" speed vs the "memory bus" speed.
This kinda feels like NUMA all over again, except for using the unified memory like the M-Series and now the MI300.
CPUs solved the throughput problem with caches so their not geared for HighBandwidth memory, I'm not sure how much of a core redesign it would take to remove the caching and have a cpu attached to memory like a gpu and have branch prediction etc work.
Maybe this is the hold up?
i hear cache and wendell, where is the buzzword zfs? :D
Yeah.
That sound what AMD strix halo would be base on Moore's law is dead rumor.
So when do we get consumer PCIE memory expansion boards? 🍑
Well im sure the game Star Citizen and how they are server meshing would love to have a new Nvidia server rack.
Tinybox specs say 6TB/s (6050 GB/s). Does that mean that Tinybox has faster memory than this beast?
tenstorrent? risc-v??!
Is this basically x86 SoC?
Developers, Developers, Developers.
Developers, Developers, Developers, Developers, Developers, Developers, Developers, Developers, Developers
anything's possible, keep it simple
I can't wait till mi400
But it won't be released until 2026
Pfff, the speed of light issues can be easily solved, just migrate to photon computing where the state of multiple photons is linked via quantum entanglement instantly no matter the distance…
Come on, people! ;-)
And we still can't get AMD SR-IOV
Unified Memory,
On CUDA programming perspective, there are tools build for this kind of practice for a decade (altho apps never use them AFAIK).
I would speculate they are just trying to deliver what customer wants rather than going on for a more aggressive full-on unified structure. Until some real impactful application IRL really demands them, in that case they would be able to deliver them as "significant improvements" based on they are actually having a close enough arch for applications to be deployed and compared to traditional arch.
Just as today the "impactful application" can't be AI. The Transformer arch including its variance today, can't be really benefited, at least in an obvious way enough from the unified arch on high density training environment.
As a gamer it's interesting that your video applys to gaming too😂
With server farms moving to 3 year life cycles could Nintendo buy grace-hopper or newer server's as companies move to newer Architectures and build cloud services on the chep enabling streaming to Switch devices
Can we see finally real results with a MI300A/MI300x?
Until that happens, Go AMD!
There is a lack of real investment in the market, we are seeing "gimmicks", companies soldering chips and changing memory formats, preventing upgrades, etc., all to avoid assuming the basics, they are repeating the same recipe for decades, and are hitting all kinds of limits.
The future is total memory abstraction (IMO), unified memory and separation between slow and fast memories in CXL, all at the same time.
IMO on the CPU, with some level of HBM3e for "server" products, total freedom of dissolving memories, since it "speaks" OMI your chip, hardware etc. doesn't matter (cases like the HMC Hybrid Memory Cube memory would not have fallen by the way), do you want to use HMC/HBM3e on a memory stick? Feel free, it's possible to mix DDR4/5 and GDDR all at the same time ok, 5 TBs of memory access? OK, a chiplet-level cache and so on.
There is a lack of courage, who would have thought IBM's "dinos" were at the forefront of the "innovative" mainstream PC/server market.
programming gpu should be as simple as having the same code in cpu. no hoops for programmers, devs, whatever. in gpu land, optimization at the code level does not matter. its then only down to the algorithm you are running. only. yep gpu is designed to run everything massively parallel, so thats not the main concern of normal cpu programming issue. yep you have to transform the algorithm to work on massively parallel format, not even cpu avx vector code. gpu vector code. well if you make a smart compatible vector instruction compiler for cpu/gpu, then you dont have to recompile to get the same vector code to run on gpu directly. make universal vector computation code. universal compute shader code. that you already have. yep both gpu and cpu can run it. in vector mode. why bother with any other code format than compute shaders for vectorized compute units. npu tpu gpu cpu. all running the same code. all the janitor code around the core compute is secondary. yep, make also the cpu act as a single gpu, concerning vectorized compute code. compute shaders. the more you spend in secondary janitor cpu glue logic, the less you are doing the main thing. say ray tracing and getting the data to be traced. which one you think is the main thing. of course getting the massive compute running, running the main task. dont make the janitor task be the focus. unified memory, you dont need to manage any of those memory things. as the program dev. in other words, python numpy runs every vectorized load on anything, but instead of that, at low level, make the processors cores run the same loads directly. in addition to the unified memory, then it really does not care where the compute resources come ie where its being run. yep really flexible compute execution. why do you need the slow memory if you already have enough high speed memory. ie gddr67 instead of ddr5. no ddr5 at all or minimal with the cpu boot package. all the os nonsense etc.
wait you assume you need to compute on cpu for some reason
Your voice is very similar to the voice of Neil deGrasse Tyson...
apple unified ram isnt anything special its just lpddr5
buddy where else are you gonna get a gpu with 128gb of vram, oh wait that's the new macbook sorry
(no seriously though that might be the best ai platform out there)
You want slow memory? WTF dude
He lost his train of thought.
He meant to say: Fast Memory + Low Capacity vs. Slow Memory + High Capacity.
It's a trade-off.
x86 always wins.
all of this....for mostly unending, crappy, buggy apps.
Have AMD or Nvidia produced a counter to Intel's Xeon CPU Max Series?
"Maximize bandwidth with the Intel® Xeon® CPU Max Series, the only x86-based processor with high-bandwidth memory (HBM). Architected to supercharge the Intel® Xeon® platform with HBM, Intel® Max Series CPUs deliver up to 4.8x better performance compared to competition on real-world workloads1, such as modeling, artificial intelligence, deep learning, high performance computing (HPC) and data analytics"
...the mi300a?
@@Level1Techs Who will be the first to bring on chip or on package HBM memory to the consumer space?
what makes blachwell what it is , is not nv arch. its tsmc same applies to amd 3d cache is tsmc tech not amds, amd could never make 3d chips they don't have the money or the fab, design companies are design companies only real chip makers have foundries and the foundries design there own 3d chips . tsmc is happy to rent there tech to design houses. thing is if amd switches fabs they lose the 3 d chip onless they go intel or sansung 3d, i have heard nothing on samsung 3d cache, at any rate what people believe is amd /nv tech is not their teck it's the fabs. in so far as 3d chips go .
if ai is what you types claim the software should not be a problem . when nv crashes and it will. because its going to come out that ai is for the most part over hyped and when million lose in the market what do you people say opps . that will get you bullets if co pilot is what its all about its already a fail just ask ai hard questions the only answer you will get is pure bs because its info relies on men inputting correct info and most all people can not do that period, So it is useless and if democrat programmers are involved now it gets scary.
AMD really squandered a lot of good will when it comes to local ai. They’re more interested in nvidia taking a larger share and keeping prices high than they are in increasing the overall market.
I’m not going to even wait for strix halo. I’m going to get the m4 Max when the studio version comes out.
Yes it’s really expensive. But there’s really no better alternative without risking your power circuitry. AMD just failed the market.