@@ICanDoThatToo2 It might not gonna be worth it because maybe in 10 years, we might get a laptop with that specs. We might get Quantum Computer Laptop.
I remember putting one of those in a system for my girlfriend's little brother like 25 years ago... been a long time... voodoo2 or something like that...
What is most (un)surprising is that in about 10 years a machine like this will have the dimension of something like a mini-PC which fits in the palm of your hand.
For anyone interested, that nVidia GPU tray is the exact ones in the Dell hardware that is used at the xAI facility. Since its nVidia, it makes sense that multiple vendors use the same skus or "flavor" of part style.
I recall that the power surge issue is so serious on Ampere GPUs that we ended up upgradeing the PSU on our servers. AI/DL loads typically just takes forever to preprocess (read, parse, and pack as batches) the dataset to GPU VRAM and the GPU will crush the computation and draw insane amount of power that triggers OCP on PSUs.
9:10 I'm hearing all these things and thinking: right.. something that AMD Epyc pretty much solves, has the same number of memory channels, has 32 more PCIe lanes and has more cores.
We did a video on both Intel GNR-AP and the new EPYC Turin. Intel's GNR-AP now can be ahead of EPYC. Still, these would need well over 300 PCIe Gen5 lanes in a server, and even that is not the right answer for P2P communication. You also get more channels with both, but due to their physical size, you lose DIMM slots making memory capacity a challenge.
What on earth could this server possibly be doing that could need more _cores?_ Like, I’d just assume that because the whole point of the server is all those GPUs, you’d get something CPU-wise that’s just got the minimum number of cores to get the memory channels and PCIe lanes you actually want. But I’ve got less than no experience buying servers, closest thing is I’ve been eyeing an HP mini pc.
But will it run Crysis? OK. Joking aside I was going to ask a question about how much more inference can be done on those GPU's vs training (it is my understanding that Inference is much lighter task), but then I noticed that you said about this power spikes and how utilization will go from 0 to 100 and to 0 again. Is this because of training algorithm? Is it inefficient? Or maybe something else is bottlenecking the system? And since you mentioned that this happens over and over - are we talking several times a minute or hundreds time a second? And if we graphed the GPU use over time, what would be the "average" GPU use? Because it seems tad inefficient if it works just "from time to time". Thx for the video. I wish I had slept pass 2 nights, maybe I would understand more ;)
Training to date has been mostly NVIDIA HGX platforms like this. The quick version of what happens is the GPU gets sent a job, does a lot of math, then has to send the results out and wait for the aggregation to get its next job. You are right, that can be inefficient which is why folks are using so many 400G NICs
@@ServeTheHomeVideo Thank You. So at least partially "network" (well, more like ethernet interconnect) is the bottleneck in such cases. Thank You. I thought 800G NICs were already available.
Usually you train in many epochs. Each epoch runs many iterations on the same data and keeps GPU 100% busy because no time wasted moving data. Once epoch is done it will fetch new batch of data and you’ll see small drop of utilisation to 0 and back up to 100. But this happens very fast in a second or so. One epoch can take couple minutes but it all depends. Inference is similar intensity as training, you are just only doing forward pass without back pass. In fact inference can be optimised to use more of hardware and load it even more removing less parallel steps or bottlenecks.
Great review! Aivres server looks pretty neat and serviceable compared to other OEMs and it also fits to 6U instead 8U that some other manufacturers seem to prefer.
Heh. Back in the day (2016-2019) we would review the 8 GPU servers and even Intel Knight's Landing on the STH main site, then they would go to LTT after. You can see quite a few things that the serial numbers match since nobody else wanted to do the content.
First off love the videos as always. You do a great job of communicating to both new and seasoned viewers. I would love to see a could of videos about llm hardware for the common man. As someone that loves to learn and also loves hardware I have taken a fascination with llm's. The problem is that there is so much data and stale data on what makes since. For the most part llm craves vram above all else but there are so many nuances there. I am currently looking at ditching my four 4070 supers in favor of MI100 cards as they can be had for just over a thousand dollars these days but finding reliable data is so hard.
There's some insertion loss from the connectors, but the cables themselves can be made to have extremely low dielectric losses so you can get more distance out of them in total.
And at least the Grand Teton is an open design and contains some neat tricks (e.g. the use of load switch ICs for distributing current density on the power planes) that EEs can learn from, so even if it is designed for use in obnoxiously power hungry nonsense farms we can at least get some actual value from it.
These servers take quite some time to boot, given how many devices and how much memory is involved. They are also pretty loud. Since we have industrial power in the studio, we can get power on set, but then you could not hear me speak. Always some practical challenges with that.
The way the hardware is so specific and we have smart IO controllers running and the importance of I/O bring the memory to mainframes, and if we add that we are paying for software as a service (again people don't learn) basically we return to the mainframe era, but today is AWS, Microsoft Azure and Google are the names. If you are a creative, Adobe.
@@runforestrunfpv4354 of course it is. Yes there would be a limit of how many users but still it could do many at a time. The real work is in training the models
Something I was waiting to see explained, but it never came: What's the interconnect between the GPU tray and the main chassis? (I'm wondering about both the physical/mechanical aspects, and the signalling/protocol.)
I do have a question! You mentioned the only way to get video out is via the vga port on the back, but since the NVidia Bluefield chips are designed with SMPTE 2110 in mind wouldn't you be able to use them to get a crisper image out? Would be a fun one to test, I think if you ran crysis on something like this it would be the highest viewed video on tech youtube for the month, although I imagine there might be other blocking issues with getting that running. Still interested in the video out over IP regardless tho.
"Serve the home" "It costs as much as a Ferrari" I think the channel needs a name update. I get it. It is cool. Completely unrelatable as a home lab guy.
It’s “home” as in “/home”, not necessarily as in “homelab”. They just do videos on cool stuff, and sometimes that means stuff most people can’t afford. Not really much different then Jay Leno’s Garage or any other car channel. Especially that “One take” garage guy.
I would like to see some more videos on buying used server hardware and drives (HDD/SSD). That’s more home lab oriented but STH videos get a lot of viewers posting high end server equipment videos.
what is the approximate retail price of a server like this? 200k? It would be nice to get an approximate price. A 10 kw power input is not cheap either.
@@AI-xi4jk a quick search shows that they are more like 300k. And probably out of stock. I expect massive depreciation of these AI units. Sooner or later someone will make chips that are custom made for executing the inferences and it might even be in an Apple chip in some future mac mini.
@@edwarddejong8025 I mean, Apple CPUs have had super basic support for what is now called “AI” stuff dating all the way back to the introduction of the “Neural Engine” in I wanna say the iPhone 6s. Like, ten iPhone generations ago. Obviously, that was limited purely to more ResNet style features rather than anything StableDiffsion like. For the OP: an 8x L40 server seems to be ~180k-200k, with ‘Hopper’ HBM equipped GPUs like this has being more than that, and the current rumors are a half million for 8x Blackwell HBM server, and either 1.5 million or 15 million for one of those 72x Grace-Blackwell “SuperChip” racks. And an entry level Ferrari starts at ~400k, assuming you mean a real *_Ferrari_* Ferrari, new from the Dealership.
Yeah, I’m pretty sure that the cores are completely unchanged between the two, and the only difference is more, faster, and newer HBM memory. I think the H200s even have an extra HBM site active at 6/6 instead of 5/6 on the H100s. And so every benchmark where you see performance uplifts between the two is purely showing off the advantage of more bandwidth. At least, that was “The Next Platform”’s conclusion.
To be honest I think my cue just interest has been just what operating system they run how it split up and how much power each of these h200 have compared to something like a gaming RTX 4060, I know the biggest difference will really be in the Cuda cores and the vram but I guess I want to figure out a fundamental level the difference
Could you make a video about the Tenstorrent ML servers? Tenstorrent's products are so much more cost-efficient than an Nvidia system, and a much better option for the SMB market.
It’s hard to find many benchmarks but from what I’ve seen price/power/performance ratio is not quite there yet to compete with GPUs. Programming them is also not that smooth yet. I’m keeping an eye on it but will skip this gen as I don’t have time to just play around with these even if I want to.
I these supercomputing racks, I always wanted to know if all the RAM does get used? I know that some computations or when training AI can take weeks or even longer with all these chips running flat out. non stop. I guess the dataset gets loaded into memory. But DAMN that is a lot of data if its filling all the ram on very node - and there are hundreds of full server racks top to bottom stuffed with the H200s. What storage do these things have as well? It's never as impressive when they're showing off cutting edge tech to talk about all the empty space there is ! But the big datasets music be stored somewhere before being read into RAM so I have to assume it's a boat load of SSDs with decent read/write speed. Question... if data is read from the SSD but is required to be in RAM on every rack. Does the RAM copy what its read to other racks RAM to RAM, or does it have to continually read off the SSD until it's done it enough times to fill all the racks's and their RAM on each multi-GPU platform?
my immediate impression is the dual Intel CPU and RAM slots are so ancient next to the Nvidia GPU cluster and Infiniband cards.It should be powered by an M4 Ultra type chip with 256GB per processor. Intel needs to buck up
Wonder if this could be programmed for a rendering box, though seems far more suited to AI / LLM. With all that RAM, guess you're configuring at 4+ TB? That's a LOT of sticks!
its insane how much engineering goes into basically just trying to burn as much power as possible Ahh yes we shall use as much power as we can while still being able to keep it under controll. One of these would make one heck of a space heater at home tho xD
Quite a lot if you play that AI-hallucinating Minecraft. But also, H200 actually is a GPU, this would make a nice cloud gaming server if nVidia would allow it.
@@ICanDoThatToo2 so, that’s an interesting concept: On the one hand, people have already benchmarked these GPUs in games (well really TimeSpy) and the performance was really bad, like if you have any GPU that’s in the Steam hardware survey it’d be faster bad. Also, one of the major benefits of this GPU is the ludicrous amounts of VRAM bandwidth it has, which is moreso an advantage at higher resolutions, which a cloud gaming setup wouldn’t use. On the other hand, this has so much VRAM and VRAM bandwidth that you could split it up 10-12 ways _per_ _GPU_ and still have a decent GPU, and this is one of the more expensive GPUs where NVidia officially supports and endorses splitting up GPUs. I’d love to see a cross between these reticle limit GPUs with all the HBM and a 4090 style gpu. All the gaming centric features this doesn’t have because it doesn’t make sense for super computers, and all the expensive HPC features that “are too expensive for gamers”.
@ServeTheHomeVideo haha 😂 These things i can afford but I guess I need to rent a little "nucular" reactor to power this thing. Electricity is not cheap here in Germany :D
The H100 challenge, or any accelerator, is that if you need to scale the number of XPUs because you run out of memory, the interconnect becomes a bigger challenge and cost.
16:21 that's the problem when using multiple cards for AI training/inferencing. The GPUs are simply working sequentially at a fraction of their potential, the customers forced to dump loads of cash and buy all eight H200 to get a total of 1TB GPU memory. A single H200 with theoretical 1TB of GPU RAM in that scenario will deliver the same performance as all those eight H200.
Not necessarily. There are multiple types of parallelism that each have their own characteristics and scale with a larger number of GPUs in their own unique ways. Tensor parallelism takes advantage of a feature of matrix multiplication. If you look at the product of the first row or column of two matrices, it’s not related to the product of the second column in any useful way. In other words, you can throw half the columns on one GPU and half on the other. This scales really well and has low bandwidth, but operations which require an all-reduce (ie: Attention’s softmax) reduce performance heavily. There’s a practical limit on tensor parallelism to around 4-8 GPUs in my experience (though I don’t use systems with proprietary interconnects between accelerators), but it does more or less perfectly pool together the memory capacity, bandwidth, and compute of the GPUs (or CPUs!) used. Pipeline parallelism puts a number of layers on one device, and a number of layers on the next, and the next, and so on. It generally uses more bandwidth from my experience than TP, but it’s a lot simpler to program and optimize. Performance depends on the topology, and the use case. If you’re doing large inference batches (ie: serving customers), or doing offline processing of large amounts of data (ie: agents, document retrieval, etc), you can actually hide the “bubble” where you don’t have many overlapped computations, leading to pretty close to “perfect” scaling, if bandwidth allows. This is more scalable than tensor parallelism, but it doesn’t improve response latency usually. It still gives you more tokens per second, but not necessarily a “faster” response in all situations. It does use “all the GPUs”, though, and gives you more memory capacity from the model. Data parallelism is the final one, and at inference it sort of scales perfectly, insofar as anything does, but it does mean that you’re running a unique instance of an LLM / CNN on each node, so the responses of individual sessions aren’t accelerated. There’s also multidimensional parallelism which can combine properties of each of these. Regardless, you can actually scale the usage of GPUs networked together a lot better than you’d think, and “use all of them together”, because the operations behind AI models are very parallel by nature.
@@novantha1 Thanks, that was helpful. One note though, I'm far from being an expert in the area, but tensor parallelism you mentioned doesn't seem to dominate the computation in AI workloads as I have never observed the saturation of the GPU cores in multi GPU configurations. In fact, I'm always noticing a proportional reduction in GPU usage with the increase of GPU cards during the AI workloads.
@ Yup, it doesn’t matter which parallelism you use; there’s always overhead, it’s just you pick the overhead you can live with. My experience is generally on small scale installations using consumer hardware, so it’s a bit coloured by that (with 8 nodes, a 2 GPU TP, 2 TP pipeline, 2 data parallel structure feels pretty lossless), but as you get to larger models, larger workloads possibly spread across more datacenters the math gets very different. Your performance per GPU isn’t quite logarithmic, but it definitely feels like you lose around 0.5-1% per GPU you add to the system as long as you’re using the right parallelism. As for why Tensor parallel doesn’t dominate in the industry? The softmax operation, lol. It’s a real killer in larger deployments, even if it doesn’t matter in 2 GPU, or 4 GPU configs that much. I think there’s an argument for multi-dimensional parallelism at such scales, where you use multiple types (ie: you make pipelines of tensor parallel configs), but not all software is built to support that, and it’s not like everyone has experience with, say, PyTorch Titan’s asynchronous tensor parallel, for instance. I listed it first because it’s quite dear to my heart and is often more useful to end consumers who want to run AI applications, though (because you get less latency for a single response, in contrast with the other parallels). It’s also worth noting that for architectures without a softmax (linear attention, RNNs, etc), you can get going *pretty* fast with enough GPUs. Right tool for the job, and everything.
@@novantha1 oh, I wasn't talking about 0.5-1% per GPU. Take two GPUs (regardless, consumer or for datacenters) and do some AI inferencing with any model, you'll see the GPU usage dropping to just 50% per unit. Take four GPUs, and the usage will drop to aprox. 25% each. And in this video I observe exact the same behavior. I thought memory bandwidth is the cause, but no, H200 is huge and behaving the same. My current understanding is the AI software accessing the GPUs sequentially, the algorithms are not optimized for concurrent GPU usage.
GPU memory is not always a limiting factor. For workloads I run I usually run out of compute on GPU before I run out of memory (for inference). For training you can set high batch size and also cache data on GPU and use all memory. For inference I’m limited by processing power, not memory. I run on RTX 6000 Ada 48Gb, not these monsters but same balance applies. You mostly need lots of memory for LLMs and such.
Will STH ever speak to the growing environmental and societal issues we will face as larger datacenters and AI servers proliferate? Drooling over tech is fun, but it's not so great when it leaves me worried for our future.
TBH I worry less about the environmental impact since that is more known and manageable. The bigger question is what happens in 1-3 years from now as AI becomes a control layer, then in 3-5 years when it has the performance to start replacing huge categories of jobs. It used to be that folks could educate themselves and do higher value work. The new job threat is hard to out educate by humans
Don't know why I like watching videos on things I'll never afford, but I enjoy staying up to date with thorough videos.
You may not buy one of these, but you may use them in a cloud/ cluster. Sometimes just seeing huge systems is fun.
In 20 years you can buy this for 200 dollars )
can't buy one now, in 20 years when it's e-waste i still can't run it off anything like a normal house grid :)
$200 is ambitious, but stuff like Dell R930’s for around $700. The real issue with the stuff past 2U is just the shipping costs.
Don't say never. In ten years, one of these will be like $2k.
If you think you can't afford this just remember: The more you buy, the more you save!
Yes!
Nah, I'll just get it from eBay in 10 years. Should have my 100 amp PowerWall installed by then.
😂😂😂😂😂
Ammm, I am not sure that statement applies here outside of niche cases.
@@ICanDoThatToo2 It might not gonna be worth it because maybe in 10 years, we might get a laptop with that specs. We might get Quantum Computer Laptop.
Speaking of NVLink... man that 3dfx acquisition sure has paid off ;)
I remember putting one of those in a system for my girlfriend's little brother like 25 years ago... been a long time... voodoo2 or something like that...
Back in the 80's I wanted a Cray. Never did get one. Same goes for these.
What is most (un)surprising is that in about 10 years a machine like this will have the dimension of something like a mini-PC which fits in the palm of your hand.
Maybe. Much of the gains are happening through making larger packages instead of scaling in the space of the same package size
@@ServeTheHomeVideo That is very interesting indeed, so in effect a focus on horizontal scalability
Nividia Orin mini cough cough
Just put 100 together
For anyone interested, that nVidia GPU tray is the exact ones in the Dell hardware that is used at the xAI facility. Since its nVidia, it makes sense that multiple vendors use the same skus or "flavor" of part style.
Ours is an actual tray that extends out of the case on rails for easy servicing. Plus, our PSU's sit outside of the chassis in their own powershelf.
Danke!
Wow! Danke!!!
that would make a killer plex server
Yea, for your entire family across the whole country! lol
Next LTT video. “Building a plex server for all of Vancouver”
I saw 8x GPU's so ...8 NVENC streams? /s lmao (When I actually know what I just saw was enough to run an entire Netflix Node hahah)
Emby
I recall that the power surge issue is so serious on Ampere GPUs that we ended up upgradeing the PSU on our servers. AI/DL loads typically just takes forever to preprocess (read, parse, and pack as batches) the dataset to GPU VRAM and the GPU will crush the computation and draw insane amount of power that triggers OCP on PSUs.
That Xeon pillow looks comfortable. I'd bring it to a datacenter to rest 🛏 on too.
9:10 I'm hearing all these things and thinking: right.. something that AMD Epyc pretty much solves, has the same number of memory channels, has 32 more PCIe lanes and has more cores.
We did a video on both Intel GNR-AP and the new EPYC Turin. Intel's GNR-AP now can be ahead of EPYC. Still, these would need well over 300 PCIe Gen5 lanes in a server, and even that is not the right answer for P2P communication. You also get more channels with both, but due to their physical size, you lose DIMM slots making memory capacity a challenge.
What on earth could this server possibly be doing that could need more _cores?_ Like, I’d just assume that because the whole point of the server is all those GPUs, you’d get something CPU-wise that’s just got the minimum number of cores to get the memory channels and PCIe lanes you actually want.
But I’ve got less than no experience buying servers, closest thing is I’ve been eyeing an HP mini pc.
GNR beats Turin in a lot of these AI workloads. Even SPR and EMR can if you’re utilizing Intels accelerators/ software stack.
But will it run Crysis?
OK. Joking aside I was going to ask a question about how much more inference can be done on those GPU's vs training (it is my understanding that Inference is much lighter task), but then I noticed that you said about this power spikes and how utilization will go from 0 to 100 and to 0 again. Is this because of training algorithm? Is it inefficient? Or maybe something else is bottlenecking the system? And since you mentioned that this happens over and over - are we talking several times a minute or hundreds time a second? And if we graphed the GPU use over time, what would be the "average" GPU use? Because it seems tad inefficient if it works just "from time to time".
Thx for the video. I wish I had slept pass 2 nights, maybe I would understand more ;)
Training to date has been mostly NVIDIA HGX platforms like this. The quick version of what happens is the GPU gets sent a job, does a lot of math, then has to send the results out and wait for the aggregation to get its next job. You are right, that can be inefficient which is why folks are using so many 400G NICs
@@ServeTheHomeVideo Thank You. So at least partially "network" (well, more like ethernet interconnect) is the bottleneck in such cases. Thank You. I thought 800G NICs were already available.
Usually you train in many epochs. Each epoch runs many iterations on the same data and keeps GPU 100% busy because no time wasted moving data. Once epoch is done it will fetch new batch of data and you’ll see small drop of utilisation to 0 and back up to 100. But this happens very fast in a second or so. One epoch can take couple minutes but it all depends. Inference is similar intensity as training, you are just only doing forward pass without back pass. In fact inference can be optimised to use more of hardware and load it even more removing less parallel steps or bottlenecks.
@@AI-xi4jk Thank You. Very useful information.
Not enough VRAM!! Need more!!!
Over 1TB!
@@ServeTheHomeVideothese are rookie numbers xd mi325x will soon be here
The only guy in here who can say this for real :D
Look! the ram rider are trespassing the wall
@@ServeTheHomeVideo I think that might not be enough for my Wishes, please send it Over so i can realy test it ;-)
Great review! Aivres server looks pretty neat and serviceable compared to other OEMs and it also fits to 6U instead 8U that some other manufacturers seem to prefer.
I am just waiting for @LinusTech to get his hand on this, run some Crysis and then drop it
Heh. Back in the day (2016-2019) we would review the 8 GPU servers and even Intel Knight's Landing on the STH main site, then they would go to LTT after. You can see quite a few things that the serial numbers match since nobody else wanted to do the content.
I'm likely never going to afford this or anything like it, I'm just here to see how far technology has come
It all reminds me computer craze from mid 90s and all those modular systems
First off love the videos as always. You do a great job of communicating to both new and seasoned viewers. I would love to see a could of videos about llm hardware for the common man. As someone that loves to learn and also loves hardware I have taken a fascination with llm's. The problem is that there is so much data and stale data on what makes since. For the most part llm craves vram above all else but there are so many nuances there. I am currently looking at ditching my four 4070 supers in favor of MI100 cards as they can be had for just over a thousand dollars these days but finding reliable data is so hard.
This is perfect for home web browsing !!
That's a total monster!!!!
Can’t Wait. 5 or so years and I’ll be digging this out of the recycle bin
Interesting that cables offer better signal integrity than PCB despite the additional connectors; Is that because they can be made twisted or thicker?
There's some insertion loss from the connectors, but the cables themselves can be made to have extremely low dielectric losses so you can get more distance out of them in total.
Wow, that really reminds me of a Grand Teton OCP chassis, I wonder why...
We have covered the Grand Teton on the STH main site, it is a bit different.
And at least the Grand Teton is an open design and contains some neat tricks (e.g. the use of load switch ICs for distributing current density on the power planes) that EEs can learn from, so even if it is designed for use in obnoxiously power hungry nonsense farms we can at least get some actual value from it.
Massively disappointed that I didn't get to see the boot up and actual operation of this beast :/
These servers take quite some time to boot, given how many devices and how much memory is involved. They are also pretty loud. Since we have industrial power in the studio, we can get power on set, but then you could not hear me speak. Always some practical challenges with that.
@@ServeTheHomeVideo This is true -- anywhere between 5 minutes up to 30 or longer.
The way the hardware is so specific and we have smart IO controllers running and the importance of I/O bring the memory to mainframes, and if we add that we are paying for software as a service (again people don't learn) basically we return to the mainframe era, but today is AWS, Microsoft Azure and Google are the names. If you are a creative, Adobe.
Yes. We pay Adobe a lot
It functions as a room/house warmer too.
I think at 10kw it would do multiple (well insulated) houses
@ronjatter mind you one H100 server is no where enough for responsive generative AI deliverable.
@@runforestrunfpv4354 of course it is. Yes there would be a limit of how many users but still it could do many at a time. The real work is in training the models
I fully expect to be able to buy this for about $5k in 10 years.
Probably less than that
Something I was waiting to see explained, but it never came:
What's the interconnect between the GPU tray and the main chassis? (I'm wondering about both the physical/mechanical aspects, and the signalling/protocol.)
I this good for running Plex with transcoding? Would be nice if I can squeeze in Home Assistant as well.
What's the lead time on a replacement GPU tray if you need one? Aren't they the most unobtainable part of the whole system anyway?
Nice room heater, I could use one for my shoes closet 🏎️😅
Only half way through the video, but I'm wondering if you've got a circuit you can power that without starting a fire.
If only we had these in the era of GPU mining.
"It costs as much as a base model Ferrari" ... 👀
With the VGA output, I could connect a 3dfx Voodoo card to really speed things up.
I do have a question! You mentioned the only way to get video out is via the vga port on the back, but since the NVidia Bluefield chips are designed with SMPTE 2110 in mind wouldn't you be able to use them to get a crisper image out? Would be a fun one to test, I think if you ran crysis on something like this it would be the highest viewed video on tech youtube for the month, although I imagine there might be other blocking issues with getting that running. Still interested in the video out over IP regardless tho.
"Serve the home" "It costs as much as a Ferrari" I think the channel needs a name update. I get it. It is cool. Completely unrelatable as a home lab guy.
It’s “home” as in “/home”, not necessarily as in “homelab”. They just do videos on cool stuff, and sometimes that means stuff most people can’t afford.
Not really much different then Jay Leno’s Garage or any other car channel. Especially that “One take” garage guy.
I mean STH started in 2009 and we were doing 8 GPU systems and blade servers in like 2016. Home is the /home/ directory in Linux for the users
I would like to see some more videos on buying used server hardware and drives (HDD/SSD). That’s more home lab oriented but STH videos get a lot of viewers posting high end server equipment videos.
I dunno, some of us like to watch the bleeding edge and dream...
Watch enez, some " homes" cost hundreds of millions 😅
what is the approximate retail price of a server like this? 200k? It would be nice to get an approximate price. A 10 kw power input is not cheap either.
200k seems cheap for this. I also like to know without giving all my background story to a reseller.
@@AI-xi4jk a quick search shows that they are more like 300k. And probably out of stock.
I expect massive depreciation of these AI units. Sooner or later someone will make chips that are custom made for executing the inferences and it might even be in an Apple chip in some future mac mini.
@@edwarddejong8025 I mean, Apple CPUs have had super basic support for what is now called “AI” stuff dating all the way back to the introduction of the “Neural Engine” in I wanna say the iPhone 6s. Like, ten iPhone generations ago. Obviously, that was limited purely to more ResNet style features rather than anything StableDiffsion like.
For the OP: an 8x L40 server seems to be ~180k-200k, with ‘Hopper’ HBM equipped GPUs like this has being more than that, and the current rumors are a half million for 8x Blackwell HBM server, and either 1.5 million or 15 million for one of those 72x Grace-Blackwell “SuperChip” racks. And an entry level Ferrari starts at ~400k, assuming you mean a real *_Ferrari_* Ferrari, new from the Dealership.
$280k to $375k from SMC, Dell, Lenovo, etc.
So the major advantage of H200 is more and better memory ? The cores didn't improve much ?
Yeah, I’m pretty sure that the cores are completely unchanged between the two, and the only difference is more, faster, and newer HBM memory. I think the H200s even have an extra HBM site active at 6/6 instead of 5/6 on the H100s.
And so every benchmark where you see performance uplifts between the two is purely showing off the advantage of more bandwidth. At least, that was “The Next Platform”’s conclusion.
They're bleeding edge and bound to have yield issues, like what @servethehomevideo mentioned in the video.
Key lessons learned: I don't earn enough to have one in my homelab 🙂
To be honest I think my cue just interest has been just what operating system they run how it split up and how much power each of these h200 have compared to something like a gaming RTX 4060, I know the biggest difference will really be in the Cuda cores and the vram but I guess I want to figure out a fundamental level the difference
Could you make a video about the Tenstorrent ML servers? Tenstorrent's products are so much more cost-efficient than an Nvidia system, and a much better option for the SMB market.
It’s hard to find many benchmarks but from what I’ve seen price/power/performance ratio is not quite there yet to compete with GPUs. Programming them is also not that smooth yet. I’m keeping an eye on it but will skip this gen as I don’t have time to just play around with these even if I want to.
Hi @ServeTheHome, What are the best vendors for gpu server racks that ship to Africa?
Insane! 🎉
Definitely need to pick up a few for my home rack.
Pretty sure my apartment doesn't have enough power for even one, though -- maybe 1/3 of one. Yikes.
这视频的质量让我十分惊讶,真是超出预期!
You didn't do the noise test?
Ha! Loud
I these supercomputing racks, I always wanted to know if all the RAM does get used? I know that some computations or when training AI can take weeks or even longer with all these chips running flat out. non stop. I guess the dataset gets loaded into memory. But DAMN that is a lot of data if its filling all the ram on very node - and there are hundreds of full server racks top to bottom stuffed with the H200s.
What storage do these things have as well? It's never as impressive when they're showing off cutting edge tech to talk about all the empty space there is !
But the big datasets music be stored somewhere before being read into RAM so I have to assume it's a boat load of SSDs with decent read/write speed.
Question... if data is read from the SSD but is required to be in RAM on every rack. Does the RAM copy what its read to other racks RAM to RAM, or does it have to continually read off the SSD until it's done it enough times to fill all the racks's and their RAM on each multi-GPU platform?
can't wait to see it on ebay in 20 years
The only thing stopping these from being 100% the whole time is the networking. You got to wonder if it’ll ever catch up
Nothing an Accton or Celestica switch with QSFP-XD can't handle..
@@iszotopeAn NVL72 rack has 14.4Tbps of NVLink but only within it. And even that can’t make the compute not wait for the network
2035 homelabs are gonna be wild 😂😂😂
my immediate impression is the dual Intel CPU and RAM slots are so ancient next to the Nvidia GPU cluster and Infiniband cards.It should be powered by an M4 Ultra type chip with 256GB per processor. Intel needs to buck up
This is 1TB per CPU and using low cost 64GB DIMMs. A M4 Ultra also would need a lot of PCIe I/O
Is there a golden ratio of CPU to GPU or does it very on workload or code used?
I wish I could afford that! lol very nice
Thanks! Have an awesome day
Damn, half a mill. Definitely not going into my homelab.
Usually these are quite a bit less than that.
Awesome. Thanks again.
Wonder if this could be programmed for a rendering box, though seems far more suited to AI / LLM. With all that RAM, guess you're configuring at 4+ TB? That's a LOT of sticks!
its insane how much engineering goes into basically just trying to burn as much power as possible
Ahh yes we shall use as much power as we can while still being able to keep it under controll.
One of these would make one heck of a space heater at home tho xD
I would love to see how fast this thing can break password hashes.
This is ridiculous. 😍
You think I could run Home Assistant on this via Docker?
benchmarking time!
Does the Nvidia H200 have decent overclocking headroom?
How much power on 100% load is need?
Around 10kW
@@ServeTheHomeVideo so old house with solar panels about 30kW with batteries and this == hosting business and sell free power+
I rented one of these for a while and the power is hard to articulate. I wish I could own one.
I think ServeTheHome needs to make a second channel called "ServeTheDataCenter". So out of scope for ANYTHING home related.
Apparently, the “home” in ServeTheHome refers to the “/home” directory in Linux/Unix/Unix-a-likes.
Did I hear 10kilowatt? That is like €25 per day or €9,000 per year in electricity cost…..crazy. Even if it shows up on Ebay in a decade (or two).
Yea. In a decade, these will not be useful
I need to correct myself, dat is 2,50 per hour or €60 per day or €21,000 per annum on electricity….😮
And he turned half a mill on its side 😂
Not the first time, nor will it be the last. :-)
so how much fps can i get on minecraft with this
VGA so maybe 30 or 60?
@ServeTheHomeVideo damn nvidia fell off hard it doesn't even have hdmi 💔 /s
jokes aside probably 30 to 60 fps on AI Minecraft
Quite a lot if you play that AI-hallucinating Minecraft. But also, H200 actually is a GPU, this would make a nice cloud gaming server if nVidia would allow it.
@@ICanDoThatToo2 so, that’s an interesting concept:
On the one hand, people have already benchmarked these GPUs in games (well really TimeSpy) and the performance was really bad, like if you have any GPU that’s in the Steam hardware survey it’d be faster bad. Also, one of the major benefits of this GPU is the ludicrous amounts of VRAM bandwidth it has, which is moreso an advantage at higher resolutions, which a cloud gaming setup wouldn’t use.
On the other hand, this has so much VRAM and VRAM bandwidth that you could split it up 10-12 ways _per_ _GPU_ and still have a decent GPU, and this is one of the more expensive GPUs where NVidia officially supports and endorses splitting up GPUs.
I’d love to see a cross between these reticle limit GPUs with all the HBM and a 4090 style gpu. All the gaming centric features this doesn’t have because it doesn’t make sense for super computers, and all the expensive HPC features that “are too expensive for gamers”.
How does it compare to the new Mac mini M4 basic model ?
The M4 Mac is a little bit faster and more powerful but not much 🤥
Think this will run pFsense?
Talk about a great opportunity for liquid cooling!
That's crazy 🐢
Getting the wife to co-sign for the new home server is not easy.
What service do you have to offer to get that money back in time before this gets old?
Gpu is the new Cpu
I’ll rewatch this in 10 years when this server is actually affordable for homelab 😂
you do, or did get a nvlink counterpart on AMD - it's just that it's not really useable for any effect unless you split up the work yourself. (sigh)
AMD is betting on UALink in the future. Right now AMD is point to point while NVIDIA is switched
Can’t wait for 2 years to pass, so I can afford this ❤
I think that's not a tiny mini micro server. But I'm unsure about that. Send it to me please I need to have a look 😅
I hope you have a loading dock and a forklift or pallet jack
@ServeTheHomeVideo haha 😂
These things i can afford but I guess I need to rent a little "nucular" reactor to power this thing. Electricity is not cheap here in Germany :D
@@BR0KK85 I thought they frowned on “Nuclear” reactors over there in Germany. (Just teasing)
OK, but like, can it run crysis tho?
0:19 Let's hawk tuit?
Kinda makes you realize the H100 is pretty good for the price.
The H100 challenge, or any accelerator, is that if you need to scale the number of XPUs because you run out of memory, the interconnect becomes a bigger challenge and cost.
Just remember to change the return address on that server to my home address.
I work with the 80GB/GPU variant. Believe me, the QC of this unit coming out of Nvidia are downright awful.
The HGX baseboard?
@@ServeTheHomeVideo Yup.
It seems like Nvidia had big plans already in mind when they bought Mellanox.
16:21 that's the problem when using multiple cards for AI training/inferencing. The GPUs are simply working sequentially at a fraction of their potential, the customers forced to dump loads of cash and buy all eight H200 to get a total of 1TB GPU memory. A single H200 with theoretical 1TB of GPU RAM in that scenario will deliver the same performance as all those eight H200.
Not necessarily.
There are multiple types of parallelism that each have their own characteristics and scale with a larger number of GPUs in their own unique ways.
Tensor parallelism takes advantage of a feature of matrix multiplication. If you look at the product of the first row or column of two matrices, it’s not related to the product of the second column in any useful way. In other words, you can throw half the columns on one GPU and half on the other. This scales really well and has low bandwidth, but operations which require an all-reduce (ie: Attention’s softmax) reduce performance heavily. There’s a practical limit on tensor parallelism to around 4-8 GPUs in my experience (though I don’t use systems with proprietary interconnects between accelerators), but it does more or less perfectly pool together the memory capacity, bandwidth, and compute of the GPUs (or CPUs!) used.
Pipeline parallelism puts a number of layers on one device, and a number of layers on the next, and the next, and so on. It generally uses more bandwidth from my experience than TP, but it’s a lot simpler to program and optimize. Performance depends on the topology, and the use case. If you’re doing large inference batches (ie: serving customers), or doing offline processing of large amounts of data (ie: agents, document retrieval, etc), you can actually hide the “bubble” where you don’t have many overlapped computations, leading to pretty close to “perfect” scaling, if bandwidth allows. This is more scalable than tensor parallelism, but it doesn’t improve response latency usually. It still gives you more tokens per second, but not necessarily a “faster” response in all situations. It does use “all the GPUs”, though, and gives you more memory capacity from the model.
Data parallelism is the final one, and at inference it sort of scales perfectly, insofar as anything does, but it does mean that you’re running a unique instance of an LLM / CNN on each node, so the responses of individual sessions aren’t accelerated.
There’s also multidimensional parallelism which can combine properties of each of these.
Regardless, you can actually scale the usage of GPUs networked together a lot better than you’d think, and “use all of them together”, because the operations behind AI models are very parallel by nature.
@@novantha1 Thanks, that was helpful. One note though, I'm far from being an expert in the area, but tensor parallelism you mentioned doesn't seem to dominate the computation in AI workloads as I have never observed the saturation of the GPU cores in multi GPU configurations. In fact, I'm always noticing a proportional reduction in GPU usage with the increase of GPU cards during the AI workloads.
@ Yup, it doesn’t matter which parallelism you use; there’s always overhead, it’s just you pick the overhead you can live with. My experience is generally on small scale installations using consumer hardware, so it’s a bit coloured by that (with 8 nodes, a 2 GPU TP, 2 TP pipeline, 2 data parallel structure feels pretty lossless), but as you get to larger models, larger workloads possibly spread across more datacenters the math gets very different. Your performance per GPU isn’t quite logarithmic, but it definitely feels like you lose around 0.5-1% per GPU you add to the system as long as you’re using the right parallelism.
As for why Tensor parallel doesn’t dominate in the industry? The softmax operation, lol. It’s a real killer in larger deployments, even if it doesn’t matter in 2 GPU, or 4 GPU configs that much. I think there’s an argument for multi-dimensional parallelism at such scales, where you use multiple types (ie: you make pipelines of tensor parallel configs), but not all software is built to support that, and it’s not like everyone has experience with, say, PyTorch Titan’s asynchronous tensor parallel, for instance.
I listed it first because it’s quite dear to my heart and is often more useful to end consumers who want to run AI applications, though (because you get less latency for a single response, in contrast with the other parallels). It’s also worth noting that for architectures without a softmax (linear attention, RNNs, etc), you can get going *pretty* fast with enough GPUs. Right tool for the job, and everything.
@@novantha1 oh, I wasn't talking about 0.5-1% per GPU. Take two GPUs (regardless, consumer or for datacenters) and do some AI inferencing with any model, you'll see the GPU usage dropping to just 50% per unit. Take four GPUs, and the usage will drop to aprox. 25% each. And in this video I observe exact the same behavior. I thought memory bandwidth is the cause, but no, H200 is huge and behaving the same. My current understanding is the AI software accessing the GPUs sequentially, the algorithms are not optimized for concurrent GPU usage.
GPU memory is not always a limiting factor. For workloads I run I usually run out of compute on GPU before I run out of memory (for inference). For training you can set high batch size and also cache data on GPU and use all memory. For inference I’m limited by processing power, not memory. I run on RTX 6000 Ada 48Gb, not these monsters but same balance applies. You mostly need lots of memory for LLMs and such.
Can it run Doom though?
lfg rn
the goat
💚🖤📈🇺🇸
Sure it can not run #Crysis,
*but it can keep my house warm even when running idle* 😹😹😹😹
Is mark and Elon goin got upgrade their 100k gpu server?
But can it run crysis?
simulate someone playing crysis on that thing
Can it run minesweeper?
the primary question is not answered in this video, so once again "But can it run Crysis???"
Not even this amount of GPU power can generate correct AI answers ;)
Will STH ever speak to the growing environmental and societal issues we will face as larger datacenters and AI servers proliferate? Drooling over tech is fun, but it's not so great when it leaves me worried for our future.
TBH I worry less about the environmental impact since that is more known and manageable. The bigger question is what happens in 1-3 years from now as AI becomes a control layer, then in 3-5 years when it has the performance to start replacing huge categories of jobs. It used to be that folks could educate themselves and do higher value work. The new job threat is hard to out educate by humans
Please run black myth wukong at highest setting with rt on
It might max out any thing with that
10kWh??? Ouch
But can it play Crysis