It was great meeting you in St. Louis! Your build was amazing, and I'm totally jealous of that 1.4 _TERABITS_ of bandwidth! The 512 GB of RAM on the Threadripper is a little ostentatious, but it's really cool to highlight both ends of current ARM 'cluster-in-a-box' solutions.
I'm all about minimal..love the ITX build on the Streacom ITX Bench chassis..I have one of these as well and this is definately giving me some ideas here. Would love a stand alone system next to my workstation to run some Rancher K3s...
@@unknowntechio definitely love the packaging on it. Imagine having that on your desk and being able to tinker with it. So much power, such small footprint. And the energy efficiency, what if you ran it on solar energy?!
12:45 Use case: Super NAT box with network traffic analysis. NAT VPS is becoming more common, and what better way to do the NAT than big powerful NAT gateway. Keep rest of the infrastructure management exactly the same as non-NAT. I am assuming you can push the network traffic via PCI-E to the host node, thus you could do traffic analysis for network optimization, monitoring, security etc. Further you can use the host node then as the VPN gateway to access the NAT side, management etc.
I'd really love a video with an actual installed software use-case scenario to help me understand what kinds of real work could be done on a DPU cluster. I haven't yet been able to wrap my head around what's possible from the general DPU conversation. What can I do today with that massive x86 and those 8 DPUs? Show us some live working use-cases! 😁😍
I have a good one and it was my original inspiration early last September. Container orchestrator with custom runtime to offload functions or calculations to the ISA most suited for the workload. For instance, one would use cavium MIPS for network and crypto, x86 for general use, gpu for obvious, spark for heat/time machine/death by oracle, and arm for.... Um... 8bit retro gaming, farty tasks, bragging rights... Idk what arm excels at offhand besides liliputing, size vs cost,
I have been part of projects that did simulated distributed environments in a box, for showing what's possible. This would be that, but with performance available. Yes, you could do it with VMs, but if you had an edge based solution that used ARM this would be a pretty cool way to do it.
@@morosis82 Right but what's the need for the Threadripper? I'm trying to wrap my head around the software configuration of this setup. Wouldn't they each run an ARMv8 Linux as a cluster? Or like you said, you could run KVM/ESXi etc. on the ASUS mainboard, but how are those cards presented to the hypervisor? Do they use a special DPU driver and QEMU handles architecture virtualization? I simply don't understand lol. Also I was looking at the DPU brochures and I don't fully understand what workloads these are meant for. And I don't know if $15k is a budget build for this sort of compute or not. Once again the $7,000 AMD chip seems pointless or overkill at the very least.
Hey Patrick! This pair of videos with Jeff made my day. I am jealous of both of you! One thing that would help us understand your project better would be some benchmarks, which you didn't do here or in the main article.
We used this platform for some of the Threadripper Pro coverage we did earlier this year so it seemed redundant benchmarking a CPU we have already done a lot with. On the DPU side, the UCSC / Sandia folks did a killer job also looking at the accelerator performance for crypto and such. That paper is so good that I am not sure what we would add other than "the A72 core is crazily slower than the Zen2 core." I mean, I can run Linpack on it, but we will have the Ampere Altra Max next week which is more interesting for performance Arm since those are N1 cores.
i am so impatient of watching your fiber videos, i am starting to get really interested about the topic thanks to your coverage. also good choice, this one with jeff is a really good collab!
Jeff: man I want that node cluster Patrick: man I want that node cluster 2 fellow nerds at different price points but still wanting what the other guy has
That is an awesome build. The only thing I would worry about is heat build up around the DPU cards. I would want some fans taking in cool air and blowing it on the DPU cards.
It's not often you get to see what high-performance ARM can get to. :) What kind of power supply did that require? And have you measured peak draw on it?
The system right now is on an 850w PSU. I think max these take 60-65W or so across the card including optics. The 2.5GHz cards can go 75W or something like that (sorry cannot remember off the top of my head)
That's amazing all those 100G ports on that little thing haha! You pretty well have the same fiber as a data center build out! :) We have a bunch of 40g fiber going into our newest build out along with some 100G.
Totally could. The other way to look at it though is that you would need to buy ~20 of the Turing Pi 2 clusters to get similar capabilities, minus the networking. That was the point we were showing. A small version and a big version.
This is a cool build, and great to see the crossover with Jeff. Any chance that you could run some benchmarks on the Arm DPUs similar to what Jeff did on his?
Probably will do for the main site this week. The HPL is so bad on these lower-end Arm cores that the Threadripper does much better. In the meantime, the UCSC/ Sandia paper on BlueField-2 performance I think is done very well. arxiv.org/pdf/2105.06619.pdf
From the paper: "Individual Results Analysis: As we expected, the BlueField-2 card’s performance generally ranked lowest of all the systems tested except the RPi4." So, while not great, node for node, the STH Arm cluster should outperform Jeff's cluster in compute workloads.
Excuse me. I don't know anything about clusters, or how to build one. Although it does look like an interesting subject to further investigate. I wanted to ask one thing. DAMN, WHAT IS THAT ARCH IN THE INTRO MADE OF ? Oh my GOD, amazing.
Nice cluster box. As a fellow competitive person, I think that I could easily beat it. Put a X12DPG-QT6 Supermicro motherboard in a 4U rack chassis box, 2 x 40 core Icelakes + 4TB ram, 6 DPUs + 4 port 10G ethernet. You could liquid cool almost everything including at least 4 DPUs. Thanks for the video.
It can, if you leave the CPU out of it. (RDMA is AWESOME!) Even without the Threadripper Pro, on my old Intel Xeon E5-2690 (v1) cluster, in the IB bandwidth benchmark, I can top out at around 96-97 Gbps (out of a theoretical max of 100 Gbps), when running in RDMA mode. In an actual application usage, I'm closer to around maybe 80-82 Gbps, but that also be HIGHLY dependent on the application that you're using (and how well the MPI has been implemented in the application itself). When I benchmarked the network bandwidth using Fluent, I was able to get 10 GB/s max across four nodes (all of whom are connected to a Mellanox MSB-7890 externally managed 36-port 100 Gbps EDR Infiniband switch). (Sadly, alas, the X9 platform only has PCIe 3.0 x16 and the ConnectX-4 dual 100 Gbps VPI port cards are also only PCIe 3.0 x16, which means that the bus interface can't support two 100 Gbps connections at the full, line speed, simultaneously.) But you can definitely get a lot closer to it than anything else. Ethernet on the VPI ports has between a 1-3% penalty vs. Infiniband.
Right. There is enough bandwidth assuming DMA used, but it's somewhat close so it might be a little limited depending on how much overhead there is. If you try to do actual CPU processing on the data, you probably will run into Infinity Fabric bottlenecks. PCIe 4.0 x16 has 31.5 GB/s = 0.252 Tb/s of bandwidth, so with 7 nodes the available bandwidth between the nodes and CPU is sufficient: 1.764 Tb/s. (These are all full-duplex links.) DDR4-3200 provides a data rate of 25.6 GB/s, which times 8 channels gives 1.6384 Tb/s total. But I think the main intended use of the DPUs is to offload a lot of the processing to the ARM cores, so that the CPU doesn't have to actually process all the data.
@@movax20h What do the Bluefields actually do? They're high throughput, but what is their purpose? And even then I don't understand why they are paired with the Threadripper. Would this be a machine learning solution, and that's the need for the bandwidth? I'm reading the STH article and it mentions crypto offloading, but are we talking about mining?
5:38 Chad with a Tesla Cybertruck vs. Jeff with a puny Toyota Prius meme right here. Joking aside, thanks for this silly mis-comm; that was seriously funny.
Im more than a little annoyed that there was both a supercomputer conference AND Patrick from sth in my city, and i didnt know about it until 2 weeks late
@@ServeTheHomeVideo that one was nice. But I’d love to see and hear more about those fiber runs and equipment you mentioned in this video. Anyways, you are doing a great job and I really appreciate every video you put out.
This is a little bit overkill to my dream of having a plex DVR cluster, record on one node, and use all the other nodes to transcode the videos from the massive MPEG2 files, over to X265 using TDARR or handbrake. But the TuringPi, even with the 6 core Jetsons, might not have enough power to transcode. Currently my 3950x needs supplemental support from a 4650G to keep up with recordings. Might need two turingpi V2, each with 24 cores.
@@GeorgeWashingtonLaserMusket For one, the Jetson Nanos i have do have a hardware transcoder. However The problem i have is that going from MPEG2 to H265 creates larger files than the original I've tested: NVidia NVENC on Maxwell, Pascal, and Turing (IIRC Maxwell couldnt transcode to H265 at all, or couldnt do it from MPEG2) AMD VCE Vega, VegaII, Vega2.5(4000 and 5000 series Ryzen APUs), RDNA1 Intel QSV HD4000, HD5000, HD530, HD630 Apple Video Toolbox M1 ALL of them create larger files than the original, but only when going from MPEG2, to H265, where as CPU normally halves the file size. or even better if low-grain black and white My 3950X is about as fast as my GTX 1650, though, it uses 3x more power at 90w vs 30w. This is a sacrifice i'm willing to make to achieve my intended goal of not losing quality, while significantly reducing file size
Well... if you can get your hands on a dual tower, like the TT Core W200, with the P200 add ons, you could probably run 2 of those rigs you have there, plus enough space to fit ... 64 arm boards? Depends on how nuts you go with it, but R-pi boards fit in drive slots real nice like...
I am not sure what you mean? The Top500 results include full cluster-wide networking. You would normally only run Linpack on CPUs with vector acceleration such as AVX2 on the AMD chips (or AVX512 on Xeons). There the Threadripper is around 1.5-1.6Tflops. So you are about 150-160 Raspberry Pi 4's just for the main CPU on this. On the DPU side, the A72 cores have more performance per core than the Pi's but are really there to accelerate crypto. As you scale small nodes, the interconnect networking ends up eating a lot more power and performance. That is why you cannot just take a number from this machine (or a small cluster) and compare it to a Top500 result, because scaling interconnects is such a big deal in terms of power and performance. We actually test a lot of the high-end 4-8 GPU supercomputer nodes, and high-density CPU nodes, so people would look at us funny in the industry if we made a claim that this cluster, or any small cluster, represented a part of a Top500 linpack run.
@@favesongslist Yes, but it would have made his look a lot less efficient if it used 4x the power and cost 1.5-2x this entire cluster just to get the same performance as the x86 cores only. That would 100% make sense to folks in the industry as the RPi cores/ boards are not made for Linpack style workloads.
Not too bad really. You are right there is ducted airflow behind the cable cover and they need airflow. On the other hand the chips themselves are sub 55W since there is power budget for the RAM optics and such on the card as well. The networking is using most of the power on them.
interesting video even though I don't think I would choose this solution for my small project or I would reach for adult server hardware, but it's amazing to see where you can go with today's technology and look at a little unusual solution .. both gentlemen have interesting content and it was would be interesting to see some common project ;)
I know it's going to be those Bluefield cards! Just one simple question: is the host sharing the 14x100G connections with the DPUs or are those ports exclusively for the DPUs?
Idle is ~200W with 14x CWDM optics and the 10Gbase-T ports lit up. The networking part uses a ton of power. Max not over 750W yet. In terms of noise, not silent, but nowhere near the screaming servers we normally test. There is a lot more that can be done like ducting a 120mm fan to the cards that you would want to do to keep noise down.
All I can say is that I've been planning this build, albiet on a much smaller scale, since early September. I guess now we wait to see other ISA's get in the mix... or perhaps a DPU crypto offload challenge? IDK but I'm a huge fan of efficiency and minimalist OS. Lets have the parts of the application run on the silicon where they run best and not encumber the devs with sticking to one architecture. Now we need an ISA-aware Swarm Orchestrator....
Had a few screenshots in the video,l. For the actual box, the fans spin up, a few blinking lights. Not much to it really. The side needs to be on to ensure there is enough airflow though. Hopefully when the new studio is finished can do more with everything on. Good feedback
you RANDOMLY meet him at an ARCH?? As in ARCH LINUX??? Are you trying to tell us something Patrick????? ....... .... ...... ....... ..... .... .. . yea I didn't think so
I think we have a 25GbE 48 port switch somewhere in the queue. Let me look into doing more of those 10G units. I think I saw some on the schedule for next year that the team is working on.
@@ServeTheHomeVideo Speaking of 25GbE switches, any word on when we might see stuff like a fanless homelab friendly 25GbE switch similar to the MikroTik CRS305 or 309? I picked up some used 25GbE NICs on ebay and unless I directly connect them I'm stuck with running them at 10 gig.
@@ServeTheHomeVideo As I recall from the video though, the cluster nodes doesn't have NVMe onboard or DAS (unless it's like passed through from the AMD Threadripper Pro system). And the AMD Threadripper Pro system itself doesn't have a 100 Gbps connection that's tied/directly connected to it. So it would be interesting to see how you might deploy a NVMe-oF solution if the DPUs doesn't have any onboard or direct attached NVMe drives to it and/or that the NVMe storage has to be passed through to the DPUs. Like I would understand if the DPU had either onboard or NVMe directly attached to it and then you can present that to the fabric. But I've never seen how you would do the same if you don't have any onboard NVMe nor any NVMe that's directly attached to said DPU. That's interesting.
Why aren't the parts linked in the description? No overall breakdown of the prices of the parts either What's the point of a video like this if the sources aren't linked?
@@ServeTheHomeVideo So....does this mean that all of the systems have a fibre cable running back to the rack (the centralised switch)? (Maybe I'm a little bit confused about the physical and network topology/layout of your office/rack/studio.) Or are you going to be virtualising the "studio" systems and then, you can just servers that host the virtualised workstations and the server can then connect to the 100 Gbps (or 100 GbE) switch via DACs? Maybe I'm a little bit confused. (I only have two fibre cables running to my Mellanox IB switch, but my micro cluster server sits in the same rack as said switch, so I just use DACs to connect the server to the switch.) Thanks.
I would love to see an update to your cluster with: 1. Dual Ampere Altra 128C CPUs (with PCIe 5.0 upgrade) 2. Many more DPUs 3. NVIDIA GPUs You can use 16 of the 32 PCIe 5.0 lanes on BlueField 3 to connect to an NVIDIA GPU with PCIe 5.0 or two NVIDIA GPUs with PCIe 5.0 to 2 x PCIe 4.0. 2 x Ampere Altra -> 12 BlueField 3 -> 12 NVIDIA GPUs (PCIe 5.0) 2 x Ampere Altra -> 12 BlueField 3 -> 24 NVIDIA GPUs (PCIe 4.0)
Is there a budget friendly ARM card, I have this weird dream where I can have proxmox running a few VMs on x86 but I'd like to be able to spin up an arm VM from the same machine. Does hardware exist to do this that isn't a £2000 DPU? Some kind of PCiE card that took compute modules would be ideal.
It might be cheaper to just run a RPi or a cloud instance for Arm. We are going to have more on the ASRock Rack mATX board for Ampere Altra in a few weeks. Also search for ASRock Rack ALTRA-NAS which I hope to do in Q1 2024
@@ServeTheHomeVideo it would definitely be cheaper but I wouldn't be as cool. I remember you could get an 086 co-pro for a BBC micro and with a few cunning commands your trusty BBC would suddenly be a PC compatable... Now I know its a completely different era of computing but I think it would be a lot of fun if my x86 machine could BE an arm machine too rather than just emulating one.
It was great meeting you in St. Louis! Your build was amazing, and I'm totally jealous of that 1.4 _TERABITS_ of bandwidth! The 512 GB of RAM on the Threadripper is a little ostentatious, but it's really cool to highlight both ends of current ARM 'cluster-in-a-box' solutions.
Thanks Jeff! :-)
Aaaaaaaahhhh! Fan Girling in the corner here, the ultimate CS comic book crossover!
I'm all about minimal..love the ITX build on the Streacom ITX Bench chassis..I have one of these as well and this is definately giving me some ideas here. Would love a stand alone system next to my workstation to run some Rancher K3s...
@@unknowntechio definitely love the packaging on it. Imagine having that on your desk and being able to tinker with it. So much power, such small footprint. And the energy efficiency, what if you ran it on solar energy?!
@@ServeTheHomeVideo Turns out the best order to watch them both is interleaving them ;)
Did I actually say "Intel cores" instead of AMD or x86? That is why I normally do not edit my own videos! Oops
I must say, this is by far the coolest, most awesome and unpredicted collaboration of the year! Good job guys!
Thanks Bogdan. I was telling Jeff last night how much I liked his video (he shared it with me yesterday.)
Jeff Geerling lookin' bad ass. Wouldn't be surprised if he gets additional TSA screening.
Watched Jeff pretty much right away. What i loved is that he used he own software he wrote to control it. He's the man!
I like the industrial look of the build. Nothing beats the beauty of rows and rows of RAM and PCI-E cards.
12:45 Use case: Super NAT box with network traffic analysis.
NAT VPS is becoming more common, and what better way to do the NAT than big powerful NAT gateway. Keep rest of the infrastructure management exactly the same as non-NAT.
I am assuming you can push the network traffic via PCI-E to the host node, thus you could do traffic analysis for network optimization, monitoring, security etc.
Further you can use the host node then as the VPN gateway to access the NAT side, management etc.
I'd really love a video with an actual installed software use-case scenario to help me understand what kinds of real work could be done on a DPU cluster. I haven't yet been able to wrap my head around what's possible from the general DPU conversation. What can I do today with that massive x86 and those 8 DPUs? Show us some live working use-cases! 😁😍
Nothing. Those are just some fancy kungfu for show.
X2 for your question.
I have a good one and it was my original inspiration early last September. Container orchestrator with custom runtime to offload functions or calculations to the ISA most suited for the workload. For instance, one would use cavium MIPS for network and crypto, x86 for general use, gpu for obvious, spark for heat/time machine/death by oracle, and arm for.... Um... 8bit retro gaming, farty tasks, bragging rights... Idk what arm excels at offhand besides liliputing, size vs cost,
I have been part of projects that did simulated distributed environments in a box, for showing what's possible. This would be that, but with performance available.
Yes, you could do it with VMs, but if you had an edge based solution that used ARM this would be a pretty cool way to do it.
@@morosis82 Right but what's the need for the Threadripper? I'm trying to wrap my head around the software configuration of this setup. Wouldn't they each run an ARMv8 Linux as a cluster? Or like you said, you could run KVM/ESXi etc. on the ASUS mainboard, but how are those cards presented to the hypervisor? Do they use a special DPU driver and QEMU handles architecture virtualization? I simply don't understand lol.
Also I was looking at the DPU brochures and I don't fully understand what workloads these are meant for.
And I don't know if $15k is a budget build for this sort of compute or not. Once again the $7,000 AMD chip seems pointless or overkill at the very least.
I love this Patrick - Jeff meeting at the iconic Gateway Arch, St. Louis, MO, 2:05. Both passionate about their stuff.
Awesome! Two of my most favorite YT creators finally met, it's like a crossover on a super hero movie. XD
Always impressive how connected Jeff is. He's got more comments than Tom has Friends.
Yeah he also has lots of comments at Linus tech tips
I saw St Louis Arch and I immediately thought of Jeff Geerling. And then he appeared! This is awesome
Funny who you randomly bump into while on the road.
Hey Patrick! This pair of videos with Jeff made my day. I am jealous of both of you! One thing that would help us understand your project better would be some benchmarks, which you didn't do here or in the main article.
We used this platform for some of the Threadripper Pro coverage we did earlier this year so it seemed redundant benchmarking a CPU we have already done a lot with. On the DPU side, the UCSC / Sandia folks did a killer job also looking at the accelerator performance for crypto and such. That paper is so good that I am not sure what we would add other than "the A72 core is crazily slower than the Zen2 core." I mean, I can run Linpack on it, but we will have the Ampere Altra Max next week which is more interesting for performance Arm since those are N1 cores.
i am so impatient of watching your fiber videos, i am starting to get really interested about the topic thanks to your coverage. also good choice, this one with jeff is a really good collab!
I think we are going to have more on the STH main site too. Maybe a piece on the Fluke meters we are using to test the fiber.
@@ServeTheHomeVideo great!
Patrick - I have 1.4 Tbps network bandwidth
Me - *crying in my 2.5gbps* 🤣
With out this channel I would not know what I want for a career. Thank you!
Jeff: man I want that node cluster
Patrick: man I want that node cluster
2 fellow nerds at different price points but still wanting what the other guy has
Patrick: "Let's do a l'll spinny", 4:51. I am so proud of that move of your cluster on the high stool. Table still yet to come though.
Yea. Table is an issue
You didn't have to do him like that. You really didn't have to bring a server grade build to a challenge with the raspberry pi guy.
It's a cluster in a box..... a cluster in a box baby
I was at the Jeff's and just waiting for this video. And voila... here it is.
We planned to have them come out at the same time today :-)
Cool collaboration with Jeff
That is an awesome build. The only thing I would worry about is heat build up around the DPU cards. I would want some fans taking in cool air and blowing it on the DPU cards.
That is a real worry. There are chassis fans blowing over the cards that are not shown well.
Fantastic collab video!!!! Glad to see Jeff!!!
At $15,000, it's still probably a better deal than an Ampere ARM server if you're only getting one.
Wow, what a crazy random happenstance to meet GeerlingGuy RIGHT THERE at The Arch. Amazing.
Also amazing that there was a Sony FX3 setup on a nearby garbage can recording and we both had mics on as well. Amazing!
I'm waiting for this. Anyone else been doing the same?
Arm64, M.2 array , ECC, 10Gbe, ZFS.
(Present to camera) "What a coincidence that I would meet you here!" Lol
Planned coincidences are the best kind!
:-)
I saw Jeff's videos and this one and I thought "oh, what a coincidence" xD
It's not often you get to see what high-performance ARM can get to. :)
What kind of power supply did that require? And have you measured peak draw on it?
The system right now is on an 850w PSU. I think max these take 60-65W or so across the card including optics. The 2.5GHz cards can go 75W or something like that (sorry cannot remember off the top of my head)
I think Jeff covers that a bit in his video mate. You can plug a 24 pin adapter and run that through a round tip adapter. Pretty cool.
That's amazing all those 100G ports on that little thing haha!
You pretty well have the same fiber as a data center build out! :)
We have a bunch of 40g fiber going into our newest build out along with some 100G.
Basically having the same fibre cabling structures in data centres or central offices. It surely is very convenient than placing switches everywhere.
And this year's Oscar for best actor goes to... :D
It was a good competition 🔥
Patrick, you seem even happier than normal today bro.
I love getting to do these projects.
You could buy several of his clusters for the cost of one of your nodes!
Totally could. The other way to look at it though is that you would need to buy ~20 of the Turing Pi 2 clusters to get similar capabilities, minus the networking. That was the point we were showing. A small version and a big version.
Amazing crossover 😊
Glad you enjoyed it
Your ethusiasm is legendary!
This is a cool build, and great to see the crossover with Jeff. Any chance that you could run some benchmarks on the Arm DPUs similar to what Jeff did on his?
Probably will do for the main site this week. The HPL is so bad on these lower-end Arm cores that the Threadripper does much better. In the meantime, the UCSC/ Sandia paper on BlueField-2 performance I think is done very well. arxiv.org/pdf/2105.06619.pdf
Thank you for the link to the paper!
From the paper: "Individual Results Analysis: As we expected, the
BlueField-2 card’s performance generally ranked lowest of all
the systems tested except the RPi4."
So, while not great, node for node, the STH Arm cluster should outperform Jeff's cluster in compute workloads.
You cheated, he said “none of this x86 stuff” :P but yeah, it is super cool. Clusters are fun!
Excuse me.
I don't know anything about clusters, or how to build one. Although it does look like an interesting subject to further investigate.
I wanted to ask one thing. DAMN, WHAT IS THAT ARCH IN THE INTRO MADE OF ?
Oh my GOD, amazing.
Lovely build and home fibering's project.
Patrick...how many coffees do you take before the catch phrase ? ehehehehe Keep the good work :)
0. If I drink coffee before I do these people say I speak too quickly. I usually only record when I am tired.
@@ServeTheHomeVideo WOW :)
WOW!!! That thing is AMAZING!! (and Patrick's computer is nice too LOL) I only have ONE question: Can it run Crysis? :D
Thanks to both of you for some awesome content
Thanks for watching!
This is red shirt Patrick, I’m surprised you didn’t modify anything like red shirt Jeff does :)
Even red sweatshirt Patrick!
Wow, what a setup
Nice cluster box. As a fellow competitive person, I think that I could easily beat it. Put a X12DPG-QT6 Supermicro motherboard in a 4U rack chassis box, 2 x 40 core Icelakes + 4TB ram, 6 DPUs + 4 port 10G ethernet. You could liquid cool almost everything including at least 4 DPUs. Thanks for the video.
This video was fanatic!
Thanks George! I hope you have a great day.
That is freakin awesome!!!!!
Looking at this and wondering if that CPU could actually route 1.4Tbps.
Maybe an idea for next project :D
It can, if you leave the CPU out of it.
(RDMA is AWESOME!)
Even without the Threadripper Pro, on my old Intel Xeon E5-2690 (v1) cluster, in the IB bandwidth benchmark, I can top out at around 96-97 Gbps (out of a theoretical max of 100 Gbps), when running in RDMA mode.
In an actual application usage, I'm closer to around maybe 80-82 Gbps, but that also be HIGHLY dependent on the application that you're using (and how well the MPI has been implemented in the application itself).
When I benchmarked the network bandwidth using Fluent, I was able to get 10 GB/s max across four nodes (all of whom are connected to a Mellanox MSB-7890 externally managed 36-port 100 Gbps EDR Infiniband switch).
(Sadly, alas, the X9 platform only has PCIe 3.0 x16 and the ConnectX-4 dual 100 Gbps VPI port cards are also only PCIe 3.0 x16, which means that the bus interface can't support two 100 Gbps connections at the full, line speed, simultaneously.)
But you can definitely get a lot closer to it than anything else.
Ethernet on the VPI ports has between a 1-3% penalty vs. Infiniband.
Right. There is enough bandwidth assuming DMA used, but it's somewhat close so it might be a little limited depending on how much overhead there is. If you try to do actual CPU processing on the data, you probably will run into Infinity Fabric bottlenecks.
PCIe 4.0 x16 has 31.5 GB/s = 0.252 Tb/s of bandwidth, so with 7 nodes the available bandwidth between the nodes and CPU is sufficient: 1.764 Tb/s. (These are all full-duplex links.)
DDR4-3200 provides a data rate of 25.6 GB/s, which times 8 channels gives 1.6384 Tb/s total.
But I think the main intended use of the DPUs is to offload a lot of the processing to the ARM cores, so that the CPU doesn't have to actually process all the data.
The CPU itself would bottleneck around 150Gbps.
But the NICs have flow offload, and it is not to hard to use these feature, using standard tools.
@@movax20h What do the Bluefields actually do? They're high throughput, but what is their purpose? And even then I don't understand why they are paired with the Threadripper. Would this be a machine learning solution, and that's the need for the bandwidth? I'm reading the STH article and it mentions crypto offloading, but are we talking about mining?
Small note: You seem to be shooting 24fps footage and importing it into a 30fps video. This causes terrible judder! Remember to shoot in 30 or 60fps.
5:38 Chad with a Tesla Cybertruck vs. Jeff with a puny Toyota Prius meme right here.
Joking aside, thanks for this silly mis-comm; that was seriously funny.
Jeff does not drive a Prius!
Jeff stole Patrick's camera, we have the video evidence
"wifi6" XD oh yes. i think i need this for my next youtube machine.
not overkill at all.
The real question is was it red shirt, or blue shirt Jeff?
Red Shirt Jeff isn't allowed near the prototype boards :D
Don't forget the PSP cores - you have plenty of arm cores on that ROME x86 processor itself - one per CCD 😂
Im more than a little annoyed that there was both a supercomputer conference AND Patrick from sth in my city, and i didnt know about it until 2 weeks late
Great content! Any chance for a studio tour video in the future?
Possibly. The last one with the blue door studio did not do well though: th-cam.com/video/q1no7rXWALs/w-d-xo.html
@@ServeTheHomeVideo that one was nice. But I’d love to see and hear more about those fiber runs and equipment you mentioned in this video. Anyways, you are doing a great job and I really appreciate every video you put out.
I love the acting its alot better than in a marvel movie
Ha!
This is a little bit overkill to my dream of having a plex DVR cluster, record on one node, and use all the other nodes to transcode the videos from the massive MPEG2 files, over to X265 using TDARR or handbrake. But the TuringPi, even with the 6 core Jetsons, might not have enough power to transcode. Currently my 3950x needs supplemental support from a 4650G to keep up with recordings. Might need two turingpi V2, each with 24 cores.
@@GeorgeWashingtonLaserMusket For one, the Jetson Nanos i have do have a hardware transcoder. However
The problem i have is that going from MPEG2 to H265 creates larger files than the original
I've tested:
NVidia NVENC on Maxwell, Pascal, and Turing (IIRC Maxwell couldnt transcode to H265 at all, or couldnt do it from MPEG2)
AMD VCE Vega, VegaII, Vega2.5(4000 and 5000 series Ryzen APUs), RDNA1
Intel QSV HD4000, HD5000, HD530, HD630
Apple Video Toolbox M1
ALL of them create larger files than the original, but only when going from MPEG2, to H265, where as CPU normally halves the file size. or even better if low-grain black and white
My 3950X is about as fast as my GTX 1650, though, it uses 3x more power at 90w vs 30w. This is a sacrifice i'm willing to make to achieve my intended goal of not losing quality, while significantly reducing file size
...but can run crysis2 in 4k without dlss?
just kidding this is awsome, i wanna see what u can do with a ultimate cluster like that.
Well... if you can get your hands on a dual tower, like the TT Core W200, with the P200 add ons, you could probably run 2 of those rigs you have there, plus enough space to fit ... 64 arm boards? Depends on how nuts you go with it, but R-pi boards fit in drive slots real nice like...
Wow what a server in a box. Could you ask Jeff if he can help you see how this server cluster performs on the top supercomputer ratings.
I am not sure what you mean? The Top500 results include full cluster-wide networking. You would normally only run Linpack on CPUs with vector acceleration such as AVX2 on the AMD chips (or AVX512 on Xeons). There the Threadripper is around 1.5-1.6Tflops. So you are about 150-160 Raspberry Pi 4's just for the main CPU on this. On the DPU side, the A72 cores have more performance per core than the Pi's but are really there to accelerate crypto.
As you scale small nodes, the interconnect networking ends up eating a lot more power and performance. That is why you cannot just take a number from this machine (or a small cluster) and compare it to a Top500 result, because scaling interconnects is such a big deal in terms of power and performance.
We actually test a lot of the high-end 4-8 GPU supercomputer nodes, and high-density CPU nodes, so people would look at us funny in the industry if we made a claim that this cluster, or any small cluster, represented a part of a Top500 linpack run.
@@ServeTheHomeVideo It would have been fun to added what Jeff Geering did in his video
@@favesongslist Yes, but it would have made his look a lot less efficient if it used 4x the power and cost 1.5-2x this entire cluster just to get the same performance as the x86 cores only. That would 100% make sense to folks in the industry as the RPi cores/ boards are not made for Linpack style workloads.
My next pfSense home router. ;-)
These two guys need a emmy for acting!
insane rig
Those cards look like they'll catch on fire unless you add an industrial fan and a custom shroud to your build... what are the thermals like?
Not too bad really. You are right there is ducted airflow behind the cable cover and they need airflow. On the other hand the chips themselves are sub 55W since there is power budget for the RAM optics and such on the card as well. The networking is using most of the power on them.
Awesome sauce
Nice video, thank you for sharing :)
interesting video even though I don't think I would choose this solution for my small project or I would reach for adult server hardware, but it's amazing to see where you can go with today's technology and look at a little unusual solution .. both gentlemen have interesting content and it was would be interesting to see some common project ;)
Can they see each other over PCIe without involving the CPU?
I know it's going to be those Bluefield cards! Just one simple question: is the host sharing the 14x100G connections with the DPUs or are those ports exclusively for the DPUs?
Shared
@@ServeTheHomeVideo Great to hear. The Bluefield/ConnectX-6 cards really blurs between the line of NICs and being its own system.
This is like having a Ferrari inside your tiny living room
Great challenge andexecutions!!! How much energy your server consumes? How loud/quiet is it? 10q and keep doing great videos
Idle is ~200W with 14x CWDM optics and the 10Gbase-T ports lit up. The networking part uses a ton of power. Max not over 750W yet. In terms of noise, not silent, but nowhere near the screaming servers we normally test. There is a lot more that can be done like ducting a 120mm fan to the cards that you would want to do to keep noise down.
@@ServeTheHomeVideo wow, I would have guessed at least twice that figures...
thats an ARM and a LEG!
Great one.
Are you guys keeping it together in that config and using it for something??
All I can say is that I've been planning this build, albiet on a much smaller scale, since early September. I guess now we wait to see other ISA's get in the mix... or perhaps a DPU crypto offload challenge? IDK but I'm a huge fan of efficiency and minimalist OS. Lets have the parts of the application run on the silicon where they run best and not encumber the devs with sticking to one architecture. Now we need an ISA-aware Swarm Orchestrator....
Would be interesting to see it running.
Had a few screenshots in the video,l. For the actual box, the fans spin up, a few blinking lights. Not much to it really. The side needs to be on to ensure there is enough airflow though. Hopefully when the new studio is finished can do more with everything on. Good feedback
I'm new to this whole arm cluster server (thanks youtube algorithm). What is the usecase for this type of system?
Proxmox arm and rke2, and proxmox x86+ rke2 n I'd be a happy nerd. Especially with infiniband disk shelf/san
so bazooka to a knife fight it is ... I just hope Jeff has a nice case to make up for the diff in fire power ...
you RANDOMLY meet him at an ARCH?? As in ARCH LINUX??? Are you trying to tell us something Patrick????? .......
....
......
.......
.....
....
..
.
yea I didn't think so
Would there be a point in something like this but with 4 ghost/beast canyon nuc's?
That would be super cool as well. Look at the Intel VCA's too if you want to go down that path.
Could you maybe do a segment on 24-48 port used 10 gig switches that can be found in an affordable price range
I think we have a 25GbE 48 port switch somewhere in the queue. Let me look into doing more of those 10G units. I think I saw some on the schedule for next year that the team is working on.
@@ServeTheHomeVideo Speaking of 25GbE switches, any word on when we might see stuff like a fanless homelab friendly 25GbE switch similar to the MikroTik CRS305 or 309? I picked up some used 25GbE NICs on ebay and unless I directly connect them I'm stuck with running them at 10 gig.
By Arm, did you mean spend an ARM and a Leg?!!!
I'm still trying to think of what I would use the ARM cores/processors for.
The big one will eventually be running services like NVMe-oF and handling network offloads. For now, it is more fun to use them as cluster nodes.
@@ServeTheHomeVideo
As I recall from the video though, the cluster nodes doesn't have NVMe onboard or DAS (unless it's like passed through from the AMD Threadripper Pro system).
And the AMD Threadripper Pro system itself doesn't have a 100 Gbps connection that's tied/directly connected to it.
So it would be interesting to see how you might deploy a NVMe-oF solution if the DPUs doesn't have any onboard or direct attached NVMe drives to it and/or that the NVMe storage has to be passed through to the DPUs.
Like I would understand if the DPU had either onboard or NVMe directly attached to it and then you can present that to the fabric.
But I've never seen how you would do the same if you don't have any onboard NVMe nor any NVMe that's directly attached to said DPU.
That's interesting.
So... What can you do with your overkill system?
This is cool. Is there a way to buy this?
Best for running Qubes OS for the big machine
Why aren't the parts linked in the description?
No overall breakdown of the prices of the parts either
What's the point of a video like this if the sources aren't linked?
hm 3 months later and the turing pi is till TBA..^^
Will it fold for Stanford's Folding at Home?
I'm really curious as to why you aren't using a switch for the fibre.
There is a switch, just in the rack rather than in the studio
@@ServeTheHomeVideo
So....does this mean that all of the systems have a fibre cable running back to the rack (the centralised switch)?
(Maybe I'm a little bit confused about the physical and network topology/layout of your office/rack/studio.)
Or are you going to be virtualising the "studio" systems and then, you can just servers that host the virtualised workstations and the server can then connect to the 100 Gbps (or 100 GbE) switch via DACs?
Maybe I'm a little bit confused.
(I only have two fibre cables running to my Mellanox IB switch, but my micro cluster server sits in the same rack as said switch, so I just use DACs to connect the server to the switch.)
Thanks.
I would love to see an update to your cluster with:
1. Dual Ampere Altra 128C CPUs (with PCIe 5.0 upgrade)
2. Many more DPUs
3. NVIDIA GPUs
You can use 16 of the 32 PCIe 5.0 lanes on BlueField 3 to connect to an NVIDIA GPU with PCIe 5.0 or two NVIDIA GPUs with PCIe 5.0 to 2 x PCIe 4.0.
2 x Ampere Altra -> 12 BlueField 3 -> 12 NVIDIA GPUs (PCIe 5.0)
2 x Ampere Altra -> 12 BlueField 3 -> 24 NVIDIA GPUs (PCIe 4.0)
Arch btw
Show some demos and benchmarks!
5:53 Define 7 TG, I recognize that case from a mile away because I have the same one right under my desk :^)
Is there a budget friendly ARM card, I have this weird dream where I can have proxmox running a few VMs on x86 but I'd like to be able to spin up an arm VM from the same machine. Does hardware exist to do this that isn't a £2000 DPU? Some kind of PCiE card that took compute modules would be ideal.
It might be cheaper to just run a RPi or a cloud instance for Arm. We are going to have more on the ASRock Rack mATX board for Ampere Altra in a few weeks. Also search for ASRock Rack ALTRA-NAS which I hope to do in Q1 2024
@@ServeTheHomeVideo it would definitely be cheaper but I wouldn't be as cool.
I remember you could get an 086 co-pro for a BBC micro and with a few cunning commands your trusty BBC would suddenly be a PC compatable... Now I know its a completely different era of computing but I think it would be a lot of fun if my x86 machine could BE an arm machine too rather than just emulating one.
How big were the burgers at that St. Louis McDonalds. Did they spend all their money on the first arch and had none left for the second
I tried going to BBQ but they were closed
Woah! Is that a R6 case I see?