Learn more from Juniper Networks on how the network will save the AI data center: juniper.net/AIwars Dive into the heart of the AI networking battleground where the demand for speed and efficiency tests the limits of technology. This video unravels the complexities of training AI models, highlighting the network bottlenecks that slow down progress and how innovations like RDMA, RoCE, ECN, and PFC are pivotal in the quest for seamless AI development. We explore the clash between Ethernet and Infiniband, dissecting their strengths and weaknesses in handling the massive data and low-latency requirements of AI training. 🔥🔥Join the NetworkChuck Academy!: ntck.co/NCAcademy **Sponsored by Juniper
I would be curious if they have *load-balancing* also available or if that is considered 'not available' due to the fact of so many GPU's running in parallel and every line is already hyper-saturated. though *ROCERDMA* [sounds like another DMA for net], *Explicit Congestion Notification* and *Priority Flow Control* are interesting. Hope you have a great day & Safe Travels!
RDMA is secure in the sense that it's typically implemented on a closed off network that is completely separate from other portions of the network that a supercomputer might be using (I work with supercomputers so that is what I think of when it comes to RDMA). Even with no encryption across the wire, there are keys in place which allow certain NICs and devices to talk to each other so it's a zero-trust configuration. You almost have to treat the high-speed network fabric like it's the bus moving data from CPU0 to CPU1. We have implemented RDMA across two of the supercomputers I work on and the performance is insane, but there still is a potential security risk that needs to be accepted even though the RDMA network is very closed off.
@@lubu42im studying computer science , i wish to work with supercomputers or atleast to work as internet eng. but im not sure that my university is teaching me enough about it , can i get some advice how to get job as network eng or to work with supercomputer , anything wich isnt web developing ...
Anyone old enough to remember running Doom via IPX/SPX (Novell Netware) on a LAN? Remember when TCP/IP was introduced into the stack and slowing the post-Doom games down? Pepperidge Farm remembers.
Having worked with some of the networking and hardware setups for AI servers... The networking and liquid cooling I got to see there were awesome. Just about anything you can get your hands on is outpaced 10 fold or more in an AI datacenter.
You run the AI training on a dedicated private network. If people are breaking into your private, closed off network, then you have a bigger set of problems to worry about.
@@cheema33 its a new network stack using rdma, it probably won't even be able to communicate with ipv4/6. its going to have a bunch of vulnerabilities and be underdeveloped. Will it get better yes but the main security vector is that rdma. With that said i still think its freaking cool.
Hey Chuck, fellow network engineer here (mostly cloud infrastructure though). First, I loved the video. This helped a ton to clear up my understanding at a high level of InfiniBand. Just some creative criticism, this definitely went much deeper technically than your typical content, so maybe it would be a good idea to state that up front or maybe even break content like this into a separate channel. I would love it if you made more deeply technical content, even beyond this level. You have a great gift for explaining these sorts of things in terms that people can relate to while still keeping it entertaining! Keep up the GREAT work!
Same. Infact, I am in *full support* of dismantling companies like Stability, Wombo, and Open. I have no problem if RatGPT is shut down, along with all of the other scummy A.I. applications.
What's important is working with what is adopted. As you mentioned ethernet is worldwide and the cheaper option to adopt which means that is what tech's are going to be hired to handle as that is what companies will prefer. This is where tech's can be paid a lot more to apply for positions that will soon open up in connection to supporting infini-band. As IT specialists, we work with companies and their current infrastructure, if the advancement of networking technology is inevitable, we'll all be forced to move along with it!
Wow this was cool 😀 I like both of them. A new network system will be needed for Quantum Computers soon as well. Quantum teleportation where the state can not change. Interesting time to be alive . Thank you for this video😊
you're talking about infiniband and rdma being new.. they're ooold :D For instance Isilon used Infiniband for their backend network for over 10 years, they even have migrated away from it to ethernet already
12:26 Chuck, this is precisely my beef with Ethernet. It did not "think" about the future, until a contender arrived! If Ethernet could do improvements to our current networking tech, why it did not? Precisely because it did not care...until it saw it's survival on the line. We could and should have a better network than what we currently use now, not just for AI. It is better to have both so they push each other to advance
That doesn't really make sense though? There's literally nothing to gain from "holding back progress". What is "it" in this context? Companies implementing network protocols? The people developing them? They all gain from having faster and more reliable products than the copetition.
@@DJTimeLockThats true, however it seems that the standard itself had nothing to compete against, hence the lack of new ethernet protocols in the past decades
@@pikpockett necessity creates innovation. That's very true. But to me it seems like it wasn't a necessity until the AI boom happend. As much as his analogy about a ferrari on blocked highway it goes both ways. What's the point in creating a 12 lane highway if 3 lanes are sufficient.
Hi! Long time watcher of your channel! Ive been a sports instructor for 15 years and ive been offered a job as Junior Network Engineer at a BIG company. I'd like to think everything I learned here helped :)
Based on the models I train at home: For RTX 3090/4090 cards, you'll hit VRAM limits before bandwidth limits for the size models you're working on. 10Gbps is more than enough. You don't need anything exotic. Also it's fine to put your GPUs in multiple cheaper computers; you don't need a single super expensive motherboard/CPU.
AI training is usually done over A100 or H100 and you sure will hit the bandwith issue. A mixture of experts of 8x220B parameters models is not your everyday at home. Training requires communicating the gradient and optimizer which is about 3 times the model. This is several GBs at each iteration, while the GPUs have the model sharded and they just need to execute one layer per iteration which is way faster.
@@joseleperez8742 Not sure if I care how you think its "usually" done. I see lots of papers presenting cutting edge work trained on 3090/4090 cards, sometimes just one. For home labs it makes no sense to purchase an A100 or H100 for many reasons.
Awesome video Chuck, you always do an amazing job breaking down explanations. Been subscribed for a while, and I have always noticed very good quality videos. Thank you!
This is the best video I've seen on your channel and I've been watching for years. Your knowledge shone through at an atmospheric level. Please do more like this every so often as the AI landscape broadens!!
Aye Chuck team infi but without the @ and we saw the data center vid! 🎉🎉🎉 thanks for the support (open sourced) infiniband, just like containers/‘Docker; appreciate IT Chuck!
I'm with Ethernet. Sticking with the traffic analogy: we're investing billions in trying to make self-driving cars that function within our existing infrastructure. It's taking a lot of effort, but it will ultimately require less effort, time, and money than to rip out all our roads and streets in favor of a self-driving-specific infrastructure. Using what we have and building within those constraints will require less (if any) overhaul AND the improvements to the stacks will only make things better for everyone, since everyone is already using the base tech.
omg, why am I barely hearing about this?! OK so word to the wise... if you're worried about AI replacing your networking job, start learning Infiniband and all associated technologies. Thank you Network Chuck and Juniper for this video!!!!!!
This is a great video. Its been a while since i have felt excited about networking and technology. but this video has me super interested about AI and networking. this is the first time I have seen network specific design for AI.
You're forgetting one crucial thing, sustainable power consumption at scale. This is all nice and everything, but until we can find better innovations in sustainable energy management systems to scale, it doesn't matter what type of software or hardware you have.
There are numerous initiatives that aim to reduce these issues. One of them is CXL interconnect, which takes RDMA concept like a step further and throws networking to top of the rack only. Since CXL allows P2P rootless communication, you can eliminate the "server" part from the server, and leave the resources only - RAM, accelerators, GPUs, storage, RAM (yes, as an actual device - there are already CXL connected RAM carrier boards) - and if you want - an actual CPU. This enables for example RAM sharing across whole CXL domain, and any network capacity added to the system benefits all of the devices equally, with basically fixed latency. You could even export and virtualize whole parts of CXL tree over normal network to group systems with identical dataset or mix and match for power envelope control. Of course it's not panacea - it's restricted to rack scale only solution, so you need same infrastructure anyway outside of it - just less of it, it's power hungry (current CXL switches eat up kilowatts of power), it's insanely expensive, and requires rewriting and validating most of current workload managers, which comes with its own challenges, kind of doesn't fit in normal systems, the hardware is quite fragile physically, and it's not something you can build over time - you need to set it all up in one go, which kind of raises difficulty of adopting it and raises entry bar for development, not to mention requires expert knowledge form much wider set of topics from the developers than otherwise. We will most likely never see much from this in "normal" datacenter environment for many years, as this tech is heavily skewed toward supercomputer systems, but the innovation is always welcome.
Since we are exchanging data directly btw the memories of the two servers without going through the network stacks or layers, it also means that a large bandwidth will be required since there are no data formatting techniques such as compression and encoding.
I would say ai slowing down is a good thing. Better yet would be some sort of impenetrable barrier that makes larger ai models (and smarter than human models) impractical or impossible
this mutha trucka ... not ONLY is he using my name better than I could ever hope, but THEN you add on that MAGNIFICENT beard! ... THIS SHALL NOT STAND! (no really I'm a network engineer named Chuck) great information, my dude! For the record, Team Ethernet
They did a study in 2013. They asked 2500ish AI researchers when AI would take over ever single job. They estimated 2045. Recently they did the same study. The results were pushed back to 2120ish. This is due to hardware limitations. They estimate 25% of the entire power grid will be AI alone. This issue is good guys, we need it to slow down and perhaps hardware limitations will be our saving grace.
well AI is already being used in networking for Automated Switching | Intelligent Routing| Quality of Service (QoS) Optimization | Self-Healing Networks | Anomaly Detection etc etc ;) good video
I’m team infiniband for these reasons, Ethernet is getting hard to get up to speed, I love Ethernet but eventually we are going to need speed and reliability that Ethernet cannot keep up with without major complexity added. The improvements mentioned in the video are great but isn’t that a lot more complex then just figuring out infiniband and future proofing it then having to upgrade every few years? Anyways, I had another reason but I forgot it sooooooooo. Just to be clear, I love Ethernet and want it to work well but it just isn’t fast enough and is getting bottlenecked.
Based on your description of how infiniband works it seems like there being an internet connected device which will likely be running a full operating system controlling direct memory access between two computers could have some potential security implications. For instance hypothetically if a malicious actor were to gain access to a switch handling the connection between the two servers it could be possible to inject malicious instructions directly into the receiving servers memory allowing an attacker to potentially gain some level of control over the server having its RAM written to.
You gotta really break this problem down. It's specific parts of the networking that are standing in the way of having better bandwidth. There are also things that are more important at times to allow hardware and software to be able to effectively complete their operations efficiently. Latency is one factor that I have seen in software that can cause some software to behave different than it was intended to. Latency also drives up software costs because measures to contend with the varying levels of latency need to be put in place. I think we should be going hard into fully optical communications circuits. I mean, that everything is done via photons until the very last second. Decreasing lacency and allowing the bandiwdth to be even less constrained to the fabric/medium it was transmitted through.
Personally I am team competition. I don't care so much for either winning, but InfiniBand existing and putting pressure on ethernet to keep up is only going to give us improvements on both ends leading to an overall better end product.
Juniper has come a long way since that started. I’m for ETH, it’s not like we hit a limit with Ethernet that we can’t vet come like IPv4 so why reinvent the wheel from scratch.
I find it hard to shed a single tear over the speed at which AI training steals copyrighted content. I think it all should be filtered through an ASR-33 teletype acoustic modem at 110 baud.
Hey Chuck, wouldn't RDMA require the data to be transmitted in an unencrypted form? Since the CPU is not processing it, I don't think any sort of encryption standards would be plausible for such networking, and hence it leaves the whole network more vulnerable than ever before.
#1 This is usually local traffic. In the example Chuck mentioned most scenarios only involved switching and not a lot of routing. #2 You can very easily do line speed encryption/decryption in the NIC.
2:10 bro InfiniBand has been around longer than PCI-express 3:10 again, this issue happened in 2003, hell all of 2004 and 5 was nearly fully dedicated to finding and powering datacenters 3:30 watercooling in a datacenter is also not new 5:55 RDMA has been around since 2013 ish and RoCE is like 2015 11:45 IPU and IDPF is gonna blow your mind
We've had packet balancing and load balancing from late 90s-2000s, I'll stick with that. If I need to mitigate stress, I'll put a load balancer in. People forget IP routing is just that, routing. You can redirect packets to a different machine. If you setup 2 servers with training data you can setup a load balancer in between that redirects the flow to a less congested server while the GPU server is none the wiser.
Love your chan. Your videos are fun to watch and you are a relatable host. Thanks for all the great content! Infiniband is not new. Ive had infiniband in one of my production clusters for neigh on 20yrs! Infiniband doesn't have to be expensive. I picked up 2xQlogic 18port QDR switches for $15 on ebay(still dont know how it was possible) I was able to pick up cards/cables for 4 servers for in my home lab rack for under $200 . I run subnet manager on a couple hosts since these switches cant run it.
I'm on team ethernet on this one. 🌐 I think keeping with the standard that we've already pretty much decided on is best as long as it can keep up with other future hardware. 💻
In the future when AI has taken away entry level programming jobs, artists jobs, writers jobs, customer service jobs, etc and the middle class vanishes I plan to come back and watch this video/read the comments and remind myself that humanity did this to themselves. Long live our corporate overlords.
I'm team infiniband solely because the network team doesn't own it. I used to manage FC SANs and it was super convenient to make changes on it because I could do it myself. I didn't dislike the network team, it just took longer to get them to do something than it took me to do it myself.
hey chuck been a sub for a while always really like what you're doing, you and many others are also inspiring people like me to also maybe go down the same road as what you guys are doing work for myself and share knowledge with people that can make use of your skills and passion instead of always fighting with cooks at the normal workplace.. thanks again.
Super cool. TCP or holograpic udp. The big jump that is comming is photonic matrix processing chips. And I agree, we prbably do not need a need a new protocol. ~ Though, I do not think there is a mathematical posibility of exctracting certainy from our current network topology.
love the new capabilities being added to the entire ethernet stack, but for HPC IB in my view just that more mature/proven. it's been the core of Oracle's Engineered system solutions since 2008, running on Mellanox switches. for general DC networks, sure ethernet, for something as specialised like HPC/AI stacks, I think IB still the leader.
I'm not into Networking, but it seems to me that because it's lossless RDMA has a major advantage. You have to do all kinds of engineering to make Ethernet work at the same speed. It just seems to me that RDMA will free up engineers and hardware to do what needs to be done as opposed to making what doesn't work correctly in the first place to work correctly. It may take while, but I think that eventually the RDMA type protocol will win out.
Hmmm interesting conversation! I love Ethernet but I am not sure if it will last, the one thing good thing about Ethernet is cost, it is fairly cheap, great video Chuck!
We should all go to the next generation of networks to increase what way we communicate Infinite band is the next level of communication and we should make our community.
really good content again Chuck. I might have an easier time picking between the two if I understood how firewalls and QoS work in the Infinaband space. Or do they have an alternate technology?
I work with Juniper on the daily think they are great but we use fiber not ethernet and can get 100 gigabit and that is just the communication to each other. The company I work for is a fiber network ISP we have no ethernet in the network outside of some old parts of the network we have not upgraded yet. But you are right about Juniper sense we switched to them from mikrotik.
I’d say there’s a solid argument for a ground up replacement for Ethernet. It pretty awesome, I use it every day, but there’s so much legacy overhead and bolted on modifications that I feel it’s holding the next generation of networking back. I’m not saying infiniband is the answer but taking a clean sheet to networking is long overdue. Think of the security improvements that could be made by having a ground up redesign with the 21st century in mind. It would be painful at first, but I think we’d be glad once the adoption pains had passed.
Interesting video but very high level. Regarding the high power consumption/heat dissipation point on GPUs a topic I’d like to see a well researched video about is the possible avenues to improve code efficiency where at a given load an LLM could have more productivity, as that would counteract the FLOPs/time needed and hence reduce power per output on the AI side
Why don't they use fiber optics everywhere? I guess they can afford it? If they increased the version of their IT system, there would be no problem. One fiber optic cable not enough? You have 10 more. And this is how you build highways between servers. :-)
Learn more from Juniper Networks on how the network will save the AI data center: juniper.net/AIwars
Dive into the heart of the AI networking battleground where the demand for speed and efficiency tests the limits of technology. This video unravels the complexities of training AI models, highlighting the network bottlenecks that slow down progress and how innovations like RDMA, RoCE, ECN, and PFC are pivotal in the quest for seamless AI development. We explore the clash between Ethernet and Infiniband, dissecting their strengths and weaknesses in handling the massive data and low-latency requirements of AI training.
🔥🔥Join the NetworkChuck Academy!: ntck.co/NCAcademy
**Sponsored by Juniper
i will be the networkjustin of albania
/
Juniper is being acquired my HPE
I would be curious if they have *load-balancing* also available or if that is considered 'not available' due to the fact of so many GPU's running in parallel and every line is already hyper-saturated. though *ROCERDMA* [sounds like another DMA for net], *Explicit Congestion Notification* and *Priority Flow Control* are interesting.
Hope you have a great day & Safe Travels!
6:40 Roblox💥💯💯
I wonder what the security implications of RDMA is since it is bypassing the kernal
That was my question to. How can I ensure security to earn money when today's skills are obsolete?
Usually RDMA is used in an isolated environment
RDMA is secure in the sense that it's typically implemented on a closed off network that is completely separate from other portions of the network that a supercomputer might be using (I work with supercomputers so that is what I think of when it comes to RDMA). Even with no encryption across the wire, there are keys in place which allow certain NICs and devices to talk to each other so it's a zero-trust configuration. You almost have to treat the high-speed network fabric like it's the bus moving data from CPU0 to CPU1. We have implemented RDMA across two of the supercomputers I work on and the performance is insane, but there still is a potential security risk that needs to be accepted even though the RDMA network is very closed off.
@@lubu42im studying computer science , i wish to work with supercomputers or atleast to work as internet eng. but im not sure that my university is teaching me enough about it , can i get some advice how to get job as network eng or to work with supercomputer , anything wich isnt web developing ...
kernel*
Anyone old enough to remember running Doom via IPX/SPX (Novell Netware) on a LAN? Remember when TCP/IP was introduced into the stack and slowing the post-Doom games down?
Pepperidge Farm remembers.
Having worked with some of the networking and hardware setups for AI servers... The networking and liquid cooling I got to see there were awesome. Just about anything you can get your hands on is outpaced 10 fold or more in an AI datacenter.
which site code?
Yep until the tree hugging greenies go on about power usage for cards and data centre cooling
7:39 Someone is behind you man! Just Run 😀
I saw that too!
I got scared lmfao
What or who was that? Pause at your own risk. That will give you nightmares. I hope you’re still alive!
Who is that behind you
we need an answer @NetworkChuck who was that weird looking person/creature behind you?
sounds like this is going to introduce a severe set of attack vectors, but none the less its pretty awesome
You run the AI training on a dedicated private network. If people are breaking into your private, closed off network, then you have a bigger set of problems to worry about.
@@cheema33 its a new network stack using rdma, it probably won't even be able to communicate with ipv4/6. its going to have a bunch of vulnerabilities and be underdeveloped. Will it get better yes but the main security vector is that rdma. With that said i still think its freaking cool.
This whole video must be the most sneaky advertisement I have ever seen in my life.
the jump cuts on the other hand...
It’s not
Not an advert or not sneaky?@@SovietCustard
We've just done some testing at work with the guys from Juniper, those switches are beasts
Hey Chuck, fellow network engineer here (mostly cloud infrastructure though). First, I loved the video. This helped a ton to clear up my understanding at a high level of InfiniBand. Just some creative criticism, this definitely went much deeper technically than your typical content, so maybe it would be a good idea to state that up front or maybe even break content like this into a separate channel. I would love it if you made more deeply technical content, even beyond this level. You have a great gift for explaining these sorts of things in terms that people can relate to while still keeping it entertaining! Keep up the GREAT work!
I didn't even realize it was that deep technically. I guess it just makes sense.
separate channel, yes more advance would be the best solution
I have no problems with A"i" (really just the same machine learning we've had forced upon us for 10 years) being slowed to a crawl.
Same. Infact, I am in *full support* of dismantling companies like Stability, Wombo, and Open. I have no problem if RatGPT is shut down, along with all of the other scummy A.I. applications.
Fun to see this video right now, as I am planning a big cluster at work. The main argument for me for Ethernet is cost.
What's important is working with what is adopted. As you mentioned ethernet is worldwide and the cheaper option to adopt which means that is what tech's are going to be hired to handle as that is what companies will prefer. This is where tech's can be paid a lot more to apply for positions that will soon open up in connection to supporting infini-band. As IT specialists, we work with companies and their current infrastructure, if the advancement of networking technology is inevitable, we'll all be forced to move along with it!
Wow this was cool 😀 I like both of them. A new network system will be needed for Quantum Computers soon as well. Quantum teleportation where the state can not change. Interesting time to be alive . Thank you for this video😊
This is quite an interesting comment. I was exaggerated
you're talking about infiniband and rdma being new.. they're ooold :D For instance Isilon used Infiniband for their backend network for over 10 years, they even have migrated away from it to ethernet already
I remember working with Infiniband in 2006. Its cool to see that they're still going.
12:26 Chuck, this is precisely my beef with Ethernet. It did not "think" about the future, until a contender arrived!
If Ethernet could do improvements to our current networking tech, why it did not? Precisely because it did not care...until it saw it's survival on the line.
We could and should have a better network than what we currently use now, not just for AI.
It is better to have both so they push each other to advance
True thats how Adobe has opperated for decades.
That doesn't really make sense though? There's literally nothing to gain from "holding back progress". What is "it" in this context? Companies implementing network protocols? The people developing them? They all gain from having faster and more reliable products than the copetition.
@@DJTimeLockThats true, however it seems that the standard itself had nothing to compete against, hence the lack of new ethernet protocols in the past decades
@@pikpockett necessity creates innovation. That's very true. But to me it seems like it wasn't a necessity until the AI boom happend. As much as his analogy about a ferrari on blocked highway it goes both ways. What's the point in creating a 12 lane highway if 3 lanes are sufficient.
2:50 Before we air cooled data centers, we liquid cooled them.
Keep Going Chuck !
Where?
Chuck, your tech videos are consistently awesome and helpful. Keep up the great work! Thanks for sharing your knowledge!
Hi!
Long time watcher of your channel!
Ive been a sports instructor for 15 years and ive been offered a job as Junior Network Engineer at a BIG company.
I'd like to think everything I learned here helped :)
I'm a long time (now retired CCIE #1937, R/S Emeritus) network engineer and I'm definitely on Team Ethernet.
Based on the models I train at home: For RTX 3090/4090 cards, you'll hit VRAM limits before bandwidth limits for the size models you're working on. 10Gbps is more than enough. You don't need anything exotic. Also it's fine to put your GPUs in multiple cheaper computers; you don't need a single super expensive motherboard/CPU.
AI training is usually done over A100 or H100 and you sure will hit the bandwith issue. A mixture of experts of 8x220B parameters models is not your everyday at home. Training requires communicating the gradient and optimizer which is about 3 times the model. This is several GBs at each iteration, while the GPUs have the model sharded and they just need to execute one layer per iteration which is way faster.
@@joseleperez8742 Not sure if I care how you think its "usually" done. I see lots of papers presenting cutting edge work trained on 3090/4090 cards, sometimes just one. For home labs it makes no sense to purchase an A100 or H100 for many reasons.
Awesome video Chuck, you always do an amazing job breaking down explanations. Been subscribed for a while, and I have always noticed very good quality videos. Thank you!
I like that you take your time to make absolutely sure the video is on point before you publish.
This is the best video I've seen on your channel and I've been watching for years. Your knowledge shone through at an atmospheric level. Please do more like this every so often as the AI landscape broadens!!
Love it!!!!!!! Rdma SInce 2019!!!!!!!!!!!!!
Ethernet for sure. Can imagine changing the for on global scale🤯
1:53-1:54 there's a cool moment...
you know what am i talking about :0
That thumbnail gets the attention!!
Yes, more data centers networking and AI please
Aye Chuck team infi but without the @ and we saw the data center vid! 🎉🎉🎉 thanks for the support (open sourced) infiniband, just like containers/‘Docker; appreciate IT Chuck!
I work for MinIO and I want to refactor how the world sees and uses its data 🤩 🎉 Thanks for the inspiration :3
I'm with Ethernet.
Sticking with the traffic analogy: we're investing billions in trying to make self-driving cars that function within our existing infrastructure. It's taking a lot of effort, but it will ultimately require less effort, time, and money than to rip out all our roads and streets in favor of a self-driving-specific infrastructure.
Using what we have and building within those constraints will require less (if any) overhaul AND the improvements to the stacks will only make things better for everyone, since everyone is already using the base tech.
omg, why am I barely hearing about this?! OK so word to the wise... if you're worried about AI replacing your networking job, start learning Infiniband and all associated technologies. Thank you Network Chuck and Juniper for this video!!!!!!
1:23 Never thought I would hear Chuck say he wants to become a battery😂
I used to work for a company which built a 76TBit/s ethernet network
This is a great video. Its been a while since i have felt excited about networking and technology. but this video has me super interested about AI and networking. this is the first time I have seen network specific design for AI.
I'm not sure if I should be scared or excited
Thank You for helping us to know more details of this new networking technology.....!!! and by the way I support both..!!
Awesome video chuck
At this point , Ethernet is a known quantity. I see no need to reinvent the wheel .
The "Animaniacs" part is a classic...😅😂 oh and yeah the AI slow down, good job Juniper..
awesome video as always - I am team ethernet as infiband is for a small niche
You're forgetting one crucial thing, sustainable power consumption at scale.
This is all nice and everything, but until we can find better innovations in sustainable energy management systems to scale, it doesn't matter what type of software or hardware you have.
There are numerous initiatives that aim to reduce these issues. One of them is CXL interconnect, which takes RDMA concept like a step further and throws networking to top of the rack only. Since CXL allows P2P rootless communication, you can eliminate the "server" part from the server, and leave the resources only - RAM, accelerators, GPUs, storage, RAM (yes, as an actual device - there are already CXL connected RAM carrier boards) - and if you want - an actual CPU. This enables for example RAM sharing across whole CXL domain, and any network capacity added to the system benefits all of the devices equally, with basically fixed latency. You could even export and virtualize whole parts of CXL tree over normal network to group systems with identical dataset or mix and match for power envelope control.
Of course it's not panacea - it's restricted to rack scale only solution, so you need same infrastructure anyway outside of it - just less of it, it's power hungry (current CXL switches eat up kilowatts of power), it's insanely expensive, and requires rewriting and validating most of current workload managers, which comes with its own challenges, kind of doesn't fit in normal systems, the hardware is quite fragile physically, and it's not something you can build over time - you need to set it all up in one go, which kind of raises difficulty of adopting it and raises entry bar for development, not to mention requires expert knowledge form much wider set of topics from the developers than otherwise.
We will most likely never see much from this in "normal" datacenter environment for many years, as this tech is heavily skewed toward supercomputer systems, but the innovation is always welcome.
Since we are exchanging data directly btw the memories of the two servers without going through the network stacks or layers, it also means that a large bandwidth will be required since there are no data formatting techniques such as compression and encoding.
As someone who has extensive knowledge in installation/testing/troubleshooting network equipment/ethernet etc. I welcome the endless work muahahahaha
This dude makes me want to learn
Wow really amazing topic
Thanks so much Networkchuck
Love your channel man
Thanks for that absolutely cursed AI generated preview.
Awesome video Bro! Highly educational. Thank you for sharing!🎉
Great thumbnail
It has some "I have no mouth and I must scream" vibes :)
I would say ai slowing down is a good thing. Better yet would be some sort of impenetrable barrier that makes larger ai models (and smarter than human models) impractical or impossible
this mutha trucka ... not ONLY is he using my name better than I could ever hope, but THEN you add on that MAGNIFICENT beard! ... THIS SHALL NOT STAND! (no really I'm a network engineer named Chuck) great information, my dude! For the record, Team Ethernet
Excellent content. What a great introduction to AI Networking. Great work! Thank you
7:39 - Person in background scared me. lol
I can't sleep anymore...
THANK YOU FOR POSTING CHUCK!!!!!!
They did a study in 2013. They asked 2500ish AI researchers when AI would take over ever single job. They estimated 2045.
Recently they did the same study. The results were pushed back to 2120ish. This is due to hardware limitations.
They estimate 25% of the entire power grid will be AI alone.
This issue is good guys, we need it to slow down and perhaps hardware limitations will be our saving grace.
well AI is already being used in networking for Automated Switching | Intelligent Routing| Quality of Service (QoS) Optimization | Self-Healing Networks | Anomaly Detection etc etc ;) good video
I’m team infiniband for these reasons, Ethernet is getting hard to get up to speed, I love Ethernet but eventually we are going to need speed and reliability that Ethernet cannot keep up with without major complexity added. The improvements mentioned in the video are great but isn’t that a lot more complex then just figuring out infiniband and future proofing it then having to upgrade every few years? Anyways, I had another reason but I forgot it sooooooooo. Just to be clear, I love Ethernet and want it to work well but it just isn’t fast enough and is getting bottlenecked.
I like you beard. It looks cool man.
I think the native speed of Chuck videos are -20% of "normal" speed, unless it's a matter of Check coffee....Check what's going on ;)?
Based on your description of how infiniband works it seems like there being an internet connected device which will likely be running a full operating system controlling direct memory access between two computers could have some potential security implications. For instance hypothetically if a malicious actor were to gain access to a switch handling the connection between the two servers it could be possible to inject malicious instructions directly into the receiving servers memory allowing an attacker to potentially gain some level of control over the server having its RAM written to.
Hey great video man - I'm into AI and also out of personal interest I've been learning about networking so this video was awesome and well done
Damn chick in black in back was creepin 07:38
Good stuff ❤
I really like the thumbnail
You gotta really break this problem down. It's specific parts of the networking that are standing in the way of having better bandwidth. There are also things that are more important at times to allow hardware and software to be able to effectively complete their operations efficiently. Latency is one factor that I have seen in software that can cause some software to behave different than it was intended to. Latency also drives up software costs because measures to contend with the varying levels of latency need to be put in place.
I think we should be going hard into fully optical communications circuits. I mean, that everything is done via photons until the very last second. Decreasing lacency and allowing the bandiwdth to be even less constrained to the fabric/medium it was transmitted through.
1:53 was crazy
Personally I am team competition. I don't care so much for either winning, but InfiniBand existing and putting pressure on ethernet to keep up is only going to give us improvements on both ends leading to an overall better end product.
Juniper has come a long way since that started. I’m for ETH, it’s not like we hit a limit with Ethernet that we can’t vet come like IPv4 so why reinvent the wheel from scratch.
Fantastic cutting edge content Chuck...
TCP/IP it is for me, as you said, that is well known, and easy to find help. But that will need a separated network to secure it.
I find it hard to shed a single tear over the speed at which AI training steals copyrighted content. I think it all should be filtered through an ASR-33 teletype acoustic modem at 110 baud.
Awe
Hey Chuck, wouldn't RDMA require the data to be transmitted in an unencrypted form? Since the CPU is not processing it, I don't think any sort of encryption standards would be plausible for such networking, and hence it leaves the whole network more vulnerable than ever before.
#1 This is usually local traffic. In the example Chuck mentioned most scenarios only involved switching and not a lot of routing.
#2 You can very easily do line speed encryption/decryption in the NIC.
2:10 bro InfiniBand has been around longer than PCI-express
3:10 again, this issue happened in 2003, hell all of 2004 and 5 was nearly fully dedicated to finding and powering datacenters
3:30 watercooling in a datacenter is also not new
5:55 RDMA has been around since 2013 ish and RoCE is like 2015
11:45 IPU and IDPF is gonna blow your mind
We've had packet balancing and load balancing from late 90s-2000s, I'll stick with that. If I need to mitigate stress, I'll put a load balancer in. People forget IP routing is just that, routing. You can redirect packets to a different machine. If you setup 2 servers with training data you can setup a load balancer in between that redirects the flow to a less congested server while the GPU server is none the wiser.
Love your chan. Your videos are fun to watch and you are a relatable host. Thanks for all the great content!
Infiniband is not new. Ive had infiniband in one of my production clusters for neigh on 20yrs!
Infiniband doesn't have to be expensive. I picked up 2xQlogic 18port QDR switches for $15 on ebay(still dont know how it was possible)
I was able to pick up cards/cables for 4 servers for in my home lab rack for under $200 .
I run subnet manager on a couple hosts since these switches cant run it.
I'm on team ethernet on this one. 🌐 I think keeping with the standard that we've already pretty much decided on is best as long as it can keep up with other future hardware. 💻
The industry is already moving back to ethernet mainly due to the need of L3 routed networks in isolated environments.
In the future when AI has taken away entry level programming jobs, artists jobs, writers jobs, customer service jobs, etc and the middle class vanishes I plan to come back and watch this video/read the comments and remind myself that humanity did this to themselves. Long live our corporate overlords.
AI is not going to be AI.
I'm team infiniband solely because the network team doesn't own it. I used to manage FC SANs and it was super convenient to make changes on it because I could do it myself. I didn't dislike the network team, it just took longer to get them to do something than it took me to do it myself.
hey chuck been a sub for a while always really like what you're doing, you and many others are also inspiring people like me to also maybe go down the same road as what you guys are doing work for myself and share knowledge with people that can make use of your skills and passion instead of always fighting with cooks at the normal workplace.. thanks again.
If you look at what we've done with ipv4 to avoid ipv6, I'd bet Ethernet wins.
Super cool. TCP or holograpic udp.
The big jump that is comming is photonic matrix processing chips.
And I agree, we prbably do not need a need a new protocol.
~
Though,
I do not think there is a mathematical posibility of exctracting
certainy from our current network topology.
RDMA is also used in some Clustering stuff for databases or Storage systems
I love your videos! Learning so much, and getting better and better with my homeserver!
Keep it up! and give me more informationen xD!
love the new capabilities being added to the entire ethernet stack, but for HPC IB in my view just that more mature/proven. it's been the core of Oracle's Engineered system solutions since 2008, running on Mellanox switches. for general DC networks, sure ethernet, for something as specialised like HPC/AI stacks, I think IB still the leader.
I'm not into Networking, but it seems to me that because it's lossless RDMA has a major advantage. You have to do all kinds of engineering to make Ethernet work at the same speed. It just seems to me that RDMA will free up engineers and hardware to do what needs to be done as opposed to making what doesn't work correctly in the first place to work correctly. It may take while, but I think that eventually the RDMA type protocol will win out.
Hmmm interesting conversation! I love Ethernet but I am not sure if it will last, the one thing good thing about Ethernet is cost, it is fairly cheap, great video Chuck!
We're too far lost in the Ethernet sauce, to all switch now
Just won on 4RA today, bro. Feels great! 🤑💪
We should all go to the next generation of networks to increase what way we communicate Infinite band is the next level of communication and we should make our community.
Insightful video. Thanks!
really good content again Chuck. I might have an easier time picking between the two if I understood how firewalls and QoS work in the Infinaband space. Or do they have an alternate technology?
I work with Juniper on the daily think they are great but we use fiber not ethernet and can get 100 gigabit and that is just the communication to each other. The company I work for is a fiber network ISP we have no ethernet in the network outside of some old parts of the network we have not upgraded yet. But you are right about Juniper sense we switched to them from mikrotik.
I’d say there’s a solid argument for a ground up replacement for Ethernet. It pretty awesome, I use it every day, but there’s so much legacy overhead and bolted on modifications that I feel it’s holding the next generation of networking back. I’m not saying infiniband is the answer but taking a clean sheet to networking is long overdue. Think of the security improvements that could be made by having a ground up redesign with the 21st century in mind. It would be painful at first, but I think we’d be glad once the adoption pains had passed.
Interesting video but very high level. Regarding the high power consumption/heat dissipation point on GPUs a topic I’d like to see a well researched video about is the possible avenues to improve code efficiency where at a given load an LLM could have more productivity, as that would counteract the FLOPs/time needed and hence reduce power per output on the AI side
Why don't they use fiber optics everywhere? I guess they can afford it? If they increased the version of their IT system, there would be no problem. One fiber optic cable not enough? You have 10 more. And this is how you build highways between servers. :-)
The larger communities will always win unless the new tech is so groundbreaking, it will replace old tech eventually.