You have helped me immensely setup my homelab and make my private network so damn effective and efficient. Thank you and dont stop enjoying what you do.
Thanks for this. I played with SR-IOV in ESXi before I migrated to Proxmox but I hadn't got around to testing it again. Based on your testing, I don't need to play around with it again.
Actually SR-IOV is really shines on high PPS. With huge number of PPS even on utilisation of channel lower that 1 gigabit you may end up that single core will be utilised by virtual switching this applies for both Linux QEMU and VMware.
What's this use-case? I can totally see how a ton of individual ops could be expensive, but is it really more expensive in the virtual switch? But that's really going to be a crap-ton of PPS, and maybe something should be optimized first to avoid that?
glad to see you got time to look into it, I bugged you about it some time ago and eventually got networking working for Linux but windows doesn't like the 10g card I got
Another couple of downsides of SR-IOV for NIC are perhaps network redundancy (which should be done in VMs themselves with bonding), MAC address limitations (which are probably set randomly when setting up virtual interaces), and it might impact VM migrations between cluster nodes...
I think your findings mirror mine when I played with it - For my own hardware, which is a variety of 10GbE NIC's, mostly Intel, I found that there wasn't any performance gain by going to the hardware directly vs. the Linux bridges. At least for any kind of throughput I could generate, YMMV for higher speeds, like 100GbE, but I can't even contemplate that right now. I find I swamp out the CPU with other tasks long before the overhead of network IO would become noticeable - meaning never. I can completely saturate my hyperconverged network with ceph pretty easily, maxing out around your 9.3GbE numbers, or about 950MB/Sec.
What about configuring the virtio intrerfaces as macvtap? Then you cannot talk to the host over it (which I consider a feature, not a bug; you can always throw in a second virtio interface that only connects to the host, in case that's needed) but also shouldn't go through a bridge. In general, you don't mention anything about latency. I'd be much more curious about how each choice afects latency. Maybe virtio/macvtap would have better latency than the regular virtio/bridge? And shouldn't SR-IOV offer the best latency? In the latter csae, is tehre a difference between passing through a VF as opposed to PF? Finally, I assume with SR-IOV you can use RDMA from within the guest. Which is not the case with virtio AFAIK. It may have limited use cases, but why not use SMB Direct or NFS/RDMA. Should offer better latency and maybe even throughput than the regular brethren. PS How do you persist the VF setup on the host across reboots?
well, finally an ad for something I need (my old powerbank died) and it says: "This item cannot be shipped to your selected delivery location. Please choose a different delivery location." so yeah... that was that. 😞
I'm guessing the vmxnet3 implementation in qemu is poorly optimised... I'm using 40Gb connectivity on ESXi and saturating it has noticeably lower CPU overhead with vmxnet3 than any other method.
Of course, 'native' paravirtualization for KVM is using virtio (w/ Red Hat's Windows drivers on Windows). vmxnet3 is really implemented for people migrating from VMWare, and given Broadcom being Broadcom I thought it might be a useful data point
So again... if you get virtualization, you get a lot of new problems, but containers is less of an issue. Which is why I've been running some form of containers in production before even looking at virtualization almost 20 years ago.
That's what I've found as well. Sure it uses a little more host CPU to keep them virtual, but, it's not really meaningfully faster at 10gbit. It can make sense on machines with 25/40/100 with certain workloads.
These new fanged Intel GPUs have SR-IOV and can be split into multiple graphics and each can be sent into VMs. I got an Arc GPU to do this and I got far enough to see the maximum number of sub devices were possible in the devices /sys but the kernel module doesn't have support to split this up. Unfortunately, before I got a chance to install the Intel patches from their forums, I broke a surface mounted component off and I haven't gotten back to it. I want to get another one someday and try this out.
Yeah, I do this with some Arc's now - it's pretty sweet- but also note if you're running a container, they can just have shared access to the ARC hardware directly if it's been virtualized - I do this with Plex and it's good, even set it up HA with three nodes with the same ARC A310 hardware - smooth. I also pass through an LSI HBA card to a VM for SAS access to a tape drive - that works nicely too.
That's really cool. What does the VM see from its perspective? Is it just a 3D accelerator or can the virtual functions provide video out? I wish my Arc didn't get busted! Oh and... did you have to patch the i915 kernel module?
@@VexMage It just looks like the real hardware to the VM - nothing different. And yes, you can set it as a video device and output video that way if that's what you want. All my machines aren't in the same room, so I don't use video. I do have one machine with 4 RTX graphics cards I pass through to do AI work. Note that a lot of this is hardware dependent, and I use server hardware that is known to work well with this kind of thing.
@@ChrisCebelenskinice, I didn't know if the Arc would expose them like Nvidia GRID or more like MIG. MIG is free but Nvidia charges almost the cost of the video cards a year to use GRID for a VDI solution, even with our state university contracts and education discount! I wanted to look into the Arc as a possible free alternative to GRID. Thankfully MIG is free and we use it for our HPC machine learning cluster. Nvidia chastised me on an engineering call because our professors keep buying RTX when ray tracing isn't a feature they want. What's totally annoying to my team is that the new RTX cards can't even fit in a 4U chassis! I get the appeal tho for the CUDA cores per dollar!
Similar to you most of my stuff is in LXCs, and considering it's backed by some very slow HDDs i'm not even limited by gigabit networking, i won't worry about SRIOV support. It would be much more interesting on a consumer GPU, but so far it seems that support is reserved for the enterprise cards.
So from a driver point, the VM will see as a real card and not a virtual nic? If so, I see this useful for VMs that need real NIC and not virtIO. I've have had some VMs the didn't like paravirt. Also, how do you tell if the card has sr-iov? Great video.
it sees it as a real card, but lacking some of the features of the full-featured card, and some hardware flags are provided but 'faked' or read-only (for example, the 'link' status is not real, and the rate may also not be real) Generally any Intel 500 series or Mellanox CX4 or higher will have SR-IOV. It's really quite common. The older cards support less VFs.
A former company of mine continued to run two Solaris workstations as of 2020, to support legacy projects from the late 90s / early 2000s which had to be done using the same dev environment from back then. I am amazed how well 10+ people running X over SSH works from a single SPARC machine.
How much would it help a "bare metal" workstation that needs to access both the "untagged" net as well as a VLAN or two? I read somewhere that if you do the VLAN splitting in software, the NIC has to be switched to promiscuous mode which has some performance impact. But apparently, it's possible to configure the SR-IOV VFs to have VLAN filters and do the VLAN splitting in hardware.
The NIC does need to be in promiscuous mode, which could result in receiving extra traffic that will ultimately be dropped. However the upstream physical switches are also really good at MAC-learning, so this should not be significant, at least in Linux. Can't say if Windows's sometimes weird VLAN behavior would cause problems for a Windows workstation. There's also nothing stopping you (again on Linux) from adding a bunch of VFs and using them as normal interfaces, this is basically what LXC does. You can set the hardware VLAN tag (either 802.1Q or 802.1AD but not both) with an `ip` on the physical function. There are a ton of things which can put a NIC into promiscuous mode though
You can use the PVE firewall to filter whatever you want, but there would be no reason to filter two customer VMs attached directly to the internet from talking directly to each other.
Hosts on the public internet expect to communicate directly with each other using their public IPs. Restricting multicast / broadcast or more specific ARP/NDP spoofing filtering would still be good if you are only doing layer 2, but doing layer 3 separately to each client is ideal (and then there is no ARP/NDP to worry about)
I was passing through network cards and was surprised to learn proxmox doesn't enable iommu by default. Pain in the A** having to research that annoyance.
They have recently changed the kernel defaults with pve 8.2 / kernel 6.8. I’m not sure if they changed Intel_iommu only or also amd_iommu. They aren’t enabling the ACS patch by default but it’s still compiled in.
You have helped me immensely setup my homelab and make my private network so damn effective and efficient. Thank you and dont stop enjoying what you do.
Thanks for this. I played with SR-IOV in ESXi before I migrated to Proxmox but I hadn't got around to testing it again. Based on your testing, I don't need to play around with it again.
Regarding the ad at the beginning of the video:
20,000mAh = Twenty Thousand One-Thousanth Ah = 20Ah
Actually SR-IOV is really shines on high PPS. With huge number of PPS even on utilisation of channel lower that 1 gigabit you may end up that single core will be utilised by virtual switching this applies for both Linux QEMU and VMware.
What's this use-case? I can totally see how a ton of individual ops could be expensive, but is it really more expensive in the virtual switch? But that's really going to be a crap-ton of PPS, and maybe something should be optimized first to avoid that?
glad to see you got time to look into it, I bugged you about it some time ago and eventually got networking working for Linux but windows doesn't like the 10g card I got
Thanks for covering this topic. I’ve wanted to look into it but it sounds like it’s not worth the effort at 10G.
Excellent once again
I like your wall sockets!
Another couple of downsides of SR-IOV for NIC are perhaps network redundancy (which should be done in VMs themselves with bonding), MAC address limitations (which are probably set randomly when setting up virtual interaces), and it might impact VM migrations between cluster nodes...
I think your findings mirror mine when I played with it - For my own hardware, which is a variety of 10GbE NIC's, mostly Intel, I found that there wasn't any performance gain by going to the hardware directly vs. the Linux bridges. At least for any kind of throughput I could generate, YMMV for higher speeds, like 100GbE, but I can't even contemplate that right now. I find I swamp out the CPU with other tasks long before the overhead of network IO would become noticeable - meaning never. I can completely saturate my hyperconverged network with ceph pretty easily, maxing out around your 9.3GbE numbers, or about 950MB/Sec.
Loving your videos. Good job sir.
What about configuring the virtio intrerfaces as macvtap? Then you cannot talk to the host over it (which I consider a feature, not a bug; you can always throw in a second virtio interface that only connects to the host, in case that's needed) but also shouldn't go through a bridge.
In general, you don't mention anything about latency. I'd be much more curious about how each choice afects latency. Maybe virtio/macvtap would have better latency than the regular virtio/bridge? And shouldn't SR-IOV offer the best latency? In the latter csae, is tehre a difference between passing through a VF as opposed to PF?
Finally, I assume with SR-IOV you can use RDMA from within the guest. Which is not the case with virtio AFAIK. It may have limited use cases, but why not use SMB Direct or NFS/RDMA. Should offer better latency and maybe even throughput than the regular brethren.
PS How do you persist the VF setup on the host across reboots?
well, finally an ad for something I need (my old powerbank died) and it says: "This item cannot be shipped to your selected delivery location. Please choose a different delivery location." so yeah... that was that. 😞
I was looking online, the model looks very similar to Anker's devices...?
I'm guessing the vmxnet3 implementation in qemu is poorly optimised... I'm using 40Gb connectivity on ESXi and saturating it has noticeably lower CPU overhead with vmxnet3 than any other method.
Of course, 'native' paravirtualization for KVM is using virtio (w/ Red Hat's Windows drivers on Windows). vmxnet3 is really implemented for people migrating from VMWare, and given Broadcom being Broadcom I thought it might be a useful data point
So again... if you get virtualization, you get a lot of new problems, but containers is less of an issue. Which is why I've been running some form of containers in production before even looking at virtualization almost 20 years ago.
That's what I've found as well. Sure it uses a little more host CPU to keep them virtual, but, it's not really meaningfully faster at 10gbit. It can make sense on machines with 25/40/100 with certain workloads.
These new fanged Intel GPUs have SR-IOV and can be split into multiple graphics and each can be sent into VMs. I got an Arc GPU to do this and I got far enough to see the maximum number of sub devices were possible in the devices /sys but the kernel module doesn't have support to split this up. Unfortunately, before I got a chance to install the Intel patches from their forums, I broke a surface mounted component off and I haven't gotten back to it. I want to get another one someday and try this out.
Yeah, I do this with some Arc's now - it's pretty sweet- but also note if you're running a container, they can just have shared access to the ARC hardware directly if it's been virtualized - I do this with Plex and it's good, even set it up HA with three nodes with the same ARC A310 hardware - smooth. I also pass through an LSI HBA card to a VM for SAS access to a tape drive - that works nicely too.
That's really cool. What does the VM see from its perspective? Is it just a 3D accelerator or can the virtual functions provide video out? I wish my Arc didn't get busted! Oh and... did you have to patch the i915 kernel module?
@@VexMage It just looks like the real hardware to the VM - nothing different. And yes, you can set it as a video device and output video that way if that's what you want. All my machines aren't in the same room, so I don't use video. I do have one machine with 4 RTX graphics cards I pass through to do AI work. Note that a lot of this is hardware dependent, and I use server hardware that is known to work well with this kind of thing.
@@ChrisCebelenskinice, I didn't know if the Arc would expose them like Nvidia GRID or more like MIG. MIG is free but Nvidia charges almost the cost of the video cards a year to use GRID for a VDI solution, even with our state university contracts and education discount! I wanted to look into the Arc as a possible free alternative to GRID. Thankfully MIG is free and we use it for our HPC machine learning cluster. Nvidia chastised me on an engineering call because our professors keep buying RTX when ray tracing isn't a feature they want. What's totally annoying to my team is that the new RTX cards can't even fit in a 4U chassis! I get the appeal tho for the CUDA cores per dollar!
Similar to you most of my stuff is in LXCs, and considering it's backed by some very slow HDDs i'm not even limited by gigabit networking, i won't worry about SRIOV support. It would be much more interesting on a consumer GPU, but so far it seems that support is reserved for the enterprise cards.
It’s on my radar but I’m not really into patching Nvidia vgpu and their licensing. As soon as Intel or AMD has support I’ll buy a card.
So from a driver point, the VM will see as a real card and not a virtual nic? If so, I see this useful for VMs that need real NIC and not virtIO. I've have had some VMs the didn't like paravirt. Also, how do you tell if the card has sr-iov? Great video.
it sees it as a real card, but lacking some of the features of the full-featured card, and some hardware flags are provided but 'faked' or read-only (for example, the 'link' status is not real, and the rate may also not be real)
Generally any Intel 500 series or Mellanox CX4 or higher will have SR-IOV. It's really quite common. The older cards support less VFs.
"Solaris"
That's a name I've not heard in a long time...
A former company of mine continued to run two Solaris workstations as of 2020, to support legacy projects from the late 90s / early 2000s which had to be done using the same dev environment from back then. I am amazed how well 10+ people running X over SSH works from a single SPARC machine.
How much would it help a "bare metal" workstation that needs to access both the "untagged" net as well as a VLAN or two? I read somewhere that if you do the VLAN splitting in software, the NIC has to be switched to promiscuous mode which has some performance impact. But apparently, it's possible to configure the SR-IOV VFs to have VLAN filters and do the VLAN splitting in hardware.
The NIC does need to be in promiscuous mode, which could result in receiving extra traffic that will ultimately be dropped. However the upstream physical switches are also really good at MAC-learning, so this should not be significant, at least in Linux. Can't say if Windows's sometimes weird VLAN behavior would cause problems for a Windows workstation.
There's also nothing stopping you (again on Linux) from adding a bunch of VFs and using them as normal interfaces, this is basically what LXC does. You can set the hardware VLAN tag (either 802.1Q or 802.1AD but not both) with an `ip` on the physical function.
There are a ton of things which can put a NIC into promiscuous mode though
By the way, about customers vms. Proxmox doesn’t support vm network isolation, so you have to assign separate clans for each of vm. Right?
You can use the PVE firewall to filter whatever you want, but there would be no reason to filter two customer VMs attached directly to the internet from talking directly to each other.
@ yeah, l know, but port isolation is a thing….
Hosts on the public internet expect to communicate directly with each other using their public IPs. Restricting multicast / broadcast or more specific ARP/NDP spoofing filtering would still be good if you are only doing layer 2, but doing layer 3 separately to each client is ideal (and then there is no ARP/NDP to worry about)
I was passing through network cards and was surprised to learn proxmox doesn't enable iommu by default. Pain in the A** having to research that annoyance.
They have recently changed the kernel defaults with pve 8.2 / kernel 6.8. I’m not sure if they changed Intel_iommu only or also amd_iommu. They aren’t enabling the ACS patch by default but it’s still compiled in.
Sorry, but I didn't understand what wendell was talking about?