Patrick, I love it when you do such awesome stories on niche products. The reviews of a $75 device one day and $150,000 device two days later really make me enjoy your website and TH-cam channel. I wonder if this would be a dream job for me...
We will probably be adding an 18th team member in Q4. Shoot me a note if you are serious and what you are looking to do and if you are thinking part time or more full time. We usually have folks start part-time to see how they actually like reviewing hardware.
You are correct, so-called smart nics will include different accelerators, but the best will be re-programmable ones such as FPGA based designs where you can load on different accelerators that you actually require, including bug fixes and upgrades.
Thank you so much for this video!! I started diving into qat a few month’s ago and learned the hard way about the support for the different generation of cards lol!
You can. I just tested a QAT add-in card on an AMD system using SQL Server 2022 RC0 which can offload backup compression with QAT. Works like a charm. No idea how much faster or slower the CPU version of QAT is since I don't have access to that kind of hardware. I also have allot of questions on how this is going to work with a hypervisor sitting in between. I don't want to pass the QAT device to just one VM, I want all VM's to be able to use QAT acceleration. Does the CPU version show up as a discrete PCIe device or is it more like a instruction set extension?
@@ServeTheHomeVideo the question was not about AMD hardware, but rather about using the Intel QAT hardware on AMD hardware, which is a supported modality. It is a shame that STH didn't think of that...
what does this mean for latency, interrupt budget, DMA, etc? These are valid benchmarks, but what happens in total system testing, do you really free up resources which can be used without immediately hitting a next bottleneck, for example interrupts bogging down some subsystems of the platform?
Interestingly, Intel has a more "mainstream" acceleration called Quick Sync for video encoding and decoding. When it works (i.e. on supported video codec), it makes a huge difference. AMD seems to completely neglect this market for some reason.
@@jaffarbh It is not part of AES-NI but it is a competing technology in the workflow that was presented in the video. Also, the video uses CPU + Dedicated QAT PCI hardware to compare against AMD without using AES-NI.
@@aliancemd Fair point. Intel has already embedded QAT into the latest Xeons. The real question is whether equivalent QAT accelertion exists in AMD processors. In any case, this is a specialist market and not something eveyone needs. May be AMD doesn't see the need to dedicate silicon for it.
I am very much not an expert in these things. With QAT's compression acceleration, could such hardware be used to accelerate disk access in a desktop environment? I'm not necessarily asking from a practical standpoint, merely wondering if we might see something like it in a future chipset/CPU so they can advertise that your SSD will be XX% larger or faster since it would be less data going across a bottleneck.
Thanks for your great video! Just to make sure --- this card is attached on PCIe right? Can instruction trigger accelerator ops on this card? How to use it? As a I/O call, or through instructions?
Can you please do an update for this? Not your full blown speed tests, but more of the proper mix and match what's out there to have it function and somewhat futureproof.
is the acceleration used when using Java or C# or nginx with their main libraries out of the box, or did you implement specific intel dependencies/libraries to take advantage of quickassist?
Cool. I had not heard of these but have been looking for a long time for a way to cheaply accelerate my older servers. My servers run a lot of modern hardware like high speed NVMe and they can't keep up. I think this is what I need. Will these basically work on any machine? I mean it's just a PCIe card, right?
@@ServeTheHomeVideo Well, no not really, there'd still be no CPU core complex on the card so it wouldn't pass your own 8 point 'is this a DPU?' checklist.
Ah I meant more like Mt. Evans would be the closest product, or the FPGAs. But if you went FPGA + EPYC embedded, then you would get Xilinx most likely.
17:35 That is so misleading. To remind everyone that AMD(and Intel) supports AES-NI, which has significantly better support, including from hypervisors - there is really no reason to compare hardware acceleration vs no hardware acceleration at all, on hardware that has it.
Loved the video, but IMHO one thing is missing - if a customer already has EPYC servers or he's planning to buy the upcoming Genoa CPU's, is there something for him related to acceleration? I'm sure there's some serious NDA that has been signed by you and AMD but still, some hints ... ;)
@@ServeTheHomeVideo That's exactly what I thought while you were talking about the accelerator card. The functionality on there will probably be folded into the DPU which also means we will have more vendors putting out products which can do this.
As for ciphers, ChaCha20-Poly1305 is not going anywhere really, it is seriously overpowered (20 rounds of ChaCha is silly overkill, 12 is sensible overkill, 8 is probably fine).
The QAT cards are basically server PCHs on a card so they are like 23W TDP parts. That is why I wanted to use EPYC CPUs with 15W each more (30W total) to at least bridge some of the gap.
Kind of rubs me the wrong way that you didn't mention AMD's solution, Xilinx/ Pensando (is this available now or soon?) and the Intel QAT card can be used in an AMD system. Looking at the video, one could easily think Intel has a huge advantage over AMD. Hard to believe Intel didn't have a say in this or maybe you were influenced by their sponsorship. I'm not saying do something that sours the relationship, but just mentioning it would have been much better already. Honestly I feel like the AMD results should have been removed as you are comparing apples to oranges, only thing it does is make AMD look bad. Hope you can keep that in mind in the future.
@@ServeTheHomeVideo so will xilinx and bluefield solutions be able to do this same sort of thing? I'm curious what the implementations will be like on the software stack in order to utilize the offload, and if there is direct hardware support in those cards to accelerate these functions (specific cyphers, etc), and if so or if not, will that affect bandwidth, latency, Max # of connections, and power efficiency. I'd be VERY curious to have these same tests and cases with the same basic base hardware paired with nvidia and amd accel/dpu cards benchmarked the same way (if possible) so that these numbers could be put side by side with them, and show how intel compares to other vendor solutions (and how much work it would be on the software side to implement - like is there native support for each solution in popular products like pfsense, web server stacks, etc)
If you're having to spend the dev time to implement QAT within your application, why marry yourself to a hardware specific component when there are fast real-time algorithms like LZ4 and ZSTD when you can get 1 GB/s+ per core? I don't get the feel that the forward looking storage vendors are continuing down the hardware accelerated path here as they get locked into a specific technology and then they're unable to port elsewhere, i.e. cloud.
Wow, that is super weird! A HW accelerator that actually accelerates something! But seriously guys, why the f.. are you trying to make this a comparison against AMD with no acceleration, this is just plain silly.
We used a faster AMD CPU so when we did things like acceleration via ISA-L AMD was faster due to the clock speed and extra TDP it had. The TDP difference between lower power Intel parts and the higher power AMD SKUs we used is about the same as QAT card TDP. AMD has promised the Pensando solution for example, but has yet to deliver cards and we cannot eBay them. When Pensando cards arrive, we will look at those.
@@ServeTheHomeVideo showing that a specialized HW unit + CPU can draw the same power as another CPU isn't any less silly, and excusing this with AMD not yet delivering similar HW isn't a very good excuse for making a silly and utterly useless "comparison".
What is the alternative though? No major server vendors support QAT cards in EPYC systems. You can put them in, but that is a one-off unsupported configuration that would be a lab project, not something that people would really deploy. That is why we need AMD's accelerators so we can have real solutions not lab experiments.
@@ServeTheHomeVideo the alternative is to not do silly stuff and only report on what the QAT can do. The fact that QAT doesn't work on EPYC is partially (if not fully) Intel's fault, so trying to put this on AMD is just even more silly. BTW, an accelerator from AMD will perform very differently, and comparing the two would also make little sense, these are very specific SW accelerators, but you probably already know that?
@@brynyard it's not silly at all. If there is no industry supported alternative for amd systems, this could mean the difference between choosing an intel or an amd platform for a specific application based purely on the amount of resources we've just been shown get used. This may be a huge realization for a lot of people, and may affect purchasing decisions for variously sized projects. In larger data centers, optimizing for a specific use case can potentially mean the difference of a ton of power, latency, number of connections a server can make while still performing work with those connections, so users per server, so then number of total servers, so data center sizing, etc. This may have huge implications for our very Internet-oriented data centers, with all kinds of encryption and very little inter-data center machine-to-machine trust.
Patrick, I love it when you do such awesome stories on niche products. The reviews of a $75 device one day and $150,000 device two days later really make me enjoy your website and TH-cam channel. I wonder if this would be a dream job for me...
We will probably be adding an 18th team member in Q4. Shoot me a note if you are serious and what you are looking to do and if you are thinking part time or more full time. We usually have folks start part-time to see how they actually like reviewing hardware.
I always assumed something like this would be put on the nic instead of as a separate card. Neat video!
That is a wise observation.
You are correct, so-called smart nics will include different accelerators, but the best will be re-programmable ones such as FPGA based designs where you can load on different accelerators that you actually require, including bug fixes and upgrades.
Thanks for being honest about sponsorship. It increases your credibility.
Thank you so much for this video!! I started diving into qat a few month’s ago and learned the hard way about the support for the different generation of cards lol!
There was even a difference in the like v1.5 and v1.6 that mattered when we did it in 2016-2017.
I love when Patrick welcomes me. It never fails to make me smile for some reason. His enthusiasm is infectious.
Loved this, fantastic presentation of the power of accelerators in hardware. Great work!
Thank you. Glad you liked it.
@@ServeTheHomeVideo Too much waffle at the beginning for me.
But, can't you use a QuickAssist add-in card on a AMD system? How does that compare to the native QAT from Xeon CPUs?
AMD's solution is Xilinx/ Pensando.
You can. I just tested a QAT add-in card on an AMD system using SQL Server 2022 RC0 which can offload backup compression with QAT. Works like a charm. No idea how much faster or slower the CPU version of QAT is since I don't have access to that kind of hardware. I also have allot of questions on how this is going to work with a hypervisor sitting in between. I don't want to pass the QAT device to just one VM, I want all VM's to be able to use QAT acceleration. Does the CPU version show up as a discrete PCIe device or is it more like a instruction set extension?
@@ServeTheHomeVideo the question was not about AMD hardware, but rather about using the Intel QAT hardware on AMD hardware, which is a supported modality. It is a shame that STH didn't think of that...
Let's see the AMD Xilinx/Pensando version of this next then!
Yes, Soni said we would do the Pensando one soon when I spoke with her last week. It has been hard to get cards but it is on the plan.
The intro always makes me think "QUICK, Somebody call a doctor. I think he's gonna have a stroke..."
Ha! I have to record before I have coffee in the morning or very late at night just to tone it down to where it is today.
The perfect tech channel i ever needed. Loved your way of presentation 😍
Thanks a ton. Have a great day.
People asking why intel is so popular in the server space when AMD is "Just better". Well I think this is a good point
Fantastic job! Top info - beats all the documentation and marketing blurb. Looking forward to the next video.
Very informative video! I had no idea that Intel made a QAT PCIe card. Thanks!!
Super fun is that these are PCHs on the PCIe card
Glad you enjoyed your time in Hillsboro. I live here with my wife and she works on the Jones Farm campus :)
Awesome! I was there during the heat wave at the end of June for this.
Can you imagine if Patrick drank coffee... 🤣
That actually got laugh out of me.
what does this mean for latency, interrupt budget, DMA, etc? These are valid benchmarks, but what happens in total system testing, do you really free up resources which can be used without immediately hitting a next bottleneck, for example interrupts bogging down some subsystems of the platform?
well researched, nice video, thank you!
Thanks Jay
Great presentation,excellently informed
Glad it was helpful!
I've been wondering about the QAT feature in pfsense as my status screen is showing it as "NO". Thank you for explaining it in detail.
how does qat help with my homelab that behind firewall ,idon't actually need SSL?
Intel will be a step ahead in competition 😎
Hey Patrick, what about latency? How much does it add, cause it has to travel multiple times over the PCIE bus
Interestingly, Intel has a more "mainstream" acceleration called Quick Sync for video encoding and decoding. When it works (i.e. on supported video codec), it makes a huge difference. AMD seems to completely neglect this market for some reason.
It doesn’t neglect it. They have AES-NI but he chose to compare Intel with hardware acceleration against AMD without AES-NI(hardware acceleration).
@@aliancemd Interesting. I assumed that "Quick Assist" was NOT part of the AES-NI instructions set. Worthing researching
@@jaffarbh It is not part of AES-NI but it is a competing technology in the workflow that was presented in the video. Also, the video uses CPU + Dedicated QAT PCI hardware to compare against AMD without using AES-NI.
@@aliancemd Fair point. Intel has already embedded QAT into the latest Xeons. The real question is whether equivalent QAT accelertion exists in AMD processors. In any case, this is a specialist market and not something eveyone needs. May be AMD doesn't see the need to dedicate silicon for it.
Cool video! Any chance of the Intel QAT 8970 Card 3 working with pfsense?
Check out our Netgate 4100 review. pfSense Plus can use QAT. Opnsense also supports QAT.
I am very much not an expert in these things. With QAT's compression acceleration, could such hardware be used to accelerate disk access in a desktop environment? I'm not necessarily asking from a practical standpoint, merely wondering if we might see something like it in a future chipset/CPU so they can advertise that your SSD will be XX% larger or faster since it would be less data going across a bottleneck.
Probably less common in the very near term for the desktop, but this is what storage vendors use QAT for.
Really apreciate the effort, Sir! .. One question, could 8960/8970 card be used for many VMs? Or it could only be used for 1 instance?
Thanks for your great video! Just to make sure --- this card is attached on PCIe right? Can instruction trigger accelerator ops on this card? How to use it? As a I/O call, or through instructions?
Can you please do an update for this? Not your full blown speed tests, but more of the proper mix and match what's out there to have it function and somewhat futureproof.
Watched video. Immediately checked for TrueNAS support. Looks like I won't have much use for this until IX Systems enables support for it.
It is more likely to happen with TrueNAS Scale since that is Linux based.
@@ServeTheHomeVideo I would be okay with that. (Especially if it could get Scale up to the same performance as Core)
is the acceleration used when using Java or C# or nginx with their main libraries out of the box, or did you implement specific intel dependencies/libraries to take advantage of quickassist?
but will it allow pfsense to allow multiple vpn connections without grinding to kb/s?
that should already be possible w/o QAT unless you're running on something really terrible
Cool. I had not heard of these but have been looking for a long time for a way to cheaply accelerate my older servers. My servers run a lot of modern hardware like high speed NVMe and they can't keep up. I think this is what I need. Will these basically work on any machine? I mean it's just a PCIe card, right?
5:15 - I remember having this Antec ITX case 👍
Interesting.
How much and what kind of work is needed to use it ?
Do software need to be recompiled ?
Does the 8970 support chacha20/poly?
After seeing that rat's nest I do not feel so bad about my cabling anymore. If it works it works!
cable management for a bench test that's just going to get immediately torn back down is just a waste of time
Rather important note that the QAT acceleration for NGINX only works with HTTP/1.1. So, enable HTTP/2 or HTTP/3 and AMD wins again.
It’s weird seeing QAT talked about in servers when I’ve only seen it used in graphics rendering using the intel igpu in desktop processors
That is Quick Sync. Intel went on a "Quick...." binge for a bit.
If Intel made NICs with QAT that can be used in embedded Epyc systems... 😁
That would be a DPU/ IPU at this point.
@@ServeTheHomeVideo Well, no not really, there'd still be no CPU core complex on the card so it wouldn't pass your own 8 point 'is this a DPU?' checklist.
Ah I meant more like Mt. Evans would be the closest product, or the FPGAs. But if you went FPGA + EPYC embedded, then you would get Xilinx most likely.
No way! I work in JF5, wish I could've seen you haha.
Bummer! I was in the cafeteria and people were saying hi quite a bit.
Calm down Patrick. You are drinking too much cofee.
17:35 That is so misleading. To remind everyone that AMD(and Intel) supports AES-NI, which has significantly better support, including from hypervisors - there is really no reason to compare hardware acceleration vs no hardware acceleration at all, on hardware that has it.
What is the AMD equivalent to this?
Pensando and Xilinx will be
I like to watch many of my TH-cam videos at 1.25x or 1.5x speed… Not Patrick’s videos!
You should try that with Overly Sarcastic Productions ;)
Loved the video, but IMHO one thing is missing - if a customer already has EPYC servers or he's planning to buy the upcoming Genoa CPU's, is there something for him related to acceleration? I'm sure there's some serious NDA that has been signed by you and AMD but still, some hints ... ;)
Pensando and Xilinx.
@@ServeTheHomeVideo That's exactly what I thought while you were talking about the accelerator card. The functionality on there will probably be folded into the DPU which also means we will have more vendors putting out products which can do this.
As for ciphers, ChaCha20-Poly1305 is not going anywhere really, it is seriously overpowered (20 rounds of ChaCha is silly overkill, 12 is sensible overkill, 8 is probably fine).
Duuuuuuude, if you are going to do it, take half or less of the amount of Coke you took before this video :o
Perf per Wat ?
The QAT cards are basically server PCHs on a card so they are like 23W TDP parts. That is why I wanted to use EPYC CPUs with 15W each more (30W total) to at least bridge some of the gap.
5:15 lmao, using windows to run a bunch of linux software and then... of course.. microsoft word. (edge counts as linux cuz it's chromium)
0:58 CLI ASMR when?
It's disingenuous to use performance per thread for accelerators when the accelerators don't scale linearly (or at all) with more threads.
Kind of rubs me the wrong way that you didn't mention AMD's solution, Xilinx/ Pensando (is this available now or soon?) and the Intel QAT card can be used in an AMD system. Looking at the video, one could easily think Intel has a huge advantage over AMD. Hard to believe Intel didn't have a say in this or maybe you were influenced by their sponsorship. I'm not saying do something that sours the relationship, but just mentioning it would have been much better already.
Honestly I feel like the AMD results should have been removed as you are comparing apples to oranges, only thing it does is make AMD look bad. Hope you can keep that in mind in the future.
I asked AMD for its Pensando/ Xilinx solutions but they still have not sent cards. Only so long we can wait. We cover Pensando a lot on the main site
@@ServeTheHomeVideo so will xilinx and bluefield solutions be able to do this same sort of thing? I'm curious what the implementations will be like on the software stack in order to utilize the offload, and if there is direct hardware support in those cards to accelerate these functions (specific cyphers, etc), and if so or if not, will that affect bandwidth, latency, Max # of connections, and power efficiency.
I'd be VERY curious to have these same tests and cases with the same basic base hardware paired with nvidia and amd accel/dpu cards benchmarked the same way (if possible) so that these numbers could be put side by side with them, and show how intel compares to other vendor solutions (and how much work it would be on the software side to implement - like is there native support for each solution in popular products like pfsense, web server stacks, etc)
Hillsborough 😭 well an attempt was made
How much coffee have you had?
No coffee. I record these usually between 4:30AM and 7:30AM before any coffee so I do not get too excited
Sorry but after 4:24 minutes into the video you have not given a one sentence explanation of what this does.
The quickassist cards are $$$ on ebay. Wanted one for my opnsense box
The 2nd gen cards are not crazy expensive.
If you're having to spend the dev time to implement QAT within your application, why marry yourself to a hardware specific component when there are fast real-time algorithms like LZ4 and ZSTD when you can get 1 GB/s+ per core? I don't get the feel that the forward looking storage vendors are continuing down the hardware accelerated path here as they get locked into a specific technology and then they're unable to port elsewhere, i.e. cloud.
Wow, that is super weird! A HW accelerator that actually accelerates something!
But seriously guys, why the f.. are you trying to make this a comparison against AMD with no acceleration, this is just plain silly.
We used a faster AMD CPU so when we did things like acceleration via ISA-L AMD was faster due to the clock speed and extra TDP it had. The TDP difference between lower power Intel parts and the higher power AMD SKUs we used is about the same as QAT card TDP. AMD has promised the Pensando solution for example, but has yet to deliver cards and we cannot eBay them. When Pensando cards arrive, we will look at those.
@@ServeTheHomeVideo showing that a specialized HW unit + CPU can draw the same power as another CPU isn't any less silly, and excusing this with AMD not yet delivering similar HW isn't a very good excuse for making a silly and utterly useless "comparison".
What is the alternative though? No major server vendors support QAT cards in EPYC systems. You can put them in, but that is a one-off unsupported configuration that would be a lab project, not something that people would really deploy. That is why we need AMD's accelerators so we can have real solutions not lab experiments.
@@ServeTheHomeVideo the alternative is to not do silly stuff and only report on what the QAT can do. The fact that QAT doesn't work on EPYC is partially (if not fully) Intel's fault, so trying to put this on AMD is just even more silly.
BTW, an accelerator from AMD will perform very differently, and comparing the two would also make little sense, these are very specific SW accelerators, but you probably already know that?
@@brynyard it's not silly at all. If there is no industry supported alternative for amd systems, this could mean the difference between choosing an intel or an amd platform for a specific application based purely on the amount of resources we've just been shown get used. This may be a huge realization for a lot of people, and may affect purchasing decisions for variously sized projects. In larger data centers, optimizing for a specific use case can potentially mean the difference of a ton of power, latency, number of connections a server can make while still performing work with those connections, so users per server, so then number of total servers, so data center sizing, etc. This may have huge implications for our very Internet-oriented data centers, with all kinds of encryption and very little inter-data center machine-to-machine trust.
This is a 25 minute ad
Were you raised by a pastor? What critical thinking are you afraid of when you don't don't allow the audience any breaks for reflection?
😳wow
First one here 🙂
Wow! In seconds!