For the hiccup at 2:54 ----------- > Just change the quality of the stream from whatever you're running it at in the middle of the hitch. It will push through.
this is the first channel i consciously subscribed. it looks to me you know what your talking about and you produce serious videos with real info. well done.
I get it when Wendell explains NUMA latency using the old x2 socket Xeon server with 2 physical CPU slots and 2 separate I/O riser slots, 2 RAM banks slapped together with that Intel QuickPath, because when cpu1 has to ask cpu2 for some I/O or memory request because maybe the stuff it needs in on the other I/O riser/RAM bank, but I want to know would ANY latency be present on an x399 system if there is only ONE I/O (1 GPU on PCI-E x16 slot) connected directly to TR CPU and only ONE of those RAM banks were populated? TR then only has one of those on the motherboard it can use when running a game for example If using 2 GPU slots and all RAM slots were populated it is easy to understand why then latency would be introduced. Great video Wendell! You made NUMA easy to understand. I wish you would soon make a video about how to properly assign AMD Ryzen/TR CPU cores to a Virtual Machines in VMware Workstation and Hyper-V when virtualizing stuff for most optimized VM experience. EDIT: NVM I watched the video till the end :P seems it does not matter how many slots are populated if you do not limit the app or game to use specific cores on specific CCX.
I run a 5950x. I'm going to try process lasso and or try modifying the core affinity for my games. I know that my system is not running numa but there is still some latency with spreading a task across multiple ccd's as far as I understand. Thanks for this.
thats strange the video won't play past a little after 2:45 , just skipped a little ahead and its playing fine which I hope it still counts as a view in the system
UMA mode doesn't just lie. It lies AND it also shuffles the memory so (for example), each block of 256 bytes alternates between NUMA nodes. This increases the chances that general memory allocations by the OS will cover both nodes and spread the memory load out. Some will have higher latencies, some will have lower latencies. On average it works out pretty well because latency is just one of several stall conditions that can slow a CPU down. The mechanism that makes UMA work reasonably well is this shuffling. Basically, one of the low address bits is exchanged with a high address bit in the physical address space. This shuffling appears to require symmetric NUMA. I have never been able to enable any sort of UMA mode on my 2990WX, and I'm guessing the reason is that the 2990WX has assymetric NUMA, where two of the four CPU dies have no memory connected to them at all. I guess the address bit shuffling hardware doesn't work for that particular situation. That is something AMD should fix. The 2990WX would benefit greatly from having a UMA mode. Your description of the spectre mitigation is ... well, it's actually pretty wildly off. Sorry. Though you got fairly close. First, the system overhead is not really an issue of switching between unrelated processes. There is no need to segregate processes to avoid that. There is overhead there, but it is not synchronous... it's no in the 'hot path' that would interfere with a program on a continuous basis and effect its performance. Where the spectre mitigation really hurts is when a user process does a system call. This is a synchronous operation, it always occurs on the same cpu as the process making the call. The syscall trampoline must issue the Spectre related MSRs to firewall the cache effects and these MSR's eat around 2uS (2000nS) of overhead when all is said and done. It's very nasty. The same context switch issue gets even worse when we are talking about VM's. VM's have to execute the spectre mitigation not only inside the VM, but the host also has to execute the mitigation when the VM punts out to the host (which it does for a lot of things, particularly if you can't map PCIe devices directly into the guest). The Spectre MSR's don't just clear the L1 cache. In fact, I'm pretty sure they don't clear the L1 cache at all. They primarily have to clear the branch cache and mess around with the call/return hardware cache in the cpu core. And they also have to firewall speculative execution from crossing the boundary. The Meltdown fix essentially requires isolating the MMU map between user and supervisor mode, which means that %cr3 has to be written to on the context switch (e.g. for a system call). Twice in fact, one to get into the kernel, and one to get out of the kernel. This overhead adds 100-300nS or so depending on the cpu. AMD is less vulnerable to Spectre and invulnerable to Meltdown. It is less vulnerable to Spectre because it uses full 64-bit address tags in its branch cache tags while Intel uses fewer address bits and XOR's the higher address bits into the lower address bits. This allows a user program (on Intel) to directly control the branch cache for memory locations in kernel memory space. And for Meltdown, Intel CPUs will speculatively read memory through the TLB while ignoring the U (supervisor/user) bit in the PTE, which allows a user program to mess with the L1 caches related to any kernel memory location. AMD's speculative reads honor the U bit and do not have this problem. So, basically, meltdown adds 100-300ns of overhead, and Specter adds 2000nS of overhead to any system call. This is also true for interrupts. That is a lot of overhead considering that the nominal overhead without the mitigations is typically only 70nS. -Matt
off topic but i just installed a new ssd into my laptop (that was suppose to go into my future build but was just collecting dust - and yes after installing chrome i watched an level1tech vid ^^) but im amazed at how in many different aspects an ssd has given light to this dying laptop... no more stuttery video playback is the biggest change,
I second that! I bought a couple of used Xeon E5-2687W in 2016 (16 physical cores total, 3.1 GHz base / turbo 3.8, for less than $600) and it's so versatile (64 GB RAM, 7 PCIe slots linked to either CPU 1 or 2 etc.) I love that workstation / server. Performance is on par with 16-cores Threadrippers. Thank you, Ebay!
Just for a quick rundown, because I don't really know much about numa, if I have 2 processors (x5690s) and a have a game running on one processor's cores but the GPU is on the PCIe lanes for the others I'm theoretically losing preformance?
@@charlesturner897 yes. In particular the thread running the GPU should be in core 0 of the CPU controlling its PCIe lanes. Since the host OS would typically use core 0 of CPU 1 (most devices being attached to it I suppose), running the GPU off of CPU 2 may yield a smoother experience in some cases. You can check your topology to know which core numbers to target for the guest. If hyperthreaded, give it pairs accordingly (e.g. on a 16C/32T dual socket system from Intel, CPU 2 physical cores are 8-15 and their hyperthreads are cores 24-31.
Awesome timing. I was re-watching your Threadripper livestream from January regarding Looking Glass / VFIO, as I'm in the process of setting up such a system on my dual Xeon E5-2687W (very decent workstation chips, 2012). I'm hesitating between Arch and Fedora for the host OS... #nerdproblems
Man, you’ve managed to clarify to the rest of us such a complex topic... you are a champion Wendell! Love your channel buddy, keep posting Threadripper stuff (maybe a comparison between 2990wx and 2790wx once released in Oct) 👍
I remember having major headaches just trying to not have threads stomp all over system process back in the day, and the only fix was just as aggravating as you could not set an affinity and have it stick. the instant the thread or process closes that affinity is forgotten the next time the thread or process starts up, so you basically had to manually set the affinity every single time you opened an app (on XP through win7). I've yet to even bother looking at win10 affinity, as it's a not worth the effort kind of thing at this point, and easy to just assume setting affinity on windows is more effort than it is worth. As for non-windows, well, yeah, remembering affinity has been a thing going way back, and it has always been a set it and forget it kind of thing, lol. I had thought about looking at the 'topo' of my FX8350 comp (win7), however, I hadn't had the time to do more than just think about it in passing. And I suspect I already know just how bad the M5A97 (rev original POJ) is laid out, lol. The top 16x slot will not post with a GPU over 50 or so watts, so the GTX1050ti had to go in the lower 4x slot (ouch), and I guess like the AM3-99 chipset (I forget the full name of it) that slot goes through the chipset. I don't game on the system, so it works well enough, lol. And I know the 4x PCIe 2.0 bus is NOT holding back the GPU, as the GPU has no prob going to 100% load with the bus never peeking over 50% usage from my testing some time ago. Great vid Wendal and Crew. B)
Great! So they will teach you the most important thing. It's Amdahl's Law. en.wikipedia.org/wiki/Amdahl%27s_law Then you know why Cinebench render bench is such a deceptive benchmark for general (desktop/gaming...) performance.
Pretty much every benchmark is deceptive. Unless you would use pretty much the same programms you would use normally. Even Firestrike is deceptive for gaming performance. Especially when considering SLI. And sometimes even the build-in benchmarks in games are far from accurate of in-game performance. You should use a benchmark according to your use case. If you really want to render stuff on your CPU I guess cinebench is a great benchmark. In every other use case: not really great. And from what I've heard Windows has problems dealing with the number of threads on a 2990WX so you also have to consider the OS used for benchmarking / your use case. PCs are complex. There is no way there can be one easy answer (or performance number) that says everything you need to know....
This is a great Video Wendell. Could you take a deeper dive and explain why we would, or would not want to manage memory, cpu cores, for every situation to optimize performance?
haha I was just going through the best performance guide for VMware and never heard of NUMA even though it was mentioned to have it enabled and to use vNUMA never heard of it before hand - gonna score some pts with the boss
here's one of those report you talk about ;) mine I do that for an old game (torchlight 2) it already had stuttering problems on a 9900K (8c) but on 3960x (24c) it was way worse so I did the set affinity thing and only select threads 40-47 (4 cores+MT) and tada frametimes and fps go from a rollercoaster to pretty much an horizontal line, older Forza Horizon games on UWP/MS store also had that big problem as the game was decrypting the "secure" data of the game aka the textures and that in a open world game meant your core0 was at 100% use all the time basically don't know if they fixed that but it was a complaint often found in the officialk Turn10 forums, process lasso is also something game modders/tweakers have used for years
This is a great video Wendell, and I think it's going t be an important consideration going in the future because AMD would not have been able to achieve high core counts without multiples dies and thus multiple NUMA nodes, and Intel isn't going to compete without following a similar path. I have been experimenting with my Skylake-X processor (7820X) and on some games, I get a vastly improved framerate by setting core affinity manually. But the Skylake-X is still on one node, so I'm not sure what the exact cause would be. It's more apparent with older titles that do not multithread well. WoW is a good example, and I get best framerates in that choosing two logical cores which are not on the same physical core (ie cores 0 and 2, or 13, 15 etc). I will have to test on GTA V. I don't know if it's do do with the userland/system level processes you mentioned. I was originally thinking it was due to Intel using a Mesh instead of the Ring Bus. I'd like to hear your thoughts on that. I love this sort of content, keep it up!
Changing the CPU affinity sounds like drug/chemical’s affinity for the receptor it’s seated on. Flushing L1 cache sounds like taking a receptor antagonist while the receptor is occupied by an agonist. What process has the authority to “kick” off you off your set affinity ? Anything?
But is there really any benefit to running numa mode on threadripper when you can just set affinity to cores within 1 die or 1 ccx, depending how many threads it benefits from? Even on ryzen 1600x setting affinity to threads within 1 ccx can improve framerate on some games. Also, if you want some more advanced tweaking, process hacker is program that can set affinity individually for every thread separately, but unfortunately you will have to do that every time manually for every thread when you open program.
Intel server grade hardware also supports "UMA". I should say... you really can't make a NUMA topology become UMA by flipping a switch. The moment you have two CPUs with an IMC on each one you have NUMA nodes. What you can do, and both Intel and AMD support this, is setting node-interleaving on. This will present an UMA like topology to the OS (I'd rather call it Sufficiently Uniform Memory Architecture or SUMA, as some people do). This will only change the memory addressing to be interleaved, so when you access memory in sequence you will hit local memory, remote memory and so on. The latency will "look" uniform but it is not actually. I've read about workloads that prefer this predictable approach, but I've never seen a case where it is necessary in real life.
The big question: why doesn't MS allocate cores which aren't doing anyghing (sort of, no load) to games? Why put a part of the game its load on the same CPU core as Windows? That doesn't make any sense to me. I noticed that Wendell is running Ubuntu. Preparations for the PCIe passthrough on Ubuntu tutorial?
That can give problems for Ryzen (CCX) if Windows/Linux lets threads wander from one thread to another in between different CCX's. However, the point is, if a game only uses up to 6 cores and the CPU has 8 cores while the CPU has 4 cores per CCX, then why not reserving 4 cores on 1 CCX and 2 cores on the other for the game and confine Windows to those two remaining cores. My point was that you want for games to use as many cores on the same CCX as possible.
> However, the point is, if a game only uses up to 6 cores and the CPU has 8 cores A game (or any other application) does not "use 6 cores" if you want to be fully correct. It uses/has on *average* 6 *software* threads that are runnable (i.e. not blocked/waiting). Threads are dynamic, they come and go. And there are many more threads than hardware cores. There is no direct relation from a thread to a core. It is up to the scheduler (and depending on priorities, time a thread has already run etc.) to decide which thread runs on which core. If a higher prioritized thread becomes runnable the scheduler has to pick a core that is either idling or another thread has to be stopped. The scheduler however will try to keep a thread on one core as long as possible, to allow for the cache of that core to be kept hot. NUMA is making all of this much more complicated as the scheduler must be aware of it. With SMP things are easier since all cores are created equal (including SMT/Hyper Threading "cores").
Peter Jansen A particular core can't have 2 things running at the same time -- there's only 1 thread ever running per core. It's just that the cores are timesliced by the processor. In reality, a game can go for periods of time with all threads halted (not running on any cores), and then suddenly a bunch of threads spring to life. So, if windows receives other work to do (from the OS or other background processes) while some game threads are sleeping, it sees unused cores and may put them to use. A good analogy is a restaurant. Each table will be used by dozens of parties during a day, but only 1 at a time. If the restaurant had your favorite table reserved 24/7, they'd miss out on business when you weren't there, if they were otherwise at max capacity.
Enjoy the content man, can I just check something quickly though? So if all the cores are on one node, is NUMA not possible? Does it depend on there being a interconnect such as quickpath or more than one physical socket? Thanks man
So the os scheduler is a better for dealing with specter and meltdown than any changes to the microcode? Obviously on lower core count processors this isn't the case, but for servers why is Intel responsible for fixing the issue? Please correct me if I'm misunderstanding this.
Hey I just got a workstation off ebay... the machine only POSTs when I disable NUMA... when I enable it, it's basically a dice roll if it POSTs and 100% doesn't get into the OS. What could be the issue?
hehehe. I clicked like because my 2009 Macpro Pro died (2 x 3.46GHz Xeons with 32 Gig ECC Memory). Now I'm back on my ~2008 core2 system, which I upgraded to a Quad core from South Korea on eBay, and maxxed the memory to 8gB. The 'old' workhorses you are showing made the internet way better. Time will tell if the newbies can do something awesome ;)
I think if is more the latency in the ram when the video card has the compute strength though lacks the physical ram to handle and its handover to system memory with its transversal effect causing an over buffer because the ram on the video isn't enough to utilize a mix of 4k hdr 6k and 8k content.. noting that to do native 4k hdr and the other niceties of other compute options taking up more storage the hard drive.. average 4khdr game consuming 250+GB in storage assets add the new servicing tweaks another 150GB more than likely... we have been past 24GB in a gpu as standard since the 4khdr became a console standard.. why do tou think it takes so much ram to process 4k content in a video editor because even if you stripe 8 gpus with 24GB ram on them you still wouldn't have the ram capacity @ 192GB to process what you are doing I would not expect 1TB ram if applicable to be functionally used at some point you are going to get a bus bottleneck in frame rate whether it is ssd. ram or GPU and its ram limitations.. what you are talking about is bus limitations between hard drive, ram of gpu and intersection where system ram comes into play covering the shortfall.. why should we be talking a hdmi vs dvi vs vga vs display port.. when 1080 p in general on pc has been a standard over pc for the last 6-10 years and anything higher falls to hdmi or display port standards.. whilst 2k and the latter 4k have active within the av market space for close 25+ years since before blu ray was was a chosen format...
Why AMD decided to configure 2990WX as 4 NUMA nodes instead of just 2 NUMA? For example: Numa Node 0 = Zeppelin 1 [master] + Zeppelin 2 [slave] Numa Node 1 = Zeppelin 4 [master] + Zeppelin 3 [slave]
Hi Wendel, Can you please make a video about making a dual boot pc. I have tried to make a dual boot of Windows 10 and Linux mint, when i am installing Linux mint using Rufus as bootable it just stays on the linux mint logo. I have done this before on my old PC but i do not know what is wrong why it is not working anymore. I have x370 prime pro, 1700x, 16 gb ram and 1070 gpu. Please make a video of the step by step procedure and what programs are you using for this. Thanks.
"If anybody is gonna make Non-uniform Memory access exciting, I'm your guy" earned my like right then and there man...
+1
Content found nowhere else! Thank you! :-)
Ma-ia-hii
Ma-ia-huu
Ma-ia-hoo
Ma-ia-haa
You have restored my faith in humanity. I came to this video hoping for this comment.
Salut
I knew from the title lol....
I was looking for this comment!!!! Yes!!!!!!!
^ this
For the hiccup at 2:54 ----------- > Just change the quality of the stream from whatever you're running it at in the middle of the hitch. It will push through.
this is the first channel i consciously subscribed. it looks to me you know what your talking about and you produce serious videos with real info. well done.
*Wendel ECC-ing (Error Correcting) the crap out of himself in this video :D
Ah those first world problems, mixing up your cores for threads, and your 8 core CPU for your 16 core cpu. Haha :P
Anyone else getting stuck on 2:54 no matter what?
same here
Yep, exactly what's happening.
I got to 2:57, jumped forward a few seconds and it continued. weird.
Wendell has packet loss
Just change the quality of the stream from whatever you're running it at.. Mine went through just fine.
Probably the few but I love watching these informative videos.
Love it! This is high quality nerd content. And you can tell Wendell means business because of the awesome shirt.
These kind of deep insights are the reason I am a fan of this channel
love the graphics on the TV. but really thank you so much for the forum and this great videos.
The R710 seems to be THE thing for home labs and learning, I see it in almost every server-related video on TH-cam. :)
I get it when Wendell explains NUMA latency using the old x2 socket Xeon server with 2 physical CPU slots and 2 separate I/O riser slots, 2 RAM banks slapped together with that Intel QuickPath, because when cpu1 has to ask cpu2 for some I/O or memory request because maybe the stuff it needs in on the other I/O riser/RAM bank, but I want to know would ANY latency be present on an x399 system if there is only ONE I/O (1 GPU on PCI-E x16 slot) connected directly to TR CPU and only ONE of those RAM banks were populated? TR then only has one of those on the motherboard it can use when running a game for example If using 2 GPU slots and all RAM slots were populated it is easy to understand why then latency would be introduced. Great video Wendell! You made NUMA easy to understand. I wish you would soon make a video about how to properly assign AMD Ryzen/TR CPU cores to a Virtual Machines in VMware Workstation and Hyper-V when virtualizing stuff for most optimized VM experience. EDIT: NVM I watched the video till the end :P seems it does not matter how many slots are populated if you do not limit the app or game to use specific cores on specific CCX.
Numa? Is it as bad as Ligma?
but not as bad as smegma. i'll get my coat
I think joe really likes ligma
What's Ligma?
@@edsknife joe mama likes to ligma balls
Ligmanuma is worse
Wendell, You make superb videos! Keep up the great work, Sir!
Using memory benchmarks I noticed a general improvement on the 1950X using UMA regarding peak bandwidth, I was unable to test latency
Thanks for this video. I learned something today.
I run a 5950x. I'm going to try process lasso and or try modifying the core affinity for my games. I know that my system is not running numa but there is still some latency with spreading a task across multiple ccd's as far as I understand. Thanks for this.
thats strange the video won't play past a little after 2:45 , just skipped a little ahead and its playing fine which I hope it still counts as a view in the system
Same happened to me.
UMA mode doesn't just lie. It lies AND it also shuffles the memory so (for example), each block of 256 bytes alternates between NUMA nodes. This increases the chances that general memory allocations by the OS will cover both nodes and spread the memory load out. Some will have higher latencies, some will have lower latencies. On average it works out pretty well because latency is just one of several stall conditions that can slow a CPU down. The mechanism that makes UMA work reasonably well is this shuffling. Basically, one of the low address bits is exchanged with a high address bit in the physical address space.
This shuffling appears to require symmetric NUMA. I have never been able to enable any sort of UMA mode on my 2990WX, and I'm guessing the reason is that the 2990WX has assymetric NUMA, where two of the four CPU dies have no memory connected to them at all. I guess the address bit shuffling hardware doesn't work for that particular situation. That is something AMD should fix. The 2990WX would benefit greatly from having a UMA mode.
Your description of the spectre mitigation is ... well, it's actually pretty wildly off. Sorry. Though you got fairly close. First, the system overhead is not really an issue of switching between unrelated processes. There is no need to segregate processes to avoid that. There is overhead there, but it is not synchronous... it's no in the 'hot path' that would interfere with a program on a continuous basis and effect its performance. Where the spectre mitigation really hurts is when a user process does a system call. This is a synchronous operation, it always occurs on the same cpu as the process making the call. The syscall trampoline must issue the Spectre related MSRs to firewall the cache effects and these MSR's eat around 2uS (2000nS) of overhead when all is said and done. It's very nasty.
The same context switch issue gets even worse when we are talking about VM's. VM's have to execute the spectre mitigation not only inside the VM, but the host also has to execute the mitigation when the VM punts out to the host (which it does for a lot of things, particularly if you can't map PCIe devices directly into the guest).
The Spectre MSR's don't just clear the L1 cache. In fact, I'm pretty sure they don't clear the L1 cache at all. They primarily have to clear the branch cache and mess around with the call/return hardware cache in the cpu core. And they also have to firewall speculative execution from crossing the boundary.
The Meltdown fix essentially requires isolating the MMU map between user and supervisor mode, which means that %cr3 has to be written to on the context switch (e.g. for a system call). Twice in fact, one to get into the kernel, and one to get out of the kernel. This overhead adds 100-300nS or so depending on the cpu.
AMD is less vulnerable to Spectre and invulnerable to Meltdown. It is less vulnerable to Spectre because it uses full 64-bit address tags in its branch cache tags while Intel uses fewer address bits and XOR's the higher address bits into the lower address bits. This allows a user program (on Intel) to directly control the branch cache for memory locations in kernel memory space. And for Meltdown, Intel CPUs will speculatively read memory through the TLB while ignoring the U (supervisor/user) bit in the PTE, which allows a user program to mess with the L1 caches related to any kernel memory location. AMD's speculative reads honor the U bit and do not have this problem.
So, basically, meltdown adds 100-300ns of overhead, and Specter adds 2000nS of overhead to any system call. This is also true for interrupts. That is a lot of overhead considering that the nominal overhead without the mitigations is typically only 70nS.
-Matt
Good explanation. This matches what I know about these vulns too.
Super informative video! glad I took the time to watch
off topic but i just installed a new ssd into my laptop (that was suppose to go into my future build but was just collecting dust - and yes after installing chrome i watched an level1tech vid ^^) but im amazed at how in many different aspects an ssd has given light to this dying laptop... no more stuttery video playback is the biggest change,
How to make NUMA fun, step 1: be Wendell, step 2: you're already done
techporn
Having a NUMA system requires a bit of management to get the best performance, but once you get used to it, the pros outweigh the hassles.
I second that! I bought a couple of used Xeon E5-2687W in 2016 (16 physical cores total, 3.1 GHz base / turbo 3.8, for less than $600) and it's so versatile (64 GB RAM, 7 PCIe slots linked to either CPU 1 or 2 etc.)
I love that workstation / server. Performance is on par with 16-cores Threadrippers. Thank you, Ebay!
Just for a quick rundown, because I don't really know much about numa, if I have 2 processors (x5690s) and a have a game running on one processor's cores but the GPU is on the PCIe lanes for the others I'm theoretically losing preformance?
@@charlesturner897 yes. In particular the thread running the GPU should be in core 0 of the CPU controlling its PCIe lanes.
Since the host OS would typically use core 0 of CPU 1 (most devices being attached to it I suppose), running the GPU off of CPU 2 may yield a smoother experience in some cases.
You can check your topology to know which core numbers to target for the guest. If hyperthreaded, give it pairs accordingly (e.g. on a 16C/32T dual socket system from Intel, CPU 2 physical cores are 8-15 and their hyperthreads are cores 24-31.
Awesome timing. I was re-watching your Threadripper livestream from January regarding Looking Glass / VFIO, as I'm in the process of setting up such a system on my dual Xeon E5-2687W (very decent workstation chips, 2012).
I'm hesitating between Arch and Fedora for the host OS... #nerdproblems
09:46 Something else for the Windows users to download separately and install ... and then figure out how to keep up-to-date.
at 2:54 the video is broken, thanks youtube
This is an excellent video. Thanks for taking the time to properly explain NUMA.
Very informative!
thanks. Had some giggles along the way.
Man, you’ve managed to clarify to the rest of us such a complex topic... you are a champion Wendell!
Love your channel buddy, keep posting Threadripper stuff (maybe a comparison between 2990wx and 2790wx once released in Oct) 👍
I remember having major headaches just trying to not have threads stomp all over system process back in the day, and the only fix was just as aggravating as you could not set an affinity and have it stick. the instant the thread or process closes that affinity is forgotten the next time the thread or process starts up, so you basically had to manually set the affinity every single time you opened an app (on XP through win7). I've yet to even bother looking at win10 affinity, as it's a not worth the effort kind of thing at this point, and easy to just assume setting affinity on windows is more effort than it is worth. As for non-windows, well, yeah, remembering affinity has been a thing going way back, and it has always been a set it and forget it kind of thing, lol.
I had thought about looking at the 'topo' of my FX8350 comp (win7), however, I hadn't had the time to do more than just think about it in passing. And I suspect I already know just how bad the M5A97 (rev original POJ) is laid out, lol. The top 16x slot will not post with a GPU over 50 or so watts, so the GTX1050ti had to go in the lower 4x slot (ouch), and I guess like the AM3-99 chipset (I forget the full name of it) that slot goes through the chipset. I don't game on the system, so it works well enough, lol. And I know the 4x PCIe 2.0 bus is NOT holding back the GPU, as the GPU has no prob going to 100% load with the bus never peeking over 50% usage from my testing some time ago.
Great vid Wendal and Crew. B)
Great timing! My next exam is about parallelcomputing and parallelprogramming. Which includes NUMA / UMA.
This video was really great and helpful!
Great! So they will teach you the most important thing. It's Amdahl's Law.
en.wikipedia.org/wiki/Amdahl%27s_law
Then you know why Cinebench render bench is such a deceptive benchmark for general (desktop/gaming...) performance.
Pretty much every benchmark is deceptive. Unless you would use pretty much the same programms you would use normally.
Even Firestrike is deceptive for gaming performance. Especially when considering SLI. And sometimes even the build-in benchmarks in games are far from accurate of in-game performance.
You should use a benchmark according to your use case. If you really want to render stuff on your CPU I guess cinebench is a great benchmark. In every other use case: not really great. And from what I've heard Windows has problems dealing with the number of threads on a 2990WX so you also have to consider the OS used for benchmarking / your use case.
PCs are complex. There is no way there can be one easy answer (or performance number) that says everything you need to know....
> Pretty much every benchmark is deceptive.
Cinebench render bench is an _absolute_ best case scenario. One has to be aware of that fact.
Pure class, great work!
very interesting! thanks Wendell!!
This is a great Video Wendell. Could you take a deeper dive and explain why we would, or would not want to manage memory, cpu cores, for every situation to optimize performance?
Loved the video! You know what you're talking about
Fantastic content, ty.
Yes, teach us Sensei. Class is in session.
I couldn't help but notice DisplayCal on the monitor in the background. Doing some calibration? Nifty piece of software really.
haha I was just going through the best performance guide for VMware and never heard of NUMA even though it was mentioned to have it enabled and to use vNUMA
never heard of it before hand - gonna score some pts with the boss
Good video, very informative.
NUMA NUMA node,
NUMA NUMA NUMA node
this channel is absolute gold
Very informative video, thankyou
here's one of those report you talk about ;) mine I do that for an old game (torchlight 2) it already had stuttering problems on a 9900K (8c) but on 3960x (24c) it was way worse so I did the set affinity thing and only select threads 40-47 (4 cores+MT) and tada frametimes and fps go from a rollercoaster to pretty much an horizontal line, older Forza Horizon games on UWP/MS store also had that big problem as the game was decrypting the "secure" data of the game aka the textures and that in a open world game meant your core0 was at 100% use all the time basically don't know if they fixed that but it was a complaint often found in the officialk Turn10 forums, process lasso is also something game modders/tweakers have used for years
Very interesting video, thanks
This is a great video Wendell, and I think it's going t be an important consideration going in the future because AMD would not have been able to achieve high core counts without multiples dies and thus multiple NUMA nodes, and Intel isn't going to compete without following a similar path. I have been experimenting with my Skylake-X processor (7820X) and on some games, I get a vastly improved framerate by setting core affinity manually. But the Skylake-X is still on one node, so I'm not sure what the exact cause would be. It's more apparent with older titles that do not multithread well. WoW is a good example, and I get best framerates in that choosing two logical cores which are not on the same physical core (ie cores 0 and 2, or 13, 15 etc). I will have to test on GTA V. I don't know if it's do do with the userland/system level processes you mentioned. I was originally thinking it was due to Intel using a Mesh instead of the Ring Bus. I'd like to hear your thoughts on that.
I love this sort of content, keep it up!
Awesome, very good explanation.
Great job explaining Nonuniform Memory Access.
1:43 Topology ≠ topography!
verdict: you indeed made NUMA exciting (0:53) 👍
Thank you, Wendell
that was pretty cool............... thank you wendel...........
You make me regret not going forward with computer science after hurricane Katrina :/
Changing the CPU affinity sounds like drug/chemical’s affinity for the receptor it’s seated on. Flushing L1 cache sounds like taking a receptor antagonist while the receptor is occupied by an agonist. What process has the authority to “kick” off you off your set affinity ? Anything?
15:38 - don't worry wendell, we won't flush you :P
Sweet I'm saved my 4K TV gets confused when I try to run 1080p, so I'm not losing anything sticking with 4K.
How could he ever survive those long podcasts with the other guy. His depth of knowledge is great.
Great job!!! Please, next time try to address the the issue at hand in regard to network card and network performance... :-). Thank you 🙏
Using ProcessLasso since 2015
But is there really any benefit to running numa mode on threadripper when you can just set affinity to cores within 1 die or 1 ccx, depending how many threads it benefits from?
Even on ryzen 1600x setting affinity to threads within 1 ccx can improve framerate on some games.
Also, if you want some more advanced tweaking, process hacker is program that can set affinity individually for every thread separately, but unfortunately you will have to do that every time manually for every thread when you open program.
Intel server grade hardware also supports "UMA". I should say... you really can't make a NUMA topology become UMA by flipping a switch. The moment you have two CPUs with an IMC on each one you have NUMA nodes. What you can do, and both Intel and AMD support this, is setting node-interleaving on. This will present an UMA like topology to the OS (I'd rather call it Sufficiently Uniform Memory Architecture or SUMA, as some people do). This will only change the memory addressing to be interleaved, so when you access memory in sequence you will hit local memory, remote memory and so on. The latency will "look" uniform but it is not actually. I've read about workloads that prefer this predictable approach, but I've never seen a case where it is necessary in real life.
Engagement!
Engagement X2
thanks Wendell
The big question: why doesn't MS allocate cores which aren't doing anyghing (sort of, no load) to games? Why put a part of the game its load on the same CPU core as Windows? That doesn't make any sense to me.
I noticed that Wendell is running Ubuntu. Preparations for the PCIe passthrough on Ubuntu tutorial?
You (usually) do not pin a thread to a core. Threads wander from one core to another.
That can give problems for Ryzen (CCX) if Windows/Linux lets threads wander from one thread to another in between different CCX's.
However, the point is, if a game only uses up to 6 cores and the CPU has 8 cores while the CPU has 4 cores per CCX, then why not reserving 4 cores on 1 CCX and 2 cores on the other for the game and confine Windows to those two remaining cores. My point was that you want for games to use as many cores on the same CCX as possible.
> However, the point is, if a game only uses up to 6 cores and the CPU has 8 cores
A game (or any other application) does not "use 6 cores" if you want to be fully correct. It uses/has on *average* 6 *software* threads that are runnable (i.e. not blocked/waiting). Threads are dynamic, they come and go. And there are many more threads than hardware cores.
There is no direct relation from a thread to a core. It is up to the scheduler (and depending on priorities, time a thread has already run etc.) to decide which thread runs on which core. If a higher prioritized thread becomes runnable the scheduler has to pick a core that is either idling or another thread has to be stopped.
The scheduler however will try to keep a thread on one core as long as possible, to allow for the cache of that core to be kept hot. NUMA is making all of this much more complicated as the scheduler must be aware of it. With SMP things are easier since all cores are created equal (including SMT/Hyper Threading "cores").
Peter Jansen A particular core can't have 2 things running at the same time -- there's only 1 thread ever running per core. It's just that the cores are timesliced by the processor. In reality, a game can go for periods of time with all threads halted (not running on any cores), and then suddenly a bunch of threads spring to life. So, if windows receives other work to do (from the OS or other background processes) while some game threads are sleeping, it sees unused cores and may put them to use.
A good analogy is a restaurant. Each table will be used by dozens of parties during a day, but only 1 at a time. If the restaurant had your favorite table reserved 24/7, they'd miss out on business when you weren't there, if they were otherwise at max capacity.
what is that software running on the big monitor? I want that :-)
This dude does a good job of explaining things without boring the sh*t out of me.
Enjoy the content man, can I just check something quickly though? So if all the cores are on one node, is NUMA not possible? Does it depend on there being a interconnect such as quickpath or more than one physical socket? Thanks man
Hey a dell R710 ( well google search app) runs my works 90TB raw raid just fine !
11:29 Topology, not topography!
18:38 “Topology’ -- *yes!*
So the os scheduler is a better for dealing with specter and meltdown than any changes to the microcode? Obviously on lower core count processors this isn't the case, but for servers why is Intel responsible for fixing the issue?
Please correct me if I'm misunderstanding this.
They still have 16MB of cache.. that's like.. the entire main memory of my first PC. it ran windows 95 in 16 meg :O
Flush level1? All I can think about now is Wendell, Ryan, and Krista being flushed
isn't flushing L1 cache related to resolving the meltdown issue ?
What about nematodes?
The picture on the screen is all white during most of the video for me...
Can you add pictures/screenshots of those in the description?
Hey I just got a workstation off ebay... the machine only POSTs when I disable NUMA... when I enable it, it's basically a dice roll if it POSTs and 100% doesn't get into the OS. What could be the issue?
I’m so confused so what’s better for latency no numa node or multiple
Only 5:00 in video already like clicking noise
hehehe. I clicked like because my 2009 Macpro Pro died (2 x 3.46GHz Xeons with 32 Gig ECC Memory). Now I'm back on my ~2008 core2 system, which I upgraded to a Quad core from South Korea on eBay, and maxxed the memory to 8gB. The 'old' workhorses you are showing made the internet way better. Time will tell if the newbies can do something awesome ;)
Clive Cussler National Underwater And Marine Agency Great adventure books!
The NUMA Graph that you showed kinda reminds me of the "It's a Unix System" on jurassic park, except this is a real Unix-ish system xD
thanks was helpful... not a gamer but good to know when it comes to Sql Server.
Who comes here after watching Linus Epyc video?
It was riveting!
I think if is more the latency in the ram when the video card has the compute strength though lacks the physical ram to handle and its handover to system memory with its transversal effect causing an over buffer because the ram on the video isn't enough to utilize a mix of 4k hdr 6k and 8k content..
noting that to do native 4k hdr and the other niceties of other compute options taking up more storage the hard drive..
average 4khdr game consuming 250+GB in storage assets add the new servicing tweaks another 150GB more than likely...
we have been past 24GB in a gpu as standard since the 4khdr became a console standard..
why do tou think it takes so much ram to process 4k content in a video editor because even if you stripe 8 gpus with 24GB ram on them you still wouldn't have the ram capacity @ 192GB to process what you are doing I would not expect 1TB ram if applicable to be functionally used
at some point you are going to get a bus bottleneck in frame rate whether it is ssd. ram or GPU and its ram limitations..
what you are talking about is bus limitations between hard drive, ram of gpu and intersection where system ram comes into play covering the shortfall..
why should we be talking a hdmi vs dvi vs vga vs display port..
when 1080 p in general on pc has been a standard over pc for the last 6-10 years and anything higher falls to hdmi or display port standards..
whilst 2k and the latter 4k have active within the av market space for close 25+ years since before blu ray was was a chosen format...
This is very interesting and should make the DIDs simulator EF2000 Tact-com run much faster
Anyone knows why wendell mentioned flushing level 1 cache is bad? My guess is performance?
Does hp z600 has numa?
Let him talk about NUMA ...love your videos. Thanks for the schooling XD
I remember numa numa having a dance. Where is the dance?
Nice video
Why AMD decided to configure 2990WX as 4 NUMA nodes instead of just 2 NUMA?
For example:
Numa Node 0 = Zeppelin 1 [master] + Zeppelin 2 [slave]
Numa Node 1 = Zeppelin 4 [master] + Zeppelin 3 [slave]
Hi Wendel, Can you please make a video about making a dual boot pc. I have tried to make a dual boot of Windows 10 and Linux mint, when i am installing Linux mint using Rufus as bootable it just stays on the linux mint logo. I have done this before on my old PC but i do not know what is wrong why it is not working anymore. I have x370 prime pro, 1700x, 16 gb ram and 1070 gpu.
Please make a video of the step by step procedure and what programs are you using for this. Thanks.
I'm sure there is a song named numa,, I'm humming it now
He’s saying it improves performance but then saying it increases latency
Ask Dirk Pitt...
Dirk Pitt would say, that Numa stands for, National underwater Marine agency. I love you Clive Cussler!💝
There is another program similar to process lasso its called ...system explorer
Is that not 8 cores and 16 threads?
I love your explanations...lol
Ryzen, Epyc... why isn't it ThreadRypper?