Please keep having Casey on even if its more "eat your vegetables" than "JavaScript junk food" content. I learn so much every time I listen to this guy talk
For people who are still confused about the L1/TLB address checking explanation: this is just a cache invalidation scheme. Instead of sending the virtual address to the TLB, then finding the physical address in the L1 cache, the TLB and L1 are accessed concurrently, they both produce a physical address, and the cache is valid iff the TLB and L1 cache contain the same physical address. It's important to do this for speed because the TLB is large and slow, because it needs to be in order to support 4KB virtual memory page granularity. The only way to convert from virtual to physical address in the L1 cache is to compare the offset within the page (because the address outside the page is what is modified by the TLB) but not within the cache line (because every byte in the cache line is accessed at the same time). This presents a hard limit on how the L1 cache can be structured without breaking or rewriting operating systems; it can only have as many buckets as can be referred to by this mapping. Bringing it all together: cache lines are 64B == 2^6B, so 6 bits of the address refer to the cache line offset; pages are 4096B == 2^12B, so 12 bits refer to the page offset; 12 - 6 == 6, so 6 bits refer to the offset within a page but not within a cache line; 2^6 == 64, so there can be at most 64 sets (buckets) for the cache; the old L1 cache stored 8 items per bucket, so its total size is 64*8*64B == 32,768B == 32KiB; the new L1 cache stored 12 items per bucket, so its total size is 64*12*64 == 49,152B == 48KiB.
Loved the explanation of how L1 cache works. Prime's perspective as someone who isn't knowledgeable about this topic helped me better understand Casey's explanation. Would totally watch a regular show or podcast where casey explains to prime how things work down at the hardware level. It was anything but boring! Thanks to you two for doing this one
Exactly, wild! So to get to L1 cache (and back) in 3 cycles at 5GHz, the cache could be at most 3/10 foot = 3.6 inches away from the core. And that’s assuming best case just for the speed of electricity itself. The cache has to actually do something, too. In practice, L1 cache is separate for instructions and data, and is physically located right next to the associated pieces of the CPU core.
i know lengthy, deep (that's what she said) explanations might be boring for a lot of people and not great to do on stream, but I want to say I really enjoy those. it takes the edge off of all the abstraction we're submerged in every day and it actually feels like computer science. I can't apply anything of what Casey said, but I loved every second. I wouldn't like for Prime or Casey to feel weird about these, since they might hurt the stream's numbers a bit.
Learning how SIMDeez nuts code is generated from a regular c/c++ code by the compiler would be great. Nobody wants to rewrite everything to simd, but just having the compiler do that for them with maybe some minor tweaks and mental model shifts would be great
The compiler is quite limited in what it can vectorize, no? You need to write your program in a vector friendly manner to even hope the compiler will auto vectorize it.
@@TapetBart depends on the semantic of the language, and tons of other things, but yes compiler can't do everything. Especially compilers that were not designed for vectorizing from scratch
Efficient SIMD code is more about data layout and memory access patterns than particular instructions. The compiler typically can't do anything about your data layout so there are serious limits to what auto vectorization can achieve.
@@Bestmann3n it goes both ways without SIMD instructions you can’t really take full advantage of good memory layout, and without a good memory layout you can’t get the best out of SIMD.
this was great Prime, i know this type of content is not the best for viewership... but its deeply apprrcieted by some of us who want to learn from people like Casey. He is a national treasure.
SIMD nuts all the way. The opmask regs that came with AVX-512 are the true GOAT of that extension. New opmask instructions were added for operating on all the vector sizes; 128, 256, and 512-bit.
That point Casey made about it feeling positive was spot on. Always feel really excited about my job after listening to these. The point of building things for the joy of building things really hit home as well. I been struggling to figure out why I don't enjoy programming any longer and it is literally because "get it out now!".
This is great, I just started the corsera course "Nand to Tetris" so I can actually understand how a computer works. Then boom the same week this gem shows up
@@jwr6796If I remember right all float ops involving NaN spit out NaN, so I don't think it would work...... Now if you could build a logic table where you can get more than one result...... (well, there are 2 types of NaNs signaling vs non signaling, and there's probly some bits left.....)
Amazing video. As someone that is studying compilers for hopefully a career switch one day, I would really love to watch that SIMD talk To make this comment more productive I would like to add that majority of cache misses in VMs, like JVM, happens because of the additional tag field added to each object and not because they are scattered all over the memory. JVM for example uses a class of GC algorithms known as mark-compact collectors. During the compact phase GC will place all the objects that reference each other as close as possible. This is something that even a C++ programmer has to actively think and doesn't get for "free". Before the collection happens, Objects are also allocated on something called TLAB, Thread Local Allocation Buffer. These buffers are large memory spaces exclusive to one thread so objects that are allocated by that thread can always be placed next to each other without any interference from the outside world. If anyone is more interested in this stuff I suggest a less known book: The Garbage Collection Handbook: The Art of Automatic Memory Management This book is basically the CLRS of memory management algorithms.
I'd be interested to know how the tag hurts cache performance. Is this just because that extra memory dilutes the cache, or is there some level of indirection going on?
I think we should go even deeper with Casey in the future. when I started I programming I watched around 100 episodes of Handmade Hero. I think alot of poeple don't have that context. I know the enough basics of Virtual Memory, Cache Associativity to follow this but I think a lot of people even experienced don't have this context
Man, we need more Casey on the channel. Love hearing his expertise. He is a great teacher. I found the CPU deep dive chat very fascinating. Would love to hear more things like it.
This is actually one of primary reasons why I bought TH-cam Premium. No ads and offline background videos. Most of TH-cam content that I consume is basically podcasts. I primarily listen to videos and having them downloaded is great when you need to drive/be somewhere without good internet access
This video is so good,.I am listening to it twice! Casey is such a good communicator, he could have just told you, "Intel can't increase cache because 4096 is a small number" but instead he took us through a constructive and instructive journey of the entire system so we could make that conclusion with him. Before he mentioned the memory size limit, I had already intuitively knew this built on the scaffolding he had built in my mind. Brav-f'n-O! This is my Brav-f'n-O face.
I love this kind of stuff. Now I can watch the same video all week! (which is exactly what I'm going to have to do If I want any chance at understanding what these guys are talking about) Edit: I might be exposing myself as a noob but if you hear all this doesn't it make you respect the devices we use everyday that much more.
1:04:00 or so was a lightbulb for me and I suddenly understood it once he tied it to a cache miss. I can't believe an hour already passed watching this, it just flew right by
Props to Mr eagen for following through with esoteric questions that were apparently "spot on". That's not an easy feat to follow Casey's beautiful in depth explanation.
In the discussion of modulo for the hashing vs masking. Masking is modulo for powers of 2 minus 1 for positive integers anyway. i & 255 == i % 255 where i is an unsigned integer.
Amazing video, i heard people say AMD made improvements but i didn't understand terms. Finally someone is talking about what the improvements mean, thank you
Lots of interesting thoughts on the vertical potential of LLMs. IMO they are and continue to be used as blunt instruments: the techniques are brand new, we're still learning incredible amounts about how to to use and combine the components. I think regardless of the hypothetical vertical potential in the future, there are going to be huge amounts of lateral expansion as every industry and niche finds their own special usecases and refined designs.
You guys should do a semi-regular segment, call it "Prime Lesson Time w/ Uncle Casey" I want to call him uncle because of a friend of my dad's who i called uncle who was like Casey; very smart tech wise but had that strong Dad energy and ability to explain things simply as possible. Alt: "Prime Lesson Time w/ Mr Muratori" if you want to be fancy.
This is such great entertainment. I alredy knew most of this but 1) I feel so smart 2) This is not efficiently put together but entertainingly put togeather I have nothing but love for this. SIMD would be great, I would probably orgasm if you'd discuss long word instruction pipelineing, so don't do that. Simply put: This was awesome 🎉
Incredible stuff here, will cross reference in 5 years when finally understand everything Casey said 😅 that translation buffer was kinda crazy of a concept.
As a normie with no programming/coding anything, I actually understand this. Cheat Engine vaguely works based off of "bits that don't change" and "bits changing less often" gaming experience ftw
I have to rewatch the "32 kib 8 way -> 48kib 12 way" explanation again, I need to take notes and draw some diagrams to understand this. CPUs are so fascinating dude!
My background for the last 10 years, is datacenter infrastructure at a movie studio, and CPU’s are a very big topic in that space and really in any space where you have scale out compute (scientific computing, rendering, etc). Outside of those environments, CPU really doesn’t matter to the consumers of the CPU unless it causes some sort of bug.
Asahilina gave a great explanation of much if the same info in her 3 hour deep dive into her macros gpu explout she uncovered while reverse engineering the GPU for her Linux driver implemented in rust. And she has a pink slide deck which helps visualise the material.
Please keep having Casey on even if its more "eat your vegetables" than "JavaScript junk food" content. I learn so much every time I listen to this guy talk
+
The right amount of brussel sprouts bowls to burgers is 5 to 1
@@monsieuralexandergulbu3678 So 5 bowls of brussel sprouts for every 1 burger, got it.
it's*
I agree
the CrowdStrike joke was lit
real
It was shit
@@saltstillwaters7506 crowdstrike shareholder spotted :p
simdeeznuts
SIMDeez nuts
give em the ol swizzle
@@DavidM_603 Oh my god. 😂
@@DavidM_603 the ol shuffle
Maximizing the throughput of deeznuts
Casey coming in strong with "I don't even know what all this tech slop is, what tf is a fireship and a lavarel?" That's my boy right there! 😂
Execution on the crowdstrike joke was really on point
You and Casey have such good chemistry, please consider turning these videos to a podcast series!
Someone doesn't know about the Jeff and Casey show.
@@braincruserThis show was so good, Casey on his unleashed mode.
@@braincruser Yeah, this will also end up with the hosts to the punches.
Please consider sewerslide.
A 400-part series called "Handmade BFFs"
For people who are still confused about the L1/TLB address checking explanation: this is just a cache invalidation scheme. Instead of sending the virtual address to the TLB, then finding the physical address in the L1 cache, the TLB and L1 are accessed concurrently, they both produce a physical address, and the cache is valid iff the TLB and L1 cache contain the same physical address. It's important to do this for speed because the TLB is large and slow, because it needs to be in order to support 4KB virtual memory page granularity. The only way to convert from virtual to physical address in the L1 cache is to compare the offset within the page (because the address outside the page is what is modified by the TLB) but not within the cache line (because every byte in the cache line is accessed at the same time). This presents a hard limit on how the L1 cache can be structured without breaking or rewriting operating systems; it can only have as many buckets as can be referred to by this mapping.
Bringing it all together: cache lines are 64B == 2^6B, so 6 bits of the address refer to the cache line offset; pages are 4096B == 2^12B, so 12 bits refer to the page offset; 12 - 6 == 6, so 6 bits refer to the offset within a page but not within a cache line; 2^6 == 64, so there can be at most 64 sets (buckets) for the cache; the old L1 cache stored 8 items per bucket, so its total size is 64*8*64B == 32,768B == 32KiB; the new L1 cache stored 12 items per bucket, so its total size is 64*12*64 == 49,152B == 48KiB.
Loved the explanation of how L1 cache works.
Prime's perspective as someone who isn't knowledgeable about this topic helped me better understand Casey's explanation.
Would totally watch a regular show or podcast where casey explains to prime how things work down at the hardware level.
It was anything but boring!
Thanks to you two for doing this one
Another Casey video count me in! I don't care how many hours that guy talks I'm always learning so much from him.
Love how Casey is 2x the size of prime just towering over him as disembodied head. 😆
how else is he going to fit the knowlage in his head?
Simdeeznutz, time to learn bud.
Love Casey! He actually knows what he's talking about. Great resource
26:42 Hahaha the chat message “BEAM = berry easy artificial machine” was very under-appreciated
True
I just wanted to say that I really appreciate those in depth explanations
34:48 A handy conversion to remember is that light travels ~1 foot in 1 nanosecond (in a vacuum). Electricity in silicon is about 20% of that
Exactly, wild! So to get to L1 cache (and back) in 3 cycles at 5GHz, the cache could be at most 3/10 foot = 3.6 inches away from the core. And that’s assuming best case just for the speed of electricity itself. The cache has to actually do something, too. In practice, L1 cache is separate for instructions and data, and is physically located right next to the associated pieces of the CPU core.
Please invite Casey again! And give him a whiteboard!
Very good video! Keep having Casey on stream, it's really interesting and entertaining.
I've been listening to Casey since his Handmade Hero series. It was such a formative experience and glad to see him on the channel. Thank you
Casey is great. Dude is so chill
i know lengthy, deep (that's what she said) explanations might be boring for a lot of people and not great to do on stream, but I want to say I really enjoy those. it takes the edge off of all the abstraction we're submerged in every day and it actually feels like computer science. I can't apply anything of what Casey said, but I loved every second. I wouldn't like for Prime or Casey to feel weird about these, since they might hurt the stream's numbers a bit.
Learning how SIMDeez nuts code is generated from a regular c/c++ code by the compiler would be great. Nobody wants to rewrite everything to simd, but just having the compiler do that for them with maybe some minor tweaks and mental model shifts would be great
I wanna say mojo is working on something like this if I recall
The compiler is quite limited in what it can vectorize, no? You need to write your program in a vector friendly manner to even hope the compiler will auto vectorize it.
@@TapetBart depends on the semantic of the language, and tons of other things, but yes compiler can't do everything. Especially compilers that were not designed for vectorizing from scratch
Efficient SIMD code is more about data layout and memory access patterns than particular instructions. The compiler typically can't do anything about your data layout so there are serious limits to what auto vectorization can achieve.
@@Bestmann3n it goes both ways without SIMD instructions you can’t really take full advantage of good memory layout, and without a good memory layout you can’t get the best out of SIMD.
this was great Prime, i know this type of content is not the best for viewership... but its deeply apprrcieted by some of us who want to learn from people like Casey. He is a national treasure.
2 hour long Casey discussion. Sick.
Always great to see Casey on the show -- love these interviews!
I care. It's important stuff. Casey Muratori is a fantastic brain that is so enthusiastic. Love that guy.
Love Casey and these deep dives. He's incredibly interesting to listening to!
SIMD nuts all the way. The opmask regs that came with AVX-512 are the true GOAT of that extension. New opmask instructions were added for operating on all the vector sizes; 128, 256, and 512-bit.
I really enjoy in-depth talks like these with Casey. Please keep it going, its really incredible.
I just want to say this was fascinating. we need more of this.
That point Casey made about it feeling positive was spot on. Always feel really excited about my job after listening to these. The point of building things for the joy of building things really hit home as well. I been struggling to figure out why I don't enjoy programming any longer and it is literally because "get it out now!".
Please make this a monthly or biweekly podcast. Love y'all's interactions, you really bring the best out of eachother.
This is great, I just started the corsera course "Nand to Tetris" so I can actually understand how a computer works. Then boom the same week this gem shows up
That's a great course and very fun too
But can you do NaN to Tetris?
@@jwr6796 I can't even spell NaN....
His website is gold for this stuff. Almost too much information but its all good
@@jwr6796If I remember right all float ops involving NaN spit out NaN, so I don't think it would work...... Now if you could build a logic table where you can get more than one result...... (well, there are 2 types of NaNs signaling vs non signaling, and there's probly some bits left.....)
Best content in my feed for weeks. You're both great!
Amazing video. As someone that is studying compilers for hopefully a career switch one day, I would really love to watch that SIMD talk
To make this comment more productive I would like to add that majority of cache misses in VMs, like JVM, happens because of the additional tag field added to each object and not because they are scattered all over the memory.
JVM for example uses a class of GC algorithms known as mark-compact collectors. During the compact phase GC will place all the objects that reference each other as close as possible. This is something that even a C++ programmer has to actively think and doesn't get for "free".
Before the collection happens, Objects are also allocated on something called TLAB, Thread Local Allocation Buffer. These buffers are large memory spaces exclusive to one thread so objects that are allocated by that thread can always be placed next to each other without any interference from the outside world.
If anyone is more interested in this stuff I suggest a less known book:
The Garbage Collection Handbook: The Art of Automatic Memory Management
This book is basically the CLRS of memory management algorithms.
I'd be interested to know how the tag hurts cache performance. Is this just because that extra memory dilutes the cache, or is there some level of indirection going on?
Wow that is really interesting, I didn't know that. Gonna check up on that book
Really good show, watched the whole thing and would love to see another one. Great vibes, learned a lot, what more can you ask for. Keep it up, guys!
Casey is by far my favorite guest! I learn a ton every time he’s speaking. Also he’s great at simplifying and explaining things!
This may actually be my favourite discussion so far, I thought I already understood a lot of this but I was missing some key concepts.
I think we should go even deeper with Casey in the future.
when I started I programming I watched around 100 episodes of Handmade Hero.
I think alot of poeple don't have that context.
I know the enough basics of Virtual Memory, Cache Associativity to follow this but I think a lot of people even experienced don't have this context
Man, we need more Casey on the channel. Love hearing his expertise. He is a great teacher. I found the CPU deep dive chat very fascinating. Would love to hear more things like it.
Thank you Prime! Casey is awesome!This is just such an interesting subject, now searching for the HW Engineer’s perspective as case mention @1:05 :D
- How do you know Casey is dropping tech bars?
- His mouth is open
btw, simdeeznuts
Casey is amazing. Please bring him back!
You should post this as a podcast, so I can listen to it while walking my dogs in the forest.
YES
please do
I would LOVE to listen to this while walking my dogs in the forest.
IF I HAD ANY
This is actually one of primary reasons why I bought TH-cam Premium. No ads and offline background videos. Most of TH-cam content that I consume is basically podcasts. I primarily listen to videos and having them downloaded is great when you need to drive/be somewhere without good internet access
what prevents you from...i don't know...play the youtube video and listen to it just like a podcast ?
@@OBEYTHEPYRAMID I’m assuming there’s no internet service on his dog walk in the forest
This video is so good,.I am listening to it twice! Casey is such a good communicator, he could have just told you, "Intel can't increase cache because 4096 is a small number" but instead he took us through a constructive and instructive journey of the entire system so we could make that conclusion with him. Before he mentioned the memory size limit, I had already intuitively knew this built on the scaffolding he had built in my mind. Brav-f'n-O! This is my Brav-f'n-O face.
Casey my hero, so cool to watch u together, and u, Prime, ask him really interesting questions
Not so many popular channels go this deep, explained so well. Prime content right here.
SIMDeez NUTs
Not bored at all dude. Stayed till the end.... 👍
I love this kind of stuff. Now I can watch the same video all week! (which is exactly what I'm going to have to do If I want any chance at understanding what these guys are talking about)
Edit: I might be exposing myself as a noob but if you hear all this doesn't it make you respect the devices we use everyday that much more.
1:04:00 or so was a lightbulb for me and I suddenly understood it once he tied it to a cache miss. I can't believe an hour already passed watching this, it just flew right by
Him not knowing fireship is funny 🤣
he even said "idk what fireship is" instead of "who" lol
CASEY IS ON THE CASE!!!!
Every time I see a video that has Casey in it makes me smile.
Bro I love Casey sooo much!! Please bring him on more
Oh definitely keep these coming, these are a goldmine.
SIMDeezNuts
49:13 missed opportunity to make cache hit joke right there
Casey seems very knowledgeable, love to hear his thoughts
SIMDeezNUTZ
These things just go above my head, there is so much more to learn
Do you know what Beam is? If so, please enlighten me.
Props to Mr eagen for following through with esoteric questions that were apparently "spot on". That's not an easy feat to follow Casey's beautiful in depth explanation.
This was really interesting, bring Casey on more
In the discussion of modulo for the hashing vs masking. Masking is modulo for powers of 2 minus 1 for positive integers anyway. i & 255 == i % 255 where i is an unsigned integer.
Amazing video, i heard people say AMD made improvements but i didn't understand terms. Finally someone is talking about what the improvements mean, thank you
I was here for a little of this conversation and it was great.
the crowdstrike joke.... beautiful!
Please do more with thus guy it reminds of the time when we had to know our hardware well if we had to do code for it
Casey is awesome, his course opened my mind to new things after 24 years of professional (yea right...) programing.
Best content on yt .. and its not even close.
SIMDEEZNUTS
more Casey please! Amazing knowledge and content.
Lots of interesting thoughts on the vertical potential of LLMs. IMO they are and continue to be used as blunt instruments: the techniques are brand new, we're still learning incredible amounts about how to to use and combine the components. I think regardless of the hypothetical vertical potential in the future, there are going to be huge amounts of lateral expansion as every industry and niche finds their own special usecases and refined designs.
Bring Casey more. He is such a delight
I could listen to these deep-dives for ages
wow this is taking me back to the days of cpu designs :) physical & virtual addressing. page aligns, cache flushing. oh the memories.
You guys should do a semi-regular segment, call it "Prime Lesson Time w/ Uncle Casey"
I want to call him uncle because of a friend of my dad's who i called uncle who was like Casey; very smart tech wise but had that strong Dad energy and ability to explain things simply as possible.
Alt: "Prime Lesson Time w/ Mr Muratori" if you want to be fancy.
Casey just single handedly elaborated the best JavaScript defense argument EVER
Simdeeznuts for more Casey interviews
57:21 the moment when you close your eyes so you can use your entire visual cortex processing power to actually see things in your mind !!!
This is such great entertainment. I alredy knew most of this but
1) I feel so smart
2) This is not efficiently put together but entertainingly put togeather
I have nothing but love for this. SIMD would be great, I would probably orgasm if you'd discuss long word instruction pipelineing, so don't do that.
Simply put: This was awesome 🎉
One of the best episodes I’ve watched.
Incredible stuff here, will cross reference in 5 years when finally understand everything Casey said 😅 that translation buffer was kinda crazy of a concept.
This is Prime content honestly
I loved this, the whole L1 cache thing was super interesting
Great vid, thanks Casey and Prime👍🏻
As a normie with no programming/coding anything, I actually understand this. Cheat Engine vaguely works based off of "bits that don't change" and "bits changing less often" gaming experience ftw
simdeeznuts!
SIMDEEZ NUTS
Interesting interview. Great deepdive about 8 ways etc. Didnt know that. At all.
this video is pure gold
They can macro that extra page, for the operations. That makes the cache that much more effective based the extra cached instructions.
I have to rewatch the "32 kib 8 way -> 48kib 12 way" explanation again, I need to take notes and draw some diagrams to understand this.
CPUs are so fascinating dude!
3:16 Oh my God this is so hilarious! 😂😂😂 I had to look at my screen. Why did it stop? Oh no! Oh no! Oh no LMAO
😂😂😂 all I needed was the crowd strike joke.
Could listen to Casey for days...😎
I can listen to Casey talking metal all day.
My background for the last 10 years, is datacenter infrastructure at a movie studio, and CPU’s are a very big topic in that space and really in any space where you have scale out compute (scientific computing, rendering, etc). Outside of those environments, CPU really doesn’t matter to the consumers of the CPU unless it causes some sort of bug.
Asahilina gave a great explanation of much if the same info in her 3 hour deep dive into her macros gpu explout she uncovered while reverse engineering the GPU for her Linux driver implemented in rust. And she has a pink slide deck which helps visualise the material.
Casey is basically explaining the Hennessy Patterson book. Although he's good at doing so :)
As a last year eee, this guy helps me understand many things