CPU Faster Than You Think | Prime Reacts

ThePrimeTime

มุมมอง 75 637

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 ก.ย. 2024
Recorded live on twitch, GET IN
/ theprimeagen
Reviewed video: • Your CPU May be FASTER...
Channel: DepthBuffer | / @depth-buffer
MY MAIN YT CHANNEL: Has well edited engineering videos
/ theprimeagen
Discord
/ discord
Have something for me to read or react to?: / theprimeagenreact
Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
turso.tech/dee...

ความคิดเห็น • 226

@stokedfool ปีที่แล้ว ⁺¹¹⁹
I want to acknowledge at 18:44 how dope the animations are for the matrix multiply. Stepping through the algorithm side-by-side with the highlighted code is a subtle by awesome touch. The comment even gets updated to show how many times the memory was accessed.
@turun_ambartanen ปีที่แล้ว ⁺⁶⁶
For matrix multiplication you need one row from matrix A (contiguous in memory) and one column from matrix B (not contiguous). The CPU is very good at fetching contiguous blocks from memory (prefetching, cache, etc). So if you transpose matrix B beforehand the cache hit rate is much higher and therefore the computation is faster in total. The transpose is O(N^2), and the matrix multiplication is O(N^3), so it is worth it for large matrices.
@YumekuiNeru 11 หลายเดือนก่อน ⁺²
or even moreso if you are reusing the transposed matrix several times
@noobgam6331 9 หลายเดือนก่อน
except for large matrices you'd never multiply them in O(N^3) on cpu, but yeah. strassen can be optimized as well with the transposition, but I don't know to what extent.
@kollpotato ปีที่แล้ว ⁺⁶⁹
I have Discord sound notifications disabled, you can't fool me
@landin1181 ปีที่แล้ว ⁺²
I haven’t had discord sound notifications enabled for years. It still got me
@darukutsu ปีที่แล้ว ⁺⁴
I don't use discord. Checkmate!
@nwstremo9443 ปีที่แล้ว ⁺²
imagine using that pos (beyond 2017) 🤡
@blarghblargh ปีที่แล้ว
imagine not having a discord community for your projects (in 2023)
@darukutsu ปีที่แล้ว
@@blarghblargh matrix, irc, git discussion exists
@chainingsolid ปีที่แล้ว ⁺³⁷
The video being reacted to, also has plenty of deeper CPU/GPU performance details and clarifications in the comments.
@nate7803 ปีที่แล้ว ⁺⁹
This man is the king of keyboard shortcuts yet he uses his mouse to pause the video and to go back 😂
@TangoFoxtrotWhiskey ปีที่แล้ว ⁺³
TH-cam shortcuts don't have a leader key so it's not worth it. 🤣
@cls880 7 หลายเดือนก่อน
True lol. I wish browser was more keyboard friendly. Best we can do is surfingkeys extension, or a text browser.
@knofi7052 ปีที่แล้ว ⁺²³
Optimising computing is a complex job because cpu's and computer architectures are complex as well.😊
@CallousCoder ปีที่แล้ว ⁺⁷
Operating system design and implementation was the funnest class after digital technology for me.
The book by Prof Tannenbaum that we used was really good. It was a pity though that we didn't have the time in college to design the whole Z80 microcontroller that we did and write a proper operating system on top of it. Not in the least because we didn't have a z80 C compiler and the book doesn't really go into assembly a lot, it is all C. But I managed to made a cooperative multitasking scheduler on my Z80 board, which made programming the washing dryer combination that was the final test really easy for me.
Also because I took the time to write a Z80 assembler and didn't do machine code like all the other students. I just couldn't deal with hand assembling machine code instruction after the second lesson :D
@Alkis05 10 หลายเดือนก่อน ⁺¹
Unfortunantly our OS professor was a substitute and it wasn't that great. Luckly we also had RTOS course and we at least learned how to implement our custom realtime scheduller policies using freeRTOS API on a Tiva launchpad. But I feel like I will have to do some self teaching on OS if I really go the embeeded systems route.
@CallousCoder 10 หลายเดือนก่อน
@@Alkis05 VxWorks and QNX also Unix like fully posix compliant and real time operating systems, awesome for embedded. RTOS are incredible unfortunately that course didn’t go into developing those. We also had to do a very rudimentary scheduler just cooperative multitasking. But I guess in assembly that is what you can only do in 6 weeks 🤣
@TanigaDanae ปีที่แล้ว ⁺⁷
The reason why transposing the second matrix is faster is because you can SIMD two 4 element contiguous arrays.
In other words the columns are no longer 3 bytes apart in the second matrix but contiguous.
That is because in C/C++ (and some other languages) a two dimensional Matrix like float[4][4] is internally float[4*4].
@mennovanlavieren3885 ปีที่แล้ว ⁺³
Also for large matrices that don't fit in the cache a simgle line will fit in the cache. So you won't have to load the second matrix from main memory over and over again.
@ErazerPT ปีที่แล้ว
Edit: sorry was wrong, C/C++ seem to be even more amusingly obtuse in this regard... [][] is usually jagged not multidim, but seems C/C++ decided it made more sense to have [][] = [,] = [m*n] and then "obviously" use pointers to construct jagged arrays. Sigh...
@isodoubIet ปีที่แล้ว
@@ErazerPT "C/C++ seem to be even more amusingly obtuse in this regard... [][] is usually jagged not multidim"
It's not that C/C++ is being obtuse here, it's just reusing the same array indexing syntax for multi-indices. In C++ in particular that syntax is "just" an operator which means whether it's jagged or continuous can is up to the implementation. A vector of vectors will be jagged, but a custom matrix type can be implemented continuously (most library authors will overload the function call operator for this purpose though, so A(i, j) instead of A[i, j], at least until C++23 when the latter syntax will become legal).
@theohallenius8882 ปีที่แล้ว ⁺²
It's funny how I watched some of those videos before only to then later find out ThePrimeagen has a reaction video of it
@ilearncode7365 ปีที่แล้ว ⁺³
6:00 “uhm ackkshually” its not “equalize”, but just “mix”. He didnt mix the sound levels correctly. That refers to simple volume levelcontolling on the different audio tracks. “Equalizing” is a different thing that is done to audio where you tweak specifix frequencies on the tracks (whereas mixing lowers or highers ALL frequencies on a given track)
@Yupppi 9 หลายเดือนก่อน
The epic CPU background music sounds like some chill version of Killer Instict. Hope your wife doesn't mind your teraflops.
The comment about the website was so on point too. Engineers and academics doing 90's websites when they have the most important and thorough information to communicate. There's this one dude in Finland who used to do some university job and just for fun has any and every single language formatting and writing rule, style and advice (verbal and relating to typing it on computer documents) on his website, goes through programming language and just general information about Linux and IT on rather deep level. He's pure math bachelor and worked in academic IT center for decades, used to sit in computer politics committee as well. And his website is just a sort of green tilted turquoise background with basic font, pages and pages of text. Well written and paragraphed, but absolutely nothing extra. And he still updates it, 2023 updates have been "mathematical notation 2nd edition" and "pronouncing portuguese". The dude apparently just loves communication related stuff and teaching it. Pages like that are golden and don't look like much.
@CottidaeSEA ปีที่แล้ว ⁺¹
This guy basically said "Python is really fucking slow" in a 12 minute video.
@nicholasbicholas ปีที่แล้ว ⁺³
One of the new firship videos has the greatest visualization of matrix multiplications I wish they showed us in college
@RucadiDev 10 หลายเดือนก่อน ⁺¹
One of the bests courses of my university, the Operating System's course, truly amazing professor, I still remember it to this day.
@nimmneun ปีที่แล้ว ⁺⁹
Wearing my flip-FLOPS for this for max immersion 😲
@jasondoe2596 ปีที่แล้ว ⁺¹
I hope your feet are large enough for an upgrade to GIGA-flip-FLOPS!
@VivekYadav-ds8oz ปีที่แล้ว ⁺¹
You could've capitalised FLIP as well since FLIP-FLOPS are also an electronic component (studied in last semester, already forgot about it lol)
@jasondoe2596 ปีที่แล้ว
@VivekYadav-ds8oz yes, I know :D
@CFalcon030 ปีที่แล้ว ⁺³
There's an awesome computerphile video comparing linked lists with arrays.
@blarghblargh ปีที่แล้ว
imagine you did a search in a web based product catalog, and every page only contained one result.
that's a linked list (as often implemented, with a single value per node).
@nitrovent ปีที่แล้ว ⁺¹
In university, a friend of mine worked on a project where they tried to optimize cpu die design/placement algorithms for GPUs. The result afaik was not that good because of cache. While the CPUs had large local caches all cores could easily access, the individual cores on the GPU had a hard time fetching the data, which for the problem given was much more dynamic than what e.g. games require. While that was some years ago, I can imagine this still being a problem for algorithms that need to access large chunks of data from less predicatable locations.
@llothar68 ปีที่แล้ว ⁺²
Who else wishes that GPU would finally start optimizing for integer and data operation as well? The 6000 CUDA cores that are each as complex as a pentium processor could help in other data processing as well. But it took until the NVidia 1080 release to get even the basics for this on the GPU.
@KangoV 10 หลายเดือนก่อน
SIMD (AVX256 and AVX512) is being embedded into the JVM, so Matrix computations can be performed with a simple API. No external C/C++ libraries needed. Java, Kotlin, Groovy and Scala will all benefit. Just imaging doing Spark DataSet processing with SIMD instructions. Nice.
@oefzdegoeggl 11 หลายเดือนก่อน
yeah, it's all about getting the input values in continuous memory, ideally paragraph-aligned so the instructions for aligned move can be used. that's for assembly only though, i doubt rust can do that as it cannot know alignment of your data. in fact, this continuous memory layout is what the matrix rotation is good for. you can also get away by just duplicating the data (in both layouts) when loading the matrices. this really goes through the roof if you have too much data to fit into L3.
@brainsniffer ปีที่แล้ว ⁺³
It would have been nice if the video ended with something about how the comparison between CPU and GPU ends up, answering what the actual difference was from the initial question. But other than that it wasn’t bad. Sure a lot of it was way over my head, but so was the SIMD rust video prime mentioned. It’s a gateway drug tho.
@jayglookr ปีที่แล้ว ⁺²
I love old school, crappy looking geo-cities esque websites.
@robfielding8566 ปีที่แล้ว ⁺¹
I experienced about a 10x speedup with real code doing reinforcement learning. It's the difference between being able to run a RL job and lose an hour; vs the whole day is lost if you started the job wrong. Of course, you must be doing SIMD code via the GPU to do it. I watched a bunch of interns spend all day doing RL jobs. I got my personal faptop accelerated, and just gave it to them for the summer. My CEO saw how huge the difference was, and got 3 monster boxes with a pair of GPUs in each one. It very much matters for machine learning work.
But this is the performance of vectorized code; which has nothing to do with CPU-only code.
@NC-nc3gs ปีที่แล้ว ⁺¹³
Transposing is simpler swapping - but caches it in L2 or something I think. That is why you get cache gains. Anyway, can anyone explain?
@mattymerr701 ปีที่แล้ว ⁺⁷
From what I understood, it's more that when you transpose, your matrix multiplication gets data next to each other for both matrices instead of one having data next to each other and the other jumping along by N
So, instead of adding up a row and a column, you are adding up 2 rows, where the data for a row is all close in cache meaning you get a lot more cache hits. Caches hate data that is spatially distant
@philipchoe ปีที่แล้ว ⁺⁶
cache access always happens in line blocks. even if you are reading single float value, cpu will load entire vlock from memory into its cache line.
given that, lets say you are multiplying big enough matrices where single matrix doesn't fit entirely on the cache. for a single value in a result matrix, you have to do A[i,j] * B[j,k] where j ranges from 0 to number of rows in B. Without transposing B, CPU will load whole line containing B[j,k] every inner loop, which will fill up the cache, and eventually evict previously loaded lines. so the next time you try to compute value for the C[i,k+1], it will encounter more cache misses. Transposing B before actual multiplication just reduces the chance of cache misses by just making the B matrix more cache friendly (essentially making matrix B's indexing to use B[k,j] scheme so that cache misses are reduced for adjacent j values).
This is regardless of L1, L2 or L3 cache. more evictions you have at your lower level cache, slower the operation.
@MrAssassin9891 ปีที่แล้ว ⁺⁴
Another way to think of it is just to flip the 2 inner loops. When you are doing {i, j, k} loop order, and your inner looks like C_{i, j} += A_{i, k} * B_{k, j}, you are constantly changing A_{i, k} and you are always fetching another array when accessing B_{k, j}. But if you flip the loop to be {i, k, j}, B_{k, j} will just be consecutive values in an array, and you also get the added benefit that A_{i, k} is effectively a constant for the entirety of the J loop. This means that you are effectively only fetching values for B when you do the inner most loop. Another added benefit is that you can check if A_{i, k} is 0. If it is, you can just skip the entirety of the J loop because you will just be accumulating 0 * something which is 0.
@marcelocardoso1979 ปีที่แล้ว
This has to do with memory locality, which increases the number of cache hits by taking advantage of the processor's prefetcher. I had to do this on college on matrices up to 32k x 32k, but the math behind A x B^(-1) and why it works I can't recall.
@vitalyl1327 ปีที่แล้ว
Transposing is just a simplest thing you can do to improve cache access pattern. You can do a lot more complex scrambling-descrambling to optimise caches, or GPU local memory. E.g., for a lot of image processing algorithms, representing the image planes as Z-curves (or other space-filling curves) result in a much better locality.
@PepsiMaxVanilla ปีที่แล้ว
the worse the blog/forum post looks the more useful the contents are. i was researching something about some heat resistant tape for a project i was doing and the trash-looking forum i ended up on literally had people working on nuclear reactors talking on it
@mrrolandlawrence ปีที่แล้ว ⁺¹
its hard. one of the things i loved about coding ARM assembler was the simplicity of the instruction set. Optimising code for cycles was fairly easy.The bummer about x86 is the amount of extensions (x87, IA-32, x86-64, x86-S, MMX, 3DNow!, SSE, MCA, ACPI, SSE2, NX bit, SMT, SSE3, SSSE3, SSE4, SSE4.2, AES-NI, CLMUL, RDRAND, SHA, MPX, SME, SGX, XOP, F16C, ADX, BMI, FMA, AVX, AVX2, AVX-VNNI, AVX512, AVX10, VT-x, VT-d, AMD-V, AMD-Vi, TSX, ASF, TXT, APX). There are over 2000 CPU instructions for x86-64.
@neodonkey 10 หลายเดือนก่อน
Thats why I love the RCA 1802 from the early 70s. Sure its the slowest processor you'll ever use but its instruction set is like Xen perfection and you can learn to program the thing in an afternoon. Intel's 4004 is surprisingly beautiful too.
@GRHmedia 10 หลายเดือนก่อน
Check out the book, "Developing your own 32 bit operating system". It is a bit old but fast enough to go through and learn a lot. You probably can read through it in about a weekend.
@c0zn1c 10 หลายเดือนก่อน
Arrays got that Gigaflop energy. Just a tight, consolidated group of Chads. Length lists got that Karenflop energy. Just holding everyone up.
@notuxnobux ปีที่แล้ว ⁺²⁰
Most of the time you just have to use a compiled language with no extra overhead (no gc, no unecessary heap allocation because of objects, etc) and do no nothing obviously bad and your program ends up being very fast without even having to optimize it.
@SaHaRaSquad ปีที่แล้ว ⁺⁸
Indeed. Just days ago I switched from Python + numpy to a couple lines of Rust and the same task immediately ran 650x faster. Seeing a modern CPU do its thing without brakes on is a thing to behold.
@blarghblargh ปีที่แล้ว ⁺²
SIMD can give a 4x speedup (or more). It's often worth exploring, for anything where 4x the bandwidth would actually make any difference to the users. Which can be more often than you'd expect, if they have to wait for the results or blocks the UI, if it's not just some nightly batch job.
@kuhluhOG ปีที่แล้ว ⁺³
depending on your application the latency jitter of a GC can make certain languages a no-go too
@zperk13 ปีที่แล้ว
@@blarghblargh I was trying to make a program go faster the other day with SIMD. My program ended up telling me that 2-0=0. I need to do more debugging on the SIMD lol
@SaHaRaSquad ปีที่แล้ว
@@blarghblargh In my program I tried a lot of tricks and none were as fast as just the first unoptimized version. Even SIMD with 32 values per calculation was only half as fast. I guess the cache works in mysterious ways.
@Roxor128 ปีที่แล้ว ⁺¹
What you were saying about pretty websites being useless is right. I think you could codify it as "The usefulness of a website is inverse to its Javascript-to-HTML ratio."
@neodonkey 10 หลายเดือนก่อน
It's same as how the bigger someone's television set is the poorer they are.
@Roxor128 10 หลายเดือนก่อน
@@neodonkey The more suitable the TV size is to the room you want to put it in, the harder it is to find a good one. Most TVs are huge and most rooms are too small for it to look right. People who are poor on money are also poor on time and will give up on finding a good fit sooner.
@justdoityourself7134 ปีที่แล้ว
9:00 I love me a nice piece of tight memory.
@edborowski5950 ปีที่แล้ว
You totally reminded me....
It was back in 2013 that now-Blizzard President J. Allen Brack issued his infamously condescending line of “You think you do, but you don't” in response to fan requests for a World of Warcraft vanilla server.Feb 24, 2020
@benarcher372 ปีที่แล้ว ⁺¹
Very educational! Hard stuff but good.
@Acetyl53 ปีที่แล้ว
With modern advancements in software design principles, even greater achievements begin to come into view. A new milestone, a CPU slower than you can think. It's almost poetic. It's all come full circle. The circle of life.
@MrAzteca1998 ปีที่แล้ว ⁺¹
I'd like to add some details that I've noticed are either misleading or missing, since I did optimize the GEMM algorithm as an university project and I'm familiar with it. That project probably ended up being the one I enjoyed the most in my entire degree. They'll need to be answers to this comment because of the length, though.
It's great to see that there are videos explaining this stuff to the general public (well maybe not that general in this case but you know what I mean) and the reactions derived from it. Keep up the good work. Also, if you're short on content sometime I'd be willing to go more in depth with this algorithm.
@MrAzteca1998 ปีที่แล้ว
Let's start with the hardware comparisons. While it's true that a Xeon E5-2630 v3 peaks at 8 FLOP/cycle using avx2, it's not necessarily also true to other processors. Some processors may execute several operations per cycle, depending on the hardware and the data dependencies between the instructions. For example, the Zen 4 architecture is theoretically capable to run 2x avx512 float additions in parallel with 2x avx512 float multiplications, or in other words: multiplying 32 floats while adding another 32 floats at the same time reaching a mind-numbing 144 GFLOP/s per core for a Ryzen 9 7950x (2304 GFLOP/s across all its cores, which is actually not that far from the GPU in the video). Also, especially on desktop processors, the performance of single-threaded algoritms may be way higher than what the base clock of the processor would suggest because of their insane boost speed on short-running single-core loads. For example, the previous processor has a base clock of 4.5GHz, but has a max boost clock of up to 5.7GHz without overclocking depending on the conditions, which is a whopping 27% extra (or a burst at 182.4 GFLOP/s). If we take into consideration that said processor supports avx512, depending on the algorithm it may not be worth it to parallelize just due to the loss of the single-core boost combined with the parallelization overhead (which is something the video doesn't mention).
As for the GPUs, comparing CPUs vs GPUs is a bit like comparing apples to oranges. That is to say, it's plainly unfair. CPUs are more optimized for branch-oriented algorithms while GPUs may have some problems with those kinds of algorithms. I tend to thing of GPUs as an independent multi-core SIMD coprocessor in steroids. That is because of the way a GPU executes instructions: there's a "scheduler" per each group of 32 cores (64 in older RADEON cards, both modes are supported in newer ones) which dispatches the same operation to each and every one of them, potentially masking the result out. It may be a bit fairer to say that the CPU side has 2 sockets * 8 cores * 8 floats per cycle = 128 "cores", or that the GPU mentioned in the video has 3072 cuda cores / 32 per ROP = 96 "cores", which is still a pretty significant difference but not by as much as before. There's also differences in how the caches are organized too, which is especially important when optimizing CPU algorithms since you have more control on how the code is executed: for example, you can bind each thread to a specific CPU core so that it never changes and take advantage of the data locality this provides, which is something that (as far as I know) can't be done in a GPU. The GPU has also some architectural advantages like the local memory, which is like a fast cache shared between every core in each group of 32, but it requires manual usage (aka. writing and reading to it with explicit code rather than relying on hardware-implemented caching algorithms like the CPU caches do). GPUs also require you to copy data between the CPU's and the GPU's RAM memory since they are separated (unless it's an integrated graphics card, mobile, etc...) and (usually) built different. It depends a lot on the algorithm itself.
@MrAzteca1998 ปีที่แล้ว
On to memories: Memory speed depends on a lot of factors too, such as the channel width (or gurth ;) and speed, but also on the number of channels and where the data is located physically. For example, say your data is on a single RAM stick (therefore probably a single channel) but your machine supports two independent channels, then you'll be using only half your bandwidth. Of course we can't choose which ram stick to store our data (as far as I'm aware) and the hardware probably already does that for us.
Something the video fails to mention are latencies, which exist both in the CPU instructions and the memory accesses. When the CPU requests some data from the memory, the memory takes some time to reply. I'm not talking about tens of cycles as the video shows but rather hundreds to thousands of cycles. This is because what we call RAM is actually DRAM memory (dynamic random access memory). The dynamic part means that it uses capacitors instead of bistable circuits, which are much more storage dense. The downside is that those capacitors need to be refreshed periodically since otherwise they'll lose the data (hence the volatility of the RAM). This refresh cycle is done every few milliseconds (or whatever the manufacturer decides), but that's not where all the latency comes from. For efficiency purposes, the RAM isn't byte-accessible, but rather it is row-accessible. That is, for every data read or written there's a more than likely chance that the active row has to be changed, further delaying the operations. But it's even worse: if my memory isn't failing me, those row reads are destructive and need to be written back to keep the data where it was.
Although latencies aren't mentioned, they aren't all that relevant on some algorithms because of the caches. The video mentions that alogrithms which use the data exactly once can't possibly benefit from it, but in reality they do benefit. When the CPU requests data to the RAM, it'll be returned as a block of (most likely) 64 bytes. That is the size of a single cache line on x86 hardware. That transfer may take a lot more cycles than the ones mentioned in the video, but remember that once the block has been brought to the L1 cache we'll basically have the entire 64 bytes accessible within a single cycle. So, even though it seems the algorithm isn't profiting from the cache, it actually is, and quite a lot. The point the video makes is that the algorithm can't profit from a cache more than it already is; or in other words: no matter how much we torture our remaining brain cells on that kind of algorithms, it won't make the cache hit rate better.
Cache management becomes especially important when dealing with SIMD because you're no longer accessing 4 or 8 bytes per cycle (assuming a dispatch width of 1 instruction per cycle), but rather you'll be dealing with 16 bytes (sse/avx/neon...), 32 bytes (avx2) or even 64 bytes on newer hardware (simd512). In other words, you may potentially end up requesting multiple whole cache lines per core to your (comparatively) slow RAM. It is also important to take it into account on multithreaded algorithms: the video mentions that multiple cores accessing the same cache line is potentially catastrophic, but that only applies when writing data, not when reading it. The problem with writing data is that (under normal circumstances) the data needs to be pulled into the L1 cache, then modified, then marked as dirty (aka. tell the neighbouring cores that the shared cache's copy of this block is not up to date). If multiple threads are writing data within the same cache line simultaneously, then for every write they'll need to send the block from thead A's L1 to thread A's L2 to the shared L3 (hopefully, otherwise into RAM and back to thread B's L3), then into thread B's L2 to finally reach thead B's L1 just to be modified, marked as dirty and the process to be repeated in reverse. There are some special instructions to avoid this problem (when avoiding it the obvious way is not possible) which are called non-temporal reads and writes. Those read and write directly to RAM (or the shared cache, I'm not sure which one it was since it should work with both depending on the hardware configuration). The non-temporal instructions have abismal latencies therefore the compiler usually generates them sometimes on write-only data (aka. data that you won't need later on, ex: for i in ... do a[i] = b[i] + c[i], a is write-only and doesn't really need to be cached within L1 or L2 really; but a[i] += b[i] does need to be cached and will have poor performance regardless if, for example, thread A computes even indices and thread B computes the odd ones; the correct way to fix that would be to make thread A compute the first half and thread B the second half).
As mentioned in the video, it is true that it is sometimes faster to copy the data with a different memory layout (including but not limited to transposing) and then operate over the new data than use the original one, I can vouch for that. Using the divide and conquer method of splitting the computation into blocks is also ridiculously effective with caches.
Caches also have their own quirks that need to be addressed, for example cache associativity. In some cases where data is (unfortunately) aligned it may completely destroy the cache usage rate, leaving your program using only 3.125% of your L1 (I can prove it since it happened to me). As a side note, the video mentions that loading aligned data is potentially faster than non-aligned loads, but that is no longer the case in new architectures (at least that I know of).
@MrAzteca1998 ปีที่แล้ว
Alright, let's transition into compute stuff. It is true that compute speed (not just CPU clock speed) has grown at a quite faster rate than memory bus speeds, but that's what caches are supposed to make up for. There's quite a lot of algorithms that can improve a lot in cache misses but don't because of how they are coded. The matrix multiplication shown there is just one of them. When the video is showcasing the matrix multiplication it briefly mentions the divide and conquer algorithm (albeit with a different name), but let's leave that for later.
The video says that matrix multiplication has a computational complexity of O(n³) and a spatial complexity of O(n²). This means that the algorithm will grow the number of steps it needs by n³ as n keeps growing, while the space used will grow by n². However, the truth is that n² has nothing to do with memory transfers, which is what we're interested in. Each of the operations related to O(n³) will involve 3 reads, then a multiplication of two of those followed by and addition with the third, and finally storing the result back into memory. This means that the amount of data that will need to be transfered will also be O(n³). Spatial complexity just means how fast the space you need grows as the problem size grows, not how many data transfers you need.
We can all imagine how badly the CPU would perform without caches if every cycle you're not only requesting multiple cache lines per core but also your complexity on the memory accesses is O(n³) instead of O(n²). This is where caches shine: you can copy the data you're working with into the cache, work with only that data locally since the data will be available practically instantly, and after you're done with that data then you forget about it and move on to the next batch. This probably is what the next video in the series will probably address. I'm not sure how in depth will that video be but I'm looking forward to it since there's quite a few cool stuff you can do with it to make it ludicrously efficient (yes, not joking, I'm talking about improvements in orders of magnitude in cycles, cache misses and number of instructions at the same time when done correctly).
@MrAzteca1998 ปีที่แล้ว ⁺¹
Finally, some notes on the GEMM algorithm itself. I like to think of matrix multiplication as taking two bidimensional arrays, placing them perpendicularly in a 3d space, generating the product of each intersection and collapsing it by adding everything together across the third axis. Or something like that, I haven't checked if that works on paper but I'm pretty sure it does. It's probably a lot easier to visualize than it is to understand what I just wrote. If you manage to get an actual mental representation of what is going on there, you can instantly understand how tiling works and how to optimize the tiling itself. For example, when using caches, the order in which the tiles are computed does matter, and depending on the tiling configuration it can reduce cache misses by orders of magnitude on all cache levels.
Also, having that representation available also makes it really easy to think up a really nice and compact piece of vectorized assembly to multiply those tiles. I managed to get between 10 and 11 operations (of those in O(n³)) per cycle per core on my machine, which is pretty impressive for a CPU. Granted that was only compute with zero cache misses, but still.
@Shri ปีที่แล้ว
9:06 Arrays single list, contiguous memory. Specifically great for SIMD operations. You can't use SIMD with linked lists.
@defter6800 10 หลายเดือนก่อน
There is one more important thing with heavy AVX CPU is lowering it real frequency when multiple cores do many AVX instructions to not burnout from high temperature (hello Intel)
@fantasypvp ปีที่แล้ว
0:20 yep, can confirm i checked my discord 💀
@Dev-Siri ปีที่แล้ว
kilochad
megachad
gigachad
terachad
petachad
exachad
zettachad
yottachad
bronto**chad
@pif5023 ปีที่แล้ว
Now I can GIGAFLOP with awareness 😌
@sdstorm ปีที่แล้ว
Nice to see that Bill Burr grew hair and got into programming. Good for him.
@blarghblargh ปีที่แล้ว ⁺¹
I'd be interested to see how thoroughly and reliably Rust can auto-vectorize code.
C++ compilers often don't do such a great job, and Intel ISPC demos the benefits of a language specifically designed around vector code.
It's not so much that the compiler can and does vectorize code, but that the vectorization is stable, and doesn't degrade to scalar code silently when you modify the code slightly, or compile it on a different compiler version or build mode, etc. It's often the case that if you're bothering to ensure code is vectorized once, that you want it to stay that way.
@327efrain ปีที่แล้ว
I've experienced this as well with c++. I have had to sometimes switch compilers to get it to vectorize some code snippets.
@blarghblargh ปีที่แล้ว ⁺¹
@@327efrain dunno if it's still the case, but at least up until a few years ago, game engine devs would use compiler intrinsics instead of trying to coax things to auto-vectorize. Then they could be guaranteed it would emit the code they were looking for.
@MI08SK ปีที่แล้ว
@@327efrainyou could've had used compiler intrisics to manualy vectorise
@i_forgot1477 11 หลายเดือนก่อน
in my experience, GCC and clang on unix-like systems tend to be better than rust at auto-vectorising code
@KingJellyfishII ปีที่แล้ว ⁺²
SIMDeez nuts
@ErazerPT ปีที่แล้ว
The video is self explanatory when it stresses "cache line" regarding the optimization. A cache miss pulls in a whole cache line not a single element. So, per miss you want to pull in as much data as possible that will be relevant to the following computations. Transposing the matrix achieves that, and as it grows bigger, the cost to transpose is smaller than the cost of all those cache misses. But... and here comes the Clean Code killer... if your matrices are small enough that they can fit wholly within the cache then all data will be in the cache eventually, and the cost of "random access" to the cache might be smaller than the cost of transposing the matrix. So yeah, optimal performance will require a bit of RY to handle all mixes optimally, with a "generic fallback".
@robertmaltby4113 ปีที่แล้ว
What if the blow stricken against humanity by the misguided and overhyped attempts at ai isn’t so much ‘robot overlords destroy us in nuclear fire’ as it is ‘copilot ensures novice programmers spend 532x more time making the same mistakes, and even expert programmers spend as much as 52x as much time recoding blunders’…?
@Veptis 2 หลายเดือนก่อน
As a wgpu-py contributor... It's time to update
@KangoV 10 หลายเดือนก่อน
IBM Northpole is 8x faster than NVidia A100 and 6x faster than their top flight H100. It has been in development for 20 years and is more efficient as well and is on the 12nm process node so has way more bandwidth to be shrunk. "It has 256 cores and can perform 2,048 operations per core per cycle". What IBM and AMD are doing is embedding compute into memory on a single chip.
@kubre ปีที่แล้ว
Damn I'm definitely going to that channel and binging some videos
@mworld ปีที่แล้ว
Well, you should have seen what I had to work on for 2 years. Lucee 5 created by Indians. You want 300 line functions, we got em!
@XfStef ปีที่แล้ว
You missed out on "Dr. Bandwidth" =))))
@nevokrien95 ปีที่แล้ว ⁺²
Prime we r talking ml programers...
We optimize the shit out of everything obsessively scaling
@Optimus6128 ปีที่แล้ว
The solution is,... using cash..
Just throw more money at it
@br3nto ปีที่แล้ว
18:33 do programs get enough info to calculate whether an algorithm will be compute vs memory bounded?
@sammavrs ปีที่แล้ว
i have no idea what was going on..... brain hurt
@SimonBuchanNz ปีที่แล้ว
My braincells have shrunk so much that i can fit 5 times more in my skull! JavaScript makes you a genius!
@sinasalahshour ปีที่แล้ว ⁺¹
just went and checked my discord😂
@L1Q ปีที่แล้ว ⁺¹
dude you are hijacking my YT viewing. every single video I watched lately you also react to, meaning I might not actually look for any new videos and watch your reactions right away instead D:
@soroushjm1011 ปีที่แล้ว ⁺¹
Am I the only one who only uses high level languages and now feel dumb?
@DemiImp ปีที่แล้ว ⁺¹
Not rotate. Transpose. Agen.
@BlackwaterEl1te ปีที่แล้ว
dam mfer just teasing us with dem Gemms algorithms.
@EmreHepsag ปีที่แล้ว
If just rotating your matrix works, what are compiler engineers doing?
@jakejakeboom ปีที่แล้ว
If you use a math package like BLAS/LAPACK/numpy, i believe they will look for those sorts of optimizations, and more complicated ones
@ionarevamp 11 หลายเดือนก่อน
All you ever need to process anything is 1.21 JiggaFlops
@nomadshiba ปีที่แล้ว
cache locality
@BozCoding ปีที่แล้ว
... Just checked discord because of you... lol
@juice2 ปีที่แล้ว
TIL 1 terachad is 1000 gigachads
@ChevCastProgramming ปีที่แล้ว
lol literally checking my discord notifications while prime is literally telling me the sounds were in the video 🤦‍♂
@henson2k 9 หลายเดือนก่อน
Cash fixes a lot of performance problems.
@AlesNajmann ปีที่แล้ว
Cash vs cache! 😂
@axtract ปีที่แล้ว
I enjoy these reaction videos, but is this not arguably a breach of copyright? Perhaps someone who is familiar with the specifics of "fair use" can weigh in.
@twenty-fifth420 ปีที่แล้ว ⁺¹
The original video is a test on how neurotypical vs atypical you are in society.
I personally loved it and the reaction, it was very informative, dense but not patronising. 🎉
@br3nto ปีที่แล้ว
6:03 even with the music, you can still hear him 1000x better than most of the dialog in movies and tv shows these days
@trapexit ปีที่แล้ว
array vs linked list... the problem is the lack of understanding of O notation and also how it fails to really make the nuance more explicit.
@kazwalker764 ปีที่แล้ว
11:50 Did anyone else have a Robot Chicken flashback? 😂
th-cam.com/video/kdhhQhqi_AE/w-d-xo.html
@xpmon ปีที่แล้ว
VIRGINIA
@abrarmasumabir3809 ปีที่แล้ว
Why do you have your hood under the headsets!
@Slashx92 ปีที่แล้ว
Swag points
@BeamMonsterZeus ปีที่แล้ว
You remind me a lot of LobosJr. He's an OG justintv/twitch streamer, codes, and stares at s**t the same way you do. You're a tad more life-experienced
@daltonbrady2492 11 หลายเดือนก่อน
For real just checked discord... triggered.
@mustaphaben2921 ปีที่แล้ว
I checked my discord
@nicoooxx 11 หลายเดือนก่อน
Look at the like button at this timestamp 7:30
@cocosloan3748 10 หลายเดือนก่อน
I bet you "parallelize" in 5 minutes tops 🤣 Its called "premature parallelization " 😂😂
@codingjango ปีที่แล้ว
stop the discord notification. gives me PTSD
@danser_theplayer01 10 หลายเดือนก่อน
Programming in python makes your pant snake significantly below average.
@carriagereturned3974 ปีที่แล้ว
write programs (not apps) in x64 assembler (not wasm) and watch real cpu speed and brain cell enlargement.
@SimonBuchanNz ปีที่แล้ว
X64 ASM is banned by the Geneva convention.
@JuusoAlasuutari ปีที่แล้ว
That's a low bar tbh
@Pasakoye ปีที่แล้ว ⁺¹
My CPU is Hotter tha-
@markjenkins9424 ปีที่แล้ว
ML embedded in u tsx files.
@RetroGenetic ปีที่แล้ว
Nah, you turn those off and set your self to DND ;P
@lukasmolcic5143 ปีที่แล้ว
I like you prime, but that discord, every. single. time
@Cracktune ปีที่แล้ว
alt tabbed, dont even have Discord running. fawk!
@aimentetbirt1363 ปีที่แล้ว
Misleading title
@georgerogers1166 ปีที่แล้ว
Julia
@moosethemucha ปีที่แล้ว
Just do it you coward - zero to OS (btw I work at netflix)
@kelvinsmith4894 ปีที่แล้ว ⁺¹
This channel is just for trolls wanna be programmers because they learned react and few others
@PinakiGupta82Appu ปีที่แล้ว
😄
@discoveryme4975 ปีที่แล้ว
so everything made by tailwind is useless :D
@BusinessWolf1 ปีที่แล้ว
This video said a lot but said nothing.
@saritsotangkur2438 ปีที่แล้ว
How often do you really need to do this level of optimization for cache coherency? Any well understood matrix math already has a well optimized, benchmarked library that might even already be optimized to use the GPU. You would be laughed at for trying to start from scratch only to lose out to even more arcane low level optimizations in those libraries. Even games developers are now using ECS frameworks to get the benefits of cache coherency without the need for low level data manipulation. Do yourselves a favor and stop the pre-mature optimization, don’t you know that’s the root of all evil? Optimize only when you need to.
@Olixis ปีที่แล้ว ⁺⁵
first, cheers from Brazil
@vaisakhkm783 ปีที่แล้ว
first??
@huge_letters ปีที่แล้ว ⁺²¹⁰
I switched my WIndows 11 to Ubuntu last evening for the first time in my life - and yes, I can confirm that actually my CPU is much faster than I thought and I should've stopped gaslighting myself into thinking my PC is really laggy because of my CPU and not Windows long ago.
@jonathancrowder3424 ปีที่แล้ว ⁺³³
At some point you'll find out that Ubuntu is slower than other distros like I did. Still using it for gaming. Fedora is faster and arch is very fast, but I'm too dumb to use it
@Endelin ปีที่แล้ว
@@jonathancrowder3424If Arch was too hard for you I recommend NixOS. You can use the graphical installer to install a graphical or barebones version. Most of the drivers get setup for you and any change you make can be rolled back easily. Vimjoyer here on youtube has some nice short videos explaining Nix.
@elenaquerida ปีที่แล้ว
@@jonathancrowder3424I have a pretty beefy computer and used WSL for dev but recently switched to arch and I’m blown away by how fast and instant everything happens.
@hypnoz7871 ปีที่แล้ว ⁺²⁰
And yet Ubuntu is one the most bloated distro in the unix realm.
But the lighter ones are NOT user friendly. (Arch, Void, or god forbid Gentoo)
@vali69 ปีที่แล้ว
I recommend trying endeavor os for anyone interested in getting into arch but is too scared about it's complexity. And stay away from manjaro, it might be based on arch but it's surprisingly really bad.

ต่อไป

เล่นอัตโนมัติ

Intel Arc 2024 Revisit & Benchmarks (A750, A770, A580, A380 Updated GPU Tests)