The animation is sweet, but do people realize that this is a native ultrawide vid? IMMEDIATE subscription for choosing by far the best aspect ratio... y'know, the one we collectively failed to adopt. Seriously, not opting for 21:9 is the big tech fumble of our time, it's wild how narrow and tight 16:9 looks these days if you have some UW experience.
This video flew entirely above my head (still getting into this math in Uni haha), but the presentation is stunning, and I still watched through all of it because these optimizations are weirdly beautiful to me. Awesome job!
I love how in depth this video makes me feel smart. I know that only a few people could make sense of such content like this. But, you make it feel like even more people can get close to it.
I've been teaching myself how to use SIMD for the past 3 weeks and I can't even tell you how helpful this video was. I was baffled when my serial implementation of some image processing code was 10x faster than my naive SIMD implementation. Took me quite a while to understand how that was possible. This video has made my greatly appreciate the simplicity of Rust's (experimental) portable SIMD library. Also, did not know what OpenMP was and it seems somewhat similar to Rust's library. Absolutely incredible video!
You can get better performance by unrolling the loops(do it aggresively, for example, unroll the whole loop by block size(in GCC, #pragma GCC unroll 8)). However, it would still be 3 times slower than intel mkl.
@@axe863GCC in particular doesn't like unrolling loops and vectorizing them as much as LLVM does in my experience. Although they seem to have caught up a lot with the latest few versions.
@@blipman17 Yeah this is some info I was looking for. Based on what I've seen, LLVM would just automatically do these SIMD things without needing to be explicitly told, right?
@@davidr2421 errr… yes and no. It really depends. Moving data to SIMD-registers and back has overhead compared to regular single instruction, single data (SISD) code that we normally program. Some SISD algorithm implementations in assembly can just be faster than the equivalent SIMD algorithms depending on the instruction set. Also, heat. All instructions generate some amounth of heat, which will affect the clock speed. So while an algorithm might be faster on SIMD, on a given processor it might be slower due to the CPU thermal-throttling. This is/was extreme apparent on AVX-512 SIMD instructions on Intel. That’s also why GCC used to be or is hesitant about SIMD instructions. Furtheron, any consumer/server cpu of the latest 20 years can run instructions out-of-order and in paralel. This is extremely apparent on for-loops where for example the 6’th iteration of the loop is started before the first iteration is even finished! That’s parallelism you get for free. And this all changes drastically per specific model of CPU + RAM and sometimes even between different cooling solutions. So… benchmark before making any such statements. This is extremely difficult.
The final part where you say "Oh but there is that library that does everything for you for that case" just made me laugh my head off, great vid full of deep commentary, really great job!
I was thinking the same. The matrix multiplication was very effective. I remembered the first matrix row gets a dot product with 2nd matrix's first column, so I could see that it was lower left being the first matrix and upper right being the second (lower right obviously being the result).
That's awesome! I've been studying parallel programming and a lot of these strategies i had no idea were possible. I wish my university had courses on HPC/Parallel programming like yours. This course you mentioned at the end of the video seems great.
It reminds me of when Matt Parker made a unique wordle combination finding algorithm that took a month to complete, and then his viewers managed to optimize it so it only takes a few ms.
@@okie9025this is so not the same, both algorithms here are o(n^3), but with constant optimizations you reduce the runtime significantly in matt parkers video he basically wrote the most naive bruteforcey approach
@@okie9025Matt Parker *isn’t a programmer* so he implemented a naive solution in a dog-SLOW language Python. It took me a one day to write a C solution; this took 20 mins to run. The next morning I optimized it down to 20 seconds. In the evening I had multithreading working on my 24 core/48 thread Threadripper. It took ~2 seconds to find all 538 solutions. Understanding not only data access but calculations is paramount to getting high performance.
@@amirnuriev9092 Right, this is the kind of speedup that programmers with only python experience would assume to be impossible (multiple order of magnitude improvement despite having the same asymptotic complexity, using the same language, and not even using any stupid premature abstractions in the initial version that could slow things down)
I love it when videos like these give me inspiration to delve into a topic I'm not familiar with. I admire people like you for coming up with such elegant and beautiful ways to communicate these concepts.
Until now I only considered the efficiency of the algorithm for optimizing program execution. Although I admit I didn't understand everything in this video it demonstrates that knowledge of the computer's hardware and taking advantage of it can significantly speed up execution.
I did this exact thing in my intern project this summer at TI. Their processor even had a special instruction to do the matrix multiplication of two small blocks. I was able to achieve around 80% the rated capacity of the processor.
probably one of the best programming videos I have ever seen, as a more senior developer there is a lack of content on this level of production quality when explaining complex ideas
Very good. I haven't thought about optimization for a long time since I was doing Assembly in the 90's. This brings back memories and the feeling. Very good!
This is so beautiful I nearly cried, and at the end when you reviled that the library had a 1000x optimisation I died and came back to life a better person.
this is a well done video and explains the idea of leveraging hardware and machine code to optimize cache lookups super well. But I also want to shout out what may be the best animation of a matrix dot product I’ve ever seen. this feels like the first time I watched a video that got me to understand what monads are
Great video! But I feel like I will never get to do this kind of work during my job, since we use a scripting language, and all bets are off when you store everything on the heap.
While scripting langauge is a huge problem here memory isn't. Stack and heap are just different names of how memory is allocated by OS kernel. All those types of memory are still sitting on the same RAM chip, so there is no difference in access speed and locality of that memory to CPU cache. While OS will almost always store stack memory in CPU cache, every time you access heap address it pull to cache and couple of more addresses. It's not that CPU loads only one value you request, it's value you request + some block of data after this. If you have an array of numbers and you want element zero, CPU is loading not only that but propably 1k more elements after.
I could be wrong, but NVidia's linear algebra library does these types of optimizations to optimize utilization. The cublas library does a bunch of re-ordering and transposing by default.
Great videos, you certainly know what you are talking about and you can share it while keeping it interesting, keep it up, you deserve more subscribers
Transposition works really well if you can combine it as a result of a previous operation. For instance if the matrix that's about to be multiplied is the result of adding together two matrices, when you're adding the elements of the matrix, reverse the order so that the result is the transposition of actual answer.
Software engineers can have two mindsets - one of being the best and the other of being the worst. After watching this video, I quickly fell into the second. 10 out of 10
I have a real-world example that feels-to-me similar what is described here. It involves a Postgres install on a machine with a RAID-5 setup. Benchmarking showed peak ~15k TPS, for about 30-seconds, then the TPS would degrade to ~2k TPS, and widly fluctuate up and down, for about 30-seconds, and then resume at ~15k TPS for another 30-seconds. This behavior was steady over longer benchmarks (10 minutes). I used Postgres's table partitioning ability to solve this. First I used 10 partitions, which made no difference. Then I used 100 partitions, and while the peak throughput capped at ~13k TPS, it was steady for the entire 10-minute benchmark. Another, more human example, is writing the same sentence over and over on a chalkboard. Easier to write "I, I, I, I, will, will, will, will, never, never, never, never, chew, chew, chew, chew, gum, gum, gum, gum." Instead of "I will never chew gum", etc. It's all related.
I cant lie I do not program by any stretch of the imagination... but your videos are something else, the clean and clear explanations are just amazing.
We are not using any textbooks during the course, and the course is also not recorded. But the slides are published here: pages.cs.wisc.edu/~sifakis/courses/cs639-s20/
Its a great videof or optimizing a matrix multiplication from a purely technical/computing science standpoint. But i think improvement you still could make is to multiply the matrices with something like the Strassen algorithmus ( nowdays there even faster algorithm), but a matrix multiplication doesnt need to have a cubic ^3 runtime. You can actually do matrix multiplications in slightly less for example ^2.8, which should give significant improvements when multiplying big matrices
For those interested in how to get 380 size threshold (2:54) Arithmetic Intensity = # ops / # bytes, algorithm property, measures how many operations will be done per moved byte. ops:byte ratio = BW_math / BW_mem, hardware property Math limited: arithmetic intensity > ops:byte ratio Mem limited: arithmetic intensity < ops:byte ratio transition from memory to compute: Arithmetic Intensity = ops:byte ratio (2N^3) / (2 * 3*N^2) = (640 GFLOP/s) / (5.25 GFLOP/s) N = 364 # bytes = 2 * 3*N^2, assuming FP16, 2 bytes # ops = 2*N^3 Any ideas how author got 380?
Beautiful video. Understandable even for a EE grunt like me. The take home message at the end was a little disheartening though. AFAIK only Intel and Nvidia churn out hyper optimized libraries. Binding you to their tools. I'm not sure about the impact of using Python for ML but I guess all the nice libraries do something similar or use MKL directly (numpy does this with specific versions). Just a user of all the shiny tools to build new shiny tools.
Please tell me the next one will only be a month or two. I love learning this level of optimization and it is conveyed so well. I will openly admit I’m being selfish cause I don’t wanna wait XD, though I do understand if that can’t be the case, hardly got a lick of free time with my own courses as well
There is an error in how multiplication of the block matrices is animated. As the current active result block is being calculated the blocks of the input matrices should iterate through rows and columns of blocks just like with regular matrix matrix multiplication.
It would be cool if you would follow up on this using strassen-algorithm with z-ordering (morton ordering), and maybe even avx512 instructions. I think you would get pretty close for large matrices.
This lvl of perf is amazing! The depth of knowledge and understanding is stunning! Is it possible to do something like this in Typescript / React / Next / tRPC?
If you mean in JS, then realistically no. You can maybe do some of these optimizations using some of the newer arraybuffers or whatever they are called to use "real" arrays, but at the end of the day you can't guarantee that the runtime or interpreter won't mess something up. It just wouldn't be worth it, just use a library
If you're new to the field, which seems to be a fair assumption... I highly recommend checking out machine learning papers and their ridiculous names. A tiny selection: - We used Neural Networks to Detect Clickbaits: You won't believe what happened Next! - Learning to learn by gradient descent by gradient descent - BERT has a mouth, and it must speak - Gotta Learn Fast: A New Benchmark for Generalization in RL - Everything involving BERT and ERNIE and the full range of Sesame Street characters Yeah, there's lots of whimsical stuff going on in that particular space, and it often is pretty relevant to the actual topic at hand, too.
The other day I was watching a video that was implying that using less conditions makes the code faster, but then I ran some tests in assembly and saw that without the conditions, the code was longer
it looks like the guthub repo you linked doesn't contain all the code you showcased in the video, im trying to go through your code a bit more thoroughly and was wondering where I could find all the examples shown in the video.
Great video! Do you mind sharing the source for your motion canvas? I like the implementation of the graph animations. Just getting used to motion and was inspired how you used it. Thanks!
Currently, my repository contains source code for my next video so I can’t make it public. But I’m using my own custom fork of motion canvas and some effects in the video require those custom features. You can check it out if you want: github.com/fangjunzhou/motion-canvas
Please, if you mention MKL, do mention that this is a non-portable intel-only library ! It won't work at all on any non-x86 compatible architecture, such as arm, and has poor performance on AMD cpu. There are some open-source portable alternatives to MKL (blis, libflame). Don't trade portability for performance on a single CPU family !
Isnt there an algorithm that uses 7 computations instead of 8 for the 2x2 GEMM? Cant we exploit that to always work in 2x2 blocks? Also, sweet animations dude! What tools do you use for these? Is that just manim?
Maybe a factor in why MKL is 8x faster than your implementation is that it uses something like the Strassen algorithm when it recognizes that you are just doing matrix-matrix multiplication of matrices with real values?
@@dmitrykargin4060 I only said it might be a factor, a few recursions could be applied whilst still using the typical algorithm. Quoting from en.wikipedia.org/wiki/Strassen_algorithm "In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for matrices as small as 500x500". And yeah wikipedia is not necessarily always right, but looking at the paper cited for this statement, it does seem to support the claim.
@@Vaaaaadimit is not 3 times difference for this size. You need far larger matrices to get that. I bet it was avx512 with smart reductions to get that 3x boost
for 4x4 u can load whole values to avx256 registers, floats are 32 bit avx256 has 16x 128bit register, u can store 1 line matrix to single avx register, and these registers has mask flags to multiplicate specific part, no memory access, and because cpu do not have to wait result of multiplication if u do not use result in same cpu pipeline, u can multiplicate 4x4 matrix almost in 3-4 clock cycle. cpu math has big difference from real life
dude, this video is so beautifully animated. i can't wrap my head around how much time you spent on this.
yeah, cache you l8ter! 🤣
The animation is sweet, but do people realize that this is a native ultrawide vid? IMMEDIATE subscription for choosing by far the best aspect ratio... y'know, the one we collectively failed to adopt. Seriously, not opting for 21:9 is the big tech fumble of our time, it's wild how narrow and tight 16:9 looks these days if you have some UW experience.
@@minhuang8848 I like horizontal space. Give me a 32-40 inch 4:3. 4:3 is crazy good to work with.
Good ol’ Canva
@@minhuang8848 21:9 is one of the worst inventions of out time
I saw your video in the ThePrimeTime, and your video is epic. Very well explain for such high level topics.
I believe it's actually rather low level.
This video flew entirely above my head (still getting into this math in Uni haha), but the presentation is stunning, and I still watched through all of it because these optimizations are weirdly beautiful to me.
Awesome job!
u only need basic algebra bro..
I love how in depth this video makes me feel smart. I know that only a few people could make sense of such content like this. But, you make it feel like even more people can get close to it.
I've been teaching myself how to use SIMD for the past 3 weeks and I can't even tell you how helpful this video was. I was baffled when my serial implementation of some image processing code was 10x faster than my naive SIMD implementation. Took me quite a while to understand how that was possible. This video has made my greatly appreciate the simplicity of Rust's (experimental) portable SIMD library. Also, did not know what OpenMP was and it seems somewhat similar to Rust's library. Absolutely incredible video!
You can get better performance by unrolling the loops(do it aggresively, for example, unroll the whole loop by block size(in GCC, #pragma GCC unroll 8)). However, it would still be 3 times slower than intel mkl.
Why wouldnt the compiler pick that up? Make the compiler more aggressive???
@@axe863GCC in particular doesn't like unrolling loops and vectorizing them as much as LLVM does in my experience. Although they seem to have caught up a lot with the latest few versions.
@blipman17 Thanks for the info.
@@blipman17 Yeah this is some info I was looking for. Based on what I've seen, LLVM would just automatically do these SIMD things without needing to be explicitly told, right?
@@davidr2421 errr… yes and no. It really depends. Moving data to SIMD-registers and back has overhead compared to regular single instruction, single data (SISD) code that we normally program. Some SISD algorithm implementations in assembly can just be faster than the equivalent SIMD algorithms depending on the instruction set. Also, heat. All instructions generate some amounth of heat, which will affect the clock speed. So while an algorithm might be faster on SIMD, on a given processor it might be slower due to the CPU thermal-throttling. This is/was extreme apparent on AVX-512 SIMD instructions on Intel. That’s also why GCC used to be or is hesitant about SIMD instructions. Furtheron, any consumer/server cpu of the latest 20 years can run instructions out-of-order and in paralel. This is extremely apparent on for-loops where for example the 6’th iteration of the loop is started before the first iteration is even finished! That’s parallelism you get for free.
And this all changes drastically per specific model of CPU + RAM and sometimes even between different cooling solutions. So… benchmark before making any such statements. This is extremely difficult.
The final part where you say "Oh but there is that library that does everything for you for that case" just made me laugh my head off, great vid full of deep commentary, really great job!
great animations and visual style! you've found yourself a great side hobby :)
I was thinking the same. The matrix multiplication was very effective. I remembered the first matrix row gets a dot product with 2nd matrix's first column, so I could see that it was lower left being the first matrix and upper right being the second (lower right obviously being the result).
That's awesome! I've been studying parallel programming and a lot of these strategies i had no idea were possible. I wish my university had courses on HPC/Parallel programming like yours. This course you mentioned at the end of the video seems great.
The fact that optimizing for cache hits and enabling SIMD was able to bring a given matrix multiplication operation from 2000ms to 15ms is wild.
It reminds me of when Matt Parker made a unique wordle combination finding algorithm that took a month to complete, and then his viewers managed to optimize it so it only takes a few ms.
@@okie9025this is so not the same, both algorithms here are o(n^3), but with constant optimizations you reduce the runtime significantly
in matt parkers video he basically wrote the most naive bruteforcey approach
@@okie9025Matt Parker *isn’t a programmer* so he implemented a naive solution in a dog-SLOW language Python.
It took me a one day to write a C solution; this took 20 mins to run. The next morning I optimized it down to 20 seconds. In the evening I had multithreading working on my 24 core/48 thread Threadripper. It took ~2 seconds to find all 538 solutions.
Understanding not only data access but calculations is paramount to getting high performance.
@@amirnuriev9092 Right, this is the kind of speedup that programmers with only python experience would assume to be impossible (multiple order of magnitude improvement despite having the same asymptotic complexity, using the same language, and not even using any stupid premature abstractions in the initial version that could slow things down)
multithreading was the mvp
I love it when videos like these give me inspiration to delve into a topic I'm not familiar with. I admire people like you for coming up with such elegant and beautiful ways to communicate these concepts.
Subscribed. I can't believe you don't have more subscribers already. Any software engineer dealing with matrix math should watch this video.
Until now I only considered the efficiency of the algorithm for optimizing program execution. Although I admit I didn't understand everything in this video it demonstrates that knowledge of the computer's hardware and taking advantage of it can significantly speed up execution.
Very well done! I make a living optimizing BLAS routines, this will probably become my default “what do you do” to send people
I did this exact thing in my intern project this summer at TI. Their processor even had a special instruction to do the matrix multiplication of two small blocks. I was able to achieve around 80% the rated capacity of the processor.
probably one of the best programming videos I have ever seen, as a more senior developer there is a lack of content on this level of production quality when explaining complex ideas
Very good. I haven't thought about optimization for a long time since I was doing Assembly in the 90's. This brings back memories and the feeling. Very good!
This is so beautiful I nearly cried, and at the end when you reviled that the library had a 1000x optimisation I died and came back to life a better person.
GEMM is the perfect example to demonstrate these concepts. Wonderful video my friend. You earned a subscriber
I loved the presentation and the animations really helped me collect the concepts.
this is a well done video and explains the idea of leveraging hardware and machine code to optimize cache lookups super well. But I also want to shout out what may be the best animation of a matrix dot product I’ve ever seen. this feels like the first time I watched a video that got me to understand what monads are
The most captivating and insightful presentation I have ever witnessed.
Great video! But I feel like I will never get to do this kind of work during my job, since we use a scripting language, and all bets are off when you store everything on the heap.
While scripting langauge is a huge problem here memory isn't. Stack and heap are just different names of how memory is allocated by OS kernel. All those types of memory are still sitting on the same RAM chip, so there is no difference in access speed and locality of that memory to CPU cache. While OS will almost always store stack memory in CPU cache, every time you access heap address it pull to cache and couple of more addresses. It's not that CPU loads only one value you request, it's value you request + some block of data after this. If you have an array of numbers and you want element zero, CPU is loading not only that but propably 1k more elements after.
This seems really cool. Good work, you're on the right path.
love this stuff keep the great work !
Pure gold! Thanks for the video!
Cool videos. Interesting information and great graphics/animations. Thanks!
I could be wrong, but NVidia's linear algebra library does these types of optimizations to optimize utilization. The cublas library does a bunch of re-ordering and transposing by default.
Great videos, you certainly know what you are talking about and you can share it while keeping it interesting, keep it up, you deserve more subscribers
Transposition works really well if you can combine it as a result of a previous operation. For instance if the matrix that's about to be multiplied is the result of adding together two matrices, when you're adding the elements of the matrix, reverse the order so that the result is the transposition of actual answer.
Awesome video! Thank you so much for sharing this!
Software engineers can have two mindsets - one of being the best and the other of being the worst. After watching this video, I quickly fell into the second. 10 out of 10
I have a real-world example that feels-to-me similar what is described here. It involves a Postgres install on a machine with a RAID-5 setup. Benchmarking showed peak ~15k TPS, for about 30-seconds, then the TPS would degrade to ~2k TPS, and widly fluctuate up and down, for about 30-seconds, and then resume at ~15k TPS for another 30-seconds. This behavior was steady over longer benchmarks (10 minutes).
I used Postgres's table partitioning ability to solve this. First I used 10 partitions, which made no difference. Then I used 100 partitions, and while the peak throughput capped at ~13k TPS, it was steady for the entire 10-minute benchmark.
Another, more human example, is writing the same sentence over and over on a chalkboard. Easier to write "I, I, I, I, will, will, will, will, never, never, never, never, chew, chew, chew, chew, gum, gum, gum, gum." Instead of "I will never chew gum", etc.
It's all related.
Visualizations were freaking amazing! Loved those! could you make a video of your process for editing these videos?
I cant lie I do not program by any stretch of the imagination... but your videos are something else, the clean and clear explanations are just amazing.
hats off to the effort and thanks to @primetime for showing us this gem
This has so fucking deeply re-ignited my passion for computer science. Gosh I am on fire right now
anybody got any online course/book similar to the course DepthBuffer mentions he took at university for highly optimized code?
I'm also interested in that :)
same @@simonl1938
also interested :D
We are not using any textbooks during the course, and the course is also not recorded. But the slides are published here: pages.cs.wisc.edu/~sifakis/courses/cs639-s20/
@@depth-buffer awesome! thank you
Its a great videof or optimizing a matrix multiplication from a purely technical/computing science standpoint. But i think improvement you still could make is to multiply the matrices with something like the Strassen algorithmus ( nowdays there even faster algorithm), but a matrix multiplication doesnt need to have a cubic ^3 runtime. You can actually do matrix multiplications in slightly less for example ^2.8, which should give significant improvements when multiplying big matrices
We need to go deeper!
Just seen ThePrimeagen react to your last video lol
Interesting subject, well presented, thank you.
Once you reminded me of the context of the last video, I believe i know where this is going
This is so well explained and the animation are SO good!
Also it would be interesting to see how clang performs compared to gcc.
Great work and its true optimizing the code by splitting up commands in assembler is a pain. saw videos on that too.
Love the video,
Where have you been mate until now?😂
there's also computational optimization similar to karatsuba multiplication but for matrices. probably proprietary lib uses it
kinda wonder why our university barely have such lectures, amazing video
DSA pro: You get 1900ms, take it or leave it
Guy with an intel chil and an ASM course: Hold my cache
For those interested in how to get 380 size threshold (2:54)
Arithmetic Intensity = # ops / # bytes, algorithm property, measures how many operations will be done per moved byte.
ops:byte ratio = BW_math / BW_mem, hardware property
Math limited: arithmetic intensity > ops:byte ratio
Mem limited: arithmetic intensity < ops:byte ratio
transition from memory to compute: Arithmetic Intensity = ops:byte ratio
(2N^3) / (2 * 3*N^2) = (640 GFLOP/s) / (5.25 GFLOP/s)
N = 364
# bytes = 2 * 3*N^2, assuming FP16, 2 bytes
# ops = 2*N^3
Any ideas how author got 380?
This was a great watch, thanks you.
That library you used, MKL, can you make a video about it? It sparked my interest
Incredible video. Feel like I am getting a PhD in Optimization just by watching and I haven’t really coded for 20ys 😂
Beautiful video. Understandable even for a EE grunt like me. The take home message at the end was a little disheartening though. AFAIK only Intel and Nvidia churn out hyper optimized libraries. Binding you to their tools. I'm not sure about the impact of using Python for ML but I guess all the nice libraries do something similar or use MKL directly (numpy does this with specific versions).
Just a user of all the shiny tools to build new shiny tools.
just bruh, what's this video, why you have only 8k sub, why the animation are so precise, why I'm here
Please tell me the next one will only be a month or two. I love learning this level of optimization and it is conveyed so well. I will openly admit I’m being selfish cause I don’t wanna wait XD, though I do understand if that can’t be the case, hardly got a lick of free time with my own courses as well
There is an error in how multiplication of the block matrices is animated. As the current active result block is being calculated the blocks of the input matrices should iterate through rows and columns of blocks just like with regular matrix matrix multiplication.
Maybe Intel used AVX vector intrinsics under the hood if your CPU supports it and that leads to perf difference.
It would be cool if you would follow up on this using strassen-algorithm with z-ordering (morton ordering), and maybe even avx512 instructions. I think you would get pretty close for large matrices.
This lvl of perf is amazing! The depth of knowledge and understanding is stunning! Is it possible to do something like this in Typescript / React / Next / tRPC?
If you mean in JS, then realistically no. You can maybe do some of these optimizations using some of the newer arraybuffers or whatever they are called to use "real" arrays, but at the end of the day you can't guarantee that the runtime or interpreter won't mess something up. It just wouldn't be worth it, just use a library
This is the first time I've ever heard the term 'giga-flops', and I quite like it.
If you're new to the field, which seems to be a fair assumption... I highly recommend checking out machine learning papers and their ridiculous names. A tiny selection:
- We used Neural Networks to Detect Clickbaits: You won't believe what happened Next!
- Learning to learn by gradient descent by gradient descent
- BERT has a mouth, and it must speak
- Gotta Learn Fast: A New Benchmark for Generalization in RL
- Everything involving BERT and ERNIE and the full range of Sesame Street characters
Yeah, there's lots of whimsical stuff going on in that particular space, and it often is pretty relevant to the actual topic at hand, too.
Impressive! Never heard of these optimizations druing my computer engineering grade
man, thank you for work, is amazing.
Memory access now is the bottleneck for CPU
I love this. As a Snr.Sweng, looking to learn more about these god tier optimisations, where can I start? 😮😮
Always Nester? This seems like a direct attack against the Never Nester!
The other day I was watching a video that was implying that using less conditions makes the code faster, but then I ran some tests in assembly and saw that without the conditions, the code was longer
that would depend entirely on what you are doing
Nice video.
I love Cities XL music too. ;)
Came here from primeagen! :D
Great video!
Beautiful!
best visualizations i've seen gj
really good video!
this dude's channel bout to get huge now
very nicely done
man what a great video
it looks like the guthub repo you linked doesn't contain all the code you showcased in the video, im trying to go through your code a bit more thoroughly and was wondering where I could find all the examples shown in the video.
Imagine how much underlying low level mechanism was hidden from programmers.
They aren’t supposed to know if they don’t have to optimize to the limit.
@depthBuffer can you share the book or materials that you have read to understand this topic ?
btw, blazing beautiful video
Beautiful.
Great video! Do you mind sharing the source for your motion canvas? I like the implementation of the graph animations. Just getting used to motion and was inspired how you used it. Thanks!
Currently, my repository contains source code for my next video so I can’t make it public. But I’m using my own custom fork of motion canvas and some effects in the video require those custom features. You can check it out if you want: github.com/fangjunzhou/motion-canvas
Please, if you mention MKL, do mention that this is a non-portable intel-only library !
It won't work at all on any non-x86 compatible architecture, such as arm, and has poor performance on AMD cpu.
There are some open-source portable alternatives to MKL (blis, libflame).
Don't trade portability for performance on a single CPU family !
just amazing
Great video!
Isnt there an algorithm that uses 7 computations instead of 8 for the 2x2 GEMM? Cant we exploit that to always work in 2x2 blocks?
Also, sweet animations dude! What tools do you use for these? Is that just manim?
I’m using motion canvas
Maybe a factor in why MKL is 8x faster than your implementation is that it uses something like the Strassen algorithm
when it recognizes that you are just doing matrix-matrix multiplication of matrices with real values?
You need larger size for strassen to bring such advantage
@@dmitrykargin4060 I only said it might be a factor, a few recursions could be applied whilst still using the typical algorithm.
Quoting from en.wikipedia.org/wiki/Strassen_algorithm "In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for matrices as small as 500x500". And yeah wikipedia is not necessarily always right, but looking at the paper cited for this statement, it does seem to support the claim.
@@Vaaaaadimit is not 3 times difference for this size. You need far larger matrices to get that. I bet it was avx512 with smart reductions to get that 3x boost
No. Strassen accumulates errors.
great video
Shoutouts to CinemaScope. :3
hit that cache like it owes you money
Genius
Amazing.
Great!
6:24 That's like constant/static recursion, which is much faster than dynamic recursion (specially if there's no TCO).
11:16 Can NUMA help with this?
OMFG I love these videos
Motion canvas!
would it even be possible in higher level languages as javascript and python?
for 4x4 u can load whole values to avx256 registers, floats are 32 bit avx256 has 16x 128bit register, u can store 1 line matrix to single avx register, and these registers has mask flags to multiplicate specific part, no memory access, and because cpu do not have to wait result of multiplication if u do not use result in same cpu pipeline, u can multiplicate 4x4 matrix almost in 3-4 clock cycle. cpu math has big difference from real life
Amazing
i just subed this channel
What's your IDE at 13:55 ?
Anybody knows how can i make animation like this video? Which app should i use?