120x Faster Algorithm By Nested Loops

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ก.ย. 2024

ความคิดเห็น • 347

  • @MichaelAbramo
    @MichaelAbramo 11 หลายเดือนก่อน +1146

    It takes a true legend to make a 15 minute demo on optimization that he probably spent a month making and then at the end say "but instead you can just use this library that's 8x faster"

    • @khhnator
      @khhnator 11 หลายเดือนก่อน +53

      but only if your problem is algebra... the thing is that you will be always having cache performance issues.
      is really a mine of free performance gains if you can optimize the hotspots of your code

    • @chrism3790
      @chrism3790 11 หลายเดือนก่อน +52

      Well, those libraries were created by someone who went through the optimization process in a much more thorough way than presented here, and that is interesting in and of itself.

    • @lesscommonsense1804
      @lesscommonsense1804 11 หลายเดือนก่อน +10

      True legends give the sauces.

    • @pithlyx
      @pithlyx 10 หลายเดือนก่อน +3

      @@chrism3790 yea but optimizing for your specific use cases will be more performant than a catch all solution in most situations. additionally i think the video is also telling a story that can be applied to many sectors of optimization.

    • @chrism3790
      @chrism3790 10 หลายเดือนก่อน +8

      @@pithlyx
      I disagree with that. Most of the time, people overestimate their ability to hard core optimize things. It takes deep knowledge to truly extract every ounce of performance from your hardware. If you're working on well known problems, trying to do these things yourself is a waste of time that will produce suboptimal results, almost guaranteed.
      Ask yourself, would you be able to optimize matrix multiplication yourself, to the degree that it was done in this video? And remember, even the author admits there is more under the hood of the referenced implementation.

  • @christophervsilvas
    @christophervsilvas 11 หลายเดือนก่อน +552

    I’m doing full stack coming from a game development background. As a game developer I had to deal with this stuff all the time and so when I got my first full stack job I wrote everything like a high performance C++ algorithm even though it was just a typescript nodejs server. At first I thought I was always just a bad programmer and didn’t even know it but now I am realizing good code means different things for different contexts. In game dev it means minimal operations to achieve best result. In web it means minimal complexity and maximum maintainability by the broadest set of programmers. It’s almost like you have to program for the “worst” programmer in the “room”.

    • @boastfultoast
      @boastfultoast 11 หลายเดือนก่อน +9

      If you don't mind me asking - how did you swap from gave dev to full stack? What steps did you take to learn/fill in gaps for full stack

    • @christophervsilvas
      @christophervsilvas 11 หลายเดือนก่อน

      @@boastfultoast Hey, the full story is a little complicated (I didn't exactly "swap", it's more like I got a good job from someone who didn't care what I knew, they just knew I could make what they need happen). Anyway, to answer your question I would really just say have a problem that you are interested in and solve it! Build something!
      I didn't really have the luxury of learning fullstack in the abstract. I started by creating a Shopify app with a Typescript NestJS backend, and NextJS admin ui front end. Redis for caching. MongoDB for db.
      You have to be okay with writing everything baddly for a while so you see the problems and go look for a solution. I can't tell you how many times I have suffered through ugly programming experiences only to find a feature I didn't know about that makes everything so much easier. After two years of development I am currently on our third major refactor. Jump in with a simple project and be okay with sucking and not knowing how much you suck.
      Side note: if you come from a typed language I would really recommend learning Typescript. It was a huge help for me to get started. (If you don't come from a typed language I would still recommend Typescript :D).
      I just recently "relearned" how important it can be to have a project to work on in order to learn. I saw the Primagen talking about Rust and decided to give it a try but couldn't really make heads or tail of it for a whole week... Luckily my boss told me to make a simple service for querying every zipcode in the USA against the AT&T and Verizon coverage maps. I started dissecting both AT&T and Verizon's APIs using Python as a quick mockup. Then I decided to try to create it in Rust. At first it was really really slow... but then it just started to fall into place and after 3 days I know Rust pretty well.
      That all may or may not be helpfull. If you have any more specfic questions, feel free to ask!

    • @doc8527
      @doc8527 11 หลายเดือนก่อน +21

      I think you still got some wrong ideas from the video (not everything but how you view web and game in terms of difference). The video is the opposite of what you are doing. For a given "specific size of dataset" or "specific data structure", sometimes simply dumping the nested array might run faster (node.js in your example, V8 is doing some cache magic behind the scene), which is what a lot of devs are doing in web. Even though they don't realize that, and many in-place optimizations might lead to slow performance because the requirement change constantly, your old in-place optimization only works for specific size of data, then the stupid slow nested solution out-perform yours.
      The web is quite dynamic, so premature optimization is pretty bad. If you are just working on some local offline game, not the game server that has a drastically change of traffic all the time. Then your in-place optimization will be best because you know exactly what you will get. For web server development, a lot of times is hard to predict the incoming traffic or outgoing traffic (as db size grows). the stupidest slow solution can have the maximum reliability (regardless the size and structure ) and flexibility later if you need to make the change.
      The video is about optimization wouldn't work uniformly, size wise, structure wise, some works the best for the given range or given structure.
      I have the opposite experience from another side, I actually want to try out game development from fullstack (but traditionally just software development). I constant see game devs are doing over optimization without knowing the reason, and lead to unreadable and slow code, but they tend to so pretentious that they are the smartest, like those leetcode obsessives. Many are lack of computer science knowledge, but hey, game devs are really passionate (better than traditional software developers by a margin), I really really like them doing some creative thing. At the end is more case by case, person by person, I wouldn't generalize my experience to the entire field, whether it's game development or web development, whether who is smart or not, it's up to the specific area/problem you are working on.

    • @BeefIngot
      @BeefIngot 11 หลายเดือนก่อน

      I want to know this too. Not because I think its impossible at all, just curious. I would imagine for a motivated individual it just takes building your own project 0 user service to a reasonable enough quality to feel confident you know what is going on, front to back @@boastfultoast

    • @martinvuyk5326
      @martinvuyk5326 11 หลายเดือนก่อน +8

      Yeah I also started on a more C performance track then fell on Web dev. It's absolutely about maintainability. But there are corners here and there where I still get a kick on optimization, especially now that I'm doing rather Data engineering on DB. I had to reimplements something with numpy the other day and it was not really as readable as before, but I'll take 22 seconds over an hour every day lol

  • @DexterMorgan
    @DexterMorgan 11 หลายเดือนก่อน +87

    I'm a prewatch viewer. I watch it once, write down notes, then watch it again and pretend I know the answers while crossing my arms and nodding my head with a smug look on my face. Classic.

    • @ThePrimeTimeagen
      @ThePrimeTimeagen  11 หลายเดือนก่อน +43

      That is the way to do it

    • @YDV669
      @YDV669 7 หลายเดือนก่อน +2

      A fellow after my heart. I do that with PVR. Once the thing watches the movie I told it to record, I feel I no longer have to and can get on with my day without too much waste of time.

  • @CattoFace
    @CattoFace 11 หลายเดือนก่อน +94

    16:10 "transposing the walk" works without hurting performance only when it doesnt cause cache misses, earlier in the video, he did not account for cache size yet, so it caused misses.
    but now the block is in cache so the transposition doesnt cause extra misses.

  • @drooplug
    @drooplug 11 หลายเดือนก่อน +35

    Matt parker wrote a python script that he admitted wasn't very good, but it got thebjob done. It took something like 30 days to complete on his laptop. He shared it with his viewers and a number people rewrote it to run faster. I think the first rewrite reduced the rub time ti a few hours in python. Eventually, someone submitted code in another language that gor it down to something like 2 ms.

    • @GabrielSoldani
      @GabrielSoldani 11 หลายเดือนก่อน +12

      That’s pretty different, though. Parker used an inefficient algorithm. The community first found more efficient algorithms and then optimized it for cache hits and parallel execution. This video uses the same algorithm from start to finish.

    • @Z3rgatul
      @Z3rgatul 7 หลายเดือนก่อน +5

      ​@@GabrielSoldaniare you sure library doesn't use Strassen's algorithm?

    • @Hamstray
      @Hamstray 13 วันที่ผ่านมา +2

      @@GabrielSoldani yes, and he used an inefficient language.

    • @juniper.318
      @juniper.318 6 วันที่ผ่านมา

      pretty sure the newest rewrite is down to like, under a millisecond

  • @notapplicable7292
    @notapplicable7292 11 หลายเดือนก่อน +255

    Cache optimization has been really eye opening for me over the last year.

    • @petrkiyashko4248
      @petrkiyashko4248 11 หลายเดือนก่อน +14

      Same here, and also branch prediction and cpu pipelining lately

    • @alexhiatt3374
      @alexhiatt3374 11 หลายเดือนก่อน +19

      same! I think people focus too much on "heap vs. stack" instead of "cached vs. uncached". the stack is way faster because it's basically always in cache, but heap usage doesn't necessarily need to be slow.

    • @astronemir
      @astronemir 11 หลายเดือนก่อน +1

      It’s also all wasted effort whenever cache layout size cpu instructions change.

    • @yegorzakharov8514
      @yegorzakharov8514 11 หลายเดือนก่อน

      Cache me daddy outside

    • @TheSnowMan2023
      @TheSnowMan2023 11 หลายเดือนก่อน

      @@Jim-gyswry i think it depends on the problem you are trying to solve and the framework/languages and packages/libraries you are using.

  • @m4rt_
    @m4rt_ 11 หลายเดือนก่อน +89

    16:40 if you are doing game programing and know what size the matrices will be, for example 4x4, you can just hardcode the multiplication.
    Since the size is set you don't need General Matrix Multiplication (GEMM), which is the algorithm he is optimizing.
    With GEMM you don't optimize for the specific size, you make something that can handle all sizes, but then may not have the best performance.
    When you know that the size will always be 4x4 you can make it do it the most optimal way.

    • @heavymetalmixer91
      @heavymetalmixer91 11 หลายเดือนก่อน +4

      Asking from a novice's perspective: Can you put an example in C++ here of how you would hardcode a 4x4 optimized multiplication?

    • @theairaccumulator7144
      @theairaccumulator7144 11 หลายเดือนก่อน

      ​@@heavymetalmixer91use a library like glm

    • @Ulchie
      @Ulchie 11 หลายเดือนก่อน +25

      @@heavymetalmixer91 My assumption is that you just manually unroll any loops and have every multiplication laid out, probably using SIMD intrinsics to really optimize it.

    • @m4rt_
      @m4rt_ 11 หลายเดือนก่อน

      @@heavymetalmixer91 I don't have a C++ version handy, but I have a Jai version. It should be relatively readable. (btw --- means that it is uninitialized).
      multiply :: (m: Matrix4, n: Matrix4) -> Matrix4 {
      result: Matrix4 = ---;
      result._11 = m._11*n._11 + m._12*n._21 + m._13*n._31 + m._14*n._41;
      result._21 = m._21*n._11 + m._22*n._21 + m._23*n._31 + m._24*n._41;
      result._31 = m._31*n._11 + m._32*n._21 + m._33*n._31 + m._34*n._41;
      result._41 = m._41*n._11 + m._42*n._21 + m._43*n._31 + m._44*n._41;
      result._12 = m._11*n._12 + m._12*n._22 + m._13*n._32 + m._14*n._42;
      result._22 = m._21*n._12 + m._22*n._22 + m._23*n._32 + m._24*n._42;
      result._32 = m._31*n._12 + m._32*n._22 + m._33*n._32 + m._34*n._42;
      result._42 = m._41*n._12 + m._42*n._22 + m._43*n._32 + m._44*n._42;
      result._13 = m._11*n._13 + m._12*n._23 + m._13*n._33 + m._14*n._43;
      result._23 = m._21*n._13 + m._22*n._23 + m._23*n._33 + m._24*n._43;
      result._33 = m._31*n._13 + m._32*n._23 + m._33*n._33 + m._34*n._43;
      result._43 = m._41*n._13 + m._42*n._23 + m._43*n._33 + m._44*n._43;
      result._14 = m._11*n._14 + m._12*n._24 + m._13*n._34 + m._14*n._44;
      result._24 = m._21*n._14 + m._22*n._24 + m._23*n._34 + m._24*n._44;
      result._34 = m._31*n._14 + m._32*n._24 + m._33*n._34 + m._34*n._44;
      result._44 = m._41*n._14 + m._42*n._24 + m._43*n._34 + m._44*n._44;
      return result;
      }
      and here is a simplified version of the struct Matrix4
      Matrix4 :: struct {
      _11, _12, _13, _14 : float;
      _21, _22, _23, _24 : float;
      _31, _32, _33, _34 : float;
      _41, _42, _43, _44 : float;
      }

    • @oblivion_2852
      @oblivion_2852 11 หลายเดือนก่อน

      I have a strong suspicion that the simd registers being wider than the data... With the added latency would actually make it slower than just raw assembly on the memory addresses.

  • @viniciuspjardim
    @viniciuspjardim 11 หลายเดือนก่อน +20

    In my understanding he only stopped transposing the matrix in the end because of the buffers. Without the buffers, a transposed copy of the matrix will lead to more cache hits. With the buffer we already have a copy. So he fuse two steps in one, transposing it into the buffer, so it's still transposed copy when accessing the buffer memory.

    • @gregorymorse8423
      @gregorymorse8423 7 หลายเดือนก่อน

      Transpose load from cache vs transpose load from RAM. I thought the same.

  • @emanuelfarauanu1760
    @emanuelfarauanu1760 11 หลายเดือนก่อน +12

    One of my tasks during an internship way back in 2018 was to do such optimisations in C++ on a Raspberry PI, I learnt so much from that. Really cool algs.

  • @PiratePawsLive
    @PiratePawsLive 11 หลายเดือนก่อน +17

    This was awesome, I knew quite a bit about how memory and CPU's work. But the optimization was a bit of beauty even if the code looks like an aesthetic war crime xD. Love it.
    Now I'm interested in learning more about optimizing again :).

    • @aerocodes
      @aerocodes 11 หลายเดือนก่อน

      the code aesthetics are really not good. I'm downvoting it

  • @m.sierra5258
    @m.sierra5258 9 หลายเดือนก่อน +1

    I optimized this very problem in a university competition, and won the competition. One thing they missed in this video: line stride. Not every memory can be put into ever cache line. In most cases, there is something like modulo 256 applied to the address to decide which cache bucket something gets put into. Sadly, a matrix of 1024 means that every item of a column gets put into the same cache bucket. So for us it increased throughout substantially to apply a padding of one cache entry to every matrix row. This has the effect that columns can be cached efficiently.

  • @resresres1
    @resresres1 11 หลายเดือนก่อน +11

    modern programming at it's finest: Spend forever optimizing code, then find out there's a library for that and you wasted your time.

    • @mort44444
      @mort44444 หลายเดือนก่อน +4

      Going thru the pain & learning is never a waste of time

  • @Kotfluegel
    @Kotfluegel 11 หลายเดือนก่อน +5

    For small size matrices up to four dimensions, you usually employ specific types for each dimension, because then you can literally unroll all the loops. So, no need to use block matrices for graphic applications.

  • @PhilfreezeCH
    @PhilfreezeCH 11 หลายเดือนก่อน +3

    Last year I helped optimize the math library for my universities RISCV multi-core microprocessors (you can also buy commercial versions called GAP8 and GAP9).
    It was entirely this stuff, just with Fourier transforms and so on.
    We started with C code and by the end almost all inner loops were assembly, it was pretty fun honestly.

  • @manofacertainrage856
    @manofacertainrage856 11 หลายเดือนก่อน +4

    Wow. I haven't seen this stuff for a while ... or felt the terror of implementing the first few optimizations for a grade. That's after two parallel programming classes in my master's that included OpenMP and many matrix multiplication algorithms. Took a CUDA course a few years afterward and some of the access/transpose problems are still a problem, but the CUDA libs are the answer there too. Awesome video!

  • @ArkDEngal
    @ArkDEngal 11 หลายเดือนก่อน +3

    loop ordering (locality, locality, locality) was so important that I had at least 3 classes in classes that covered it in college.

  • @Hebruwu
    @Hebruwu 8 วันที่ผ่านมา

    I did something like that during HPC class. It was fun. If you really want to get into HPC, matrix multiplication is just such an amazing example to learn on.

  • @B_dev
    @B_dev 15 วันที่ผ่านมา

    i fukin love that he goes through all that just to end up with something thats 8x slower than just using a library lol

  • @suede__
    @suede__ 11 หลายเดือนก่อน +2

    That's a lot of computer science.

  • @zkreso
    @zkreso 8 หลายเดือนก่อน +1

    Can't wait to implement this optimization in the ERP software I write at work

  • @leschopinesns100
    @leschopinesns100 11 หลายเดือนก่อน +2

    I've literally implemented vanilla GEMM in parallelism course at uni, multiplying big ass matrices.

  • @theoballet1519
    @theoballet1519 10 หลายเดือนก่อน +1

    another optimization not mentioned here is using the super scalar nature of our processors by creaing a temp vector that stores the product of elements from the two lines, the multiplication will be performed in parrallel and the sums will be performed when the prior mult have been done. (this results in more code and more loops, but more perf too, especially for big lines since you wont have too much miss in branch prediction)
    this is an optimisation used by BLAS. another thing could be using INT representation since modern CPUs have more int functionnal units that floats.

  • @Blubb3rbub
    @Blubb3rbub 11 หลายเดือนก่อน +1

    Matrix Multiplication becomes even more interesting if you reach the "I can't reasonably fit my matrix into memory"-size of matrices and need to employ MPI and distributed memory.

  • @ismaelvc3728
    @ismaelvc3728 11 วันที่ผ่านมา

    The transposition can return a view, transposed indexes of the same data

  • @filiformis
    @filiformis 11 หลายเดือนก่อน +1

    I encountered this stuff in my Computer Architecture class in college, but I haven't had the opportunity to use this knowledge it yet.

  • @satibel
    @satibel 2 วันที่ผ่านมา

    both the most optimized (and non optimized*) code I wrote is a benchmark to see if what I wanted to do is even possible, I'm doing collision detection over thousands of entities
    the position is stored in 3 different ways, one normal way, and in the left side and right side of a simd less or equal comparison, so comparing collision is 1 instruction.
    * quad tree space partitioning is way more efficient as you don't scale in pyramid numbees but the same algorithm can be used on top of qsp (and it's the same in the worst case scenario where you have all collisions inside the smallest cell of your qsp.), so it doesn't affect the result, and you can also do the calculation on the gpu directly.

  • @thapr0digy
    @thapr0digy 11 หลายเดือนก่อน

    8:43 Prime realizes that this guys maths is blowing his mind.

  • @besknighter
    @besknighter 11 หลายเดือนก่อน +1

    In games, we rarely use huge matrices, if ever. However, depending on the game, the amount of matrices multiplications itself is so so soooo big that it might be useful to optimize it this much.

  • @BleachWizz
    @BleachWizz 8 หลายเดือนก่อน

    5:13 - I think the idea is really to transpose the matrix in memory, we want the columns to be close together so when we read the first block in cache from the first multiplication all subsequent numbers possible for the next multiplications would be brought into cache. otherwise indices would be in intervals on the dimension of the matrix and you would read a lot less numbers at a time up to rereading memory every multiplication.
    16:17 - the problem with walking is that you have to check if you CAN do that, in this case, localA and local B are both being completely brought into the cache, because now you're calculating different parts of the matrix at a time so the result needs to be completely fit in the cache as well or THIS PART will start causing problems.

  • @oliverer3
    @oliverer3 10 หลายเดือนก่อน +1

    This is the stuff that makes me question if I can even call myself a programmer. Any optimization I've ever had to do has been very basic stuff. I can't even begin to wrap my head around this.
    Then again I've never really done any compute-heavy programming.

  • @swannie1503
    @swannie1503 11 หลายเดือนก่อน +1

    I’ve had to do quite a bit of cache optimization. Rust is very good for it actually. I enjoy the problems. Similarity matrices with 100s of thousands of members features. It makes a huge difference.

  • @Peregringlk
    @Peregringlk 6 หลายเดือนก่อน

    7:24 If you don't care about the order of elements of the array, you can remove an element at any position by swapping it with the last value and then pop. That way you don't need to "move" things around.

  • @lameshithead
    @lameshithead 6 หลายเดือนก่อน

    11:47 this is called a native array, since it is fixed in memory and you cant push

  • @josefpharma4714
    @josefpharma4714 8 หลายเดือนก่อน

    Transposing the memory layout of a matrix can improve cache hits a lot.

  • @zataritamods7499
    @zataritamods7499 9 หลายเดือนก่อน

    Transposing the matrix reminds me a lot of the concept of Swizzling, also doing the blocks reminds me a lot of the tiling method used by the ps2. It feels like this is generalizing the concept for matrix multiplication.
    Also really helps me conceptualize why someone would do something like this.

  • @josephlabs
    @josephlabs 11 หลายเดือนก่อน +1

    That’s crazy 😂. I aspire to be doing this one day soon.

  • @notspm9157
    @notspm9157 16 วันที่ผ่านมา

    Larger matrices it's not optimal to actually do all the multiplications like that. It's no longer an O(n^2) operation, instead there are tricks to reduce the amount of multiplications down to something like O(nlognlogn) can't remember exactly which is where I think the other one likely benefits from. It reduces the overall steps needed

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow 11 หลายเดือนก่อน

    8:35 This is the point where you see in Primagen’s eyes how his brains got fried by maths.

  • @alexcristache8830
    @alexcristache8830 9 หลายเดือนก่อน

    This is really cool. I had to do a very similar thing for one of my university courses called High Performance Computing. Basically simulate fluids using Lattice Boltzmann on a cluster of the university's supercomputer. Was done in C and had to pay attention to all the optimizations he talks about there. Memory access, cache, using OMP and so on. It was really cool, but extremely challenging. I guess this is used more in academia for different simulations like the one I mentioned.

  • @aerocodes
    @aerocodes 11 หลายเดือนก่อน +1

    This video made me feel dumb and I want to go and study linear algebra back again

  • @sanderbos4243
    @sanderbos4243 11 หลายเดือนก่อน +2

    7:03 If you have a Set of 10 items, rather than transforming it into an array and doing O(n) removal by shifting everything over, you can just make it O(1) using the "swap-remove" `arr[i] = arr[--len];` that swaps the last element into the position of the removed element. Note this presumes it's a Set, so the order doesn't matter, but you said it was a Set. :)

    • @YumekuiNeru
      @YumekuiNeru 11 หลายเดือนก่อน +1

      was not the point that the O-notation does not take into account how the data structure is stored in memory so an array will probably be faster even if it naively says O(n) or whatever

    • @Marius426
      @Marius426 11 หลายเดือนก่อน

      ... unless you want to keep it sorted for binary search access, then you do have to move the elements. But again, for small enough sets it doesn't matter there either.

    • @sanderbos4243
      @sanderbos4243 11 หลายเดือนก่อน

      @@YumekuiNeru Sure, but my single-line suggested approach of removing items from the array is an improvement of the approach prime suggested where you're moving way more items. Why shift over say 5 items when you can just swap. It does depend on it being a Set, but that was prime's hypothetical.

    • @sanderbos4243
      @sanderbos4243 11 หลายเดือนก่อน

      @@Marius426 If your elements are integers and you want to find an element in it, you can do it in O(1) as well rather than binary search's O(log N). You just need to maintain a second array that is the same length, where you index it with the item you're looking for and it tells you what index it's at. No hashing required. This of course doesn't really matter for performance when dealing with 10 integers, but with more integers it can. I hear you say "but my elements often aren't integers", but you can let those integers be indices into an array of the structs or whatever you're looking for. Think of them as entity IDs. This costs you another level of indirection, but on the other hand swapping integers is faster than swapping large structs. But this is admittedly going quite far, just wanted to bring up the possibility.
      "We can solve any problem by introducing an extra level of indirection."
      "…except for the problem of too many levels of indirection,"

  • @dominikmuller4477
    @dominikmuller4477 2 หลายเดือนก่อน

    there have been recent improvements to the matrix multiplication algorithm. Just like multiplication of scalars has some convoluted improvements, you can multiply some sums of sub-matrices and get away with vastly fewer multiplications.

  • @GRHmedia
    @GRHmedia 8 หลายเดือนก่อน

    The stated amount a CPU can do is based on if the program runs entirely inside cache. However if the program has to reach out to memory for new variables then it is limited by memory bandwidth.

  • @Felix-ve9hs
    @Felix-ve9hs 11 หลายเดือนก่อน

    I barely know anything about algorithms, programming, math, or optimization, but it's pretty impressive how knowledge about hardware can improve performance :)

  • @7th_CAV_Trooper
    @7th_CAV_Trooper 7 หลายเดือนก่อน

    two guys having an argument over which compiled language is faster and this guy walks up, transposes some matrices, lights the computer on fire and then tells them it could be faster if he just walked the array backwards and walks out of the room.

  • @Hamstray
    @Hamstray 13 วันที่ผ่านมา

    this kind of optimization is exactly tailored to specific hardware and may not work on others. i guess that's why nvidia goes on to recompile specific shaders for different games.

  • @hannessteffenhagen61
    @hannessteffenhagen61 11 หลายเดือนก่อน

    For small matrices the main thing is that parallelising the work for multiplying two matrices makes no sense anymore, the overhead of distributing the work would far exceed the actual work that needs to be done. Some of the other things mentioned in the video may still make sense though.

  • @Shenepoy
    @Shenepoy 4 หลายเดือนก่อน

    I did once optimize python code from 50min to 5min with some C binding libraries and multithreading + multiprocessing using 12cores/24threads cpu
    felt great, the best feeling I ever had after that hot garbage worked and later threw it at colleague to finish it

  • @richardpowell1425
    @richardpowell1425 7 หลายเดือนก่อน

    The applications for large matrix multiplication include engineering and science. Important to get fast code.

  • @rotors_taker_0h
    @rotors_taker_0h 11 หลายเดือนก่อน

    Then you find yourself counting execution port's limits, micro-op cache, branchless logic, whatever.
    The very next level of optimization (for those curious enough) is to separate "logic" of transforming data (matrix multiplication or whatever) and "schedule" (correct blocking, SIMD, multithreading, etc) and let a compiler or special optimizer do it for you. And they can optimize _several_ operations in the best way. Google for stuff like Halide, Taichi, TVM, Tiramisu compiler, etc. This stuff may optimize for different hardware too (DSP, GPUs, etc). They make camera pipelines with that, neural networks, rendering, all kind of stuff. Computers are crazy fast and crazy difficult to program.

  • @d_9696
    @d_9696 11 หลายเดือนก่อน +1

    This level of optimization is probably only needed for numerical applications and gaming

    • @bennymountain1
      @bennymountain1 11 หลายเดือนก่อน

      optimization and gaming in one sentence LMAO

  • @Hr1s7i
    @Hr1s7i 9 หลายเดือนก่อน

    3:40 It's the same relation as the one between moments and velocity in mechanics. The CPU can perform that many billions of floating point operations in a single second. Meanwhile the RAM can "process" or store/cycle, that many in a single second. RAM is a capacitive hard wired reactive circuit, which means you have to wait for the rather long transient processes to do their thing. Couple that and the fact that you expend part of the bandwidth for the sake of communication (CPU's memory controller is the equivalent of a train station tossing trains at other stations, be it memory, gpu or in some cases SSDs) and you might end up with seemingly low performance under certain conditions. A good example of how this weakness can surface is when you operate on enormous active data sets. In the meantime, your PC can look like an absolute monster when playing a video game, where once you load the data necessary for the simulation to run, very little of it changes on the logic side of things. As a matter of fact, most game behaviours are de facto hard coded specifically to prevent overhead. In the meantime, driving a PID sim in Matlab can easily employ 4-5 cores when done in a non cheaty way.

  • @hydrobolix3365
    @hydrobolix3365 11 หลายเดือนก่อน

    And if you walk into Google with a nested On^2 youre fired

  • @mubin-ansari
    @mubin-ansari 11 หลายเดือนก่อน +1

    Prime without hoodie? This is going to be a great article

  • @TQuantP
    @TQuantP 11 หลายเดือนก่อน

    I wanted to watch this video for a while now, just sitting there on a tab so old it wad beggining to gather dust. Thankfully now I can! Thanks Prime 😂

  • @heck-r
    @heck-r 16 วันที่ผ่านมา

    Step 1: Spend a lot of time to do a crazy and very effective optimization
    Step 2: Forget to document it properly
    Step 3: Random normal colleague comes across it a month later: "Wow, this code is unreadable, matrix multiplication does not have to be this hard" * refactors it to the original slow version *

  • @jasonchen-alienroid
    @jasonchen-alienroid 9 วันที่ผ่านมา

    This type of optimization are often done in os/bsp. We do a ton of these in mobile os

  • @jamestreleaven7302
    @jamestreleaven7302 17 วันที่ผ่านมา

    you should see the speed ups using GPU processing with cuda it’s crazy

  • @rogierderuijter7009
    @rogierderuijter7009 11 หลายเดือนก่อน

    Never seen prime this quite. Amazing video

  • @forinda
    @forinda 11 หลายเดือนก่อน

    I have never seen @primeagen have this agree to almost everything. This guy is fluent in math

  • @BederikStorm
    @BederikStorm 11 หลายเดือนก่อน

    He didn't make Optimization with the multiplication itself. It's like the first algorithm in Algorithms by Cormen. You don't need so many multiplications, you can use addition and subtraction as well. It decreases from N^3. That's why the library works better

  • @BrainySmurf77
    @BrainySmurf77 11 หลายเดือนก่อน

    I spent a large part of my PhD doing optimizations like this to my CFD codes so I could graduate faster.

  • @bourne_
    @bourne_ 8 หลายเดือนก่อน

    6:50 that's like using functional language vs object language. Not really, but kind of 😅

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow 11 หลายเดือนก่อน

    Watched this video a while back. It’s really good.

  • @BeefIngot
    @BeefIngot 11 หลายเดือนก่อน +2

    Ok, so I don't ever use this word, but did anyone else feel like watching this video felt the closest you can feel to being cucked as a programmer?
    Actually that can't be true, Imagine this guy comes through and optimizes on a project you just finished, now that has gotta be the biggest source of performance anxiety.
    Shoot that code is so slick it probably lowers the companies productivity from how anxious everyone else gets... also if anyone else ever has to use it (though I know this is the type of code you write once, test the Jesus out of and no one ever questions or touches it ever again unless absolutely necessary, and that's the grey beards task).

  • @LewisMoten
    @LewisMoten 16 วันที่ผ่านมา

    Average programmer digs into it when performance becomes a problem that the company can justify the costs. Most clients are more interested in getting things working quickly rather than efficiently.

  • @TutoPanographic
    @TutoPanographic 11 หลายเดือนก่อน

    The optimization concept in every mind is to try to make is faster, smaller, more efficient, etc. while the real problem is to choose first the right algorithm. Thanks Donald.
    Also the video doesn't take care to tell that even if this low level optimizations work, they probably compile differently on different machines.
    And, every optimization is contextual, because every problem /need is different. And then there's reality. When the client code is not new, you know you'll witness schizophrenia...

  • @fabiano-co
    @fabiano-co 6 หลายเดือนก่อน

    "I'm pretty good at meth ok?" lol

  • @petermikus2363
    @petermikus2363 11 หลายเดือนก่อน

    5 for loops? Thats nothing. My end of the school year project had about 20.
    It was a text based adventure game demo and i put it together in 1 day.

  • @markuszeller_official
    @markuszeller_official 11 หลายเดือนก่อน

    Uncle Bob is going to die if sees these nested loops.

  • @cblair1353
    @cblair1353 11 หลายเดือนก่อน

    Linear algebra is fucking magic.

  • @gjermundification
    @gjermundification 11 หลายเดือนก่อน

    3:25 He is budgeting bandwidth distribution. Memory? Loading data into memory, is the data already in memory, I guess this is where Apple has some workstation benefits from their short distances with CPU/GPU/RAM on the same chip.

  • @diegocassinera
    @diegocassinera 11 หลายเดือนก่อน

    you need the data from memory to registers do the operation in registers and move the result to memory

  • @johanngambolputty5351
    @johanngambolputty5351 11 หลายเดือนก่อน

    Did I see a Manjaro terminal, hell yeah! (oh and something like astronvim too, respect)

  • @gustavcoetzee5018
    @gustavcoetzee5018 4 หลายเดือนก่อน

    Luke! I will have to rewatch this video to understand what MathMan is sayinng.

  • @mannycalavera121
    @mannycalavera121 11 หลายเดือนก่อน

    I'd just be happy to make the visualisations

  • @maervo4179
    @maervo4179 6 หลายเดือนก่อน

    If i would see such code in my production code, i would cry. It would take me ages to understand whats going on 😂

  • @lameshithead
    @lameshithead 6 หลายเดือนก่อน

    i just dont get it if you can avoid redundancy you use a adjastancy matrix nested loop, if not you either use matrix multiplication with native arrays, or you use tensors

  • @TraceursMonkey
    @TraceursMonkey 11 หลายเดือนก่อน

    if you know std::thesenuts it will be 120.1x faster

  • @deldrinov
    @deldrinov 11 หลายเดือนก่อน

    I'm a novice in programming at best but I've encountered something similar once.
    I wanted to add attributes to each member of an array which themselves were in an array; pretty much extending 2D array if my lingo is correct.
    I did that in 2 nested for loops, and the initial code ran like crap. I swapped the for statements without any other changes and speed increased 40x.

    • @DFPercush
      @DFPercush 10 หลายเดือนก่อน

      There's a similar thing in image processing. It's usually better to loop for (y=...) on the outer loop and for (x=...) in the inner loop. That way you're accessing memory contiguously in a horizontal manner and not skipping several cache lines on every pixel.

  • @seancooper5007
    @seancooper5007 11 หลายเดือนก่อน +1

    That's some for loop nesting right there

  • @tomm.4447
    @tomm.4447 11 หลายเดือนก่อน

    I'm a web developer and those are certainly all words

  • @smanzoli
    @smanzoli 16 วันที่ผ่านมา

    It would not scale well t in a small matrix (let's say smaller that 1000x1000) IF he was using CUDA (GPU) for parallel computation.... it would be yet manuch faster in a 10k x 10x, but in a 128 x 128 it would be slower usig CUDA than it would be using the CPU alone.
    But yes, the purpose was: I'll add more nested loops AND it will be faster! Nice.

  • @rocstar3000
    @rocstar3000 11 หลายเดือนก่อน +1

    I saw this on stream and didn't understand shit, now seeing the video I still don't understand shit.

  • @kylek3822
    @kylek3822 9 หลายเดือนก่อน

    We don't just walk the matrix differently as we may want to do operations on columns
    en.wikipedia.org/wiki/In-place_matrix_transposition
    There are algorithms that allow us to store the elements of an MxN matrix and its transpose needed in less than 2MN. (This shouldn't be too surprising as the diagonal is shared, but some algorithms actually give only a bounded constant above MN, while others are computationly expensive, have O(MN * LN (MN) storage, but have better locality for row and column operations)

  • @dipereira0123
    @dipereira0123 11 หลายเดือนก่อน

    Oh Being honest, the day that I finally reach a point to have deal with this level of problem is when I will be confident enough to apply to that JR SWE position (Imposter syndrome ftw) 😆

  • @nocopyrightgameplaystockvi231
    @nocopyrightgameplaystockvi231 11 หลายเดือนก่อน

    Just going to use Matmul to escape this Armageddon

  • @7alexze
    @7alexze 11 หลายเดือนก่อน

    but that N cubed you silly goose, squared is just 1 million.

  • @nothingisreal6345
    @nothingisreal6345 9 หลายเดือนก่อน

    Guys.All this has been solved years or decades ago. There are commercial libraries for these purposes. Nobody needs to write a maxtrix * matrix operator anymore these days. CRAP!

  • @CPSPD
    @CPSPD 11 หลายเดือนก่อน

    theyre called for loops because you unlock their full power when youre four layers deep

  • @QayLikeKay
    @QayLikeKay 4 หลายเดือนก่อน

    Even small cache line sizes are 32 bytes long, so a 4x4 float matrix would be 16 bytes, 2 of them would be 32. So both matrices could fit in one cache line and the result another (in the worst case scenario, in a 64byte cache line system everything *could* potentially fit in one cache line). People in chat saying your wrong forget that pretty much everything in computing is a trade off and many optimizations come at the cost of some fixed amount of setup time (or a setup time that scales slower than the normal case). As one of my professors once said "Bubble sort beats small children" (english was not his first language), but he meant that bubble sort beats algorithms like quicksort for small child data sets as it can just go, it doesn't have to allocate extra buffers, or do anything else. That said "bubble sort beats small children" sticks in your brain way better.

  • @priyeshagarwal2644
    @priyeshagarwal2644 7 หลายเดือนก่อน

    Would Strassen's matrix multiplication optimize it even further? O(n^2.8)

  • @0xdiane
    @0xdiane 11 หลายเดือนก่อน +1

    It would be really cool if LLVM could do this automatically, though that seems to be potentially impossible.

    • @asandax6
      @asandax6 11 หลายเดือนก่อน +3

      AI has entered the chat.

    • @mgord9518
      @mgord9518 11 หลายเดือนก่อน +3

      ​@@asandax6Would be interesting to see AI-assisted compilation for highly optimized release modes, but it's probably not gonna be a thing any time soon

    • @asandax6
      @asandax6 11 หลายเดือนก่อน

      @@mgord9518 Considering there's already AI that can handle code generation like ChatGPT and Copilot it won't be too long before we get AI for the compilers.

    • @unbreakablefootage
      @unbreakablefootage 10 หลายเดือนก่อน

      @@mgord9518 I have seen reports of LLMs doing optimizations without even being trained for it specifically

  • @omdevs
    @omdevs 11 หลายเดือนก่อน

    This is amazing!

  • @randomdamian
    @randomdamian 10 หลายเดือนก่อน

    I wish I could benchmark this on my PC... Just curious what a 7950x3d could do, just the fact that it has more MB of L3 cache than GPUs 2 decades ago is funny...

  • @Spiderboydk
    @Spiderboydk 8 หลายเดือนก่อน

    RIP never-nesters ;-)

  • @KamillaMirabelle
    @KamillaMirabelle 11 หลายเดือนก่อน

    We needed to that in a homework assignment at a course 😅

  • @coolmanthecool603
    @coolmanthecool603 8 หลายเดือนก่อน

    Giga floppers

  • @Veptis
    @Veptis 11 หลายเดือนก่อน

    This seems like great engineering. Food to be able to simply import mkl

  • @tiagoaoa
    @tiagoaoa 11 หลายเดือนก่อน

    I thought this was pretty basic stuff, since I teach it to undergrads.