Adding Nested Loops Makes this Algorithm 120x FASTER?

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 พ.ค. 2024
  • In the last video, I introduced the concepts of compute-bounded and memory-bounded tasks. This video takes a step further and uses the theory we discussed to optimize a famous memory-bounded algorithm.
    Many of these tricks are counterintuitive but highly effective. By the end of the video, you'll find we can make the algorithm around 120x faster than the naive implementation.
    Join the Discord server: / discord
    My GitHub profile: github.com/fangjunzhou
    The source code used in this video: github.com/fangjunzhou/blas-p...
    Motion Canvas official repository: github.com/motion-canvas/moti...
    My custom fork of Motion Canvas: github.com/fangjunzhou/motion...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 175

  • @bernardcrnkovic3769
    @bernardcrnkovic3769 7 หลายเดือนก่อน +442

    dude, this video is so beautifully animated. i can't wrap my head around how much time you spent on this.

    • @mhavock
      @mhavock 7 หลายเดือนก่อน +7

      yeah, cache you l8ter! 🤣

    • @minhuang8848
      @minhuang8848 7 หลายเดือนก่อน +6

      The animation is sweet, but do people realize that this is a native ultrawide vid? IMMEDIATE subscription for choosing by far the best aspect ratio... y'know, the one we collectively failed to adopt. Seriously, not opting for 21:9 is the big tech fumble of our time, it's wild how narrow and tight 16:9 looks these days if you have some UW experience.

    • @gordonlawrenz7635
      @gordonlawrenz7635 7 หลายเดือนก่อน +6

      @@minhuang8848 I like horizontal space. Give me a 32-40 inch 4:3. 4:3 is crazy good to work with.

    • @ImSidgr
      @ImSidgr 7 หลายเดือนก่อน +2

      Good ol’ Canva

    • @itellyouforfree7238
      @itellyouforfree7238 7 หลายเดือนก่อน

      @@minhuang8848 21:9 is one of the worst inventions of out time

  • @Zullfix
    @Zullfix 7 หลายเดือนก่อน +114

    The fact that optimizing for cache hits and enabling SIMD was able to bring a given matrix multiplication operation from 2000ms to 15ms is wild.

    • @okie9025
      @okie9025 7 หลายเดือนก่อน +23

      It reminds me of when Matt Parker made a unique wordle combination finding algorithm that took a month to complete, and then his viewers managed to optimize it so it only takes a few ms.

    • @amirnuriev9092
      @amirnuriev9092 6 หลายเดือนก่อน +4

      ​@@okie9025this is so not the same, both algorithms here are o(n^3), but with constant optimizations you reduce the runtime significantly
      in matt parkers video he basically wrote the most naive bruteforcey approach

    • @MichaelPohoreski
      @MichaelPohoreski 6 หลายเดือนก่อน

      @@okie9025Matt Parker *isn’t a programmer* so he implemented a naive solution in a dog-SLOW language Python.
      It took me a one day to write a C solution; this took 20 mins to run. The next morning I optimized it down to 20 seconds. In the evening I had multithreading working on my 24 core/48 thread Threadripper. It took ~2 seconds to find all 538 solutions.
      Understanding not only data access but calculations is paramount to getting high performance.

    • @BosonCollider
      @BosonCollider 2 หลายเดือนก่อน

      ​@@amirnuriev9092 Right, this is the kind of speedup that programmers with only python experience would assume to be impossible (multiple order of magnitude improvement despite having the same asymptotic complexity, using the same language, and not even using any stupid premature abstractions in the initial version that could slow things down)

  • @foolriver
    @foolriver 7 หลายเดือนก่อน +114

    You can get better performance by unrolling the loops(do it aggresively, for example, unroll the whole loop by block size(in GCC, #pragma GCC unroll 8)). However, it would still be 3 times slower than intel mkl.

    • @axe863
      @axe863 7 หลายเดือนก่อน +2

      Why wouldnt the compiler pick that up? Make the compiler more aggressive???

    • @blipman17
      @blipman17 6 หลายเดือนก่อน +7

      @@axe863GCC in particular doesn't like unrolling loops and vectorizing them as much as LLVM does in my experience. Although they seem to have caught up a lot with the latest few versions.

    • @axe863
      @axe863 6 หลายเดือนก่อน

      @blipman17 Thanks for the info.

    • @davidr2421
      @davidr2421 6 หลายเดือนก่อน

      @@blipman17 Yeah this is some info I was looking for. Based on what I've seen, LLVM would just automatically do these SIMD things without needing to be explicitly told, right?

    • @blipman17
      @blipman17 6 หลายเดือนก่อน

      @@davidr2421 errr… yes and no. It really depends. Moving data to SIMD-registers and back has overhead compared to regular single instruction, single data (SISD) code that we normally program. Some SISD algorithm implementations in assembly can just be faster than the equivalent SIMD algorithms depending on the instruction set. Also, heat. All instructions generate some amounth of heat, which will affect the clock speed. So while an algorithm might be faster on SIMD, on a given processor it might be slower due to the CPU thermal-throttling. This is/was extreme apparent on AVX-512 SIMD instructions on Intel. That’s also why GCC used to be or is hesitant about SIMD instructions. Furtheron, any consumer/server cpu of the latest 20 years can run instructions out-of-order and in paralel. This is extremely apparent on for-loops where for example the 6’th iteration of the loop is started before the first iteration is even finished! That’s parallelism you get for free.
      And this all changes drastically per specific model of CPU + RAM and sometimes even between different cooling solutions. So… benchmark before making any such statements. This is extremely difficult.

  • @shimadabr
    @shimadabr 7 หลายเดือนก่อน +32

    That's awesome! I've been studying parallel programming and a lot of these strategies i had no idea were possible. I wish my university had courses on HPC/Parallel programming like yours. This course you mentioned at the end of the video seems great.

  • @trustytrojan
    @trustytrojan 7 หลายเดือนก่อน +30

    great animations and visual style! you've found yourself a great side hobby :)

    • @erikjohnson9112
      @erikjohnson9112 7 หลายเดือนก่อน +1

      I was thinking the same. The matrix multiplication was very effective. I remembered the first matrix row gets a dot product with 2nd matrix's first column, so I could see that it was lower left being the first matrix and upper right being the second (lower right obviously being the result).

  • @Val-vm6qu
    @Val-vm6qu 7 หลายเดือนก่อน +4

    The final part where you say "Oh but there is that library that does everything for you for that case" just made me laugh my head off, great vid full of deep commentary, really great job!

  • @RoyerAdames
    @RoyerAdames 6 หลายเดือนก่อน +3

    I saw your video in the ThePrimeTime, and your video is epic. Very well explain for such high level topics.

    • @just-a-hriday
      @just-a-hriday หลายเดือนก่อน

      I believe it's actually rather low level.

  • @cefcephatus
    @cefcephatus 6 หลายเดือนก่อน +8

    I love how in depth this video makes me feel smart. I know that only a few people could make sense of such content like this. But, you make it feel like even more people can get close to it.

  • @giordano7703
    @giordano7703 7 หลายเดือนก่อน +8

    I love it when videos like these give me inspiration to delve into a topic I'm not familiar with. I admire people like you for coming up with such elegant and beautiful ways to communicate these concepts.

  • @fenril6685
    @fenril6685 7 หลายเดือนก่อน +8

    Subscribed. I can't believe you don't have more subscribers already. Any software engineer dealing with matrix math should watch this video.

  • @1000percent1000
    @1000percent1000 7 หลายเดือนก่อน +14

    I've been teaching myself how to use SIMD for the past 3 weeks and I can't even tell you how helpful this video was. I was baffled when my serial implementation of some image processing code was 10x faster than my naive SIMD implementation. Took me quite a while to understand how that was possible. This video has made my greatly appreciate the simplicity of Rust's (experimental) portable SIMD library. Also, did not know what OpenMP was and it seems somewhat similar to Rust's library. Absolutely incredible video!

  • @Antonio-yy2ec
    @Antonio-yy2ec 7 หลายเดือนก่อน +3

    Pure gold! Thanks for the video!

  • @Affax
    @Affax 7 หลายเดือนก่อน +51

    This video flew entirely above my head (still getting into this math in Uni haha), but the presentation is stunning, and I still watched through all of it because these optimizations are weirdly beautiful to me.
    Awesome job!

    • @stxnw
      @stxnw 7 หลายเดือนก่อน +1

      u only need basic algebra bro..

  • @MahdeenSky
    @MahdeenSky 7 หลายเดือนก่อน +2

    I loved the presentation and the animations really helped me collect the concepts.

  • @Otomega1
    @Otomega1 7 หลายเดือนก่อน +8

    love this stuff keep the great work !

  • @naj0916
    @naj0916 6 หลายเดือนก่อน

    Very good. I haven't thought about optimization for a long time since I was doing Assembly in the 90's. This brings back memories and the feeling. Very good!

  • @Dayal-Kumar
    @Dayal-Kumar 7 หลายเดือนก่อน +14

    I did this exact thing in my intern project this summer at TI. Their processor even had a special instruction to do the matrix multiplication of two small blocks. I was able to achieve around 80% the rated capacity of the processor.

  • @nordicus666
    @nordicus666 7 หลายเดือนก่อน

    Great videos, you certainly know what you are talking about and you can share it while keeping it interesting, keep it up, you deserve more subscribers

  • @EricGT
    @EricGT 6 หลายเดือนก่อน

    this is a well done video and explains the idea of leveraging hardware and machine code to optimize cache lookups super well. But I also want to shout out what may be the best animation of a matrix dot product I’ve ever seen. this feels like the first time I watched a video that got me to understand what monads are

  • @brucea9871
    @brucea9871 7 หลายเดือนก่อน +4

    Until now I only considered the efficiency of the algorithm for optimizing program execution. Although I admit I didn't understand everything in this video it demonstrates that knowledge of the computer's hardware and taking advantage of it can significantly speed up execution.

  • @danh9002
    @danh9002 6 หลายเดือนก่อน

    Awesome video! Thank you so much for sharing this!

  • @U2VidWVz
    @U2VidWVz 6 หลายเดือนก่อน

    Cool videos. Interesting information and great graphics/animations. Thanks!

  • @LV-ii7bi
    @LV-ii7bi 6 หลายเดือนก่อน

    This seems really cool. Good work, you're on the right path.

  • @karim-gb5nx
    @karim-gb5nx 7 หลายเดือนก่อน +2

    Interesting subject, well presented, thank you.

  • @MohammedAbdulatef
    @MohammedAbdulatef 7 หลายเดือนก่อน +1

    The most captivating and insightful presentation I have ever witnessed.

  • @camofelix
    @camofelix 6 หลายเดือนก่อน

    Very well done! I make a living optimizing BLAS routines, this will probably become my default “what do you do” to send people

  • @itellyouforfree7238
    @itellyouforfree7238 7 หลายเดือนก่อน

    very nicely done

  • @amadzarak7746
    @amadzarak7746 7 หลายเดือนก่อน

    GEMM is the perfect example to demonstrate these concepts. Wonderful video my friend. You earned a subscriber

  • @user-uz3fc3ty2n
    @user-uz3fc3ty2n 6 หลายเดือนก่อน

    This is so beautiful I nearly cried, and at the end when you reviled that the library had a 1000x optimisation I died and came back to life a better person.

  • @tansanDOTeth
    @tansanDOTeth 6 หลายเดือนก่อน +1

    Visualizations were freaking amazing! Loved those! could you make a video of your process for editing these videos?

  • @joshpauline
    @joshpauline 6 หลายเดือนก่อน

    probably one of the best programming videos I have ever seen, as a more senior developer there is a lack of content on this level of production quality when explaining complex ideas

  • @xdevs23
    @xdevs23 6 หลายเดือนก่อน

    This is so well explained and the animation are SO good!
    Also it would be interesting to see how clang performs compared to gcc.

  • @megaxlrful
    @megaxlrful 7 หลายเดือนก่อน +59

    Great video! But I feel like I will never get to do this kind of work during my job, since we use a scripting language, and all bets are off when you store everything on the heap.

    • @wiktorwektor123
      @wiktorwektor123 7 หลายเดือนก่อน +18

      While scripting langauge is a huge problem here memory isn't. Stack and heap are just different names of how memory is allocated by OS kernel. All those types of memory are still sitting on the same RAM chip, so there is no difference in access speed and locality of that memory to CPU cache. While OS will almost always store stack memory in CPU cache, every time you access heap address it pull to cache and couple of more addresses. It's not that CPU loads only one value you request, it's value you request + some block of data after this. If you have an array of numbers and you want element zero, CPU is loading not only that but propably 1k more elements after.

  • @gara8142
    @gara8142 7 หลายเดือนก่อน

    This was a great watch, thanks you.
    That library you used, MKL, can you make a video about it? It sparked my interest

  • @laughingvampire7555
    @laughingvampire7555 2 หลายเดือนก่อน

    man, thank you for work, is amazing.

  • @homematvej
    @homematvej 7 หลายเดือนก่อน +2

    We need to go deeper!

  • @stavgafny
    @stavgafny 7 หลายเดือนก่อน +3

    Just seen ThePrimeagen react to your last video lol

  • @HyperFocusMarshmallow
    @HyperFocusMarshmallow 7 หลายเดือนก่อน

    Great video!

  • @MNNoxMortem
    @MNNoxMortem 7 หลายเดือนก่อน

    Amazing.

  • @woolfel
    @woolfel 7 หลายเดือนก่อน +2

    I could be wrong, but NVidia's linear algebra library does these types of optimizations to optimize utilization. The cublas library does a bunch of re-ordering and transposing by default.

  • @rvoros
    @rvoros 7 หลายเดือนก่อน

    Nice video.
    I love Cities XL music too. ;)

  • @birdbrid9391
    @birdbrid9391 7 หลายเดือนก่อน

    best visualizations i've seen gj

  • @DeathSugar
    @DeathSugar 7 หลายเดือนก่อน +1

    there's also computational optimization similar to karatsuba multiplication but for matrices. probably proprietary lib uses it

  • @HadiAriakia
    @HadiAriakia 6 หลายเดือนก่อน

    Love the video,
    Where have you been mate until now?😂

  • @Malte133
    @Malte133 6 หลายเดือนก่อน

    Great work and its true optimizing the code by splitting up commands in assembler is a pain. saw videos on that too.

  • @wiipronhi
    @wiipronhi 7 หลายเดือนก่อน

    OMFG I love these videos

  • @flightman2870
    @flightman2870 6 หลายเดือนก่อน

    hats off to the effort and thanks to @primetime for showing us this gem

  • @jacob_90s
    @jacob_90s 2 หลายเดือนก่อน

    Transposition works really well if you can combine it as a result of a previous operation. For instance if the matrix that's about to be multiplied is the result of adding together two matrices, when you're adding the elements of the matrix, reverse the order so that the result is the transposition of actual answer.

  • @skope2055
    @skope2055 7 หลายเดือนก่อน +1

    Great!

  • @errodememoria
    @errodememoria 6 หลายเดือนก่อน

    man what a great video

  • @ElZafro_
    @ElZafro_ 7 หลายเดือนก่อน +5

    Impressive! Never heard of these optimizations druing my computer engineering grade

  • @mfinixone1417
    @mfinixone1417 7 หลายเดือนก่อน

    Amazing

  • @Kotfluegel
    @Kotfluegel 7 หลายเดือนก่อน +2

    There is an error in how multiplication of the block matrices is animated. As the current active result block is being calculated the blocks of the input matrices should iterate through rows and columns of blocks just like with regular matrix matrix multiplication.

  • @axe863
    @axe863 7 หลายเดือนก่อน

    Beautiful.

  • @Mefistofy
    @Mefistofy 7 หลายเดือนก่อน +1

    Beautiful video. Understandable even for a EE grunt like me. The take home message at the end was a little disheartening though. AFAIK only Intel and Nvidia churn out hyper optimized libraries. Binding you to their tools. I'm not sure about the impact of using Python for ML but I guess all the nice libraries do something similar or use MKL directly (numpy does this with specific versions).
    Just a user of all the shiny tools to build new shiny tools.

  • @69k_gold
    @69k_gold 7 หลายเดือนก่อน +1

    DSA pro: You get 1900ms, take it or leave it
    Guy with an intel chil and an ASM course: Hold my cache

  • @tarn_jihas
    @tarn_jihas 7 หลายเดือนก่อน

    subbed :)

  • @sawinjer
    @sawinjer 7 หลายเดือนก่อน +1

    Software engineers can have two mindsets - one of being the best and the other of being the worst. After watching this video, I quickly fell into the second. 10 out of 10

  • @NIghtmare8791
    @NIghtmare8791 7 หลายเดือนก่อน +1

    Its a great videof or optimizing a matrix multiplication from a purely technical/computing science standpoint. But i think improvement you still could make is to multiply the matrices with something like the Strassen algorithmus ( nowdays there even faster algorithm), but a matrix multiplication doesnt need to have a cubic ^3 runtime. You can actually do matrix multiplications in slightly less for example ^2.8, which should give significant improvements when multiplying big matrices

  • @ujjwalaggarwal7065
    @ujjwalaggarwal7065 7 หลายเดือนก่อน

    just amazing

  • @rabbitcreative
    @rabbitcreative 7 หลายเดือนก่อน

    I have a real-world example that feels-to-me similar what is described here. It involves a Postgres install on a machine with a RAID-5 setup. Benchmarking showed peak ~15k TPS, for about 30-seconds, then the TPS would degrade to ~2k TPS, and widly fluctuate up and down, for about 30-seconds, and then resume at ~15k TPS for another 30-seconds. This behavior was steady over longer benchmarks (10 minutes).
    I used Postgres's table partitioning ability to solve this. First I used 10 partitions, which made no difference. Then I used 100 partitions, and while the peak throughput capped at ~13k TPS, it was steady for the entire 10-minute benchmark.
    Another, more human example, is writing the same sentence over and over on a chalkboard. Easier to write "I, I, I, I, will, will, will, will, never, never, never, never, chew, chew, chew, chew, gum, gum, gum, gum." Instead of "I will never chew gum", etc.
    It's all related.

  • @poproporpo
    @poproporpo 7 หลายเดือนก่อน

    Motion canvas!

  • @__samuel-cavalcanti__
    @__samuel-cavalcanti__ 7 หลายเดือนก่อน

    @depthBuffer can you share the book or materials that you have read to understand this topic ?
    btw, blazing beautiful video

  • @Gefionius
    @Gefionius 3 หลายเดือนก่อน

    Incredible video. Feel like I am getting a PhD in Optimization just by watching and I haven’t really coded for 20ys 😂

  • @simian3455
    @simian3455 7 หลายเดือนก่อน +4

    I cant lie I do not program by any stretch of the imagination... but your videos are something else, the clean and clear explanations are just amazing.

  • @AREACREWBMX
    @AREACREWBMX 6 หลายเดือนก่อน

    Genius

  • @insu_na
    @insu_na 6 หลายเดือนก่อน

    great video

  • @ChaoticNeutralMatt
    @ChaoticNeutralMatt 7 หลายเดือนก่อน

    Once you reminded me of the context of the last video, I believe i know where this is going

  • @Proprogrammer001
    @Proprogrammer001 6 หลายเดือนก่อน

    This has so fucking deeply re-ignited my passion for computer science. Gosh I am on fire right now

  • @shakibrahman
    @shakibrahman 7 หลายเดือนก่อน +21

    anybody got any online course/book similar to the course DepthBuffer mentions he took at university for highly optimized code?

    • @simonl1938
      @simonl1938 7 หลายเดือนก่อน +2

      I'm also interested in that :)

    • @winda1234567
      @winda1234567 7 หลายเดือนก่อน

      same @@simonl1938

    • @leshommesdupilly
      @leshommesdupilly 7 หลายเดือนก่อน +2

      also interested :D

    • @depth-buffer
      @depth-buffer  7 หลายเดือนก่อน +23

      We are not using any textbooks during the course, and the course is also not recorded. But the slides are published here: pages.cs.wisc.edu/~sifakis/courses/cs639-s20/

    • @shakibrahman
      @shakibrahman 7 หลายเดือนก่อน +2

      @@depth-buffer awesome! thank you

  • @saltyowl3229
    @saltyowl3229 7 หลายเดือนก่อน +1

    Please tell me the next one will only be a month or two. I love learning this level of optimization and it is conveyed so well. I will openly admit I’m being selfish cause I don’t wanna wait XD, though I do understand if that can’t be the case, hardly got a lick of free time with my own courses as well

  • @codeChuck
    @codeChuck 3 หลายเดือนก่อน

    This lvl of perf is amazing! The depth of knowledge and understanding is stunning! Is it possible to do something like this in Typescript / React / Next / tRPC?

    • @samuelwaller4924
      @samuelwaller4924 2 หลายเดือนก่อน

      If you mean in JS, then realistically no. You can maybe do some of these optimizations using some of the newer arraybuffers or whatever they are called to use "real" arrays, but at the end of the day you can't guarantee that the runtime or interpreter won't mess something up. It just wouldn't be worth it, just use a library

  • @mNikke
    @mNikke 4 หลายเดือนก่อน

    It would be cool if you would follow up on this using strassen-algorithm with z-ordering (morton ordering), and maybe even avx512 instructions. I think you would get pretty close for large matrices.

  • @jupiterbjy
    @jupiterbjy 7 หลายเดือนก่อน

    kinda wonder why our university barely have such lectures, amazing video

  • @huanxu5140
    @huanxu5140 7 หลายเดือนก่อน +1

    Maybe Intel used AVX vector intrinsics under the hood if your CPU supports it and that leads to perf difference.

  • @manuelmoscardi6606
    @manuelmoscardi6606 7 หลายเดือนก่อน

    just bruh, what's this video, why you have only 8k sub, why the animation are so precise, why I'm here

  • @Rudxain
    @Rudxain 7 หลายเดือนก่อน

    6:24 That's like constant/static recursion, which is much faster than dynamic recursion (specially if there's no TCO).
    11:16 Can NUMA help with this?

  • @TylerHallHiveTech
    @TylerHallHiveTech 7 หลายเดือนก่อน

    Great video! Do you mind sharing the source for your motion canvas? I like the implementation of the graph animations. Just getting used to motion and was inspired how you used it. Thanks!

    • @depth-buffer
      @depth-buffer  7 หลายเดือนก่อน +1

      Currently, my repository contains source code for my next video so I can’t make it public. But I’m using my own custom fork of motion canvas and some effects in the video require those custom features. You can check it out if you want: github.com/fangjunzhou/motion-canvas

  • @ericraio7145
    @ericraio7145 6 หลายเดือนก่อน +1

    Came here from primeagen! :D

  • @aeebeecee3737
    @aeebeecee3737 6 หลายเดือนก่อน

    i just subed this channel

  • @groff8657
    @groff8657 7 หลายเดือนก่อน

    This is the first time I've ever heard the term 'giga-flops', and I quite like it.

    • @minhuang8848
      @minhuang8848 7 หลายเดือนก่อน +5

      If you're new to the field, which seems to be a fair assumption... I highly recommend checking out machine learning papers and their ridiculous names. A tiny selection:
      - We used Neural Networks to Detect Clickbaits: You won't believe what happened Next!
      - Learning to learn by gradient descent by gradient descent
      - BERT has a mouth, and it must speak
      - Gotta Learn Fast: A New Benchmark for Generalization in RL
      - Everything involving BERT and ERNIE and the full range of Sesame Street characters
      Yeah, there's lots of whimsical stuff going on in that particular space, and it often is pretty relevant to the actual topic at hand, too.

  • @geckoo9190
    @geckoo9190 7 หลายเดือนก่อน

    The other day I was watching a video that was implying that using less conditions makes the code faster, but then I ran some tests in assembly and saw that without the conditions, the code was longer

    • @Erlisch1337
      @Erlisch1337 7 หลายเดือนก่อน +2

      that would depend entirely on what you are doing

  • @stacksmasherninja7266
    @stacksmasherninja7266 7 หลายเดือนก่อน

    Isnt there an algorithm that uses 7 computations instead of 8 for the 2x2 GEMM? Cant we exploit that to always work in 2x2 blocks?
    Also, sweet animations dude! What tools do you use for these? Is that just manim?

    • @depth-buffer
      @depth-buffer  7 หลายเดือนก่อน +1

      I’m using motion canvas

  • @machinelearningTR
    @machinelearningTR 3 หลายเดือนก่อน

    Anybody knows how can i make animation like this video? Which app should i use?

  • @4SnapIT
    @4SnapIT 2 หลายเดือนก่อน

    So the first optimization is all about cache line utilization and not so much about cache hits. Make sense when you think it trough. Big percent of cache line is "garbage" data in the vanilla implementation.

  • @thanatosor
    @thanatosor 7 หลายเดือนก่อน

    Imagine how much underlying low level mechanism was hidden from programmers.
    They aren’t supposed to know if they don’t have to optimize to the limit.

  • @marketsmoto3180
    @marketsmoto3180 6 หลายเดือนก่อน

    this dude's channel bout to get huge now

  • @colonthree
    @colonthree 6 หลายเดือนก่อน

    Shoutouts to CinemaScope. :3

  • @user-gt2th3wz9c
    @user-gt2th3wz9c 7 หลายเดือนก่อน

    would it even be possible in higher level languages as javascript and python?

  • @Vaaaaadim
    @Vaaaaadim 7 หลายเดือนก่อน +1

    Maybe a factor in why MKL is 8x faster than your implementation is that it uses something like the Strassen algorithm
    when it recognizes that you are just doing matrix-matrix multiplication of matrices with real values?

    • @dmitrykargin4060
      @dmitrykargin4060 7 หลายเดือนก่อน

      You need larger size for strassen to bring such advantage

    • @Vaaaaadim
      @Vaaaaadim 7 หลายเดือนก่อน

      ​@@dmitrykargin4060 I only said it might be a factor, a few recursions could be applied whilst still using the typical algorithm.
      Quoting from en.wikipedia.org/wiki/Strassen_algorithm "In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for matrices as small as 500x500". And yeah wikipedia is not necessarily always right, but looking at the paper cited for this statement, it does seem to support the claim.

    • @dmitrykargin4060
      @dmitrykargin4060 7 หลายเดือนก่อน +1

      @@Vaaaaadimit is not 3 times difference for this size. You need far larger matrices to get that. I bet it was avx512 with smart reductions to get that 3x boost

    • @theexplosionist2019
      @theexplosionist2019 7 หลายเดือนก่อน

      No. Strassen accumulates errors.

  • @somniad
    @somniad 7 หลายเดือนก่อน

    this video is w i d e

  • @abbasballout4441
    @abbasballout4441 5 หลายเดือนก่อน

    Wait what's the music?

  • @imlassuom
    @imlassuom 7 หลายเดือนก่อน

    🌟🌟🌟🌟🌟

  • @vbregier
    @vbregier 6 หลายเดือนก่อน +1

    Please, if you mention MKL, do mention that this is a non-portable intel-only library !
    It won't work at all on any non-x86 compatible architecture, such as arm, and has poor performance on AMD cpu.
    There are some open-source portable alternatives to MKL (blis, libflame).
    Don't trade portability for performance on a single CPU family !

  • @snoozy355
    @snoozy355 6 หลายเดือนก่อน

    What's your IDE at 13:55 ?

  • @dalicodes
    @dalicodes 6 หลายเดือนก่อน

    from Prime

  • @thanatosor
    @thanatosor 7 หลายเดือนก่อน +1

    Memory access now is the bottleneck for CPU

  • @fulbone2848
    @fulbone2848 7 หลายเดือนก่อน

    STRASSENS!!

  • @deckard5pegasus673
    @deckard5pegasus673 7 หลายเดือนก่อน

    I didn't see the whole video, just skipped through it, but why even do matrix operations with the CPU nowadays?, with other options like cuda.

  • @Petch85
    @Petch85 7 หลายเดือนก่อน +1

    What if your matrix is much larger? Like 10Mx10M. Do the algorithms swop places.

    • @depth-buffer
      @depth-buffer  7 หลายเดือนก่อน +13

      That’s a very interesting topic! As far as I know, people rarely talk about optimization problems for gigantic matrices when they are dense. Because you probably can’t even fit them inside the memory. However, when the matrices are sparse, there are a lot of possibilities. In fact, I worked on a research project this summer related to this topic. Maybe I’ll talk about it in the future!

    • @Petch85
      @Petch85 7 หลายเดือนก่อน

      @@depth-buffer place make a video on sparse matrix calculations. I have been toying with the idea to make a simple "topology optimization FEM solver" (in python/numpy) and one using cuda (for the GPU), to see if the GPU would increase the speed. FEM is basically just solving the equation f=K*u, where K, the stiffness matrix, is a symmetric sparse matrix that easily can be very large. f is the force vector and u is a displacement vector. You can have constraints in bot f and u (but not for the same components). So you need to find the Inverse of K, This can be done iterative involving a lot of matrix times vector calculations, thus it is not exactly the same as matrix multiplications. I know there have been a lot of research into this problem, one of them being how can we solve problems that are way larger than we have memory to solve. But still, when solving this problem you would love for you problem to fint into memory. Thus you want memory more than you want more cores up until you can fint the hole problem in memory. But sparse matrices are a must, cause K is 99% 0.

    • @feha92
      @feha92 7 หลายเดือนก่อน +1

      @@depth-buffer You forgot to actually answer his yes/no question though xD gave me a laugh

    • @Johnnius
      @Johnnius 7 หลายเดือนก่อน +2

      ​@@feha92Zhe answer is "we don't know". 10M × 10M dense matrix of 32-bit floats would require 400TB of memory. And we don't have such a big machine.

    • @feha92
      @feha92 7 หลายเดือนก่อน +1

      @@Johnnius You clearly never heard of big-O notation, or math?