1. Introduction and Matrix Multiplication

แชร์
ฝัง
  • เผยแพร่เมื่อ 11 ก.ย. 2024

ความคิดเห็น • 97

  • @leixun
    @leixun 4 ปีที่แล้ว +194

    *My takeaways:*
    *1. Why do we study performance engineering **0:57*
    1.1 Why we can't improve the frequency of hardware anymore 10:00, because static power consumption dominates the power consumption in modern hardware, and it is hard to reduce it.
    1.2 Parallelism is used to increase performance 11:49, but the performance is no longer free since we have to write programs that can run in parallel.
    *2. An example of performance engineering: Matrix multiplication **15:35*
    2.1 Hyperthreading can improve performance a little bit, but it makes performance measurement difficult 17:17
    2.2 Python code running time: *~6 hours* 19:38
    2.2 Java code running time: *~46 minutes* 23:44
    2.3 C code running time: *~19 minutes* 24:23
    2.4 Why Python is slow 25:30, because Python is interpreted, C is compiled and Java is in between.
    2.5 Further improvement on the C code.
    - Change the order of loops: *~3 minutes* running time, because cache locality is exploited 29:50, and we can use measurement tool $ cachegrind 33:54
    *-- In C, a matrix are stored row by row in memory, therefore access data column by column can exploit cache locality.*
    *-- In C[i][j] += A[i][k] * B[k][j], we want to maximise the access to the column index which are [j], [k] and [j], and because there are two [j], put [j] in the inner loop could maximise cache locality.*
    *-- We also want to minimise the access to row index because it can break cache locality, and since there are two[i] and one [k], put [i] in the outer loop could maximise cache locality.*
    *-- Put [k] in the middle loop.*
    - Change the compiler flag: *~54s* running time 34:59
    - Use more cores: *~3s* running time 36:24, rule of thumb is parallel outer loops rather than inner loops.
    - Manage cache for data reuse, tiled matrix multiplication with parallelism: *~1.74s* running time 40:10
    - Tiling for two-level cache, recursive matrix multiplication with parallelism: *~94s* running time 45:45, because the overhead of recursion is large, but we can coarsening the recursion to get *~1.3s* running time 49:00
    - Using vector hardware, SIMD with compiler: *~0.7s* running time 51:35
    - Using vector instruction directly: *~0.4s* running time 55:55
    *3. Matrix multiplication is a good example, but we generally won't see such magnitude of improvement in other cases **58:58*

    • @vishnusingh4118
      @vishnusingh4118 4 ปีที่แล้ว +10

      You're God's gift to humankind. Thanks a ton for this Lei Xun! Much appreciated!

    • @leixun
      @leixun 4 ปีที่แล้ว +2

      @@vishnusingh4118 You are welcome

    • @Kingpin.7
      @Kingpin.7 3 ปีที่แล้ว +1

      Can anyone reply if all these videos are to be watched in order or are they all independent of eachother?

    • @leixun
      @leixun 3 ปีที่แล้ว +2

      @@Kingpin.7 Mostly in order

    • @user-mr-m12312
      @user-mr-m12312 3 ปีที่แล้ว +2

      Great takeaways!

  • @hengchunsong301
    @hengchunsong301 4 ปีที่แล้ว +94

    From 6 hrs in python down to less than half a second. Indeed impressive optimization!

    • @takasurazeem
      @takasurazeem 4 ปีที่แล้ว +5

      And you said that exactly 6 months back from today.

    • @jiaming5269
      @jiaming5269 4 ปีที่แล้ว +2

      spoiler warning zzz

    • @jiaming5269
      @jiaming5269 4 ปีที่แล้ว

      spoiler warning zzz

    • @awsumpersun321
      @awsumpersun321 3 ปีที่แล้ว +2

      most of it attributed to parallelization and language though (trivial changes)

    • @mytech6779
      @mytech6779 2 ปีที่แล้ว +4

      @@awsumpersun321 Only trivial an an absurdly high level abstraction. Just try to implement either. If you take those two out as not everything can be parrallel(or alternately the task is already parallel), and if you are not totally daft(regarding high demand computation) and assumed a compiled language from the start; then you are still looking at a 92x improvement. Now consider super compute machines that consume $10'000 per day and many scientific processes that need many days or even months of runtime. Then even 2% is a substantial savings, let alone cutting it to 1/92 (~1%) of the original.

  • @ajayrajan8882
    @ajayrajan8882 4 ปีที่แล้ว +31

    The material taught here is gold. Zero dislikes, just how it's supposed to be!

    • @merlinarthur4661
      @merlinarthur4661 3 ปีที่แล้ว +8

      I think Cuz of your comment it got 2 dislikes🥲🥲

    • @ajayrajan8882
      @ajayrajan8882 3 ปีที่แล้ว +1

      @@merlinarthur4661 plausible

    • @larry_the
      @larry_the 2 ปีที่แล้ว

      still no dislikes a year later

    • @mytech6779
      @mytech6779 2 ปีที่แล้ว +4

      @@larry_the TH-cam turned off dislikes a couple months ago, its rainbows and unicorns forever....

    • @sarantsogtmunkhbaatar6974
      @sarantsogtmunkhbaatar6974 2 ปีที่แล้ว

      u fucking jinxed it

  • @windmaple
    @windmaple 4 ปีที่แล้ว +48

    This is a truly amazing class.

  • @MIT60346
    @MIT60346 5 ปีที่แล้ว +28

    Glad to see Prof Leiserson again.
    Btw, CLRS is always awesome!

  • @josephgibson2548
    @josephgibson2548 5 ปีที่แล้ว +24

    Thank you MIT! Once again a great video and lecture.

  • @OTadashi162
    @OTadashi162 2 ปีที่แล้ว +1

    I have not been expecting it to be a performance optimization class (i didn't read the description before, i was looking for game dev related stuff), but i have found very interesting. I never had this kind of advanced concept before about how computers REALLY works.

  • @npe_exception
    @npe_exception 3 ปีที่แล้ว +2

    2:59 Changed my whole view of SE

  • @BinuVargheseamd
    @BinuVargheseamd ปีที่แล้ว

    One of the best classes I have attended. Absolutely fabulous. Thank you for the upload.

  • @kythrathesuntamer9715
    @kythrathesuntamer9715 2 ปีที่แล้ว +1

    This is incredible. Thank you so much for this course MIT.

  • @hanw544
    @hanw544 4 ปีที่แล้ว +20

    This guy wrote Intro to Algos, OMG😱

    • @tear728
      @tear728 2 ปีที่แล้ว

      Wow that's legendary

  • @takshpatel8109
    @takshpatel8109 2 ปีที่แล้ว

    I really appreciate this open source courses

  • @Er.Sunil.Pedgaonkar
    @Er.Sunil.Pedgaonkar 9 หลายเดือนก่อน

    A course in System Software and Programming is must for CS / CpE majors

  • @MysterySemicolon
    @MysterySemicolon 3 ปีที่แล้ว +5

    Great class. I had to laugh at how much just two years have given over to specs as he describes the AWS large machine as a "honkin' big machine". Enterprise VD environments are really pushing what constitutes a honkin' big machine upward.

    • @mytech6779
      @mytech6779 2 ปีที่แล้ว +3

      There were bigger AWS options even in 2018, the point is that most folks are using less than 1% of it and think they are really cooking or that they just need a faster machine.
      Seeing what 1980s programmers could push through an 8bit console with 2k of memory (like the NES) with various optimization tricks, now those machines had a good utilization ratio.(Yes, they were simpler than modern workstation architectures and thus easier to optimize.)

  • @mandcali7613
    @mandcali7613 11 หลายเดือนก่อน

    Wow! Thank you… a lot of the question marks that I’ve encountered before was answered here…

  • @thehardikraheja
    @thehardikraheja 2 ปีที่แล้ว +1

    As to why we cannot cross 40% peak, is the reason memory accesses? Since the whole matrix of such size cannot be cached, I am curious if there are more reasons as well

  • @asdfghjkl36958
    @asdfghjkl36958 2 ปีที่แล้ว +3

    Can someone explain why the k loop can't be parallelized?

    • @thehardikraheja
      @thehardikraheja 2 ปีที่แล้ว +2

      Because it's the A[i][k] * B[k][j] so k is common between A and B and needs to be executed sequentially for obtaining the result C, this is my understanding of the reason

  • @MathNerdGamer
    @MathNerdGamer 2 ปีที่แล้ว +1

    Was this video unlisted until just 30-ish minutes ago?
    I saw the title and was excited to see another updated 6.172, but then I noticed that the thumbnail showed that I'd already watched it before.

    • @mitocw
      @mitocw  2 ปีที่แล้ว +1

      Good catch! Yes it was. We're glad you got to enjoy it anyways :D

  • @gabrielagalindo6339
    @gabrielagalindo6339 4 ปีที่แล้ว +2

    Thanks MIT, amazing lecture

  • @veramentegina
    @veramentegina 4 ปีที่แล้ว +4

    what a great lecture!!

  • @backpackmusician
    @backpackmusician หลายเดือนก่อน

    26:12 can it be faster if you run it on assembly? And if yes, how much?

  • @bob-the-constructors9912
    @bob-the-constructors9912 2 ปีที่แล้ว

    Thank you MIT

  • @codefast93
    @codefast93 ปีที่แล้ว

    Amazing lecture! Why does it go only up to only 41.6% of peak?

  • @stephenodogwu7359
    @stephenodogwu7359 5 หลายเดือนก่อน

    I went through this very engaging lecture and thought I understood it. Maybe I'm missing something, because I ran the matrix multiplication on my PC and my results came back very fast. Without optimization, C performed the operation in just 0.03 seconds. Please am I getting something wrong as to why it took longer to perform in this lecture

    • @ankushmenat
      @ankushmenat หลายเดือนก่อน

      Did you use 4096x4096 matrix?

  • @Originalimoc
    @Originalimoc 3 ปีที่แล้ว +1

    38:16 because possible racing condition? since the use of +=

  • @Adit86
    @Adit86 4 ปีที่แล้ว +2

    @14:30 - 2000 "OH OHS" = oughts

  • @pravinjoshi8978
    @pravinjoshi8978 4 ปีที่แล้ว +2

    Thanks MIT!!! keep up good work...

  • @chessimprovement8804
    @chessimprovement8804 2 ปีที่แล้ว

    Great resource.

  • @manfredbogner9799
    @manfredbogner9799 หลายเดือนก่อน

    Sehr gut

  • @This.Object
    @This.Object 8 หลายเดือนก่อน

    Did he jus say "well i took this 1.3hr class to prove you don't need performance, just write your god damn code in such a way that it makes some fucking sense" 😂

  • @FernandoJavierSosa
    @FernandoJavierSosa 4 ปีที่แล้ว +2

    Thanks for sharing this excelent lesson!

  • @RonMartinJr
    @RonMartinJr 4 ปีที่แล้ว +1

    That was amazing

  • @user-hn4sf7cn1r
    @user-hn4sf7cn1r 3 ปีที่แล้ว

    It's a cool lecture!

  • @regulus8518
    @regulus8518 4 ปีที่แล้ว +2

    what is the pescribed reading for this course ?.

    • @mitocw
      @mitocw  4 ปีที่แล้ว +18

      6.172 has no required readings. There is a list of supplemental readings on MIT OpenCourseWare at: ocw.mit.edu/6-172F18. Best wishes on your studies!

    • @maloxi1472
      @maloxi1472 4 ปีที่แล้ว

      Thanks @@mitocw

  • @FreedomIsNeverEverFree
    @FreedomIsNeverEverFree 3 ปีที่แล้ว

    how long does a numpy matrix multiplication take?

  • @seunghwacho5889
    @seunghwacho5889 4 ปีที่แล้ว +1

    Thank you for sharing great lessons.

  • @Soulixs
    @Soulixs 4 ปีที่แล้ว

    Thankyou

  • @serkardis292
    @serkardis292 2 ปีที่แล้ว +1

    These optimizations are amazing, python code isn't great though. I managed to write a pure python version that runs in 300s(using pypy). No optimizations other than loop reordering. Still slow, but it's just 5 loc that were written in 3 minutes.

    • @miguelescutia5556
      @miguelescutia5556 ปีที่แล้ว

      You used concepts from his lesson anyway.

    • @serkardis292
      @serkardis292 ปีที่แล้ว

      @@miguelescutia5556 yes, it's an amazing lecture. Just wanted to highlight that barely optimized python is less than 2x slower than C at the same level of optimization.

  • @lange22
    @lange22 3 ปีที่แล้ว

    What is the source/paper/study behind the sldies @ 14:00?

    • @mitocw
      @mitocw  3 ปีที่แล้ว

      Visit ocw.mit.edu/6-172F18 for the course materials. Best wishes on your studies!

  • @aaronrenfroe5199
    @aaronrenfroe5199 4 ปีที่แล้ว +3

    what is absolutely bonkers is python's numpy.matmul holds it's own up until #9 on my 5 year old macbook pro, and it looks like, C = np.matmul(A,B). Lol. So yes, plain python is slow but it can be fast. Good Lecture.

    • @sauravtiwary8513
      @sauravtiwary8513 3 ปีที่แล้ว +8

      @Aaron, The core logic of NumPy that actually does the multiplication is written in C and not Python. github.com/numpy/numpy/tree/master/numpy/core/src

    • @mytech6779
      @mytech6779 2 ปีที่แล้ว

      @@sauravtiwary8513 Yep, can't get around the laws of physics. (Same algorithms pre-compiled vs interpreted in real time.)

  • @SFXFF
    @SFXFF 2 ปีที่แล้ว

    Chalkboard is so cool

  • @Kingpin.7
    @Kingpin.7 3 ปีที่แล้ว

    Can anyone reply if all these videos are to be watched in order or are they all independent of eachother?

    • @mytech6779
      @mytech6779 2 ปีที่แล้ว +1

      It is recorded class lectures, as it clearly says in the description. They are a series, not independent.

  • @21fad24
    @21fad24 2 ปีที่แล้ว

    what was changed in this re-upload?

    • @mitocw
      @mitocw  2 ปีที่แล้ว +1

      This was not a re-upload. We just forgot to make this lecture public. Everything else was public, except this one lecture. 🤦‍♂️

  • @tolga1503
    @tolga1503 2 ปีที่แล้ว

    VAHAB BEYAAZ AHMET ÇAKAAR

  • @JP-vg8vl
    @JP-vg8vl 2 ปีที่แล้ว +2

    7:00 so michael jackson is a computer scientist huh

  • @abpccpba
    @abpccpba 2 ปีที่แล้ว

    Why not try Forth programing language; supper fast.

  • @DistortedV12
    @DistortedV12 2 ปีที่แล้ว +1

    Bro why does this have so many views??

  • @diego44551
    @diego44551 4 ปีที่แล้ว

    Do someone knows some book of software? I wanna study an other engineering but I would like to learn software too.

  • @eagledove9
    @eagledove9 3 ปีที่แล้ว +1

    If I wanted to make an Amish-like computer, using very simple and primitive parts, and if my goal was to produce all of it in the United States instead of in China, it might make sense to build computers and appliances that had less power, less memory, a frugal approach to programming. You might make a goal of building a computer that doesn't use certain kinds of rare earth elements or something, using only elements that could be mined locally, so you would give up a lot of the advanced, modern hardware.

    • @mytech6779
      @mytech6779 2 ปีที่แล้ว +1

      You can make computers with water and mechanical valves. So what?

  • @p0187
    @p0187 ปีที่แล้ว

    program vs software vs application 🤔

  • @Pabloparsil
    @Pabloparsil 2 หลายเดือนก่อน

    Web monkeys mentioned

  • @xiaoxiashirleywu2799
    @xiaoxiashirleywu2799 2 ปีที่แล้ว

    check

  • @rileydavidjesus
    @rileydavidjesus 2 ปีที่แล้ว

    Run it in Rust though

  • @gp10020
    @gp10020 2 ปีที่แล้ว

    i wonder what he's drinking?

  • @BaptisteCanton1
    @BaptisteCanton1 7 หลายเดือนก่อน

    TIL: web monkey concept 😂😅

  • @kenichimori8533
    @kenichimori8533 5 ปีที่แล้ว

    Rank(0)(1,0)matrix permutix.

  • @Matowix
    @Matowix 2 ปีที่แล้ว

    Basic stuff I learned early on in my first year.

    • @mytech6779
      @mytech6779 2 ปีที่แล้ว +3

      This is a 100 level class ...so your point is?

    • @johnwolves2705
      @johnwolves2705 2 ปีที่แล้ว +2

      @@mytech6779 I think he thought of the 3×3 matrix multiplication in C taught in mostly 1st year of most University 😂😂