SIMD and vectorization using AVX intrinsic functions (Tutorial)

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 พ.ย. 2024

ความคิดเห็น • 50

  • @matias-eduardo
    @matias-eduardo 2 ปีที่แล้ว +44

    This is the best SIMD intro I've seen. And the bonus (relevant) humor really helped give the topics some breathing room to let them sink in. Just an overall outstanding presentation.

  • @PBlague
    @PBlague 9 หลายเดือนก่อน +3

    This video and your channel generally needs much much more attention! I'm just starting out on SIMD and man, the serious yet funny and very clear explanation of how everything works is amazing!
    I rarely subscribe to channels just by watching one video... I think it has happened less than the number of fingers I have on one hand but your channel was one of them!
    Amazing job! Keep it up!

  • @my_stackoverflow8982
    @my_stackoverflow8982 2 ปีที่แล้ว +8

    At the moment, this is the best SIMD video I've seen! Thank you very much!

  • @AugasDopesugerwin
    @AugasDopesugerwin 2 ปีที่แล้ว +12

    Literally the best lecture I've ever seen in my entire life. Very good job, and a very big thank you for this!

  • @amj864
    @amj864 6 หลายเดือนก่อน +2

    Educational, funny, and engaging for such a topic. I don't know what more one could ask for.

  • @anonymouscommentator
    @anonymouscommentator 2 ปีที่แล้ว +19

    This was such an awesome video! I assume this is for some kind of university course (which I'm not part of) but I never used AVX instructions before and yet I could easily follow the video. I enjoyed all the jokes and the examples really helped. It really didn't feel like an entire hour video! I might give AVX a try in the future :)

  • @Quancept
    @Quancept ปีที่แล้ว +3

    Was stuck on the alignment issue since yesterday. Finally understood what the issue was and solved it. Thank you so much my friend.

  • @azraeldophoom7857
    @azraeldophoom7857 4 หลายเดือนก่อน

    You got my subscription at the illustration of the composition of two vectors. Thanks :)

  • @torvasdh
    @torvasdh 2 หลายเดือนก่อน

    Literally the best SIMD video ong

  • @ngissac3411
    @ngissac3411 2 ปีที่แล้ว +2

    Thanks for your wonderful video. I'm a newly CPU engineer and responsible for the uach of FP aprt. This vedio truly provides me a new aspect to understand vectorization and SIMD. Thank u!!!

    • @ngissac3411
      @ngissac3411 2 ปีที่แล้ว

      Hello, Joel! I have one issue for u. When u mentioned FMA3 instruction and its calculation ordering, I found it's not matched with the Intel manual when meets VFMADD231PS. The "231" in mnemonics indicates that O1 should be a multiplier, O2 should be adding part and O3 should be a multiplicand according to the Intel Developer's Manual. It seems right for the O1*O3+O2=O1 formulation, but your slide and the intel manual say it should be O2*O3+O1=O1. I want to figure out why?

    • @ngissac3411
      @ngissac3411 2 ปีที่แล้ว

      I think I was misguided by the gramar. But anyway, u are right! Thanks a lot!

  • @deeps-n5y
    @deeps-n5y 2 ปีที่แล้ว +5

    This was entertaining and insightful at the same time! Cleared so many confusions I had since I mainly work with Python and deep learning. The production quality (audio+video) too is outstanding! Looking forward to more videos :)

  • @peterwilson69
    @peterwilson69 5 หลายเดือนก่อน

    Amazing video - I appreciate the work that went into this, thank you.

  • @mariovrpereira
    @mariovrpereira 2 ปีที่แล้ว +1

    This is exactly what I was looking for! Thank you so much.

  • @niranjanm5942
    @niranjanm5942 ปีที่แล้ว

    Best thing on this subject. Thank you very much

  • @abdulwasayikhlaq8013
    @abdulwasayikhlaq8013 5 หลายเดือนก่อน

    Amazing video! thankyou so much for this

  • @eduardovaldez4013
    @eduardovaldez4013 2 ปีที่แล้ว

    Really good video, found it to be very thorough in the explanation of the basics.

  • @gabrielguitiánestrella
    @gabrielguitiánestrella ปีที่แล้ว

    Amazing, Incredible, Fantastic work!! Wow!

  • @petrocksgarage
    @petrocksgarage 10 หลายเดือนก่อน

    This is great! Any chance you could make an intrinsics best practices video? I can't really find anything like that on TH-cam or elsewhere.
    All I've been able to find is stackoverflow sample code with no explanations of why things are done the way they are. ie why its common to group similar intrinsics together: load, load, load, load, add, add add, add, store, store, store store. Why not load, add, store, load, add, store, load, add, store, load, add, store?
    The tip about creating an array of __mm256d is pretty interesting. I am going to have to try that in my current project where I'm trying to ensure all 16 AVX2 vector registers (or as close as possible) are in use in a 3GB+ matrix MaxPlus algorithm. In some quick test code I wrote, using an element in the array disassembles to a vmovupdy just like a regular _mm256_load_pd does. I'm surprised it doesn't use vmovapdy since the memory is aligned. Adding __assume_aligned didn't help either I'm using the Intel icc compiler. Hmmm...

  • @__hannibaalbarca__
    @__hannibaalbarca__ ปีที่แล้ว +2

    Things is mysterious till someone _hero_ unveiled

  • @austinbai1460
    @austinbai1460 2 ปีที่แล้ว +3

    This helps me a lot, thanku!

    • @joelacarpenter
      @joelacarpenter  2 ปีที่แล้ว +2

      Glad to hear it. Different topic from most of my videos. The official Intel intrinsics guide is comprehensive, but it's pretty intimidating when you're first starting out. Especially for the more complicated instructions, and especially given the guide has no diagrams.

  • @EN-hm6zx
    @EN-hm6zx 9 หลายเดือนก่อน

    excellent lecture with great meme. Thx for sharing!

  • @lukaszm5011
    @lukaszm5011 8 หลายเดือนก่อน

    Thank you for this video! And very good sense of humor 😂

  • @alexeiionov9812
    @alexeiionov9812 ปีที่แล้ว

    actually so helpful, cheers from Berkeley ◡̈

  • @hananfawzy7139
    @hananfawzy7139 2 ปีที่แล้ว

    great work , thanks very much
    please where can I find this presentation ???

  • @ramzisabra9235
    @ramzisabra9235 2 ปีที่แล้ว

    Such a wonderful video. Thank you!

  • @colinmaharaj50
    @colinmaharaj50 ปีที่แล้ว

    8:10 Other than John Ham, Kevin Bacon is one of the best smelling actors around

  • @j.r.8176
    @j.r.8176 4 หลายเดือนก่อน

    Thank you! I love you

  • @AsafRavid
    @AsafRavid ปีที่แล้ว

    Great Lecture!

  • @jimamar
    @jimamar หลายเดือนก่อน

    The handling of the return values-which are an array for vector output and a single number for regular scalar output-is not covered.
    Put differently, how are you handling the outcome and what are you doing during accumulation?

    • @joelacarpenter
      @joelacarpenter  หลายเดือนก่อน

      Not sure what you're asking, but if you look near the end where I do a complex dot product, you can see some code examples.

  • @yuan.pingchen3056
    @yuan.pingchen3056 4 หลายเดือนก่อน

    Does SIMD have a reverse Polish notation similar to the FPU era to optimize performance?

  • @weicongchen237
    @weicongchen237 2 ปีที่แล้ว

    Awesome video! Thank you!

  • @virtsusu
    @virtsusu 10 หลายเดือนก่อน

    Why weren't _mm_testz_si128 and _mm_movemask_ps covered? They look pretty useful, are you saying that they should be avoided?

    • @joelacarpenter
      @joelacarpenter  8 หลายเดือนก่อน +1

      Nothing wrong with those intrinsics, I use them sometimes, I just couldn't get through all the instructions because there's so many. Mostly I just wanted to give people a taste of the most common ones and what kinds of operations exist. Also should be a good start-off point for learning the other intrinsics that I didn't cover.

  • @bruhbrh7266
    @bruhbrh7266 11 หลายเดือนก่อน

    Is there a reason we are reading the memory bits from right to left?

    • @joelacarpenter
      @joelacarpenter  11 หลายเดือนก่อน +1

      Not aware of any fundamental reason, but I think it's just to keep with convention of how you normally write numbers from least significant digit to most significant digit right-to-left.
      Think of it as one big 256-digit binary number, just like you'd write a decimal number '1234' meaning one thousand, two hundred and thirty-four, with the least significant digit is on the right, and then you work to the left with each digit being a higher-power of 10. Same idea, except this is binary. Or in C/C++, if you were writing a binary number it'd be 0b1101010101, with the least significant bits on the right.
      The confusion can be if you're thinking of it like an array in C/C++, where you'd define it left-to-right, A = {0,1,2,3} where the first element is on the left. Don't get me wrong, you can also think of the AVX blocks as little arrays, it's just the way they're normally illustrated is such that the significance of the bits in the whole block increase from right-to-left, like one big number (even if you address them as blocks of individual numbers). Then no matter how you slice it, every chunk would just read like it's own binary number right-to-left.

    • @bruhbrh7266
      @bruhbrh7266 11 หลายเดือนก่อน

      @@joelacarpenter Thank you so much for explaining!!

  • @pramodhgopalan1141
    @pramodhgopalan1141 2 ปีที่แล้ว +1

    best lecture ever lmao

  • @fakhrihidayat3597
    @fakhrihidayat3597 ปีที่แล้ว +1

    Lmao awesome lecture! Thanks!

  • @chinmayjain9616
    @chinmayjain9616 ปีที่แล้ว

    memes were always on point

  • @reik2006
    @reik2006 2 ปีที่แล้ว

    13:19 Did you mean to write
    sumTotal256 = _mm256_add_ps(*A256++, sumTotal256);
    assuming
    __m256 *A256 = (__m256 *)A;
    ? Otherwise this only will work if all elements of array A are equal.

    • @joelacarpenter
      @joelacarpenter  2 ปีที่แล้ว +1

      yeah I think what I was probably trying to write was
      sumTotal256 = _mm256_add_ps(A256[i],sumTotal256);
      and then the for loop itself should iterates, i++ instead of i+=8, and then the loop runs to i

    • @reik2006
      @reik2006 2 ปีที่แล้ว

      @@joelacarpenter Thank you for responding and making this video! You have given a nice overview of the intrinsics available and your visualizations have been very helpful to me in deciphering the operation syntax on the Intel intrinsics guide

    • @olegpatraschku3736
      @olegpatraschku3736 ปีที่แล้ว

      ​@@joelacarpenter again at 13:19 shouldn't the integers be 32 bits in size ? 8 integers * 32 bits = 256 bits but the ints on the slides are 64 bits, 8 ints * 64 = 512 bits

    • @joelacarpenter
      @joelacarpenter  ปีที่แล้ว

      @@olegpatraschku3736 Are you talking about the int64, 'i' that I've used as the index? That's just for addressing the array (rather than the number format of the data inside the array itself). Normally I'd use 'size_t' rather than '__int64' for the index for potentially large arrays. The arrays themselves in this case as 'ps' which are 'packed singles' which are 8xfloat32

    • @olegpatraschku3736
      @olegpatraschku3736 ปีที่แล้ว

      @@joelacarpenter my mistake, it all made sense later on as I continued to watch the video :), content is very good !
      (my personal "struggle" was about this "lane" concept, I don't think I understood it fully why is it designed like this...)

  • @dandan1364
    @dandan1364 2 หลายเดือนก่อน

    Amazingly useful video thank you!!!