From Scratch: Matrix Multiplication in CUDA

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ต.ค. 2024

ความคิดเห็น • 32

  • @lor3nzo37
    @lor3nzo37 4 ปีที่แล้ว +7

    This video helped me infinitely to finish my college assingment- GREAT CONTENT MAN!

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 ปีที่แล้ว +2

      Thanks! I'm glad you found it helpful!

  • @kid-vf4lu
    @kid-vf4lu 5 ปีที่แล้ว +6

    this is such great content. I appreciate your efforts
    If it's not too much work, can you demo how to work with images in cuda -- this is the most common use of cuda, right?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  5 ปีที่แล้ว +1

      I don't know if it's true that CUDA is primarily used to work with images (people like running ML workloads on lots of things other than images). I can definitely do some work with images though!

  • @nishchayjavara5032
    @nishchayjavara5032 ปีที่แล้ว +1

    Is it global memory version of matrix multiplication?

  • @cory99998
    @cory99998 ปีที่แล้ว +1

    Thank you! Very clear tutorial

  • @GokulNathXYZ
    @GokulNathXYZ 3 ปีที่แล้ว

    Around 8:30, you mention that "we can do the same padding trick as last time".
    Which previous video are you talking about? Is this video part of some series?
    Btw, thanks a lot for your content, they're really helpful.

  • @chizeu389
    @chizeu389 ปีที่แล้ว

    I’m following the same code with you but ended with Segmentation fault (core dumped). Any solutions?

  • @apkiller2812
    @apkiller2812 3 ปีที่แล้ว

    Hi - great video. Quick question... How would you Implement parallel blocked GEMM using CUDA?

  • @exodus8213
    @exodus8213 ปีที่แล้ว

    Hello
    how to do remove_postfix with different starting points in cuda

  • @mohammadanas2432
    @mohammadanas2432 4 ปีที่แล้ว +2

    Is Linearizing the 2d array necessary...or we can use the original array?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 ปีที่แล้ว +2

      You're more than welcome to is a 2d array (e.g., a[N][N]) if you wish. Whatever works for you!

    • @mohammadanas2432
      @mohammadanas2432 4 ปีที่แล้ว +2

      @@CoffeeBeforeArch thanks for replying to my question.. Can you suggest me some resources where I can learn how to do it without changing the array?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  4 ปีที่แล้ว +2

      The easiest way to do it (with a matrix width known at compile-time) is with a quick typedef. In general you should prefer the linearized way of indexing (in many cases you don't know the dimension of the matrix at compile-time, and the solution to that is rather messy, and incurs unnecessary performance penalties). I've re-written the example from this video in the link that follows to use 2d indexing. Hope this helps! github.com/CoffeeBeforeArch/from_scratch/blob/master/2d_array/matrix_mul.cu

  • @aneesmd7837
    @aneesmd7837 3 ปีที่แล้ว +2

    Hey iam getting a assertion failed error if i take number of threads = 64 can anyone tell what maybe the reason for that and how to fix it??
    Please help me!!

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  3 ปีที่แล้ว +1

      If you're using 64 for threads (and the code from the video), that is giving you a thread block w/ 4096 (64 * 64) threads, which greater than the maximum 1024 threads per thread block allowed by hardware.

  • @nguyentu6676
    @nguyentu6676 3 ปีที่แล้ว

    Hi, can you please explain me why i have this error " no kernel image is available for execution on the device". Thank you!

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  3 ปีที่แล้ว

      It's likely you're using a CUDA version that is unsupported by your GPU

    • @nguyentu6676
      @nguyentu6676 3 ปีที่แล้ว

      @@CoffeeBeforeArch Thanks for your reply, how can i know what version of CUDA that is supported by my GPU?

  • @tempdeltavalue
    @tempdeltavalue ปีที่แล้ว

    You didn't use memcpy in this video.. Is it not needed ?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  ปีที่แล้ว +2

      No, not if you're using cudaMallocManaged (the CUDA runtime manages the transfer of memory back and forth for you)

  • @Teslawaverunner
    @Teslawaverunner 5 ปีที่แล้ว

    I’m trying to figure out the logic behind cudamallocmanaged(&a, bytes) needing the ampersand sign when a is an array , surely it’s not wanting the address of the pointer ?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  5 ปีที่แล้ว

      If you look at the api, the input to cudaMallocManaged is a void** devptr. The api call returns in *devptr a pointer to the allocated memory. By passing in the address of the pointer, the api call just overwrites the pointer with one pointing to the new memory.

    • @Teslawaverunner
      @Teslawaverunner 5 ปีที่แล้ว

      CoffeeBeforeArch Thankyou for the detailed response. That makes sense now i.e. it is indeed the address of the pointer it needs

  • @apekshamegeri3322
    @apekshamegeri3322 2 ปีที่แล้ว

    I'm doing an experiment with this code (everything same except I initialized matrices A and B on host memory and then copied it to device) using cudaMemcpy. I performed verification on matrix sizes 2x2... 8x8... till 2048 x 2048. But when I changed int N = 1

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  2 ปีที่แล้ว

      You might just be running into a memory error (depending on your GPU) since each 4kx4k matrix is 64MB. CUDA runtime API calls may fail, but that doesn't necessarily crash your program. I'd check to see if you're kernel is even running on the device (or if indeed your other API calls are completing successfully).

    • @jugszy
      @jugszy 2 ปีที่แล้ว

      @@CoffeeBeforeArch Thanks, we obtained execution time for GPU operation and CPU verification operation. They differ drastically as we climb up the matrix sizes, so we think the kernel is being executed on the GPU. We changed the BLOCKS = (N + threads -1)/threads to BLOCKS = 256 directly and it worked. We are not sure why!

  • @LuizFernando-je2lx
    @LuizFernando-je2lx ปีที่แล้ว +1

    tysm

  • @christopherfranko3531
    @christopherfranko3531 3 ปีที่แล้ว

    The audio is soooo low. Like you dont want someone in the next room to hear you.