From Scratch: Cache Tiled Matrix Multiplication in CUDA

From Scratch: Vector Addition in CUDA

CUDA Crash Course: GPU Performance Optimizations Part 1

The Wall Song ร้องข้ามกำแพง| EP.215 | พาเวล / พูห์ / พีท / ไนกี้ / อะตอม / ลุลา | 17 ต.ค. 67 FULL EP

ซ้ายหรือขวาEP2 #mnjtv

这是怎么回事？#shorts #Fairy#fairytales

From Scratch: Matrix Multiplication in CUDA

CoffeeBeforeArch

มุมมอง 21 723

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 21 ต.ค. 2024

ความคิดเห็น • 32

@lor3nzo37 4 ปีที่แล้ว ⁺⁷
This video helped me infinitely to finish my college assingment- GREAT CONTENT MAN!
@CoffeeBeforeArch 4 ปีที่แล้ว ⁺²
Thanks! I'm glad you found it helpful!
@kid-vf4lu 5 ปีที่แล้ว ⁺⁶
this is such great content. I appreciate your efforts
If it's not too much work, can you demo how to work with images in cuda -- this is the most common use of cuda, right?
@CoffeeBeforeArch 5 ปีที่แล้ว ⁺¹
I don't know if it's true that CUDA is primarily used to work with images (people like running ML workloads on lots of things other than images). I can definitely do some work with images though!
@nishchayjavara5032 ปีที่แล้ว ⁺¹
Is it global memory version of matrix multiplication?
@cory99998 ปีที่แล้ว ⁺¹
Thank you! Very clear tutorial
@CoffeeBeforeArch ปีที่แล้ว
Glad you liked it! :^)
@GokulNathXYZ 3 ปีที่แล้ว
Around 8:30, you mention that "we can do the same padding trick as last time".
Which previous video are you talking about? Is this video part of some series?
Btw, thanks a lot for your content, they're really helpful.
@chizeu389 ปีที่แล้ว
I’m following the same code with you but ended with Segmentation fault (core dumped). Any solutions?
@apkiller2812 3 ปีที่แล้ว
Hi - great video. Quick question... How would you Implement parallel blocked GEMM using CUDA?
@exodus8213 ปีที่แล้ว
Hello
how to do remove_postfix with different starting points in cuda
@mohammadanas2432 4 ปีที่แล้ว ⁺²
Is Linearizing the 2d array necessary...or we can use the original array?
@CoffeeBeforeArch 4 ปีที่แล้ว ⁺²
You're more than welcome to is a 2d array (e.g., a[N][N]) if you wish. Whatever works for you!
@mohammadanas2432 4 ปีที่แล้ว ⁺²
@@CoffeeBeforeArch thanks for replying to my question.. Can you suggest me some resources where I can learn how to do it without changing the array?
@CoffeeBeforeArch 4 ปีที่แล้ว ⁺²
The easiest way to do it (with a matrix width known at compile-time) is with a quick typedef. In general you should prefer the linearized way of indexing (in many cases you don't know the dimension of the matrix at compile-time, and the solution to that is rather messy, and incurs unnecessary performance penalties). I've re-written the example from this video in the link that follows to use 2d indexing. Hope this helps! github.com/CoffeeBeforeArch/from_scratch/blob/master/2d_array/matrix_mul.cu
@aneesmd7837 3 ปีที่แล้ว ⁺²
Hey iam getting a assertion failed error if i take number of threads = 64 can anyone tell what maybe the reason for that and how to fix it??
Please help me!!
@CoffeeBeforeArch 3 ปีที่แล้ว ⁺¹
If you're using 64 for threads (and the code from the video), that is giving you a thread block w/ 4096 (64 * 64) threads, which greater than the maximum 1024 threads per thread block allowed by hardware.
@nguyentu6676 3 ปีที่แล้ว
Hi, can you please explain me why i have this error " no kernel image is available for execution on the device". Thank you!
@CoffeeBeforeArch 3 ปีที่แล้ว
It's likely you're using a CUDA version that is unsupported by your GPU
@nguyentu6676 3 ปีที่แล้ว
@@CoffeeBeforeArch Thanks for your reply, how can i know what version of CUDA that is supported by my GPU?
@tempdeltavalue ปีที่แล้ว
You didn't use memcpy in this video.. Is it not needed ?
@CoffeeBeforeArch ปีที่แล้ว ⁺²
No, not if you're using cudaMallocManaged (the CUDA runtime manages the transfer of memory back and forth for you)
@Teslawaverunner 5 ปีที่แล้ว
I’m trying to figure out the logic behind cudamallocmanaged(&a, bytes) needing the ampersand sign when a is an array , surely it’s not wanting the address of the pointer ?
@CoffeeBeforeArch 5 ปีที่แล้ว
If you look at the api, the input to cudaMallocManaged is a void** devptr. The api call returns in *devptr a pointer to the allocated memory. By passing in the address of the pointer, the api call just overwrites the pointer with one pointing to the new memory.
@Teslawaverunner 5 ปีที่แล้ว
CoffeeBeforeArch Thankyou for the detailed response. That makes sense now i.e. it is indeed the address of the pointer it needs
@apekshamegeri3322 2 ปีที่แล้ว
I'm doing an experiment with this code (everything same except I initialized matrices A and B on host memory and then copied it to device) using cudaMemcpy. I performed verification on matrix sizes 2x2... 8x8... till 2048 x 2048. But when I changed int N = 1
@CoffeeBeforeArch 2 ปีที่แล้ว
You might just be running into a memory error (depending on your GPU) since each 4kx4k matrix is 64MB. CUDA runtime API calls may fail, but that doesn't necessarily crash your program. I'd check to see if you're kernel is even running on the device (or if indeed your other API calls are completing successfully).
@jugszy 2 ปีที่แล้ว
@@CoffeeBeforeArch Thanks, we obtained execution time for GPU operation and CPU verification operation. They differ drastically as we climb up the matrix sizes, so we think the kernel is being executed on the GPU. We changed the BLOCKS = (N + threads -1)/threads to BLOCKS = 256 directly and it worked. We are not sure why!
@LuizFernando-je2lx ปีที่แล้ว ⁺¹
tysm
@christopherfranko3531 3 ปีที่แล้ว
The audio is soooo low. Like you dont want someone in the next room to hear you.

ต่อไป

เล่นอัตโนมัติ

From Scratch: Cache Tiled Matrix Multiplication in CUDA

From Scratch: Cache Tiled Matrix Multiplication in CUDA

From Scratch: Vector Addition in CUDA

From Scratch: Vector Addition in CUDA

CUDA Crash Course: GPU Performance Optimizations Part 1

CUDA Crash Course: GPU Performance Optimizations Part 1

The Wall Song ร้องข้ามกำแพง| EP.215 | พาเวล / พูห์ / พีท / ไนกี้ / อะตอม / ลุลา | 17 ต.ค. 67 FULL EP

The Wall Song ร้องข้ามกำแพง| EP.215 | พาเวล / พูห์ / พีท / ไนกี้ / อะตอม / ลุลา | 17 ต.ค. 67 FULL EP

ซ้ายหรือขวาEP2 #mnjtv

ซ้ายหรือขวาEP2 #mnjtv

这是怎么回事？#shorts #Fairy#fairytales

这是怎么回事？#shorts #Fairy#fairytales

คำถามบ้านแตก x โบ๊ท&เบส คำสิงห์ #โบ๊ทคำสิงห์ #เบสคำสิงห์

คำถามบ้านแตก x โบ๊ท&เบส คำสิงห์ #โบ๊ทคำสิงห์ #เบสคำสิงห์

comparing GPUs to CPUs isn't fair

comparing GPUs to CPUs isn't fair

2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU

2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU

Writing Code That Runs FAST on a GPU

Writing Code That Runs FAST on a GPU

Must Know Technique in GPU Computing | Episode 4: Tiled Matrix Multiplication in CUDA C

Must Know Technique in GPU Computing | Episode 4: Tiled Matrix Multiplication in CUDA C

How AI Discovered a Faster Matrix Multiplication Algorithm

How AI Discovered a Faster Matrix Multiplication Algorithm

Inside the Matrix: How does matrix multiplication work inside GPUs?

Inside the Matrix: How does matrix multiplication work inside GPUs?

Just enough C to have fun

Just enough C to have fun

From Scratch: Shared Memory Atomics and Dynamic Allocation in CUDA

From Scratch: Shared Memory Atomics and Dynamic Allocation in CUDA

Nvidia CUDA in 100 Seconds

Nvidia CUDA in 100 Seconds

ซ้ายหรือขวาEP2 #mnjtv

ซ้ายหรือขวาEP2 #mnjtv

Human vs Jet Engine

Human vs Jet Engine

24 ชั่วโมงใน ห้องขังเดี่ยว ( อยู่รอดเอาไปเลย 100,000 บาท )

24 ชั่วโมงใน ห้องขังเดี่ยว ( อยู่รอดเอาไปเลย 100,000 บาท )

CAN YOU DO THIS ?

CAN YOU DO THIS ?

TEEDEE TADA - DIAMOND NARAKORN X LEGO LYKN [Special Dance Performance]

TEEDEE TADA - DIAMOND NARAKORN X LEGO LYKN [Special Dance Performance]

When mom gets home, but you're in rollerblades.

When mom gets home, but you're in rollerblades.

ROSÉ & Bruno Mars - APT. (Official Music Video)

ROSÉ & Bruno Mars - APT. (Official Music Video)

🔴 ฟุตบอลแชมป์กีฬา 7HD แชมเปียน คัพ 2024 สนาม 2 วันที่ 20 ต.ค. 2567

🔴 ฟุตบอลแชมป์กีฬา 7HD แชมเปียน คัพ 2024 สนาม 2 วันที่ 20 ต.ค. 2567