Some days the internet makes me sad. Other days it reminds me of all the people with the same niche interests as me and how incredibly talented some of them are. Thanks for putting so much effort into this :)
*skilled. not talented. talent is god-given. skill is developed through practice. I think it's a little disrespectful to call someone talented - almost makes it seem like they didn't work for it. 🙂
This is why social media is amazing. No mainstream media would not make this. Too niche and too hard for general public. But here can find hidden diamonds.
"MINI" Project? What the heck?! You just munched a lot of hard to grasp technical implementations, coded a working example, shared it on your blog, AND made a fully animated video about it!! You make me mad.
I aspire toward the day when i can do something similar and have the audacity to call it a "mini project" 😂 Thank you for sharing your brilliance and curiosity with the world my friend 💜
Good job. The reason workgroups are laid out in 1d/2d/3d grids is that all GPU compute APIs were first designed and implemented on top of existing graphics concepts where calculating outputs as e.g. 2D grids is a natural thing.
Hey great video. Are the animations at 4:58 detailing the matrix multipications correct? Looks like there's a mistake in what's being selected by the animation. The column vector should be repeated for each dot product shown, but that doesn't look to be the case.
Can you talk about scan operations like Blelloch and Hillis/Steele algorithms? Maybe you've already talked about them I haven't seen all your videos. This would be nice to know in what context they're used and provide a cool visualisation of them too.
I’ll consider this in future. My objective as of now is to showcase GPU programming, and keep things as simple as possible. Next videos will be on different algorithms and how GPU can be used there. But, I’ll definitely consider covering streams in future. Thanks a lot for your comment and I appreciate your input. Cheers! 😃
What was the reason for surprise at the x-horizontal / y-vertical layout? It's the standard convention for image processing, which is what GPUs are designed for
y-axis is generally vertical up and I'm not bothered too much about this as well. What confused me was (z, y, x). I understand that GPUs weren't designed to work with these kind of computations but it always confused me (when I started out).
Is the nvidia implementation using wmma (Warp Matrix Functions)? Either way it would be interesting to see how the performance would be impacted if the tensor cores were used as well! Good video!
I’ve a video showing the use of tensor cores in a very basic way. I’m not using tension cores in this video, but will definitely consider a similar step by step approach using wmma. Thanks a lot for the suggestion 😀
Great content! One thing I have never understood is, for matrix multiplication, what is the sort of threshold in size that makes a GPU implementation faster than a CPU implementation? If I want to do one million multiplications of 4 x 4 matrices, ignoring the overhead required to set up the computations, is the GPU faster than the CPU? Surely not. What about 100 x 100? 1000 x 1000?
Any plans to make more hands on gpu/cuda tutorials? That show more of the specific syntax instead of algorithms and techniques? Or do you know where i can find quality tutorials for that?
Some days the internet makes me sad. Other days it reminds me of all the people with the same niche interests as me and how incredibly talented some of them are. Thanks for putting so much effort into this :)
Thanks a lot. Glad you liked the video 😃
*skilled. not talented. talent is god-given. skill is developed through practice. I think it's a little disrespectful to call someone talented - almost makes it seem like they didn't work for it.
🙂
This is why social media is amazing. No mainstream media would not make this. Too niche and too hard for general public. But here can find hidden diamonds.
"MINI" Project? What the heck?! You just munched a lot of hard to grasp technical implementations, coded a working example, shared it on your blog, AND made a fully animated video about it!! You make me mad.
😅
That was a click bait for me, cause it ain't MINI at all
Gues it IS mini because he reviewed a single operation indepth. Great work anyways. Cheers for the author!
@@wlatol6512 makes sense
I aspire toward the day when i can do something similar and have the audacity to call it a "mini project" 😂
Thank you for sharing your brilliance and curiosity with the world my friend 💜
I am reading the CUDA C programming book and your videos are super helpful in visualizing the memory access process! Thank you very much!
Glad it was helpful!
This is not a "mini" project, you made some real content here! Fantastic video, congrats!
Thanks! 😃
Hey, I’ve been going through the Programming Massively Parallel Processors book lately and doing some CUDA and this was a GREAT video!!!
Thanks a lot! Glad the video was helpful!
Great stuff Tushar. I have been keen on learning GPU programming so great to see your videos in my feed. Keep it up and all the best.
Great to hear! 😀
Yay, CUDA video. Feel like my timeline has been needing CUDA content
I am from embedded systems background but I love your work. Keep Going brother, Just don't quit ! There's always an audience for great content.
Thanks a lot! I’m in this for the long run 😃
dude this is actually amazing. you’re the cs version of 3b1b… keep up the great work!
Thanks a lot! I appreciate it 😃
Good job. The reason workgroups are laid out in 1d/2d/3d grids is that all GPU compute APIs were first designed and implemented on top of existing graphics concepts where calculating outputs as e.g. 2D grids is a natural thing.
Super satisfying to see Manim to show the algorithm like that.
Glad you enjoy it!
Great video. I love the simplicity, and the great explanation.
Thanks. Glad you found it useful.
Very cool project, I will definitely go through the project code in the evening!
Great! Please do let me know what you think 😃
Hey great video. Are the animations at 4:58 detailing the matrix multipications correct? Looks like there's a mistake in what's being selected by the animation. The column vector should be repeated for each dot product shown, but that doesn't look to be the case.
I didn't show the complete set of calculations there. My point was to just show the memory access patterns. 🙂
Thanks for the research! Keep going! I would like to see other algorithms being run an optimized on GPUs...
Beautiful explanation and animation
Beautiful visualization!! i am enjoying watching your videos. Keep up the good work
Thank you! Cheers!
Beautiful video as usual. I'll am motivated to pick up PMPP after sem end just from watching your videos!
Thanks a lot and good luck 😃
Great explanation. Thanks
Glad you liked the video 😃
Crazy good manim skills! perfect video
Appreciate it!
yay i know a little about this now, thank you!!
Your videos have similar vibes to 3Blue1Brown Channel!
Great content
Glad you liked the video 😃
Yes, because he uses the python library Manim created by 3B1B!
thanks for this amazing vid brother!
Thanks. Glad you liked the video 😃
Your content is very helpfull and your method teach is great
This is beautiful. I would love to see what it takes to reach the cublas implementation.
You can check out the references in the video description. There’s a blog post (by someone else) who got around 90% of the cuBLAS performance.
Can you talk about scan operations like Blelloch and Hillis/Steele algorithms? Maybe you've already talked about them I haven't seen all your videos. This would be nice to know in what context they're used and provide a cool visualisation of them too.
I can take these topics for some future videos. Thanks a lot for suggesting. 😀
Gonna enjoy this knowledge
High quality content. Subscribed.
Thanks a lot. I really appreciate it 😃
This is awesome. Have you considered using streams and unrolling the matrices?
I’ll consider this in future. My objective as of now is to showcase GPU programming, and keep things as simple as possible. Next videos will be on different algorithms and how GPU can be used there. But, I’ll definitely consider covering streams in future. Thanks a lot for your comment and I appreciate your input. Cheers! 😃
What was the reason for surprise at the x-horizontal / y-vertical layout? It's the standard convention for image processing, which is what GPUs are designed for
y-axis is generally vertical up and I'm not bothered too much about this as well. What confused me was (z, y, x). I understand that GPUs weren't designed to work with these kind of computations but it always confused me (when I started out).
@@0mean1sigma i think you mean i,j vs x,y.. i often means the row in matrix operations. But x is always horizontal in cartesian coordinates.
Fantastic video
This is so beautiful and magnificent to see ❤❤❤🎉
Thank you so much!
Amazing🎉.
Thank you!
Is the nvidia implementation using wmma (Warp Matrix Functions)?
Either way it would be interesting to see how the performance would be impacted if the tensor cores were used as well! Good video!
I’ve a video showing the use of tensor cores in a very basic way. I’m not using tension cores in this video, but will definitely consider a similar step by step approach using wmma. Thanks a lot for the suggestion 😀
really nice stuff, thanks
Glad you liked it!
this is golden
Great content!
One thing I have never understood is, for matrix multiplication, what is the sort of threshold in size that makes a GPU implementation faster than a CPU implementation? If I want to do one million multiplications of 4 x 4 matrices, ignoring the overhead required to set up the computations, is the GPU faster than the CPU? Surely not. What about 100 x 100? 1000 x 1000?
GPUs are suited for large dataset. I can’t specify a number as it will depend on the algorithm and GPU specs.
What did you do? Compared with GFLOP you developed your own function to handle matrix mul ?
I wrote SGeMM from scratch (that runs on a gpu)
@ got it 🔥great man !!
I'm so dumb to understand this but I know this is something good. I'll understand it someday
i study linear algebra and im shocked right now now cuz it’s important to program the hardware system
i have a question ... from where can i learn these concepts
I’ve provided some of the links in the video description. There are also some good textbooks. Good luck!
Nice Manim work!
I was hoping that someone would simplify GPU's Matrix multiplication to me , so thank you .
Glad you found it useful 😃
If you use Strassen Method for multiplying 2x2 matrices you will reduce the number of required steps by 67% in large matrices
I'll definitely try that in future. Thanks a lot for your comment 😀
@@0mean1sigma 👍👍👍👍
@@0mean1sigma th-cam.com/video/0oJyNmEbS4w/w-d-xo.html
That is where I learnt it 😃
Thanks for the video, from where did you learn all this stuff? Any book or course?
I've put links to a couple of Blog posts in the description. They were very helpful (especially when it came to verifying my code).
But you didn’t take any course right?
Nope
Are you using manim?
Yes he is
Yet another Indian banger video
Any plans to make more hands on gpu/cuda tutorials? That show more of the specific syntax instead of algorithms and techniques?
Or do you know where i can find quality tutorials for that?
oh,I'm guessing your blog is that place? I'll check it out.
Yes, you can find more details on my blog and I also open source my code.
Thans for you Service
Why it feels like 3blue1brown vid... You use manim???
Yes
W project
Where is Cuda/C/C++
Check out the links in the description
So, after trying your benchmarks, I've got that cuBLAS is fastest in comparison to any of your approach.
Though, nice video
Thanks a lot for your comment. Yes, my implementations are slower than CUDA but my focus was more on understanding the GPU programming concepts.
thanks but the music is very annoying can't focus
Will keep this in mind for the future 😃
5 minutes in and I’m starting to stroke
Come on, now use tensor cores
Mini💀
wtf did I just watched ? 😬
Are you using manim??
Yes
Where did you learn that? Its soo clean asf @@0mean1sigma