Getting Started With CUDA for Python Programmers

Jeremy Howard

มุมมอง 63 965

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 7 ม.ค. 2025

ความคิดเห็น • 47

@wadejohnson4542 11 หลายเดือนก่อน ⁺⁶⁵
Jeremy Howard: a true hero of the common man. Thank you for this.
@wolpumba4099 11 หลายเดือนก่อน ⁺³⁷
Chapter titles:
- 00:01 Introduction to CUDA Programming
- 00:32 Setting Up the Environment
- 01:43 Recommended Learning Resources
- 02:39 Starting the Exercise
- 03:26 Image Processing Exercise
- 06:08 Converting RGB to Grayscale
- 07:50 Understanding Image Flattening
- 11:04 Executing the Grayscale Conversion
- 12:41 Performance Issues and Introduction to CUDA Cores
- 14:46 Understanding Cuda and Parallel Processing
- 16:23 Simulating Cuda with Python
- 19:04 The Structure of Cuda Kernels and Memory Management
- 21:42 Optimizing Cuda Performance with Blocks and Threads
- 24:16 Utilizing Cuda's Advanced Features for Speed
- 26:15 Setting Up Cuda for Development and Debugging
- 27:28 Compiling and Using Cuda Code with PyTorch
- 28:51 Including Necessary Components and Defining Macros
- 29:45 Ceiling Division Function
- 30:10 Writing the CUDA Kernel
- 32:19 Handling Data Types and Arrays in C
- 33:42 Defining the Kernel and Calling Conventions
- 35:49 Passing Arguments to the Kernel
- 36:49 Creating the Output Tensor
- 38:11 Error Checking and Returning the Tensor
- 39:01 Compiling and Linking the Code
- 40:06 Examining the Compiled Module and Running the Kernel
- 42:57 Cuda Synchronization and Debugging
- 43:27 Python to Cuda Development Approach
- 44:54 Introduction to Matrix Multiplication
- 46:57 Implementing Matrix Multiplication in Python
- 50:39 Parallelizing Matrix Multiplication with Cuda
- 51:50 Utilizing Blocks and Threads in Cuda
- 58:21 Kernel Execution and Output
- 58:28 Introduction to Matrix Multiplication with CUDA
- 1:00:01 Executing the 2D Block Kernel
- 1:00:51 Optimizing CPU Matrix Multiplication
- 1:02:35 Conversion to CUDA and Performance Comparison
- 1:07:50 Advantages of Shared Memory and Further Optimizations
- 1:08:42 Flexibility of Block and Thread Dimensions
- 1:10:48 Encouragement and Importance of Learning CUDA
- 1:12:30 Setting Up CUDA on Local Machines
- 1:12:59 Introduction to Conda and its Utility
- 1:14:00 Setting Up Conda
- 1:14:32 Configuring Cuda and PyTorch with Conda
- 1:15:35 Conda's Improvements and Compatibility
- 1:16:05 Benefits of Using Conda for Development
- 1:16:40 Conclusion and Next Steps
For details see: pastebin.com/vMakt9Mq
I think youtube kind of shadow banned me and I can't post Summary 1/2
@wolpumba4099 11 หลายเดือนก่อน ⁺⁸
*Summary 2/2*
*Examining the Compiled Module and Running the Kernel*
- 40:06 Observe the module compilation process and identify the resulting files such as `main.cpp`, `main.o`, and others.
- 41:15 Ensure the input image tensor is contiguous and on CUDA before passing it to the kernel.
- 41:42 Run the kernel on a full-sized image, highlighting the significant performance improvement from 1.5 seconds to 1 millisecond due to compiled code and GPU acceleration.
*Cuda Synchronization and Debugging*
- 42:57 Discusses how synchronization of data between GPU and CPU can be triggered manually or by printing a value.
- 43:19 After synchronization, the same grayscale image is obtained, confirming successful Cuda kernel execution.
*Python to Cuda Development Approach*
- 43:27 Explains that writing Cuda kernels in Python and converting them to Cuda is uncommon but preferred for ease of debugging.
- 44:25 Argues that this method is effective for developing and debugging Cuda kernels, as it bypasses the painful traditional Cuda development process.
*Introduction to Matrix Multiplication*
- 44:54 Describes matrix multiplication as a fundamental operation in deep learning.
- 45:14 Explains the process of matrix multiplication using input matrices M and N.
- 46:03 Emphasizes the ubiquity of matrix multiplication in neural networks, although it's typically handled by libraries.
*Implementing Matrix Multiplication in Python*
- 46:57 Uses the MNIST dataset to illustrate matrix multiplication.
- 47:26 Describes creating a single layer of a neural network through matrix multiplication without an activation function.
- 48:14 Implements a smaller example in Python for performance reasons due to Python's slow execution speed.
- 49:31 Explains the Python implementation step by step, emphasizing the importance of checking each line for errors.
- 50:17 Shows the performance of the implemented matrix multiplication in Python, highlighting its slowness.
*Parallelizing Matrix Multiplication with Cuda*
- 50:39 Discusses converting the innermost loop of matrix multiplication to a Cuda kernel to allow parallel execution.
- 51:18 Describes how each cell in the output tensor will be handled by one Cuda thread for the dot product.
*Utilizing Blocks and Threads in Cuda*
- 51:50 Explains Cuda's ability to index into 2D and 3D grids using blocks and threads.
- 53:05 Describes how blocks are indexed with a tuple and how threads are indexed within blocks.
- 54:44 Develops a kernel runner using four nested loops to iterate through blocks and threads in both dimensions.
- 57:01 Details running the matrix multiplication kernel, ensuring the current row and column are within the bounds of the tensor.
*Kernel Execution and Output*
- 58:21 Completes the kernel execution by performing the dot product and placing the result in the output tensor.
*Introduction to Matrix Multiplication with CUDA*
- 58:28 Discusses the ability to call the matrix multiplication function using the height and width of input matrices and the inner dimensions K and K2.
- 58:42 Explains that threads per block is a pair of numbers for two-dimensional inputs.
- 58:56 Notes that threads per block must multiply to 256 and should not exceed 1024.
- 59:31 Mentions the maximum number of blocks for each dimension.
- 59:49 Describes each symmetric multiprocessor's capability to run blocks and access shared memory.
*Executing the 2D Block Kernel*
- 1:00:01 Details how to use the ceiling division for block dimensions.
- 1:00:14 Outlines the process to call a 2D block kernel runner with necessary inputs and dimensions.
- 1:00:35 Validates the output of the 2D block matrix multiplication.
*Optimizing CPU Matrix Multiplication*
- 1:00:51 Introduces the CUDA version of matrix multiplication.
- 1:01:16 Describes using a broadcasting approach for a fast CPU-based matrix multiplication.
- 1:01:34 Compares the broadcasting approach to nested loops and confirms its efficiency.
- 1:02:00 Measures the time taken for the broadcasting approach on the full input matrices.
*Conversion to CUDA and Performance Comparison*
- 1:02:35 Explains how to convert Python code to CUDA using ChatGPT.
- 1:03:01 Describes the C kernel and its similarities to the Python version.
- 1:03:20 Discusses the process of invoking the CUDA kernel from PyTorch.
- 1:03:56 Emphasizes the importance of assertions in CUDA code.
- 1:04:14 Introduces the dim3 structure for specifying threads per block in CUDA.
- 1:04:58 Explains how to call the CUDA kernel with the dim3 structure.
- 1:06:13 Summarizes the steps to compile and run the CUDA module.
- 1:06:41 Reports the performance improvement using CUDA over the optimized CPU approach.
*Advantages of Shared Memory and Further Optimizations*
- 1:07:50 Discusses the benefits of shared memory in the GPU for optimizing matrix multiplication.
- 1:08:02 Highlights the potential for caching to improve performance with shared memory.
*Flexibility of Block and Thread Dimensions*
- 1:08:42 Explains that 2D or 3D blocks are optional and can be replaced with 1D blocks if preferred.
- 1:09:04 Provides an example of converting RGB to grayscale using 2D blocks instead of 1D.
- 1:09:18 Compares code complexity between 1D and 2D block versions for the same task.
- 1:10:17 Emphasizes that either approach yields the same result, and the choice depends on convenience.
*Encouragement and Importance of Learning CUDA*
- 1:10:48 Encourages data scientists and Python programmers to learn CUDA.
- 1:11:03 Suggests that writing CUDA code is increasingly important for modern complex models.
- 1:11:54 Mentions that models are becoming more sophisticated, often requiring CUDA for efficiency.
*Setting Up CUDA on Local Machines*
- 1:12:30 Discusses the possibility of setting up CUDA on personal or cloud machines.
- 1:12:44 Outlines a simple setup process for CUDA on different operating systems.
*Introduction to Conda and its Utility*
- 1:12:59 Mac and Coda compatibility issues with a link to be provided in the video notes.
- 1:13:17 Introduction to Conda as a misunderstood tool, not a replacement for pip or poetry.
- 1:13:34 Conda's ability to manage multiple versions of Python, Cuda, and C++ compilation systems.
*Setting Up Conda*
- 1:14:00 Explaining the ease of setting up Conda with a script for installation.
- 1:14:24 Instructions on restarting the terminal after running the Conda installation script.
*Configuring Cuda and PyTorch with Conda*
- 1:14:32 Finding the correct version of Cuda for PyTorch.
- 1:14:48 Command for installing the correct version of Cuda.
- 1:15:04 Installing all necessary Nvidia tools directly from Nvidia.
- 1:15:22 Installing PyTorch with the correct Cuda version, removing the unnecessary 'nightly' tag.
*Conda's Improvements and Compatibility*
- 1:15:35 Mention of Conda's previous slow solver and recent improvements for speed.
- 1:15:58 Conda's compatibility across various operating systems like WSL, Ubuntu, Fedora, and Debian.
*Benefits of Using Conda for Development*
- 1:16:05 Recommending Conda for local development without the need for Docker.
- 1:16:18 Ability to switch between different versions of tools without hassle.
- 1:16:31 Conda's efficient use of hard drive space through hard linking shared libraries.
*Conclusion and Next Steps*
- 1:16:40 Guidance on getting started with development using Conda on a local machine or the cloud.
- 1:16:47 Closing remarks and encouragement to create with Cuda.
- 1:17:01 Suggestions to watch other Cuda mode lectures and to try out personal projects.
- 1:17:12 Examples of potential projects to implement using Cuda.
- 1:17:38 Advice on improving skills by reading other people's code and examples provided.
Disclaimer: I used gpt4-1106 to summarize the video transcript. This
method may make mistakes in recognizing words. I also had to split the
text into six segments of 2400 words each and there may be problems at
the transitions.
@howardjeremyp 11 หลายเดือนก่อน ⁺⁹
Thank you @wolpumba4099! :D
@grandmastergyorogyoro532 11 หลายเดือนก่อน
@@wolpumba4099
Thank u Sir Pumba
@boydrh 10 หลายเดือนก่อน ⁺⁶
I ran this notebook on a Jetson Nano DevKit (from 2015) and it took 6 seconds for the CPU greyscale conversion and 8ms for the CUDA Kernel. This was a really cool tutorial!!
@ilia_zaitsev 11 หลายเดือนก่อน ⁺⁶
I am following works that Jeremy Howard publishes for a while, starting when the fastai library used Keras. And since that time, each year or two, great content is published, new ideas shared, new projects started. It is CUDA time! (Always wanted to learn; never had a good starting point.) No doubt, a true pillar of the Machine Learning community :)
@dahiruibrahimdahiru2690 11 หลายเดือนก่อน ⁺⁷
What better way to spend a Sunday than a Jeremy Howard video
@AmputeerMeneer 4 หลายเดือนก่อน ⁺¹
Quite brilliant to do this in a notebook because it avoid the normal hassle of setting up a CUDA environment. Even if you have your own GPU, setting up CUDA can be a real pain (eg getting the versions right). Well done Jeremy!
@markozege 11 หลายเดือนก่อน ⁺³
This is amazing, thank you Jeremy! So happy you are continuing with making educational videos. And thanks to all 'Cuda Mode' folks as well...
@oceanograf83 11 หลายเดือนก่อน ⁺²
Amazing, thank you for taking the time to put this stuff out, Jeremy, despite doing for-profit work right now!
@mochalatte3547 11 หลายเดือนก่อน ⁺³
Outstanding - Your work always impresses me and part of me tells me you are indeed a great teacher.
@AM-yk5yd 10 หลายเดือนก่อน
Really interesting approach to use python for prototyping CUDA.
Translation back to C++ without chatgpt can probably by automated using AST traversal (as if trl and torchscript is not enough) as number of available operations is self-limited.
@Kwolf448 11 หลายเดือนก่อน
I love that magic is open-source, thanks, Jeremy!
@JaySingh-gv8rm 11 หลายเดือนก่อน ⁺¹
Wow...thanks for this Jeremy. Yet to complete this video but I know, as always it will be awesome
@pkn8707 6 หลายเดือนก่อน
Until and unless, educators like Jeremy are present, no closed source company can have a lock on knowledge.
Thanks for doing what you do, so consistently.
One question though, even though there's so much chaos in education field, what motivates you to do it consistently? Doing great is okay, doing great consistently is really hard in this distraction prone world.
Anyways as always Thank you and your team for your contribution
@letrillion 11 หลายเดือนก่อน ⁺¹
Been looking for something like this for so long
@godiswatching_895 11 หลายเดือนก่อน ⁺³
Thanks as usual for the great video. Also I see you got a new camera haha :]
@howardjeremyp 11 หลายเดือนก่อน ⁺⁴
It’s just my iPhone camera :)
@b2prix21 11 หลายเดือนก่อน
And he's on macOS. When has that happened? Judging from the TH-cam videos about a year ago - I wonder what his reasons were (did he mention it at all?) and whether he's already using the Alfred App for productivity? 😉 Switching your OS is not a minor thing IMHO.
@carstenmaul7220 11 หลายเดือนก่อน
Jeremy, your tutorials stand out.
@KiejlA9Armistice 11 หลายเดือนก่อน
Thanks for the excellect course! Very helpful
@alinour7488 11 หลายเดือนก่อน ⁺⁵
Thank you for the amazing tutorial.
Is it possible in the future, when mojo is released, to recreate this tutorial using it?
@howardjeremyp 11 หลายเดือนก่อน ⁺⁴
Great idea!
@sh4ny1 11 หลายเดือนก่อน ⁺¹
hi Jeremy thank you for the amazing introduction but I am curious why not simply do the :
%%time
gray = 0.2989*img[:,:,0] + 0.5870*img[:,:,1] + 0.1140*img[:,:,0]
CPU times: total: 0 ns
Wall time: 3 ms
was it just for the sake of demostration purpose ? thank you
@howardjeremyp 11 หลายเดือนก่อน ⁺¹
This course is to teach you CUDA -- using Pytorch ops to do it won't teach you CUDA! :D
@sh4ny1 11 หลายเดือนก่อน
@@howardjeremyp :D
@jpiabrantes 10 หลายเดือนก่อน
Would love to get an intuition on the speed up that CUDA can deliver when compared to using vectorized operations on the CPU, any pointers to this?@@howardjeremyp
(How to factor things like SIMD width, memory transfer overhead, etc)
@sayakpaul3152 11 หลายเดือนก่อน ⁺²
Who would have thought writing CUDA kernels like this!?
@amortalbeing 11 หลายเดือนก่อน ⁺¹
Thanks a lot doctor, you are a Godsend.
God bless you and please keep up the amazing job.
@howardjeremyp 11 หลายเดือนก่อน
You’re most welcome- although I’m not a doctor!
@sjmeldrum 11 หลายเดือนก่อน
Trust me, I'm not a doctor ;D @@howardjeremyp
@rw-kb9qv 11 หลายเดือนก่อน
What a hero! thank you.
@alpha1968 11 หลายเดือนก่อน ⁺¹
Thank you... For this...
@mchristos 11 หลายเดือนก่อน
I'm assuming there's no simple way to do something equivalent on a Mac GPUs? i.e. the MPS device support that Pytorch comes with
@eitanporat9892 11 หลายเดือนก่อน
When will the rest of the tutorial be published?
@jancijak9385 10 หลายเดือนก่อน
Getting started with C++ Cuda operations in Python, coding with jupyter notebook. Will not run on natively on Windows, because no fcntl support. You need to use wsl linux with conda. Otherwise great course content.
@JonathanEyre 4 หลายเดือนก่อน
If chatGPT can convert Python to C code, then surely it must be possible to write a notebook plugin (or whatever) in Python that takes a Python cell and creates an adjacent cell in CUDA C so the process is automated, thus allowing everyone to code in Python with its attendant advantages for the GPU native target. This is exactly like the old days of writing code in C and using a cross compiler to generate Motorola assembler code for burning EPROM chips.
@Le.Loki.T 11 หลายเดือนก่อน
Thanks for this!
@wolpumba4099 11 หลายเดือนก่อน
I just installed hipcc on my Linux laptop with AMD APU. I wonder if I can run the examples on AMD.
@mkmishra.1997 11 หลายเดือนก่อน
amazing video!!
@ILikeAI1 11 หลายเดือนก่อน
Thanks Jeremy!
@JuanPablodelaCruzGutierrez 5 หลายเดือนก่อน
👏👏👏
@BlessBeing 10 หลายเดือนก่อน
thanks sensei
@Tuscani2005GT 11 หลายเดือนก่อน
Amazing! Thank you so much. How can I buy you a coffee?
@seikatsu_ki 11 หลายเดือนก่อน
best teacher
@EkShunya 11 หลายเดือนก่อน
lecture notes please 🙏🏾

ต่อไป

เล่นอัตโนมัติ

Practical Deep Learning for Coders: Lesson 1