Getting Started With CUDA for Python Programmers

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 มิ.ย. 2024
  • I used to find writing CUDA code rather terrifying. But then I discovered a couple of tricks that actually make it quite accessible. In this video I introduce CUDA in a way that will be accessible to Python folks, & I even show how to do it all for free in Colab!
    Notebooks
    This is lecture 3 of the "CUDA Mode" series (but you don't need to watch the others first). The notebook is available in the lecture3 folder here: github.com/cuda-mode/lectures . Or access it directly via Colab here: colab.research.google.com/dri...
    Here's a link to the thread that shows how to install CUDA on Linux or WSL: / 1697435241152127369
    GPT4 auto-generated summary
    In this comprehensive video tutorial, Jeremy Howard from answer.ai demystifies the process of programming NVIDIA GPUs using CUDA, and simplifies the perceived complexities of CUDA programming. Jeremy emphasizes the accessibility of CUDA, especially when combined with PyTorch's capabilities, allowing for programming directly in notebooks rather than traditional compilers and terminals. To make CUDA more approachable to Python programmers, Jeremy shows step by step how to start with Python implementations, and then convert them largely automatically to CUDA. This approach, he argues, simplifies debugging and development.
    The tutorial is structured in a hands-on manner, encouraging viewers to follow along in a Colab notebook. Jeremy uses practical examples, starting with converting an RGB image to grayscale using CUDA, demonstrating the process step-by-step. He further explains the memory layout in GPUs, emphasizing the differences from CPU memory structures, and introduces key CUDA concepts like streaming multi-processors and CUDA cores.
    Jeremy then delves into more advanced topics, such as matrix multiplication, a critical operation in deep learning. He demonstrates how to implement matrix multiplication in Python first and then translates it to CUDA, highlighting the significant performance gains achievable with GPU programming. The tutorial also covers CUDA's intricacies, such as shared memory, thread blocks, and optimizing CUDA kernels.
    The tutorial also includes a section on setting up the CUDA environment on various systems using Conda, making it accessible for a wide range of users.
    Timestamps
    - 00:00 Introduction to CUDA Programming
    - 00:32 Setting Up the Environment
    - 01:43 Recommended Learning Resources
    - 02:39 Starting the Exercise
    - 03:26 Image Processing Exercise
    - 06:08 Converting RGB to Grayscale
    - 07:50 Understanding Image Flattening
    - 11:04 Executing the Grayscale Conversion
    - 12:41 Performance Issues and Introduction to CUDA Cores
    - 14:46 Understanding Cuda and Parallel Processing
    - 16:23 Simulating Cuda with Python
    - 19:04 The Structure of Cuda Kernels and Memory Management
    - 21:42 Optimizing Cuda Performance with Blocks and Threads
    - 24:16 Utilizing Cuda's Advanced Features for Speed
    - 26:15 Setting Up Cuda for Development and Debugging
    - 27:28 Compiling and Using Cuda Code with PyTorch
    - 28:51 Including Necessary Components and Defining Macros
    - 29:45 Ceiling Division Function
    - 30:10 Writing the CUDA Kernel
    - 32:19 Handling Data Types and Arrays in C
    - 33:42 Defining the Kernel and Calling Conventions
    - 35:49 Passing Arguments to the Kernel
    - 36:49 Creating the Output Tensor
    - 38:11 Error Checking and Returning the Tensor
    - 39:01 Compiling and Linking the Code
    - 40:06 Examining the Compiled Module and Running the Kernel
    - 42:57 Cuda Synchronization and Debugging
    - 43:27 Python to Cuda Development Approach
    - 44:54 Introduction to Matrix Multiplication
    - 46:57 Implementing Matrix Multiplication in Python
    - 50:39 Parallelizing Matrix Multiplication with Cuda
    - 51:50 Utilizing Blocks and Threads in Cuda
    - 58:21 Kernel Execution and Output
    - 58:28 Introduction to Matrix Multiplication with CUDA
    - 1:00:01 Executing the 2D Block Kernel
    - 1:00:51 Optimizing CPU Matrix Multiplication
    - 1:02:35 Conversion to CUDA and Performance Comparison
    - 1:07:50 Advantages of Shared Memory and Further Optimizations
    - 1:08:42 Flexibility of Block and Thread Dimensions
    - 1:10:48 Encouragement and Importance of Learning CUDA
    - 1:12:30 Setting Up CUDA on Local Machines
    - 1:12:59 Introduction to Conda and its Utility
    - 1:14:00 Setting Up Conda
    - 1:14:32 Configuring Cuda and PyTorch with Conda
    - 1:15:35 Conda's Improvements and Compatibility
    - 1:16:05 Benefits of Using Conda for Development
    - 1:16:40 Conclusion and Next Steps
    Thanks to @wolpumba4099 for the chapter timestamps. Summary description provided by GPT4.

ความคิดเห็น • 44

  • @wadejohnson4542
    @wadejohnson4542 4 หลายเดือนก่อน +61

    Jeremy Howard: a true hero of the common man. Thank you for this.

  • @wolpumba4099
    @wolpumba4099 4 หลายเดือนก่อน +33

    Chapter titles:
    - 00:01 Introduction to CUDA Programming
    - 00:32 Setting Up the Environment
    - 01:43 Recommended Learning Resources
    - 02:39 Starting the Exercise
    - 03:26 Image Processing Exercise
    - 06:08 Converting RGB to Grayscale
    - 07:50 Understanding Image Flattening
    - 11:04 Executing the Grayscale Conversion
    - 12:41 Performance Issues and Introduction to CUDA Cores
    - 14:46 Understanding Cuda and Parallel Processing
    - 16:23 Simulating Cuda with Python
    - 19:04 The Structure of Cuda Kernels and Memory Management
    - 21:42 Optimizing Cuda Performance with Blocks and Threads
    - 24:16 Utilizing Cuda's Advanced Features for Speed
    - 26:15 Setting Up Cuda for Development and Debugging
    - 27:28 Compiling and Using Cuda Code with PyTorch
    - 28:51 Including Necessary Components and Defining Macros
    - 29:45 Ceiling Division Function
    - 30:10 Writing the CUDA Kernel
    - 32:19 Handling Data Types and Arrays in C
    - 33:42 Defining the Kernel and Calling Conventions
    - 35:49 Passing Arguments to the Kernel
    - 36:49 Creating the Output Tensor
    - 38:11 Error Checking and Returning the Tensor
    - 39:01 Compiling and Linking the Code
    - 40:06 Examining the Compiled Module and Running the Kernel
    - 42:57 Cuda Synchronization and Debugging
    - 43:27 Python to Cuda Development Approach
    - 44:54 Introduction to Matrix Multiplication
    - 46:57 Implementing Matrix Multiplication in Python
    - 50:39 Parallelizing Matrix Multiplication with Cuda
    - 51:50 Utilizing Blocks and Threads in Cuda
    - 58:21 Kernel Execution and Output
    - 58:28 Introduction to Matrix Multiplication with CUDA
    - 1:00:01 Executing the 2D Block Kernel
    - 1:00:51 Optimizing CPU Matrix Multiplication
    - 1:02:35 Conversion to CUDA and Performance Comparison
    - 1:07:50 Advantages of Shared Memory and Further Optimizations
    - 1:08:42 Flexibility of Block and Thread Dimensions
    - 1:10:48 Encouragement and Importance of Learning CUDA
    - 1:12:30 Setting Up CUDA on Local Machines
    - 1:12:59 Introduction to Conda and its Utility
    - 1:14:00 Setting Up Conda
    - 1:14:32 Configuring Cuda and PyTorch with Conda
    - 1:15:35 Conda's Improvements and Compatibility
    - 1:16:05 Benefits of Using Conda for Development
    - 1:16:40 Conclusion and Next Steps
    For details see: pastebin.com/vMakt9Mq
    I think youtube kind of shadow banned me and I can't post Summary 1/2

    • @wolpumba4099
      @wolpumba4099 4 หลายเดือนก่อน +8

      *Summary 2/2*
      *Examining the Compiled Module and Running the Kernel*
      - 40:06 Observe the module compilation process and identify the resulting files such as `main.cpp`, `main.o`, and others.
      - 41:15 Ensure the input image tensor is contiguous and on CUDA before passing it to the kernel.
      - 41:42 Run the kernel on a full-sized image, highlighting the significant performance improvement from 1.5 seconds to 1 millisecond due to compiled code and GPU acceleration.
      *Cuda Synchronization and Debugging*
      - 42:57 Discusses how synchronization of data between GPU and CPU can be triggered manually or by printing a value.
      - 43:19 After synchronization, the same grayscale image is obtained, confirming successful Cuda kernel execution.
      *Python to Cuda Development Approach*
      - 43:27 Explains that writing Cuda kernels in Python and converting them to Cuda is uncommon but preferred for ease of debugging.
      - 44:25 Argues that this method is effective for developing and debugging Cuda kernels, as it bypasses the painful traditional Cuda development process.
      *Introduction to Matrix Multiplication*
      - 44:54 Describes matrix multiplication as a fundamental operation in deep learning.
      - 45:14 Explains the process of matrix multiplication using input matrices M and N.
      - 46:03 Emphasizes the ubiquity of matrix multiplication in neural networks, although it's typically handled by libraries.
      *Implementing Matrix Multiplication in Python*
      - 46:57 Uses the MNIST dataset to illustrate matrix multiplication.
      - 47:26 Describes creating a single layer of a neural network through matrix multiplication without an activation function.
      - 48:14 Implements a smaller example in Python for performance reasons due to Python's slow execution speed.
      - 49:31 Explains the Python implementation step by step, emphasizing the importance of checking each line for errors.
      - 50:17 Shows the performance of the implemented matrix multiplication in Python, highlighting its slowness.
      *Parallelizing Matrix Multiplication with Cuda*
      - 50:39 Discusses converting the innermost loop of matrix multiplication to a Cuda kernel to allow parallel execution.
      - 51:18 Describes how each cell in the output tensor will be handled by one Cuda thread for the dot product.
      *Utilizing Blocks and Threads in Cuda*
      - 51:50 Explains Cuda's ability to index into 2D and 3D grids using blocks and threads.
      - 53:05 Describes how blocks are indexed with a tuple and how threads are indexed within blocks.
      - 54:44 Develops a kernel runner using four nested loops to iterate through blocks and threads in both dimensions.
      - 57:01 Details running the matrix multiplication kernel, ensuring the current row and column are within the bounds of the tensor.
      *Kernel Execution and Output*
      - 58:21 Completes the kernel execution by performing the dot product and placing the result in the output tensor.
      *Introduction to Matrix Multiplication with CUDA*
      - 58:28 Discusses the ability to call the matrix multiplication function using the height and width of input matrices and the inner dimensions K and K2.
      - 58:42 Explains that threads per block is a pair of numbers for two-dimensional inputs.
      - 58:56 Notes that threads per block must multiply to 256 and should not exceed 1024.
      - 59:31 Mentions the maximum number of blocks for each dimension.
      - 59:49 Describes each symmetric multiprocessor's capability to run blocks and access shared memory.
      *Executing the 2D Block Kernel*
      - 1:00:01 Details how to use the ceiling division for block dimensions.
      - 1:00:14 Outlines the process to call a 2D block kernel runner with necessary inputs and dimensions.
      - 1:00:35 Validates the output of the 2D block matrix multiplication.
      *Optimizing CPU Matrix Multiplication*
      - 1:00:51 Introduces the CUDA version of matrix multiplication.
      - 1:01:16 Describes using a broadcasting approach for a fast CPU-based matrix multiplication.
      - 1:01:34 Compares the broadcasting approach to nested loops and confirms its efficiency.
      - 1:02:00 Measures the time taken for the broadcasting approach on the full input matrices.
      *Conversion to CUDA and Performance Comparison*
      - 1:02:35 Explains how to convert Python code to CUDA using ChatGPT.
      - 1:03:01 Describes the C kernel and its similarities to the Python version.
      - 1:03:20 Discusses the process of invoking the CUDA kernel from PyTorch.
      - 1:03:56 Emphasizes the importance of assertions in CUDA code.
      - 1:04:14 Introduces the dim3 structure for specifying threads per block in CUDA.
      - 1:04:58 Explains how to call the CUDA kernel with the dim3 structure.
      - 1:06:13 Summarizes the steps to compile and run the CUDA module.
      - 1:06:41 Reports the performance improvement using CUDA over the optimized CPU approach.
      *Advantages of Shared Memory and Further Optimizations*
      - 1:07:50 Discusses the benefits of shared memory in the GPU for optimizing matrix multiplication.
      - 1:08:02 Highlights the potential for caching to improve performance with shared memory.
      *Flexibility of Block and Thread Dimensions*
      - 1:08:42 Explains that 2D or 3D blocks are optional and can be replaced with 1D blocks if preferred.
      - 1:09:04 Provides an example of converting RGB to grayscale using 2D blocks instead of 1D.
      - 1:09:18 Compares code complexity between 1D and 2D block versions for the same task.
      - 1:10:17 Emphasizes that either approach yields the same result, and the choice depends on convenience.
      *Encouragement and Importance of Learning CUDA*
      - 1:10:48 Encourages data scientists and Python programmers to learn CUDA.
      - 1:11:03 Suggests that writing CUDA code is increasingly important for modern complex models.
      - 1:11:54 Mentions that models are becoming more sophisticated, often requiring CUDA for efficiency.
      *Setting Up CUDA on Local Machines*
      - 1:12:30 Discusses the possibility of setting up CUDA on personal or cloud machines.
      - 1:12:44 Outlines a simple setup process for CUDA on different operating systems.
      *Introduction to Conda and its Utility*
      - 1:12:59 Mac and Coda compatibility issues with a link to be provided in the video notes.
      - 1:13:17 Introduction to Conda as a misunderstood tool, not a replacement for pip or poetry.
      - 1:13:34 Conda's ability to manage multiple versions of Python, Cuda, and C++ compilation systems.
      *Setting Up Conda*
      - 1:14:00 Explaining the ease of setting up Conda with a script for installation.
      - 1:14:24 Instructions on restarting the terminal after running the Conda installation script.
      *Configuring Cuda and PyTorch with Conda*
      - 1:14:32 Finding the correct version of Cuda for PyTorch.
      - 1:14:48 Command for installing the correct version of Cuda.
      - 1:15:04 Installing all necessary Nvidia tools directly from Nvidia.
      - 1:15:22 Installing PyTorch with the correct Cuda version, removing the unnecessary 'nightly' tag.
      *Conda's Improvements and Compatibility*
      - 1:15:35 Mention of Conda's previous slow solver and recent improvements for speed.
      - 1:15:58 Conda's compatibility across various operating systems like WSL, Ubuntu, Fedora, and Debian.
      *Benefits of Using Conda for Development*
      - 1:16:05 Recommending Conda for local development without the need for Docker.
      - 1:16:18 Ability to switch between different versions of tools without hassle.
      - 1:16:31 Conda's efficient use of hard drive space through hard linking shared libraries.
      *Conclusion and Next Steps*
      - 1:16:40 Guidance on getting started with development using Conda on a local machine or the cloud.
      - 1:16:47 Closing remarks and encouragement to create with Cuda.
      - 1:17:01 Suggestions to watch other Cuda mode lectures and to try out personal projects.
      - 1:17:12 Examples of potential projects to implement using Cuda.
      - 1:17:38 Advice on improving skills by reading other people's code and examples provided.
      Disclaimer: I used gpt4-1106 to summarize the video transcript. This
      method may make mistakes in recognizing words. I also had to split the
      text into six segments of 2400 words each and there may be problems at
      the transitions.

    • @howardjeremyp
      @howardjeremyp  4 หลายเดือนก่อน +8

      Thank you @wolpumba4099! :D

    • @grandmastergyorogyoro532
      @grandmastergyorogyoro532 4 หลายเดือนก่อน

      ​@@wolpumba4099
      Thank u Sir Pumba

  • @markozege
    @markozege 4 หลายเดือนก่อน +3

    This is amazing, thank you Jeremy! So happy you are continuing with making educational videos. And thanks to all 'Cuda Mode' folks as well...

  • @boydrh
    @boydrh 3 หลายเดือนก่อน +4

    I ran this notebook on a Jetson Nano DevKit (from 2015) and it took 6 seconds for the CPU greyscale conversion and 8ms for the CUDA Kernel. This was a really cool tutorial!!

  • @oceanograf83
    @oceanograf83 4 หลายเดือนก่อน +2

    Amazing, thank you for taking the time to put this stuff out, Jeremy, despite doing for-profit work right now!

  • @ilia_zaitsev
    @ilia_zaitsev 4 หลายเดือนก่อน +6

    I am following works that Jeremy Howard publishes for a while, starting when the fastai library used Keras. And since that time, each year or two, great content is published, new ideas shared, new projects started. It is CUDA time! (Always wanted to learn; never had a good starting point.) No doubt, a true pillar of the Machine Learning community :)

  • @mochalatte3547
    @mochalatte3547 4 หลายเดือนก่อน +3

    Outstanding - Your work always impresses me and part of me tells me you are indeed a great teacher.

  • @dahiruibrahimdahiru2690
    @dahiruibrahimdahiru2690 4 หลายเดือนก่อน +6

    What better way to spend a Sunday than a Jeremy Howard video

  • @JaySingh-gv8rm
    @JaySingh-gv8rm 4 หลายเดือนก่อน +1

    Wow...thanks for this Jeremy. Yet to complete this video but I know, as always it will be awesome

  • @carstenmaul7220
    @carstenmaul7220 4 หลายเดือนก่อน

    Jeremy, your tutorials stand out.

  • @letrillion
    @letrillion 4 หลายเดือนก่อน +1

    Been looking for something like this for so long

  • @JustSayin24
    @JustSayin24 4 หลายเดือนก่อน +3

    30:24 is why I love this channel. Why learn low-level GPU programming when ChatGPT can do it for us? A no-fuss, genuinekly useful tutorial. thank you Jeremy.

  • @Kwolf448
    @Kwolf448 4 หลายเดือนก่อน

    I love that magic is open-source, thanks, Jeremy!

  • @KiejlA9Armistice
    @KiejlA9Armistice 4 หลายเดือนก่อน

    Thanks for the excellect course! Very helpful

  • @rw-kb9qv
    @rw-kb9qv 4 หลายเดือนก่อน

    What a hero! thank you.

  • @Le.Loki.T
    @Le.Loki.T 4 หลายเดือนก่อน

    Thanks for this!

  • @AM-yk5yd
    @AM-yk5yd 3 หลายเดือนก่อน

    Really interesting approach to use python for prototyping CUDA.
    Translation back to C++ without chatgpt can probably by automated using AST traversal (as if trl and torchscript is not enough) as number of available operations is self-limited.

  • @mkmishra.1997
    @mkmishra.1997 4 หลายเดือนก่อน

    amazing video!!

  • @godiswatching_895
    @godiswatching_895 4 หลายเดือนก่อน +3

    Thanks as usual for the great video. Also I see you got a new camera haha :]

    • @howardjeremyp
      @howardjeremyp  4 หลายเดือนก่อน +4

      It’s just my iPhone camera :)

    • @b2prix21
      @b2prix21 4 หลายเดือนก่อน

      And he's on macOS. When has that happened? Judging from the TH-cam videos about a year ago - I wonder what his reasons were (did he mention it at all?) and whether he's already using the Alfred App for productivity? 😉 Switching your OS is not a minor thing IMHO.

  • @ILikeAI1
    @ILikeAI1 4 หลายเดือนก่อน

    Thanks Jeremy!

  • @sayakpaul3152
    @sayakpaul3152 4 หลายเดือนก่อน +2

    Who would have thought writing CUDA kernels like this!?

  • @alpha1968
    @alpha1968 4 หลายเดือนก่อน +1

    Thank you... For this...

  • @amortalbeing
    @amortalbeing 4 หลายเดือนก่อน

    Thanks a lot doctor, you are a Godsend.
    God bless you and please keep up the amazing job.

    • @howardjeremyp
      @howardjeremyp  4 หลายเดือนก่อน

      You’re most welcome- although I’m not a doctor!

    • @sjmeldrum
      @sjmeldrum 4 หลายเดือนก่อน

      Trust me, I'm not a doctor ;D @@howardjeremyp

  • @alinour7488
    @alinour7488 4 หลายเดือนก่อน +5

    Thank you for the amazing tutorial.
    Is it possible in the future, when mojo is released, to recreate this tutorial using it?

  • @BlessBeing
    @BlessBeing 4 หลายเดือนก่อน

    thanks sensei

  • @seikatsu_ki
    @seikatsu_ki 4 หลายเดือนก่อน

    best teacher

  • @Tuscani2005GT
    @Tuscani2005GT 4 หลายเดือนก่อน

    Amazing! Thank you so much. How can I buy you a coffee?

  • @eitanporat9892
    @eitanporat9892 4 หลายเดือนก่อน

    When will the rest of the tutorial be published?

  • @jancijak9385
    @jancijak9385 3 หลายเดือนก่อน

    Getting started with C++ Cuda operations in Python, coding with jupyter notebook. Will not run on natively on Windows, because no fcntl support. You need to use wsl linux with conda. Otherwise great course content.

  • @wolpumba4099
    @wolpumba4099 4 หลายเดือนก่อน

    I just installed hipcc on my Linux laptop with AMD APU. I wonder if I can run the examples on AMD.

  • @sanawarhussain
    @sanawarhussain 4 หลายเดือนก่อน +1

    hi Jeremy thank you for the amazing introduction but I am curious why not simply do the :
    %%time
    gray = 0.2989*img[:,:,0] + 0.5870*img[:,:,1] + 0.1140*img[:,:,0]
    CPU times: total: 0 ns
    Wall time: 3 ms
    was it just for the sake of demostration purpose ? thank you

    • @howardjeremyp
      @howardjeremyp  4 หลายเดือนก่อน +1

      This course is to teach you CUDA -- using Pytorch ops to do it won't teach you CUDA! :D

    • @sanawarhussain
      @sanawarhussain 4 หลายเดือนก่อน

      @@howardjeremyp :D

    • @jpiabrantes
      @jpiabrantes 4 หลายเดือนก่อน

      Would love to get an intuition on the speed up that CUDA can deliver when compared to using vectorized operations on the CPU, any pointers to this?@@howardjeremyp
      (How to factor things like SIMD width, memory transfer overhead, etc)

  • @chrismarais1999
    @chrismarais1999 4 หลายเดือนก่อน

    I'm assuming there's no simple way to do something equivalent on a Mac GPUs? i.e. the MPS device support that Pytorch comes with

  • @EkShunya
    @EkShunya 4 หลายเดือนก่อน

    lecture notes please 🙏🏾