Chapter titles: - 00:01 Introduction to CUDA Programming - 00:32 Setting Up the Environment - 01:43 Recommended Learning Resources - 02:39 Starting the Exercise - 03:26 Image Processing Exercise - 06:08 Converting RGB to Grayscale - 07:50 Understanding Image Flattening - 11:04 Executing the Grayscale Conversion - 12:41 Performance Issues and Introduction to CUDA Cores - 14:46 Understanding Cuda and Parallel Processing - 16:23 Simulating Cuda with Python - 19:04 The Structure of Cuda Kernels and Memory Management - 21:42 Optimizing Cuda Performance with Blocks and Threads - 24:16 Utilizing Cuda's Advanced Features for Speed - 26:15 Setting Up Cuda for Development and Debugging - 27:28 Compiling and Using Cuda Code with PyTorch - 28:51 Including Necessary Components and Defining Macros - 29:45 Ceiling Division Function - 30:10 Writing the CUDA Kernel - 32:19 Handling Data Types and Arrays in C - 33:42 Defining the Kernel and Calling Conventions - 35:49 Passing Arguments to the Kernel - 36:49 Creating the Output Tensor - 38:11 Error Checking and Returning the Tensor - 39:01 Compiling and Linking the Code - 40:06 Examining the Compiled Module and Running the Kernel - 42:57 Cuda Synchronization and Debugging - 43:27 Python to Cuda Development Approach - 44:54 Introduction to Matrix Multiplication - 46:57 Implementing Matrix Multiplication in Python - 50:39 Parallelizing Matrix Multiplication with Cuda - 51:50 Utilizing Blocks and Threads in Cuda - 58:21 Kernel Execution and Output - 58:28 Introduction to Matrix Multiplication with CUDA - 1:00:01 Executing the 2D Block Kernel - 1:00:51 Optimizing CPU Matrix Multiplication - 1:02:35 Conversion to CUDA and Performance Comparison - 1:07:50 Advantages of Shared Memory and Further Optimizations - 1:08:42 Flexibility of Block and Thread Dimensions - 1:10:48 Encouragement and Importance of Learning CUDA - 1:12:30 Setting Up CUDA on Local Machines - 1:12:59 Introduction to Conda and its Utility - 1:14:00 Setting Up Conda - 1:14:32 Configuring Cuda and PyTorch with Conda - 1:15:35 Conda's Improvements and Compatibility - 1:16:05 Benefits of Using Conda for Development - 1:16:40 Conclusion and Next Steps For details see: pastebin.com/vMakt9Mq I think youtube kind of shadow banned me and I can't post Summary 1/2
*Summary 2/2* *Examining the Compiled Module and Running the Kernel* - 40:06 Observe the module compilation process and identify the resulting files such as `main.cpp`, `main.o`, and others. - 41:15 Ensure the input image tensor is contiguous and on CUDA before passing it to the kernel. - 41:42 Run the kernel on a full-sized image, highlighting the significant performance improvement from 1.5 seconds to 1 millisecond due to compiled code and GPU acceleration. *Cuda Synchronization and Debugging* - 42:57 Discusses how synchronization of data between GPU and CPU can be triggered manually or by printing a value. - 43:19 After synchronization, the same grayscale image is obtained, confirming successful Cuda kernel execution. *Python to Cuda Development Approach* - 43:27 Explains that writing Cuda kernels in Python and converting them to Cuda is uncommon but preferred for ease of debugging. - 44:25 Argues that this method is effective for developing and debugging Cuda kernels, as it bypasses the painful traditional Cuda development process. *Introduction to Matrix Multiplication* - 44:54 Describes matrix multiplication as a fundamental operation in deep learning. - 45:14 Explains the process of matrix multiplication using input matrices M and N. - 46:03 Emphasizes the ubiquity of matrix multiplication in neural networks, although it's typically handled by libraries. *Implementing Matrix Multiplication in Python* - 46:57 Uses the MNIST dataset to illustrate matrix multiplication. - 47:26 Describes creating a single layer of a neural network through matrix multiplication without an activation function. - 48:14 Implements a smaller example in Python for performance reasons due to Python's slow execution speed. - 49:31 Explains the Python implementation step by step, emphasizing the importance of checking each line for errors. - 50:17 Shows the performance of the implemented matrix multiplication in Python, highlighting its slowness. *Parallelizing Matrix Multiplication with Cuda* - 50:39 Discusses converting the innermost loop of matrix multiplication to a Cuda kernel to allow parallel execution. - 51:18 Describes how each cell in the output tensor will be handled by one Cuda thread for the dot product. *Utilizing Blocks and Threads in Cuda* - 51:50 Explains Cuda's ability to index into 2D and 3D grids using blocks and threads. - 53:05 Describes how blocks are indexed with a tuple and how threads are indexed within blocks. - 54:44 Develops a kernel runner using four nested loops to iterate through blocks and threads in both dimensions. - 57:01 Details running the matrix multiplication kernel, ensuring the current row and column are within the bounds of the tensor. *Kernel Execution and Output* - 58:21 Completes the kernel execution by performing the dot product and placing the result in the output tensor. *Introduction to Matrix Multiplication with CUDA* - 58:28 Discusses the ability to call the matrix multiplication function using the height and width of input matrices and the inner dimensions K and K2. - 58:42 Explains that threads per block is a pair of numbers for two-dimensional inputs. - 58:56 Notes that threads per block must multiply to 256 and should not exceed 1024. - 59:31 Mentions the maximum number of blocks for each dimension. - 59:49 Describes each symmetric multiprocessor's capability to run blocks and access shared memory. *Executing the 2D Block Kernel* - 1:00:01 Details how to use the ceiling division for block dimensions. - 1:00:14 Outlines the process to call a 2D block kernel runner with necessary inputs and dimensions. - 1:00:35 Validates the output of the 2D block matrix multiplication. *Optimizing CPU Matrix Multiplication* - 1:00:51 Introduces the CUDA version of matrix multiplication. - 1:01:16 Describes using a broadcasting approach for a fast CPU-based matrix multiplication. - 1:01:34 Compares the broadcasting approach to nested loops and confirms its efficiency. - 1:02:00 Measures the time taken for the broadcasting approach on the full input matrices. *Conversion to CUDA and Performance Comparison* - 1:02:35 Explains how to convert Python code to CUDA using ChatGPT. - 1:03:01 Describes the C kernel and its similarities to the Python version. - 1:03:20 Discusses the process of invoking the CUDA kernel from PyTorch. - 1:03:56 Emphasizes the importance of assertions in CUDA code. - 1:04:14 Introduces the dim3 structure for specifying threads per block in CUDA. - 1:04:58 Explains how to call the CUDA kernel with the dim3 structure. - 1:06:13 Summarizes the steps to compile and run the CUDA module. - 1:06:41 Reports the performance improvement using CUDA over the optimized CPU approach. *Advantages of Shared Memory and Further Optimizations* - 1:07:50 Discusses the benefits of shared memory in the GPU for optimizing matrix multiplication. - 1:08:02 Highlights the potential for caching to improve performance with shared memory. *Flexibility of Block and Thread Dimensions* - 1:08:42 Explains that 2D or 3D blocks are optional and can be replaced with 1D blocks if preferred. - 1:09:04 Provides an example of converting RGB to grayscale using 2D blocks instead of 1D. - 1:09:18 Compares code complexity between 1D and 2D block versions for the same task. - 1:10:17 Emphasizes that either approach yields the same result, and the choice depends on convenience. *Encouragement and Importance of Learning CUDA* - 1:10:48 Encourages data scientists and Python programmers to learn CUDA. - 1:11:03 Suggests that writing CUDA code is increasingly important for modern complex models. - 1:11:54 Mentions that models are becoming more sophisticated, often requiring CUDA for efficiency. *Setting Up CUDA on Local Machines* - 1:12:30 Discusses the possibility of setting up CUDA on personal or cloud machines. - 1:12:44 Outlines a simple setup process for CUDA on different operating systems. *Introduction to Conda and its Utility* - 1:12:59 Mac and Coda compatibility issues with a link to be provided in the video notes. - 1:13:17 Introduction to Conda as a misunderstood tool, not a replacement for pip or poetry. - 1:13:34 Conda's ability to manage multiple versions of Python, Cuda, and C++ compilation systems. *Setting Up Conda* - 1:14:00 Explaining the ease of setting up Conda with a script for installation. - 1:14:24 Instructions on restarting the terminal after running the Conda installation script. *Configuring Cuda and PyTorch with Conda* - 1:14:32 Finding the correct version of Cuda for PyTorch. - 1:14:48 Command for installing the correct version of Cuda. - 1:15:04 Installing all necessary Nvidia tools directly from Nvidia. - 1:15:22 Installing PyTorch with the correct Cuda version, removing the unnecessary 'nightly' tag. *Conda's Improvements and Compatibility* - 1:15:35 Mention of Conda's previous slow solver and recent improvements for speed. - 1:15:58 Conda's compatibility across various operating systems like WSL, Ubuntu, Fedora, and Debian. *Benefits of Using Conda for Development* - 1:16:05 Recommending Conda for local development without the need for Docker. - 1:16:18 Ability to switch between different versions of tools without hassle. - 1:16:31 Conda's efficient use of hard drive space through hard linking shared libraries. *Conclusion and Next Steps* - 1:16:40 Guidance on getting started with development using Conda on a local machine or the cloud. - 1:16:47 Closing remarks and encouragement to create with Cuda. - 1:17:01 Suggestions to watch other Cuda mode lectures and to try out personal projects. - 1:17:12 Examples of potential projects to implement using Cuda. - 1:17:38 Advice on improving skills by reading other people's code and examples provided. Disclaimer: I used gpt4-1106 to summarize the video transcript. This method may make mistakes in recognizing words. I also had to split the text into six segments of 2400 words each and there may be problems at the transitions.
I ran this notebook on a Jetson Nano DevKit (from 2015) and it took 6 seconds for the CPU greyscale conversion and 8ms for the CUDA Kernel. This was a really cool tutorial!!
I am following works that Jeremy Howard publishes for a while, starting when the fastai library used Keras. And since that time, each year or two, great content is published, new ideas shared, new projects started. It is CUDA time! (Always wanted to learn; never had a good starting point.) No doubt, a true pillar of the Machine Learning community :)
Quite brilliant to do this in a notebook because it avoid the normal hassle of setting up a CUDA environment. Even if you have your own GPU, setting up CUDA can be a real pain (eg getting the versions right). Well done Jeremy!
Really interesting approach to use python for prototyping CUDA. Translation back to C++ without chatgpt can probably by automated using AST traversal (as if trl and torchscript is not enough) as number of available operations is self-limited.
Until and unless, educators like Jeremy are present, no closed source company can have a lock on knowledge. Thanks for doing what you do, so consistently. One question though, even though there's so much chaos in education field, what motivates you to do it consistently? Doing great is okay, doing great consistently is really hard in this distraction prone world. Anyways as always Thank you and your team for your contribution
And he's on macOS. When has that happened? Judging from the TH-cam videos about a year ago - I wonder what his reasons were (did he mention it at all?) and whether he's already using the Alfred App for productivity? 😉 Switching your OS is not a minor thing IMHO.
hi Jeremy thank you for the amazing introduction but I am curious why not simply do the : %%time gray = 0.2989*img[:,:,0] + 0.5870*img[:,:,1] + 0.1140*img[:,:,0] CPU times: total: 0 ns Wall time: 3 ms was it just for the sake of demostration purpose ? thank you
Would love to get an intuition on the speed up that CUDA can deliver when compared to using vectorized operations on the CPU, any pointers to this?@@howardjeremyp (How to factor things like SIMD width, memory transfer overhead, etc)
Getting started with C++ Cuda operations in Python, coding with jupyter notebook. Will not run on natively on Windows, because no fcntl support. You need to use wsl linux with conda. Otherwise great course content.
If chatGPT can convert Python to C code, then surely it must be possible to write a notebook plugin (or whatever) in Python that takes a Python cell and creates an adjacent cell in CUDA C so the process is automated, thus allowing everyone to code in Python with its attendant advantages for the GPU native target. This is exactly like the old days of writing code in C and using a cross compiler to generate Motorola assembler code for burning EPROM chips.
Jeremy Howard: a true hero of the common man. Thank you for this.
Chapter titles:
- 00:01 Introduction to CUDA Programming
- 00:32 Setting Up the Environment
- 01:43 Recommended Learning Resources
- 02:39 Starting the Exercise
- 03:26 Image Processing Exercise
- 06:08 Converting RGB to Grayscale
- 07:50 Understanding Image Flattening
- 11:04 Executing the Grayscale Conversion
- 12:41 Performance Issues and Introduction to CUDA Cores
- 14:46 Understanding Cuda and Parallel Processing
- 16:23 Simulating Cuda with Python
- 19:04 The Structure of Cuda Kernels and Memory Management
- 21:42 Optimizing Cuda Performance with Blocks and Threads
- 24:16 Utilizing Cuda's Advanced Features for Speed
- 26:15 Setting Up Cuda for Development and Debugging
- 27:28 Compiling and Using Cuda Code with PyTorch
- 28:51 Including Necessary Components and Defining Macros
- 29:45 Ceiling Division Function
- 30:10 Writing the CUDA Kernel
- 32:19 Handling Data Types and Arrays in C
- 33:42 Defining the Kernel and Calling Conventions
- 35:49 Passing Arguments to the Kernel
- 36:49 Creating the Output Tensor
- 38:11 Error Checking and Returning the Tensor
- 39:01 Compiling and Linking the Code
- 40:06 Examining the Compiled Module and Running the Kernel
- 42:57 Cuda Synchronization and Debugging
- 43:27 Python to Cuda Development Approach
- 44:54 Introduction to Matrix Multiplication
- 46:57 Implementing Matrix Multiplication in Python
- 50:39 Parallelizing Matrix Multiplication with Cuda
- 51:50 Utilizing Blocks and Threads in Cuda
- 58:21 Kernel Execution and Output
- 58:28 Introduction to Matrix Multiplication with CUDA
- 1:00:01 Executing the 2D Block Kernel
- 1:00:51 Optimizing CPU Matrix Multiplication
- 1:02:35 Conversion to CUDA and Performance Comparison
- 1:07:50 Advantages of Shared Memory and Further Optimizations
- 1:08:42 Flexibility of Block and Thread Dimensions
- 1:10:48 Encouragement and Importance of Learning CUDA
- 1:12:30 Setting Up CUDA on Local Machines
- 1:12:59 Introduction to Conda and its Utility
- 1:14:00 Setting Up Conda
- 1:14:32 Configuring Cuda and PyTorch with Conda
- 1:15:35 Conda's Improvements and Compatibility
- 1:16:05 Benefits of Using Conda for Development
- 1:16:40 Conclusion and Next Steps
For details see: pastebin.com/vMakt9Mq
I think youtube kind of shadow banned me and I can't post Summary 1/2
*Summary 2/2*
*Examining the Compiled Module and Running the Kernel*
- 40:06 Observe the module compilation process and identify the resulting files such as `main.cpp`, `main.o`, and others.
- 41:15 Ensure the input image tensor is contiguous and on CUDA before passing it to the kernel.
- 41:42 Run the kernel on a full-sized image, highlighting the significant performance improvement from 1.5 seconds to 1 millisecond due to compiled code and GPU acceleration.
*Cuda Synchronization and Debugging*
- 42:57 Discusses how synchronization of data between GPU and CPU can be triggered manually or by printing a value.
- 43:19 After synchronization, the same grayscale image is obtained, confirming successful Cuda kernel execution.
*Python to Cuda Development Approach*
- 43:27 Explains that writing Cuda kernels in Python and converting them to Cuda is uncommon but preferred for ease of debugging.
- 44:25 Argues that this method is effective for developing and debugging Cuda kernels, as it bypasses the painful traditional Cuda development process.
*Introduction to Matrix Multiplication*
- 44:54 Describes matrix multiplication as a fundamental operation in deep learning.
- 45:14 Explains the process of matrix multiplication using input matrices M and N.
- 46:03 Emphasizes the ubiquity of matrix multiplication in neural networks, although it's typically handled by libraries.
*Implementing Matrix Multiplication in Python*
- 46:57 Uses the MNIST dataset to illustrate matrix multiplication.
- 47:26 Describes creating a single layer of a neural network through matrix multiplication without an activation function.
- 48:14 Implements a smaller example in Python for performance reasons due to Python's slow execution speed.
- 49:31 Explains the Python implementation step by step, emphasizing the importance of checking each line for errors.
- 50:17 Shows the performance of the implemented matrix multiplication in Python, highlighting its slowness.
*Parallelizing Matrix Multiplication with Cuda*
- 50:39 Discusses converting the innermost loop of matrix multiplication to a Cuda kernel to allow parallel execution.
- 51:18 Describes how each cell in the output tensor will be handled by one Cuda thread for the dot product.
*Utilizing Blocks and Threads in Cuda*
- 51:50 Explains Cuda's ability to index into 2D and 3D grids using blocks and threads.
- 53:05 Describes how blocks are indexed with a tuple and how threads are indexed within blocks.
- 54:44 Develops a kernel runner using four nested loops to iterate through blocks and threads in both dimensions.
- 57:01 Details running the matrix multiplication kernel, ensuring the current row and column are within the bounds of the tensor.
*Kernel Execution and Output*
- 58:21 Completes the kernel execution by performing the dot product and placing the result in the output tensor.
*Introduction to Matrix Multiplication with CUDA*
- 58:28 Discusses the ability to call the matrix multiplication function using the height and width of input matrices and the inner dimensions K and K2.
- 58:42 Explains that threads per block is a pair of numbers for two-dimensional inputs.
- 58:56 Notes that threads per block must multiply to 256 and should not exceed 1024.
- 59:31 Mentions the maximum number of blocks for each dimension.
- 59:49 Describes each symmetric multiprocessor's capability to run blocks and access shared memory.
*Executing the 2D Block Kernel*
- 1:00:01 Details how to use the ceiling division for block dimensions.
- 1:00:14 Outlines the process to call a 2D block kernel runner with necessary inputs and dimensions.
- 1:00:35 Validates the output of the 2D block matrix multiplication.
*Optimizing CPU Matrix Multiplication*
- 1:00:51 Introduces the CUDA version of matrix multiplication.
- 1:01:16 Describes using a broadcasting approach for a fast CPU-based matrix multiplication.
- 1:01:34 Compares the broadcasting approach to nested loops and confirms its efficiency.
- 1:02:00 Measures the time taken for the broadcasting approach on the full input matrices.
*Conversion to CUDA and Performance Comparison*
- 1:02:35 Explains how to convert Python code to CUDA using ChatGPT.
- 1:03:01 Describes the C kernel and its similarities to the Python version.
- 1:03:20 Discusses the process of invoking the CUDA kernel from PyTorch.
- 1:03:56 Emphasizes the importance of assertions in CUDA code.
- 1:04:14 Introduces the dim3 structure for specifying threads per block in CUDA.
- 1:04:58 Explains how to call the CUDA kernel with the dim3 structure.
- 1:06:13 Summarizes the steps to compile and run the CUDA module.
- 1:06:41 Reports the performance improvement using CUDA over the optimized CPU approach.
*Advantages of Shared Memory and Further Optimizations*
- 1:07:50 Discusses the benefits of shared memory in the GPU for optimizing matrix multiplication.
- 1:08:02 Highlights the potential for caching to improve performance with shared memory.
*Flexibility of Block and Thread Dimensions*
- 1:08:42 Explains that 2D or 3D blocks are optional and can be replaced with 1D blocks if preferred.
- 1:09:04 Provides an example of converting RGB to grayscale using 2D blocks instead of 1D.
- 1:09:18 Compares code complexity between 1D and 2D block versions for the same task.
- 1:10:17 Emphasizes that either approach yields the same result, and the choice depends on convenience.
*Encouragement and Importance of Learning CUDA*
- 1:10:48 Encourages data scientists and Python programmers to learn CUDA.
- 1:11:03 Suggests that writing CUDA code is increasingly important for modern complex models.
- 1:11:54 Mentions that models are becoming more sophisticated, often requiring CUDA for efficiency.
*Setting Up CUDA on Local Machines*
- 1:12:30 Discusses the possibility of setting up CUDA on personal or cloud machines.
- 1:12:44 Outlines a simple setup process for CUDA on different operating systems.
*Introduction to Conda and its Utility*
- 1:12:59 Mac and Coda compatibility issues with a link to be provided in the video notes.
- 1:13:17 Introduction to Conda as a misunderstood tool, not a replacement for pip or poetry.
- 1:13:34 Conda's ability to manage multiple versions of Python, Cuda, and C++ compilation systems.
*Setting Up Conda*
- 1:14:00 Explaining the ease of setting up Conda with a script for installation.
- 1:14:24 Instructions on restarting the terminal after running the Conda installation script.
*Configuring Cuda and PyTorch with Conda*
- 1:14:32 Finding the correct version of Cuda for PyTorch.
- 1:14:48 Command for installing the correct version of Cuda.
- 1:15:04 Installing all necessary Nvidia tools directly from Nvidia.
- 1:15:22 Installing PyTorch with the correct Cuda version, removing the unnecessary 'nightly' tag.
*Conda's Improvements and Compatibility*
- 1:15:35 Mention of Conda's previous slow solver and recent improvements for speed.
- 1:15:58 Conda's compatibility across various operating systems like WSL, Ubuntu, Fedora, and Debian.
*Benefits of Using Conda for Development*
- 1:16:05 Recommending Conda for local development without the need for Docker.
- 1:16:18 Ability to switch between different versions of tools without hassle.
- 1:16:31 Conda's efficient use of hard drive space through hard linking shared libraries.
*Conclusion and Next Steps*
- 1:16:40 Guidance on getting started with development using Conda on a local machine or the cloud.
- 1:16:47 Closing remarks and encouragement to create with Cuda.
- 1:17:01 Suggestions to watch other Cuda mode lectures and to try out personal projects.
- 1:17:12 Examples of potential projects to implement using Cuda.
- 1:17:38 Advice on improving skills by reading other people's code and examples provided.
Disclaimer: I used gpt4-1106 to summarize the video transcript. This
method may make mistakes in recognizing words. I also had to split the
text into six segments of 2400 words each and there may be problems at
the transitions.
Thank you @wolpumba4099! :D
@@wolpumba4099
Thank u Sir Pumba
I ran this notebook on a Jetson Nano DevKit (from 2015) and it took 6 seconds for the CPU greyscale conversion and 8ms for the CUDA Kernel. This was a really cool tutorial!!
I am following works that Jeremy Howard publishes for a while, starting when the fastai library used Keras. And since that time, each year or two, great content is published, new ideas shared, new projects started. It is CUDA time! (Always wanted to learn; never had a good starting point.) No doubt, a true pillar of the Machine Learning community :)
What better way to spend a Sunday than a Jeremy Howard video
Quite brilliant to do this in a notebook because it avoid the normal hassle of setting up a CUDA environment. Even if you have your own GPU, setting up CUDA can be a real pain (eg getting the versions right). Well done Jeremy!
This is amazing, thank you Jeremy! So happy you are continuing with making educational videos. And thanks to all 'Cuda Mode' folks as well...
Amazing, thank you for taking the time to put this stuff out, Jeremy, despite doing for-profit work right now!
Outstanding - Your work always impresses me and part of me tells me you are indeed a great teacher.
Really interesting approach to use python for prototyping CUDA.
Translation back to C++ without chatgpt can probably by automated using AST traversal (as if trl and torchscript is not enough) as number of available operations is self-limited.
I love that magic is open-source, thanks, Jeremy!
Wow...thanks for this Jeremy. Yet to complete this video but I know, as always it will be awesome
Until and unless, educators like Jeremy are present, no closed source company can have a lock on knowledge.
Thanks for doing what you do, so consistently.
One question though, even though there's so much chaos in education field, what motivates you to do it consistently? Doing great is okay, doing great consistently is really hard in this distraction prone world.
Anyways as always Thank you and your team for your contribution
Been looking for something like this for so long
Thanks as usual for the great video. Also I see you got a new camera haha :]
It’s just my iPhone camera :)
And he's on macOS. When has that happened? Judging from the TH-cam videos about a year ago - I wonder what his reasons were (did he mention it at all?) and whether he's already using the Alfred App for productivity? 😉 Switching your OS is not a minor thing IMHO.
Jeremy, your tutorials stand out.
Thanks for the excellect course! Very helpful
Thank you for the amazing tutorial.
Is it possible in the future, when mojo is released, to recreate this tutorial using it?
Great idea!
hi Jeremy thank you for the amazing introduction but I am curious why not simply do the :
%%time
gray = 0.2989*img[:,:,0] + 0.5870*img[:,:,1] + 0.1140*img[:,:,0]
CPU times: total: 0 ns
Wall time: 3 ms
was it just for the sake of demostration purpose ? thank you
This course is to teach you CUDA -- using Pytorch ops to do it won't teach you CUDA! :D
@@howardjeremyp :D
Would love to get an intuition on the speed up that CUDA can deliver when compared to using vectorized operations on the CPU, any pointers to this?@@howardjeremyp
(How to factor things like SIMD width, memory transfer overhead, etc)
Who would have thought writing CUDA kernels like this!?
Thanks a lot doctor, you are a Godsend.
God bless you and please keep up the amazing job.
You’re most welcome- although I’m not a doctor!
Trust me, I'm not a doctor ;D @@howardjeremyp
What a hero! thank you.
Thank you... For this...
I'm assuming there's no simple way to do something equivalent on a Mac GPUs? i.e. the MPS device support that Pytorch comes with
When will the rest of the tutorial be published?
Getting started with C++ Cuda operations in Python, coding with jupyter notebook. Will not run on natively on Windows, because no fcntl support. You need to use wsl linux with conda. Otherwise great course content.
If chatGPT can convert Python to C code, then surely it must be possible to write a notebook plugin (or whatever) in Python that takes a Python cell and creates an adjacent cell in CUDA C so the process is automated, thus allowing everyone to code in Python with its attendant advantages for the GPU native target. This is exactly like the old days of writing code in C and using a cross compiler to generate Motorola assembler code for burning EPROM chips.
Thanks for this!
I just installed hipcc on my Linux laptop with AMD APU. I wonder if I can run the examples on AMD.
amazing video!!
Thanks Jeremy!
👏👏👏
thanks sensei
Amazing! Thank you so much. How can I buy you a coffee?
best teacher
lecture notes please 🙏🏾