47
192 218

Lecture 41: FlashInfer

1:08:51

Lecture 40: CUDA Docs for Humans

51:07

Lecture 39: Torchtitan

1:43:28

Lecture 38: Low Bit ARM kernels

1:03:41

Lecture 37: Introduction to SASS & GPU Microarchitecture

1:50:40

Lecture 36: CUTLASS and Flash Attention 3

1:49:16

Lecture 42: Mosaic GPU

Speaker: Adam Paszke

มุมมอง: 717

วีดีโอ

1:08:51

Lecture 41: FlashInfer

มุมมอง 701วันที่ผ่านมา

Speaker: Zihao Ye

51:07

Lecture 40: CUDA Docs for Humans

มุมมอง 2.4Kหลายเดือนก่อน

x.com/charles_irl/status/1867306225706447023 docs.google.com/presentation/d/15lTG6aqf72Hyk5_lqH7iSrc8aP1ElEYxCxch-tD37PE/edit

1:43:28

Lecture 39: Torchtitan

มุมมอง 1.5Kหลายเดือนก่อน

github.com/pytorch/torchtitan

1:03:41

Lecture 38: Low Bit ARM kernels

มุมมอง 6592 หลายเดือนก่อน

Lecture 38: Low Bit ARM kernels

Lecture 37: Introduction to SASS & GPU Microarchitecture

1:50:40

Lecture 37: Introduction to SASS & GPU Microarchitecture

มุมมอง 2.4K2 หลายเดือนก่อน

Lecture 37: Introduction to SASS & GPU Microarchitecture

Lecture 36: CUTLASS and Flash Attention 3

1:49:16

Lecture 36: CUTLASS and Flash Attention 3

มุมมอง 3.1K2 หลายเดือนก่อน

Lecture 36: CUTLASS and Flash Attention 3

45:19

Lecture 35: SGLang

มุมมอง 2.9K2 หลายเดือนก่อน

Lecture 35: SGLang

1:45:31

Lecture 34: Low Bit Triton Kernels

มุมมอง 1.6K3 หลายเดือนก่อน

Lecture 34: Low Bit Triton Kernels

1:01:48

Lecture 33: Bitblas

มุมมอง 1.1K3 หลายเดือนก่อน

Lecture 33: Bitblas

1:24:54

Lecture 32: Unsloth

มุมมอง 4.7K3 หลายเดือนก่อน

Lecture 32: Unsloth

1:32:08

Lecture 31: Beginners Guide to Metal

มุมมอง 1.8K3 หลายเดือนก่อน

Lecture 31: Beginners Guide to Metal

9:45

The History of CUDA MODE (Now GPU MODE)

มุมมอง 1.3K3 หลายเดือนก่อน

The History of CUDA MODE (Now GPU MODE)

1:16:40

Lecture 30: Quantized Training

มุมมอง 1.9K3 หลายเดือนก่อน

Lecture 30: Quantized Training

1:48:19

GPU MODE IRL 2024 Keynotes

มุมมอง 17K4 หลายเดือนก่อน

GPU MODE IRL 2024 Keynotes

1:04:49

Lecture 29: Triton Internals

มุมมอง 2.6K4 หลายเดือนก่อน

Lecture 29: Triton Internals

Lecture 28: Liger Kernel - Efficient Triton Kernels for LLM Training

1:11:27

Lecture 28: Liger Kernel - Efficient Triton Kernels for LLM Training

มุมมอง 6K5 หลายเดือนก่อน

Lecture 28: Liger Kernel - Efficient Triton Kernels for LLM Training

Lecture 27: gpu.cpp - Portable GPU compute using WebGPU

58:12

Lecture 27: gpu.cpp - Portable GPU compute using WebGPU

มุมมอง 1.9K5 หลายเดือนก่อน

Lecture 27: gpu.cpp - Portable GPU compute using WebGPU

1:18:35

Lecture 26: SYCL Mode (Intel GPU)

มุมมอง 7765 หลายเดือนก่อน

Lecture 26: SYCL Mode (Intel GPU)

Lecture 25: Speaking Composable Kernel (CK)

1:30:52

Lecture 25: Speaking Composable Kernel (CK)

มุมมอง 2.1K6 หลายเดือนก่อน

Lecture 25: Speaking Composable Kernel (CK)

1:06:19

Lecture 24: Scan at the Speed of Light

มุมมอง 2K7 หลายเดือนก่อน

Lecture 24: Scan at the Speed of Light

1:47:50

Lecture 23: Tensor Cores

มุมมอง 7K7 หลายเดือนก่อน

Lecture 23: Tensor Cores

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

1:09:25

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

มุมมอง 6K8 หลายเดือนก่อน

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

1:04:41

Lecture 21: Scan Algorithm Part 2

มุมมอง 6618 หลายเดือนก่อน

Lecture 21: Scan Algorithm Part 2

1:02:20

Lecture 20: Scan Algorithm

มุมมอง 1.9K8 หลายเดือนก่อน

Lecture 20: Scan Algorithm

59:09

Lecture 19: Data Processing on GPUs

มุมมอง 1.2K8 หลายเดือนก่อน

Lecture 19: Data Processing on GPUs

1:23:22

Lecture 18: Fusing Kernels

มุมมอง 1.7K8 หลายเดือนก่อน

Lecture 18: Fusing Kernels

59:43

Lecture 17: NCCL

มุมมอง 6K8 หลายเดือนก่อน

Lecture 17: NCCL

55:41

Lecture 16: On Hands Profiling

มุมมอง 3K9 หลายเดือนก่อน

Lecture 16: On Hands Profiling

1:16:05

Bonus Lecture: CUDA C++ llm.cpp

มุมมอง 6K9 หลายเดือนก่อน

Bonus Lecture: CUDA C llm.cpp

ความคิดเห็น

@hsubyron2277 2 วันที่ผ่านมา
🐐
@ufilh 3 วันที่ผ่านมา
Hi Mark, IMHO, arithmetic intensity of ReLU should not be dependent on whether the input is larger than zero or not as either way you need to write to the output tensor (not in-place update). My understanding is that Nvidia article mentioned the intensity of ReLU is 0.25 simply because FP16 is assumed instead of FP32 as assumed in the video.
@吴俊-n8q 4 วันที่ผ่านมา
it is very useful for me
@mohitarora3506 8 วันที่ผ่านมา
Please share presentation
@Applestaffman 9 วันที่ผ่านมา
img.flatten should equal to cxhxw, not the 33750, should be 33750 x 3
@realisticlevel2553 11 วันที่ผ่านมา
Thanks for the re-upload so that we can catch it later! This is a banger talk
@esaliya 12 วันที่ผ่านมา
thanks for the comprehensive tutorial!
@tomasruiz94 16 วันที่ผ่านมา
This guy is 100% cracked! No slides & no fluff. Super high-density content ♥
@TheAIEpiphany 19 วันที่ผ่านมา
Cade is cracked! :) really fun lecture!
@SpiderLLL-i1l 22 วันที่ผ่านมา
Thank you for your excellent work!
@MrMalish13 22 วันที่ผ่านมา
There are no sync instructions for global load and store, does it mean that those instructions are synchronous?
@MrMalish13 22 วันที่ผ่านมา
in the PTX code, the instruction on line 139: @%p11 st.global.b32 [ %rd14 + 0 ], { %r22 }; in which %rd14 assigned on line 124: add.s64 %fd14, %rd9, 2560; looks like those might be fused. "+0" in global store seems like can be replaced with +2560. Were you using non -O3 option?
@karanjakhar 23 วันที่ผ่านมา
Great talk
@konstantinwilleke6292 23 วันที่ผ่านมา
Excellent lecture. Thank you <3
@erfanmiahi9462 25 วันที่ผ่านมา
Thank you. I was trying to find the collectives GitHub repository but it looks like it's not public anymore. Do you plan to share it sometime?
@iamsiddhantsahu 26 วันที่ผ่านมา
Great video -- really good talk by Charles!
@panosdimi4713 หลายเดือนก่อน
Happy new year my friend and happy jobs 🎈
@wolpumba4099 หลายเดือนก่อน
*CUDA Docs for Humans: A Comprehensive Guide to GPU Programming* * *0:00** Introduction:* This presentation discusses "CUDA Docs for Humans," a comprehensive, interconnected resource for understanding GPU programming, particularly the CUDA stack. * *1:02** Live Demo:* The resource is demonstrated, showcasing its interconnected nature, linking terms like "compute capability" to related concepts like "streaming multiprocessor architecture" with diagrams. * *2:35** Design:* The documentation features a visually engaging, "CUDA mode" design for an enjoyable learning experience. * *3:03** Origin and Motivation:* The project originated from the presenter's experiences in machine learning research and deployment, highlighting the need for a unified understanding of the CUDA stack. * *4:52** Debugging Performance Issues:* Tracing tools like the PyTorch profiler are crucial for understanding performance bottlenecks and the asynchronous nature of CUDA operations. * *6:32** Deployment and Scaling:* The presenter's current work focuses on real-world GPU deployment and scaling, leading to extensive debugging and the creation of the document "I am done not understanding the CUDA stack." * *7:28** Motivation for Public Documentation:* The need for a comprehensive, publicly available resource became apparent after discussions with other professionals who had created similar internal documents, recognizing the limitations of existing, scattered documentation. * *9:49** High-Level Takeaways:* The presenter shares key insights gained from compiling the documentation, emphasizing the multifaceted nature of CUDA and the importance of the PTX ISA. * *10:22** Multiple Meanings of "CUDA":* CUDA refers to different layers of the stack: the software platform, the abstract programming model, and the hardware architecture. * *11:38** CUDA Software Platform:* This layer includes the CUDA runtime and driver APIs, facilitating interaction between application code and the GPU. * *13:28** CUDA Programming Model:* This abstract model defines how programs are written at the thread level, emphasizing shared memory and synchronization within thread blocks. * *16:47** Independence of Abstraction:* The programming model is independent of specific hardware or language implementations, focusing on what can and cannot be assumed about parallel execution. * *17:48** Compute Unified Device Architecture (CUDA):* This approach to hardware design emphasizes a homogeneous array of streaming multiprocessors (SMs) for scalability, contrasting with earlier heterogeneous GPU designs. * *20:38** Historical Context:* The 2008 whitepaper by Lindholm et al. provides a comprehensive overview of the CUDA vision, from hardware to ecosystem. * *21:31** Recommendation:* The presenter strongly recommends reading the Lindholm et al. whitepaper for a deeper understanding of CUDA's foundational principles. * *21:52** Parallel Thread Execution (PTX) ISA:* PTX is highlighted as the most crucial part of the stack, acting as an intermediate representation that enables forward compatibility and transparent scaling. * *26:25** PTX and Forward Compatibility:* PTX allows programs to run on different GPUs and benefit from new hardware features without recompilation. * *27:45** PTX Virtual Machine:* PTX defines a virtual machine with multiple processors and a memory hierarchy, ensuring predictable program behavior. * *28:41** Constraints and Scalability:* The constraints in the CUDA programming model, such as limitations on synchronization, enable transparent scaling across different hardware configurations. * *30:15** Future of the Project:* The presenter discusses plans to enhance the resource, including interactive elements, expanded content, and potential collaborations. * *31:56** Community Resource:* The goal is to make the documentation a valuable community resource, potentially open-sourcing it in the future. * *32:09** Short-Term Goals:* Plans include making the documentation compatible with language models (e.g., "CUDA Docs for Chatbots"), adding interactive code snippets, improving diagrams, and expanding content on synchronization and thread block clusters. * *38:21** External Feedback:* The presenter emphasizes the importance of community feedback and contributions to improve the resource. * *38:45** Medium-Term Goals:* Future plans involve covering performance debugging, GPU fleet management, multi-GPU execution, and potentially partnering with universities for educational content. * *40:53** Call for Collaboration:* The presenter invites collaboration on these medium-term goals, particularly in areas like performance debugging and multi-GPU programming. * *44:06** Hiring at Modal:* The presenter's company, Modal, is hiring GPU experts and offering opportunities for open-source contributions. * *45:48** Closing Remarks:* The presenter thanks the audience and encourages further engagement on Discord and Twitter. * *46:55** Document Availability:* The internal document "I am done not understanding the CUDA stack" has been incorporated into the public GPU Glossary. * *47:01** Discussion on Documentation Issues:* The presenter acknowledges challenges with interlinking information across different documentation sources and suggests that community involvement can help address these issues. * *49:16** Challenges of Documentation:* The presenter notes that documenting the absolute frontiers of performance is inherently difficult due to the breakdown of abstractions and the need for real-world experience to refine understanding. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.04 Input tokens: 25218 Output tokens: 1137
@mastershredder2002 หลายเดือนก่อน
bruh is a netrunner and nvidia pooed on his head.
@ArpitAgarwal1 หลายเดือนก่อน
very helpful talk! love this
@skanderbegvictor6487 หลายเดือนก่อน
Always liked listening to charles.
@charles_irl หลายเดือนก่อน
thanks!
@shisanliu6314 หลายเดือนก่อน
why do not we need mask during naive_matmul_k?
@yolo4eva หลายเดือนก่อน
This is an awesome lecture. Thank you so much!
@rehanbhatti5843 หลายเดือนก่อน
thank you
@zaursamedov8906 หลายเดือนก่อน
Thank you❤
@mytech6779 2 หลายเดือนก่อน
One minute in and my "um" counter already threw an overflow exception.
@anastasiiafilippova5212 2 หลายเดือนก่อน
Thank you for this awesome content, super helpful! Just a small advice for those who does not have cuda (for instance, macOS users).: I am using collab Tesla 4 GPU
@xiyanwang-n6m 2 หลายเดือนก่อน
Could you please tell me where to download the dataset you demonstrated in your video?
@sucim 2 หลายเดือนก่อน
You misspelled his name it is "Wizard" not Aroun. This was amazing. He is so well spoken and knowledgable! How can he even make all of this up on the fly?! And there is still a good structure in all of it. I could listen to him speak about this for anouther couple hours! This also highly motivates me in converting my own project into CUDA. He conveys this "you can just do things" feeling. I did a good amount of CUDA coding already but it always felt like poking a black box. I think the tools and his workflows around them might be even more valuable than the other details
@mobu-o1g 2 หลายเดือนก่อน
👑
@Pleexed 2 หลายเดือนก่อน
Could you add a link to Jay's blog? Also, what's the blog by Simon referred to in the beginning? Thanks!
@kevbuh 2 หลายเดือนก่อน
good lecture
@literailly 2 หลายเดือนก่อน
Fantastic, thank you!
@ProgrammingWIthRiley 2 หลายเดือนก่อน
I’m here. I made it
@IsaacLeong-y4k 2 หลายเดือนก่อน
11:18 I think in a previous lecture has mentioned that floating point are commutative but not associative, which is actually what is causing the problem in the parallel reduction algorithm.
@madankd 2 หลายเดือนก่อน
Can we get code file
@diakorudd7268 2 หลายเดือนก่อน
This is pure gold!
@aviralgoel5709 2 หลายเดือนก่อน
Thank you
@aviralgoel5709 2 หลายเดือนก่อน
Thank you
@mitchellcheng8688 2 หลายเดือนก่อน
Awesome!
@ahmedtremo 2 หลายเดือนก่อน
Great explanation, thanks
@ahmedtremo 2 หลายเดือนก่อน
Great video really!
@mobu-o1g 3 หลายเดือนก่อน
@oguzhanercan4701 3 หลายเดือนก่อน
worse but faster
@deependu__ 3 หลายเดือนก่อน
thanks for the video. Triton's documentation tutorial starts directly with coding, and it was really difficult for me. Thanks for sharing the programming model first.
@wolpumba4099 3 หลายเดือนก่อน
*Lecture 33: Bitblas - Enabling Efficient Low Precision Computing with Hardware Aware Transformations* * *0:00** Introduction:* James Melvin introduces Lei Wang, a research intern at Microsoft Research, who presents Bitblas, a kernel library and end-to-end compiler for high-performance mixed-precision computing. He also introduces a Triton-like programming language called TI Language. * *1:58** Mixed Precision Computing Background:* Lei Wang explains the shift towards lower bit formats in AI models for memory efficiency. He outlines three challenges: lack of custom precision format support in hardware/software, limited mixed-precision instructions, and vast computation combinations requiring extensive optimization. * *6:20** Insights and abstractions:* Two core insights drive Bitblas: flexible data type representation in memory allows reinterpretation in software, and custom data types can be converted to standard types to leverage existing hardware instructions. * *7:18** Tensor-Centric Abstractions:* Bitblas introduces abstractions like TI Type (custom data types), TI Tile (tensors), Index Map (data layout), and scheduling templates to manipulate tensors. This enables defining computations with explicit data types and layouts. * *13:30** Finding the Right Instructions:* Bitblas includes a "Bit Machine Instruction" framework to select the most efficient hardware instructions based on data type and FLOPs. An iterator classification method maps computations to target instructions (e.g., Tensor Cores). * *17:34** Optimizing Data Layouts:* Bitblas infers memory layouts aligned with hardware instructions to minimize memory access issues. The TI approach further optimizes by fusing operators and propagating layouts through the tensor graph. * *20:40** Layout Propagation:* Challenges in layout propagation include misalignment between problem scale and instructions, computations outside core instructions, and layout transformations affecting correctness. Bitblas categorizes layouts and implements specific propagation methods. * *26:14** Deciding When to Dequantize:* Bitblas uses a latency-oriented policy to determine the optimal stage for dequantization (registers, shared memory, or global memory), trading off compute overhead and memory savings. * *29:00** Bitblas Systems: Later and Bitblas:* Later is an end-to-end compiler that optimizes operator fusion and generates efficient CUDA kernels. Bitblas is a kernel library with a simplified API abstracting tensor transformations. * *32:58** Optimization Tricks:* Bitblas implements fast dequantization techniques using vectorization and specialized instructions for improved performance, especially for low bit widths. * *40:58** Kernel Code Generation for Dynamic Shapes:* Bitblas addresses the challenge of dynamic shapes in LLMs by generating code for segments of the dynamic dimension and storing optimal configurations for dispatch. * *46:42** Performance Results:* Bitblas demonstrates significant speedups over existing systems and hand-written kernels across various hardware and models, including AMD GPUs. Scaling experiments with Llama models show memory and compute benefits with lower precision. * *51:06** Challenges and Future Work:* Kernel compilation time, complexity of Bitblas scheduling, and the limitations of schedule-based implementations are highlighted as areas for future work. * *51:49** Bitblas Code Overview and TI Language:* Lei Wang provides a brief overview of the Bitblas code structure and highlights TI Language, a new programming language designed for ease of kernel development with support for custom data types, layouts, and hardware instructions. I used gemini-1.5-pro-002 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.03 Input tokens: 24672 Output tokens: 716
@saladpalad 3 หลายเดือนก่อน
what a cutie patootie
@kunalsuri8316 3 หลายเดือนก่อน
Super useful! Thank you!!!
@sludgekicker 3 หลายเดือนก่อน
Thanks! Very helpful.