- 47
- 192 218
GPU MODE
เข้าร่วมเมื่อ 31 ธ.ค. 2023
A GPU reading group and community discord.gg/gpumode
Supplementary content here github.com/gpu-mode
Created by Mark Saroufim and Andreas Köpf
Supplementary content here github.com/gpu-mode
Created by Mark Saroufim and Andreas Köpf
วีดีโอ
Lecture 40: CUDA Docs for Humans
มุมมอง 2.4Kหลายเดือนก่อน
x.com/charles_irl/status/1867306225706447023 docs.google.com/presentation/d/15lTG6aqf72Hyk5_lqH7iSrc8aP1ElEYxCxch-tD37PE/edit
Lecture 37: Introduction to SASS & GPU Microarchitecture
มุมมอง 2.4K2 หลายเดือนก่อน
Lecture 37: Introduction to SASS & GPU Microarchitecture
Lecture 36: CUTLASS and Flash Attention 3
มุมมอง 3.1K2 หลายเดือนก่อน
Lecture 36: CUTLASS and Flash Attention 3
The History of CUDA MODE (Now GPU MODE)
มุมมอง 1.3K3 หลายเดือนก่อน
The History of CUDA MODE (Now GPU MODE)
Lecture 28: Liger Kernel - Efficient Triton Kernels for LLM Training
มุมมอง 6K5 หลายเดือนก่อน
Lecture 28: Liger Kernel - Efficient Triton Kernels for LLM Training
Lecture 27: gpu.cpp - Portable GPU compute using WebGPU
มุมมอง 1.9K5 หลายเดือนก่อน
Lecture 27: gpu.cpp - Portable GPU compute using WebGPU
Lecture 25: Speaking Composable Kernel (CK)
มุมมอง 2.1K6 หลายเดือนก่อน
Lecture 25: Speaking Composable Kernel (CK)
Lecture 24: Scan at the Speed of Light
มุมมอง 2K7 หลายเดือนก่อน
Lecture 24: Scan at the Speed of Light
Lecture 22: Hacker's Guide to Speculative Decoding in VLLM
มุมมอง 6K8 หลายเดือนก่อน
Lecture 22: Hacker's Guide to Speculative Decoding in VLLM
🐐
Hi Mark, IMHO, arithmetic intensity of ReLU should not be dependent on whether the input is larger than zero or not as either way you need to write to the output tensor (not in-place update). My understanding is that Nvidia article mentioned the intensity of ReLU is 0.25 simply because FP16 is assumed instead of FP32 as assumed in the video.
it is very useful for me
Please share presentation
img.flatten should equal to cxhxw, not the 33750, should be 33750 x 3
Thanks for the re-upload so that we can catch it later! This is a banger talk
thanks for the comprehensive tutorial!
This guy is 100% cracked! No slides & no fluff. Super high-density content ♥
Cade is cracked! :) really fun lecture!
Thank you for your excellent work!
There are no sync instructions for global load and store, does it mean that those instructions are synchronous?
in the PTX code, the instruction on line 139: @%p11 st.global.b32 [ %rd14 + 0 ], { %r22 }; in which %rd14 assigned on line 124: add.s64 %fd14, %rd9, 2560; looks like those might be fused. "+0" in global store seems like can be replaced with +2560. Were you using non -O3 option?
Great talk
Excellent lecture. Thank you <3
Thank you. I was trying to find the collectives GitHub repository but it looks like it's not public anymore. Do you plan to share it sometime?
Great video -- really good talk by Charles!
Happy new year my friend and happy jobs 🎈
*CUDA Docs for Humans: A Comprehensive Guide to GPU Programming* * *0:00** Introduction:* This presentation discusses "CUDA Docs for Humans," a comprehensive, interconnected resource for understanding GPU programming, particularly the CUDA stack. * *1:02** Live Demo:* The resource is demonstrated, showcasing its interconnected nature, linking terms like "compute capability" to related concepts like "streaming multiprocessor architecture" with diagrams. * *2:35** Design:* The documentation features a visually engaging, "CUDA mode" design for an enjoyable learning experience. * *3:03** Origin and Motivation:* The project originated from the presenter's experiences in machine learning research and deployment, highlighting the need for a unified understanding of the CUDA stack. * *4:52** Debugging Performance Issues:* Tracing tools like the PyTorch profiler are crucial for understanding performance bottlenecks and the asynchronous nature of CUDA operations. * *6:32** Deployment and Scaling:* The presenter's current work focuses on real-world GPU deployment and scaling, leading to extensive debugging and the creation of the document "I am done not understanding the CUDA stack." * *7:28** Motivation for Public Documentation:* The need for a comprehensive, publicly available resource became apparent after discussions with other professionals who had created similar internal documents, recognizing the limitations of existing, scattered documentation. * *9:49** High-Level Takeaways:* The presenter shares key insights gained from compiling the documentation, emphasizing the multifaceted nature of CUDA and the importance of the PTX ISA. * *10:22** Multiple Meanings of "CUDA":* CUDA refers to different layers of the stack: the software platform, the abstract programming model, and the hardware architecture. * *11:38** CUDA Software Platform:* This layer includes the CUDA runtime and driver APIs, facilitating interaction between application code and the GPU. * *13:28** CUDA Programming Model:* This abstract model defines how programs are written at the thread level, emphasizing shared memory and synchronization within thread blocks. * *16:47** Independence of Abstraction:* The programming model is independent of specific hardware or language implementations, focusing on what can and cannot be assumed about parallel execution. * *17:48** Compute Unified Device Architecture (CUDA):* This approach to hardware design emphasizes a homogeneous array of streaming multiprocessors (SMs) for scalability, contrasting with earlier heterogeneous GPU designs. * *20:38** Historical Context:* The 2008 whitepaper by Lindholm et al. provides a comprehensive overview of the CUDA vision, from hardware to ecosystem. * *21:31** Recommendation:* The presenter strongly recommends reading the Lindholm et al. whitepaper for a deeper understanding of CUDA's foundational principles. * *21:52** Parallel Thread Execution (PTX) ISA:* PTX is highlighted as the most crucial part of the stack, acting as an intermediate representation that enables forward compatibility and transparent scaling. * *26:25** PTX and Forward Compatibility:* PTX allows programs to run on different GPUs and benefit from new hardware features without recompilation. * *27:45** PTX Virtual Machine:* PTX defines a virtual machine with multiple processors and a memory hierarchy, ensuring predictable program behavior. * *28:41** Constraints and Scalability:* The constraints in the CUDA programming model, such as limitations on synchronization, enable transparent scaling across different hardware configurations. * *30:15** Future of the Project:* The presenter discusses plans to enhance the resource, including interactive elements, expanded content, and potential collaborations. * *31:56** Community Resource:* The goal is to make the documentation a valuable community resource, potentially open-sourcing it in the future. * *32:09** Short-Term Goals:* Plans include making the documentation compatible with language models (e.g., "CUDA Docs for Chatbots"), adding interactive code snippets, improving diagrams, and expanding content on synchronization and thread block clusters. * *38:21** External Feedback:* The presenter emphasizes the importance of community feedback and contributions to improve the resource. * *38:45** Medium-Term Goals:* Future plans involve covering performance debugging, GPU fleet management, multi-GPU execution, and potentially partnering with universities for educational content. * *40:53** Call for Collaboration:* The presenter invites collaboration on these medium-term goals, particularly in areas like performance debugging and multi-GPU programming. * *44:06** Hiring at Modal:* The presenter's company, Modal, is hiring GPU experts and offering opportunities for open-source contributions. * *45:48** Closing Remarks:* The presenter thanks the audience and encourages further engagement on Discord and Twitter. * *46:55** Document Availability:* The internal document "I am done not understanding the CUDA stack" has been incorporated into the public GPU Glossary. * *47:01** Discussion on Documentation Issues:* The presenter acknowledges challenges with interlinking information across different documentation sources and suggests that community involvement can help address these issues. * *49:16** Challenges of Documentation:* The presenter notes that documenting the absolute frontiers of performance is inherently difficult due to the breakdown of abstractions and the need for real-world experience to refine understanding. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.04 Input tokens: 25218 Output tokens: 1137
bruh is a netrunner and nvidia pooed on his head.
very helpful talk! love this
Always liked listening to charles.
thanks!
why do not we need mask during naive_matmul_k?
This is an awesome lecture. Thank you so much!
thank you
Thank you❤
One minute in and my "um" counter already threw an overflow exception.
Thank you for this awesome content, super helpful! Just a small advice for those who does not have cuda (for instance, macOS users).: I am using collab Tesla 4 GPU
Could you please tell me where to download the dataset you demonstrated in your video?
You misspelled his name it is "Wizard" not Aroun. This was amazing. He is so well spoken and knowledgable! How can he even make all of this up on the fly?! And there is still a good structure in all of it. I could listen to him speak about this for anouther couple hours! This also highly motivates me in converting my own project into CUDA. He conveys this "you can just do things" feeling. I did a good amount of CUDA coding already but it always felt like poking a black box. I think the tools and his workflows around them might be even more valuable than the other details
👑
Could you add a link to Jay's blog? Also, what's the blog by Simon referred to in the beginning? Thanks!
good lecture
Fantastic, thank you!
I’m here. I made it
11:18 I think in a previous lecture has mentioned that floating point are commutative but not associative, which is actually what is causing the problem in the parallel reduction algorithm.
Can we get code file
This is pure gold!
Thank you
Thank you
Awesome!
Great explanation, thanks
Great video really!
worse but faster
thanks for the video. Triton's documentation tutorial starts directly with coding, and it was really difficult for me. Thanks for sharing the programming model first.
*Lecture 33: Bitblas - Enabling Efficient Low Precision Computing with Hardware Aware Transformations* * *0:00** Introduction:* James Melvin introduces Lei Wang, a research intern at Microsoft Research, who presents Bitblas, a kernel library and end-to-end compiler for high-performance mixed-precision computing. He also introduces a Triton-like programming language called TI Language. * *1:58** Mixed Precision Computing Background:* Lei Wang explains the shift towards lower bit formats in AI models for memory efficiency. He outlines three challenges: lack of custom precision format support in hardware/software, limited mixed-precision instructions, and vast computation combinations requiring extensive optimization. * *6:20** Insights and abstractions:* Two core insights drive Bitblas: flexible data type representation in memory allows reinterpretation in software, and custom data types can be converted to standard types to leverage existing hardware instructions. * *7:18** Tensor-Centric Abstractions:* Bitblas introduces abstractions like TI Type (custom data types), TI Tile (tensors), Index Map (data layout), and scheduling templates to manipulate tensors. This enables defining computations with explicit data types and layouts. * *13:30** Finding the Right Instructions:* Bitblas includes a "Bit Machine Instruction" framework to select the most efficient hardware instructions based on data type and FLOPs. An iterator classification method maps computations to target instructions (e.g., Tensor Cores). * *17:34** Optimizing Data Layouts:* Bitblas infers memory layouts aligned with hardware instructions to minimize memory access issues. The TI approach further optimizes by fusing operators and propagating layouts through the tensor graph. * *20:40** Layout Propagation:* Challenges in layout propagation include misalignment between problem scale and instructions, computations outside core instructions, and layout transformations affecting correctness. Bitblas categorizes layouts and implements specific propagation methods. * *26:14** Deciding When to Dequantize:* Bitblas uses a latency-oriented policy to determine the optimal stage for dequantization (registers, shared memory, or global memory), trading off compute overhead and memory savings. * *29:00** Bitblas Systems: Later and Bitblas:* Later is an end-to-end compiler that optimizes operator fusion and generates efficient CUDA kernels. Bitblas is a kernel library with a simplified API abstracting tensor transformations. * *32:58** Optimization Tricks:* Bitblas implements fast dequantization techniques using vectorization and specialized instructions for improved performance, especially for low bit widths. * *40:58** Kernel Code Generation for Dynamic Shapes:* Bitblas addresses the challenge of dynamic shapes in LLMs by generating code for segments of the dynamic dimension and storing optimal configurations for dispatch. * *46:42** Performance Results:* Bitblas demonstrates significant speedups over existing systems and hand-written kernels across various hardware and models, including AMD GPUs. Scaling experiments with Llama models show memory and compute benefits with lower precision. * *51:06** Challenges and Future Work:* Kernel compilation time, complexity of Bitblas scheduling, and the limitations of schedule-based implementations are highlighted as areas for future work. * *51:49** Bitblas Code Overview and TI Language:* Lei Wang provides a brief overview of the Bitblas code structure and highlights TI Language, a new programming language designed for ease of kernel development with support for custom data types, layouts, and hardware instructions. I used gemini-1.5-pro-002 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.03 Input tokens: 24672 Output tokens: 716
what a cutie patootie
Super useful! Thank you!!!
Thanks! Very helpful.