GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

The Magic of RISC-V Vector Processing

Make people happy and it will come back to you ❤️↩️

พร้อมรักหรือยัง? Ready For Love? พากย์ไทย 07 ซีรีส์

เมาส์ทรงด้ามปืน!? รีวิวใช้จริงจะเป็นยังไง!? #mouse #Ragnok2

GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA

Christopher Hollinworth

มุมมอง 11 934

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 26 มิ.ย. 2024
Come for an introduction to programming the GPU by the lead architect of CUDA. CUDA's unique in being a programming language designed and built hand-in-hand with the hardware that it runs on. Stepping up from last year's "How GPU Computing Works" deep dive into the architecture of the GPU, we'll look at how hardware design motivates the CUDA language and how the CUDA language motivates the hardware design. This is not a course on CUDA programming. It's a foundation on what works, what doesn't work, and why. We'll tell you how to think about a problem in a way that will run well on the GPU, and you'll see how the CUDA programming model is built to run that way. If you're new to CUDA, we'll give you the core background knowledge you need - getting started begins with understanding. If you're an expert, hopefully you'll face your next optimization problem with a new perspective on what might work, and why.

ความคิดเห็น • 14

@citizensmith3074 2 ปีที่แล้ว ⁺¹²
This video is pure gold: thanks so much for uploading I've learnt so much from it. I may have to watch it several times though!!! A great overview and introduction to so many areas for further study.
@TheAIEpiphany 11 หลายเดือนก่อน ⁺¹
One thing that's confusing: if reading from a memory location in a different row is 3x slower than reading from a memory location in the same row - how come we get 13x slowdown? Worst case (if you're deliberately reading from a different row each time) - one would expect a 3x slowdown?
What am I missing out on? Is it the burst mode?
2) You're using float2 type so that means your thread is loading 4 bytes (for 2 points) not 8 bytes? Which would put the 4 warps into 512B loading territory instead of the optimal 1024? -> EDIT: ok, I just saw that p1 & p2 are actually float pointers so that does make sense.
3) How can we guarantee that p1 & p2 arrays (holding the points) are adjacent, i.e. in the same physical row in memory?
Great video! The sound quality is a bit off though.
@brady1123 5 หลายเดือนก่อน
It's 3x slower for reading a single value, but it gets worse when reading many contiguous values where the burst column read can read many values in one operation.
For example, let's say that we're reading two sets of 10 values, one set of which are all contiguous in a row, and one set that are all on different rows. And you have the three ops in the video: LOAD a row, READ a column, STORE the row back.
For the contiguous values: time = LOAD + BURST READ + STORE = 3 ops
For the disjoint values: time = (LOAD + READ + STORE)*10 = 30 ops
That's how you get the 10x speed-up.
@steveHoweisno1 ปีที่แล้ว
Excellent. For the matrix multiply, you’re reusing the same row multiple times but the columns would have to be loaded in every time. So how do you increase compute intensity of the columns?
@webgpu 7 หลายเดือนก่อน
Christopher, do you think the long time it takes for ram to be accessed could be decreased by embedding a basic cpu in those ram modules?
@christopherhollinworth7405 7 หลายเดือนก่อน ⁺¹
Good question, I don't know!
@codingmachine2817 ปีที่แล้ว ⁺³
33:10 FlashAttention proved this wrong
@brady1123 5 หลายเดือนก่อน
"Occupancy is the most powerful tool that you have for tuning a program. **Once you're doing your best for memory access patterns** there's pretty much no algorithmic optimization that you can do that'll speed your program up by as much as 33%"
I thought FlashAttention's major contribution was optimizing memory access patterns, namely reducing the number of HBM loads/stores.
@ChimiChanga1337 3 หลายเดือนก่อน
can you please explain this a bit more? I'm trying to teach myself flash attention's cuda code.
@dGooddBaddUgly 2 หลายเดือนก่อน ⁺²
Look like Intel is out of the question here.
@christopherhollinworth7405 2 หลายเดือนก่อน
They are getting better in terms of energy efficiency and performance www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html
@GeorgePaul82 4 หลายเดือนก่อน
Is there a chance you can do a video about Why AMD's version isnt as good as NVIDIA ?
@christopherhollinworth7405 4 หลายเดือนก่อน
I've not got a AMD gfx card ZLUDA means it does not really matter www.phoronix.com/review/radeon-cuda-zluda
@ryderbrooks1783 3 หลายเดือนก่อน
AMD's issue is tooling and the general software ecosystem. The hardware is reasonably close.

ต่อไป

เล่นอัตโนมัติ

GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

GTC 2022 - CUDA: New Features and Beyond - Stephen Jones, CUDA Architect, NVIDIA

An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

An Intro to GPU Architecture and Programming Models I Tim Warburton, Virginia Tech

The Magic of RISC-V Vector Processing

The Magic of RISC-V Vector Processing

Make people happy and it will come back to you ❤️↩️

Make people happy and it will come back to you ❤️↩️

พร้อมรักหรือยัง? Ready For Love? พากย์ไทย 07 ซีรีส์

พร้อมรักหรือยัง? Ready For Love? พากย์ไทย 07 ซีรีส์

เมาส์ทรงด้ามปืน!? รีวิวใช้จริงจะเป็นยังไง!? #mouse #Ragnok2

เมาส์ทรงด้ามปืน!? รีวิวใช้จริงจะเป็นยังไง!? #mouse #Ragnok2

ฉลองโสด | แพรวพราว แสงทอง ft. วินเนอร์ แสงทอง

ฉลองโสด | แพรวพราว แสงทอง ft. วินเนอร์ แสงทอง

Solving Rust’s biggest problem

Solving Rust’s biggest problem

CUDA Programming

CUDA Programming

CUDA: New Features and Beyond | NVIDIA GTC 2024

CUDA: New Features and Beyond | NVIDIA GTC 2024

How CUDA Programming Works | GTC 2022

How CUDA Programming Works | GTC 2022

Nvidia's Breakthrough AI Chip Defies Physics (GTC Supercut)

Nvidia's Breakthrough AI Chip Defies Physics (GTC Supercut)

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Trends in Deep Learning Hardware: Bill Dally (NVIDIA)

Nvidia CUDA in 100 Seconds

Nvidia CUDA in 100 Seconds

How Branch Prediction Works in CPUs - Computerphile

How Branch Prediction Works in CPUs - Computerphile

FEDFE Memory | เปิดบ้าน9ฤดูของพี่เอ ศุภชลาศัย กับเด็กในสังกัด | Remaster 4K

FEDFE Memory | เปิดบ้าน9ฤดูของพี่เอ ศุภชลาศัย กับเด็กในสังกัด | Remaster 4K

นี่คือ Mod ที่เล่นโคตรง่าย!!! #minecraftbut #เกมกับshorts

นี่คือ Mod ที่เล่นโคตรง่าย!!! #minecraftbut #เกมกับshorts

การแข่งขัน RoV ระดับนานาชาติ 𝐀𝐏𝐋 𝟐𝟎𝟐𝟒 กับรอบ Swiss Stage วันที่ 9

การแข่งขัน RoV ระดับนานาชาติ 𝐀𝐏𝐋 𝟐𝟎𝟐𝟒 กับรอบ Swiss Stage วันที่ 9

พร้อมรักหรือยัง? Ready For Love? พากย์ไทย 07 ซีรีส์

พร้อมรักหรือยัง? Ready For Love? พากย์ไทย 07 ซีรีส์

Post Malone ft. Blake Shelton - Pour Me A Drink (Official Video)

Post Malone ft. Blake Shelton - Pour Me A Drink (Official Video)

LISA - ROCKSTAR (MV Teaser)

LISA - ROCKSTAR (MV Teaser)

NAME THE EURO 2024 PLAYER OR SWIM 💦

NAME THE EURO 2024 PLAYER OR SWIM 💦

ถ่ายทอดสด U16 ชิงแชมป์อาเซียน 2024 l ทีมชาติติมอร์ เลสเต พบ ทีมชาติไทย

ถ่ายทอดสด U16 ชิงแชมป์อาเซียน 2024 l ทีมชาติติมอร์ เลสเต พบ ทีมชาติไทย