[06x12] How to use your GPU for Machine Learning using CUDA.jl and Flux.jl (CUDA.jl 101 Part 3 of 3)

[06x10] High-Level, Conceptual Introduction to Julia GPGPU using CUDA.jl (CUDA.jl 101 Part 1 of 3)

Writing Code That Runs FAST on a GPU

มายคราฟแต่ "น้ำกับลาวา" สลับกัน!?

"ทักษิณ" ยึดปราจีนฯ ลูกน้องโกทรแปรพักตร์| DAILYNEWSTODAY 17/12/67

LIVE🔴 : Singapore vs Thailand | ASEAN Championship 2024 | 17.12.24

[06x11] How to Write CUDA Kernels and Use CUDA Libraries using CUDA.jl (CUDA.jl 101 Part 2 of 3)

doggo dot jl

มุมมอง 3 217

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 1 ม.ค. 2025

ความคิดเห็น • 10

@doggodotjl 2 ปีที่แล้ว ⁺³
You can improve the performance of the CUDA Kernel significantly if you set the value of the constant 'a' using the Float32 datatype:
const a = Float32(3.1416)
When you do that, the performance of the CUDA Kernel should be about same as the performance you get using CUBLAS.axpy!().
Also, the @btime macro runs axpy!() multiple times, so the values in y are overwritten multiple times, which is why the values look off.
@inomo 2 ปีที่แล้ว ⁺⁵
The reason why the "y value" is different is because you are overwriting it several times while benchmarking it. The reason why the last digit is more accurate while broadcasting is because you declared "a" without giving it it a type and julia defaulted it to Float64 therefore julia promoted the results to that while broadcasting. Since the external library was for single precision it just kept the values as Float32 which is actually the expected behavior.
@inomo 2 ปีที่แล้ว ⁺¹
That is also why the results where slower with broadcasting, but you already mentioned that.
@doggodotjl 2 ปีที่แล้ว ⁺¹
Do you know why I get a different value every time I use the @btime macro with axpy!()? I thought @btime was running it 10,000 times, but I get a different result every time I run it.
@inomo 2 ปีที่แล้ว ⁺³
@@doggodotjl BenchmarkTools runs a benchmark with two constraints sample time and max samples, whatever happens first. The default is 5 seconds and 10,000 samples you can see that with the following:
julia> bm = @benchmarkable CUDA.@sync CUBLAS.axpy!($dim, $a, $x, $y)
Benchmark(evals=1, seconds=5.0, samples=10000)
julia> run(bm)
BenchmarkTools.Trial: 821 samples with 1 evaluation.
Range (min … max): 5.933 ms … 12.289 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.994 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.066 ms ± 435.365 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
julia> run(bm)
BenchmarkTools.Trial: 815 samples with 1 evaluation.
Range (min … max): 5.921 ms … 17.940 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.995 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.134 ms ± 821.625 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
You see that in two runs the number of samples varied. You can force it to use the same number of samples always using:
julia> run(bm, samples=1000)
And the you will always get the same answer.
@doggodotjl 2 ปีที่แล้ว ⁺¹
@@inomo Ah, very cool. Thank you for sharing your knowledge! I learned a lot just now by reading your reply.
@jikuncai 2 ปีที่แล้ว ⁺¹
Hi Doggodotjl! Thanks to your vedios, that do help me a lot.
I just did what you talked in the episode 06x11 and found that when I use const a=3.1416, the julia would default declare a as a Float64, then the performance of the CUDA Kernel returns 5.396ms while the results is 3.932ms using const a=Float32(3.1416), but it seems little help to use broadcst-method mentioned by the episode 06x10, the broadcst_time is all around 3.99 ms whether a is Float64 or Float32. I think the broadcst-method may do a lot preprocess to make sure the performance is good enough like type conversion. It's very cool to learn that the CUDA Kernel method will perform better while the programmer should be experienced.However, a new-hand like me using CUDA.jl currently should use the broadcst-method insteaded.
@doggodotjl 2 ปีที่แล้ว ⁺¹
Thanks for sharing your experience! You'll become an expert in no time!
@user-wr4yl7tx3w 2 ปีที่แล้ว ⁺²
Can Julia make it like JAX where you don’t need to specify which device you are using.
@doggodotjl 2 ปีที่แล้ว ⁺¹
I'm not familiar with JAX, so I'll let others weigh in, but with Julia, I think you need to be explicit about which device you want to use.

ต่อไป

เล่นอัตโนมัติ

[06x12] How to use your GPU for Machine Learning using CUDA.jl and Flux.jl (CUDA.jl 101 Part 3 of 3)

[06x12] How to use your GPU for Machine Learning using CUDA.jl and Flux.jl (CUDA.jl 101 Part 3 of 3)

[06x10] High-Level, Conceptual Introduction to Julia GPGPU using CUDA.jl (CUDA.jl 101 Part 1 of 3)

[06x10] High-Level, Conceptual Introduction to Julia GPGPU using CUDA.jl (CUDA.jl 101 Part 1 of 3)

Writing Code That Runs FAST on a GPU

Writing Code That Runs FAST on a GPU

มายคราฟแต่ "น้ำกับลาวา" สลับกัน!?

มายคราฟแต่ "น้ำกับลาวา" สลับกัน!?

"ทักษิณ" ยึดปราจีนฯ ลูกน้องโกทรแปรพักตร์| DAILYNEWSTODAY 17/12/67

"ทักษิณ" ยึดปราจีนฯ ลูกน้องโกทรแปรพักตร์| DAILYNEWSTODAY 17/12/67

LIVE🔴 : Singapore vs Thailand | ASEAN Championship 2024 | 17.12.24

LIVE🔴 : Singapore vs Thailand | ASEAN Championship 2024 | 17.12.24

ใครขยับไม่ได้เป็น!!

ใครขยับไม่ได้เป็น!!

[10x28] How to install and use Pluto.jl (reactive notebook programming environment for Julia)

[10x28] How to install and use Pluto.jl (reactive notebook programming environment for Julia)

Intro to CUDA (part 1): High Level Concepts

Intro to CUDA (part 1): High Level Concepts

[10x16] How to use a struct in Julia

[10x16] How to use a struct in Julia

Linux Network Event Tracing With LTTng

Linux Network Event Tracing With LTTng

CUDA Simply Explained - GPU vs CPU Parallel Computing for Beginners

CUDA Simply Explained - GPU vs CPU Parallel Computing for Beginners

Tim Besard - GPU Programming in Julia: What, Why and How?

Tim Besard - GPU Programming in Julia: What, Why and How?

Dark Energy Is Not Real, New Data Analysis Finds

Dark Energy Is Not Real, New Data Analysis Finds

CMake Tutorial for Absolute Beginners - From GCC to CMake including Make and Ninja

CMake Tutorial for Absolute Beginners - From GCC to CMake including Make and Ninja

[10x15] How to use a Dictionary in Julia

[10x15] How to use a Dictionary in Julia

🔴𝐋𝐈𝐕𝐄 การแข่งขัน RoV นานาชาติ AIC 2024 รอบ Swiss Stage วันที่ 9

🔴𝐋𝐈𝐕𝐄 การแข่งขัน RoV นานาชาติ AIC 2024 รอบ Swiss Stage วันที่ 9

มายคราฟ, แต่ ไลค์ = หัวใจ!

มายคราฟ, แต่ ไลค์ = หัวใจ!

เนื้อเรื่องที่ท่านจะโมโหจนน้ำตาไหล | Mouthwashing

เนื้อเรื่องที่ท่านจะโมโหจนน้ำตาไหล | Mouthwashing

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 16 : แมนเชสเตอร์ ซิตี้ พบ แมนเชสเตอร์ ยูไนเต็ด

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 16 : แมนเชสเตอร์ ซิตี้ พบ แมนเชสเตอร์ ยูไนเต็ด

MARK 마크 '프락치 (Fraktsiya) (Feat. 이영지)' MV

MARK 마크 '프락치 (Fraktsiya) (Feat. 이영지)' MV

How Strong Is Tape?

How Strong Is Tape?

PiXXiE - Pick A Card | OFFICIAL M/V

PiXXiE - Pick A Card | OFFICIAL M/V

【หนังพากย์ไทย】ยอดฝีมือสังหารนักโทษ แต่นักโทษเป็นปรมาจารย์กังฟูที่ซ่อนอยู่ เขาจัดการทั้งหมดในทันที

【หนังพากย์ไทย】ยอดฝีมือสังหารนักโทษ แต่นักโทษเป็นปรมาจารย์กังฟูที่ซ่อนอยู่ เขาจัดการทั้งหมดในทันที