You can improve the performance of the CUDA Kernel significantly if you set the value of the constant 'a' using the Float32 datatype: const a = Float32(3.1416) When you do that, the performance of the CUDA Kernel should be about same as the performance you get using CUBLAS.axpy!(). Also, the @btime macro runs axpy!() multiple times, so the values in y are overwritten multiple times, which is why the values look off.
The reason why the "y value" is different is because you are overwriting it several times while benchmarking it. The reason why the last digit is more accurate while broadcasting is because you declared "a" without giving it it a type and julia defaulted it to Float64 therefore julia promoted the results to that while broadcasting. Since the external library was for single precision it just kept the values as Float32 which is actually the expected behavior.
Do you know why I get a different value every time I use the @btime macro with axpy!()? I thought @btime was running it 10,000 times, but I get a different result every time I run it.
@@doggodotjl BenchmarkTools runs a benchmark with two constraints sample time and max samples, whatever happens first. The default is 5 seconds and 10,000 samples you can see that with the following: julia> bm = @benchmarkable CUDA.@sync CUBLAS.axpy!($dim, $a, $x, $y) Benchmark(evals=1, seconds=5.0, samples=10000) julia> run(bm) BenchmarkTools.Trial: 821 samples with 1 evaluation. Range (min … max): 5.933 ms … 12.289 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.994 ms ┊ GC (median): 0.00% Time (mean ± σ): 6.066 ms ± 435.365 μs ┊ GC (mean ± σ): 0.00% ± 0.00% julia> run(bm) BenchmarkTools.Trial: 815 samples with 1 evaluation. Range (min … max): 5.921 ms … 17.940 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.995 ms ┊ GC (median): 0.00% Time (mean ± σ): 6.134 ms ± 821.625 μs ┊ GC (mean ± σ): 0.00% ± 0.00% You see that in two runs the number of samples varied. You can force it to use the same number of samples always using: julia> run(bm, samples=1000) And the you will always get the same answer.
Hi Doggodotjl! Thanks to your vedios, that do help me a lot. I just did what you talked in the episode 06x11 and found that when I use const a=3.1416, the julia would default declare a as a Float64, then the performance of the CUDA Kernel returns 5.396ms while the results is 3.932ms using const a=Float32(3.1416), but it seems little help to use broadcst-method mentioned by the episode 06x10, the broadcst_time is all around 3.99 ms whether a is Float64 or Float32. I think the broadcst-method may do a lot preprocess to make sure the performance is good enough like type conversion. It's very cool to learn that the CUDA Kernel method will perform better while the programmer should be experienced.However, a new-hand like me using CUDA.jl currently should use the broadcst-method insteaded.
You can improve the performance of the CUDA Kernel significantly if you set the value of the constant 'a' using the Float32 datatype:
const a = Float32(3.1416)
When you do that, the performance of the CUDA Kernel should be about same as the performance you get using CUBLAS.axpy!().
Also, the @btime macro runs axpy!() multiple times, so the values in y are overwritten multiple times, which is why the values look off.
The reason why the "y value" is different is because you are overwriting it several times while benchmarking it. The reason why the last digit is more accurate while broadcasting is because you declared "a" without giving it it a type and julia defaulted it to Float64 therefore julia promoted the results to that while broadcasting. Since the external library was for single precision it just kept the values as Float32 which is actually the expected behavior.
That is also why the results where slower with broadcasting, but you already mentioned that.
Do you know why I get a different value every time I use the @btime macro with axpy!()? I thought @btime was running it 10,000 times, but I get a different result every time I run it.
@@doggodotjl BenchmarkTools runs a benchmark with two constraints sample time and max samples, whatever happens first. The default is 5 seconds and 10,000 samples you can see that with the following:
julia> bm = @benchmarkable CUDA.@sync CUBLAS.axpy!($dim, $a, $x, $y)
Benchmark(evals=1, seconds=5.0, samples=10000)
julia> run(bm)
BenchmarkTools.Trial: 821 samples with 1 evaluation.
Range (min … max): 5.933 ms … 12.289 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.994 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.066 ms ± 435.365 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
julia> run(bm)
BenchmarkTools.Trial: 815 samples with 1 evaluation.
Range (min … max): 5.921 ms … 17.940 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.995 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.134 ms ± 821.625 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
You see that in two runs the number of samples varied. You can force it to use the same number of samples always using:
julia> run(bm, samples=1000)
And the you will always get the same answer.
@@inomo Ah, very cool. Thank you for sharing your knowledge! I learned a lot just now by reading your reply.
Hi Doggodotjl! Thanks to your vedios, that do help me a lot.
I just did what you talked in the episode 06x11 and found that when I use const a=3.1416, the julia would default declare a as a Float64, then the performance of the CUDA Kernel returns 5.396ms while the results is 3.932ms using const a=Float32(3.1416), but it seems little help to use broadcst-method mentioned by the episode 06x10, the broadcst_time is all around 3.99 ms whether a is Float64 or Float32. I think the broadcst-method may do a lot preprocess to make sure the performance is good enough like type conversion. It's very cool to learn that the CUDA Kernel method will perform better while the programmer should be experienced.However, a new-hand like me using CUDA.jl currently should use the broadcst-method insteaded.
Thanks for sharing your experience! You'll become an expert in no time!
Can Julia make it like JAX where you don’t need to specify which device you are using.
I'm not familiar with JAX, so I'll let others weigh in, but with Julia, I think you need to be explicit about which device you want to use.