1:22:33 - my suspicion is that clang was spilling registers, which is always something to check for. They provide an amazingly wonderful tool called llvm-mca to do analysis llvm-mca takes as input the machine code and generates the intra-cycle pipeline analysis that it perceives a given processor architecture will execute.. (Most importantly where and why it will stall) AVX-512 with 32 regs was always somewhat of a wild idea, so it wouldn't be surprising that clang was not optimized for it
I'm a simple man. I see a Bob Steagall video, I click on it. Awesome presenter and you always learn a thing or two.
this was the first time i managed to follow along a SIMD lecture. I won't be able to do it myself or explain most of it, but I didn't feel lost.
Now that was an amazing tutorial on SIMD and how 'windowing' is applied to input data (signals). 👍
Glad to hear how much you appreciated this presentation.
1:22:33 - my suspicion is that clang was spilling registers, which is always something to check for. They provide an amazingly wonderful tool called llvm-mca to do analysis
llvm-mca takes as input the machine code and generates the intra-cycle pipeline analysis that it perceives a given processor architecture will execute.. (Most importantly where and why it will stall)
AVX-512 with 32 regs was always somewhat of a wild idea, so it wouldn't be surprising that clang was not optimized for it
An excellent talk. Explained everything beautifully.
Glad you liked it!
amazing galaxy brain content is here!
This should be good :)
Strange results, n-th element should be O(N), so at larger inputs that should make it faster.
Ah, nvm, this is n-th element on 7 elements always, so yeah, the overhead will prevail then.
👍👍👍