I would say this statement is highly misleading. You are likely to see a noticeable improvements with SIMD. Maybe not 30x, but 4x-8x improvement for non-trivial code seems to be easily reachable from my experience. And it's not going to take that long to write a decent SIMD function. The sample given is just about the worst example you can possibly show. There is no useful computation, so it's mostly showing loads and stores. As soon as you have chained, meaningful operations like shifts, multiplications, square roots, masks, etc., you start noticing a big difference. Also, it's really cherry-picking a compiler and code example, because as soon as you have a non-trivial code fragment, you will start seeing that often the compiler will just emit a scalar code. Sometimes it might even be using xmm registers, but it will still do just a scalar operation on the first float in the register for example. Even this misleading example will not be vectorized on all compilers. My experience so far has been that for very trivial code that the compiler manages to make sense of, intrinsics vs good compiler the speed might be about the same, but not worse. But if you compile on multiple compilers, intrinsics will be noticeably faster for some of them. And for non-trivial cases, you are easily looking at 4x-7x improvement. Which means instead of waiting some computation for 7 seconds, it finishes in 1 second. Or instead of 7 hours of computation, you finish the task in 1 hour. I find this common disregard of actually useful technologies to be extremely harmful. As a good developer you should know where code can be optimized, you should know when SIMD makes sense. And you should use it because it can give you at least 5x savings in hardware, energy, and wait times.
Speaker here. I agree with and disagree with various points you made. Regarding cherry picking a compiler: there's basically two compilers. And, yes, Clang is better at autovectorization. GCC is worse, but it will autovectorize this example. But you are correct that I chose the compiler that does a better job intentionally. I'm sorry if you felt that was dishonest. It wasn't meant to be. I chose this example not to cherry pick, but because in a lightning talk, we can't sit there for a minute waiting for the audience to figure out what the code is doing. It has to be simple. I would say two comments both in regards to autovectorization and intrinsics. This video is from over a year ago, and the compilers are better and better, and I am increasingly preferring to use std::assume and various branch-free techniques to help guide the compiler to generate good SIMD code. When that fails, turning to google highway or intrinsics has been another, sometimes very good, route. You say autovectorization only works on simple examples, but give it a shot. Also, make use of optimization remarks output by the compilers. They can point you to where the compiler failed to vectorize something, and often it's something subtle that many programmers don't understand. For example, dot product on floats doesn't vectorize because rearranging the order of adds can result in different answers, so you need the fast-math flag. As for really big (30x) speedups, I've seen it mostly in operations on char vectors. But yes, uncommon.
For a moment I thought he was going to show how to use SIMD for destructing objects. I was expecting some reinterpret_cast into an object of same size but only primitive types and with some magic, parallelize destruction.. But I guess not
Microarchitecture knowledge is essential.
I would say this statement is highly misleading. You are likely to see a noticeable improvements with SIMD. Maybe not 30x, but 4x-8x improvement for non-trivial code seems to be easily reachable from my experience. And it's not going to take that long to write a decent SIMD function.
The sample given is just about the worst example you can possibly show. There is no useful computation, so it's mostly showing loads and stores. As soon as you have chained, meaningful operations like shifts, multiplications, square roots, masks, etc., you start noticing a big difference.
Also, it's really cherry-picking a compiler and code example, because as soon as you have a non-trivial code fragment, you will start seeing that often the compiler will just emit a scalar code. Sometimes it might even be using xmm registers, but it will still do just a scalar operation on the first float in the register for example. Even this misleading example will not be vectorized on all compilers.
My experience so far has been that for very trivial code that the compiler manages to make sense of, intrinsics vs good compiler the speed might be about the same, but not worse. But if you compile on multiple compilers, intrinsics will be noticeably faster for some of them. And for non-trivial cases, you are easily looking at 4x-7x improvement. Which means instead of waiting some computation for 7 seconds, it finishes in 1 second. Or instead of 7 hours of computation, you finish the task in 1 hour.
I find this common disregard of actually useful technologies to be extremely harmful. As a good developer you should know where code can be optimized, you should know when SIMD makes sense. And you should use it because it can give you at least 5x savings in hardware, energy, and wait times.
Speaker here. I agree with and disagree with various points you made.
Regarding cherry picking a compiler: there's basically two compilers. And, yes, Clang is better at autovectorization. GCC is worse, but it will autovectorize this example. But you are correct that I chose the compiler that does a better job intentionally. I'm sorry if you felt that was dishonest. It wasn't meant to be.
I chose this example not to cherry pick, but because in a lightning talk, we can't sit there for a minute waiting for the audience to figure out what the code is doing. It has to be simple.
I would say two comments both in regards to autovectorization and intrinsics. This video is from over a year ago, and the compilers are better and better, and I am increasingly preferring to use std::assume and various branch-free techniques to help guide the compiler to generate good SIMD code. When that fails, turning to google highway or intrinsics has been another, sometimes very good, route. You say autovectorization only works on simple examples, but give it a shot. Also, make use of optimization remarks output by the compilers. They can point you to where the compiler failed to vectorize something, and often it's something subtle that many programmers don't understand. For example, dot product on floats doesn't vectorize because rearranging the order of adds can result in different answers, so you need the fast-math flag.
As for really big (30x) speedups, I've seen it mostly in operations on char vectors. But yes, uncommon.
This is so true 😅
For a moment I thought he was going to show how to use SIMD for destructing objects. I was expecting some reinterpret_cast into an object of same size but only primitive types and with some magic, parallelize destruction.. But I guess not