If you want to make sure to compile using SIMD instructions specific for the HostCPU you can use llvm bindings for the language of your choice and then compile through llvm. Interesting vid!
Very helpful video. I was working on a particle system/simulation, and I use GL to draw the particles. Was wondering with SIMD and GL, how can I draw multiple particles at once? Or is this something more to do with GL buffers?
many dsp algorithms contain single sample feedback. can anything be done to vectorize these algorithms? It seems like the feedback complicates any attempt to use block processing to vectorize.
Nice video & nicely paced and clear. Just what I needed to get this topic a bit more. Just need some more examples of calculations actually taken care of by the SIMD extension sets, and perhaps some alternative SIMD/FFT libraries with info about what does what and how, that would be epic. Not many people teaching this in audio with such good phrasing! Keep up the great work! 👍
I can understand the concept of simd. But, in the code I can see that you are adding each value when it is added to the register. I see that which is equivalent to scalar addition, I think inorder to avoid one more for loop to store the addition values into the result array which makes sense. This points me to ask whether the intrinsic function performs the addition, only when all the 256bits are filled with values or it can also perform otherwise?
Late reply, you probably already have figured it out by now. Responding anyway for others with the same question. That's like saying planes are pointless for traveling large distances, because you still need to walk the short distance to your destination from the airport. SIMD will do a large portion of the work, in this case it will do it in multiples of 8, and the regular loop will finish the remaining amount so for normal loop you are looking at: N*scalar while for SIMD you are getting: floor(N/8)*SIMD + (N%8)*scalar Since by design 1*SIMD will be faster than 8*scalar, for sizes greater or equal to 8, the second algorithm will be faster than just doing the first loop. Otherwise, for sizes smaller than 8, it will be the same as the first loop + some overhead because of the division by 8.
Have I helped you with this video? If yes, please, consider buying me a ☕ coffee at www.buymeacoffee.com/janwilczek
Thanks! 🙂
Thanks for your great introduction and lively demo! I really like your pace!
Thanks this was great intro on this topic. I wanted to get started on SIMD and this will put me in right way
If you want to make sure to compile using SIMD instructions specific for the HostCPU you can use llvm bindings for the language of your choice and then compile through llvm. Interesting vid!
Thank you Jan!
Thanks for commenting! :)
great video!. looking forward the next one. for the next time, could include more on the arm and risc v case?
Great job explaining, and demonstrating. Thank you.
You made the concept easy to understand, thank you. Would like to see some C examples if it's possible too.
hej, to jest trudny temat, nic nie można znaleść na Internet, cieli dziękuję ci Jan!
Bardzo się cieszę, dzięki również!
Line 13 is killing me lol
Very helpful video. I was working on a particle system/simulation, and I use GL to draw the particles. Was wondering with SIMD and GL, how can I draw multiple particles at once? Or is this something more to do with GL buffers?
many dsp algorithms contain single sample feedback. can anything be done to vectorize these algorithms? It seems like the feedback complicates any attempt to use block processing to vectorize.
Nice video & nicely paced and clear. Just what I needed to get this topic a bit more. Just need some more examples of calculations actually taken care of by the SIMD extension sets, and perhaps some alternative SIMD/FFT libraries with info about what does what and how, that would be epic. Not many people teaching this in audio with such good phrasing! Keep up the great work! 👍
I didn´t read the article about this topic you wrote before. It is great, much more info there giving more depth, thanks!
I can understand the concept of simd. But, in the code I can see that you are adding each value when it is added to the register. I see that which is equivalent to scalar addition, I think inorder to avoid one more for loop to store the addition values into the result array which makes sense. This points me to ask whether the intrinsic function performs the addition, only when all the 256bits are filled with values or it can also perform otherwise?
Yes, you helped a lot ^_^
That's great ;)
Great job,
but didn't second for-loop killed the entire reason of using SIMD?
Late reply, you probably already have figured it out by now. Responding anyway for others with the same question.
That's like saying planes are pointless for traveling large distances, because you still need to walk the short distance to your destination from the airport.
SIMD will do a large portion of the work, in this case it will do it in multiples of 8, and the regular loop will finish the remaining amount
so for normal loop you are looking at:
N*scalar
while for SIMD you are getting:
floor(N/8)*SIMD + (N%8)*scalar
Since by design 1*SIMD will be faster than 8*scalar, for sizes greater or equal to 8, the second algorithm will be faster than just doing the first loop. Otherwise, for sizes smaller than 8, it will be the same as the first loop + some overhead because of the division by 8.