In the false_sharing example, since each thead has its own counter, there is no need to use std::array for the counters. I found that using std::array makes the program much faster.
Yep - invalidations occurs whether or not you use atomic. If a core wants to write to an integer on a cache line that currently is in another core’s cache, it has to invalidate that cache line and get exclusive access to it. Atomic makes sure that the entire read+modify+write operation occurs as one atomic unit (I.e., it can’t be broken up). Atomic is more of a specifier for how actions are performed on a piece of memory (not really the details of what that memory looks like). An atomic int and and a regular int are just 4B of memory. With an atomic int though, things like increment result in atomic instructions (eg, and increment with the lock prefix in x86), while an increment of a normal integer will just result in something like an increment instruction (without a lock prefix).
@@CoffeeBeforeArch hm, i tried it on my machine and got basically no cache misses when running perf c2c and the program ran in 0.x sec or so like in the case with alignas. Any thoughts?
@@thisisolorin394 If the program completed almost instantly, there's a good chance that the compiler just optimized away the operation. Compiler optimizers are pretty clever about getting rid of code that generates values that are not used later on in the program (things like atomic prevent that from happening).
isnt this alignas a bit of a dirty hack? ; ) i mean, you're using up way more memory than you need to - in this case it doesnt matter, but what to do in cases when it does? Another thing - what if your code is supposed to work on multiple different types of processors and there simply ISNT a single size of cacheline? I wonder if this problem can be solved in a different way somehow.
It's a pretty standard technique for solving these kinds of problems. A dummy array of padding bytes is also pretty standard, and is shown as a solution in a recent performance blog post by Netlifx - netflixtechblog.com/seeing-through-hardware-counters-a-journey-to-threefold-performance-increase-2721924a2822 There usually is a space-speed trade-off in performance (e.g., double buffering doubles the amount of memory allocated to buffers). Padding to avoid false sharing is generally small, and not a major overhead (in terms of overall memory capacity). You only need the padding at the edges of where a problem is being partitioned (and is at most a cache line of padding at each boundary which is, in general, only 64B each). If you are in some super memory constrained embedded system, your only real option would be to move around data members that are being written to so that they are naturally spaced apart (or find a way to limit the number you're doing in the first place). Software optimizations are not always portable - you're inherently tuning to the underlying architecture your code is running on. If you have a different cache line size, you might have to tune it differently. You could just have the spacing controlled by something like a preprocessor macro that is set at compile-time based on the cache line size of the target architecture.
If the counter variable was local to the lambda, it should also avoid false sharing. Scott Meyers showed a similar example in his presentation about caches
@@matheusmarchetti628 that's a totally different use case with counter variable being thread local, performance penalty from cache coherence does not apply.
That is the most enlightening lesson on cache related performance effect that I have ever had! Thank you very much Nick! Cheers, Simone
Glad you enjoyed it! :^)
In every tutorial people talk theory, but you show it practically which makes you special!.. thank you! BIG FAN :)
Currently trying to break into the HFT space, your video is super helpful!
That's incredible how much of a difference it makes. And explained very well and clearly. Thanks for direct sharing, fantastic video.
Glad you enjoyed it :^)
In the false_sharing example, since each thead has its own counter, there is no need to use std::array for the counters. I found that using std::array makes the program much faster.
Do we still need -lpthread in modern g++? I thought gcc is now smart enough to include the library by default
what would happen if an array of int was used instead of atomic? Would the other elements of the array be invalidated as well if they are not atomic?
Yep - invalidations occurs whether or not you use atomic. If a core wants to write to an integer on a cache line that currently is in another core’s cache, it has to invalidate that cache line and get exclusive access to it.
Atomic makes sure that the entire read+modify+write operation occurs as one atomic unit (I.e., it can’t be broken up). Atomic is more of a specifier for how actions are performed on a piece of memory (not really the details of what that memory looks like). An atomic int and and a regular int are just 4B of memory. With an atomic int though, things like increment result in atomic instructions (eg, and increment with the lock prefix in x86), while an increment of a normal integer will just result in something like an increment instruction (without a lock prefix).
@@CoffeeBeforeArch hm, i tried it on my machine and got basically no cache misses when running perf c2c and the program ran in 0.x sec or so like in the case with alignas. Any thoughts?
@@thisisolorin394 If the program completed almost instantly, there's a good chance that the compiler just optimized away the operation. Compiler optimizers are pretty clever about getting rid of code that generates values that are not used later on in the program (things like atomic prevent that from happening).
hello . what is your compailer?
gnu c++, version I couldn't see.
isnt this alignas a bit of a dirty hack? ; ) i mean, you're using up way more memory than you need to - in this case it doesnt matter, but what to do in cases when it does? Another thing - what if your code is supposed to work on multiple different types of processors and there simply ISNT a single size of cacheline?
I wonder if this problem can be solved in a different way somehow.
It's a pretty standard technique for solving these kinds of problems. A dummy array of padding bytes is also pretty standard, and is shown as a solution in a recent performance blog post by Netlifx - netflixtechblog.com/seeing-through-hardware-counters-a-journey-to-threefold-performance-increase-2721924a2822
There usually is a space-speed trade-off in performance (e.g., double buffering doubles the amount of memory allocated to buffers). Padding to avoid false sharing is generally small, and not a major overhead (in terms of overall memory capacity). You only need the padding at the edges of where a problem is being partitioned (and is at most a cache line of padding at each boundary which is, in general, only 64B each). If you are in some super memory constrained embedded system, your only real option would be to move around data members that are being written to so that they are naturally spaced apart (or find a way to limit the number you're doing in the first place).
Software optimizations are not always portable - you're inherently tuning to the underlying architecture your code is running on. If you have a different cache line size, you might have to tune it differently. You could just have the spacing controlled by something like a preprocessor macro that is set at compile-time based on the cache line size of the target architecture.
If the counter variable was local to the lambda, it should also avoid false sharing. Scott Meyers showed a similar example in his presentation about caches
@@matheusmarchetti628 that's a totally different use case with counter variable being thread local, performance penalty from cache coherence does not apply.
hi wi the wery smole lesson dot`t trasted infotmasion bu the lesson i`m stat goot see serealizm