Write contention is the primary scalability limit for write-invalidate cache coherency protocols. Each thread that attempts to update a cache line must fetch the line (in exclusive or modified state) and invalidate all other copies. Lock-based and lock-free designs that have the same memory write behavior (from the perspective of the CC protocol) will also scale and perform similarly. There are pitfalls to watch out for though, e.g. CAS may request the line in exclusive state even if the comparison fails and the update is not performed ("false write contention"?). Similarly, certain spin lock implementations (using exchange instead of compare_exchange) can invalidate the line with the lock in the lock owner's cache even if the lock is already busy. CPU microarchitecture, interconnect (inter-CPU) latencies and thread count/contention level also influence the actual impact on performance and scalability.
I'm curious, if a singly linked list or queue is only for task dispatch, children thread would only get their one node and leave. They only have to modify the head, and then they consume their own nodes. This way is way simpler.
18:11 I do not believe that spinlocks are power-efficient at all. If atomic access is taking CPU time but not wasting power it's probably more efficient even though it's longer.
ARMv8.1-a introduces atomic instructions such as LDADD and STADD. There could still be delays due to contention and I don't know to what extent fairness is guaranteed.
It's a shame how bad this presentation is prepared. Hopefully nobody starts learning concurrency from here. Don't get me wrong, I like most of the Fedor talks, but in this one, I'm not sure what's the point of showing some performance results without providing explanation.
Write contention is the primary scalability limit for write-invalidate cache coherency protocols. Each thread that attempts to update a cache line must fetch the line (in exclusive or modified state) and invalidate all other copies. Lock-based and lock-free designs that have the same memory write behavior (from the perspective of the CC protocol) will also scale and perform similarly. There are pitfalls to watch out for though, e.g. CAS may request the line in exclusive state even if the comparison fails and the update is not performed ("false write contention"?). Similarly, certain spin lock implementations (using exchange instead of compare_exchange) can invalidate the line with the lock in the lock owner's cache even if the lock is already busy. CPU microarchitecture, interconnect (inter-CPU) latencies and thread count/contention level also influence the actual impact on performance and scalability.
I really hoped you'd test ticket locks as well. These are much more efficient than binary spinlocks in kernels (where there is no interruption).
I'm curious, if a singly linked list or queue is only for task dispatch, children thread would only get their one node and leave. They only have to modify the head, and then they consume their own nodes. This way is way simpler.
18:11 I do not believe that spinlocks are power-efficient at all. If atomic access is taking CPU time but not wasting power it's probably more efficient even though it's longer.
Does anybody know the url of the Dan's talk Fedor mentioned in the beginning of his talk?
Did you find it? Thank you
On plenty of architectures (e.g. ARM) incrementing an std::atomic is not wait-free
ARMv8.1-a introduces atomic instructions such as LDADD and STADD. There could still be delays due to contention and I don't know to what extent fairness is guaranteed.
Great talk!
It's a shame how bad this presentation is prepared. Hopefully nobody starts learning concurrency from here. Don't get me wrong, I like most of the Fedor talks, but in this one, I'm not sure what's the point of showing some performance results without providing explanation.
It’s a shame they have great folks but produce very bad and useless std library APIs since c++17.