I've probably given this tip to more people than any other, but if you're optimizing around an array of any variety, use a size that's a power of two. If you need to constrain an index into such an array, a `bitwise and` is the fastest way to do so. Never use division/modulus and oddly sized arrays, even when setting up the backing store for a hash table. Prime number sizes may yield better distribution, but it's always at the expense of performance. Also, with regards to hash tables, always store the unnormalized hash with the keys so that when you need to resize the table you don't need to query the hashing function again to rebuild the entire table.
While I completely agree regarding power of 2 + bitwise and on one minus the size, I would also not be surprised if the hardware didn't recognize this optimization. In my day job I constrain the size to be a power of 2.
41:00 I had that issue too. My solution was to not store objects in the queue, but just store pointers. the new/delete are done outside. I used another fifo of "Available buffers", pre-allocated pointers that can be reused. Instead of new grab one from it, instead of delete push it back to be reused. That way I get rid of the push/pop memcpy's, and ALSO of runtime allocations. Problem is just that you have to handle when the push fails and the queue is full, you need to either throw the pointer back into the available buffers, or hold onto it until it can be pushed.
I’m glad I’m not the only one struggling to understand memory ordering 😂 To understand this talk fully, I read chapter 5 from the book “C++ Concurrency in Action” (2nd edition), which goes into depth on the topic of memory ordering. Had to read it twice to understand everything, but it was worth it. Thank you for this excellent talk!
Wow, what a marvelous presentation! I enjoyed every single second of it. I have always wanted some similar learning material to start with lock-free programming, and this one was just the right one. Thank you very much, Mr. Frasch.
I completely misunderstood Roi's question about 1:03. Yes, I agree that if you grab multiple pushers or poppers in the same block they will be destructed in the reverse order in which they were acquired. And, if you do that you must carefully consider that fact. The pusher and popper destructors just advance the cursor. So, it is possible that at the end of the block the cursors are correct. But, my real answer is, if you need to read or write multiple values from and to the FIFO you should extend the API to add functions that do that correctly.
Well, yeah, they just advance the cursor... But because of that only the cursor values will be correct at the end! The FIFO won't be! Because when you are getting multiple pushers in a row they are getting a reference to the same memory, and when one lifetime ends only the cursor position is changes the still existing pushers won't be, so they will overwrite the previous item and skip one item when they go out of scope. Instead of advancing the cursor implicitly in the destructor I would use a commit() method. That way you can get a pusher, commit and then get another pusher in the same scope. I'm not sure if I would actually implement pusher and popper at all, just use some kind of peek methods to get direct access the write and read location and use commit and free or similarly named methods to advance the write and read head.
Wonderful talk. Have done a similar implementation but never needed the extra performance over fifo2 so never looked into how it worked. Your talk explained everything very well. Thank you.
Me a noob thinking I was optimizing by messing with the std lib and working around those. This talk expanding my understanding of how you can truly optimize a system at a deep level.
One way to handle a pointer request when near the end of the FiFo wrap around point is to allocate extra space past the main circular buffer. Then the get pointer function can in the rare cases where the buffer wraps around to do a copy into this extra space so that the caller has a linear buffer. and then the release can copy data back into the FiFo buffer. The only problem is when the FiFo can auto expand the buffer. In that case a mutex is needed and all callers can't be holding pointers. Luckily in my case I don't need lock free and the code just grabs a mutex for any operation and also keeps the mutex held between the getting of the pointer and the release of the pointer. I also use RAII in that my get pointer returns a class to release the pointer and the mutex.
Great talk! Just a warning though, it is not a working FIFO, once pushCursor will get to the end of max size_t, it will start overwriting the buffer at [0] and on wards.
I am sure I have missed it but in bool Fifo5::push(T const& value) , how does *pusher = value; end up invoking the pusher::get() when no access to the underlying Fifo5 object has been provided (operator*()) to allow access
From a correctness point of view the load order shouldn't make any difference. In the push function and while the cursors are being loaded the fifo can only become emptier. If the cursor comparison shows full but the reader thread popped something off, the function will return full and noop. The next time through a slot will be obtained and the push will succeed. The full condition is already the slow path. From a performance point of view it might be a bit more efficient to start the load acquire first since it is a more expensive operation than load relaxed. OTOH, the processor's instruction scheduler show run the loads concurrently. Regardless, the comparison cannot proceed until both values are available.
Great talk. I wonder how hardware works and how often do fifo1 with out atomic get error, but after 10,000 times of test in release mode, it still work correctly, do any one have any hint of it?
I find out that x86 is a strongly-ordered system; everything is release-acquire by default. So, whether Fiio1 will pass the test or not is decided by how the program is compiled, not at runtime. Am I right about it?
12:07 While this would work for most, this is not correct [when capacity is not 2's power]. It relies on size_t not overflowing. (Which is a safe bet when it's 64 bit since it would take hundreds of years, but it is NOT when it's 32 bit or less, for example when running on a 32 bit or 8 bit microcontroller, where it only takes minutes to overflow.) To demonstrate the problem create a FIFO with 10 capacity and set the cursors to 0xfffffffaUL (or push and pop that many times), then push 10 items into the FIFO (compiled to a 32 bit platform). The write index will be 0,1,2,3,4,5,0,1,2,3. Meaning even though the FIFO was initially empty and had enough space to hold 10 items, it overwrites 4 of those items... and it will try to destruct some items twice. Also I usually don't like using modulo operation for indexing. Because capacity is not a compile time constant modulo division won't be optimized which could be a problem (mostly on processors without hardware division unit) (even when capacity is 2's power). I'd rather limit the cursor to valid index range / capacity to make sure it at all times contains actual valid array index (which makes debugging easier). (of course you would need to differentiate between full and empty in another way) I have written a number of lock free FIFOs for microcontrollers over the decades, so it seemed trivial to me that this is possible (even though the implementation might not be trivial, and may include memory barriers) I also realize that it might not be obvious for "modern"/young developers. I still wondered what could be said about them to fill out a 1 hour long presentation... So I still watched it (on 2x speed) to see if he ever addresses this overflow issue, and the talk did contain some interesting pitfalls. This shows that multi-threading and modern processors are complicated, so nearly everyone should use standard libraries instead of rolling their own FIFO.
That's right. The presentation serves its purpose as is. It would be far too complicated to to deal with ABA issues; perhaps I should have mentioned that.
I've probably given this tip to more people than any other, but if you're optimizing around an array of any variety, use a size that's a power of two. If you need to constrain an index into such an array, a `bitwise and` is the fastest way to do so. Never use division/modulus and oddly sized arrays, even when setting up the backing store for a hash table. Prime number sizes may yield better distribution, but it's always at the expense of performance. Also, with regards to hash tables, always store the unnormalized hash with the keys so that when you need to resize the table you don't need to query the hashing function again to rebuild the entire table.
Are there benchmarks which show that power of 2 size is better?
I very much agree with this advice. Especially for anything that is performance sensitive, bitwise and is a go-to means to restrict value ranges.
While I completely agree regarding power of 2 + bitwise and on one minus the size, I would also not be surprised if the hardware didn't recognize this optimization. In my day job I constrain the size to be a power of 2.
Seems like he was going to say the same thing at 8:30 before he lost his train of thought
Ah, he gets it out by 9:50
Great talk. We need more of this type of content in C++ talks!
41:00 I had that issue too.
My solution was to not store objects in the queue, but just store pointers.
the new/delete are done outside. I used another fifo of "Available buffers", pre-allocated pointers that can be reused. Instead of new grab one from it, instead of delete push it back to be reused.
That way I get rid of the push/pop memcpy's, and ALSO of runtime allocations.
Problem is just that you have to handle when the push fails and the queue is full, you need to either throw the pointer back into the available buffers, or hold onto it until it can be pushed.
One of the best explanations for implementing a lock-free SPSC I've come across. Kudos!
Far better than my implementation which just used a try lock and gave up if it couldn't immediately do the action that it wanted to do!
I’m glad I’m not the only one struggling to understand memory ordering 😂
To understand this talk fully, I read chapter 5 from the book “C++ Concurrency in Action” (2nd edition), which goes into depth on the topic of memory ordering. Had to read it twice to understand everything, but it was worth it.
Thank you for this excellent talk!
Wow, what a marvelous presentation! I enjoyed every single second of it. I have always wanted some similar learning material to start with lock-free programming, and this one was just the right one. Thank you very much, Mr. Frasch.
You are welcome. The references in the slides are also very informative.
I completely misunderstood Roi's question about 1:03. Yes, I agree that if you grab multiple pushers or poppers in the same block they will be destructed in the reverse order in which they were acquired. And, if you do that you must carefully consider that fact. The pusher and popper destructors just advance the cursor. So, it is possible that at the end of the block the cursors are correct. But, my real answer is, if you need to read or write multiple values from and to the FIFO you should extend the API to add functions that do that correctly.
Well, yeah, they just advance the cursor... But because of that only the cursor values will be correct at the end!
The FIFO won't be! Because when you are getting multiple pushers in a row they are getting a reference to the same memory, and when one lifetime ends only the cursor position is changes the still existing pushers won't be, so they will overwrite the previous item and skip one item when they go out of scope.
Instead of advancing the cursor implicitly in the destructor I would use a commit() method. That way you can get a pusher, commit and then get another pusher in the same scope.
I'm not sure if I would actually implement pusher and popper at all, just use some kind of peek methods to get direct access the write and read location and use commit and free or similarly named methods to advance the write and read head.
how is your comment 3 months old when the video was published 2 days ago ?
Wonderful talk. Have done a similar implementation but never needed the extra performance over fifo2 so never looked into how it worked. Your talk explained everything very well. Thank you.
@@youtou252 I'm the presenter; they allowed me to preview the video.
So many good things explained so well in one go. Bravo.
Beautiful talk
Thank you.
Great talk! I enjoyed it a lot.
Me a noob thinking I was optimizing by messing with the std lib and working around those. This talk expanding my understanding of how you can truly optimize a system at a deep level.
One way to handle a pointer request when near the end of the FiFo wrap around point is to allocate extra space past the main circular buffer. Then the get pointer function can in the rare cases where the buffer wraps around to do a copy into this extra space so that the caller has a linear buffer. and then the release can copy data back into the FiFo buffer. The only problem is when the FiFo can auto expand the buffer. In that case a mutex is needed and all callers can't be holding pointers. Luckily in my case I don't need lock free and the code just grabs a mutex for any operation and also keeps the mutex held between the getting of the pointer and the release of the pointer. I also use RAII in that my get pointer returns a class to release the pointer and the mutex.
It's not clear to me what problem you are describing or trying to solve. But your proposal doesn't seem like a sound solution to me.
Great talk! Just a warning though, it is not a working FIFO, once pushCursor will get to the end of max size_t, it will start overwriting the buffer at [0] and on wards.
it's assumed that std::size_t is 64 bits I think, therefore it is not realistic for it to reach the maximum value
I am sure I have missed it but in bool Fifo5::push(T const& value) , how does
*pusher = value; end up invoking the pusher::get() when no access to the underlying Fifo5 object has been provided (operator*()) to allow access
That gentleman looks so much like the cousin of Bjourne, cool presentation
slide 28, in push function, shouldn't we load popCursor with acquire first, and then load pushCursor with relaxed, correct me if I misunderstand this.
From a correctness point of view the load order shouldn't make any difference. In the push function and while the cursors are being loaded the fifo can only become emptier. If the cursor comparison shows full but the reader thread popped something off, the function will return full and noop. The next time through a slot will be obtained and the push will succeed. The full condition is already the slow path. From a performance point of view it might be a bit more efficient to start the load acquire first since it is a more expensive operation than load relaxed. OTOH, the processor's instruction scheduler show run the loads concurrently. Regardless, the comparison cannot proceed until both values are available.
Great talk. I wonder how hardware works and how often do fifo1 with out atomic get error, but after 10,000 times of test in release mode, it still work correctly, do any one have any hint of it?
I find out that x86 is a strongly-ordered system; everything is release-acquire by default. So, whether Fiio1 will pass the test or not is decided by how the program is compiled, not at runtime. Am I right about it?
12:07 While this would work for most, this is not correct [when capacity is not 2's power]. It relies on size_t not overflowing.
(Which is a safe bet when it's 64 bit since it would take hundreds of years, but it is NOT when it's 32 bit or less, for example when running on a 32 bit or 8 bit microcontroller, where it only takes minutes to overflow.)
To demonstrate the problem create a FIFO with 10 capacity and set the cursors to 0xfffffffaUL (or push and pop that many times), then push 10 items into the FIFO (compiled to a 32 bit platform). The write index will be 0,1,2,3,4,5,0,1,2,3. Meaning even though the FIFO was initially empty and had enough space to hold 10 items, it overwrites 4 of those items... and it will try to destruct some items twice.
Also I usually don't like using modulo operation for indexing. Because capacity is not a compile time constant modulo division won't be optimized which could be a problem (mostly on processors without hardware division unit) (even when capacity is 2's power). I'd rather limit the cursor to valid index range / capacity to make sure it at all times contains actual valid array index (which makes debugging easier). (of course you would need to differentiate between full and empty in another way)
I have written a number of lock free FIFOs for microcontrollers over the decades, so it seemed trivial to me that this is possible (even though the implementation might not be trivial, and may include memory barriers) I also realize that it might not be obvious for "modern"/young developers. I still wondered what could be said about them to fill out a 1 hour long presentation...
So I still watched it (on 2x speed) to see if he ever addresses this overflow issue, and the talk did contain some interesting pitfalls. This shows that multi-threading and modern processors are complicated, so nearly everyone should use standard libraries instead of rolling their own FIFO.
That's right. The presentation serves its purpose as is. It would be far too complicated to to deal with ABA issues; perhaps I should have mentioned that.
Don't we have to compare/exchange in order to get thread safety here?
You can certainly use atomic compare and exchange. It is a more complicated and expensive operation that the atomic load and stores I use.