Work Contracts - Rethinking Task Based Concurrency & Parallelism for Low Latency C++ - CppCon 2024
ฝัง
- เผยแพร่เมื่อ 10 ก.พ. 2025
- cppcon.org
---
Work Contracts - Rethinking Task Based Concurrency and Parallelism for Low Latency C++ - Michael A Maniscalco - CppCon 2024
---
Task-based concurrency offers benefits in streamlining software and enhancing overall responsiveness. Numerous frameworks build upon this approach by furnishing abstractions that harness the capabilities of modern multi-core processors and distributed computing environments. These frameworks achieve this by facilitating the creation of task graphs, which simplify the management of task execution order and dependencies between tasks.
However, these frameworks are typically not well suited for use in low latency environments where every nanosecond matters, and any additional overhead introduced by managing task graphs can impact the system's ability to meet stringent latency requirements.
This presentation introduces Work Contracts, a novel approach specifically tailored for low-latency environments, which re-imagines both task-based concurrency as well as tasks themselves. We present an innovative lock free, often wait free, data structure designed to significantly enhance scalability in parallel task distribution, particularly under high contention. Additionally, we introduce concepts which enhance tasks with internal state allowing for single threaded or parallel execution, recurring execution, and deterministic asynchronous task destruction.
Finally, we examine usage cases for Work Contracts to showcase the advantages that this solution makes possible, resulting in cleaner, more manageable, and more scalable software which is well suited for low latency applications.
---
Slides: github.com/Cpp...
Sponsored by JetBrains: www.jetbrains....
---
Michael Maniscalco
Michael Maniscalco has been a professional C++ developer for over 25 years. Initially he worked in the area of data compression (Intelligent Compression Technologies, later Viasat) and, for the last decade, he has been employed as a principal engineer and software architect at Lime Trading where he designs and authors low latency software for use in HFT.
He has invented several novel data compression algorithms including M99 and M03 as well as MSufSort, a pioneering algorithm for suffix array construction.
His works have been published and cited in numerous academic publications including the Journal of Experimental Algorithms.
On rare occasion he finds the time to add to his website and github repo, "Building C++" which is a platform dedicated to "Architecting systems with modern C++."
---
CppCon is the annual, week-long face-to-face gathering for the entire C++ community. The conference is organized by the C++ community for the community. You will enjoy inspirational talks and a friendly atmosphere designed to help attendees learn from each other, meet interesting people, and generally have a stimulating experience. Taking place this year in Aurora, Colorado, near the Denver airport, and including multiple diverse tracks, the conference will appeal to anyone from C++ novices to experts.
Annual CppCon Conference - www.cppcon.org
/ cppcon
x.com/cppcon
/ cppconference
/ cppcon
mastodon.socia...
---
Videos Filmed & Edited by Bash Films: www.BashFilms.com
TH-cam Channel Managed by Digital Medium Ltd: events.digital...
---
#lowlatency #concurrency #cpp #cplusplus #cppcon #cppprogramming #cplusplusprogramming #softwaredevelopment #softwareengineering #coding #code #technology #programming #programmer
Really great talk! Concise and comprehensive... I wish all talks were like this
Such a great talk
This talk is excellent! Great work
I think if your counter is signed then a thread that does fetch_sub() and makes it < 0 can just re-increment it before moving on. It's a waste of some bits but no more CAS loop. Your contracts model is very cool, btw. Thanks for sharing.
Someone else mentioned that idea after the talk as well (perhaps you are that same person?)
The problem with this approach is two fold. First, I believe the CAS will be faster than testing the result for underflow (but I could be wrong here). Second, and more importantly, the nodes contain more that one counter, in reality. This makes the tree more compact and elimintes a lot of the CAS operation entirely. I detail that in the README in the repo.
@@Michael_19056 It is surprising that with multiple counters packed together on a cache line there is not interference between threads trying to increment different fields in the word, unless you have very many tasks, or anyway many task slots with tasks spread out among them. If your counters are packed into 64-bit words, that is 8 such words in a cache line ping-ponging between cores as each thread/core tries to atomically update a few bits in the shared cache line.
@@Michael_19056 Hi! I'm not that person. It was just a random thought. In my personal research, I went with two counters and I check if they match. Sounds like that's the way to go. Thanks for the info!
@@Akio-fy7ep two things help here. First, I think the odds of such interthread contention is fairly low vs increasing the number of counters and, therefore, the number of CAS calls to avoid it.
Second, there's a simple algorithm inside to ensure that work contracts are selected on leaf nodes which are spread out as far from each other as is possible so they have the fewest number of common parent nodes which also helps to avoid such contention. And if the tree gets crowded a bigger tree (or better yet more trees) helps to avoid that crowding too. These trees are cheap, in regards to memory footprint.
Thanks for the talk, it was clear, well prepared, and interesting.
A couple of questions/thoughts though:
a) what happened at 44:18? index 4 is suspiciously missing
b) the benchmark does not seem equivalent, or at least is is specific, since, AFAIK, the actual queues do synchronize the data if it is moved by the tasks between tasks; e.g. task T1 is handling data D, T1 then triggers a new task T2 and moves D to T2 while doing so, T2 can work on D without further synchronization. It does not seem like the same applies to Work Contracts, so the benefits really require you to have lots of small and very data-independent work (or am I wrong here?)
a) Slideware! (^: These slides go through a bunch of revisions and the 4 doubtlessly go dropped while copy/pasting. But just to be certain I just ran that example and sure enough 4 is present.
b) I'm not sure that I follow you here. But if I am then the same is done with work contracts by creating two work contracts. WC1 has in ingress queue, when it gets selected and invoked it pops D from the ingress queue, does whatever it needs to with it, and then pushes D (or some byproduct of the work it did with D) into an egress queue. That egress queue is WC2's ingress queue and WC2 is scheduled. A second worker thread will select WC2 and invoke it, perhaps even before the first thread has exited WC1.
If I am not understanding (b), please feel free to clarify.
You are correct though. One of the big wins for WC is that its low overhead enables (even encourages) smaller, more fine grained, tasks because the overhead of scheduling/selecting is much less than it is for enqueue/dequeue of a queue - especially a MPMC queue. But I would also argue that for larger tasks there is no performance loss by choosing WC rather than MPMC queues. That is what the 'low contention' benchmark is intended to show by making the tasks take a long time to complete. In fact, even in that benchmark WC came out _slightly_ more efficient than MPMCs. Albeit, not by much.
kewl. i wonder how a tree with 4 children per node would compare, if there is any benefit in tree navigation compared to the tradeoff of more contention per node
Thanks for the question. I added a section to the repo's README to provide answers. I hope that helps.
Very interesting, this is great to see an alternatives to queues with tasks, since that is what I mostly use in application code.
I still would like a clear explanation on when I would use this vs. when to use a concurrent queue. What situations would the concurrent queue be clearly better, and which situations would the work contracts be clearly better? There's no way work contracts are better for every situation, even with future improvements.
Thanks for the question. I added a section to the repo's README to provide answers. I hope that helps.
I spotted a minor mistake on the presentation slide 11 6. The resulting output show 1 then 2 (++i) instead of 0 then 1 (i++).
Anyways awsome talk.
Glad you liked it. You are correct. The slideware is wrong and it should be either 0 and 1 or ++i.
This is a great approach! I wonder about the cost of going through so many atomics though.
Thanks for the question. I added a section to the repo's README to provide answers. I hope that helps.
Maybe it makes sense to convert the binary tree to an N-children tree? It will reduce the number of atomic ops per contract pickup by Log2(N) and should also decrease the overall contention factor because the worker threads will "part" more actively while traversing down the tree.
@@romanzelenyi982 That is, in fact, how it works. The binary tree was just easier to present within the time constraints of the talk and to get the basic idea across. I believe that I did mention that at some point in the Q/A that the tree is more compact than what was presented. There are more details on this in the README of the repo too.
I’m not quite following how scalability works if the usage pattern ends up biasing a single work contract (all requests are for the same type of work) as it seems benefits come from work of different work contracts being scheduled.
Thanks for the question. I added a section to the repo's README to provide answers. I hope that helps.
i want to give a hint: when you cannot read the code on the slides, neither can the audience. please put less code on the slides and make it bigger. (this applies to many presentations, not just you)
otherwise great presentation. very interesting technique. thanks ;)
Noted. I'll make the font larger in the future. 😀
"Could you imagine writing your code and queueing lines of code to be executes", bro has not done a lot in modern JS.
Thanks for the breakdown! Just a quick off-topic question: I have a SafePal wallet with USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). Could you explain how to move them to Binance?