This video exposes one concept that you rarely notice in regular code. Computer code looks sequential, when written down as text, but in actuality it defines an oriented graph of operations. If two operations don't lie on the same path through the edges of the graph, their relative order is undefined. Multi-threaded memory access exposes this weird quirk, because it itself may lack inherent ordering (ie. allows for race conditions). Memory ordering introduces extra edges to the graph, to restrict possible orderings.
Another important ingredient is out-of-order execution by modern processors. Normally such CPUs and their caches provide a programmer-visible model that appears to execute sequentially. When they run on multiple threads, especially on symmetric multi-processors (e.g. a quad-core CPU), then these out-of-order execution models can expose far more interdependence than you might think could occur just due to "parallel execution". Ever since the Pentium Pro, for example, your CPU isn't even executing the raw instructions. It's converting them into micro-code, which is analyzed in real-time, much like a modern compiler, and reordered according to the directed graph of operations you describe. Then as many of those micro-code instructions as possible are executed simultaneously. Add to that the presence of multi-layer caches, shared between multiple processors, and the whole idea of instructions that execute one instruction at a time just completely breaks down.
This is related to how Haskell's IO type is defined (in GHC) - it is a state monad which passes around a special value (RealWorld#) which the compiler treats as being different every time it sees it: data IO a = IO (RealWorld# -> (a, RealWorld#)) - this creates a strict ordering between IO actions, the compiler can't change them around because the RealWorld# token would no longer be passed between the actions in the same order. Other Haskell code is exactly what you've said, a graph of values, but it's one in which the order doesn't matter, because the values are always pure. (I'll note that "ordering of actions" didn't require the use of the m-word, that's what we call the ubiquitous pattern that IO fits nicely into)
I want to give feedback: This is great. It fits my difficulty level perfectly and is exactly what I need - as the 2018 guideline suggests. Im have some experience with rust and am a computer science student currently just before my bachelors degree, so I have some experience with programming and how a computer works on several levels. Heck, I wrote a small subset-of-c-compiler. But I i.e. never heard of Ordering before. Those topics normally have a super high barrier of entry because of all the previous knowledge required, only that that knowledge has the same barrier. Your videos help break that and to learn a lot about how computers work in general. Keep on with the great work. I wish all my lectures were even half as engaging, informative, accessible and interesting.
One of the exciting aspects of fetch_update (or any CAS loop that generates the new value from the old) is that your closure may be executed more than once. If your closure has side-effects, it sometimes surprises people that they get the side-effect multiple times. A more common mistake is for the closure to read and write to a value that exists outside the closure (i.e. "if !initialized { initialized = true; new_value = 42 }"). If the first CAS attempt fails, then the second run of the closure generates an unexpected result because it inherits state from the first run of the closure. I've seen this bug multiple times in production code.
For anyone testing with themselves while following along: While testing with the first section with the "bad" mutex. It reliably generated completely random numbers every run when run with just "cargo run" instead of "cargo run --release", even without the yield. However I'm running on a Macbook Air with the M1 based ARM processor so that probably increases the randomness as ARM's memory model is a little more relaxed than Intel's I believe. Also, with everything on Relaxed, I still got reliable failures even after switching to the "correct" implementation using compare_exchange. The value was very close to being correct and passed most of the time however.
Another topic I might have avoided for much longer. But you manage to explain very complicated things (at least to me) in a very digestible way. Hat off 🎩. And the MESI protocol is indeed an interesting topic. Thanks for that hint as well.
Thank you for creating so many of these awesome videos Jon. I find them very beneficial personally, and always enjoy the adventures you take us on. Cheers 🍻
Thanks a lot for this video! I feel it's a bit of a low-key hardcore topic that is great to know, even though simple mutexes and atomic counters have been enough for me so far :)
Thank you so much for diving so deep into all the memory orderings, and all of everything regarding atomics! I watched the entire video. I was going to ask why they say Rust gives you safe multithreading if there are so many issues with it apparently, but then at the very end you basically said that it's just atomics that are the problem; mutexes are what really give you the safety :)
The other big difference is that if you don't use Atomics or unsafe, you can't mess up. C/C++ will happily pass a T* to another thread and let you get all the problems atomics have, and more. In Rust, code that doesn't use unsafe or a preexisting synchronization primitive (which necessarily uses unsafe) can't even experience a race condition (afaik).
@@berylliosis5250 Correction: ...can't even experience a -race condition- data race. From the Rustonomicon: "Safe Rust guarantees an absence of data races,..." "...However Rust does not prevent general race conditions."
First of all it's a great video, Thank you. I though it might be a good idea to think on the problem in terms of cpu cores, rather than threads. Later x.load -> c1 denotes "x.load was scheduled to run on core 1". 2nd problem (Ordering::SeqCst section): - (x.store -> c1, y.store -> c2) finish. *** At this point c1: {x: true, y: false} c2: {x: false, y: true} - (!x.load -> c1, !y.load -> c2), so both loops end; - (x.load -> c2, y.load -> c1) This way z == 0. 1st problem (Ordering::Relaxed section): To achieve the case, we need at least "x.load finishes after y.store finishes", since y.store is the only way 42 can get into any core memory. So this only achievable with compiler shenanigans, in this case on 1 core: - y.store -> c1 *** At this point c1:{y: 42} - y.load -> c1 - x.store -> c1 *** At this point c1:{x: 42} - x.load -> c1 Also, loom docs docs.rs/loom/latest/loom/#relaxed-memory-ordering state that "B can run before A", not "D can run before C", meaning "compiler can shuffle commands even if the 2nd depends on result of the 1st"? - I hope it's just a typo.
Regarding your example of relaxed ordering, there is a variant that is even more mind bending: thread 1: { let r1 = y; x = r1; r1 } thread 2: { let r2 = x; if r2 == 42 { y = 42; } r2 } (All loads of x and y are relaxed.) The crazy thing is that it is _still_ allowed that both threads read 42. This seems impossible since there is a control dependency from the read of x to the write of y, but this can actually happen in practice if thread 2 speculatively executes y = 42, thread 1 picks that up and stores x = 42, and then thread 2 receives 42 from x and decides that the speculation was correct. In fact, relaxed loads are even worse than that, since in the following program thread 1: { let r1 = y; x = r1; r1 } thread 2: { let r2 = x; y = r2; r2 } it is _still_ allowed for r1 and r2 to be 42, which is now completely nuts since there aren't any 42's in the program. (Actually C11 has a clause saying there has to be a 42 somewhere in the program but you can just have a random 42 elsewhere to satisfy this constraint.) However this is an unrealistic execution, in that no hardware actually can produce 42 in this situation without literal time travel. (This is also known as the "out-of-thin-air" problem since 42 just appears by time travel shenanigans.)
On that reference for "in the air" of this one can there be a transformative way to stay in a better position where the experience isn't plugged into a two
The way I understand it is that load/store instruction only affects the local cache of a particular core. Ordering::Acquire is an analogue to recv and Ordering::Release is an analogue to send operation. However, there is some additional rule where if there is no incoming message in the bus, then Ordering::Acquire just read directly on its local cache.
Thank You! I learned a lot from this video! However, I almost feel like parts of this API should require unsafe as an "I know what I'm doing" kind of thing. I was thinking that SeqCst could be the "safe" default and that methods where you can specify orderings would be unsafe. Perhaps that is taking it a bit far since this isn't really about memory safety, but it might go a little way towards fearless concurrency. Do you have any thoughts on that?
Thank you so much for the great video Jon Just wanna point something out that I think you overlooked a little bit There is also another guarantee with Relaxed Ordering, and that is the operation is guaranteed to happen at some point in the future For example, without using any ordering, these two codes are considered equivalent from the compiler perspective (This optimization is known as Hoisting): while (!stoprequested){ } if (!stoprequested){ while(true){ } } However if you set and read the stoprequested variable with relaxed ordering, it is guaranteed that the first loop will break eventually, meaning that the compiler will not translate it to the second form.
I actually looked through the spec for text along these lines, assuming that it _must_ be in there, and couldn't find it. Can you point me to where you see it written? The best I found was something along the lines of "the value is eventually visible" or something to that effect, but with no promises about what "eventually" means.
@@jonhoo Two years later and I found this video again:) Sorry it took so long It is mentioned in the docs: " Atomic operations tagged memory_order_relaxed are not synchronization operations; they do not impose an order among concurrent memory accesses. They only guarantee atomicity and modification order consistency. " Since the operation atomicity is guaranteed, then it must happen at some point in the future. This is part of the promise of atomicity. An atomic operation is something that is "all at once". it has happened in the current thread (we have the modification order guarantee, the current thread should be able to see it) so it must also be visible in any other thread. In today's modern cache coherent CPUs, the compiler doesn't have to do anything to get the "relaxed" ordering. All the relaxed ordering does is that it disables some optimizations such as loop-invariant-code-motion if the target variable is marked as atomic and is read with relaxed ordering. There is an amazing free book called: A Primer on Memory Consistency and Cache Coherence (Synthesis Lectures on Computer Architecture) It pretty much explains all of this stuff and much more in great details. There is also a blog post of Doug Lea that explains JDK9 memory model in great details. Java has memory ordering similar to C++ from version 9 and they are equivalent to C++. You can google it (I don't think I can post a link here): Using JDK 9 Memory Order Modes by Doug Lea.
Learning a lot here. Still cant accurately predict the results when using ordering in certain cases, but this video explains the concepts very well. Perhaps some sketching of the concepts can help, as was done in the leftright/evmap vid. Thanks!
Gonna leave a plug for "Designing Data-Intensive Applications" by Martin Kleppmann. It talks about a lot of the same things in the context of distributed systems and databases. Specifically in the context of eventual consistency vs. other read/write ordering guarantees, a lot of the same thinking applies. It's a great book, especially if you're working in with those types of systems.
So, if I increment a counter in different threads and I just care to fetch the result after all threads have resolved, I can just use relaxed because there is no read just writes and therefore the ordering is not relevant? In regards that we take care that stuff happens in one execution.
Yes, this is mostly correct. Your statement that "there is no read just writes" isn't correct, updating a counter is both a read *and* a write. But when updating a counter you probably don't care what value your thread reads, so it's OK to use relaxed. More precisely, if you use `fetch_add` to guarantee that the updates to the counter are atomic, and you call `.join()` on all the threads to provide the Happens-Before relationship that guarantees that the threads' updates to the counter are visible to you before you read the final value, then you can do all the updates with `Ordering::Relaxed` and things will work.
so I know I'm late to the party but I do have a question about the Orderding::SeqCst example. Does this amount of synchronization mean we end up regressing to almost single threaded performance again? Or can the compiler/CPU ensure this ordering while still maintaining concurrency (with obviously at least some overhead for the synchronization) ?
A bit late, but no, SeqCst should still be much faster than single threading, provided the algorithms are implemented appropriately. To synchronize, the different cores have channels, an when an atomic instruction is used that requires synchronization, the thread uses a protocol over those channels to notify the relevant cores. Read up on the MOESI protocol. This stalls the participant cores a bit, but is still much faster than doing everything on the same thread, as the pauses are negligible compared to long running multi-threaded tasks.
I'm not sure whether I understand the example with `SeqCst`, do atomics when they change value not still have the same memory address? Can the CPU just change that around? And wouldn't it be horrible if it does, but the load can still refer to the old location? How can it at that point even guarantee that it ever gets the new value and your references to the atomic don't become stale immediately How can `load` get an old value after it has been changed in memory 🤔
I think this may, for example, come down to caching. When Threads run on different cores, both cores may have a given value copied in their caches. Both cores may modify the value in their cache, without the other core seeing the modification. If the ordering of the modifications does not specify a clear “happened before” relation between them, then the cores may chose not to invalidate their caches before the modifications, which would be costly. This is probably also why stricter orderings (than relaxed) are considered to incur a performance cost. The cores cannot optimize as freely (with caches and out-of-order execution and the likes) when an ordering is specified. So if performance is a concern, you always want to find the “most relaxed” ordering that makes the program still work correctly. Anyway, caching is just my guess, there might also be other reasons for Threads to see different values of the same memory location.
On the counter example, wouldn’t relaxed allow the counter to be updated to the same value twice? Like, both load a previous value and both add 1 to the same old value.
When you say in the beginning that one thread may never get the update. If you have memory X = 1 Thread A that writes 2 to X in a loop. Thread B that reads X in a loop. Wouldn't Thread B eventually get the value 2 but not in the same step as you later mention?
If we were to acquire and release atomic variables inside of a loop, can memory ordering allow the CPU to wrap operations around the loop? For example, if we did a Release operation at the end of the loop, the contents of the loop will now be visible to other threads. If the CPU correctly predicts that the loop will run again, it will have already begun processing instructions for the next iteration of the loop. Is it guaranteed that it won’t move these operations ahead of the release synchronization from the previous iteration of the loop?
Ah, no, the memory model generally guarantees in-order semantics within a given thread. Reordering is only with respect to the actions of other threads.
I got confused by the argument for why z=0 is impossible in the SeqCst example. I think the argument should be that if t1 does not add, it must be that it sees y = false after it sees x = true, which establishes that the x load happens before the y load, so t2 must see x = true after it sees y = true and it adds. Similarly, if t2 does not add then this means the y load happens first and so t1 adds.
Ah, but you see, I don't actually _like_ dark mode for anything that has a lot of text on it like documentation. I *only* use dark mode for streaming because I was told by a bunch of viewers early on that their eyes hurt since they're watching the streams in a dark room ;)
@@jonhoo Well I should thank you for being so considerate for us. I do get flash blinded when everyone in youtube video open a very bright website. Maybe its because I have become too depend on dark mode.
1:21:40, that Relaxed (on fail), doesn't it mean it uses that memory ordering to load the value? even before the compare happens? or is it internally split then into two operations: a load for the compare&exchange and if the compare failed, then a load(again!!) for the actual value; and these two are grouped into an atomic operation; but doesn't seem to make sense, so doesn't it load it with the failure ordering (aka Relaxed in this case) then compare that value and if it's same as expected then it does the store with the success ordering(here Acquire), and these two load&(compare&)store are themselves internally atomic ?! otherwise it seems kinda weird that it would load the same value twice, once for success and once for fail; so then if what i expect is correct then the ordering args would be Release(for success) and Acquire(for failure), because it would load it with Acquire and store it with Release, or well, maybe storing it with Release isn't needed? unclear on this part. But i'm only talking about the compare_exchange_weak here, if it wasn't clear, not the last store of unlocked.
the doc says this: "compare_exchange_weak takes two Ordering arguments to describe the memory ordering of this operation. success describes the required ordering for the read-modify-write operation that takes place if the comparison with current succeeds. failure describes the required ordering for the load operation that takes place when the comparison fails. Using Acquire as success ordering makes the store part of this operation Relaxed, and using Release makes the successful load Relaxed. The failure ordering can only be SeqCst, Acquire or Relaxed."
Hi Jon, I have a small question regarding std::cell::Cell and std::rc::RefCell if you happen to have a minute. So, I get the concept of being able to change an inner value on an immutable reference, and have seen examples where a clone() implementation that takes a shared reference to self, but also needs to update some other member of self, and hence the issue since it's only shared reference, and so the work around is to use these cells. I was wondering do you tend to see this usage often outside of things like clone() implementations? I guess it's good to know these cells are there and what they do, but when thinking about when I'd ever use it, I kinda see it as quite rare, but good to know exists -- is that your take, or are these actually used all over the place in typical Rust code. thanks for any thoughts
I most commonly see in use-cases where you need mutable access to multiple subsets of a collection that are determined at runtime. This comes up frequently in graphs, but that's not the only place. In those cases, you have multiple & to the collection, and then take out &mut through something like RefCell as you, say, traverse nodes in topological order.
Ok I have a question on the possible reordering of the programm by the CPU when the operations are independant, if I print something in my function, but the printed value don't depend on any operation of the function, like if I just put "println!("test");", is the compiler allowed to move that print in the function? or because IO is considered having side effect it can't assume there is no link with the other operation ?
The compiler won’t reorder entire functions, what it will reorder is memory loads and stores. Same for the CPU. Atomics is all about the super low level load and store primitives.
Is it possible to get the code for this episode as a gist or git repo? Sometimes it is hard to follow and I have to rewind all the way cause i forgot the context, when switching between examples.
In ordering::relaxed section, y can store into x's 42 value because they happen at the same time? The let r1 is step a1, the let r2 is step b1 and happen at the same time as they are in different threads. the x.store is step a2, and the y.store is step b2. Because it's relaxed the compiler is free and the the CPU is free to execute these functions in any order so a1 can happen at the same time b2 is happening meaning we see y's stored value cross over into x.
Getting a Raspberry Pi 4 to show the Relaxed Ordering failing could be interesting. With how powerful Ordering is over how the compiler will execute the statements it sees, it would almost be helpful to have a Syntax lowlighting in the background of a different color to show were the relationships start and end.
So atomics are single assembly instructions, an instruction is only consider atomic when no other application, thread, process, core will have a chance to interject itself into your program flow because it's a single instruction. Electrons are in flight changing the value, nothing else can happen until that's done.
A little confused on Box::leak. So leak gives you back a reference to the actual value and consumes owned binding, but confused as to why one would want to over just allocating Box::new alone. What benefit does that yield? thanks
Box::leak gives you a &'static, which you can then give out many copies of, including to other threads. If you have a Box, any reference to the Box is tied to the lifetime of the Box (i.e., it isn't 'static), so you can't send it to other threads and such!
@@jeffg4686 the short version is that it's a non-mut reference, eg a "shared" reference, and the type is an atomic which implements Sync and thus is declaring itself safe to use from multiple threads at the same time.
Can machine instructions reordering make function below to panic by making the line B to be executed before the line A? fn foo() -> Result { ... if my_vec.len() < 2 { return Err( MyErr::TooShortErrMsg ); } // line A let last_value = my_vec.last().unwrap(); // line B, last() function returns None if my_vec is empty ... Ok(()) }
No - the compiler must respect "program order" as far as any visible side-effects. The reason it can seemingly reorder operations when multiple threads are involved is that it's allowed to choose semi-arbitrarily which other threads' writes are visible at what time.
I have a problem with the part at 1:53:05, you say "there's no requirement that x is true even though tx has run". That's simply false. An Ordering::Acquire will see all Ordering::Release of the same variable, if it's run. Did you mean to say "even though ty has run"?
The C specification is actually extremely vague on this front. From the C specification section 5.1.2.4: "The value observed by a load of an atomic object depends on the “happens before” relation, which in turn depends on the values observed by loads of atomic objects. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the “happens before” relation derived as described above, satisfy the resulting constraints as imposed here." Section 7.17.3.1 does say that "[i]mplementations should make atomic stores visible to atomic loads within a reasonable amount of time", but that's really all. Ultimately, by my reading at least, if two threads A and B have no relationship except that A writes x (with Release) and B reads x (with Acquire), it doesn't _seem_ like there is a requirement that B _will_ see A's write to x. There _is_ a requirement that _if_ B sees A's write to x, then B read of x "happens after" A's write to x. Now, _in practice_ you won't generally observe this oddity since chances are that there are also other "happens before" relationships that The reason I conclude that this is the right reading of the standard, as opposed to "any acquire load of x that occurs after (by wall-clock time) a release store of x _will_ see the x", is that otherwise the example given in the reference (en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering) would not require SeqCst - we'd be back to "either tx or ty happened first" even for Acquire/Release ordering, in which case an outcome of 0 is impossible. But the reference says that 0 _is_ possible under Acquire/Release, which must mean that it's possible for tx to run, but its store only being visible to one thread but not another.
@@jonhoo Thanks for the response. I'm not quite convinced, but I'll think on this some more. I'm wondering if this is maybe a difference between C11 and C++11 given that you're mixing the two languages here.
@@jonhoo As a side note, if you have sway in the loom project, if you could talk them into increasing MAX_THREADS or making it configurable, it's currently a hard coded const value to the number "4" which makes it impossible to use loom with this example as there's 5 threads including the main thread. Using loom with this example and it just instantly panics. It makes it hard for me to understand how it can be used for anything other than trivial examples given it's so limited.
@@Ergzay Yeah, it's a known challenge. Part of the problem is that loom needs to track a lot of per-thread state, and it does so with fixed size arrays to avoid allocating (which would kill performance). Bumping the number of threads would thus also bump the size of the entire working state fairly drastically. Though I wonder if maybe it'd be possible to use the new const generics to make it configurable! You can actually model this case in loom just by doing the store to x in the main thread. In general I've found it to be rare that you need more than four threads to replicate any given race condition at the level of atomics.
It's incredible how coherent you can be in a live session, and on such a tricky topic. Kudos!
This video exposes one concept that you rarely notice in regular code. Computer code looks sequential, when written down as text, but in actuality it defines an oriented graph of operations. If two operations don't lie on the same path through the edges of the graph, their relative order is undefined.
Multi-threaded memory access exposes this weird quirk, because it itself may lack inherent ordering (ie. allows for race conditions).
Memory ordering introduces extra edges to the graph, to restrict possible orderings.
🤯 that’s a really clear explanation. I’d never thought of code in those terms. Thanks for the perspective shift!
You may want to mentally map Ordering to Transaction Isolation levels in SQL databases.
Another important ingredient is out-of-order execution by modern processors. Normally such CPUs and their caches provide a programmer-visible model that appears to execute sequentially. When they run on multiple threads, especially on symmetric multi-processors (e.g. a quad-core CPU), then these out-of-order execution models can expose far more interdependence than you might think could occur just due to "parallel execution". Ever since the Pentium Pro, for example, your CPU isn't even executing the raw instructions. It's converting them into micro-code, which is analyzed in real-time, much like a modern compiler, and reordered according to the directed graph of operations you describe. Then as many of those micro-code instructions as possible are executed simultaneously. Add to that the presence of multi-layer caches, shared between multiple processors, and the whole idea of instructions that execute one instruction at a time just completely breaks down.
This is a great explanation, thanks tor sharing!
This is related to how Haskell's IO type is defined (in GHC) - it is a state monad which passes around a special value (RealWorld#) which the compiler treats as being different every time it sees it: data IO a = IO (RealWorld# -> (a, RealWorld#)) - this creates a strict ordering between IO actions, the compiler can't change them around because the RealWorld# token would no longer be passed between the actions in the same order. Other Haskell code is exactly what you've said, a graph of values, but it's one in which the order doesn't matter, because the values are always pure.
(I'll note that "ordering of actions" didn't require the use of the m-word, that's what we call the ubiquitous pattern that IO fits nicely into)
"Don't use spinlocks! Don't implement spinlocks! Ok, let's implement a spinlock". Love it!
@Cooper Rene LOOL just use a private tracker, don't be a punk
I want to give feedback:
This is great. It fits my difficulty level perfectly and is exactly what I need - as the 2018 guideline suggests.
Im have some experience with rust and am a computer science student currently just before my bachelors degree, so I have some experience with programming and how a computer works on several levels. Heck, I wrote a small subset-of-c-compiler. But I i.e. never heard of Ordering before.
Those topics normally have a super high barrier of entry because of all the previous knowledge required, only that that knowledge has the same barrier. Your videos help break that and to learn a lot about how computers work in general.
Keep on with the great work. I wish all my lectures were even half as engaging, informative, accessible and interesting.
One of the exciting aspects of fetch_update (or any CAS loop that generates the new value from the old) is that your closure may be executed more than once. If your closure has side-effects, it sometimes surprises people that they get the side-effect multiple times.
A more common mistake is for the closure to read and write to a value that exists outside the closure (i.e. "if !initialized { initialized = true; new_value = 42 }"). If the first CAS attempt fails, then the second run of the closure generates an unexpected result because it inherits state from the first run of the closure. I've seen this bug multiple times in production code.
Jeff Preshing has an incredible blog (and I think some videos) on this. He's a C developer, but as you said that doesn't matter.
For anyone testing with themselves while following along: While testing with the first section with the "bad" mutex. It reliably generated completely random numbers every run when run with just "cargo run" instead of "cargo run --release", even without the yield. However I'm running on a Macbook Air with the M1 based ARM processor so that probably increases the randomness as ARM's memory model is a little more relaxed than Intel's I believe.
Also, with everything on Relaxed, I still got reliable failures even after switching to the "correct" implementation using compare_exchange. The value was very close to being correct and passed most of the time however.
My wife made me watch the intro 7 times. Makes us laugh every time! (Now if only I could also get her as excited about coding again)
Just have to come at it from an angle that's applicable to her in some way, that way she can see the possibilities and how it benefits/helps her :D
Could not have asked for anything better than this! Thank you so much for the work you do for us!
Very interesting, thanks. The general rule seems to be (as with crypto): don't roll your own (if you can avoid it).
Another topic I might have avoided for much longer. But you manage to explain very complicated things (at least to me) in a very digestible way. Hat off 🎩. And the MESI protocol is indeed an interesting topic. Thanks for that hint as well.
Thank you for explaining this in such a brilliant manner! This is the first time I have been able to understand this ordering issue. Thank you Jon!
Thank you for creating so many of these awesome videos Jon. I find them very beneficial personally, and always enjoy the adventures you take us on. Cheers 🍻
I think this is the best explanation of how to use atomics.
For a master’s course I implemented exhaustive and optimal enumeration of value release-acquire traces using an algorithm called DPOR. Fun stuff.
Thanks a lot for this video! I feel it's a bit of a low-key hardcore topic that is great to know, even though simple mutexes and atomic counters have been enough for me so far :)
7:00 I couldn't live without Dark Reader!
Thank you so much for diving so deep into all the memory orderings, and all of everything regarding atomics! I watched the entire video.
I was going to ask why they say Rust gives you safe multithreading if there are so many issues with it apparently, but then at the very end you basically said that it's just atomics that are the problem; mutexes are what really give you the safety :)
The other big difference is that if you don't use Atomics or unsafe, you can't mess up. C/C++ will happily pass a T* to another thread and let you get all the problems atomics have, and more. In Rust, code that doesn't use unsafe or a preexisting synchronization primitive (which necessarily uses unsafe) can't even experience a race condition (afaik).
@@berylliosis5250 Correction: ...can't even experience a -race condition- data race.
From the Rustonomicon:
"Safe Rust guarantees an absence of data races,..." "...However Rust does not prevent general race conditions."
great video, thanks! this whole memory ordering thing is much clearer now
First of all it's a great video, Thank you.
I though it might be a good idea to think on the problem in terms of cpu cores, rather than threads.
Later x.load -> c1 denotes "x.load was scheduled to run on core 1".
2nd problem (Ordering::SeqCst section):
- (x.store -> c1, y.store -> c2) finish.
*** At this point c1: {x: true, y: false} c2: {x: false, y: true}
- (!x.load -> c1, !y.load -> c2), so both loops end;
- (x.load -> c2, y.load -> c1)
This way z == 0.
1st problem (Ordering::Relaxed section):
To achieve the case, we need at least "x.load finishes after y.store finishes", since y.store is the only way 42 can get into any core memory.
So this only achievable with compiler shenanigans, in this case on 1 core:
- y.store -> c1
*** At this point c1:{y: 42}
- y.load -> c1
- x.store -> c1
*** At this point c1:{x: 42}
- x.load -> c1
Also, loom docs docs.rs/loom/latest/loom/#relaxed-memory-ordering state that "B can run before A", not "D can run before C", meaning "compiler can shuffle commands even if the 2nd depends on result of the 1st"? - I hope it's just a typo.
Regarding your example of relaxed ordering, there is a variant that is even more mind bending:
thread 1: { let r1 = y; x = r1; r1 }
thread 2: { let r2 = x; if r2 == 42 { y = 42; } r2 }
(All loads of x and y are relaxed.) The crazy thing is that it is _still_ allowed that both threads read 42. This seems impossible since there is a control dependency from the read of x to the write of y, but this can actually happen in practice if thread 2 speculatively executes y = 42, thread 1 picks that up and stores x = 42, and then thread 2 receives 42 from x and decides that the speculation was correct. In fact, relaxed loads are even worse than that, since in the following program
thread 1: { let r1 = y; x = r1; r1 }
thread 2: { let r2 = x; y = r2; r2 }
it is _still_ allowed for r1 and r2 to be 42, which is now completely nuts since there aren't any 42's in the program. (Actually C11 has a clause saying there has to be a 42 somewhere in the program but you can just have a random 42 elsewhere to satisfy this constraint.) However this is an unrealistic execution, in that no hardware actually can produce 42 in this situation without literal time travel. (This is also known as the "out-of-thin-air" problem since 42 just appears by time travel shenanigans.)
On that reference for "in the air" of this one can there be a transformative way to stay in a better position where the experience isn't plugged into a two
Hi, Jon. Thanks for the amazing tutorial. Could you do some videos about tokio, crossbeam, and rayon?
The way I understand it is that load/store instruction only affects the local cache of a particular core. Ordering::Acquire is an analogue to recv and Ordering::Release is an analogue to send operation. However, there is some additional rule where if there is no incoming message in the bus, then Ordering::Acquire just read directly on its local cache.
Thank You! I learned a lot from this video! However, I almost feel like parts of this API should require unsafe as an "I know what I'm doing" kind of thing. I was thinking that SeqCst could be the "safe" default and that methods where you can specify orderings would be unsafe. Perhaps that is taking it a bit far since this isn't really about memory safety, but it might go a little way towards fearless concurrency. Do you have any thoughts on that?
Thank you so much for the great video Jon
Just wanna point something out that I think you overlooked a little bit
There is also another guarantee with Relaxed Ordering, and that is the operation is guaranteed to happen at some point in the future
For example, without using any ordering, these two codes are considered equivalent from the compiler perspective (This optimization is known as Hoisting):
while (!stoprequested){
}
if (!stoprequested){
while(true){
}
}
However if you set and read the stoprequested variable with relaxed ordering, it is guaranteed that the first loop will break eventually, meaning that the compiler will not translate it to the second form.
I actually looked through the spec for text along these lines, assuming that it _must_ be in there, and couldn't find it. Can you point me to where you see it written? The best I found was something along the lines of "the value is eventually visible" or something to that effect, but with no promises about what "eventually" means.
@@jonhoo Two years later and I found this video again:)
Sorry it took so long
It is mentioned in the docs:
" Atomic operations tagged memory_order_relaxed are not synchronization operations; they do not impose an order among concurrent memory accesses. They only guarantee atomicity and modification order consistency. "
Since the operation atomicity is guaranteed, then it must happen at some point in the future. This is part of the promise of atomicity. An atomic operation is something that is "all at once". it has happened in the current thread (we have the modification order guarantee, the current thread should be able to see it) so it must also be visible in any other thread.
In today's modern cache coherent CPUs, the compiler doesn't have to do anything to get the "relaxed" ordering. All the relaxed ordering does is that it disables some optimizations such as loop-invariant-code-motion if the target variable is marked as atomic and is read with relaxed ordering.
There is an amazing free book called:
A Primer on Memory Consistency and Cache Coherence (Synthesis Lectures on Computer Architecture)
It pretty much explains all of this stuff and much more in great details.
There is also a blog post of Doug Lea that explains JDK9 memory model in great details. Java has memory ordering similar to C++ from version 9 and they are equivalent to C++. You can google it (I don't think I can post a link here):
Using JDK 9 Memory Order Modes by Doug Lea.
I was about to lose hope on understanding atomics... then I saw there's a Jon Gjengset video about it 😍
32:48 is probably the most sarcastic "Oh nice, it works so well" I've ever heard XD.
Thanks a lot for your service.
Learning a lot here. Still cant accurately predict the results when using ordering in certain cases, but this video explains the concepts very well. Perhaps some sketching of the concepts can help, as was done in the leftright/evmap vid.
Thanks!
Gonna leave a plug for "Designing Data-Intensive Applications" by Martin Kleppmann. It talks about a lot of the same things in the context of distributed systems and databases. Specifically in the context of eventual consistency vs. other read/write ordering guarantees, a lot of the same thinking applies. It's a great book, especially if you're working in with those types of systems.
Honestly one of the best books I've read
So, if I increment a counter in different threads and I just care to fetch the result after all threads have resolved, I can just use relaxed because there is no read just writes and therefore the ordering is not relevant? In regards that we take care that stuff happens in one execution.
Yes, this is mostly correct. Your statement that "there is no read just writes" isn't correct, updating a counter is both a read *and* a write. But when updating a counter you probably don't care what value your thread reads, so it's OK to use relaxed.
More precisely, if you use `fetch_add` to guarantee that the updates to the counter are atomic, and you call `.join()` on all the threads to provide the Happens-Before relationship that guarantees that the threads' updates to the counter are visible to you before you read the final value, then you can do all the updates with `Ordering::Relaxed` and things will work.
so I know I'm late to the party but I do have a question about the Orderding::SeqCst example. Does this amount of synchronization mean we end up regressing to almost single threaded performance again? Or can the compiler/CPU ensure this ordering while still maintaining concurrency (with obviously at least some overhead for the synchronization) ?
A bit late, but no, SeqCst should still be much faster than single threading, provided the algorithms are implemented appropriately. To synchronize, the different cores have channels, an when an atomic instruction is used that requires synchronization, the thread uses a protocol over those channels to notify the relevant cores. Read up on the MOESI protocol. This stalls the participant cores a bit, but is still much faster than doing everything on the same thread, as the pauses are negligible compared to long running multi-threaded tasks.
I'm not sure whether I understand the example with `SeqCst`, do atomics when they change value not still have the same memory address?
Can the CPU just change that around?
And wouldn't it be horrible if it does, but the load can still refer to the old location?
How can it at that point even guarantee that it ever gets the new value and your references to the atomic don't become stale immediately
How can `load` get an old value after it has been changed in memory 🤔
I think this may, for example, come down to caching. When Threads run on different cores, both cores may have a given value copied in their caches. Both cores may modify the value in their cache, without the other core seeing the modification. If the ordering of the modifications does not specify a clear “happened before” relation between them, then the cores may chose not to invalidate their caches before the modifications, which would be costly. This is probably also why stricter orderings (than relaxed) are considered to incur a performance cost. The cores cannot optimize as freely (with caches and out-of-order execution and the likes) when an ordering is specified. So if performance is a concern, you always want to find the “most relaxed” ordering that makes the program still work correctly.
Anyway, caching is just my guess, there might also be other reasons for Threads to see different values of the same memory location.
On the counter example, wouldn’t relaxed allow the counter to be updated to the same value twice? Like, both load a previous value and both add 1 to the same old value.
Hi thanks for the great video.
What lsp plugin are you using it is really responsive
I'm using rust-analyzer through coc.nvim :)
omg I will need to see this video 5 times to understand half
When you say in the beginning that one thread may never get the update.
If you have memory X = 1
Thread A that writes 2 to X in a loop.
Thread B that reads X in a loop.
Wouldn't Thread B eventually get the value 2 but not in the same step as you later mention?
Thanks the great tutorial!
Great video, learned alot. Thanks I might need to rewatch
If we were to acquire and release atomic variables inside of a loop, can memory ordering allow the CPU to wrap operations around the loop? For example, if we did a Release operation at the end of the loop, the contents of the loop will now be visible to other threads. If the CPU correctly predicts that the loop will run again, it will have already begun processing instructions for the next iteration of the loop. Is it guaranteed that it won’t move these operations ahead of the release synchronization from the previous iteration of the loop?
Ah, no, the memory model generally guarantees in-order semantics within a given thread. Reordering is only with respect to the actions of other threads.
I got confused by the argument for why z=0 is impossible in the SeqCst example. I think the argument should be that if t1 does not add, it must be that it sees y = false after it sees x = true, which establishes that the x load happens before the y load, so t2 must see x = true after it sees y = true and it adds. Similarly, if t2 does not add then this means the y load happens first and so t1 adds.
You should give Dark Reader extension a try. It gives a decent enough dark mode experience on pretty much any website.
Absolutely! Love that extension.
Great extension.
Ah, but you see, I don't actually _like_ dark mode for anything that has a lot of text on it like documentation. I *only* use dark mode for streaming because I was told by a bunch of viewers early on that their eyes hurt since they're watching the streams in a dark room ;)
@@jonhoo Blasphemer! Dark mode for life 🕶️
@@jonhoo
Well I should thank you for being so considerate for us. I do get flash blinded when everyone in youtube video open a very bright website. Maybe its because I have become too depend on dark mode.
great video to explain the details, thanks
@jon Gjengset, is it possible to do a stream on Rayon or parallelism in Rust?
Even for C++ this is really useful
1:21:40, that Relaxed (on fail), doesn't it mean it uses that memory ordering to load the value? even before the compare happens? or is it internally split then into two operations: a load for the compare&exchange and if the compare failed, then a load(again!!) for the actual value; and these two are grouped into an atomic operation; but doesn't seem to make sense, so doesn't it load it with the failure ordering (aka Relaxed in this case) then compare that value and if it's same as expected then it does the store with the success ordering(here Acquire), and these two load&(compare&)store are themselves internally atomic ?! otherwise it seems kinda weird that it would load the same value twice, once for success and once for fail; so then if what i expect is correct then the ordering args would be Release(for success) and Acquire(for failure), because it would load it with Acquire and store it with Release, or well, maybe storing it with Release isn't needed? unclear on this part. But i'm only talking about the compare_exchange_weak here, if it wasn't clear, not the last store of unlocked.
the doc says this:
"compare_exchange_weak takes two Ordering arguments to describe the memory
ordering of this operation. success describes the required ordering for the
read-modify-write operation that takes place if the comparison with current succeeds.
failure describes the required ordering for the load operation that takes place when
the comparison fails. Using Acquire as success ordering makes the store part
of this operation Relaxed, and using Release makes the successful load
Relaxed. The failure ordering can only be SeqCst, Acquire or Relaxed."
Hi Jon, I have a small question regarding std::cell::Cell and std::rc::RefCell if you happen to have a minute. So, I get the concept of being able to change an inner value on an immutable reference, and have seen examples where a clone() implementation that takes a shared reference to self, but also needs to update some other member of self, and hence the issue since it's only shared reference, and so the work around is to use these cells. I was wondering do you tend to see this usage often outside of things like clone() implementations? I guess it's good to know these cells are there and what they do, but when thinking about when I'd ever use it, I kinda see it as quite rare, but good to know exists -- is that your take, or are these actually used all over the place in typical Rust code. thanks for any thoughts
I most commonly see in use-cases where you need mutable access to multiple subsets of a collection that are determined at runtime. This comes up frequently in graphs, but that's not the only place. In those cases, you have multiple & to the collection, and then take out &mut through something like RefCell as you, say, traverse nodes in topological order.
This is new to me. My mind blew up after 1:58:00. LOL
Great topic!
Ok I have a question on the possible reordering of the programm by the CPU when the operations are independant, if I print something in my function, but the printed value don't depend on any operation of the function, like if I just put "println!("test");", is the compiler allowed to move that print in the function? or because IO is considered having side effect it can't assume there is no link with the other operation ?
The compiler won’t reorder entire functions, what it will reorder is memory loads and stores. Same for the CPU. Atomics is all about the super low level load and store primitives.
Hmm... this loom thing reminds me of Kyle Kingsbury's "jepsen" and "Knossos" software to check distributed databases for consistency issues.
Is it possible to get the code for this episode as a gist or git repo?
Sometimes it is hard to follow and I have to rewind all the way cause i forgot the context, when switching between examples.
I think my take home message from this has been: If I do multi-threading in Rust. Just use a damned Mutex.
At least learn channels as well. They fit some problems much better. The book ha sa subchapter on them.
1:14:47 C++ docs are enlightening
I wonder if the data race for the locking of the mutex is more likely on LITTLE.big architectures because of the different speed of the P and E cores.
In ordering::relaxed section, y can store into x's 42 value because they happen at the same time? The let r1 is step a1, the let r2 is step b1 and happen at the same time as they are in different threads. the x.store is step a2, and the y.store is step b2. Because it's relaxed the compiler is free and the the CPU is free to execute these functions in any order so a1 can happen at the same time b2 is happening meaning we see y's stored value cross over into x.
Getting a Raspberry Pi 4 to show the Relaxed Ordering failing could be interesting.
With how powerful Ordering is over how the compiler will execute the statements it sees, it would almost be helpful to have a Syntax lowlighting in the background of a different color to show were the relationships start and end.
So atomics are single assembly instructions, an instruction is only consider atomic when no other application, thread, process, core will have a chance to interject itself into your program flow because it's a single instruction. Electrons are in flight changing the value, nothing else can happen until that's done.
1:30:55 - Line 864: What on earth is `x @ Ok(_)`?! I've never seen that before.
A little confused on Box::leak. So leak gives you back a reference to the actual value and consumes owned binding, but confused as to why one would want to over just allocating Box::new alone. What benefit does that yield? thanks
Box::leak gives you a &'static, which you can then give out many copies of, including to other threads. If you have a Box, any reference to the Box is tied to the lifetime of the Box (i.e., it isn't 'static), so you can't send it to other threads and such!
@@jonhoo - gotcha thanks much. Actually, kinda wondering what makes this thread safe? 'static lifetime doesn't itself imply thread safe does it?
@@jeffg4686 the short version is that it's a non-mut reference, eg a "shared" reference, and the type is an atomic which implements Sync and thus is declaring itself safe to use from multiple threads at the same time.
2:39:17 always an educational experience :)
Thanks
2:36:45 uh, Rust has .as_ptr(), which is surprisingly safe, as dereferencing it is unsafe anyway.
Do you use vim nerdtree plugin? how do you navigate between different files
Nope. I just use fzf.vim :) github.com/jonhoo/configs/blob/26409d160dff9a667d3fcb854bf83b37b544b154/editor/.config/nvim/init.vim#L28
Explaining atomics without Indiana Jones reference.
NOM NOM NOM NOM
Monja monja
is the rust language capable to write single process single thread non-atomic data-type console program?
yes
@@jonhoo this is a good news, but, it seems like it doesn't have the capability to off-loading workloads to GPU? isn't it ? (Heterogeneous computing)
This is great
Thank you!!!!!
I knew the difference but 16:00 really gave me an ah-ha moment on usage.
1:05 the magic happens in compiler!!!
Nice job, btw why not just get a dark mode browser plug-in 😂
Can machine instructions reordering make function below to panic by making the line B to be executed before the line A?
fn foo() -> Result {
...
if my_vec.len() < 2 { return Err( MyErr::TooShortErrMsg ); } // line A
let last_value = my_vec.last().unwrap(); // line B, last() function returns None if my_vec is empty
...
Ok(())
}
No - the compiler must respect "program order" as far as any visible side-effects. The reason it can seemingly reorder operations when multiple threads are involved is that it's allowed to choose semi-arbitrarily which other threads' writes are visible at what time.
Yum num yum
I have a problem with the part at 1:53:05, you say "there's no requirement that x is true even though tx has run". That's simply false. An Ordering::Acquire will see all Ordering::Release of the same variable, if it's run. Did you mean to say "even though ty has run"?
The C specification is actually extremely vague on this front. From the C specification section 5.1.2.4: "The value observed by a load of an atomic object depends on the “happens before” relation, which in turn depends on the values observed by loads of atomic objects. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the “happens before” relation derived as described above, satisfy the resulting constraints as imposed here." Section 7.17.3.1 does say that "[i]mplementations should make atomic stores visible to atomic loads within a reasonable amount of time", but that's really all. Ultimately, by my reading at least, if two threads A and B have no relationship except that A writes x (with Release) and B reads x (with Acquire), it doesn't _seem_ like there is a requirement that B _will_ see A's write to x. There _is_ a requirement that _if_ B sees A's write to x, then B read of x "happens after" A's write to x. Now, _in practice_ you won't generally observe this oddity since chances are that there are also other "happens before" relationships that
The reason I conclude that this is the right reading of the standard, as opposed to "any acquire load of x that occurs after (by wall-clock time) a release store of x _will_ see the x", is that otherwise the example given in the reference (en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering) would not require SeqCst - we'd be back to "either tx or ty happened first" even for Acquire/Release ordering, in which case an outcome of 0 is impossible. But the reference says that 0 _is_ possible under Acquire/Release, which must mean that it's possible for tx to run, but its store only being visible to one thread but not another.
@@jonhoo Thanks for the response. I'm not quite convinced, but I'll think on this some more. I'm wondering if this is maybe a difference between C11 and C++11 given that you're mixing the two languages here.
@@Ergzay They both have the same memory ordering model as far as I am aware, I just had the C11 spec more readily available :)
@@jonhoo As a side note, if you have sway in the loom project, if you could talk them into increasing MAX_THREADS or making it configurable, it's currently a hard coded const value to the number "4" which makes it impossible to use loom with this example as there's 5 threads including the main thread. Using loom with this example and it just instantly panics. It makes it hard for me to understand how it can be used for anything other than trivial examples given it's so limited.
@@Ergzay Yeah, it's a known challenge. Part of the problem is that loom needs to track a lot of per-thread state, and it does so with fixed size arrays to avoid allocating (which would kill performance). Bumping the number of threads would thus also bump the size of the entire working state fairly drastically. Though I wonder if maybe it'd be possible to use the new const generics to make it configurable!
You can actually model this case in loom just by doing the store to x in the main thread. In general I've found it to be rare that you need more than four threads to replicate any given race condition at the level of atomics.
19:36 memo
Bruh ya selling me on coding.
th-cam.com/video/rMGWeSjctlY/w-d-xo.html
Instruction level parallelism is what you mean, i think.
underrated
th-cam.com/video/rMGWeSjctlY/w-d-xo.html
I think you wanna say data dependencies.
wtf this guy is a vim demon 😭
Stop trying to make fetch happen
the low freq. feedback of your keyboard resonating through your microphone when you're typing, thats really annoying :>