Threads On Multicore Systems
ฝัง
- เผยแพร่เมื่อ 5 ก.พ. 2025
- This video was sponsored by Brilliant.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/.... You’ll also get 20% off an annual premium subscription.
Join CodeCrafters and learn by creating your own: Redis, Git, Http server, Interpreter, Grep... in your favorite programming language:
app.codecrafte...
In this video we explore the role of threads on multi-core systems.
Questions and business contact:
contact.coredumped@gmail.com
Sponsor my work on Github:
github.com/jdv...
Join our discord server:
/ discord
Follow me on twitter:
twittter.com/c...
Twitch:
/ coredumpped
For those wondering about my computer: x.com/coredumpped/status/1868128051361354032?s=46
This video was sponsored by Brilliant.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/CoreDumped. You’ll also get 20% off an annual premium subscription.
No channel has every single video worth watching, well except for the lord core dumped
Can we have more computer architecture videos?
Yes, sir!
Yes! Please
Realest 🔥💯
This channel is like a gold mine. Here you can found all those concepts that seem complicated in a form that almost everyone can understand.
We havent even touched on synchronization and atomics yet, but I'm already stoked to see how that + this video + your IPC video come together to allow us to finally have insight into true parallel scalability! This journey has been so consistently clear yet thorough every step of the way, I can't wait to watch it culminate :)
atomics? i'm scared to even google this
i will though
10:11 You can still chunk the array into 4 segments, find the minimum in each, and then take the minimum from those 4 results as you join the threads.
Yep! Forkjoin parallelism to the rescue!
We may split, map and reduce if the operation is associative like sum, min, max.
It's good because a monad is just a monoid in the category of endofunctors! ;)
I've finally realized that that statement just refers to things that are not nested in a meaningful way. What @leongtengman6261 is referring to about associativity is that a data structure that's associative for some operation can be thought of as "flat" and easily "chunkable" when trying to apply that operation to it. Trees can be represented as nested tuples whereas something list-like doesn't need parentheses to distinguish the structure. The challenge is that applying a map function to a tree requires knowledge of the internal structure and isn't easily scalable.
But something without structure like lists, arrays, etc (aka anything "monoidal", think "mono" for one possible internal structure) can easily have a map function applied to its parts in chunks as long as the overall ordering of the chunks is preserved (since list-like things still imply an ordering property compared to something like a bag or set). So monoids can be trivially fanned out / split into chunks as small as you want, have a function applied to all the items in each chunk, and then be fanned back in to a single data structure. Then the restriction on time is the overhead of time fanning out and in (or "forkjoin" as @no1ofinterst phrases it) as well as number of available resources.
thats what i thought. works when you have billions of numbers and split them
I have a feeling he was leading up to it in a follow up example but the video ended before it happened, maybe in a future video.
Babe wakeup. Core Dumped released his new video with in 15 days.
Yup
Yeah
Babe wakeup. @rohithpokala is about to win a subscription to codecrafters!
I'm a cs graduate and I think your videos are perfect for software engineers who did not have chance to attend good cs basics classes. Your videos are simple, yet very informative and well structured.
For those who also like your videos I would recommend reading the book "The Elements of Computing Systems". It's quite simple and gives a lot hands-on exercises of building things from the cpu parts up to a game.
I would also like to see video about cpu-gpu communication.
PS: messed up a computer? seems like someone uses arch btw...
@@stephanbylkov8215 in not even in college yet but want to learn about system level concepts and this has been especially useful
I'll be sure to check out the book too
Still can’t believe I’m learning more about circuits and operation load from you instead of my professors. Don’t know how involved in the subject you are, but getting into the material science and electrical engineering side of computing would be marvelous to have.
Thanks a ton and keep up the quality content!
This channel is GOAT.
I agree
Amazed at the quality of content you are putting out! Illuminating animations combined with concrete examples from the linux kernel is such a great way to communicate these concepts.
Few other channels give me as many “aha” moments per unit of time! Keep up your style!
To answer the question asked in the video at 9:45, think of it like baking a cake. You have two parts to the task: mixing the ingredients and baking the cake. Mixing can be done quickly and shared among friends, so the more friends you have, the faster you can mix. But once the cake is mixed, it has to go into the oven, and that part can only be done one at a time.
So, even if you have four friends helping you mix, you still have to wait for the oven to bake the cake. This means that while adding more friends speeds up the mixing, the overall time is limited by the baking time. The more time-consuming the baking (which is the sequential part), the less benefit you get from having more friends (or processors).
This is related to Amdahl's Law that teaches us that not all tasks can be sped up just by adding more resources, especially if there are parts that must be done one after the other. It’s a great reminder of the limits of parallel processing. More info here: en.wikipedia.org/wiki/Amdahl%27s_law.
Please also include hyper-threading and the way it works in future episodes. Does it really have parallelism? Or just improves concurrency? I’ve been reading around the internet but I would love a CoreDumpped level explanation. Thank you for all the great videos so far.
9:40
My thoughts are as follows, one might think that with 4 cores we’d achieve a 4x speedup, but the reality is more complex. I like to use this analogy: it’s like trying to cook faster using several pans-while you have more "processing power," there are other factors to consider:
Communication overhead: The cores need to coordinate with each other.
Shared resources: RAM and cache are shared (like when multiple cooks share utensils).
System competition: Other programs also need CPU time.
From my experience with parallelism, using 4 cores typically yields a 2-3x speedup. It’s a significant improvement, but it doesn’t reach the theoretical 4x due to these limitations.
I found it really interesting to see in the video benchmarks how different parts of the array took varying amounts of time to process. It makes sense-some numbers require more calculations to determine if they’re prime.
Fantastic video! Your explanation of the difference between concurrency and parallelism was incredibly clear-I've never seen it laid out so simply before. The examples of data parallelism vs. task parallelism really helped solidify my understanding.
On the topic of performance gains from parallelism, it might be interesting to dive into **Amdahl's Law** in a future episode. It highlights how the potential speedup of a task is limited by the sequential portion of the task that can't be parallelized. Understanding this could help viewers grasp why adding more cores doesn't always result in linear performance gains.
Keep up the great work! Your videos make complex concepts accessible and engaging. Looking forward to the next one!
This is one of the most knowledgeable channels I have ever found on TH-cam
With this, now I can understand more about threads and parallelism.
Pls, continue doing more videos like this and take the time that you need to do it. Thanks Core Dump
One of the best thing about this channel is that it uses Rust language to explain concepts
I watched every single video in this channel to learn about the inner secrets of CPU and memory interaction. It worked good for me so far. Thank you.
Your videos have been a lifesaver for my computer architecture studies, Winning the CodeCrafters subscription would be a huge boost to my learning journey. Thank you for creating such amazing and insightful content!
this is awesome! I have an upcoming OS exam and seeing these concepts come to life through your animations has made grasping the underlying theory so much more intuitive! much love from italy🙏
10:50 in this example the best way would be to create one loop in one thread but add there variables for min, max etc, all can be done in one loop.
Great video! The upper limit for parallel speed up can be described with Amdahls law. The law states, that the sequential part of a program (for instance synchronization points) have a tremendous impact on the achievable parallel speedup. You can think of it literally like a bottleneck: even though the majority of the bottle is pretty thick, the speed in which the fluid flows out the bottle depends on the smallest area.
Your video really helped me understand the concepts of OS, especially threads, concurrency, and parallelism, which were tough to grasp before. It even helped me in my exams! As a student, learning about how an OS works and the basic principles of programming feels so much easier and more fun now.
Your channel is such an incredible source of knowledge for students like me who are eager to learn but often struggle with complex topics. Thank you for creating such amazing and free content! Please keep making more videos like this-you’re inspiring the next generation of developers. 🙌
My cop-out for OS exam studying is watching your videos
this channel is slowing becoming one of my favourites
Please keep it up you are an amazing channel providing content which makes really hard to grasp and hard to research knowledge really accessible.
By the power vested in me as a youtube commenter, I announce you as the 3Blue1Brown of OS and Computer Architecture since they are so pleasing to watch! In terms of visuals/animation, content, and delivery, all are great!
I learn something useful everytime I watch your videos and I watch them on TV cuz I love them so much lol.
I think the state of the system at a given point in times changes the execution times despite repeating the same task multiple times. You might have another background thread using one of the cores, making the running process get access to the core much later and so on.
Ben eater disappeared when the world needed him the most. But i believe core dumped could save the world
My only regret is not being able to record my own voice because some people seem to hate TTS to the point where they won't even watch the content but just comment "AI slop"
@@CoreDumpped personally i love the joke of you actually being an LLM, it's a fun spin on the channel.
@@CoreDumppedwell, i gotta say the TTS voice is an *appeal* of the channel to me. it's memorable, comfy, puts me into a listening and study mindset, and ive just come to think of you like this. i legit quite like it man, you're doing an incredible job with these videos
@CoreDumpped AI or not, your work can't be described as "slop".Every video seems to have a lot of thought put in.
@@CoreDumpped What those people don't realize is that just because the narration is AI, it doesn't mean the content wasn't extremely well thought out and high quality and certainly wasn't AI generated like the voice. It could be worth trying your own voice but to be honest it's really not that big a deal. The people who are actually interested in the content won't care.
You are a gift to humanity 🙏🏽
Please share your knowledge - there are millions of people who would greatly benefit
The visualisations help out a ton in understanding these concepts! Great job and thank you!
Your explanations are awesome.
Please make more videos.
Also can you share some resources from where you are learning these concepts.
An interesting note about threads on multicore systems is that although they generally improve performance, there are cases where performance can suffer compared to a single threaded program. Such as poor implementations where the a thread has to wait for another to complete its task
It's worth mentioning that, depending on your data size, the order in which you process your data between threads may matter for cache performance, and the way in which your threads can sleep and be rescheduled may also impact performance due to the cache. This is one reason we sometimes might lock a thread to be scheduled on a specific core. The cache is typically tiered, meaning there will be a larger cache shared by all cores, and smaller caches exclusive to each core.
For instance, at 10:25 with the example of task parallelism, some of the tasks are starting at the end, some at the beginning. If your data set is sufficiently large that it cannot fit within the shared cache, then your tasks could be competing for space in that cache as it tries to fetch from different places in memory, rather than using what another thread already fetched. Of course, caches can be complicated, especially when considering cache access for multiple simultaneous threads, and sometimes things don't work as you expect, so this is all hypothetical. All this is to say that if you want maximum performance, sometimes the order in which data is processed among different threads can be important, and profiling your code with these different scenarios is essential to finding the optimal performance, if that is important to you.
awesome content! It would be interesting to see the differences between dynamic dispatch and generics; and how they work behind the scenes, if the topic allows to talk about that ofc
Hello, your videos are amazing, you have a talent to explain such difficult concepts in very simple words! Visualizations are extremely helpful!
Could you please also create a video regarding compiled and interpreted languages? How do interpreters work and how do they manage memory?
For example, in javascript we don't specify a variable type, does it mean that only identifiers themselves are stored on the stack while the actual values are stored in memory?
I haven't looked at the comments yet so I do not know if it's been said, but I do not agree with you that it does not make sense to not split the problems here (10:10 ish) into subsets. If you have the minimum of each subset, it's very easy to find the global minimum. Same thing with maximum and whether or not to the list contains 101 (even though I admit that in the last case you are worsening you're best possible performance, if the value was at the beginning of the list). It's also easy for the mean, (either compute the mean in each thread then do a weighted average of the means, or just compute the sum in each thread and only do the division on the main thread).
Apart from this nitpick, amazing work as always!
I was about to comment the same, you are correct!
Exactly! This is basically the map/reduce architecture.
Yes, and in the case of the contains it is even possible to cancel the remaining threads early when one of the threads returns because it has found the value.
We always love a CoreDumpped video. Thank you George😊
You're videos are helping me get through my course 😄
Amazing content. What a wild time to be alive and have free access to this quality educational videos. Thank you very much
I read the book modern operating system by Tanenbaum S. twice to get the concepts but never understood it so detailed. It’s by far the best TH-cam channel to truly understand this topic. I recommend this channel since the first video. Do you have Patreon?
10:10 @CoreDumpped A small correction:
All of those can be parallelized to gain a significant performance improvement. For min, max, and mean you can just split it into n subsets, calculate the result for each of them, and then compute the value with the result of those n subsets again. This is still roughly a n fold increase in performance, (asymptotically it is exactly n fold) but of course not as much as if it is a fully parallelizable task.
Likewise for contains you can run the check for n subsets instead, and once one of them finds 101 that process sets the output to true (and optionally tells all the other processes to stop looking). This is in fact perfect parallelizable since no actual communication between the processes is needed beyond a single shared output variable in memory. The speedup here is EXATLY n fold (assuming no overhead for thread creation etc).
Love the way the knowledge is organized and presented. This is what is called "wisdom".
Amazing video! Multi threading, if done right, does wonders in improving performance. It works especially well when you are doing anything that is I/O bound because I/O is very slow.
Had a scenario where I had to an array of tasks, each task was trying to fetched data from 4 different sources and then combined the result and stored it elsewhere. The time taken to complete 100 tasks took about 3 minutes, whereas, introducing threads cut short this time down to 15 seconds.
While developing this, I also learned that you really need to be careful with multi threading. The tasks were assigned to a fixed size thread pool so as to not spawn too many threads. Well, it turns out if you do not handle the exceptions correctly, then the thread suspends. Now you have 1 less thread in the thread pool. Have it happen enough times and now you have hardly 1-2 threads trying to do all the tasks. The debugging of this issue was a roller coaster of emotions that I won’t ever forget.
PERFECT visualizations. You really understand what makes concepts click and how they should be visualized.
Interesting to see another viewpoint on thread architecture, always thought its better to have a thread per physical core, so threads don't overwrite each other's local data, hence causing cache misses. I hope you explore more in that area. Nice video!
Thank you for this. As an IT grad this kind of deep dive into the concepts in programming were not discussed in our university this thoroughly. That's why when I get to work with CS grads I found out that I'm actually missing a lot of fundamentals in programming. 🥺
I'm new here but this channel seems one of the best of its kind, good luck for the winner
I wish I would have seen all your videos during my Bc.
Hell I'm in love your animations
Would love to see more, especially about CPU Scheduling
Soon!
It's impressive how far we've gotten. At first, it was more like electronics, but now we're starting to dive into computing; I love it
I love this channel. Everytime i learn something.
Haven’t had this all explained so simply before. Great video!
Superb Explanation.
Thank You.
Please post more such videos.
Falo do Brasil , sou muito seu fan desde o vídeo "why is Stack só fast", vídeo muito bom❤
You're the best George!
Thank you for your videos!
Just discovered this channel and it's great! Much more digestible than my university courses.
This channel is a gold mine! 🥳
Truly
I'm not sure there is a better example but all four of the operations you mentioned can be parallelised :) min(array) is the same as min([min(subarray) for subarray in chunked(array)]). `contains` is best because the serial phase depends on the number of threads, not on the size of the data! There even algorithms like partial sum which are really important in data parallelism. [sorry for the bad pseudocode!]
Yes, but I never said they couldn't be. What I said is that to get the result of those operations you need the whole dataset anyways so assigning each task to a different thread makes more sense.
Now, whether that's the best solution is out of the scope of the video.
Hola. He entendido que hablas español. Gracias por la opción de audio en castellano. Voy a recomendar tu canal a mis alumnos para que entiendan de forma visual todos estos conceptos de Sistemas Operativos.
Muchas gracias.
One of the only channels where I immediately like the video because I already know i will love it!
His explanation is god tier❤
Man, this guys video's are so intuitive and so well explained.
This deserves its own award category. 🏆🌟
This channel videos are just🔥❤️❤️❤️❤️ plz provide next parts soon
can you make video about floating point numbers, how to convert into decimal, arithmetical operations with it and some another stuff
Thanks man, you always on top!Everyone just need to be cautious when working with chunks on separate threads. Make sure not to mutate the data or array you use, otherwise the program would become unpredictable.
Im your first supporter if you decide to make a website like brilliant but for low level / computer science
I really would love to see that at the same time i feel like its a missed opportunity where no other websites really take, maybe because everyone is already familiar and searches for normal learning, but then again the potential is apparent in your vids.
Anyways keep up the quality content !
Beautifully animated, makes the subject x10 easier to digest!
The video is very cool, I really love the visuals and the animations to explain the concepts, what do you use or how do you make the animations?
Your videos are so informative and valuable, thank you🙏
coredumped releases the video that I was excited about. Thanks
I think the usefullness of paralelism as being proportional to the consistency and complexity of the task.
If the task is computationally simple, it will paralilize well. For example, adding up all the numbers on a big list, or sorring an array with merge sort.
But things like calculating prime numbers, which is not a linear time problem, don't paralilize so well, as you'll be waiting for the slowest thread.
This brings me to GPUs. GPUs are excellent for paralelization, great at vector math, but for them to be great, you should have every thread (aka, fragment) finish executing as close to eachother as possible.
Tho, paralelization is still a great tool, but it should't be abused. The limiting factor will be the slowest thread, and you should scale the number of threads acordingly.
Edit: As always, great video ❤
Regarding the point raised at 9:51, I think factors like synchronization overhead, cache coherency issues, and imperfect load balancing reduce efficiency. Additionally, physical constraints such as heat generation and power limits can lead to throttling, further hindering performance scaling. These challenges, combined with the inherent difficulty of fully parallelizing tasks, prevent linear scaling even in isolated environments
in the RISC-V ecosystem, there are "HARTs". HARTs are hardware supported threads. A core can be constructed with extra hardware that can make one core operate on two threads, as in hyper threading.
the answer is of the question ask in video is depend on how much portion of (process/task/program )" sorry don't clear about these terms" are parallel and how much is number of core increase.
then by applying amdahl's law
speedup=1/[(1-P)+ P/n)].
P = fraction portion the program that can be parallelized. e.g 0.4 for 40%.
n= number of cores.
From this, we can find speedup.
To calculate performance increases with respect to time.
then
Tp= Ts/speedup.
Tp is execute time after parallelize
Ts is sequential time on single core or before parallelizing.
Great explanation!! Please create video for kernel internal
I'm putting faith in you guys, up vote ❤
The 4 ops u gave actually allow for a very elegant data split sincw they are all reductions.
For instance we can compute the qmx for each if the 4 slice then the maximum max is the true max.
9:41 Here the workload is distributed by splitting the data into 4 even subsets. Instead it could be distributed using a queue, where each thread pops the next number from the queue (or a small chunk of numbers if each pop is costly). This would result in a more even utilisation of all cores.
How much we can get out of multiple cores depends on a lot of factors. Take example from video only, if given to check for prime numbers. If we have sorted data set and then we can get very optimised results when we try to distribute data modulo wise. Take example of 4 core , 1st number for core 1, 2nd number for core 2 , 3rd number for core 3, 4th number for core 4, then 5th for core 1 , 6th for core 2, 7th for core 3 , 8th for core 4. Although this will be very optimised but it needs pre computations or sorted datas. I also want to point out if one wants to find min, we can find min out of all cores with distributed datas and take min of them , this will be optimised.
Brilliant as usual. Looking forward for a new episode with a code example.
Great animation and excellent educational video. Thanks.
This channel should have a lot more stars. It is christmas before the actual date to have a new video.
Core dumped has change the way of looking computer science which my University didn't taught ❤
We also need a video on how having several cores increase the overall CPI performance in different ways. I believe multi threading is just one way of utilizing several cores
i don't know if you take suggestions but a video on file descriptor (or pointer) could be entertaining. Like how file paging work and how process get access to file (and shared memory space)
I'm early!! Bro I love your videos so much. I'm subscribed and I like all of them. Please can you do a video on containers like Docker? An in-depth explanation please, thank you.
Amazing explanation and video. Glad I came across this. Can you please also explain the difference between a cpu core and its virtual core and why it's needed, and whats the difference in hardware level?? I mean, when we say stuff like "This Intel chip has 4 physical cores and 8 virtual cores, i.e, two virtual cores for one physical core". What does this mean ?? How does this work ?? And how does it relate to the threads in this video ?? I'll be very grateful if you explain this.
Another amazing video, like always. For me this video is the Christmas gift :)
That's the kind of content that I like on TH-cam
I love your video mam! Somehow your videos are almost exact as my class exams topics lol.
Regarding the performance gain though, to me, there will always be overhead (distributing/broadcasting - gathering) to parallelism, and I would have to ask myself if parallelism worth it. Of course theres also the problem of equally distributed the work/data too.
A simple example that my prof used to said is that: imagine you have to count 10 candies, it would be much faster if you count it yourself than to ask 10 people to count it and and summarize the result :)).
I know it's a bit more complicated, but, man, if you ever need a rest from dropping all these gems to achieve a better quality/result/polish in a return, please take it. Because your videos will still be played years later (yes, they're so good) and we can wait a couple of weeks
According to my university classes on computer architecture, the maximum theoretical speed-up you can achieve by parallelizing a problem is limited by Amdahl's Law. This depends on the fraction of the problem that can be parallelized. For instance, if a problem is 100% parallelizable, the theoretical maximum speed-up would be NN, where NN is the number of processors in the system. However, in practice, the actual speed-up is lower due to additional factors, such as the synchronization required for communication between processes, contention for shared resources (like data in memory), and the overhead of creating and managing threads or processes. These practical limitations often prevent parallelization from reaching its theoretical maximum.
This video made me incredibly excited for your next video.
Really good video! Honestly, though, Data vs Task parallelism, to me, starts jumping more into arbitrary distinctions (as it doesn't actually affect what the CPU does) but perhaps the next video will make it more clear. Looking forward to it!
It becomes more distinct when you have threads that depend on other threads, or if you have any writes to shared data. Basically, data parallelization can often imply that the underlying data isn't going to change, so the tasks are more isolated, and thus it's a much more straightforward approach to scaling wide with cores. With task parallelization, those tasks may have side effects that alter the underlying data and/or access a shared resource that can only be safely accessed by one thread at a time, meaning it's more important that these methods involve locking mechanisms to avoid corrupting things. When you have to use locking mechanisms, that's when it does have a tangible effect on what the CPU is doing.
I'd love to see you cover GPU architecture when you finish up the CPU and OS series!
9:45 How much performance gain we can achieve from parallel operations ? Let's try (I put some icons for reference of the formulas 🔶️🔺️)-->
Things to consider :
- Time cost of parallelization P : thread creation, data splitting and reconstruction (or result merging) time
- Size of the data we manipulate N
- Number of cores C
- Each split has a size S = N/C
- The executed algo (prime test here) : The bigger the number the longer it takes to test it (looking for dividers under the square root of it) on this specific case
Tbase : execution time with no parallelization
Tpara : execution time with parallelization
⭐ Having a gain means Tpara < Tbase 🔺️
Tpara = P + Tslowest_thread 🔶️ (700ms here)
We have this slow thread here because of the algo executed and the fact that the first split contains more big numbers than the others.
If we randomly change the input for each execution during the test, having a more fair distribution of big numbers in average, we have Tpara = P + Tavg_thread 🔶️. I think this is the instinctive theory ->
Tavg_thread =~ Tbase / C
-> Tpara = P + Tbase / C 🔶️
-> we want P + Tbase / C < Tbase to have a gain 🔺️
-> P < Tbase ( 1 - 1/C ) 🔺️
C = 1 (one core) : no parallelization
C = 2 : the parallelization cost in time has to be less than Tbase/2 or 50%Tbase to have a gain (P < 50%Tbase)
C = 4 : P < 75%Tbase
P has to be lower than a number (%Tbase) growing with the number of cores. In other words the gain seem easier and easier to have as C increase. ⚠️ BUT the number of cores used (C) impacts the parallelization cost in time (P). P = f(C)
Pause ! What am I doing ?! First, that was not the question, second I am not an expert and third, George will explain this way better. Let's go back to the point.
⭐ Gain = Tbase - Tpara
Using 🔶️ -->
Gain = Tbase - ( P + Tbase / C )
Gain = Tbase * (1 - 1/C) - P 💥
If we force the equal distribution of big numbers in each split, the time of execution of this should be added to P.
I would be glad to be corrected if I made any mistake.
PS : George @Core Dumped is fantastic ! 🔥
3Blue1Brown of low level software eng. I am binge watching and I love it.
Any chance to know which software you use ?
Thank you for the amazing job !
Keep up the high quality content
This channel is gold. Make me want to study Comutper Engineering
"hi friends!, my name is george" is soo calmly said by TTS :))
Thank you george