For those wondering about my computer: x.com/coredumpped/status/1868128051361354032?s=46 This video was sponsored by Brilliant. To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/CoreDumped. You’ll also get 20% off an annual premium subscription.
We havent even touched on synchronization and atomics yet, but I'm already stoked to see how that + this video + your IPC video come together to allow us to finally have insight into true parallel scalability! This journey has been so consistently clear yet thorough every step of the way, I can't wait to watch it culminate :)
It's good because a monad is just a monoid in the category of endofunctors! ;) I've finally realized that that statement just refers to things that are not nested in a meaningful way. What @leongtengman6261 is referring to about associativity is that a data structure that's associative for some operation can be thought of as "flat" and easily "chunkable" when trying to apply that operation to it. Trees can be represented as nested tuples whereas something list-like doesn't need parentheses to distinguish the structure. The challenge is that applying a map function to a tree requires knowledge of the internal structure and isn't easily scalable. But something without structure like lists, arrays, etc (aka anything "monoidal", think "mono" for one possible internal structure) can easily have a map function applied to its parts in chunks as long as the overall ordering of the chunks is preserved (since list-like things still imply an ordering property compared to something like a bag or set). So monoids can be trivially fanned out / split into chunks as small as you want, have a function applied to all the items in each chunk, and then be fanned back in to a single data structure. Then the restriction on time is the overhead of time fanning out and in (or "forkjoin" as @no1ofinterst phrases it) as well as number of available resources.
I'm not sure there is a better example but all four of the operations you mentioned can be parallelised :) min(array) is the same as min([min(subarray) for subarray in chunked(array)]). `contains` is best because the serial phase depends on the number of threads, not on the size of the data! There even algorithms like partial sum which are really important in data parallelism. [sorry for the bad pseudocode!]
To answer the question asked in the video at 9:45, think of it like baking a cake. You have two parts to the task: mixing the ingredients and baking the cake. Mixing can be done quickly and shared among friends, so the more friends you have, the faster you can mix. But once the cake is mixed, it has to go into the oven, and that part can only be done one at a time. So, even if you have four friends helping you mix, you still have to wait for the oven to bake the cake. This means that while adding more friends speeds up the mixing, the overall time is limited by the baking time. The more time-consuming the baking (which is the sequential part), the less benefit you get from having more friends (or processors). This is related to Amdahl's Law that teaches us that not all tasks can be sped up just by adding more resources, especially if there are parts that must be done one after the other. It’s a great reminder of the limits of parallel processing. More info here: en.wikipedia.org/wiki/Amdahl%27s_law.
My only regret is not being able to record my own voice because some people seem to hate TTS to the point where they won't even watch the content but just comment "AI slop"
@@CoreDumppedwell, i gotta say the TTS voice is an *appeal* of the channel to me. it's memorable, comfy, puts me into a listening and study mindset, and ive just come to think of you like this. i legit quite like it man, you're doing an incredible job with these videos
@@CoreDumpped What those people don't realize is that just because the narration is AI, it doesn't mean the content wasn't extremely well thought out and high quality and certainly wasn't AI generated like the voice. It could be worth trying your own voice but to be honest it's really not that big a deal. The people who are actually interested in the content won't care.
Your videos have been a lifesaver for my computer architecture studies, Winning the CodeCrafters subscription would be a huge boost to my learning journey. Thank you for creating such amazing and insightful content!
I'm a cs graduate and I think your videos are perfect for software engineers who did not have chance to attend good cs basics classes. Your videos are simple, yet very informative and well structured. For those who also like your videos I would recommend reading the book "The Elements of Computing Systems". It's quite simple and gives a lot hands-on exercises of building things from the cpu parts up to a game. I would also like to see video about cpu-gpu communication. PS: messed up a computer? seems like someone uses arch btw...
@@stephanbylkov8215 in not even in college yet but want to learn about system level concepts and this has been especially useful I'll be sure to check out the book too
9:45 How much performance gain we can achieve from parallel operations ? Let's try (I put some icons for reference of the formulas 🔶️🔺️)--> Things to consider : - Time cost of parallelization P : thread creation, data splitting and reconstruction (or result merging) time - Size of the data we manipulate N - Number of cores C - Each split has a size S = N/C - The executed algo (prime test here) : The bigger the number the longer it takes to test it (looking for dividers under the square root of it) on this specific case Tbase : execution time with no parallelization Tpara : execution time with parallelization ⭐ Having a gain means Tpara < Tbase 🔺️ Tpara = P + Tslowest_thread 🔶️ (700ms here) We have this slow thread here because of the algo executed and the fact that the first split contains more big numbers than the others. If we randomly change the input for each execution during the test, having a more fair distribution of big numbers in average, we have Tpara = P + Tavg_thread 🔶️. I think this is the instinctive theory -> Tavg_thread =~ Tbase / C -> Tpara = P + Tbase / C 🔶️ -> we want P + Tbase / C < Tbase to have a gain 🔺️ -> P < Tbase ( 1 - 1/C ) 🔺️ C = 1 (one core) : no parallelization C = 2 : the parallelization cost in time has to be less than Tbase/2 or 50%Tbase to have a gain (P < 50%Tbase) C = 4 : P < 75%Tbase P has to be lower than a number (%Tbase) growing with the number of cores. In other words the gain seem easier and easier to have as C increase. ⚠️ BUT the number of cores used (C) impacts the parallelization cost in time (P). P = f(C) Pause ! What am I doing ?! First, that was not the question, second I am not an expert and third, George will explain this way better. Let's go back to the point. ⭐ Gain = Tbase - Tpara Using 🔶️ --> Gain = Tbase - ( P + Tbase / C ) Gain = Tbase * (1 - 1/C) - P 💥 If we force the equal distribution of big numbers in each split, the time of execution of this should be added to P. I would be glad to be corrected if I made any mistake. PS : George @Core Dumped is fantastic ! 🔥 3Blue1Brown of low level software eng. I am binge watching and I love it. Any chance to know which software you use ? Thank you for the amazing job !
I haven't looked at the comments yet so I do not know if it's been said, but I do not agree with you that it does not make sense to not split the problems here (10:10 ish) into subsets. If you have the minimum of each subset, it's very easy to find the global minimum. Same thing with maximum and whether or not to the list contains 101 (even though I admit that in the last case you are worsening you're best possible performance, if the value was at the beginning of the list). It's also easy for the mean, (either compute the mean in each thread then do a weighted average of the means, or just compute the sum in each thread and only do the division on the main thread). Apart from this nitpick, amazing work as always!
Yes, and in the case of the contains it is even possible to cancel the remaining threads early when one of the threads returns because it has found the value.
Amazed at the quality of content you are putting out! Illuminating animations combined with concrete examples from the linux kernel is such a great way to communicate these concepts. Few other channels give me as many “aha” moments per unit of time! Keep up your style!
I think the usefullness of paralelism as being proportional to the consistency and complexity of the task. If the task is computationally simple, it will paralilize well. For example, adding up all the numbers on a big list, or sorring an array with merge sort. But things like calculating prime numbers, which is not a linear time problem, don't paralilize so well, as you'll be waiting for the slowest thread. This brings me to GPUs. GPUs are excellent for paralelization, great at vector math, but for them to be great, you should have every thread (aka, fragment) finish executing as close to eachother as possible. Tho, paralelization is still a great tool, but it should't be abused. The limiting factor will be the slowest thread, and you should scale the number of threads acordingly. Edit: As always, great video ❤
I learn something useful everytime I watch your videos and I watch them on TV cuz I love them so much lol. I think the state of the system at a given point in times changes the execution times despite repeating the same task multiple times. You might have another background thread using one of the cores, making the running process get access to the core much later and so on.
Regarding the point raised at 9:51, I think factors like synchronization overhead, cache coherency issues, and imperfect load balancing reduce efficiency. Additionally, physical constraints such as heat generation and power limits can lead to throttling, further hindering performance scaling. These challenges, combined with the inherent difficulty of fully parallelizing tasks, prevent linear scaling even in isolated environments
Your video really helped me understand the concepts of OS, especially threads, concurrency, and parallelism, which were tough to grasp before. It even helped me in my exams! As a student, learning about how an OS works and the basic principles of programming feels so much easier and more fun now. Your channel is such an incredible source of knowledge for students like me who are eager to learn but often struggle with complex topics. Thank you for creating such amazing and free content! Please keep making more videos like this-you’re inspiring the next generation of developers. 🙌
I know it's a bit more complicated, but, man, if you ever need a rest from dropping all these gems to achieve a better quality/result/polish in a return, please take it. Because your videos will still be played years later (yes, they're so good) and we can wait a couple of weeks
Amazing video! Multi threading, if done right, does wonders in improving performance. It works especially well when you are doing anything that is I/O bound because I/O is very slow. Had a scenario where I had to an array of tasks, each task was trying to fetched data from 4 different sources and then combined the result and stored it elsewhere. The time taken to complete 100 tasks took about 3 minutes, whereas, introducing threads cut short this time down to 15 seconds. While developing this, I also learned that you really need to be careful with multi threading. The tasks were assigned to a fixed size thread pool so as to not spawn too many threads. Well, it turns out if you do not handle the exceptions correctly, then the thread suspends. Now you have 1 less thread in the thread pool. Have it happen enough times and now you have hardly 1-2 threads trying to do all the tasks. The debugging of this issue was a roller coaster of emotions that I won’t ever forget.
this is awesome! I have an upcoming OS exam and seeing these concepts come to life through your animations has made grasping the underlying theory so much more intuitive! much love from italy🙏
Hola. He entendido que hablas español. Gracias por la opción de audio en castellano. Voy a recomendar tu canal a mis alumnos para que entiendan de forma visual todos estos conceptos de Sistemas Operativos. Muchas gracias.
Really good video! Honestly, though, Data vs Task parallelism, to me, starts jumping more into arbitrary distinctions (as it doesn't actually affect what the CPU does) but perhaps the next video will make it more clear. Looking forward to it!
I’d love to see you deep dive into how a http server leverages threads to handle concurrent requests! Maybe analyze popular web servers like Nginx or Apache 😮
Thanks man, you always on top!Everyone just need to be cautious when working with chunks on separate threads. Make sure not to mutate the data or array you use, otherwise the program would become unpredictable.
Great video! The upper limit for parallel speed up can be described with Amdahls law. The law states, that the sequential part of a program (for instance synchronization points) have a tremendous impact on the achievable parallel speedup. You can think of it literally like a bottleneck: even though the majority of the bottle is pretty thick, the speed in which the fluid flows out the bottle depends on the smallest area.
Please also include hyper-threading and the way it works in future episodes. Does it really have parallelism? Or just improves concurrency? I’ve been reading around the internet but I would love a CoreDumpped level explanation. Thank you for all the great videos so far.
An interesting note about threads on multicore systems is that although they generally improve performance, there are cases where performance can suffer compared to a single threaded program. Such as poor implementations where the a thread has to wait for another to complete its task
Thank you for this. As an IT grad this kind of deep dive into the concepts in programming were not discussed in our university this thoroughly. That's why when I get to work with CS grads I found out that I'm actually missing a lot of fundamentals in programming. 🥺
Two threads are talking to a barman. One yells at the other, "Stop interrupting me! You're messing with my thoughts!" The barman sighs and mutters, "You two really need to get synchronized."
As far as massive parallelism is concerned, the use of CUDA to distribute pure function applications across a GPU is about as good as it gets. It's not rust, but Python's Numba project exposes a nice API for orchestrating these operations, even in non-pure contexts.
How much we can get out of multiple cores depends on a lot of factors. Take example from video only, if given to check for prime numbers. If we have sorted data set and then we can get very optimised results when we try to distribute data modulo wise. Take example of 4 core , 1st number for core 1, 2nd number for core 2 , 3rd number for core 3, 4th number for core 4, then 5th for core 1 , 6th for core 2, 7th for core 3 , 8th for core 4. Although this will be very optimised but it needs pre computations or sorted datas. I also want to point out if one wants to find min, we can find min out of all cores with distributed datas and take min of them , this will be optimised.
Well, I think the efficiency of parallelism can be measured by first determining the resultant duration to complete a task using resultant parallel resistance formula, substituting t for R. i.e. 1/t = 1/t1 + 1/t2 + ... + 1/tn Where n = number of cores t = resultant time of the combined cores t1,t2,tn = times taken for each core to complete their subprocessess. e.g. time taken for a 2 core CPU to complete a task in parallelism would be 0.75s if core 1 takes 1s and core 2 takes 3s to complete subtasks each. Efficiency = ratio of work output to work input Work = Q/t * VT. This implies that work output by the processor is increased when t is decreased as Q/t = current V = voltage in the socket T = a given period, constant V is constant in the processor. Ti's my thought
Interesting to see another viewpoint on thread architecture, always thought its better to have a thread per physical core, so threads don't overwrite each other's local data, hence causing cache misses. I hope you explore more in that area. Nice video!
the answer is of the question ask in video is depend on how much portion of (process/task/program )" sorry don't clear about these terms" are parallel and how much is number of core increase. then by applying amdahl's law speedup=1/[(1-P)+ P/n)]. P = fraction portion the program that can be parallelized. e.g 0.4 for 40%. n= number of cores. From this, we can find speedup. To calculate performance increases with respect to time. then Tp= Ts/speedup. Tp is execute time after parallelize Ts is sequential time on single core or before parallelizing.
Something that we can also point out is how to leverage this in higher level languages like JS/TS. Promise.all() allows you to execute promises in parallel (concurrently? Whatever) and can greatly improve performance if you need to do something like call multiple APIs in the same function. If those calls are independent, Promise.all() is an awesome tool.
Parallelism is impacted by many of the same challenges of single-threaded execution. That is, a single thread’s performance depends on the nature of the operation (I/O bound or CPU bound), the efficiency of the algorithm, the type of data being operated on (for example raw data is much faster to process than text data) etc. So a parallel operation will be as fast as its slowest sub-operation. And while more related to concurrency than parallelism, thread saturation results in performance degradation and can have an impact on the performance of parallel operations. So parallelism done right definitely boosts performance, but it will not necessarily be exactly proportional to the increased hardware capacity.
9:41 Here the workload is distributed by splitting the data into 4 even subsets. Instead it could be distributed using a queue, where each thread pops the next number from the queue (or a small chunk of numbers if each pop is costly). This would result in a more even utilisation of all cores.
Interesting note about data parallelism, GPUs are really good at it, but they kinda suck at task parallelism which is were CPUs really shine. A lot of people seem to think GPUs are just beefy processors but they're really quite similar to CPUs, just specialized for a different use case. GPUs being really good at data parallelism is why they're so good for graphics rendering, because although each pixel is correlated to the pixels around it by the similar geometry, the color result of each pixel has no influence on the pixels around it.
I bet nobody upvote such long and uninteresting comment :( Increasing the number of threads will lead to diminishing returns eventually, because OS will not change threads to different cores too frequently, so some cores might be busier than others, So what happens? The task might take about 500ns if it is on 3 threads, BUT in 4 thread scenarios because we called the 4th thread that runs in the busy core it will take 700ns to complete, so total process time is 700ns despite the fact that the other 3 threads lasts 300ns on average. It's not clear to me how we can optimize for this, for example making data that each thread has in hand to work dynamic so they can help each other if one of them is done! IDK. Maybe we have some sort of kernel communication that helps us run better in these scenarios. I wanted to thank you, these videos mean a lot to me, I cannot pay for technical courses in these topics, neither do I know where to find free resources for learning such topics on the internet, having you as one of channels that hint each one of these topics for us in sub 15-min videos is really helpful ❤❤
I'm early!! Bro I love your videos so much. I'm subscribed and I like all of them. Please can you do a video on containers like Docker? An in-depth explanation please, thank you.
The 4 ops u gave actually allow for a very elegant data split sincw they are all reductions. For instance we can compute the qmx for each if the 4 slice then the maximum max is the true max.
When a core completed the task (the fragment of data allocated), It would take the half of the unprocessed data from the slowest core and update the finish index on the slowest core? Data could also be allocated differently by how powerful each core is
if same task is implemented the same way (on the other cores), then it shouldn't have been 700ns on a single core. could it be that the core 1 is in the main thread? did the main thread wait for core 2, 3 and 4 to finish, adding up the total completion time? (assuming 700ns is the total completion time) . or the OS simply using that core 1 for other things as well? (assuming we are just working on a 4-core processor unit) . i would love to know the answer next video. this is interesting
09:50 Based on Paul E. McKenney's book Is Parallel Programming Hard, And, If So, What Can You Do About It? it depends on your data and the processing model you want to apply. Some data sets are highly partitionable, and parallel processing could significantly improve performance.
one measure of performance gain in parallelism performance is to compare in relative to as if the task was done linearly on single-core. for analogy, let's illustrate a McDonald's non-automated service fulfilling multiple orders that were submitted at once (each staff is one core). From the start til finish, a front cashier prepares the order queue, gives instructions and collates and serves outputs from the drinks, fries and burger stations: communications and synchronization latency is introduced. These operations are non-parallelizable and assumed to be non-optimised. Then, it can be said the duration of the whole process - regardless how many staffs are manning those stations and executing parallel work - is at least as long as the non-parallelizable part. More doesn't mean better always - only to a certain extent. a precise explanation of the laws (Amdahl's law, Universal Scalability Law, Gustafson's law) can be found in the works of Gene Amdahl, Neil J. Gunther, John Gustafson et al respectively.
Things like SIMD (Single Instruction Multiple Data) are also very important (here you had something that was a mix of that and Multiple Instructions Multiple Data - albeit very simplified). I do love your videos, I apologize I didn't come sooner - life got in the way. I wonder if you will cover cache (you did do it from "how it's made" point of view) or rather quite complex topics like cache associativity, coherency etc. Also branch predictors - Zen 5 finally has the branch predictor that was first theorized in Academia in 1996. It's great to finally see it being made. And of course the law of how much parallelism can help an application is called Amdahl's law. Take care!
@@CoreDumpped Look for the paper "Multiple-Block Ahead Branch Predictors" by: André Seznec, Stéphan Jourdan, Pascal Sainrat, Pierre Michaud. And "Zen 5 Teardown + More" by Alexander Yee (person who makes ycruncher). And pre-zen 5 launch there was a video on techtechpotato channel. It was their Zen 5 Deep Dive: The Tech Poutine - you can see George talking about it starting around 53 minutes in. That really got me into reading tons of papers about that stuff.
@@CoreDumpped So - my reply got removed.... Probably by RUclips. Just in case this one remains, try reading: A. Seznec "Multiple-Block Ahead Branch Predictors" and "Zen5's AVX512 Teardown + More..." by Alexander J. Yee.
No there is a separate component called direct memory access (dma) controller in hardware which controls the data transfer (I/O) the cpu just starts the dma and gives the details of transfer to dma and dma does the transfer while cpu can work on other tasks
Hi, although I'm just a first-year student, I'm very interested in Computer Architecture and Computational Complexity. I just found your channel-very cool, keep it up, hope you make more vids about computer! +1 sub
Not sure if what I'm saying is a complete nonsense but eventhough we split the task of finding prime numbers in the array to the 4 cores, the actual time taken for the algorithm in general will be O(n .the time to check if arr[i] is prime) no matter if we split it to 4 or 1 core. So from math's perspective there is no faster way but if we look on the hardware side as well if the whole process was a single thread than it would have to wait for the other processes to stop their execution. While if we divide the process into 4 threads all of them will be executing in parallel and the program should be more efficient because the cpu will now switch between 4 threads which aim for achieving one task while with 1 thread we can't get the same result because the cpu will switch to this process only once and then continue with the others therefore more work will be done by using 4 than 1 thread. But as you showed in the example it is not that much of a difference if we divide the process into 4 threads or if it is just a single thread
For those wondering about my computer: x.com/coredumpped/status/1868128051361354032?s=46
This video was sponsored by Brilliant.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/CoreDumped. You’ll also get 20% off an annual premium subscription.
No channel has every single video worth watching, well except for the lord core dumped
Can we have more computer architecture videos?
Yes, sir!
Yes! Please
Realest 🔥💯
This channel is like a gold mine. Here you can found all those concepts that seem complicated in a form that almost everyone can understand.
We havent even touched on synchronization and atomics yet, but I'm already stoked to see how that + this video + your IPC video come together to allow us to finally have insight into true parallel scalability! This journey has been so consistently clear yet thorough every step of the way, I can't wait to watch it culminate :)
Babe wakeup. Core Dumped released his new video with in 15 days.
Yup
Yeah
Babe wakeup. @rohithpokala is about to win a subscription to codecrafters!
10:11 You can still chunk the array into 4 segments, find the minimum in each, and then take the minimum from those 4 results as you join the threads.
Yep! Forkjoin parallelism to the rescue!
We may split, map and reduce if the operation is associative like sum, min, max.
It's good because a monad is just a monoid in the category of endofunctors! ;)
I've finally realized that that statement just refers to things that are not nested in a meaningful way. What @leongtengman6261 is referring to about associativity is that a data structure that's associative for some operation can be thought of as "flat" and easily "chunkable" when trying to apply that operation to it. Trees can be represented as nested tuples whereas something list-like doesn't need parentheses to distinguish the structure. The challenge is that applying a map function to a tree requires knowledge of the internal structure and isn't easily scalable.
But something without structure like lists, arrays, etc (aka anything "monoidal", think "mono" for one possible internal structure) can easily have a map function applied to its parts in chunks as long as the overall ordering of the chunks is preserved (since list-like things still imply an ordering property compared to something like a bag or set). So monoids can be trivially fanned out / split into chunks as small as you want, have a function applied to all the items in each chunk, and then be fanned back in to a single data structure. Then the restriction on time is the overhead of time fanning out and in (or "forkjoin" as @no1ofinterst phrases it) as well as number of available resources.
thats what i thought. works when you have billions of numbers and split them
I'm not sure there is a better example but all four of the operations you mentioned can be parallelised :) min(array) is the same as min([min(subarray) for subarray in chunked(array)]). `contains` is best because the serial phase depends on the number of threads, not on the size of the data! There even algorithms like partial sum which are really important in data parallelism. [sorry for the bad pseudocode!]
To answer the question asked in the video at 9:45, think of it like baking a cake. You have two parts to the task: mixing the ingredients and baking the cake. Mixing can be done quickly and shared among friends, so the more friends you have, the faster you can mix. But once the cake is mixed, it has to go into the oven, and that part can only be done one at a time.
So, even if you have four friends helping you mix, you still have to wait for the oven to bake the cake. This means that while adding more friends speeds up the mixing, the overall time is limited by the baking time. The more time-consuming the baking (which is the sequential part), the less benefit you get from having more friends (or processors).
This is related to Amdahl's Law that teaches us that not all tasks can be sped up just by adding more resources, especially if there are parts that must be done one after the other. It’s a great reminder of the limits of parallel processing. More info here: en.wikipedia.org/wiki/Amdahl%27s_law.
You're the best George!
Thank you for your videos!
This channel is GOAT.
I agree
Ben eater disappeared when the world needed him the most. But i believe core dumped could save the world
My only regret is not being able to record my own voice because some people seem to hate TTS to the point where they won't even watch the content but just comment "AI slop"
@@CoreDumpped personally i love the joke of you actually being an LLM, it's a fun spin on the channel.
@@CoreDumppedwell, i gotta say the TTS voice is an *appeal* of the channel to me. it's memorable, comfy, puts me into a listening and study mindset, and ive just come to think of you like this. i legit quite like it man, you're doing an incredible job with these videos
@CoreDumpped AI or not, your work can't be described as "slop".Every video seems to have a lot of thought put in.
@@CoreDumpped What those people don't realize is that just because the narration is AI, it doesn't mean the content wasn't extremely well thought out and high quality and certainly wasn't AI generated like the voice. It could be worth trying your own voice but to be honest it's really not that big a deal. The people who are actually interested in the content won't care.
Your videos have been a lifesaver for my computer architecture studies, Winning the CodeCrafters subscription would be a huge boost to my learning journey. Thank you for creating such amazing and insightful content!
I'm putting faith in you guys, up vote ❤
I'm a cs graduate and I think your videos are perfect for software engineers who did not have chance to attend good cs basics classes. Your videos are simple, yet very informative and well structured.
For those who also like your videos I would recommend reading the book "The Elements of Computing Systems". It's quite simple and gives a lot hands-on exercises of building things from the cpu parts up to a game.
I would also like to see video about cpu-gpu communication.
PS: messed up a computer? seems like someone uses arch btw...
@@stephanbylkov8215 in not even in college yet but want to learn about system level concepts and this has been especially useful
I'll be sure to check out the book too
9:45 How much performance gain we can achieve from parallel operations ? Let's try (I put some icons for reference of the formulas 🔶️🔺️)-->
Things to consider :
- Time cost of parallelization P : thread creation, data splitting and reconstruction (or result merging) time
- Size of the data we manipulate N
- Number of cores C
- Each split has a size S = N/C
- The executed algo (prime test here) : The bigger the number the longer it takes to test it (looking for dividers under the square root of it) on this specific case
Tbase : execution time with no parallelization
Tpara : execution time with parallelization
⭐ Having a gain means Tpara < Tbase 🔺️
Tpara = P + Tslowest_thread 🔶️ (700ms here)
We have this slow thread here because of the algo executed and the fact that the first split contains more big numbers than the others.
If we randomly change the input for each execution during the test, having a more fair distribution of big numbers in average, we have Tpara = P + Tavg_thread 🔶️. I think this is the instinctive theory ->
Tavg_thread =~ Tbase / C
-> Tpara = P + Tbase / C 🔶️
-> we want P + Tbase / C < Tbase to have a gain 🔺️
-> P < Tbase ( 1 - 1/C ) 🔺️
C = 1 (one core) : no parallelization
C = 2 : the parallelization cost in time has to be less than Tbase/2 or 50%Tbase to have a gain (P < 50%Tbase)
C = 4 : P < 75%Tbase
P has to be lower than a number (%Tbase) growing with the number of cores. In other words the gain seem easier and easier to have as C increase. ⚠️ BUT the number of cores used (C) impacts the parallelization cost in time (P). P = f(C)
Pause ! What am I doing ?! First, that was not the question, second I am not an expert and third, George will explain this way better. Let's go back to the point.
⭐ Gain = Tbase - Tpara
Using 🔶️ -->
Gain = Tbase - ( P + Tbase / C )
Gain = Tbase * (1 - 1/C) - P 💥
If we force the equal distribution of big numbers in each split, the time of execution of this should be added to P.
I would be glad to be corrected if I made any mistake.
PS : George @Core Dumped is fantastic ! 🔥
3Blue1Brown of low level software eng. I am binge watching and I love it.
Any chance to know which software you use ?
Thank you for the amazing job !
My cop-out for OS exam studying is watching your videos
I haven't looked at the comments yet so I do not know if it's been said, but I do not agree with you that it does not make sense to not split the problems here (10:10 ish) into subsets. If you have the minimum of each subset, it's very easy to find the global minimum. Same thing with maximum and whether or not to the list contains 101 (even though I admit that in the last case you are worsening you're best possible performance, if the value was at the beginning of the list). It's also easy for the mean, (either compute the mean in each thread then do a weighted average of the means, or just compute the sum in each thread and only do the division on the main thread).
Apart from this nitpick, amazing work as always!
I was about to comment the same, you are correct!
Exactly! This is basically the map/reduce architecture.
Yes, and in the case of the contains it is even possible to cancel the remaining threads early when one of the threads returns because it has found the value.
This is one of the most knowledgeable channels I have ever found on RUclips
PERFECT visualizations. You really understand what makes concepts click and how they should be visualized.
Amazed at the quality of content you are putting out! Illuminating animations combined with concrete examples from the linux kernel is such a great way to communicate these concepts.
Few other channels give me as many “aha” moments per unit of time! Keep up your style!
I think the usefullness of paralelism as being proportional to the consistency and complexity of the task.
If the task is computationally simple, it will paralilize well. For example, adding up all the numbers on a big list, or sorring an array with merge sort.
But things like calculating prime numbers, which is not a linear time problem, don't paralilize so well, as you'll be waiting for the slowest thread.
This brings me to GPUs. GPUs are excellent for paralelization, great at vector math, but for them to be great, you should have every thread (aka, fragment) finish executing as close to eachother as possible.
Tho, paralelization is still a great tool, but it should't be abused. The limiting factor will be the slowest thread, and you should scale the number of threads acordingly.
Edit: As always, great video ❤
I learn something useful everytime I watch your videos and I watch them on TV cuz I love them so much lol.
I think the state of the system at a given point in times changes the execution times despite repeating the same task multiple times. You might have another background thread using one of the cores, making the running process get access to the core much later and so on.
Regarding the point raised at 9:51, I think factors like synchronization overhead, cache coherency issues, and imperfect load balancing reduce efficiency. Additionally, physical constraints such as heat generation and power limits can lead to throttling, further hindering performance scaling. These challenges, combined with the inherent difficulty of fully parallelizing tasks, prevent linear scaling even in isolated environments
this channel is slowing becoming one of my favourites
Your video really helped me understand the concepts of OS, especially threads, concurrency, and parallelism, which were tough to grasp before. It even helped me in my exams! As a student, learning about how an OS works and the basic principles of programming feels so much easier and more fun now.
Your channel is such an incredible source of knowledge for students like me who are eager to learn but often struggle with complex topics. Thank you for creating such amazing and free content! Please keep making more videos like this-you’re inspiring the next generation of developers. 🙌
I know it's a bit more complicated, but, man, if you ever need a rest from dropping all these gems to achieve a better quality/result/polish in a return, please take it. Because your videos will still be played years later (yes, they're so good) and we can wait a couple of weeks
His explanation is god tier❤
Please keep it up you are an amazing channel providing content which makes really hard to grasp and hard to research knowledge really accessible.
The visualisations help out a ton in understanding these concepts! Great job and thank you!
I'm new here but this channel seems one of the best of its kind, good luck for the winner
Amazing video! Multi threading, if done right, does wonders in improving performance. It works especially well when you are doing anything that is I/O bound because I/O is very slow.
Had a scenario where I had to an array of tasks, each task was trying to fetched data from 4 different sources and then combined the result and stored it elsewhere. The time taken to complete 100 tasks took about 3 minutes, whereas, introducing threads cut short this time down to 15 seconds.
While developing this, I also learned that you really need to be careful with multi threading. The tasks were assigned to a fixed size thread pool so as to not spawn too many threads. Well, it turns out if you do not handle the exceptions correctly, then the thread suspends. Now you have 1 less thread in the thread pool. Have it happen enough times and now you have hardly 1-2 threads trying to do all the tasks. The debugging of this issue was a roller coaster of emotions that I won’t ever forget.
this is awesome! I have an upcoming OS exam and seeing these concepts come to life through your animations has made grasping the underlying theory so much more intuitive! much love from italy🙏
Love the way the knowledge is organized and presented. This is what is called "wisdom".
Amazing content. What a wild time to be alive and have free access to this quality educational videos. Thank you very much
It's impressive how far we've gotten. At first, it was more like electronics, but now we're starting to dive into computing; I love it
Hola. He entendido que hablas español. Gracias por la opción de audio en castellano. Voy a recomendar tu canal a mis alumnos para que entiendan de forma visual todos estos conceptos de Sistemas Operativos.
Muchas gracias.
Really good video! Honestly, though, Data vs Task parallelism, to me, starts jumping more into arbitrary distinctions (as it doesn't actually affect what the CPU does) but perhaps the next video will make it more clear. Looking forward to it!
I'd love to see you cover GPU architecture when you finish up the CPU and OS series!
I’d love to see you deep dive into how a http server leverages threads to handle concurrent requests! Maybe analyze popular web servers like Nginx or Apache 😮
Thanks man, you always on top!Everyone just need to be cautious when working with chunks on separate threads. Make sure not to mutate the data or array you use, otherwise the program would become unpredictable.
Great video! The upper limit for parallel speed up can be described with Amdahls law. The law states, that the sequential part of a program (for instance synchronization points) have a tremendous impact on the achievable parallel speedup. You can think of it literally like a bottleneck: even though the majority of the bottle is pretty thick, the speed in which the fluid flows out the bottle depends on the smallest area.
We always love a CoreDumpped video. Thank you George😊
I love this channel. Everytime i learn something.
coredumped releases the video that I was excited about. Thanks
This channel is a gold mine! 🥳
Truly
Please also include hyper-threading and the way it works in future episodes. Does it really have parallelism? Or just improves concurrency? I’ve been reading around the internet but I would love a CoreDumpped level explanation. Thank you for all the great videos so far.
sorry babe ,we can do this later,
now core dumped dropped the video,i have to go...✌️
Just discovered this channel and it's great! Much more digestible than my university courses.
Haven’t had this all explained so simply before. Great video!
This channel is gold. Make me want to study Comutper Engineering
One of the only channels where I immediately like the video because I already know i will love it!
An interesting note about threads on multicore systems is that although they generally improve performance, there are cases where performance can suffer compared to a single threaded program. Such as poor implementations where the a thread has to wait for another to complete its task
How does hyperthreading work on systems that allow each core to have 2 threads?
Thank you for this. As an IT grad this kind of deep dive into the concepts in programming were not discussed in our university this thoroughly. That's why when I get to work with CS grads I found out that I'm actually missing a lot of fundamentals in programming. 🥺
Two threads are talking to a barman. One yells at the other, "Stop interrupting me! You're messing with my thoughts!"
The barman sighs and mutters, "You two really need to get synchronized."
Man, this guys video's are so intuitive and so well explained.
Beautifully animated, makes the subject x10 easier to digest!
As far as massive parallelism is concerned, the use of CUDA to distribute pure function applications across a GPU is about as good as it gets. It's not rust, but Python's Numba project exposes a nice API for orchestrating these operations, even in non-pure contexts.
That's cool
How much we can get out of multiple cores depends on a lot of factors. Take example from video only, if given to check for prime numbers. If we have sorted data set and then we can get very optimised results when we try to distribute data modulo wise. Take example of 4 core , 1st number for core 1, 2nd number for core 2 , 3rd number for core 3, 4th number for core 4, then 5th for core 1 , 6th for core 2, 7th for core 3 , 8th for core 4. Although this will be very optimised but it needs pre computations or sorted datas. I also want to point out if one wants to find min, we can find min out of all cores with distributed datas and take min of them , this will be optimised.
This deserves its own award category. 🏆🌟
Well, I think the efficiency of parallelism can be measured by first determining the resultant duration to complete a task using resultant parallel resistance formula, substituting t for R.
i.e. 1/t = 1/t1 + 1/t2 + ... + 1/tn
Where n = number of cores
t = resultant time of the combined cores
t1,t2,tn = times taken for each core to complete their subprocessess.
e.g. time taken for a 2 core CPU to complete a task in parallelism would be 0.75s if core 1 takes 1s and core 2 takes 3s to complete subtasks each.
Efficiency = ratio of work output to work input
Work = Q/t * VT. This implies that work output by the processor is increased when t is decreased as Q/t = current
V = voltage in the socket
T = a given period, constant
V is constant in the processor.
Ti's my thought
Brilliant as usual. Looking forward for a new episode with a code example.
Interesting to see another viewpoint on thread architecture, always thought its better to have a thread per physical core, so threads don't overwrite each other's local data, hence causing cache misses. I hope you explore more in that area. Nice video!
the answer is of the question ask in video is depend on how much portion of (process/task/program )" sorry don't clear about these terms" are parallel and how much is number of core increase.
then by applying amdahl's law
speedup=1/[(1-P)+ P/n)].
P = fraction portion the program that can be parallelized. e.g 0.4 for 40%.
n= number of cores.
From this, we can find speedup.
To calculate performance increases with respect to time.
then
Tp= Ts/speedup.
Tp is execute time after parallelize
Ts is sequential time on single core or before parallelizing.
Something that we can also point out is how to leverage this in higher level languages like JS/TS.
Promise.all() allows you to execute promises in parallel (concurrently? Whatever) and can greatly improve performance if you need to do something like call multiple APIs in the same function. If those calls are independent, Promise.all() is an awesome tool.
Keep up the high quality content
"hi friends!, my name is george" is soo calmly said by TTS :))
Thank you george
Parallelism is impacted by many of the same challenges of single-threaded execution. That is, a single thread’s performance depends on the nature of the operation (I/O bound or CPU bound), the efficiency of the algorithm, the type of data being operated on (for example raw data is much faster to process than text data) etc. So a parallel operation will be as fast as its slowest sub-operation. And while more related to concurrency than parallelism, thread saturation results in performance degradation and can have an impact on the performance of parallel operations. So parallelism done right definitely boosts performance, but it will not necessarily be exactly proportional to the increased hardware capacity.
Great animation and excellent educational video. Thanks.
9:41 Here the workload is distributed by splitting the data into 4 even subsets. Instead it could be distributed using a queue, where each thread pops the next number from the queue (or a small chunk of numbers if each pop is costly). This would result in a more even utilisation of all cores.
This channel should have a lot more stars. It is christmas before the actual date to have a new video.
Another amazing video, like always. For me this video is the Christmas gift :)
This is better than my Fundamentals of Computer Engineering lectures
This video made me incredibly excited for your next video.
The video is very cool, I really love the visuals and the animations to explain the concepts, what do you use or how do you make the animations?
Interesting note about data parallelism, GPUs are really good at it, but they kinda suck at task parallelism which is were CPUs really shine. A lot of people seem to think GPUs are just beefy processors but they're really quite similar to CPUs, just specialized for a different use case. GPUs being really good at data parallelism is why they're so good for graphics rendering, because although each pixel is correlated to the pixels around it by the similar geometry, the color result of each pixel has no influence on the pixels around it.
Your videos are so informative and valuable, thank you🙏
Yo this explanation is perfect mate
Finally a channel explaining real and useful stuff
I bet nobody upvote such long and uninteresting comment :(
Increasing the number of threads will lead to diminishing returns eventually, because OS will not change threads to different cores too frequently, so some cores might be busier than others,
So what happens?
The task might take about 500ns if it is on 3 threads, BUT in 4 thread scenarios because we called the 4th thread that runs in the busy core it will take 700ns to complete, so total process time is 700ns despite the fact that the other 3 threads lasts 300ns on average.
It's not clear to me how we can optimize for this, for example making data that each thread has in hand to work dynamic so they can help each other if one of them is done! IDK.
Maybe we have some sort of kernel communication that helps us run better in these scenarios.
I wanted to thank you, these videos mean a lot to me, I cannot pay for technical courses in these topics, neither do I know where to find free resources for learning such topics on the internet, having you as one of channels that hint each one of these topics for us in sub 15-min videos is really helpful ❤❤
I'm early!! Bro I love your videos so much. I'm subscribed and I like all of them. Please can you do a video on containers like Docker? An in-depth explanation please, thank you.
The 4 ops u gave actually allow for a very elegant data split sincw they are all reductions.
For instance we can compute the qmx for each if the 4 slice then the maximum max is the true max.
When a core completed the task (the fragment of data allocated), It would take the half of the unprocessed data from the slowest core and update the finish index on the slowest core? Data could also be allocated differently by how powerful each core is
if same task is implemented the same way (on the other cores), then it shouldn't have been 700ns on a single core.
could it be that the core 1 is in the main thread?
did the main thread wait for core 2, 3 and 4 to finish, adding up the total completion time? (assuming 700ns is the total completion time)
.
or the OS simply using that core 1 for other things as well? (assuming we are just working on a 4-core processor unit)
.
i would love to know the answer next video. this is interesting
09:50
Based on Paul E. McKenney's book Is Parallel Programming Hard, And, If So, What Can You Do About It? it depends on your data and the processing model you want to apply. Some data sets are highly partitionable, and parallel processing could significantly improve performance.
Great explanation!! Please create video for kernel internal
"The system can assign each thread to separate cores."
That info worths golds
one measure of performance gain in parallelism performance is to compare in relative to as if the task was done linearly on single-core.
for analogy, let's illustrate a McDonald's non-automated service fulfilling multiple orders that were submitted at once (each staff is one core). From the start til finish, a front cashier prepares the order queue, gives instructions and collates and serves outputs from the drinks, fries and burger stations: communications and synchronization latency is introduced. These operations are non-parallelizable and assumed to be non-optimised. Then, it can be said the duration of the whole process - regardless how many staffs are manning those stations and executing parallel work - is at least as long as the non-parallelizable part. More doesn't mean better always - only to a certain extent.
a precise explanation of the laws (Amdahl's law, Universal Scalability Law, Gustafson's law) can be found in the works of Gene Amdahl, Neil J. Gunther, John Gustafson et al respectively.
Things like SIMD (Single Instruction Multiple Data) are also very important (here you had something that was a mix of that and Multiple Instructions Multiple Data - albeit very simplified).
I do love your videos, I apologize I didn't come sooner - life got in the way.
I wonder if you will cover cache (you did do it from "how it's made" point of view) or rather quite complex topics like cache associativity, coherency etc.
Also branch predictors - Zen 5 finally has the branch predictor that was first theorized in Academia in 1996. It's great to finally see it being made.
And of course the law of how much parallelism can help an application is called Amdahl's law. Take care!
I just got a Ryzen 9 9900X. I didn't know about branch prediction. That's an interesting fact.
@@CoreDumpped Look for the paper "Multiple-Block Ahead Branch Predictors" by: André Seznec, Stéphan Jourdan, Pascal Sainrat, Pierre Michaud.
And "Zen 5 Teardown + More" by Alexander Yee (person who makes ycruncher). And pre-zen 5 launch there was a video on techtechpotato channel. It was their Zen 5 Deep Dive: The Tech Poutine - you can see George talking about it starting around 53 minutes in.
That really got me into reading tons of papers about that stuff.
@@CoreDumpped So - my reply got removed.... Probably by RUclips.
Just in case this one remains, try reading: A. Seznec "Multiple-Block Ahead Branch Predictors" and "Zen5's AVX512 Teardown + More..." by Alexander J. Yee.
1:07 doesn't the i/o operation hapens through the cpu > os > process
How both CPU and RAM deal with I/O is a topic for another video, maybe several videos on storage, HIDs, display, and network separately.
@DariaTimurovich ok
No there is a separate component called direct memory access (dma) controller in hardware which controls the data transfer (I/O) the cpu just starts the dma and gives the details of transfer to dma and dma does the transfer while cpu can work on other tasks
@@aadhirajan you are right I just looked at it after reading your comment . Thanks 🙏
Best explanaition I ever seen 😮
Man that last example was PERFECT!
As always excellent material.
loved the content. Please cover Amdahl's law, because we are talking about parallelism and multi core system
no cap i was the subscriber form his 1st video and it was worth
It makes every sense to use parallelism for min, max, and find (independently).
Hi, although I'm just a first-year student, I'm very interested in Computer Architecture and Computational Complexity. I just found your channel-very cool, keep it up, hope you make more vids about computer! +1 sub
Very well explained! :D Thanks!
Please make one on aynchronous programming.
Not sure if what I'm saying is a complete nonsense but eventhough we split the task of finding prime numbers in the array to the 4 cores, the actual time taken for the algorithm in general will be O(n .the time to check if arr[i] is prime) no matter if we split it to 4 or 1 core. So from math's perspective there is no faster way but if we look on the hardware side as well if the whole process was a single thread than it would have to wait for the other processes to stop their execution. While if we divide the process into 4 threads all of them will be executing in parallel and the program should be more efficient because the cpu will now switch between 4 threads which aim for achieving one task while with 1 thread we can't get the same result because the cpu will switch to this process only once and then continue with the others therefore more work will be done by using 4 than 1 thread. But as you showed in the example it is not that much of a difference if we divide the process into 4 threads or if it is just a single thread