Это видео недоступно.

Сожалеем об этом.

How to write efficient and fair multi-threaded programs?

Arpit Bhayani

Просмотров 27 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 18 мар 2023
System Design for SDE-2 and above: arpitbhayani.m...
System Design for Beginners: arpitbhayani.m...
Redis Internals: arpitbhayani.m...
Build Your Own Redis / DNS / BitTorrent / SQLite - with CodeCrafters.
Sign up and get 40% off - app.codecrafte...
In the video, I discussed the importance of writing good multi-threaded programs for optimal performance. I emphasized the need for correctness, fairness, and optimality in multi-threaded code. By using locking mechanisms and atomic instructions, we can ensure correctness by avoiding race conditions. Additionally, ensuring fairness among threads is crucial to optimize performance. Through examples of counting prime numbers sequentially versus using multiple threads, I demonstrated how fairness can significantly enhance program efficiency. By distributing work evenly among threads, we can achieve faster and more efficient multi-threaded programs.
Recommended videos and playlists
If you liked this video, you will find the following videos and playlists helpful
System Design: • PostgreSQL connection ...
Designing Microservices: • Advantages of adopting...
Database Engineering: • How nested loop, hash,...
Concurrency In-depth: • How to write efficient...
Research paper dissections: • The Google File System...
Outage Dissections: • Dissecting GitHub Outa...
Hash Table Internals: • Internal Structure of ...
Bittorrent Internals: • Introduction to BitTor...
Things you will find amusing
Knowledge Base: arpitbhayani.m...
Bookshelf: arpitbhayani.m...
Papershelf: arpitbhayani.m...
Other socials
I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff.
LinkedIn: / arpitbhayani
Twitter: / arpit_bhayani
Weekly Newsletter: arpit.substack...
Thank you for watching and supporting! it means a ton.
I am on a mission to bring out the best engineering stories from around the world and make you all fall in
love with engineering. If you resonate with this then follow along, I always keep it no-fluff.

Комментарии • 73

@sourabhsingh5286 Год назад ⁺²⁸
Thanks for bringing back the joy of learning about computer science. It had become too much about placements/packages etc…
💪🏼🙏
@KingAstraVerse 3 дня назад
Ohh Man huge respect for you ........ I wish I could have found your playlist last year I would have cleared juspay round.
@yogeshedekar6078 7 месяцев назад ⁺³
This is so darn interesting honestly and moreover I was coding side by side in java as you were explaining. I have been running away from threads right from my college days and it seems like first time some one has explained the intricacies of multi-threading to me after speding 15 years in IT. The reveletion seems fantastic. I am running JDK 15 on a 64 bit windows 10 :) and it took me 38 seconds with fair parallelism with 10 threads. Thanks a lot for this wonderful series.
@ShrihariShetty11 Год назад ⁺⁷
Really Loved your Explanation❤. Fairness among threads is something I never thought of, you just unlock areas in brain🧠 to get deeper understanding /perspective on problems and patterns. I went ahead and implemented the above in Java, made a lot of mistakes but learned a lot in that process. I tried a slightly different approach 🤔 where I gave an additional responsibility to individual threads. The responsibility was that the thread should also return the computed value(count of prime numbers) for the batch of numbers that was assigned to it. Lastly I just sum up the individual values to get the final answer. The program took 4 seconds to compute all the prime numbers between 1 and 100 million. Number of threads used were 5.
Environment: Mac M1 8 core.
It will be really helpful if you also share your thoughts(best practices/scope) on using Intrinsics (SIMD instructions) for solving a similar problem.
@AsliEngineering Год назад ⁺⁷
I am so happy reading this. thanks a ton :)
5 seconds seems too less :) may be it is M1's magic :)
also, given you have 8 cores, you should have at least 8 threads.
also, by returning the count from each thread, you are reducing the overhead of locking and/or atomic update, which is a good decision.
@shaurya3284 Год назад
@@AsliEngineering i too did all the 3 variations in java, while the first two(single thread - 62 seconds, batched multithreading 11 seconds on a macbook pro m1 (2022)) worked exactly how i'd expect it to, while trying out the fairness logic, it is taking even more time because each time the checkPrime(num) is called, in order to ensure thread safety we need to add this logic under synchronized in java else we might get a wrong ans, if i am wrong please do correct me and i believe this is why it is also taking more time (253 seconds)
@PiyushBajaj05 Год назад
@shaurya3284 Is it possible for your Java solution for the same, want to validate on this?
@harshitbangar123 Год назад ⁺¹
Another strategy can be for n threads picking nth number. Each thread (or worker) maintain its own version of counter (thread local) and do the aggregation in the end.
@shishirchaurasiya7374 Год назад ⁺¹
Its really really amazing the last 3rd explanation where you explained keeping the threads busy all the time utilizing the CPU cycles as much as we can was the pure WOAH !! wala moment...
Thanks a lot
@saxena11ashish 7 месяцев назад
And this is why I love engineering.
Perfect explaination for something that is valuable but not given enough thought 💯
@Aryan0 Год назад ⁺¹
This video is super helpful. I'm a CS sophomore but I have also learned a lot from your journey and also the importance of getting the basics right in computer science field.
@akshaymuthal3693 Год назад ⁺¹
Watching your videos is like an eye opening experience in the world of CS!
@AsliEngineering Год назад
Exactly my motive. I want people to fall in love with computer science and become better engineers.
People have become really shallow and I want to change it.
@biswaprakashmishra398 Год назад
Had to watch the 3rd approach 2 to 3 times to understand how the atomic increment is happening and each goroutine is getting a number to determine if it is prime or not. Absolutely brilliant. Excited for this series.
@anishdabhane9132 Месяц назад
Hi Aprit sir, i discovered this video through ur recent tweet on "NON-ATOMIC NATURE OF COUNT++" , the video was really fascinating . Recently I got exposed to mutex and semaphore in college and assembly language was taught in 2nd year itself .
Through this video i ended up diving deep into how cpu works (cores and threads) and after spending 2-3 days finished C++ implemention of the same.
Thanks a ton for the awesome content! I'm excited to finish the CONCURRENCY playlist and write some blogs about it too.
@AsliEngineering Месяц назад ⁺¹
Keep digging deep, because that's where the magic happens :) Glad to see you are chasing the right things in the early stages.
@anishdabhane9132 Месяц назад
@@AsliEngineering 💯
@nandagopal.05 7 месяцев назад ⁺¹
This is really helpful. A small change in approach makes a huge difference. On a similar note, i was trying to code where i will be able to use near 100% of my cpu capacity. So basically if i have some work to do that i can distribute among processes and threads, how i can achieve this in a minimum amount of time by utilising 100% of my computing power. I am not sure whether my thinking is correct but would love to get some clarification to my thinking here
@abhishekmehandiratta4241 Год назад ⁺¹
We could also have a channel with a buffer size of 100million and push every number to that channel. This could serve as the input channel for the goroutines and each of those goroutines could iterate over values of that channel and process the number. The main goroutine could close the channel immediately after pushing all the 100mil values. This way we won't need a global shared variable to keep track of the current number. This approach is very similar to a queue based approach with Kafka/RabbitMQ. Great video btw!
@nikhilneela 10 месяцев назад
wow! brilliant idea. Maintaining queue might be memory overhead for such small application, but yes can be a great idea for something big. this is a lock free solution
@rahulsangvikar7973 2 месяца назад
Absolutely amazing video, however, I'd like to point out a slightly different way of writing a lock-free fair implementation. Say you're using n threads. Start ith thread from i, and increment the counter by n each time. This way you will always calculate isPrime for a unique number in all the threads and the numbers will be fairly distributed among all of them.
@mayanknegi2471 7 месяцев назад ⁺¹
Every thread is reading value from shared variable x in 3rd approach. Why aren't there duplicate counts? What if multiple threads read same x value and increase count multiple times for same prime number?
@AsliEngineering 7 месяцев назад ⁺¹
Because no two threads are picking up the same number, ever.
@subhamtripathi1997 10 месяцев назад ⁺¹
this is just another level for attention to details!
@akshayrr3061 Год назад ⁺¹
Amazing content! Waiting for more videos on concurrency
@faizm4765 Год назад ⁺²
Just like always beautifully explained Arpit! Learnt something new and definitely going to marinate and play around with it more.
@adwaithks Год назад
😊
@adwaithks Год назад
😊
@krishnarajsivaraj Год назад ⁺¹
the sum of the execution time of all threads in the 3rd approach is more than that of the 2nd approach. the increase in execution time is due to the atomic increment. We can improve the execution time by reducing the atomic increment as much as possible.
You can see an improved version here - gist.github.com/kriznaraj/466291a46621507a09e9c3988a59147c
~15-20% improvement is seen in the overall execution time
@pratikkedar4476 2 месяца назад
Yeah its true,
I wrote c++ routine to validate this at my end, second approach took 28 sec for 100M numbers where i see variable execution time for each thread, with approach 3 each thread is taking same time but overall time has increased, i believe this is due to atomic variable access!
@arpit any comments ??
@vibhuyadav6841 Год назад ⁺¹
Thank you for Sharing this. This is such a beautiful explanation. I have learnt a lot from this it was really helpful. I also went ahead and implemented the same in Java. Mind blowing Results ❤.
Awesome.
@PiyushBajaj05 Год назад
@vibhuyadav6841 Is it possible for your Java solution for the same, want to validate on this?
@shaktijain8560 4 месяца назад
Hi Arpit, I really enjoyed your amazing video and appreciated the clear explanation of concepts. However, I couldn’t help but notice the relatively low view count. A small suggestion from my end would be to include code in the video, particularly in Java or C++, as I believe it would help viewers relate more.
@1711nitish Год назад ⁺⁸
Another issue with spawning unnecessary threads is this could lead to CPU spikes in large scale applications .
@MrSnippyv Год назад ⁺¹
The idea of using atomic integer to distribute the integers to the threads in the fair implementation was pretty neat 😅, I had the idea of using a wait-free, lock-free queue to distribute the numbers (I was not seeing them as just numbers, I was seeing them as "inputs", so using a queue to distribute them made sense to me when I thought about it).
I was kinda surprised when I saw the difference in the execution time of fair and the unfair implementations, it was not as substantial as I imagined, may be it will be more substantial if the limit increases.
Also, would it make a difference if the checkPrime() didn't use a shared atomic integer, rather each thread will gets its own int variable to increment, and you add up the values of all those integers at the end? I should take a look at the assembly of atomic add.
@liquidmetal718 Год назад ⁺¹
Aren't you using go routines instead of threads ? Go scheduler has this m:n mapping between threads and go routines right ? Also, go-routines get scheduled using that work stealing algorithm on top of those threads. We didn't modify the thread count & only changed the go routine count. Maybe we can make it better by modifying both. In Java or Rust we use threads but not co-routines.
Do let me know if i'm wrong.
@harishreddy3310 Год назад
I think it would be faster with input channel which just inputs each number and a list of workers taking the numbers from the channel and incrementing their local counter and at the end outputting their count to another channel, you won't need to do atomic operations so frequently.
@ayushanand9013 7 месяцев назад
Incredible, the 3rd approach is a standard use of mutexes. Although, as far as the problem is concerned (and limited to this problem only), can we do something in line with natural distribution of primes over the number line. This wikipedia article gives out an approximate number of prime numbers within say number N.
Given, if we know the number of Worker Threads, and the maximum limit (which is the case in the question), we can distribute the number range with a logarithmic decay. Although that might be too much maths for a problem from a CS perspective. 😅
Ref: en.wikipedia.org/wiki/Prime_number_theorem
@AsliEngineering 7 месяцев назад
Yes. Totally.
@MurtazaLokhandwala 7 месяцев назад
Very well explained! This is really helpful. Thanks a lot!
@eren-qu9uu 7 месяцев назад
average of second approach is 45 second while in 3rd approach average is 51 so i think there is overhead of 5-6 seconds to pick next number using go automic for 100 million number.
@piyushtyagi2131 Год назад
Brilliant Brilliant explanation ❤️ thanks for sharing.
@adsharma916 Год назад
Would be nice to see how this scales with concurrency factor. There should be an upper bound. So what happens when we have 10, vs 20 vs 50
@miralshah4761 11 месяцев назад
Fare way is to use queue the numbers and each thread dequeue each number and process it , No need to increment number.
@parthtrehan8387 Месяц назад
But the atomic section still remains inside the parallel region. I think pulling it out would give us more benefit here.
@nithinsastrytellapuri291 Год назад
thanks for making this free
@tedcse 7 месяцев назад
Beautifully explained!
@srirangamikshwak9471 Год назад
How to decide on number of threads that we can use in any scenario? Does it always depend on number of cores? Any reference would be great.
@shaifibansal5022 Год назад
Beautifully explained video but I have few doubts : can we know which method would be better for a function ? with threads or without threads, without running it and also why are threads always faster in this case as there is context switching between threads which also takes time but if we don't use threads then won't that thread switching time should get saved? Also there must be a point after which threads might become costly like in this case if we create threads equal to the number of digits for which we want to check for prime or not then it won't be efficient. So how can we find using how many threads will be best ? Really hoping for a response
@jayaramans7398 Год назад
Well curated content. Can you please explain how this use resources in machine cpu
@sankeerthnalam512 Год назад
One important thing to note here is that the example taken in the video is 100% CPU intensive(no I/O required). In these cases the performance of the application will depend on the no of cores in the machine.If the machine only has 1 core then no mater whatever approach we follow we will end up with approximately same time.
@rahulsangvikar7973 2 месяца назад
The unfair threading code will always run slower than the fair one, irrespective of the number of cores being used. The flaw is in the software itself, not the hardware.
@syedfarhan7105 Год назад
If we only have one CPU then will multi threaded approach gives any benefit? I think this will only work in a multi core settings. Is my understanding correct?
@sachinjindal4921 Год назад ⁺²
Awesome
@lakshminarayanannandakumar1286 10 месяцев назад
Hi Arpit, I was trying out your code with minor tweak as below
From
```
x := atomic.AddUint64(&CURRENT_NUM, 1)
// Breaking condition
if x > N {
break
}
```
To
```
atomic.AddUint64(&CURRENT_NUM, 1)
// Breaking condition
if CURRENT_NUM > N {
break
}
````
However, the program yields inconsistent results. Could you please explain what is causing such a big difference?
@7guitarlover Год назад
wouldn't a number be prime if its not divisible by all the prime numbers that came before it ? ( barring the exception of 2 ) wont that be a better approach to find prime numbers instead of iterating over to sqrt( number) ?
@AsliEngineering Год назад
yes but that's not the point.
@chapristuff8460 Год назад ⁺¹
Hi arpit, interesting video. Would love to know how you would have sized your threadpool for a particular problem.
@AsliEngineering Год назад ⁺¹
Size of threadpool is always a factor of outbound work ( writing to network, disk, or cpu computation) and it depends on the usecase at hand.
I typically run a small test to tune the threadpool by having a very aggressive monitoring setup. Start small, and then increase until my IO maxes out.
@arihannnt Месяц назад
Interesting! Thanks for sharing this, I was doing it wrong :|
@Pro3512 Год назад
Awesome @Arpit.Just one thing can we not use lock for incrementing prime numbers instead of atomic for mutilthreading ?
@sachinmukherjee29 Год назад ⁺¹
Very good explanation. Is it possible to share this code repository?
@AsliEngineering Год назад ⁺¹
not sure how to share them with members-only. let me figure out the logistics. meanwhile I would highly recommend you to write the same code and make mistakes. your understanding will multi-fold.
@DharanAditya Год назад
This was insightful video. A small great. In the 3rd approach, I see you are breaking the loop when x > MAX_INT. But since we are checking for prime number till 100M, should we breaking when x > 100M. Was it just for demonstration purpose to show the execution speed ? Or there is something i'm missing out
@AsliEngineering Год назад
Max int is set to 100m it is not integer max. That's that global variable set.
@photon628 Год назад
Nice, write the examples in Go 👍
@sraynitjsr 11 месяцев назад
Awesome 🔥
@NaveenSingh-ri4mu 11 месяцев назад
Thanks!
@krsingh.shubham 3 месяца назад
Amazing
@carlosluque4753 10 месяцев назад
Wooow that is a seriously good explanation!
@sarthakjain4571 8 месяцев назад
Awesome

Следующие

Автовоспроизведение

Why most TCP servers are multi threaded and how to build one from scratch