How databases scale writes: The power of the log ✍️🗒️

Gaurav Sen

Просмотров 137 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 30 янв 2025

Комментарии • 234

@gkcs 3 года назад ⁺²
If you are preparing for a system design interview, try get.interviewready.io.
All the best!
@martingallegos917 4 года назад ⁺⁴⁶
"You can think of your database as a data structure" Man... I wish I've had someone explain things so comprehensive when I was in school
@metfan46 3 года назад ⁺⁸
I'm reading 'Designing data intensive applications' and needed a reminder of what exactly LSM trees were, this video is amazingly helpful! Everything I've learned so far is clearer!
@phildinh852 2 года назад ⁺²
Same here. I wish there was a channel that summarizes DDIA concepts chapter-by-chapter, so we can watch the videos after reading the text to further solidify our understanding.
@avinashyadav1 4 года назад ⁺¹²
was going through "data intensive applications" book and things were like plain text, came here - watched the video , and now the same text is forming images in my mind. Thanks man, you have created a perennial resource for system design topic.
@arjuns8391 4 года назад
same here :)
@guillermokuster414 2 года назад
Same :here 😅
@KiLLvenom1 4 года назад ⁺⁶⁵
*Explaining Bloom filter* - example word "cat", let's think of another word with c... "corona" 🤣I had to check the date of the video and indeed it's post the COVID-19 outbreak. Keep up the good work! Amazing content!
@shob6094 4 года назад ⁺¹
even i checked the date lol :P
@KrishnaKishore 4 года назад ⁺¹⁵
Good one. This video summarizes most of the 3rd chapter in Designing Data Intensive Applications.
@gkcs 4 года назад
Thanks :)
@sid007ashish 4 года назад
which book?
@420_gunna 4 года назад ⁺²
You look very happy in this video, Gaurav! Nice to see this as I'm starting interviews again -- you helped me out last round, glad to have you onboard this time too :)
@gkcs 4 года назад
All the best!
@Hmm1298-8 4 года назад ⁺¹
I just finished STORAGE and RETRIEVAL from Designing Data Intensive Application and this video came. It gives more insights and flavor to our knowledge.
@sanyamjain4078 Месяц назад
At the end of the video, immediately related to Cassandra DB/ Elastic Search internal architecture (also learnt from one your videos😃). We are writing data sequentially and periodically dumping it into sorted string tables. So to reduce number of operations and better read times, we are occasionally merging those sorted string tables. So the combination of Linked list and Sorted Arrays help us scale our database by providing better write/read operations.
Great Content. Thanks a lot Gaurav!
@gkcs Месяц назад
Thank you!
@vishvedula1 4 года назад
"Think of a Database as a DataStructure" is like the keyword for us to imagine and understand things better...thanks a lot dude !!
@extremespartan117 4 года назад ⁺²⁴
Man this guy really knows how to teach. Keep it up Gaurav!
@BHUPESHROHILLA 3 года назад ⁺⁴⁹
I think the meaning of Compaction which you explain missing a major point here.
Compaction means throwing away duplicate keys in the log as there can be multiple updates for a single key. With compaction we keep only the most recent onces.
@kashishsharma6809 2 года назад ⁺¹
I am so happy to watch this playlist.
Most of time i read on platforms like quora or reddit that developers never used knowledge of DSA in their work. And everything was just to pass the interview test.
But this series proves it wrong.
If someone is working on developing a database or some services thst will be used as blackbox by others. Then knowledge of dsa will be so useful.
Please correct me, if i am wrong. And add some more egs if i am right.
@dz_s 4 года назад ⁺¹⁹
Great channel, thanks a lot. I personally would love to see some real world examples of that (open-source code etc.) maybe as a posts/short vids. I know it’s very time consuming stuff, but I bet it’s very interesting to see and understand how real world frameworks/projects use fundamental knowledge under the hood.
@creativcoder 4 года назад ⁺¹
You can go through cassandra's docs: docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
@ShabnamKhan-cj4zc 4 года назад ⁺⁹
Wow..never thought abt the database optimization from write and read perspective...always thought abt query optimzation
Thanks a lot for providing such a useful information..thanks again and keep doing this great work
@anarchymonkey 4 года назад
Bhai tumi guruji. Best explanation with real world examples
@mahanteshambali 4 года назад ⁺¹
Zooming into your eye was goooood so was once more. I think your’s and Clement’s videos are like fun with learning. No jokes on Card Tricks though.
@batman_1st 3 года назад
Finally someone explained LSM tree clearly
@sidazhong2019 4 года назад
OMG this guy gives a much much much better explanation than my professor.
@danieldsouza8754 4 года назад ⁺⁶
If you're here because you heard LSTM at the end and were like "Wait, what ?", He meant LSMT ( Log Structured Merge Trees ) :)
@gkcs 4 года назад
Yes, I messed up with neural networks :P
@jaiminpatel2170 4 года назад
no wasting time it's really good for those who had know all this stuffs and thank you Brother it's so help full for me to expand my knowlage
@romajain2425 3 года назад ⁺⁴
We can divide the searches into range of chunks also. First chunk ranges from say 1 to 30, second chunk 30 to 60 ..and so on .If we want to know if a number 45 is present in the DB or not, we will do binary search on the second chunk
@jeetnarayanchakraborty6235 25 дней назад
Sharding ?
@mangeshshikrodkar6192 4 года назад ⁺⁴
Hi Gaurav, Thanks for video. With B+ tree we have read and write complexity as O(logn) where as you claim with your approach we can make writes O(1) or less than log(N). The concern i am having is, it will vary a lot depending upon how many sorted chunks you have along with how many merges you would be doing and size of sorted chunks etc. Along with that we have a lot of processing and complexity of implementing/maintaining /monitoring these sorted chunks and bloom filters+corner cases etc. There could be certain cases this approach may be better than B+ trees but there could be some cases where B+ trees will stand out. example could be like fb we have billions of records to process. How many chunks would you make and then how many would you merge and apply bloom filters on ? you cannot go with 6 chunks in your example but may be you need to go with 60k to 6M chunks and then apply processing 1st level search on 6M chunks and second level processing within that chunk + bloom filter etc. How would you handle record deletion and how many things you need to change (chunks) , bloom filter and even more may be.
@shrinivasgutte694 3 месяца назад
i'm just looking for this kind of reponse
B+ trees tend to offer predictable performance across read-heavy workloads, and their balancing properties ensure O(log n) read and write times even at scale.
While bloom filters and chunk-based approaches shine in write-heavy workloads (like log-structured merges in LSM-trees), they may struggle when reads need to scale efficiently across massive datasets, especially with frequent deletions and updates.
@tejassardana6266 4 года назад ⁺¹³
There are very few sights more beautiful than the thumbnail of Gaurav Sen in front of a white board
@rohandvivedi 4 года назад ⁺⁶
Would like a video on internals of in-memory databases that also offer persistence.
@harisridhar1668 3 года назад
Hi Gaurav - this video makes a lot of sense after solving for how to merge k sorted lists .
@eformance 3 года назад
I have a good understanding of how MySQL does all of this, and the only real advantage to this queuing technique is multi-row inserts. Inserting multiple rows at once is faster than doing the same number of individual row inserts. When you insert data into the (InnoDB) storage engine, it stores the data records directly in the primary key BTree, so the sorting is done upon insert. If you want to optimize bulk inserts, simply create a queue of records and periodically dump that queue to the database. The types of workloads that benefit from queued inserts would primarily be ETL data transfers.
@vinnieworks1854 4 года назад
You are the best. I’ve learned so much from your videos! Thanks so much for sharing!
@gkcs 4 года назад
Thank you Vinnie!
@glsruthi6522 4 года назад
Great video. You are incredible in presenting stuff. Keep up the good work!
@yeshwanthreddy5185 4 года назад
you did not consider duplicates caused by updating records. found ur video helpful while reading the Designing Data-Intensive Applications book, I'm exactly at this topic, thanks for explaining it in short yet covering the most.
@gkcs 4 года назад
Thanks! I have talked a little more about this in the "Introduction to NoSQL" video. It's mentioned with the SST tables, I think.
@bostonlights2749 4 года назад ⁺¹
I was asked about Bloom Filters recently in a interview!
@nehaambasta1191 4 года назад
Binge watched 16 out of 23 videos of this series!
@obinator9065 7 месяцев назад
2:17 it'll actually be slower if you do it in C/C++ (etc) if you do dynamic allocation for every element at runtime. You'll also lose out massively on caching performance. You should use a linked list with consistent size arrays (chunks). O(1) insertion + cache optimization.
@Sameer-yq5gh 4 года назад ⁺³
We need more videos this time if u keep uploading everyday...that might help
@karthiknair3077 3 года назад ⁺¹
It would be good if you can make a video on "which storage engine to use and when", technically telling us when to use traditional sql vs nosql. I know there is no straightforward answer to it but just to know your view.
@nitinissacjoy5270 3 года назад
Correct me if wrong, but this is for Analytics DB. For transactional DBs which needs to change/delete value, sorted array would slow down writes. Transactional DBs still need to use B+ trees (at least SQL variants). Database with heavy write doesn't translate to Analytics DB, imagine a scenario, if you're storing a background job's state in DB and the number of jobs scale up to 1 billion for 1 billion users every 2 hours - you have 1 billion records that are ingested every hour but at same time need to be updatable.
@nozimmehrubonov8818 3 года назад
@Gaurav amazing video, I wish I could hit like million times. Keep up with amazing content.
@MilMike 4 года назад
watching your videos my brain explodes - in a good way. Thanks for your videos!
@rishiraj1616 3 года назад
The problem with Gaurav Sen's videos is that if its 17:22 minutes long then you need watch it exactly for 17:22 minutes to understand the concept. The moment you skip forward for 5 seconds anywhere then you need to rewind for 20 seconds.
@AntonyXavier-v9j 8 месяцев назад
WOW this is a great video, thanks man !!!! great help u did
@gkcs 8 месяцев назад ⁺²
Cheers!
@colinmaharaj 3 года назад
I actually want to write my own database engine. This can help.
@moiqubal 4 года назад
Thank you for the video and explaining it in a simplistic way!
@jaleelpasha3301 3 года назад ⁺¹
Really nice video, few questions
1. What happens if the server goes down before we get a chance to flush the data from memory to the DB?
2. If we already know the ideal size of the segment to read the data efficiently, why not start with having the memory of the same size instead of opting in to merge the sorted segments of a smaller size?
@malaybiswal 3 года назад
Databases keep those entries in logs and they are called redo logs. In this situation, you apply redo logs to recover those data.
@RubenAlcala 4 года назад
Greetings from Peru, keep up the good work Gaurav Sen
@gkcs 4 года назад
Thank you 😁
@boomboom-9451 2 года назад
8:03 It's still O(nlog(n)) imo. Because in the worst-case scenario you'll go over every single chunk (n) and binary search on them (log(n)) as you can't really separate them into sequentially sorted chunks. (e.g first chunk holds sorted values from 1 to 50, a second chunk from 51 to 100...). This would require sorting every chunk in O(nlog(n)) time because we're getting random values in every query block.
@pollathajeeva23 Год назад
"Revisiting is always great". I revisit this video after encountering real-world implementation from "RocksDB Internals".
@vishalydntn 4 года назад
an example maybe you could give here for a bloom filter is - considering the 0th and nth element of the sorted array. If I need to search for a number and if that number is within the range of the smallest and highest, there is a chance it could be there (it may not be there as well), however if the search number is either lesser than the a[0] (smallest number in the sorted array) or greater than a[n] (largest number in the sorted array) there is no way it could be there. Hope that helps.
@gkcs 4 года назад
That's an interesting way to think of it!
@大盗江南 4 года назад
Thank you Gaurav! You are an amazing teacher! I super super like ur videos!
@gkcs 4 года назад
Thank you!
@antoniog9871 4 года назад
Great video class! Congrats buddy! ✌️
@swapnilgupta236 4 года назад
I thought you got this merge idea from your TIM SORT video. Nice explanation :)
@大盗江南 4 года назад
Thank you Gaurav for ur amazing work!
@ivailotenevv 4 года назад
Awesome as always! Keep uploading :)
@carles5601 4 года назад
Great content! Thanks a lot, Gaurav. Keep it up!
@gkcs 4 года назад
Thank you 😁
@nipunpruthi7624 2 года назад
Nice editing with ONCE MORE
@debaratimajumder5867 3 года назад
Hi Gaurav,
Pardon me if I understood something wrong but:
1. Around 7:45, if we have 10,000 sorted records (n) in SST and we are trying to insert 6 new records (m), why do we need to sort (10,000 + 6) records? Instead, can't we do binary search on 10,000 records to find insertion position (log n) of the 6 records?
Then time complexity will be = O(m * log n) = 6 * log(10,000) = 60
2. Also, if we go with the approach of merging the sorted arrays of size n (two arrays of 6 elements each merging into one array of 12 elements), why do we need to traverse all the arrays to insert a record x? Since the arrays are already sorted, we can easily compare starting and ending element of the array with x. If starting element is more than x or ending element is less than x, we don't need to traverse that array and can check another array where a[0]
@ricardoespinoza5821 4 года назад
This is very good content. Very good editing. Thanks 👍
@deepak21903 2 года назад
Thanks gaurav for wonderful video... im your fan at first video itslf
@hrayrpetrosyan5330 3 года назад
Thank you! Very useful content.
@shivagarg9458 4 года назад
Thanks, Gaurav. The idea is quite similar to Apache HBase.
@alfaa.originals 4 года назад
I believe compaction uses optimal merge strategy similar to huffman coding. Take lists of smaller sizes first and then merge the remaining one's, which minimizes number of merge operations.
@gkcs 4 года назад
Is it? I thought the leveled compaction strategy was the most popular one.
@advaitchabukswar4163 8 месяцев назад
Great video!!🤯
@大盗江南 4 года назад
MORE MORE MORE MORE MORE!!! WE ALL LOVE U!!!!
@vibhorpareek8179 3 года назад
I think thats how cassandra works to make writes faster .. mem table in memory and flushed to sst for good write latencies and using bloom filters/row and key caches to reduce read latency!
@susmitvengurlekar 2 года назад
And that's where the leet code question of merging sorted arrays is going to help you in production !
@Amritanjali 4 года назад
u r inspiration
@sankalparora9374 Год назад
Thanks for the video!
@gkcs Год назад
You're welcome!
@madhavpruthi4474 4 года назад
Hey gaurav! Amazing video! 😁
@bhushandafale967 4 года назад
Awesome. Great video. Very nice explanation. Keep uploading. :)
@ameyapatil1139 4 года назад
very well taught ! thanks !
@daleprather3026 3 года назад
Excellent videos! What's your approach for becoming so knowledgeable about such a broad range of topics?
@adarsh7604 4 года назад
Isn't this how Cassandra works. Great concept
@gkcs 4 года назад
Yes :D
@Sushil2874 4 года назад
Nice informative video..!!
Nice editing as well...))
@griimick 4 года назад
Thank you for the explanation.
@ishaangupta4125 4 года назад ⁺¹²
I guess this is basically how Cassandra works as it uses SSTs?
@gkcs 4 года назад ⁺²
Yes :)
@imnick8222 Год назад
Appending at the end of the LinkedList, the time complexity is O(N) (we need to traverse through the linked list and we need to append), Correct me If i'm wrong
@rahulsangvikar7973 5 дней назад
You keep a track of the tail and keep updating it when there is an insert.
@梁潇瀚 4 года назад
thanks, nice explanation
@gkcs 4 года назад
Thank you!
@dpynsnyl 3 года назад
Hey can you make videos on size tiered and leveled compaction? There aren't great video resources on that. Will be very helpful.
@arnabmukherjee5840 4 года назад
Awesome bro...keep going
@ddonddon2430 4 года назад ⁺²
Hi Gaurav,
Thanks for the video. I have a doubt regarding constant time for write operation.
1) Whenever a write request is triggered it takes log m(where m is size of memtable) to insert in the memtable right?
2) write query is comprised of 2 operations right?
i) memtable
ii) log file
@raidengopro2868 4 года назад
Great video! What would be the practical use of this concept?
@Dattebayo-404 3 года назад ⁺¹
what is the difference between a SST and LSM tree ? both are doing the same thing after all.
@anishtaneja5665 4 года назад
Gaurav, can you explain this by taking a real scenerio and writing pseudo code ?
@mananamin8840 4 года назад
Why. I feel that it's same as b+ tree overall.. because if we see b+ tree abstractly, it's also maintaining sorted chunks of array...but ya we don't have control over size and when to merge two sorted chunks in b+..
in b+ tree.. insertion is always log(n)..
While here it most of the time . constant..(I guess)
reading operation is bit heavy here..but I guess that filter can help to reduce time..
Anyway, nice video..
@mohammedabulsoud1819 2 года назад
I like the theory you explain it. But how this can be applied. Is log and the sorted array something supported by RDMS ?
@gkcs 2 года назад ⁺¹
B+ tree implementations are used in most popular RDBMS.
@saifsmailbox98 4 года назад
Are the previous bloom filters of any help in the generation of the new bloom filter? As you said there is a high chance of having a lot of false positives in the new array if we don't increase the bits used by the new bloom filter. So, I'm assuming we need to again recalculate the new bloom filter discarding the previous two bloom filters with lower accuracies.
@gkcs 4 года назад ⁺¹
Yes, but the bloom filter can be made during the merge. That's O(N).
@aayush_dutt 4 года назад
Please do a NoSQL and SQL system design for ECommerce
@KanishTheVista 4 года назад
but searching on bloom filter means need to iterate query string and get hash value...... as the size of string increasing the operation will increase which can be greater than our binary search operation.
@hardikjotangia2642 4 года назад
Can u make video with some practical DB?
@Shorts_n_Laughs 2 года назад
Is it a better approach if we implement Bloom filter using TRIE data structure?
@tetsuoshiva 3 года назад
2:29 I don't get it, why do you care about read speed on the linked list? Doesn't the database have to read it in order? Or is it used as some kind of cache?
@MrThepratik 4 года назад
quite insightful enough , is it just me ? i realized some voice enactment of a news TV anchor 🤣
@gkcs 4 года назад
Which one? 😛
@MrThepratik 4 года назад
At 5:11 felt like arnab
@yeyez3149 4 года назад
Este tipo es muy joven y sabe bastante
@Amritanjali 4 года назад
which is better tri or bloom filter?????
@manu-singh 4 года назад
Hi I know this is weird to ask but can you make a
"How to prepare for competitive programming for absolute begginer" video again , like a updated version and explain more
@ishaangupta4125 4 года назад ⁺¹
Go through errichto's channel on RUclips. He's a top competitive programmer and can provide you useful techniques.
(Not taking anything away from this channel)
@manu-singh 4 года назад
@@ishaangupta4125 alright thanks
@sauravdas7591 4 года назад
@@ishaangupta4125 you gave a very good information, just to add on top. Competitive Coding and Interview preparation are quite different, so even though topics are same but strategy of preparation is different
@nagavijaykumarprathi8531 4 года назад ⁺⁴
When notification arrives from gaurav..
Me: Hey, you don't know anything about this concept. Don't waste your time by watching this.
My Brain : you stop, gaurav makes the thinkgs clear..
@gourabpanda 4 года назад
When is sorted chunk created? Is it during insertion or sometime later ? What happen when its get a query for the data which is not yet in the sorted chunk ?
@klementtan6850 4 года назад
Could you make a user authentication, authorisation, roles and privileges system design video? Cant find any good tutorial on this!
@gkcs 4 года назад ⁺¹
I am on it!
@harishvemula6709 4 года назад
Nice video 😃😃
@guptapaaras 4 года назад
Good one Gaurav !! Just a thought, isn't this is closely related to LSM trees?
@gkcs 4 года назад
It is an LSM tree :)
@DK-ox7ze 2 года назад
A large DB will typically be having millions of entries, so will the merging not be very inefficient in that case? Since merging two sorted arrays has TC of O(n1+n2). Even if this compaction happens async, it can still slow down the DB because it will happen frequently.
@gkcs 2 года назад
Merging millions of entries is very reasonable. It's "trillions of entries" which should concern you 😛
@DK-ox7ze 2 года назад
@@gkcs No of entries will depend on data size of each unit, but let's say that the data is 100GB. Sorting that much of data frequently will not be efficient.

Следующие

Автовоспроизведение

Distributed Consensus and Data Replication strategies on the server