LSM Tree + SSTable Database Indexes | Systems Design Interview: 0 to 1 with Google Software Engineer

Поделиться
HTML-код
  • Опубликовано: 17 янв 2025

Комментарии • 104

  • @SreekantShenoy
    @SreekantShenoy 9 месяцев назад +81

    I took notes bro. great going!
    Indexing techniques:
    1) Hash Index
    - They are basic hashmaps in memory. (NOT disk - as we need random hashmap access instantly and not seek it)
    - Thus limited by memory space, expensive
    - Fast reads and writes
    - Bad for range queries
    - To add durability, we add WAL - write to disk as log (easy as pointer is always at the end) and then write to the hashmap
    2) B-Trees
    - Self balancing tree, entirely kept on disk
    - Hence NO limit on size
    - Advantage having lesser concern over durability
    - Slower reads than “Hash index” technique as - disk vs memory speed.
    - Reads are fast - O(LogN) to retrieve, but slow writes
    - Good for range queries (as stored in tree structure)
    - Durability can be ensured by NOT updating tree in place, rather updating the pages/block separately file and then updating the reference later on.
    3) LSM Tree + SSTable
    - In-memory tree table (AVL, red-black), self balanced
    - Little faster writes compared to B-Trees, but slower than hash index
    - Slower reads speeds than B-Trees, as it has to check all SStables (Also slower range queries)
    - Durability handled by WAL - which can be replayed later
    - Space concern of LSM trees as in-memory, However we have SStables which are outputted to disk
    - Hence No of keys limit not a concern (to store by space)
    - SSTable - sorted, immutable
    - Finding key - can go on taking time - it is first checked in tree (memtable) and only then across SSTable backwards from creation time
    - To delete, its implemented by tombstone value - indicating deleted. It is added to the latest SSTable entry
    LSM Optimizations
    Use Sparse index
    - Also known as index for SSTable
    - Take “certain” keys of SSTable and create a sparse index with corresponding stored memory location.
    - Can use binary search to find in-between keys
    Bloom Filters
    - Can vaguely tell, if certain “key” present in the SSTable
    Compaction
    - Merges SSTables in background - O(N) operation
    - Cons is extra CPU processing in background could be bad

    • @giteshkhanna2633
      @giteshkhanna2633 18 дней назад

      Can also fit in a BST before the B-tree part.
      - Supports ranged queries.
      - The only downside of a BST is that the depth could be huge, hence still our disk would have to go through alot of I/O ops.

  • @youssefhany6345
    @youssefhany6345 Год назад +9

    This part was mentioned in the system design interview boob by Alex Xu in the end of chapter 6. Thank you for the video as it really clarifies this part with deep details.

  • @Anon-bt6dz
    @Anon-bt6dz 10 месяцев назад +8

    Excellent video. Watched this to fill some gaps after reading about this topic on DDIA.

  • @slover4384
    @slover4384 7 месяцев назад +17

    LSM tree is the name industry gives to the entire structure (the in memory BST plus the different SST tables)
    The in memory BST part is called the "memtable"
    That is, memtable + SSTables == LSM Tree

  • @mohanakumaar
    @mohanakumaar 5 месяцев назад +2

    I really enjoyed the pace at which I could consume quality information. Great summary as well.

  • @nintendolover82
    @nintendolover82 11 месяцев назад +2

    This is the exact series I was looking for. Subbed

  • @paolodm
    @paolodm 6 месяцев назад +1

    Just wanted to say this has been super useful. Concise, wastes no times... I am subscribing!

  • @AbhijithVMohan
    @AbhijithVMohan 9 месяцев назад +5

    Hi Jordan. Great series.
    DDIA mentions that binary search is not used to search in the sstable files. Rather the range from the sparse index is searched by a sequential scan. The reason for this is given as variable size of records. Also the block between 2 offsets in the sparse indeed would be compressed as a whole as a further optimization

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 месяцев назад

      Thanks for sharing! I believe once you scan the offsets in the sparse index, you binary search from the offsets you care about, but we may be misunderstanding one another

  • @daisuke.ryomen
    @daisuke.ryomen 7 месяцев назад +1

    Just started watching this series, till now it has been a lot of fun + a lot of learning!

  • @peteraleonov
    @peteraleonov 27 дней назад +1

    Man you're a refreshingly cool lecturer, keep going 🙏

    • @jordanhasnolife5163
      @jordanhasnolife5163  27 дней назад

      Wow I've been called emotionless, nerdy, a gaslighter, and devlishly handsome
      Glad to add cool to the list

  • @vimarshkoul930
    @vimarshkoul930 Год назад +3

    This video literally cleared all my doubts. Thanks

  • @Amonnuns
    @Amonnuns Год назад +1

    This is really great content, I just learned that from a book and wanted to make sure a captured everything. Thanks!

  • @kirillzlobin7135
    @kirillzlobin7135 17 часов назад

    9:27 haha, if it is a reference to that youtuber who creates videos about nothing - this is just AWESOME :))))))))))))))))))))) Great channel!!!!!!!!

  • @ihsannuruliman4005
    @ihsannuruliman4005 Год назад +4

    Boredom and depression are both our lovely friends.

  • @firezdog
    @firezdog 7 месяцев назад +1

    I must be learning something because I was mouthing the words "inorder traversal" the moment you started talking about serializing the BST on disk.

  • @shobhaagarwal6958
    @shobhaagarwal6958 3 месяца назад +2

    Thanks for a great video! When reading a key, if we dont find it in the LSM Tree then you said we have to look in the latest two SSTables.. I did not understand why only two? Wont we need to search all SSTables, whatever be their count? Also, in some real world implementations of this, how many SSTables we usually end up having?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад +1

      You have to look in all of them, that was just my example.
      How many SSTables we have IRL depends on how often we choose to compact them in the background!

  • @Delicatamente
    @Delicatamente 8 месяцев назад +1

    14:16 LSM also faster on writes over B tree because every write operation on B tree will start index restructurization, which is not happening in LSM case.
    Thank you for your videos.

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 месяцев назад +1

      I think some percentage of them will, depends if the leaf node is empty but good point here

  • @MrCanOk
    @MrCanOk 11 месяцев назад +1

    Great opening, Great Explanation, THANKS

  • @dwivedys
    @dwivedys 9 месяцев назад +1

    Boy oh boy! Beautiful this 👍👍👍👌👌👌

  • @erictsend8878
    @erictsend8878 2 месяца назад +1

    bruh... I was starting zone out and 2:55 jolted me awake, hilarious

  • @abhishekmiet
    @abhishekmiet 6 месяцев назад +1

    Nice work Jordan!!!

  • @nm_scarecrow302
    @nm_scarecrow302 20 дней назад +1

    A stupid question I had was what happens to deleted keys upon compaction? If one SSTable has a value for it and most recent one has tombstone.

    • @jordanhasnolife5163
      @jordanhasnolife5163  18 дней назад +1

      The resulting SSTable just doesn't contain any row for that key

    • @nm_scarecrow302
      @nm_scarecrow302 18 дней назад +1

      @@jordanhasnolife5163 I figured but still wanted to confirm. Thanks Jordan! Happy New Year!

  • @mann8737
    @mann8737 2 года назад +1

    Nice work as usual Jordan!

  • @dibyanshupranjal7704
    @dibyanshupranjal7704 Год назад +3

    so it the SS table just the structure in disk? I have got confused reading data intensive applications

  • @mali-qb8vh
    @mali-qb8vh 4 месяца назад +1

    In the example of the ‘tree’ shown at 0:28, are the data incorrectly placed, or were they just randomly chosen?

  • @KratosProton
    @KratosProton 2 года назад +5

    As usual great video Jordan. I do have one suggestion, I believe that your powerpoint method of explaining was more effective. And you know what SWEs say "If it ain't broke, don't fix it" 🤪

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 года назад +5

      Well to be fair, I've done all this stuff on powerpoints haha, and I've gotten relatively feedback surrounding the iPad. I think that someone will always be a little less convenienced no matter what I do, so I'll probably stick with it for now! If it helps, the slides are linked in my channel description and you can use those to help understand these videos better!
      Thanks for the feedback :)

    • @KratosProton
      @KratosProton 2 года назад +1

      @@jordanhasnolife5163 True that

  • @stefanss2494
    @stefanss2494 10 месяцев назад +2

    There are a couple of things which I didn't understand (I assumed that bloom filters and sparse Index are created per SSTable):
    1. So each SSTable has a bloom filter. Basically when we add a new entry to the memtable (the in memory balanced BST) we also update the bloom filter. When a certain threshold has been exceeded => memtable is flushed on disk (SSTable). Is the bloom filter also flushed on disk (and we have a mechanism of loading into memory the bloom filters and cache the most recent ones)?
    2.I assume that Sparse Index are also written on disk. If the bloom filter is an array of bits and for each key we apply k hash functions and set to 1 the respective bits (constant time) how is this sparse index build (how do we know the address on disk for a key)? Do we keep 2 BSTs in memory and we flush both of them ? If that is the case how to we store in the Sparse Index the address on disk of that key ?
    3. In case of compactions we would have to update both bloom filters and sparse Index right ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  10 месяцев назад +1

      1) Each SSTable has a bloom filter which is created when the SSTable is created (flushed from memory). Yep, you'd store the bloom filter on disk and load it into memory!

    • @jordanhasnolife5163
      @jordanhasnolife5163  10 месяцев назад

      2) I'm not sure what you mean by sparse index. Presumably you mean a secondary index, in which case yeah you basically have to replicate the whole LSM tree, SSTables, and bloom filters for whatever the key is for that index

    • @jordanhasnolife5163
      @jordanhasnolife5163  10 месяцев назад

      3) Again, assuming sparse index = secondary index. You have to update bloom filters on compaction, but I'd think that for the whole sparse index you can compact it separately from when you compact the primary index.

    • @stefanss2494
      @stefanss2494 10 месяцев назад +2

      No that is not what I mean by sparse Index. In the video (minute 8:15) you showed 2 optimisations for the LSM tree, one was bloom filter and the other one was Sparse Index (about which you said it contains some of the keys of the SSTable and speeds up searching for a key in an SSTable).

    • @jordanhasnolife5163
      @jordanhasnolife5163  10 месяцев назад +1

      @@stefanss2494 OMG I totally forgot about the sparse index! Yes that would have to be rebuilt upon a merge because now the addresses change. In theory, you could probably "merge" the two bloom filters by just combining the ones, but that's effectively rebuilding them.

  • @hinata4661
    @hinata4661 11 месяцев назад +1

    I think we store a part of B-Tree if not full in-memory for faster searches and perform write in-memory and do periodical flush to disk. If they are written in-memory then B-Tree ~ LSM Trees. Please correct me if I am wrong.

    • @jordanhasnolife5163
      @jordanhasnolife5163  11 месяцев назад

      There is definitely a lot of caching involved! It helps a lot to use a write back in memory cache of certain nodes within the b-tree.

    • @hinata4661
      @hinata4661 11 месяцев назад +1

      @@jordanhasnolife5163 Can you share the ipad notes for this 2.0 playlist.

    • @jordanhasnolife5163
      @jordanhasnolife5163  11 месяцев назад

      @@hinata4661 Yup, will get to this when I finish my current series.

  • @hassanharera
    @hassanharera Год назад +1

    Awesome explanation 👏

  • @hassantaherian9745
    @hassantaherian9745 6 месяцев назад +1

    Thanks for great explanation👌

  • @JinilSasidharan
    @JinilSasidharan 9 месяцев назад +1

    Excellent, Thanks bro!

  • @yahorzabalotski
    @yahorzabalotski 2 года назад +2

    Hi Jordan! Great job! AFAIK key length may vary, something (e.g. special structure of each row) needs to be done additionally to support bsearch, right? Am I missing something?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 года назад

      I'm a bit confused here. You're saying that when key lengths can vary normal binary search doesn't work? As long as they're sorted and there's a comparison function implemented that shouldn't be the case.

    • @yahorzabalotski
      @yahorzabalotski 2 года назад +1

      Of course it works, but you somehow need to know positions where to jump, right? e.g. additional file with mapping {"key_n" : "starts_at_pos_x_in_sstable"}

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 года назад +2

      @@yahorzabalotski Yeah that's what the sparse index is for

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Год назад

      @@yahorzabalotski slotted pages probably works

  • @Spreadlove5683
    @Spreadlove5683 Год назад +1

    The tombstone just lives in memory right? Well, until you flush memory to disk when you run out of space, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад

      It actually does go to disk, because keep in mind we have multiple sstables, so you need the tombstone until they get compacted together

    • @Spreadlove5683
      @Spreadlove5683 Год назад +1

      ​@@jordanhasnolife5163 thank you! when do they go to disk? Immediately upon creation? What I meant in my original comment is that I figured they live in memory until the memory gets flushed into an SS table (so, not talking about swap space), but I'm afraid I don't understand how things work if that's not the case. Memory is always the primary source of truth anyways, right? And SS tables are immutable you said, but I guess you could associate a tombstone with an SS table entry without modifying the SS table itself, depending on if you consider that a mutation.

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +1

      Yep they first go to memory and then are flushed to disk like a normal write :)@@Spreadlove5683

    • @Spreadlove5683
      @Spreadlove5683 Год назад

      ​@@jordanhasnolife5163thanks!!

  • @m.imranzaheer1368
    @m.imranzaheer1368 6 месяцев назад +1

    Thanks bro. Great work

  • @RandomShowerThoughts
    @RandomShowerThoughts Год назад +6

    2:55 LMAOO this is when you gained a subscriber

  • @Andron4iKTV
    @Andron4iKTV 8 месяцев назад +1

    What do u think about write amplification in LSM tree? A lot of people talk about it as big cons. BUt rel db still have this issue too. Do u know a db or ways how various db handle this?

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 месяцев назад

      Good question - I am not sure off of the top of my head, if you google this and come back with an answer definitely let us all know!

  • @programmingconcepts-d9w
    @programmingconcepts-d9w 7 месяцев назад +1

    bro is goat anywhere stuck come here

  • @saurabhmittal6947
    @saurabhmittal6947 6 месяцев назад +1

    hey i was recently asked why are writes blazing fast in cassandra and i told interviewer that its because of directly to memory in LSM tree and whole stuff... but he was like, postgresql also write to memory, what's causing this major difference in write throughput ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 месяцев назад

      Did you ask your interviewer what the answer was after the fact?
      1) Throughput not limited to a single node
      2) Perhaps different locking constraints, e.g. primary key stuff, but I'm not certain there off the top of my head

  • @zuowang5185
    @zuowang5185 Год назад +1

    Is nlogn really not acceptable when n is ok?

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад

      Can you elaborate on this question? But generally, no - any increase to time complexity is pretty huge when you're working on large enough datasets.

  • @vojindjukic2719
    @vojindjukic2719 Год назад +1

    very well explained, thanks

  • @vankram1552
    @vankram1552 Год назад +29

    This dude just making a youtube video for each chapter of "designing data intesive applications"

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +23

      He sure is - he also makes RUclips videos on non DDIA stuff afterwards too

    • @staffeng
      @staffeng Год назад +11

      There's nothing wrong with that. DDIA gets so much into the weeds that you forget the broader points it's trying to make by the end of the chapter. Videos are faster, concise, and easier to revisit and revise. Good luck revising with DDIA.

    • @arshadhammad
      @arshadhammad Год назад

      @@jordanhasnolife5163 And we love him for it!

    • @MikeTypes
      @MikeTypes Год назад +7

      That’s a good thing, no?

    • @dwivedys
      @dwivedys 7 месяцев назад +1

      And what - if any - may be an issue with that?

  • @Hjksivgx
    @Hjksivgx Год назад +2

    I love the condenced version of DDIA. Hand drawing looks very ugly though. Why not powerpoint or Excalidraw? I would say more professional diagrams will get you way further. Good luck!

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад

      Wow not a fan of my artistic abilities??
      Understand your point with excalidraw, however it would take a me a lot longer to iterate there and I couldn't write on top of them, hence why I opted for hand drawings.
      I could hire an animator, but all of these videos aren't behind a paywall...so maybe that'll be version 3.0

    • @Hjksivgx
      @Hjksivgx Год назад +1

      Many of your early videos featured prewritten content that looked much better than your art, haha. I understand that you already had the content, and also, redrawing for every new video without existing content would take an additional 15 minutes, but I think it's still worth it. Having it pre-drawn would make your videos shorter and much more professional. Anyways, there isn't much content like yours on RUclips, to be honest-so concise and focused solely on the information we need for SDI. Good job!

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад

      @@Hjksivgx Appreciate the feedback!

  • @code_with_om
    @code_with_om Год назад +1

    Amazing video man ❤ed it

  • @itmatterstomepodcast
    @itmatterstomepodcast 8 месяцев назад +1

    Why would any search for a key/value go through multiple SSTables? Shouldn’t it search the tables in the order they were written to starting with the most recent. That way it knows the first time it finds the key/value then that is the most up to date data for it

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 месяцев назад

      That's what we do - multiple just means more than 1, what if our key isn't in the first SSTable file?

  • @dibll
    @dibll 2 года назад +1

    Question - LSM trees have overhead of WAL(Disk) in additional to memory write while B Trees only write to Disk then why B Trees are slower in writes when compare to LSM? Also I believe B trees do not have a WAL as they are not in memory, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 года назад +3

      Btrees still likely have a WAL if you are performing writes in place. This functions to make sure that the tree doesn't accidentally get in an inconsistent state if the db fails. Btree write is WAL plus disk tree traversal, LSM tree write is WAL plus in memory tree traversal

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Год назад +1

      Btree are mutable so they require locked writes and they would probably use a page buffer that is flushed after WAL. Additionally btree on disk requires random access which is slower than sequential.

  • @andriidanylov9453
    @andriidanylov9453 2 года назад +1

    👍✍

  • @unknownman1
    @unknownman1 8 месяцев назад +1

    leaving the course at 6:03

  • @prashlovessamosa
    @prashlovessamosa 2 года назад +1

    Hey yo

  • @shamilgurban6439
    @shamilgurban6439 8 месяцев назад +1

    More clear than DDIA to be honest. Book left some gaps.

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 месяцев назад

      It's too goated so I had to make videos I could understand

  • @anandchowdhury9262
    @anandchowdhury9262 Месяц назад +1

    I would appreciate if the student's names were culturally inclusive!! Ex: Santosh with a score 98 is a more realistic example 😑