LSM Tree + SSTable Database Indexes | Systems Design Interview: 0 to 1 with Google Software Engineer

Поделиться
HTML-код
  • Опубликовано: 31 май 2024
  • I never use a write ahead log because I like living life on the edge.
  • НаукаНаука

Комментарии • 75

  • @SreekantShenoy
    @SreekantShenoy Месяц назад +15

    I took notes bro. great going!
    Indexing techniques:
    1) Hash Index
    - They are basic hashmaps in memory. (NOT disk - as we need random hashmap access instantly and not seek it)
    - Thus limited by memory space, expensive
    - Fast reads and writes
    - Bad for range queries
    - To add durability, we add WAL - write to disk as log (easy as pointer is always at the end) and then write to the hashmap
    2) B-Trees
    - Self balancing tree, entirely kept on disk
    - Hence NO limit on size
    - Advantage having lesser concern over durability
    - Slower reads than “Hash index” technique as - disk vs memory speed.
    - Reads are fast - O(LogN) to retrieve, but slow writes
    - Good for range queries (as stored in tree structure)
    - Durability can be ensured by NOT updating tree in place, rather updating the pages/block separately file and then updating the reference later on.
    3) LSM Tree + SSTable
    - In-memory tree table (AVL, red-black), self balanced
    - Little faster writes compared to B-Trees, but slower than hash index
    - Slower reads speeds than B-Trees, as it has to check all SStables (Also slower range queries)
    - Durability handled by WAL - which can be replayed later
    - Space concern of LSM trees as in-memory, However we have SStables which are outputted to disk
    - Hence No of keys limit not a concern (to store by space)
    - SSTable - sorted, immutable
    - Finding key - can go on taking time - it is first checked in tree (memtable) and only then across SSTable backwards from creation time
    - To delete, its implemented by tombstone value - indicating deleted. It is added to the latest SSTable entry
    LSM Optimizations
    Use Sparse index
    - Also known as index for SSTable
    - Take “certain” keys of SSTable and create a sparse index with corresponding stored memory location.
    - Can use binary search to find in-between keys
    Bloom Filters
    - Can vaguely tell, if certain “key” present in the SSTable
    Compaction
    - Merges SSTables in background - O(N) operation
    - Cons is extra CPU processing in background could be bad

  • @slover4384
    @slover4384 День назад +1

    LSM tree is the name industry gives to the entire structure (the in memory BST plus the different SST tables)
    The in memory BST part is called the "memtable"
    That is, memtable + SSTables == LSM Tree

  • @Anon-bt6dz
    @Anon-bt6dz 2 месяца назад +4

    Excellent video. Watched this to fill some gaps after reading about this topic on DDIA.

    • @atibhiagrawal6460
      @atibhiagrawal6460 2 месяца назад

      Same. This is excellent. This channel is a gem

  • @youssefhany6345
    @youssefhany6345 6 месяцев назад +6

    This part was mentioned in the system design interview boob by Alex Xu in the end of chapter 6. Thank you for the video as it really clarifies this part with deep details.

    • @bezimienny5
      @bezimienny5 Месяц назад

      I like interviewing boobs

  • @nintendolover82
    @nintendolover82 3 месяца назад +2

    This is the exact series I was looking for. Subbed

  • @AbhijithVMohan
    @AbhijithVMohan Месяц назад +1

    Hi Jordan. Great series.
    DDIA mentions that binary search is not used to search in the sstable files. Rather the range from the sparse index is searched by a sequential scan. The reason for this is given as variable size of records. Also the block between 2 offsets in the sparse indeed would be compressed as a whole as a further optimization

    • @jordanhasnolife5163
      @jordanhasnolife5163  Месяц назад

      Thanks for sharing! I believe once you scan the offsets in the sparse index, you binary search from the offsets you care about, but we may be misunderstanding one another

  • @Delicatamente
    @Delicatamente Месяц назад +1

    14:16 LSM also faster on writes over B tree because every write operation on B tree will start index restructurization, which is not happening in LSM case.
    Thank you for your videos.

    • @jordanhasnolife5163
      @jordanhasnolife5163  Месяц назад +1

      I think some percentage of them will, depends if the leaf node is empty but good point here

  • @Amonnuns
    @Amonnuns 4 месяца назад +1

    This is really great content, I just learned that from a book and wanted to make sure a captured everything. Thanks!

  • @vimarshkoul930
    @vimarshkoul930 11 месяцев назад +3

    This video literally cleared all my doubts. Thanks

  • @mann8737
    @mann8737 Год назад +1

    Nice work as usual Jordan!

  • @ihsannuruliman4005
    @ihsannuruliman4005 6 месяцев назад +4

    Boredom and depression are both our lovely friends.

  • @shamilgurban6439
    @shamilgurban6439 22 дня назад +1

    More clear than DDIA to be honest. Book left some gaps.

  • @MrCanokan
    @MrCanokan 3 месяца назад +1

    Great opening, Great Explanation, THANKS

  • @yahorzabalotski
    @yahorzabalotski Год назад +2

    Hi Jordan! Great job! AFAIK key length may vary, something (e.g. special structure of each row) needs to be done additionally to support bsearch, right? Am I missing something?

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад

      I'm a bit confused here. You're saying that when key lengths can vary normal binary search doesn't work? As long as they're sorted and there's a comparison function implemented that shouldn't be the case.

    • @yahorzabalotski
      @yahorzabalotski Год назад +1

      Of course it works, but you somehow need to know positions where to jump, right? e.g. additional file with mapping {"key_n" : "starts_at_pos_x_in_sstable"}

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +2

      @@yahorzabalotski Yeah that's what the sparse index is for

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Год назад

      @@yahorzabalotski slotted pages probably works

  • @RandomShowerThoughts
    @RandomShowerThoughts Год назад +5

    2:55 LMAOO this is when you gained a subscriber

  • @vankram1552
    @vankram1552 Год назад +19

    This dude just making a youtube video for each chapter of "designing data intesive applications"

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +14

      He sure is - he also makes RUclips videos on non DDIA stuff afterwards too

    • @staffeng
      @staffeng 9 месяцев назад +4

      There's nothing wrong with that. DDIA gets so much into the weeds that you forget the broader points it's trying to make by the end of the chapter. Videos are faster, concise, and easier to revisit and revise. Good luck revising with DDIA.

    • @arshadhammad
      @arshadhammad 6 месяцев назад

      @@jordanhasnolife5163 And we love him for it!

    • @MikeTypes
      @MikeTypes 5 месяцев назад +2

      That’s a good thing, no?

  • @hassanharera
    @hassanharera 11 месяцев назад +1

    Awesome explanation 👏

  • @Andron4iKTV
    @Andron4iKTV 23 дня назад +1

    What do u think about write amplification in LSM tree? A lot of people talk about it as big cons. BUt rel db still have this issue too. Do u know a db or ways how various db handle this?

    • @jordanhasnolife5163
      @jordanhasnolife5163  23 дня назад

      Good question - I am not sure off of the top of my head, if you google this and come back with an answer definitely let us all know!

  • @dwivedys
    @dwivedys 2 месяца назад +1

    Boy oh boy! Beautiful this 👍👍👍👌👌👌

  • @itmatterstomepodcast
    @itmatterstomepodcast 15 дней назад +1

    Why would any search for a key/value go through multiple SSTables? Shouldn’t it search the tables in the order they were written to starting with the most recent. That way it knows the first time it finds the key/value then that is the most up to date data for it

    • @jordanhasnolife5163
      @jordanhasnolife5163  15 дней назад

      That's what we do - multiple just means more than 1, what if our key isn't in the first SSTable file?

  • @ky747r0
    @ky747r0 Год назад +5

    As usual great video Jordan. I do have one suggestion, I believe that your powerpoint method of explaining was more effective. And you know what SWEs say "If it ain't broke, don't fix it" 🤪

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +5

      Well to be fair, I've done all this stuff on powerpoints haha, and I've gotten relatively feedback surrounding the iPad. I think that someone will always be a little less convenienced no matter what I do, so I'll probably stick with it for now! If it helps, the slides are linked in my channel description and you can use those to help understand these videos better!
      Thanks for the feedback :)

    • @ky747r0
      @ky747r0 Год назад +1

      @@jordanhasnolife5163 True that

  • @vojindjukic2719
    @vojindjukic2719 8 месяцев назад +1

    very well explained, thanks

  • @dibyanshupranjal7704
    @dibyanshupranjal7704 10 месяцев назад +3

    so it the SS table just the structure in disk? I have got confused reading data intensive applications

  • @JinilSasidharan
    @JinilSasidharan Месяц назад +1

    Excellent, Thanks bro!

  • @moidrees4661
    @moidrees4661 3 месяца назад +1

    I think we store a part of B-Tree if not full in-memory for faster searches and perform write in-memory and do periodical flush to disk. If they are written in-memory then B-Tree ~ LSM Trees. Please correct me if I am wrong.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      There is definitely a lot of caching involved! It helps a lot to use a write back in memory cache of certain nodes within the b-tree.

    • @moidrees4661
      @moidrees4661 3 месяца назад +1

      @@jordanhasnolife5163 Can you share the ipad notes for this 2.0 playlist.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      @@moidrees4661 Yup, will get to this when I finish my current series.

  • @code_with_om
    @code_with_om 11 месяцев назад +1

    Amazing video man ❤ed it

  • @stefanss2494
    @stefanss2494 3 месяца назад +2

    There are a couple of things which I didn't understand (I assumed that bloom filters and sparse Index are created per SSTable):
    1. So each SSTable has a bloom filter. Basically when we add a new entry to the memtable (the in memory balanced BST) we also update the bloom filter. When a certain threshold has been exceeded => memtable is flushed on disk (SSTable). Is the bloom filter also flushed on disk (and we have a mechanism of loading into memory the bloom filters and cache the most recent ones)?
    2.I assume that Sparse Index are also written on disk. If the bloom filter is an array of bits and for each key we apply k hash functions and set to 1 the respective bits (constant time) how is this sparse index build (how do we know the address on disk for a key)? Do we keep 2 BSTs in memory and we flush both of them ? If that is the case how to we store in the Sparse Index the address on disk of that key ?
    3. In case of compactions we would have to update both bloom filters and sparse Index right ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      1) Each SSTable has a bloom filter which is created when the SSTable is created (flushed from memory). Yep, you'd store the bloom filter on disk and load it into memory!

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      2) I'm not sure what you mean by sparse index. Presumably you mean a secondary index, in which case yeah you basically have to replicate the whole LSM tree, SSTables, and bloom filters for whatever the key is for that index

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      3) Again, assuming sparse index = secondary index. You have to update bloom filters on compaction, but I'd think that for the whole sparse index you can compact it separately from when you compact the primary index.

    • @stefanss2494
      @stefanss2494 3 месяца назад +1

      No that is not what I mean by sparse Index. In the video (minute 8:15) you showed 2 optimisations for the LSM tree, one was bloom filter and the other one was Sparse Index (about which you said it contains some of the keys of the SSTable and speeds up searching for a key in an SSTable).

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 месяца назад

      @@stefanss2494 OMG I totally forgot about the sparse index! Yes that would have to be rebuilt upon a merge because now the addresses change. In theory, you could probably "merge" the two bloom filters by just combining the ones, but that's effectively rebuilding them.

  • @okthatsnice
    @okthatsnice 7 месяцев назад +1

    The tombstone just lives in memory right? Well, until you flush memory to disk when you run out of space, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 месяцев назад

      It actually does go to disk, because keep in mind we have multiple sstables, so you need the tombstone until they get compacted together

    • @okthatsnice
      @okthatsnice 7 месяцев назад +1

      ​@@jordanhasnolife5163 thank you! when do they go to disk? Immediately upon creation? What I meant in my original comment is that I figured they live in memory until the memory gets flushed into an SS table (so, not talking about swap space), but I'm afraid I don't understand how things work if that's not the case. Memory is always the primary source of truth anyways, right? And SS tables are immutable you said, but I guess you could associate a tombstone with an SS table entry without modifying the SS table itself, depending on if you consider that a mutation.

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 месяцев назад +1

      Yep they first go to memory and then are flushed to disk like a normal write :)@@okthatsnice

    • @okthatsnice
      @okthatsnice 6 месяцев назад

      ​@@jordanhasnolife5163thanks!!

  • @dibll
    @dibll Год назад +1

    Question - LSM trees have overhead of WAL(Disk) in additional to memory write while B Trees only write to Disk then why B Trees are slower in writes when compare to LSM? Also I believe B trees do not have a WAL as they are not in memory, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  Год назад +3

      Btrees still likely have a WAL if you are performing writes in place. This functions to make sure that the tree doesn't accidentally get in an inconsistent state if the db fails. Btree write is WAL plus disk tree traversal, LSM tree write is WAL plus in memory tree traversal

    • @7th_CAV_Trooper
      @7th_CAV_Trooper Год назад +1

      Btree are mutable so they require locked writes and they would probably use a page buffer that is flushed after WAL. Additionally btree on disk requires random access which is slower than sequential.

  • @zuowang5185
    @zuowang5185 4 месяца назад +1

    Is nlogn really not acceptable when n is ok?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 месяца назад

      Can you elaborate on this question? But generally, no - any increase to time complexity is pretty huge when you're working on large enough datasets.

  • @andriidanylov9453
    @andriidanylov9453 Год назад +1

    👍✍

  • @Hjksivgx
    @Hjksivgx 5 месяцев назад +1

    I love the condenced version of DDIA. Hand drawing looks very ugly though. Why not powerpoint or Excalidraw? I would say more professional diagrams will get you way further. Good luck!

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 месяцев назад

      Wow not a fan of my artistic abilities??
      Understand your point with excalidraw, however it would take a me a lot longer to iterate there and I couldn't write on top of them, hence why I opted for hand drawings.
      I could hire an animator, but all of these videos aren't behind a paywall...so maybe that'll be version 3.0

    • @Hjksivgx
      @Hjksivgx 5 месяцев назад +1

      Many of your early videos featured prewritten content that looked much better than your art, haha. I understand that you already had the content, and also, redrawing for every new video without existing content would take an additional 15 minutes, but I think it's still worth it. Having it pre-drawn would make your videos shorter and much more professional. Anyways, there isn't much content like yours on RUclips, to be honest-so concise and focused solely on the information we need for SDI. Good job!

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 месяцев назад

      @@Hjksivgx Appreciate the feedback!

  • @unknownman1
    @unknownman1 12 дней назад +1

    leaving the course at 6:03

  • @prashlovessamosa
    @prashlovessamosa Год назад +1

    Hey yo