LSM Tree + SSTable Database Indexes | Systems Design Interview: 0 to 1 with Google Software Engineer

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ก.ย. 2024

ความคิดเห็น • 90

  • @SreekantShenoy
    @SreekantShenoy 5 หลายเดือนก่อน +49

    I took notes bro. great going!
    Indexing techniques:
    1) Hash Index
    - They are basic hashmaps in memory. (NOT disk - as we need random hashmap access instantly and not seek it)
    - Thus limited by memory space, expensive
    - Fast reads and writes
    - Bad for range queries
    - To add durability, we add WAL - write to disk as log (easy as pointer is always at the end) and then write to the hashmap
    2) B-Trees
    - Self balancing tree, entirely kept on disk
    - Hence NO limit on size
    - Advantage having lesser concern over durability
    - Slower reads than “Hash index” technique as - disk vs memory speed.
    - Reads are fast - O(LogN) to retrieve, but slow writes
    - Good for range queries (as stored in tree structure)
    - Durability can be ensured by NOT updating tree in place, rather updating the pages/block separately file and then updating the reference later on.
    3) LSM Tree + SSTable
    - In-memory tree table (AVL, red-black), self balanced
    - Little faster writes compared to B-Trees, but slower than hash index
    - Slower reads speeds than B-Trees, as it has to check all SStables (Also slower range queries)
    - Durability handled by WAL - which can be replayed later
    - Space concern of LSM trees as in-memory, However we have SStables which are outputted to disk
    - Hence No of keys limit not a concern (to store by space)
    - SSTable - sorted, immutable
    - Finding key - can go on taking time - it is first checked in tree (memtable) and only then across SSTable backwards from creation time
    - To delete, its implemented by tombstone value - indicating deleted. It is added to the latest SSTable entry
    LSM Optimizations
    Use Sparse index
    - Also known as index for SSTable
    - Take “certain” keys of SSTable and create a sparse index with corresponding stored memory location.
    - Can use binary search to find in-between keys
    Bloom Filters
    - Can vaguely tell, if certain “key” present in the SSTable
    Compaction
    - Merges SSTables in background - O(N) operation
    - Cons is extra CPU processing in background could be bad

  • @mohanakumaar
    @mohanakumaar หลายเดือนก่อน +2

    I really enjoyed the pace at which I could consume quality information. Great summary as well.

  • @youssefhany6345
    @youssefhany6345 10 หลายเดือนก่อน +8

    This part was mentioned in the system design interview boob by Alex Xu in the end of chapter 6. Thank you for the video as it really clarifies this part with deep details.

    • @bezimienny5
      @bezimienny5 5 หลายเดือนก่อน

      I like interviewing boobs

  • @slover4384
    @slover4384 3 หลายเดือนก่อน +12

    LSM tree is the name industry gives to the entire structure (the in memory BST plus the different SST tables)
    The in memory BST part is called the "memtable"
    That is, memtable + SSTables == LSM Tree

  • @Anon-bt6dz
    @Anon-bt6dz 6 หลายเดือนก่อน +6

    Excellent video. Watched this to fill some gaps after reading about this topic on DDIA.

    • @atibhiagrawal6460
      @atibhiagrawal6460 6 หลายเดือนก่อน

      Same. This is excellent. This channel is a gem

  • @paolodm
    @paolodm 2 หลายเดือนก่อน +1

    Just wanted to say this has been super useful. Concise, wastes no times... I am subscribing!

  • @daisuke.ryomen
    @daisuke.ryomen 3 หลายเดือนก่อน +1

    Just started watching this series, till now it has been a lot of fun + a lot of learning!

  • @AbhijithVMohan
    @AbhijithVMohan 5 หลายเดือนก่อน +3

    Hi Jordan. Great series.
    DDIA mentions that binary search is not used to search in the sstable files. Rather the range from the sparse index is searched by a sequential scan. The reason for this is given as variable size of records. Also the block between 2 offsets in the sparse indeed would be compressed as a whole as a further optimization

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      Thanks for sharing! I believe once you scan the offsets in the sparse index, you binary search from the offsets you care about, but we may be misunderstanding one another

  • @nintendolover82
    @nintendolover82 7 หลายเดือนก่อน +2

    This is the exact series I was looking for. Subbed

  • @Amonnuns
    @Amonnuns 8 หลายเดือนก่อน +1

    This is really great content, I just learned that from a book and wanted to make sure a captured everything. Thanks!

  • @vimarshkoul930
    @vimarshkoul930 ปีที่แล้ว +3

    This video literally cleared all my doubts. Thanks

  • @RandomShowerThoughts
    @RandomShowerThoughts ปีที่แล้ว +6

    2:55 LMAOO this is when you gained a subscriber

  • @firezdog
    @firezdog 3 หลายเดือนก่อน +1

    I must be learning something because I was mouthing the words "inorder traversal" the moment you started talking about serializing the BST on disk.

  • @MrCanokan
    @MrCanokan 7 หลายเดือนก่อน +1

    Great opening, Great Explanation, THANKS

  • @abhishekmiet
    @abhishekmiet 2 หลายเดือนก่อน +1

    Nice work Jordan!!!

  • @Delicatamente
    @Delicatamente 4 หลายเดือนก่อน +1

    14:16 LSM also faster on writes over B tree because every write operation on B tree will start index restructurization, which is not happening in LSM case.
    Thank you for your videos.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน +1

      I think some percentage of them will, depends if the leaf node is empty but good point here

  • @KratosProton
    @KratosProton ปีที่แล้ว +5

    As usual great video Jordan. I do have one suggestion, I believe that your powerpoint method of explaining was more effective. And you know what SWEs say "If it ain't broke, don't fix it" 🤪

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +5

      Well to be fair, I've done all this stuff on powerpoints haha, and I've gotten relatively feedback surrounding the iPad. I think that someone will always be a little less convenienced no matter what I do, so I'll probably stick with it for now! If it helps, the slides are linked in my channel description and you can use those to help understand these videos better!
      Thanks for the feedback :)

    • @KratosProton
      @KratosProton ปีที่แล้ว +1

      @@jordanhasnolife5163 True that

  • @mann8737
    @mann8737 ปีที่แล้ว +1

    Nice work as usual Jordan!

  • @dwivedys
    @dwivedys 6 หลายเดือนก่อน +1

    Boy oh boy! Beautiful this 👍👍👍👌👌👌

  • @vankram1552
    @vankram1552 ปีที่แล้ว +26

    This dude just making a youtube video for each chapter of "designing data intesive applications"

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +20

      He sure is - he also makes TH-cam videos on non DDIA stuff afterwards too

    • @staffeng
      @staffeng ปีที่แล้ว +7

      There's nothing wrong with that. DDIA gets so much into the weeds that you forget the broader points it's trying to make by the end of the chapter. Videos are faster, concise, and easier to revisit and revise. Good luck revising with DDIA.

    • @arshadhammad
      @arshadhammad 10 หลายเดือนก่อน

      @@jordanhasnolife5163 And we love him for it!

    • @MikeTypes
      @MikeTypes 8 หลายเดือนก่อน +3

      That’s a good thing, no?

    • @dwivedys
      @dwivedys 3 หลายเดือนก่อน

      And what - if any - may be an issue with that?

  • @hassantaherian9745
    @hassantaherian9745 2 หลายเดือนก่อน +1

    Thanks for great explanation👌

  • @ihsannuruliman4005
    @ihsannuruliman4005 10 หลายเดือนก่อน +4

    Boredom and depression are both our lovely friends.

  • @JinilSasidharan
    @JinilSasidharan 5 หลายเดือนก่อน +1

    Excellent, Thanks bro!

  • @dibyanshupranjal7704
    @dibyanshupranjal7704 ปีที่แล้ว +3

    so it the SS table just the structure in disk? I have got confused reading data intensive applications

  • @hassanharera
    @hassanharera ปีที่แล้ว +1

    Awesome explanation 👏

  • @mali-qb8vh
    @mali-qb8vh 12 วันที่ผ่านมา +1

    In the example of the ‘tree’ shown at 0:28, are the data incorrectly placed, or were they just randomly chosen?

    • @jordanhasnolife5163
      @jordanhasnolife5163  12 วันที่ผ่านมา +1

      It's a binary search tree. That's organized based on key.

    • @rashmitpankhania6513
      @rashmitpankhania6513 11 วันที่ผ่านมา +1

      Key here is name(string) not the marks

  • @m.imranzaheer1368
    @m.imranzaheer1368 3 หลายเดือนก่อน +1

    Thanks bro. Great work

  • @programmingconcepts-d9w
    @programmingconcepts-d9w 3 หลายเดือนก่อน +1

    bro is goat anywhere stuck come here

  • @hinata4661
    @hinata4661 7 หลายเดือนก่อน +1

    I think we store a part of B-Tree if not full in-memory for faster searches and perform write in-memory and do periodical flush to disk. If they are written in-memory then B-Tree ~ LSM Trees. Please correct me if I am wrong.

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน

      There is definitely a lot of caching involved! It helps a lot to use a write back in memory cache of certain nodes within the b-tree.

    • @hinata4661
      @hinata4661 7 หลายเดือนก่อน +1

      @@jordanhasnolife5163 Can you share the ipad notes for this 2.0 playlist.

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน

      @@hinata4661 Yup, will get to this when I finish my current series.

  • @vojindjukic2719
    @vojindjukic2719 ปีที่แล้ว +1

    very well explained, thanks

  • @stefanss2494
    @stefanss2494 7 หลายเดือนก่อน +2

    There are a couple of things which I didn't understand (I assumed that bloom filters and sparse Index are created per SSTable):
    1. So each SSTable has a bloom filter. Basically when we add a new entry to the memtable (the in memory balanced BST) we also update the bloom filter. When a certain threshold has been exceeded => memtable is flushed on disk (SSTable). Is the bloom filter also flushed on disk (and we have a mechanism of loading into memory the bloom filters and cache the most recent ones)?
    2.I assume that Sparse Index are also written on disk. If the bloom filter is an array of bits and for each key we apply k hash functions and set to 1 the respective bits (constant time) how is this sparse index build (how do we know the address on disk for a key)? Do we keep 2 BSTs in memory and we flush both of them ? If that is the case how to we store in the Sparse Index the address on disk of that key ?
    3. In case of compactions we would have to update both bloom filters and sparse Index right ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน +1

      1) Each SSTable has a bloom filter which is created when the SSTable is created (flushed from memory). Yep, you'd store the bloom filter on disk and load it into memory!

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน

      2) I'm not sure what you mean by sparse index. Presumably you mean a secondary index, in which case yeah you basically have to replicate the whole LSM tree, SSTables, and bloom filters for whatever the key is for that index

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน

      3) Again, assuming sparse index = secondary index. You have to update bloom filters on compaction, but I'd think that for the whole sparse index you can compact it separately from when you compact the primary index.

    • @stefanss2494
      @stefanss2494 7 หลายเดือนก่อน +2

      No that is not what I mean by sparse Index. In the video (minute 8:15) you showed 2 optimisations for the LSM tree, one was bloom filter and the other one was Sparse Index (about which you said it contains some of the keys of the SSTable and speeds up searching for a key in an SSTable).

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      @@stefanss2494 OMG I totally forgot about the sparse index! Yes that would have to be rebuilt upon a merge because now the addresses change. In theory, you could probably "merge" the two bloom filters by just combining the ones, but that's effectively rebuilding them.

  • @saurabhmittal6947
    @saurabhmittal6947 2 หลายเดือนก่อน +1

    hey i was recently asked why are writes blazing fast in cassandra and i told interviewer that its because of directly to memory in LSM tree and whole stuff... but he was like, postgresql also write to memory, what's causing this major difference in write throughput ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Did you ask your interviewer what the answer was after the fact?
      1) Throughput not limited to a single node
      2) Perhaps different locking constraints, e.g. primary key stuff, but I'm not certain there off the top of my head

  • @Andron4iKTV
    @Andron4iKTV 4 หลายเดือนก่อน +1

    What do u think about write amplification in LSM tree? A lot of people talk about it as big cons. BUt rel db still have this issue too. Do u know a db or ways how various db handle this?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      Good question - I am not sure off of the top of my head, if you google this and come back with an answer definitely let us all know!

  • @itmatterstomepodcast
    @itmatterstomepodcast 4 หลายเดือนก่อน +1

    Why would any search for a key/value go through multiple SSTables? Shouldn’t it search the tables in the order they were written to starting with the most recent. That way it knows the first time it finds the key/value then that is the most up to date data for it

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      That's what we do - multiple just means more than 1, what if our key isn't in the first SSTable file?

  • @Hjksivgx
    @Hjksivgx 9 หลายเดือนก่อน +2

    I love the condenced version of DDIA. Hand drawing looks very ugly though. Why not powerpoint or Excalidraw? I would say more professional diagrams will get you way further. Good luck!

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน

      Wow not a fan of my artistic abilities??
      Understand your point with excalidraw, however it would take a me a lot longer to iterate there and I couldn't write on top of them, hence why I opted for hand drawings.
      I could hire an animator, but all of these videos aren't behind a paywall...so maybe that'll be version 3.0

    • @Hjksivgx
      @Hjksivgx 9 หลายเดือนก่อน +1

      Many of your early videos featured prewritten content that looked much better than your art, haha. I understand that you already had the content, and also, redrawing for every new video without existing content would take an additional 15 minutes, but I think it's still worth it. Having it pre-drawn would make your videos shorter and much more professional. Anyways, there isn't much content like yours on TH-cam, to be honest-so concise and focused solely on the information we need for SDI. Good job!

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน

      @@Hjksivgx Appreciate the feedback!

  • @yahorzabalotski
    @yahorzabalotski ปีที่แล้ว +2

    Hi Jordan! Great job! AFAIK key length may vary, something (e.g. special structure of each row) needs to be done additionally to support bsearch, right? Am I missing something?

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว

      I'm a bit confused here. You're saying that when key lengths can vary normal binary search doesn't work? As long as they're sorted and there's a comparison function implemented that shouldn't be the case.

    • @yahorzabalotski
      @yahorzabalotski ปีที่แล้ว +1

      Of course it works, but you somehow need to know positions where to jump, right? e.g. additional file with mapping {"key_n" : "starts_at_pos_x_in_sstable"}

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +2

      @@yahorzabalotski Yeah that's what the sparse index is for

    • @7th_CAV_Trooper
      @7th_CAV_Trooper ปีที่แล้ว

      @@yahorzabalotski slotted pages probably works

  • @code_with_om
    @code_with_om ปีที่แล้ว +1

    Amazing video man ❤ed it

  • @Spreadlove5683
    @Spreadlove5683 11 หลายเดือนก่อน +1

    The tombstone just lives in memory right? Well, until you flush memory to disk when you run out of space, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  11 หลายเดือนก่อน

      It actually does go to disk, because keep in mind we have multiple sstables, so you need the tombstone until they get compacted together

    • @Spreadlove5683
      @Spreadlove5683 11 หลายเดือนก่อน +1

      ​@@jordanhasnolife5163 thank you! when do they go to disk? Immediately upon creation? What I meant in my original comment is that I figured they live in memory until the memory gets flushed into an SS table (so, not talking about swap space), but I'm afraid I don't understand how things work if that's not the case. Memory is always the primary source of truth anyways, right? And SS tables are immutable you said, but I guess you could associate a tombstone with an SS table entry without modifying the SS table itself, depending on if you consider that a mutation.

    • @jordanhasnolife5163
      @jordanhasnolife5163  11 หลายเดือนก่อน +1

      Yep they first go to memory and then are flushed to disk like a normal write :)@@Spreadlove5683

    • @Spreadlove5683
      @Spreadlove5683 10 หลายเดือนก่อน

      ​@@jordanhasnolife5163thanks!!

  • @zuowang5185
    @zuowang5185 8 หลายเดือนก่อน +1

    Is nlogn really not acceptable when n is ok?

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      Can you elaborate on this question? But generally, no - any increase to time complexity is pretty huge when you're working on large enough datasets.

  • @unknownman1
    @unknownman1 4 หลายเดือนก่อน +1

    leaving the course at 6:03

  • @dibll
    @dibll ปีที่แล้ว +1

    Question - LSM trees have overhead of WAL(Disk) in additional to memory write while B Trees only write to Disk then why B Trees are slower in writes when compare to LSM? Also I believe B trees do not have a WAL as they are not in memory, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +3

      Btrees still likely have a WAL if you are performing writes in place. This functions to make sure that the tree doesn't accidentally get in an inconsistent state if the db fails. Btree write is WAL plus disk tree traversal, LSM tree write is WAL plus in memory tree traversal

    • @7th_CAV_Trooper
      @7th_CAV_Trooper ปีที่แล้ว +1

      Btree are mutable so they require locked writes and they would probably use a page buffer that is flushed after WAL. Additionally btree on disk requires random access which is slower than sequential.

  • @andriidanylov9453
    @andriidanylov9453 ปีที่แล้ว +1

    👍✍

  • @shamilgurban6439
    @shamilgurban6439 4 หลายเดือนก่อน +1

    More clear than DDIA to be honest. Book left some gaps.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      It's too goated so I had to make videos I could understand

  • @prashlovessamosa
    @prashlovessamosa ปีที่แล้ว +1

    Hey yo