How do Hash Indexes work? | Systems Design Interview: 0 to 1 with Google Software Engineer

แชร์
ฝัง

ความคิดเห็น • 30

  • @faceTheCandidates
    @faceTheCandidates ปีที่แล้ว +5

    Nice work. Very cool that you mention binary trees towards the end. If asked to design an in memory db, you can actually create two types of indexes - one that supports range queries and one that is used for fast, single-key lookups. The only catch being, your query engine has to decide at run time which index to go against. Maintaining two indexes might be tricky though!

  • @ritwik121
    @ritwik121 ปีที่แล้ว +21

    hi can you use a slight bolder blue or red ink.. or whatever looks prominent. the writing is not prominent in the video. please do look into it

  • @OrbitZyro
    @OrbitZyro ปีที่แล้ว +2

    Could we have associated a timestamp for each log entry, then maintain another idx that tells us the range of specific timestamps. Then binary search to find a specific timestamp range and seek to that piece of file. However this would only work queries that can use timestamps.
    One of the things I noticed was that 1 single WAL and 1 idx could get really large overtime, so separate segments and a idx for each segment with a root segment that lets us do range queries ( yea a lot more overhead )
    Really enjoying your explanations!!

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +3

      Yeah you pointed out a big issue which is that many queries don't care about timestamps haha!
      As for q2, you can compact the write ahead log, bc when rows get updated you can just keep the last value they got updated to.

  • @slover4384
    @slover4384 25 วันที่ผ่านมา +1

    If you are using hash index to offset the location of key-values on disk, why do you need a write ahead log too?
    That's 3 writes each time (in memory to hash index, on disk twice). I don't know any database that does this.
    You can always recover the hash index for the active segment of the database on restart by reading the actual segment of key-values from disk.
    For the older inactive (i.e., read-only) segments, the database stores a snapshot of the index for that segment onto disk when the segments goes into read-only mode.
    The write ahead log is useful if you are storing actual data values in memory only without the on disk key/values.
    In general, the write ahead log for any database, even relational databases, stores changes that were made in memory that >were not< made to disk yet.
    Similar for LSM commit logs.
    If a change was made to disk, there is little value in also tracking this in a WAL or commit log.

    • @jordanhasnolife5163
      @jordanhasnolife5163  23 วันที่ผ่านมา

      What about atomicity of transactions...

    • @slover4384
      @slover4384 21 วันที่ผ่านมา +1

      I don't know a database like we are discussing that has atomic transactions - though there may be some.
      I mean a database that immediately stores the key-value on disk with an in-memory hash index pointing to offsets on disk.
      ( Note: There are some well known databases that store sorted key-values pairs on disk as SSTables that provide ACID transactions, but I don't mean those. Note: Cassandra is a database that uses SSTables but doesn't/didn't support typical transactions. But since it buffered entries in memory first before disk, it has a subset of a WAL that was just the "redo" entries. Some call that log file a commit log to make it clear that it isn't a full blown WAL, but I usually call it a redo log. If Cassandra started supporting ACID transactions, it could possibly repurpose the commit logs and make them complete WALs. So my initial comment to you really should have said I was referring to that redo part of a WAL (which is what we were discussing I think in the video). That isn't really needed if you are going to persist all changes to disk immediately)
      If such a database was created, where hashed key-values are immediately written to disk instead of being buffered in memory for a while... and you wanted to implement atomic transactions -> you could use a stripped down version of a WAL that just has the undo entries. I would call this an undo log. But that's still a subset of a typical WAL, but most people would not object to a undo log being called a WAL so long as context makes it clear. You made a great point above though, which is forcing me to write these thoughts down..
      Now, _if_ we were to introduce a WAL (i.e., with redo entries in it) to a database with immediately persisted hashed key/value pairs, the logical step is to actually not write things to disk immediately. That is, you want to spend some time buffering the hashed key/value pairs in memory with redo entries going into WAL first in case the power gets cut.
      TMI: A WAL is really a combined redo log and an undo log.
      1) Redo logs refer to the durability in ACID. They are not in general needed if writes are persisted to disk immediately, because we already get durability when we persist writes. Adding a redo log allows us to program performance boosting features into the database which amount to buffering recently modified items in memory.
      2) Undo logs refer to the atomicity in ACID. These are used to aid in aborting or rolling back partially committed transaction.

    • @jordanhasnolife5163
      @jordanhasnolife5163  20 วันที่ผ่านมา

      @@slover4384 Thanks for your detailed response, I appreciate it!

  • @wingwong6973
    @wingwong6973 ปีที่แล้ว +2

    Jordan - can you elaborate on why HashMap is not good for disk?
    - It only requires one disk seek to find my element in the hashmap and doesn’t require moving around the disk.
    - On the contrary, B-Tree is worse for being in disk as it’s o(log(n)), and thus you have to seek on disk log(n) times.
    - Given the above, isn’t it better to put B-Tree in memory, and hashMap on disk?

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +1

      Hi Wing! Often times via a write back cache part of the b-tree will go in memory. While you can have the hashmap on disk, it is much slower than in memory performance. Adjacent keys are not stored next to each other in a hash map, hence all the jumping around - also no support for range queries :(

    • @mercuryredd
      @mercuryredd 7 หลายเดือนก่อน

      @@jordanhasnolife5163 Just found your channel and I'm binging it cause it's so good! I've been spamming links to your vids to my friends too, I hope your channel grows and beats out the SD grifters who just read pages off of Grokking lmao

  • @shubhamjain1153
    @shubhamjain1153 หลายเดือนก่อน +1

    Great videos. great content. I am relying highly on your videos to prepare and rock system design. But can you please improve the mike situation here.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Is it improved in my most recently posted videos

  • @Ynno2
    @Ynno2 8 หลายเดือนก่อน +1

    What about hash indexes with SSDs? That's between spinning disks and RAM in terms of cost and performance. We don't have the same seek time limitations with random I/O. Though we still have to read at least an entire block at a time, even though maybe we only need a few bytes. For use cases where you are accessing data in a random uniformly distributed pattern it might beat b-trees? For range-queries or other cases where sequential order matters then b-trees are obviously still going to win. Even for equality queries where reads are clustered, b-tree may win due to better data locality/efficient use of cache.

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      Yeah people have mentioned this before. At least in the current moment, I simply don't think too many datacenters are using SSDs. I believe their random speeds are still worse than their sequential ones though.

    • @Spreadlove5683
      @Spreadlove5683 7 หลายเดือนก่อน +1

      ​@@jordanhasnolife5163IIRC when I went to make a postgres database on AWS RDS, SSD was the default and HDD was generally not recommended.

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน

      @@Spreadlove5683 Good to know! As I continue to educate myself it seems a lot of databases are starting to use Nvme drives as well which is just a very fast variant of ssds, nonetheless I think they can be abstracted away as if they're disks

  • @firezdog
    @firezdog 2 หลายเดือนก่อน +1

    Is there any way to mitigate the requirement that the keys be kept in memory, say through some kind of caching? I’m working through a toy database program right now and I noticed they divide it into an index file and a data file, so in this case it seems like they are storing all the indices on the disk… but the database itself is implemented using static hashing

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Yeah you could store index keys in memory and address on disk as your hash map value. That does require you to seek the data on disk though.

    • @dwivedys
      @dwivedys 10 วันที่ผ่านมา +2

      While J has answered your question - the important thing here is to understand the trade offs of teh various options and use them accordingly. Nothing is perfect but that doesn’t mean everything has to be perfect as much as they don’t have to be absolutely insipid either. Hash indexes can give O(1) but if you store them on disk - disk seeks are going to eat away most of the benefits for why we chose the hash index in the first place

  • @u1gut
    @u1gut ปีที่แล้ว +3

    Looking back on undergrad, what other CS electives would you have taken and/or which ones would you keep?

    • @u1gut
      @u1gut ปีที่แล้ว +1

      I'm debating taking a pure-math Algorithms course, but idk if it's really worth my time (the content doesn't seem to align that well with interview-style qs)

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว

      @@u1gut Truthfully whatever you're interested in! I imagine that 90% of the complicated stuff that you learn in school won't be applied on the job anyways, so just take classes that you like, and if you end up really enjoying that you can choose to specialize in it!

    • @u1gut
      @u1gut ปีที่แล้ว +1

      @@jordanhasnolife5163 ​ Are there any courses which you are really happy you took? I guess I'm kinda trying to optimize for classes I can't just google about when I'm out of school.

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +1

      @@u1gut Sure, i mean I think you can google everything, as most of these classes havepublicly available syllabi. Maybe just anything that requires you to do more hands on work/work with partners. Truthfully, I'm not sure! In my case I wish I took a distributed systems class.

  • @moveonvillain1080
    @moveonvillain1080 หลายเดือนก่อน +1

    A little confused. End of last video you write that Indexes improve read speeds for specific index but reduce write speeds. But at the end of this video you say indexes make read and write O(1) ???.
    The second statement makes more sense but then why say the first one?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +2

      Yeah hash indexes are a bit unique in that there's not a big write penalty, but they're also sort of impractical due to not being good for range queries. If you don't *need* range queries, and aren't storing a lot of data (e.g. a caching situation), a hash index could be very good for you!