The cost of Hash Tables | The Backend Engineering Show

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 พ.ค. 2024
  • Hash tables are effective for caching, database joins, sets to check if something is in a list and even load balancing, partitioning and sharding and many other applications. Most programming languages support hash tables. However they don’t come without their cost and limitations lets discuss this.
    0:00 Intro
    1:50 Arrays
    3:50 CPU Cost (NUMA/M1 Ultra)
    6:50 Hash Tables
    10:00 Hash Join
    16:00 Cost of Hash Tables
    20:00 Remapping Cost Hash Tables
    22:30 Consistent hashing
    Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
    database.husseinnasser.com
    Introduction to NGINX (link redirects to udemy with coupon)
    nginx.husseinnasser.com
    Python on the Backend (link redirects to udemy with coupon)
    nginx.husseinnasser.com
    Become a Member on TH-cam
    / @hnasr
    🔥 Members Only Content
    • Members-only videos
    🏭 Backend Engineering Videos in Order
    backend.husseinnasser.com
    💾 Database Engineering Videos
    • Database Engineering
    🎙️Listen to the Backend Engineering Podcast
    husseinnasser.com/podcast
    Gears and tools used on the Channel (affiliates)
    🖼️ Slides and Thumbnail Design
    Canva
    partner.canva.com/c/2766475/6...
    Stay Awesome,
    Hussein
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 58

  • @hnasr
    @hnasr  2 ปีที่แล้ว +5

    Head to database.husseinnasser.com for a discount coupon to my Introduction to Database Engineering course . Link redirects to udemy with coupon applied.

  • @andydataguy
    @andydataguy 2 ปีที่แล้ว +5

    I love these long-form audio breakdowns. Very natural. Please keep it up. Inch wide mile deep conversations! Just got a new job and going to subscribe to your channel as soon as I get my first check 🙏🏾🙌🏾

  • @EngSamieh
    @EngSamieh 2 ปีที่แล้ว +2

    Thanks for talking about hash tables.
    I think the answer to your question about extending hash table beyond system memory is memory paging (virtual memory), the OS does that on behalf of you (handle page fault exception), and swap memory pages between ram and disk.
    One another thing, DBMS uses hash tables in desk (indexing), and it is very primitive syscall to random access a position in file on desk using fseek syscall.

  • @gyroninjamodder
    @gyroninjamodder 2 ปีที่แล้ว +13

    18:02 isn't that just the idea of swap. You advertise that a bunch of memory is available but if you access something not actually in memory it will have to load the page into memory.

  • @sfsf285
    @sfsf285 2 ปีที่แล้ว +4

    Your videos are always helpful, refreshing and brain opening, thanks a lot man i swear whenever im financially able ill buy all of your courses and memberships

  • @johnrush607
    @johnrush607 2 ปีที่แล้ว

    Awesome talk!! Would love to hear you talk more about consistent hashing and the origins of it in various databases :)

  • @arjundureja
    @arjundureja 2 ปีที่แล้ว +3

    Very interesting. I'd love to hear more about limitations associated with specific data structures in the real-world

  • @SohomBhattacharjee
    @SohomBhattacharjee 2 ปีที่แล้ว

    this is my fav youtube channel !! Thank you

  • @botenjohn1752
    @botenjohn1752 2 ปีที่แล้ว

    Yes!! this is really a topic I need refresh on . Thank you!

  • @teunohooijer6788
    @teunohooijer6788 2 ปีที่แล้ว +2

    there are now array databases which put thete data on disk or S3 (cheaper storage). there useful for (sparse) multidimensional data. An example of the database is TileDB.

  • @techienomadiso8970
    @techienomadiso8970 2 ปีที่แล้ว +4

    Great content. Would be so awesome to see some visuals...as you explain this interesting stuff.

  • @CjqNslXUcM
    @CjqNslXUcM 2 ปีที่แล้ว

    I love your mannerisms and presentation.

  • @longtranbao4177
    @longtranbao4177 2 ปีที่แล้ว

    Loved it! Waiting for consistent hashing

  • @CraneArmy
    @CraneArmy 2 ปีที่แล้ว +4

    I remember reading something a while back, about one of intel's attempts at non-volitile RAM (I think this was optane, before they dumped the tech and used the branding for something else).
    The write up was about the thermodynamic limitations of non-volitile storage at ram speeds, probably given some voltage requirement (i dont remember, but it seems right).
    Basically stated that non-volitile memory at ram speeds (probably given the other specs of the product announcement) would need to be like an order of magnitude better than the theoretical performance limit required to store data.
    I didnt do the work to verify it, but it seemed convincing enough for someone who doesnt eat and breathe in electrical engineering.
    Since then, ive been pretty cynical of anyone claiming they are going to do it. Everyone seems to agree its the holy grail among a certain subset of smart people, they all want to do it, and the announcements seem to come and go never delivering a product.
    Might be an interesting topic for a video too, I'd watch to get an update.

  • @susmitvengurlekar
    @susmitvengurlekar 2 ปีที่แล้ว

    Thought about disk based direct access which avoids index scan by getting / constructing the file path. But file locks when writing when it is being read by many is a tricky part.

  • @hjups
    @hjups 2 ปีที่แล้ว +2

    When you said that disks are not byte addressable, were you referring to the restriction of the API / file system?
    Internally, flash is byte addressable (and technically so is a spinning disk). The caveat is that the access time is substantially higher for NVM verses volatile memory.
    Also, memory indirection via pointers can still be very expensive (I wouldn't put so much faith in "optimizations" by the hardware vendors). For example, if the pointer exists on a different CPU socket, or if you are skipping between L3 and DRAM, or even a swapped page (to disk) and DRAM. There's some tradeoff in performance between embedded fields (within the table) and table size, which also comes into play with some of these specialized NVM key-value pair accelerators (because of the access latency).

  • @thesyd1982
    @thesyd1982 2 ปีที่แล้ว

    Mr you surprise me every episiode :D, thanks fo all , allah ya7afedk

  • @anon_y_mousse
    @anon_y_mousse 2 ปีที่แล้ว

    This is why the lessons on hash tables need to be seriously rethought. The only part of a hash table that actually needs to be an array is the index table itself, but each entry in the table can be a bucket, or as needed point to a new server if you're using it for that purpose. Most databases, at least as far as I am aware actually use trees and things are already sorted in by each relational piece of data. This type of tree is generally called a graph. The implementation of hash tables is actually pretty open, and I've implemented several versions where the data is stored in a tree and the index table just helps to jump to a subtree faster. For trees I'll generally use a deque to allocate decent sized blocks of nodes and memory management is pretty easy that way. I arrange the individual deques in two different trees themselves. On the subject of implementing them as arrays, since in that form they don't need to be sorted, if you store the unmasked hash key, as I do, you can swap the element to delete with the last element and patch one index then decrement the count. TLDR: hash tables don't have to be simple arrays anymore with merely one set of characteristics.

  • @timtreichel3161
    @timtreichel3161 2 ปีที่แล้ว

    Okay, this was kinda interesting. I don't really know anything about data base programming so the insights there were cool. Personally the point of adding/deleting entries was explained pretty poorly. I don't know what strategies are normally used for handling collisions in data base programming, but when you have a hash table you probably have a strategy to handle collisions. So let's say you create a hash table of size 100 and insert 100 elements, there probably will already be collisions. At some point the collisions get so many, that the access time is becoming slow and you want to increase the size of the hash table. And when you do this, you don't want to increase the size from 100 to 101 but probably double it to 200 if not more. At least this it the common approach for dynamic arrays and you already said, that hash tables are array's. Of course it is true that you have to rehash everything with the new size of the hash table, which sadly should be slower as a normal dynamic array resize and copy.

  • @prakanshuray7595
    @prakanshuray7595 ปีที่แล้ว

    If it's a talk , it could go as a podcast as well.

  • @InsightByte
    @InsightByte 2 ปีที่แล้ว +33

    would be gr8 if you would introduce some visuals in your "lectures" - you cover a lot and is hard to keep up

    • @tridao1273
      @tridao1273 2 ปีที่แล้ว +7

      This is a podcast, not a "video"

    • @pemessh
      @pemessh 2 ปีที่แล้ว

      i second this !

    • @dabbopabblo
      @dabbopabblo 2 ปีที่แล้ว +5

      Why, you should be programming while listening, not wasting time watching

    • @arod3295
      @arod3295 2 ปีที่แล้ว +1

      Visuals please!!!

  • @TheHighclick
    @TheHighclick 2 ปีที่แล้ว

    Hi Hussein.
    I started studying Front-End Development, after 5 months of school i managed to land a Jr job.
    I am now in a situation where I work fulltime, study frontend full time, but I would like to learn backend as I find it a lot more interesting.
    What backend project would you recommend if I want to get into backend engineering?
    Thanks for your content

  • @mfbalgo
    @mfbalgo 2 ปีที่แล้ว

    We love you Hussein

  • @sohansingh2022
    @sohansingh2022 9 หลายเดือนก่อน

    Thanks 😮

  • @night-bird1788
    @night-bird1788 ปีที่แล้ว

    For collision Point, there are many usefal approcahes one of these is chain each node or address with a linked list. appearntly the more collision the more time it takes to track the value, but if the entired data is large enough then the best way is to the store using the pair concept it consums more memory but O(1) to get or insert. so why are there collisions to avoid large arrays as the addition process takes time so reduce the primary table size and then add sublists to the main table then the addition process consumes less EX:) 100+100=200 takes 2 seconds but (50+50)+(50+50)=200 takes less time.

  • @rsroyalrahul5
    @rsroyalrahul5 2 ปีที่แล้ว

    Video on consistent hashing please

  • @varunvishwakarma2901
    @varunvishwakarma2901 2 ปีที่แล้ว

    Today's best half an hour spent.

  • @kaushikkumarbora
    @kaushikkumarbora ปีที่แล้ว

    @hnasr Arrays are Contiguous and not continuous.

  • @nikilragav
    @nikilragav 2 ปีที่แล้ว

    Wow that array is just a form of a hash table is super interesting

    • @nilanjanmukhopadhyay8369
      @nilanjanmukhopadhyay8369 2 ปีที่แล้ว

      No array isn't a form of Hash Table. To build hash table you need an array and a hash function at least.

  • @realabodyify
    @realabodyify 2 ปีที่แล้ว

    Sometime when my boss asks me to generate some type of report, I use hashtable for more efficiency. Without hashtable I need to loop 1 million time trying to filter the array to get the result I want, but with hashtable it is going to faster and way way less looping.

  • @SSJ1Igor
    @SSJ1Igor 2 ปีที่แล้ว

    Where can I learn about all the different backend related concepts that you mention in the begining like caching, partitioning, database joins, sharding and so forth? is there somewhere a list of all the different concepts that a backend engineer should be familiar with? Sorry for the offtopic question but i need to know and thank u for ur videos i love your channel !

    • @SSJ1Igor
      @SSJ1Igor 2 ปีที่แล้ว

      i just need like a list of these so i can start researching them one by one and trying out

    • @RyanKOnk
      @RyanKOnk 2 ปีที่แล้ว

      Checkout this channel's playlists. Hussein has a lot of content categorised,so, you may find what you are looking for.

    • @SSJ1Igor
      @SSJ1Igor 2 ปีที่แล้ว

      @@RyanKOnk thank you boo boo

  • @amans6504
    @amans6504 ปีที่แล้ว

    I had this doubt from long and now i got clear. So basically computers literally doesnt know anything, while using arrays or hash tables, we are just doing the operational cost to retrive the first element of the array and it just adds the index so it exactly knows the address of the element we are looking for, this is literally same as just accessing a primitive value.

    • @hnasr
      @hnasr  ปีที่แล้ว

      Exactly right

  • @mohitrathore8808
    @mohitrathore8808 ปีที่แล้ว

    It would be more easier to understand if you include more animations visuals.
    Rest is awesome

  • @nikilragav
    @nikilragav 2 ปีที่แล้ว

    Does anyone implement arraylists as a hash table wherein the address is dependent on the value so it doesn't need to be a consecutive block anymore?

  • @naveedhassan7399
    @naveedhassan7399 2 ปีที่แล้ว +1

    HashTables are glorified arrays. Amazing insight.

  • @arod3295
    @arod3295 2 ปีที่แล้ว +1

    Visuals please!!!

    • @arod3295
      @arod3295 2 ปีที่แล้ว

      So many gems

  • @snake1625b
    @snake1625b 2 ปีที่แล้ว +1

    I thought databases use trees For the index, not hash tables

    • @kwakubiney5175
      @kwakubiney5175 2 ปีที่แล้ว

      Trees solved certain problems the hash tables were facing, for example, the hash tables had to be stored in main memory so that’s a constraint.

    • @nikilragav
      @nikilragav 2 ปีที่แล้ว

      @@kwakubiney5175 doesn't the tree also need to be? It has n nodes for n items in db plus it has a bunch of extra pointers

    • @kwakubiney5175
      @kwakubiney5175 2 ปีที่แล้ว

      @@nikilragav Writes and reads are done on pages and these pages don’t exist in main memory but on disk.

    • @nikilragav
      @nikilragav 2 ปีที่แล้ว +1

      @@kwakubiney5175 ok yes, so what does that have to do with the tree? Are you saying that each node of the tree represents a page? And that you load one page at a time in memory?

    • @nikilragav
      @nikilragav 2 ปีที่แล้ว

      I don't think that directly addresses what Hussain was saying. He was saying it is possible that the hashtable itself is too big to fit in memory. If that is the case, then a tree is not going to be any smaller...

  • @TehPwnerer
    @TehPwnerer 2 ปีที่แล้ว

    Disk = slow memory

  • @richard-social8125
    @richard-social8125 2 ปีที่แล้ว

    Long way of saying you know too much: "We can probably make a whole other video about this topic I don't wanna get into right now."