What Are Bloom Filters?

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ก.ย. 2024
  • Why are bloom filters such useful data structures? How do they work, and what do they do? This video is an introduction to the bloom filter data structure: we'll explore what they are, how they work, and build an understanding for why they're so efficient.
    ***
    Spanning Tree is an educational video series about computer science and mathematics. See more at spanningtree.me
    To be notified when a new video is released, sign up for the Spanning Tree mailing list at spanningtree.s...
    Spanning Tree is created by Brian Yu. brianyu.me/
    Email me at brian@spanningtree.me to suggest a future topic.

ความคิดเห็น • 236

  • @randomyoutuber4500
    @randomyoutuber4500 ปีที่แล้ว +114

    For some people, this may look crazy and inefficient.
    The Book store is just for an example. This bloom filter algorithm is often used for anti-virus scanning, vulnerable URL detection etc. where speed is an important factor. This situation is like you have billions and billions of books in the bookshop and you have a new customer coming in for every millisecond. storing and retrieving data is the process which slows down computers and running an Algorithm costs 'almost' no time.

    • @travcollier
      @travcollier ปีที่แล้ว +6

      Bloom filters are also used a lot in genomics. The "bookstore" is all the different sequences of a given length (we call them kmers) in the sequencing data.

  • @jamesdaus
    @jamesdaus 4 ปีที่แล้ว +511

    If you re-hashed the final copy and flipped all the bits off, you'd risk marking a book that is in stock as not in stock because of an overlap. My first thought is to hold a counter on each number so for example #221 would count how many times its been flipped on, and when you take out a book, you drop it down 1, and this will allow two books to overlap and one to be removed without problem, but maybe this is a bit memory excessive? Is there a better known solution?

    • @SpanningTree
      @SpanningTree  4 ปีที่แล้ว +337

      Exactly the right idea! This is what's known as a "Counting Bloom filter."

    • @ashwinalagiri-rajan1180
      @ashwinalagiri-rajan1180 2 ปีที่แล้ว +13

      This is called a Count Min Sketch

    • @jeremycohn9992
      @jeremycohn9992 2 ปีที่แล้ว +15

      What is meant by overlap? Do you just mean a collision?

    • @jamesdaus
      @jamesdaus 2 ปีที่แล้ว +30

      @@jeremycohn9992 Two books sharing at least one bit, so yes a collision

    • @leonm8906
      @leonm8906 ปีที่แล้ว +17

      @@SpanningTree how would you handle an overflow? Because if you only use 8bits per "bit" that might get overflown rather quickly, but using 32bit would be less space efficient so that you have way more cache misses. on a database server I think this would matter a lot right? Since you would have possibly thousands of search querries per minute if not much more.

  • @mazharmumbaiwala7451
    @mazharmumbaiwala7451 ปีที่แล้ว +221

    Its a shame you're not making videos any more, you are an amazing creator
    hope to see more new content someday soon

    • @AidenC.
      @AidenC. ปีที่แล้ว +14

      New video 2 months ago

    • @rhysbaker2595
      @rhysbaker2595 ปีที่แล้ว +15

      Looks like the channel is starting to pick up some attention. If he keeps posting semi-regularly, I can see this being a pretty successful channel

    • @jaywebster624
      @jaywebster624 ปีที่แล้ว

      is that like 5 euros in EU

    • @vishal-shinde
      @vishal-shinde 4 หลายเดือนก่อน

      ​​@@jaywebster624it's 0.44 Euros I think. He donated in Indian Rupees

  • @pratikkarangle
    @pratikkarangle 2 ปีที่แล้ว +175

    Why did you stop making videos? You are just too good at explaining, please keep posting...I love your narration, animation, basically everything about your videos.

    • @Galakyllz
      @Galakyllz ปีที่แล้ว +6

      I wish he would keep making these excellent videos. They're great!

    • @RandomPerson-id5rb
      @RandomPerson-id5rb ปีที่แล้ว +2

      well

    • @rhysbaker2595
      @rhysbaker2595 ปีที่แล้ว +14

      I don't think he has stopped, these sorts of videos can take a very long time to produce, especially if you have another job or other commitments

    • @erner_wisal
      @erner_wisal ปีที่แล้ว +4

      ​@@LOLWHATBRO it's 3 month since the last vid

    • @ananttiwari1337
      @ananttiwari1337 ปีที่แล้ว +2

      @@LOLWHATBRO His video about Propositional logic is 3 months old

  • @grapetoad6595
    @grapetoad6595 ปีที่แล้ว +9

    I came into this video thinking it was going to be about the graphical processing of bloom, I was then pleasantly surprised by the actual content.

  • @carlosdelamorenacoco8715
    @carlosdelamorenacoco8715 6 หลายเดือนก่อน +4

    This was the clearest and simplest explanation of such a complicated concept. hats off

  • @starman4572
    @starman4572 ปีที่แล้ว +4

    Green Robot: "Can I get this book please?"
    Blue Robot: "We're all out."
    Green: "Can you check?"
    Blue: "...no."

  • @Lehander100
    @Lehander100 3 ปีที่แล้ว +17

    You’ve just got seen most underrated explanation

  • @themannyzaur
    @themannyzaur หลายเดือนก่อน +1

    went back to the wikipedia article and now things are clicking
    this is such a fantastic explanation!

  • @mikezhao1838
    @mikezhao1838 2 ปีที่แล้ว +8

    This is such a fantastic illustation of the bloom filter. I instantly got the idea after watching the nice robot moving around(or not). Also the follow up question is interesting. Sad that no new videos recently. But keep up with the good work!

  • @sanjaymatsuda4504
    @sanjaymatsuda4504 ปีที่แล้ว +16

    My first thought is to make a "negative" bloom filter. That is, keep the normal bloom filter, but also have a second filter whose bits you'll only turn on when the last copy of a book is removed. When someone asks for a book, you'd first check the negative filter: if all the bits corresponding to that book are turned on, that means you don't have the book. Otherwise, proceed to look at the normal filter: if all the bits corresponding to that book are turned on, that means you have the book.
    Though of course, if selling out a book is a frequent occurrence at the store, there will be collisions in the negative filter, causing false negatives, where you end up claiming to not have a book that you in fact do have.

    • @psychopompous489
      @psychopompous489 ปีที่แล้ว +2

      But what if you want to re-add a book that's previously been removed?

    • @sanjaymatsuda4504
      @sanjaymatsuda4504 ปีที่แล้ว +1

      @@psychopompous489 That would be the time to completely reset both filters and repopulate the normal bloom filter with the hashes of all the books you have, plus the re-added book.
      Either that, or have a "double-negative" bloom filter where you turn on only the bits corresponding to books that are back in stock after having been sold out previously.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว +1

      @@sanjaymatsuda4504 That wouldn't be a good idea. We can't just keep adding bloom filters, because at a certain point, we do have to have a way to remove an item from the bloom filter. Also, we can't repopulate the bloom filter, because it would be incredibly inefficient to do that every time we sold out a book. We need some kind of way to reset the bits only if no other books are triggering them, but without having to search through out entire catalogue of books

    • @user-jn4sw3iw4h
      @user-jn4sw3iw4h ปีที่แล้ว

      @@benjackson7690 Usually true, and indeed not the solution hinted at in the video.
      However, it's not universal. A trade-off is in consideration here.
      for example
      - when will you be re-adding a book
      - how big is the collection
      - what errors are acceptable?
      If we take the book-store example of the video:
      usually there is a single/specific moment books are added to the collection. And it usually isn't during opening-hours.
      if your collection is small enough (or the hash fast enough, which it should be, it is the 3rd thing it is optimized for.
      after 'same input _must_ result in same output' and 'different input _should_ result in different output')
      having a re-index session scheduled after taking in new inventory is an option.
      It's not faster than the 'counting bloom filter'-solution that was the 'intended answer', but if the count can exceed 3 it is smaller.
      Also, which errors are acceptable?
      While this solution would come with the downside of 'customer returns a book' will not show up in the system until tomorrow's re-indexing (likely a sufficiently rare occurrence that having the 'full list at the desk' for that set, is a viable strategy: check a list of 3 books, while waiting for the computer to run 5 hash-functions).
      The original will have errors caused by stolen books or (other) administrative error, linger forever, while this one resolves those every 'delivery'.

  • @firebraingames
    @firebraingames 3 ปีที่แล้ว +11

    Great channel, good animations, good narration. Simply explained.

  • @nguyenduy-sb4ue
    @nguyenduy-sb4ue 2 ปีที่แล้ว

    OMG, arguably the best explanation of Bloom filter, i have ever come across

  • @nkugwamarkwilliam8878
    @nkugwamarkwilliam8878 ปีที่แล้ว

    This is one of the most underrated channels on youtube. Your content is amazing

  • @wissemaljazairi
    @wissemaljazairi ปีที่แล้ว +12

    If we replace the true/false values in the bloom filter structure with a counter that is incremented whenever a new entry is added to the filter and decreases whenever an existing element is removed... This way we return the counter to 0 if we remove the last book. 0 is equivalent to false, and > 0 is equivalent to true.

    • @jolie1543
      @jolie1543 ปีที่แล้ว

      that takes much more space tho

    • @clayupton7439
      @clayupton7439 ปีที่แล้ว

      @@jolie1543 smaller data type, or build your own custom data type by stacking bloom filters until the stack represents the maximum value in base-2.. ex: 3 stacked bloom filters can represent a maximum of 7 duplicates... 001 010 011 100 101 110 111

    • @AntonioNoack
      @AntonioNoack ปีที่แล้ว +1

      @@jolie1543 You could use saturated addition, and maybe 2-4 bits for each entry. If an entry is deleted with a saturated counter (the hope would be that is occurring very rarely), you'd have to rebuild the whole index.

    • @localidiot4078
      @localidiot4078 ปีที่แล้ว

      @@jolie1543 not really. A computer associates bools with 8 bits because that's the smallest piece of data they can work with. Also off the top of my head, I think the bloom filter size just needs to grow with the size of the array to the root of however many hash functions you have, which can be arbitrarily large.

    • @localidiot4078
      @localidiot4078 ปีที่แล้ว

      @@AntonioNoack i don't think you can reconstruct this data structure, information is lost.

  • @quemaspana
    @quemaspana 2 หลายเดือนก่อน

    This is a really great explanation.

  • @SenoritaTJ
    @SenoritaTJ 6 หลายเดือนก่อน

    Thanks for explaining it in this fun and easy way. The funny thing is I just finished my second week in a book store as a sales person.

  • @hamzahal-qadasi1771
    @hamzahal-qadasi1771 2 ปีที่แล้ว

    One of the best explanations for the Bloom filter.

  • @iwersonsch5131
    @iwersonsch5131 ปีที่แล้ว +2

    Instead of a single bit, you can assign each element in your bloom filter a variable ranging from 0 to 255. When you add a book, you add 1 to each associated variable, and when you sell the last copy of a book, you subtract 1. This uses 8x as much data but scales almost infinitely.

    • @iwersonsch5131
      @iwersonsch5131 ปีที่แล้ว

      If a value is at 255, tough luck, you can't add or subtract from that variable anymore. Let's hope that never ends up leading you to run to the shelf needlessly

  • @raz0229
    @raz0229 3 ปีที่แล้ว +19

    Best explanation so far for beginners to understand. Subbed!

  • @dhivyashri6554
    @dhivyashri6554 2 ปีที่แล้ว

    i have my finals tomorrow and because of you i'll at least know enough to bs, thank you so much

  • @heap-sort
    @heap-sort ปีที่แล้ว

    The best explanation which I have ever seen. Thank you.

  • @damamoot2291
    @damamoot2291 ปีที่แล้ว +3

    The problem would be that removing bits when the last of a book is sold would mean that if any other books you had shared any of the bits removed, your way of checking if you have a book would make it seem like you don't have that book, even though you do.
    My idea was to count how many times each bit was activated; like, if you get a new book and one of the bits is already activated you just count that it is the second time activating the bit, and then when one of the books is sold out you just reduce the counter for each of that books bits by 1. Then, the bits that you get from multiple books will now still be on after one of those books sells out and you will still keep track of the other book(s). (I checked other comments and it seemed to confirm that this is a valid solution.)
    P.S: I know my explanations suck, I am autistic, lol.

  • @shreyash245
    @shreyash245 6 หลายเดือนก่อน

    thankyou ,☺it was clearest explanation

  • @nguyenthanhdat93
    @nguyenthanhdat93 3 ปีที่แล้ว

    Hey, just wanna say thank you. Superb intuition for bloom filter

  • @GoddamnAxl
    @GoddamnAxl 2 ปีที่แล้ว +16

    How about requiring only a percentage of bits to be on for a positive. Then when you delete an item you just turn all of its bits off. Also, when someone comes in and you find a true positive, you turn on all it's corresponding bits to make up for the potentially mistakenly turned-off bits. The good side of this method is it's minimal memory footprint, the downside is requiring more hash functions (more cpu). Also, it's possible that there can be false negative. How do you improve this?

  • @zg7897
    @zg7897 2 ปีที่แล้ว

    best video of explaining bloom filters!!!

  • @sammokkabasi8957
    @sammokkabasi8957 3 ปีที่แล้ว +9

    I'd use three different arrays, one for each hash function, to make the bloom filter better.

    • @valoeghese
      @valoeghese 2 ปีที่แล้ว +3

      Uses more memory, but definitely reduces the chance of a collision

    • @localidiot4078
      @localidiot4078 ปีที่แล้ว

      @@valoeghese There is probably some law in statistic that explains this in math terms, but I think that's equivalent to increasing the array size by 3, or some constant. Like it's not quite true, but splitting into 3 arrays doesn't actually improve much because of the birthday paradox, which is what your really fighting.

  • @arif3012
    @arif3012 2 ปีที่แล้ว

    The best explanation so far 1:40 onwards

  • @flargensnirkel
    @flargensnirkel ปีที่แล้ว +1

    Yay going over inserting a book title into 5 giant machines that I have to stand on a stool to get to and counting a series of hexagons to make sure all five of the numbers are turned on will be so much faster than pressing ctrl+h on a google doc

  • @styrex4261
    @styrex4261 ปีที่แล้ว +1

    I work at autozone and always wondered how their inventory was so fast, considering the size of it.

  • @JB-ev8lt
    @JB-ev8lt 2 ปีที่แล้ว +1

    Very well explained and enjoyable animation to watch!

  • @GoodLuck-dv2zu
    @GoodLuck-dv2zu 4 หลายเดือนก่อน

    Amazing explanation. Appreciate your effort!

  • @samirrimas
    @samirrimas 2 ปีที่แล้ว +1

    Very well explained mate thank you

  • @mjcox242
    @mjcox242 ปีที่แล้ว +6

    The problem I see with 1 bloom table multiple hash function is it's eventually going to increase collision as say you had 100bits and 50 hash function, the bloom filter doesn't care about order so in the space of 5 books there's a probable chance you filled up the filter.

    • @localidiot4078
      @localidiot4078 ปีที่แล้ว

      I agree, however if you have a big enough space this isn't a problem. You could just add a constant multiple to the size of the array to offset this.
      Perhaps just the number of hash functions + 1

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว +1

      You're right that more hash functions causes the filter to fill up faster, but that's why it's important for you to build a filter large enough. When calculating how large a filter should be, we need to consider both how many items we are storing, as well as how much space they take up on the filter. This is entirely up to the programmer, although there may be some suggested algorithms to determine size

  • @danielmelville1729
    @danielmelville1729 11 หลายเดือนก่อน +1

    Great explanation, I appreciate the help :)

  • @KelvinShadewing
    @KelvinShadewing ปีที่แล้ว +3

    Did anyone else get recommended this and thought it had to do with graphics?

  • @zc4060
    @zc4060 ปีที่แล้ว +1

    I think increasing the number of hash function shall actually increase the chance of collision? Assume you have an infinite number of hash functions, then adding one "book title" to you bit grid would in fact set all bits to one. The next time you lookup another "book title", you will absolutely get a positive, no matter you have it on the shelf or not.

  • @anthonyto3
    @anthonyto3 2 ปีที่แล้ว

    Fantastic explanation and animation. Thank you so much!

  • @deepaksoundappan3244
    @deepaksoundappan3244 2 ปีที่แล้ว

    Beautiful work..,,👍👍👍.. great effort made in making this video

  • @user-jn4sw3iw4h
    @user-jn4sw3iw4h ปีที่แล้ว

    5:00
    This is the point where trade-offs come into play:
    Which costs more time:
    - performing 5 hashes (instead of 3) for every incoming request.
    - going into the back to 'fail to get' the book, once in every approximately 10.000 requests.

  • @lastmomenttuitions2.067
    @lastmomenttuitions2.067 9 หลายเดือนก่อน

    Amazing Explanation

  • @noviian3092
    @noviian3092 ปีที่แล้ว +3

    without audio, it seems as if two people are playing a board game but one player is trying to cheat with an increasingly ridiculous amount of machines

  • @derekhu2678
    @derekhu2678 ปีที่แล้ว

    Hi Brian, love your videos! Keep it up!

  • @cmyk8964
    @cmyk8964 ปีที่แล้ว +1

    You would ideally feed the book’s ISBN number instead of something prone to human error or variance like a title, because a hash function will return a completely different result for “Hamlet” vs. “hamlet” vs. “Shakespeare’s Hamlet” for example.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว

      The bookstore was more of an example instead of real. If it was though, you could still probably do the book's title, but you would have to implement some algorithms to clean up the user input of course

  • @anuragbisht1200
    @anuragbisht1200 ปีที่แล้ว

    important point that , bloom filter is a probabilistic data structure. there could be false positives but no false negatives.

  • @gregsak707
    @gregsak707 3 ปีที่แล้ว

    Great video, thank you!

  • @mohamedadnanealkhiati5006
    @mohamedadnanealkhiati5006 4 ปีที่แล้ว +2

    I don't understand why would we actually need this algorithm, I mean if we were to code this why wouldn't we just have instead of bits the books? Aren't we just adding more steps? Doesn't this require just a simple hashtable that points to books?
    I think I am missing something, but I don't know what exactly.
    Thank you for this amazing videos, Brian. I am a fan!
    edit: Or is it if we already have a data structure not implemented as a hash table, but we still want to retrieve data efficiently?

    • @SpanningTree
      @SpanningTree  4 ปีที่แล้ว +8

      The motivation here is memory efficiency - you certainly could just store all the books themselves, but that would take far more memory than storing the bits needed for the bloom filter. For storing large amounts of data, this memory efficiency can matter quite a lot!

    • @misiekeloo6114
      @misiekeloo6114 4 ปีที่แล้ว +1

      @@SpanningTree medium.com/analytics-vidhya/cbfs-44c66b1b4a78 Also this article helped me to understand the idea of avoiding collisions during removing (the task at the end of the video).

    • @arpit-jain
      @arpit-jain ปีที่แล้ว

      Even I have the same question. The memory is so cheap nowadays and compact as well. Even if we save 1 million books in a hashtable or other data structure, still memory is affordable.
      And searching through million records with today's processors computing power won't take more than a second

    • @randomyoutuber4500
      @randomyoutuber4500 ปีที่แล้ว +4

      @@arpit-jain The Book store is just for an example. This bloom filter algorithm is often used for anti-virus scanning, vulnerable URL detection etc. where speed is an important factor. This situation is like you have billions and billions of books in the bookshop and you have a new customer coming in for every millisecond. storing and retrieving data is the process which slows down computers and running an Algorithm costs 'almost' no time.

  • @rokaq5163
    @rokaq5163 ปีที่แล้ว

    Seeing english speakers call it "Don Quixoite" always cracks me up. It's way more convoluted than the original name for no apparent reason than just translation errors. It's "Don Quijote", and it's pronounced like so:
    "don key hoh teh"
    The first "h" in "hoh" should be exaggerated, as if you had something stuck to your throat and were trying to spit it out. That's the spanish sound associated to the letter "j".

  • @Matyanson
    @Matyanson ปีที่แล้ว +2

    I know this is supposed to represent the bits in computer memmory ... but in the physical example you would still need to search among the bits. For example search the possition of the 100th bit and only then you could determine if it is on/off.
    That would be similar to having a sorted table with the boks names.

    • @StreamlineDeet
      @StreamlineDeet ปีที่แล้ว +1

      The thing is, it's a lot easier to find a specific bit than a specific book. If you are looking for the 50th bit, you don't need to do any searching, you just need to do a little bit of math to arrive at the right spot. But if you are looking for the book "Frankenstein", you have no idea where it is in the list, so you have to do a search.

    • @akashneelraychaudhuri5687
      @akashneelraychaudhuri5687 ปีที่แล้ว

      Searching is different from accessing via an index. For an index, usually u can directly go to that memory location. Searching involves generally going through multiple memory locations.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว +1

      A couple of others have already mentioned why indexing is better than searching, but also consider the fact that he said the time it took to walk from the shelves and back was long. Imagine you don't have a constant array of this data, but instead have to go elsewhere to access it (like connecting to a web server). This is a long process, so you want to do it as little as possible, which means you need to know you have it before you go look.

  • @Llorx
    @Llorx ปีที่แล้ว

    Great video. Still this increases the amount of branching, which is also slow. Benchmarks are needed for each situation.

  • @fishum6483
    @fishum6483 ปีที่แล้ว +2

    "Look at this, your piece of paper is spanning the entire desk, not very space efficient, better stick 3 1x1x3 metre machines in the room and fill half your desk with a hex grid"

  • @lare290
    @lare290 ปีที่แล้ว +5

    wouldn't increasing the number of hash functions eventually lead to _more_ false positives? imagine the extreme case where you have as many hash functions as you have bits in the filter, and each function gives a different output, therefore one title will turn on every bit on the filter, subsequently making _every_ book come up positive.

    • @ccflan
      @ccflan ปีที่แล้ว

      ​@@leeroyjenkins0 i was thinking in the same idea, so this means you should have an idea about how long your library would holds before calculating your filter bites size, unless you dynamically resize the table if needed

  • @danielkaczmarczyk2482
    @danielkaczmarczyk2482 3 ปีที่แล้ว

    Great channel - love your videos.

  • @noyz-anything
    @noyz-anything ปีที่แล้ว +1

    I'd use one bloom filter for each hash function, rather than just having the singular grid. that way, there's even less of an overlap risk

  • @akulaharshitha1714
    @akulaharshitha1714 2 ปีที่แล้ว

    best explanation

  • @youtubeuniversity3638
    @youtubeuniversity3638 ปีที่แล้ว

    5:00 Could also make separate blooms for each hash perhaps.

  • @Tooley641
    @Tooley641 3 ปีที่แล้ว

    Well done!

  • @ultralowspekken
    @ultralowspekken ปีที่แล้ว

    Wouldn't using multiple hashes become redundant at some point? Let's say, for example, you have book A that uses bit 1 and 2, book B that uses bit 1 and 3 and book C that uses bit 2 and 3. If book B and C are present, then book A would give a false positive, something that would hardly have happened if only one hash was used. So there is a correct ratio of hash per book somewhere.
    Wondering if a solution might be using hashes for different aspects of a book rather than title only.
    A hash for the book, a hash for the author, a hash for publisher, a hash for language and so on. It would reduce the risk of having omonimous books, as there are less authors than books, less publishers than authors, less languages than publishers etc.
    It would also require different bit tables but we're trying to reduce the risk of a false positive here.

  • @alex_zetsu
    @alex_zetsu 3 หลายเดือนก่อน

    Is there a difference between using 1 hash function that maps to a space N^3 (say 1,000,000,000) and using 3 hash functions that map to space N (say 1,000)?

  • @SuperMichael996
    @SuperMichael996 ปีที่แล้ว

    Excel + CTRL+F problem solved ;)
    Really though, why resort to oversimplifying it into bits when you can just store a table (or even a text document) with the names of the books, the number of the shelf they're on, the number of the row and the position on the book on that shelf? (i.e. 3 for the third book from left to right)
    Giving our technology nowadays, I don't think the cost difference is noticeable.

  • @theresapan6128
    @theresapan6128 ปีที่แล้ว

    Thanks!

  • @DeannaEarley
    @DeannaEarley ปีที่แล้ว

    How does this differ from say a Perl hash, php associated array, a dictionary, etc? Is it the same lookup, but no actual data beyond the yes/no?

  • @lafiga776
    @lafiga776 ปีที่แล้ว

    I would attach a numeric value to each bit that increments by one when you add a book with that hash value and decrements by one when you remove it that way it’ll only go dark if it reaches zero

  • @toshevislombek
    @toshevislombek 3 หลายเดือนก่อน

    5:55 I create new bloom filter for deleted items, and I look for deleted BF if it is negative, I assume item is not deleted yet

  • @estrawitch
    @estrawitch ปีที่แล้ว +1

    1. turn off the bits
    2. other books might share the same bits so if u turn the bits off there will be a false negative
    3. use one hash function that has a lower chance of collisions (quality over quantity basically)

    • @travcollier
      @travcollier ปีที่แล้ว

      I'm not sure if it is part of the formal definition or not, but I've never seen Bloom filters discussed where they weren't guaranteed to never give a false negative. That's often quite important to the applications where they get used.

  • @nightfox6738
    @nightfox6738 ปีที่แล้ว

    It would take a bit more memory but changing each bit in the bloom filter to a count would solve the problem as each time you add a book to the bloom filter you increment all of the indices of the hash functions and decrement them when the book is removed. This would however add a higher count when multiple copies of the same book are added which could cause scalability issues as now you're not just counting collisions (which could ideally are very minimal) but you're counting all copies of any book where one of its hash functions produces the number cooresponding to the index. This number could end up getting quite large requiring more and more bits to store so while its better than the list of books at the front desk, its still not ideal. Another option would be to recalculate the filter every time you sell a book but as you could imagine that will have quite some overhead. Ideally you'd have a combination of both, only incrementing the filter if you don't already have the book so you're keeping track of collisions and not count of the same book, then when you sell a book, you check how many of them you have (as they should all be in the same place) and if you sell the last one, you decrement the indices in the bloom filter. I think this is the best solution as in most cases it should not require more than one byte for each index and in much the same way as many hash map implementations, you can dynamically resize the bloom filter to minimize collisions.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว +1

      I had essentially the same thought about using integers rather than bits, but you seem to have already seen the downside. I believe in C++ at least, ints are 8 bits, which would then increase the size of your bloom filter 8x, which could be incredibly expensive depending on how large it is. This also has a downside of introducing a maximum number of books, because a int can only count so high before it runs out of space.
      You also mention recalculating the filter when you sell a book, but that would be incredibly expensive, as every time you removed a book, you would have to search through your entire storage again. If you had to do that, you might as well have not had a filter.

    • @nightfox6738
      @nightfox6738 ปีที่แล้ว

      @@benjackson7690 in C++ ints are usually 32 bits but you can use char for 8 bits. It really depends on your expectations though. Do you forsee ever having over 255 of the same book in your store? If not, then this would work. You mention that would require 8x the size and that's true, but multiplying by a constant isn't a huge issue (similar to runtime complexity O(n) is basically equivalent to O(8n)) If you're running out of space with n books its conceivable that it may not be too long before you run out of space with 8n

  • @epimolophant
    @epimolophant ปีที่แล้ว +1

    I have the feeling that searching through an index would be a lot faster than using multiple hash functions for every query

    • @gamekyuubi42069
      @gamekyuubi42069 ปีที่แล้ว

      I thought this too. He should have gone into explaining why this is faster than a list of books, because for a human it's not that much better but for a computer it might be a world of difference.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว +1

      A hash function is a very quick usually 1 step math equation. It takes almost no time to do. On the other hand, searching a list takes a long time depending on the size of the list and how it is sorted. The fastest possible time to search a list is log(n) time, where n is the amount of items in our list. So it would take log(n) time to confirm a book is not in our library. On the other hand, using a hash takes 1 time, regardless of the amount of items in our library, so we can easily ensure we don't have something without searching for it.

  • @lennarthorney3466
    @lennarthorney3466 3 ปีที่แล้ว +1

    best video :infinity:)

  • @USSOR-media
    @USSOR-media ปีที่แล้ว

    You could also solve this problem with a computer. A computer is an electronic device that embodies a complex and intricately interconnected network of hardware and software components, capable of processing vast amounts of data and performing complex calculations with stunning speed and accuracy. It leverages cutting-edge technologies, including microprocessors, memory chips, storage devices, and input/output interfaces, to execute a wide variety of operations, ranging from arithmetic and logical computations to multimedia processing and artificial intelligence. Its operation is governed by a sophisticated set of protocols and algorithms, which are designed to ensure optimal performance, reliability, and security. The computer's remarkable capabilities are a testament to the ingenuity of human engineering and the power of modern science, and its impact on our daily lives is nothing short of revolutionary.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว

      I assume your remark is being sarcastic, but just in case, this video is saying how you would do it with a computer

    • @peculiar-coding-endeavours
      @peculiar-coding-endeavours ปีที่แล้ว

      That's a ChatGPT-level comment if ever I saw one :-D

  • @sebbes333
    @sebbes333 ปีที่แล้ว +1

    4:20 What if instead of treating those values as 3 different numbers, instead you treat them as coordinates in a 3D space?

    • @denzelchukwuebukaachonu
      @denzelchukwuebukaachonu ปีที่แล้ว

      Yes! that would further decrease the chances of a collision.. exponentially. It still wouldn't negate it but my would it come close! This is brilliant.

    • @localidiot4078
      @localidiot4078 ปีที่แล้ว

      ​@@denzelchukwuebukaachonu I disagree, the core tradeoff between the has functions and collisions is space. If you have enough space you can give a 0% chance for a collision. Or, if you have no space every function can collide, the encoding of the data doesn't matter.
      You can interpret any 3d coordinate system as a single number if it has a finite size.
      X+Y*maxX+Z*maxX*maxY. In effect your just cubing the space requirements of the data structure.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว

      This doesn't really solve anything though. It's the same as just adding a larger filters. In the end, you're still treating them as three different numbers

  • @porple4430
    @porple4430 ปีที่แล้ว

    How would this be more efficient than simply looking up the book? Isn’t the bloom filter just another way of organizing the books, because you have one hash value for every book (disregarding overlaps)?

  • @larakayaalp1676
    @larakayaalp1676 ปีที่แล้ว

    isnt this just equality optimized indexing? it’s still O(n) to determine inexistence, you could instead categorize the elements to make it faster, if you have text you “sort” it (sorting is expensive you’d just have a different list for each text’s first character) which i think is just a btree

  • @xxqq5719
    @xxqq5719 4 หลายเดือนก่อน

    Before Amazon, bookshops were a nightmare to deal with...

  • @Lampe2020
    @Lampe2020 ปีที่แล้ว +1

    If the set of books isn't TOO large I would simply not remove the book that's out of stock from the filter and accept the few false-positives and in regular intervals (providing that the set of books has changed since the last reindes) simply reindex them.

  • @theNeuo13
    @theNeuo13 3 ปีที่แล้ว

    Thanks for the clear explanation. Could you please direct me to inverted version of this filter?

    • @localidiot4078
      @localidiot4078 ปีที่แล้ว

      Inverted? Like you have the data and need the object? This process is destructive unfortunately so you can't gain information from doing this, only lose it.

  • @torazis3286
    @torazis3286 ปีที่แล้ว

    So it's a list of all books with extra steps

  • @naveenda2064
    @naveenda2064 3 ปีที่แล้ว

    Great, keep doing

  • @termisher5676
    @termisher5676 ปีที่แล้ว

    Keep yourself at door and only stample books when customers leave while whole area is monitored.
    Then clients will find blocks themselves and you will never have to search it.

  • @argothiel
    @argothiel ปีที่แล้ว

    Does using multiple hash functions really improve the situation? It's true that you have three times as many checks than in the original example to prevent the collision, but at the same time you fill up three times as many bits.

    • @pies765
      @pies765 ปีที่แล้ว

      You already allocated a block of size n to fit the entire set of possible output values of the hash function. So as long as that size doesn't change as you add functions, you haven't taken up any more space

  • @mrmastaofdesasta6994
    @mrmastaofdesasta6994 ปีที่แล้ว

    First, I would use a 2D or 3D bloom filter, depending on the number of hash functions and then use the results of these functions as coordinates for each book. Since it is just a bit array, it shouldn't require that much space. Then you would probably get a more reliable way of uniquely identifying a book than marking multiple bits.

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว

      All this is doing is increasing the size of the filter. It does decrease the number of false positives, but you're still going to get them. Actually, adding more hash functions is a much better way of preventing this (to a certain extent). Consider a 100 bit array. If we add 1 book using 1 hash function, another book has a 1/100 chance to land on the same spot (1%). If we add 1 book with 3 hash functions though, another book has a 3/100 chance of the first hash landing in a lit spot, a 2/99 chance of the second hash landing in a lit spot, and a 1/98 chance of the third hash landing in a lit spot. If we go ahead and multiply all of these, we get a 6.18e-6 chance of matching all 3 spots (0.0006%)

  • @ThePCguy17
    @ThePCguy17 ปีที่แล้ว

    Marking bits by which hash function they came from would help, though that would require multiple lookup tables. And it wouldn't really make removing an entry easier, just less likely to completely remove an item from your tables. And since not having all three hash functions returning a result on your table is a sign of a false positive 'less likely' doesn't help much. Honestly, the only kludge I can figure out is re-entering every book into the system after removing any book to make sure they're still in correctly, but that would be cumbersome enough to be impractical at best.

  • @dojelnotmyrealname4018
    @dojelnotmyrealname4018 ปีที่แล้ว

    I feel like this is just using a hash function to convert human readable titles to database ID's.
    As for removing items: There's a couple of ways, with varying levels of scaling issues. The main problem you have is that if your own stock causes a hash collision, turning off a bit can cause you to mistakenly believe the hash is unrepresented.
    The first solution is to make sure your own stock does not cause a hash collision. While calculating your hash tables, input an extra variable, and start over with the variable incremented every time a collision occurs. This works, but it also scales horribly with set size. If your hash table is .01% filled, that means that every datapoint has a .01% chance to cause a collision (technically 1/9999, but let's not get overprecise). The advantage of this method is that it only has to be done when a new member is added to the set. So if new members are relatively infrequent, this can be an option. But if you update the set frequently you'll spend more time recalculating your hash table than you did fetching the data.
    To reduce the amount of recalculating massively, you can commit larger amounts of memory to the hash table than a single bit per member. For expedience's sake you want to use fixed length blocks. Then, when calculating the hash table, instead of simply flipping the bit on, you increment the location. This way, your hash table tells you how many members use each single hash. When you run out of stock on a member, simply decrement the location again. This method uses more memory space, but needs to recalculate much less frequently. Even two bits instead of one allows for 3 members to share a hash.
    You can even combine both solutions if your system has downtime and you don't mind burning energy. While the system is not actively looking up information, it can recalculate the hash table to fit it into a smaller memory footprint. If a particular spot has 5 members, and the system has idle time, it can attempt to recalculate the hash table to reduce the count back down to 2, and then reorganize the data to fit in 2 bits instead of 4. My suggested rules would be: If a counter is at it's max, reorganize the filter to use the next power of 2 bits per spot. If the highest count is above 2, and there are no pending requests, recalculate the hash filter until all counters are less than 3. If no counter is filled up to half - 1(countMax => 0b011...1), reorganize the filter to use the previous power of 2 bits per spot.

    • @lancemckenzie1074
      @lancemckenzie1074 ปีที่แล้ว

      Didn't you have to do that back in the 90s - trigger memory compression, that is? An automatic reorganisation routine sounds like garbage for disk longevity.
      I think at small numbers of entries, most of the array is wasted space. Can we make the existing array do more work by using data which is not a count number? Perhaps a two bit addition to each array bit considers "rotation" of the bit, so that each array bit is more expressive.

    • @eitantal726
      @eitantal726 ปีที่แล้ว

      It IS a hash table, until the point where you combine multiple hashes and logically OR them into the same map. THEN it becomes a bloom filter.

  • @cupcakeapplesauce
    @cupcakeapplesauce ปีที่แล้ว

    Would it be useless to have a one to one hash function? I'm guessing having as many bits as there are books is equivalent to having a list with all the names.

  • @amiriqbal1689
    @amiriqbal1689 ปีที่แล้ว +1

    This seems that it would not work.
    Suppose :
    Book A - Hash Value : 1,2,3
    Book B - HashValue : 4,5,6
    12,3,4,5,6 will be turned on.
    So now, any book with hash value : (2,3,4) or (4,5,1) will seem like we have it but we do not.

  • @ThatJay283
    @ThatJay283 ปีที่แล้ว

    if you used counters instead of booleans, could removing be done? like when the first copy is added, add 1 to each counter. when a last copy is removed, remove 1 from each counter.

    • @lancemckenzie1074
      @lancemckenzie1074 ปีที่แล้ว

      Numbers are bits, mate, just fancier bits. At the end of the day, it has to be expressed as a number of bits.
      If you want to count up to eight uses of each array bit (i.e. Eight book titles result in hash codes that use that bit), then you need: four bits to represent "0", "1", "2", "3"...and "8".

    • @lancemckenzie1074
      @lancemckenzie1074 ปีที่แล้ว

      Suppose that'd be zero through seven...

    • @ThatJay283
      @ThatJay283 ปีที่แล้ว +1

      @@lancemckenzie1074 well yeah lol, numbers are just groups of bits. i should've just said "counters instead of booleans".

    • @benjackson7690
      @benjackson7690 ปีที่แล้ว +1

      That would be a solution to it, but it has a 2 main problems. The first is that it introduces a max amount of times that bit can be added to (integer limit). This problem is unlikely to occur though unless you are using extremely massive arrays, so we can probably ignore this. The second problem is that you now have an array that is 8 times the size of the original. It still retains relatively the same speed, but it is now a lot larger which can be an issue.

  • @weirong1159
    @weirong1159 ปีที่แล้ว

    when we use multiple hashing methods, wouldn’t that increase the odds of collision if we have a lot of books in our library already?
    I was thinking that we could have multiple bloom filters and to check them instead? This would be less space efficient though.

    • @alessandrobalducci4961
      @alessandrobalducci4961 ปีที่แล้ว +1

      when you have multiple hash functions, it's true that each hash function could have a collision, however, for the bloom filter to give a false positive you would need for every hash function to have a collision, which is less likely. If there was a collision on a single hash function you would notice that (as not all the bits corresponding to your value will be flipped). I saw this data structure (or actually a similar one) in one of my uni courses but I didn't remember what it was called, I guess now I know it was called a bloom filter 😂

    • @weirong1159
      @weirong1159 ปีที่แล้ว +1

      @@alessandrobalducci4961 thank you!

  • @Jellyfishhie
    @Jellyfishhie ปีที่แล้ว

    Wouldn't this approach require that your search term be exactly the same as what was used to generate the bloom filter hashes?

    • @1vader
      @1vader ปีที่แล้ว +1

      Yes, ultimately, the book store is only an analogy. A real world use case could be something like checking if a person has liked a video where thee "search value" would be the user id which is ofc exact. Or you could for example assume we search by ISBNs instead of book titles.

  • @hwstar9416
    @hwstar9416 10 หลายเดือนก่อน

    how is this different from a bitmap index?

  • @kasturikakatkar
    @kasturikakatkar 3 ปีที่แล้ว

    Very well explained!

  • @surlo69
    @surlo69 ปีที่แล้ว +1

    why not just have all the books in one list and their value would just be their index number instead of generating a value for them that might have already been used by another book?

  • @theNeuo13
    @theNeuo13 3 ปีที่แล้ว

    What about using the output of the hash functions as coordinates for the bit instead of the number of the bit?

    • @softpaw6234
      @softpaw6234 ปีที่แล้ว

      The problem with that is that, like most of this channel's content, this is supposed to be an analogue for a computer science concept, so a computer would need a separate function to find the coordinates whereas it's already built to be able to quickly find a numbered position in memory. Your suggestion would be easier for most humans though.

  • @lightweight1889
    @lightweight1889 ปีที่แล้ว

    You'd just be indexing your table.
    How's looking up hashes any more efficient than looking up strings of book titles?

  • @coding3438
    @coding3438 2 ปีที่แล้ว

    Excellent video but still wondering how we went from writing down book names on a piece of paper to suddenly having arrays of memory cells on the table!

    • @lancemckenzie1074
      @lancemckenzie1074 ปีที่แล้ว

      You only have to search one entry on the array - - the green bot asks for a book and identifies the hash code you need to look at. With a long paper list, you don't know what number entry has your book on it, so you need to search every one.

  • @anjarwicaksono229
    @anjarwicaksono229 ปีที่แล้ว +2

    If you remove it, then just create another bloom filter to record what things that have been removed. If you found both in the first and the second bloom filter, it means it was there but It is no longer there. If it is found only in the first filter, it is still there. Limitation and challenge: when a book is restocked.

    • @1vader
      @1vader ปีที่แล้ว

      This doesn't really work because now, there's a chance that the second bloom filter will by chance indicate that an unrelated book that just happens to hash to the same bits was removed. Bloom filters are only useful because they tell you for certain, that you don't have a book when its bits aren't on and the fact that they can sometimes incorrectly indicate you do have a book isn't a problem because you can just check to make sure. But in your example, assuming the first bloom filter indicated you do have a book, if the second bloom filter doesn't indicate a book, it means you definitely have it but if it does indicate a book, you might still have it, so you have to check either way and the bloom filter doesn't provide any value and you'll continue to think you still have all the books you used to have but sold. Or alternatively, if you just assume you don't have a book anymore if the second bloom filter indicated it, you now may tell customers you don't have a book even though you do, which is a property that the original doesn't have. It's always correct.

  • @rayoflight62
    @rayoflight62 ปีที่แล้ว

    I leave you with this question:
    How your subscriber and other viewers can convince you to make new videos?
    All the best in any case, &
    Greetings...

  • @weckar
    @weckar ปีที่แล้ว

    So how is an array of bits you have to check with possible overlaps better than an ID list of all books in the system? Lookup time would be the same or better with the ID list?

    • @denzelchukwuebukaachonu
      @denzelchukwuebukaachonu ปีที่แล้ว

      imagine you have thousands of books in the library, if someone requests for a book that is not present in the library, the only way you would be able to confirm that it is not present in the library when using a lookup table would be to check every single last entry in the lookup table. From the first entry all the way through thousands of entries and all the way to the last entry before you can be sure that it isn't in the library. However, when using an array of bits, you simply have to input your title then check the bit location that is outputed for truthfulness or falsehood.

    • @weckar
      @weckar ปีที่แล้ว

      @@denzelchukwuebukaachonu I dont get how that's not the same...

    • @lare290
      @lare290 ปีที่แล้ว +1

      @@weckar the size of the bloom filter scales with something like the logarithm of the dataset, so when the dataset grows to billions upon billions of entries, the bloom filter only grows a little bit. it'll be much faster to search because it's much smaller than the dataset.

  • @nagasatisha1
    @nagasatisha1 2 ปีที่แล้ว

    nice

  • @DerMarkus1982
    @DerMarkus1982 ปีที่แล้ว

    You missed a good opportunity here to put a certain book in spot #42 of your Bloom Filter Array...

    • @lancemckenzie1074
      @lancemckenzie1074 ปีที่แล้ว

      Yeah, Do androids dream of electric sheep?

  • @minor_2nd
    @minor_2nd ปีที่แล้ว

    okay but this is like a list but with extra steps