A problem so hard even Google relies on Random Chance

āđāļŠāļĢāđŒ
āļāļąāļ‡
  • āđ€āļœāļĒāđāļžāļĢāđˆāđ€āļĄāļ·āđˆāļ­ 9 āļž.āļ„. 2024
  • Head to brilliant.org/BreakingTaps/ to get a 30-day free trial. The first 200 people will get 20% off their annual subscription.
    Watch this video ad free on Nebula: nebula.tv/videos/breakingtaps...
    -------------------------------------------------------
    Today we're looking at HyperLogLog, an algorithm that leverages random chance to count the number of distinct items are in a dataset. It does this by tracking the longest run of zeros in a binary sequence, and uses that as an estimate of cardinality.
    HLL is a probabilistic algorithm, meaning it's a guess rather than true answer. But due to some clever tricks it is usually within 2% of the correct value, and can do it both quickly and in a memory-efficient manner. A 512kb datastructure can accurately process trillions of items and terrabytes of data, which is pretty impressive!
    When I made this video, I didn't realize that another #SoME3 was in progress. But a bunch of viewers suggested I enter the video, so I guess this is will be part of the event!
    ----------------------
    🔎Patreon if that's your jam: / breakingtaps
    ðŸ“ĒTwitter: / breakingtaps
    📷Instagram: / breakingtaps
    ðŸ’ŧDiscord: / discord
    ----------------------
    Journal papers:
    Flajolet, Philippe, et al. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." _Discrete Mathematics and Theoretical Computer Science_. Discrete Mathematics and Theoretical Computer Science, 2007. algo.inria.fr/flajolet/Public...
    Heule, Stefan, Marc Nunkesser, and Alexander Hall. "Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm." _Proceedings of the 16th International Conference on Extending Database Technology_. 2013. static.googleusercontent.com/...
    Earlier work:
    Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities." _Algorithms-ESA 2003: 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003. Proceedings 11_. Springer Berlin Heidelberg, 2003.
    Flajolet, Philippe, and G. Nigel Martin. "Probabilistic counting algorithms for data base applications." Journal of computer and system sciences 31.2 (1985): 182-209.
    Articles:
    - towardsdatascience.com/hyperl...
    - engineering. 2018/12/13...
  • āļ§āļīāļ—āļĒāļēāļĻāļēāļŠāļ•āļĢāđŒāđāļĨāļ°āđ€āļ—āļ„āđ‚āļ™āđ‚āļĨāļĒāļĩ

āļ„āļ§āļēāļĄāļ„āļīāļ”āđ€āļŦāđ‡āļ™ • 1.2K

  • @BreakingTaps
    @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +694

    ðŸšĻThe Nitty Gritty DetailsðŸšĻ Seems folks were interested after all! 😅
    - "Linear counting": HLL struggles when there isn't much data / the cardinality is low (the large granularity of 2^n when n is a small number means relative error can be quite high). Most implementations use a "linear counting" scheme under a certain threshold, typically a sorted list or hashmap. Up to the threshold, counts are recorded accurately in that datastructure. When the threshold is reached (say, 50000 unique elements) the counts are replayed into an HLL and it starts estimating from then on. If the implementation is good, it can often reuse the same segment of memory to avoid re-allocating too
    - "++ in HLL++": these are some empirical constants and a few other tweaks that Google added to the algorithm, which help smooth out the relative error (particularly at small and very large cardinalities)
    - The size of an HLL sketch varies by implementation, but typically on the order of a few kilobytes up to a few hundred kb at the most. Very compact! If the scorecards/registers are compressed, that means there are tens of thousands of scorecards tracking the runs.
    - In contrast, a hashmap/set could be many orders of magnitude larger. Suppose you have 500m unique 64bit integers. That ends up being 500m * 8byte = 4gb of memory. Multiply that by 10-100 partitions (not uncommon for advanced reports) and 10-50 instances from multiple servers... you are quickly looking at hundreds of gb of datastructure to transfer, store on disk, page into memory for iterative processing/merging. And that's just the keys alone, ignoring the overhead of the map/set itself. There are tricks to reduce this (shared global dictionaries, memory mapped files, etc) but none are free and they are all more expensive than a HLL sketch!
    - Yes, I know hashmaps don't actually allocate all that memory up front. But that was an easier layperson explanation than digging into how loading factors, collision resolution and re-allocations work 😉 I wanted to convey that hashmaps trade space for performance, hence the analogy. But it should be noted that even with a modest 0.7 loading factor, unless you vastly over-allocate the map you will eat a lot of performance due to memory re-allocations when trying to use it for very large cardinality datasets. And it's made worse by the merge cost on a reduction node. So the analogy isn't that far off!
    - HLL is fast! It may seem like a lot of work, but modern (non-cryptographic) hash functions are highly optimized. And most processors have built-in instructions to count leading or trailing ones/zeros (en.wikipedia.org/wiki/Find_first_set). So in practice, HLL will have about the same runtime cost as a hashmap/set but doesn't need to deal with memory re-allocations. And it's vastly faster than any sorted list/set approach.
    - HLL is best suited in an environment where you expect millions of unique elements, spread over a dataset on multiple servers (in terabytes+ of data). If all your data fits on one machine, it's much faster/easier/better to just count it exactly. 🙂
    - The cardinality metric seems pretty boring, but ends up being super useful at-scale! The distinct count of session IDs, IP addresses, src-dst tuples, etc partitioned by a few different criteria. Very useful for reporting, analytics and forensic style investigations.
    - The hash function ("transmogrifier") really isn't that important, as long as it's considered a good hash (good pseudorandom output) and fast. Something like murmurhash is often used, with 64 or 128bit outputs.
    - HLLs "merge" by taking the maximum run from each respective scorecard. So if you have two sketches that have scorecards [1, 10, 5, 3] and [3, 3, 4, 5] the merged result would be [3, 10, 5, 5]. The maximum operation is commutative (max(a,b) == max(b,a)) and associative (max(a, max(b,c)) == max(max(a,b), c)) so we can just take the max of all values for a specific scorecard slot and it's equivalent to having run the algorithm on all the data serially instead of in parallel.

    • @Splarkszter
      @Splarkszter 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +8

      Quite amazing thing. Super helpful to keep in mind for the future!!!

    • @iwersonsch5131
      @iwersonsch5131 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

      Now what if you have like 10^20 buckets and then build a stack in each bucket? That avoids having to allocate much space, and it avoids getting high stacks that are long to search through

    • @Rin-qj7zt
      @Rin-qj7zt 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +7

      this is a fantastic video with good explanation techniques, but I think you need to clarify more about why exactly this is useful. sure, you can merge cardinality between different computers a lot faster. _why is that useful?_ what i mean is, if so many companies are using this, _what are their programs accomplishing?_

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +36

      @@Rin-qj7zt The short answer is "reporting and analytics", the longer answer is hard to pin down since it's used for a ton of different purposes. But imagine something like "Show me the number of unique session IDs in our game, from the last year, partitioned by tuple". That sort of report could span literally billions of "game event" rows in the database, and generate dozens or hundreds of "buckets" representing each tuple combination. And inside each bucket there's a count of unique session IDs which are proxies for users/gamers sessions
      Then it's all thrown onto a realtime dashboard in an office, fed into a realtime algorithm for scaling the game servers, or a PDF report for management, or off to marketing/sales/whatever.

    • @Peter_Enis
      @Peter_Enis 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      @@iwersonsch5131 first thing I thought of.... The set grows (and gets slower) but should make for an accurate answer?

  • @bjarnivalur6330
    @bjarnivalur6330 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3039

    This feels like such an "outside the box" cheat, instead of counting the different colours, just find the rarest one and workout how likely/unlikely it is to be in your set.
    I love it

    • @snared_
      @snared_ 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +67

      This is the core essence of it.

    • @_Karlsson
      @_Karlsson 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +65

      It also doesn't solve the original problem at all, it will ~always be totally wrong about the actual number that the stack/hashset would actually solve.

    • @XerosOfficial
      @XerosOfficial 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +171

      @@_Karlsson It is used in some applications to estimate how much memory you need to allocate, then you can use another algorithm, such as the "magic chest" one he referenced in this video to get exact value, but without requiring nearly as much space.

    • @KatherynneF
      @KatherynneF 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +85

      @@_Karlsson Breaking Taps described at the beginning how this is used for real world applications. Coding problems aren't math problems, so answering the motivating problem is much more important than answering the theoretical analogy exactly.

    • @_Karlsson
      @_Karlsson 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +32

      @@KatherynneF and @XerosOfficial , You are both right. Like I said, it solves a totally different problem than the original problem of knowing how many unique items there are. Instead it solves for a probabilistic approximation that can be used for other purposes than when you need to know the actual count. They aren't equivalent, as you can't switch one method for the other in any system.
      To be more clear, the original solutions resolved the exact count and a set (HashSet/Array) of all the unique data. While this algorithm didn't return any of those.

  • @pafnutiytheartist
    @pafnutiytheartist 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3270

    Explaining hash tables as "magic chests" is very cute but surprisingly accurate.

    • @phizc
      @phizc 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +91

      Agree. Only thing missing in the description is that the chest will only store 1 of each. Which is perfectly fine for counting distinct items, of course.

    • @pafnutiytheartist
      @pafnutiytheartist 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +54

      @@phizc There are modifications allowing you to store multiple items in case of hash collision, python dicts is one exhample. It has a speed and memory penalty but still has a lot of applications.

    • @PattyManatty
      @PattyManatty 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +32

      As described it's more of a hash set than a hash map/table.
      But of course the fundamentals are the same

    • @Arnob-vm4ub
      @Arnob-vm4ub 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +13

      ​@@phizc*Laughs in std::unordered_multiset*

    • @kgoblin5084
      @kgoblin5084 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +48

      It's a good analogy, although I do think it could be improved, as parts of it were a bit outright misleading on how hashsets work:
      1. the shelf has a 'warm colors' side, where the red & orange shelves are, & a 'cool colors' side, where green & blue go... so when if you're looking in the chest for 'blue' you know right where to find it, instead of having to hunt thru all the shelves
      2. the chest DOES NOT start with all the shelves for all the colors it could have, instead it has exactly as many shelves as it has currently stored colors. It's only huge when you put a million colors into it, when you have only 3 to 10, it's tiny. IRL, this property is actually exploited to use hashsets as a space efficient way to store sparse arrays.

  • @BreakingTaps
    @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +686

    Hope the animated format wasn't too disconcerting considering the usual content on the channel! ðŸĪž I originally filmed this on a whiteboard, but it was hard to see the details and ended up being pretty boring. Figured I'd take a crack at hand-animating instead. Probably won't be a regular occurrence on the channel (too much work!) but was a fun diversion from the usual stuff 🙂

    • @Scyth3934
      @Scyth3934 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +27

      Nope! I'm just fine with it

    • @Yenrabbit
      @Yenrabbit 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +16

      Really great, thanks for sharing!

    • @vegasu9418
      @vegasu9418 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +5

      nice vid! ever going to try to beat Sam Zeloof at higher res transistor fabrication?

    • @Dinnye01
      @Dinnye01 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +4

      It's perfectly fine. I'll consume your content no matter what.

    • @MrBeaach
      @MrBeaach 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +11

      consider entering this video into the SoME3 competition,

  • @decb.7959
    @decb.7959 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1345

    - That's such a cool algorithm, especially the fact that it can be run asynchronously.
    - Thanks for including the programming concept names in the corner, it really helps me understand at a more concrete level.
    - I would love to see more animated videos or segments to help explain concepts.
    - If you haven't already, this video would make a perfect submission to the Summer of Math Exposition competition.

    • @ColinTimmins
      @ColinTimmins 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

      I hope they do more as well. I can only learn visually as I am very dyslexic. ChatGPT has become invaluable to me as it “erases” errors and I can manipulate data in a way that was not possible before. 😊

    • @peraltarockets
      @peraltarockets 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      @@ColinTimmins ChatGPT makes things up because it's just a fancy regression model. Please do not use it for data analysis.

    • @cheesebusiness
      @cheesebusiness 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +25

      “asynchronously”
      Probably you meant “in parallel”

    • @sausas8209
      @sausas8209 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +5

      @@cheesebusiness tHreAdEd

    • @mrbrianakias1
      @mrbrianakias1 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +7

      ​@@cheesebusinessit's a programming term

  • @ablobofgarbage
    @ablobofgarbage 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +110

    I really like how you describe programming terms in ways that non programmers can easily understand, also putting the actual names of them in the corner was a very nice solution to clarify what you were talking about to us programmers

    • @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii
      @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      As a programmer I don't get it though... computer are excellent at finding unique values in a data set. Even huge data sets and even the oldest mechanical computer did that very efficiently. I fail to see any scenario where you'd run out of computing resources for this class of problems. It is also an extremely simple problem.
      Maybe you can help me here, what is hard in counting unique elements?
      Why would you ever want to produce an inaccurate guess?
      I can only see this as interesting in the fields of statistics on infinite data sets which has little to do with programming.

    • @ablobofgarbage
      @ablobofgarbage 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      ​@@DJ_POOP_IT_OUT_FEAT_LIL_WiiWiithe video was about how counting unique elements is not an easy problem. If you use an array list or a linked list you will end up with O(n^2), using an ordered list would be O(n log n) (i think), and using a hashmap would be O(n) but use a lot of memory. HLL is O(n) but does not give an exact answer, but as the size of your data set increases the accuracy of HLL also increases. Companies like Google have unfathomable amounts of data, so for them this tradeoff is very much worth it.

    • @aidenkt
      @aidenkt 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      ​@@DJ_POOP_IT_OUT_FEAT_LIL_WiiWii This mostly becomes a problem when a dataset starts ranging into the hundreds of thousands and millions. As fast as computers are at counting through one time, this would require a computer to run a script per number, per number. Say that it's for 4 numbers, it's not really that unreasonable to count 4 numbers individually since thats 10 checked numbers in total. However, scale that up to a few million: then we have an issue. This is especially the case for larger companies when these numbers need to be continually pulled-it's a massive waste of processing power.
      To give you an idea, say one comparison is 0.3 nanoseconds (the average amount of time it takes a computer to compare two numbers). This is SUPER fast. Let's say it's 100 numbers. Unfortunately, the time it takes for 100 is not 0.3 times 100. It is actually the 0.3 times 100! (factorial). That becomes as simple as a few milliseconds-that's still relatively fine. 0.0000028
      But now take 10,000,000 numbers. The factorial of this is so high that most computers and calculators will refuse to even let you calculate the amount of commands you would have to do. This is the point where now, it's entirely unreasonable to finding unique values with the other method. Its reasons like this that you'll never see an exact amount of Google accounts in the world as it would far exceed this range. But of course, they still have to know a general amount right? That's where the solution presented in this video comes in.

    • @BEN-ys6gu
      @BEN-ys6gu 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      Well one use case was at the end, when you have a distributed data management system with multiple databases.
      Also, if the guess is really good, they might simply not need anything more accurate, so a resource-friendly algorithm is the best choice

    • @ME0WMERE
      @ME0WMERE 6 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      @@DJ_POOP_IT_OUT_FEAT_LIL_WiiWii The problem is, as he said, when there are trillions of items. A stack is O(n^2) cpu time so it's infeasible. A hash map needs at least the same order of magnitude of memory as there are unique items - and in this case, you have absolutely no idea how many unique items there are.
      Someone somewhere else in these comments mentioned using this method to estimate the size required for a hash map in order to count the exact number of unique items. That's a real-world use for this method.

  • @macchicken98
    @macchicken98 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +352

    This video was soo well done! I love the way you visualized the nature of how HashMaps and Large Scale computing feel - the idea of using the stack of blocks up into the clouds including sound effects for it makes it so visceral - wow, we need more of this to teach people about computing!

    • @thewhitefalcon8539
      @thewhitefalcon8539 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +4

      Color shelf octarine with the cobwebs

    • @vitulus_
      @vitulus_ 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      I would've preferred hashmaps to be explained as simply measuring some quality of the colour (hash) and then examining the stack corresponding to that quality. It avoids checking one massive stack, and instead much smaller stacks (buckets). Still simplified, but without losing any accuracy. Furthermore, a hashmap is technically a generalisation over the 'stack' approach, so one part of the video doesn't make sense to find a spot in-between too much space and one big stack given the hashmap literally can be adjusted for that!

    • @CrittingOut
      @CrittingOut 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      Hashmap? No, magic chest.

  • @nitsanbh
    @nitsanbh 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +99

    Wow.
    I’m actually taking a scary Uni course called “algorithms for summarizing big data”.
    We’ve learned a variant of the algorithm you showed, but you made it orders of magnitude clearer. Thank you!

  • @cernejr
    @cernejr 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +531

    The hash function is very important here, it is a non-trivial topic itself. Also, it would be nice to mention some practical real-world uses of this algorithm.

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +249

      It's both critical, but also trivial 🙂 I.e. you _must_ have a good pseudorandom hash function for this to work... but that's satisfied by any number of modern hash functions, which exist in easily obtained libraries for basically every language. So it ends up being a pretty moot point in practice. Grab a modern, fast hash function (murmurhash or similar) and use 64 or 128bit outputs and you're golden.
      I didn't want to derail the video talking about pseudorandom hash functions so just hand-waved it away :)
      Re: use-cases, it's used a lot in reporting, forensic investigations, analytics! Unique count of session IDs, IP addresses, src-dest tuples, etc etc. Typically it'll be used in a partitioned analytic scheme. Give me the count of unique session IDs over the last six months, partitioned by page on the website, country and language. That sort of thing.

    • @cernejr
      @cernejr 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +21

      @@BreakingTaps Thank you

    • @Kerpeles
      @Kerpeles 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +12

      I also started to wonder how this actually being used

    • @albertrenshaw4252
      @albertrenshaw4252 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +10

      @@Kerpelesevery password is stored as the hash of the password. When you enter your password on a frontend, it hashes it locally and sends the hash back to the database, the database then compares the hashes of the password rather than the actual raw passwords. This is ideal because in the event of a hack only the hashes are released, not the raw passwords. Hashes are worthless because (not mentioned in the video) they are one way functions. Meaning you can get the hash value of an object easily but you can’t get the original object from a hash (for all intents and purposes, I already know some “ackshually” warrior is salivating with the knowledge of dictionary attacks)

    • @einsteinx2
      @einsteinx2 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

      @@albertrenshaw4252​​⁠​​⁠​​⁠ackshually he does mention it’s one way in the video at 4:46 “it’s a consistent, one-way transformation” ;)

  • @tanvach
    @tanvach 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +163

    I use HLL all the time at work and this is such a clear and concise explanation of how it works. Also love the animation format. Thank you for putting the video together!

    • @nothayley
      @nothayley 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +11

      Can you share what you use it for? It's a brilliant algorithm, but I'm struggling to think of concrete use cases for it.

    • @alternativeglasto
      @alternativeglasto 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +11

      @@nothayley I have also used this at work, I implemented it about eight (maybe?) years ago; I was working as a SQL compiler writer for a massively parallel database company, and called it "APPROXIMATE COUNT DISTINCT (var)" (I also implemented other APPROXIMATE aggregates using other techniques like this). As it was parallel we could do the work of doing the count of distinct values over multiple nodes, then merge the results, as this video alludes to. In my opinion that is the real point of hyper log log - being able to split the work over many nodes, then merge the results from all of them to get a final result that is close to the real answer.

    • @cannaroe1213
      @cannaroe1213 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +8

      ​@@alternativeglasto yeah throughout this I was like "oh neat, it's got cute cardinality fingerprint thing going on" and then when he said it was possible to aggregate these scores i had to pause the video and take a moment, that's wild! I mean it's not crazy, it's just sharing information about the used up "hash space" across a bunch of hashers working off the same map, but it's beautiful in it's simplicity. Counting the number of 0s is a single CPU instruction called finding the Least Significant Bit LSB. Putting something into the right bucket is as simple as bit shifting everything but the last N bits, that's the bucket ID. Concat these together and you have your distributable hash. Wild! It's a half-cryptographic, half-perceptual hash.
      I wish there was a wikipedia of algorithms, i'd love to buy all the programming books and compile such an encyclopedia.

    • @tanvach
      @tanvach 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

      @@nothayley we use it for product analytics via approx distinct count in SQL. The merge-ability allows distinct count to be run in parallel on a cluster. See Presto/Trino APPROX_SET. Actually we store the binary format in aggregated tables to allow fast set operations (not just counting unique users, but rolling time window unique users, churned users, etc). With small enough error setting it’s a worth while trade off. Consider that I process 10+ terabytes of data per query, and multiple petabytes of data for our org aloneâ€Ķ

    • @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii
      @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      @@tanvach that really doesn't sounds like something you use all the time. what I do use all the time is hash tables to count unique elements...

  • @OtherWorldExplorers
    @OtherWorldExplorers 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +14

    Missed opportunity.
    One of the cubes should have been a Portal companion cube.

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

      Argh! Didn't even cross my mind, would have been perfect!

  • @alexismandelias
    @alexismandelias 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

    The information was laid out in such a clear way, I was able to effortlessly follow and even be a couple of steps ahead at some points. Good stuff.

  • @miketout
    @miketout 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +15

    I love your description of this. I've watched your hardware / instrument related videos for years, which I learned from but did not start out with the understanding I did on this one. Very impressed with your ability to break difficult subjects down clearly, and I've been leading development both in and out of companies for decades. I'm now lead dev for a large, unlimited scale, open decentralized network that will use this for some powerful applications over time. Now, I have a great explainer to point people to in order to understand it :) We've got some projects in progress that could definitely use your help and possible cooperation in exchange for strong support of your efforts. You clearly have no lack of pursuits, but I'd be happy to have a discussion about it.

  • @areadenial2343
    @areadenial2343 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +4

    This reminds me a lot of weighted reservoir sampling, especially in how you can run the algorithm in parallel and then combine the results. Instead of combining scorecards, you combine the total weights and chosen samples according to their probability. This allows you to sample a very large list of candidates with a uniform (or custom!) probability distribution.

  • @adarco99
    @adarco99 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    The narration and animated diagrams were incredibly helpful in a basic understanding of something i would never even begin to grasp reading that paper. Top tier content as always!

  • @kylewhite5695
    @kylewhite5695 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    This is cool because the video can explain the algorithm to people who are beginning programmers, people who have taken a few coding classes (with the function/algorithm names in the right), and experienced programmers with the sources at the end.

  • @steven_porter
    @steven_porter 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

    I like the format! Very different from your usual work, but it felt nicely polished and worked well with the topic.

  • @Scyth3934
    @Scyth3934 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +17

    hi! Interesting to see an animation-only video from you, but you did a great job on it!

  • @GiddyThis
    @GiddyThis 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I'm glad I saw a short of yours so I could come back and watch everything I've missed.

  • @WolfrostWasTaken
    @WolfrostWasTaken 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I love that you put a note at the bottom stating the concept you are explaining, so I can immediately figure out what it is exactly that you are trying to explain instead of trying to guess it from the explanation.

  • @WarlordEnthusiast
    @WarlordEnthusiast 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

    This is not a video I expected from your channel but I love it!
    I like videos like this because they give me ideas for projects, implementing this algorithm with a practical use should be fun :)

  • @videoharry_
    @videoharry_ 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +16

    Omg this is amazing!! The animation is so stylish I love it! Thanks for giving something new a go - but I can’t wait for the next materials video! Doing materials at uni next year and your content has gotten me so excited and informed for it. âĪ

  • @getuliomartins5005
    @getuliomartins5005 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    Your videos are always amazing! But this one is unique, incredible. But also hope you don't stop with the former format, cause those both are sooo good!

  • @perlguiman
    @perlguiman 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Please make more of these! I like this format and you have always been an exciting teacher!

  • @DarthMarioDM
    @DarthMarioDM 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +5

    This guy leans so hard into the "programming-as-wizardry" metaphor that his cubes can be the color of magic. Pratchett would be proud.

    • @jonaslind9505
      @jonaslind9505 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      Yeah I loved the discworld reference. Only thing missing is lots of little feet on the chest.

    • @fractalgem
      @fractalgem 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      There's a spider web on that part of the shelf, showing it never actually gets used :P

  • @harmlesscreationsofthegree1248
    @harmlesscreationsofthegree1248 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +21

    I’m not much of a computer or algorithm person, I’ve always worked with my hands, but this video really impressed me. I understand the concept and appreciate how elegant a solution it is! Once I visualise cubes or coins, my brain that works in 3D rather than abstraction, seems to latch onto the concept easily. Great vid sir!

    • @phutureproof
      @phutureproof 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      in your mind imagine putting the coins in a bag, and you ask the bag how many coins are in it, abstraction

    • @harmlesscreationsofthegree1248
      @harmlesscreationsofthegree1248 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      @@phutureproof I tried that and it didn’t work. Good suggestion though

  • @_abdul
    @_abdul 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    I wasn't expecting any of my fevorite programming/math related channels to cover HLL, And it ended up covered in the nicest way by the most unexpected but another one of my favourite channel, Thanks for this, It was a treat âĪ

  • @alex.g7317
    @alex.g7317 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Love how you took time to animate all of this! Gives a unique style to your videos

  • @eldoprano
    @eldoprano 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +5

    I love it! I also worked a lot with probabilistic structures, and it was such a surprise when I found out that hyper log log just counts zeros to do it's magic xD
    But I mostly worked with min-hash tables and bloom filters. They are nice for getting similarity of groups ^^

  • @puilp0502
    @puilp0502 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +8

    I just understood HLL in less than 10 minutes, a concept I struggled to understand for more than 2 years. Your explanation is phenomenal!

  • @YeloPartyHat
    @YeloPartyHat 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    I've not seen your videos before but I absolutely love the way you've explained this and it's uses. Fun animation too!

  • @DethWench
    @DethWench 8 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    This has helped me understand how data search optimization must work. I always wondered how SQL engines estimated cardinality in developing an execution plan. I see how the hash fits in now. Thanks for a very compelling video.

  • @jamesedwards6173
    @jamesedwards6173 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

    2:32 Octarine. I approve of this reference. 🙂 (And 42 shortly after. lol)

  • @BRUXXUS
    @BRUXXUS 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +8

    This was way more interesting than I expected heading into the video. The animation was great!
    I know the example used in this was unique colors, but as someone clueless to programming, I'm curious how this is used in real world applications.

    • @AlmostM
      @AlmostM 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

      It is effectively "counting distinct values in a large set", and that could be any type of value. For example, you could count "how many different users have commented on this video".

  • @markzuckerbread1865
    @markzuckerbread1865 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Amazing video! I love how the core of the video is very simple but you still put the nitty gritty details in the bottom left corner, its a great way to address a diverse audience :)

  • @joshuajoab9313
    @joshuajoab9313 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    This is a really well made video, I dont know a whole lot about programming but the animation visuals and the good explanation helped me a lot!

  • @danielsmullen3223
    @danielsmullen3223 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +5

    This was a great video and you did a fine job of explaining the concepts behind HLL in simple terms.

  • @nWestie
    @nWestie 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    A wonderful video! It felt very manageable, without simplifying the problem into insignificance(I think, maybe I should try to go implement it and see...) The little notes for the programmers were especially nice, made it clearer what was actually going on very quickly for those who care and therefore are pretty likely to know what a hash is, without sacrificing intuitiveness

  • @NoblePineapples
    @NoblePineapples 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    This was a really good video, I liked the visuals, even if it was different from your usual content.
    Learning is learning, and that is great.

  • @morgan0
    @morgan0 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

    it’s different but my first guess of how it works is a method i saw talked about for estimating the total number of species. you basically look at the new ones you find over time and as that reduces, you can then estimate how many are left. and the selecting which “scorecard” to put it on might be a little faster than the idea i had before that was mentioned of generating multiple hashes for each color, considering them separately, and at the end, taking the min, but maybe harmonic mean would still be better. it might be more accurate, but if the hash is slow, it would be slower because you would need to do it several times or more.

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

      It's not quite the same, but the scheme you are describing is very similar to an algorithm called CountMin (and a few related algos)! CountMin builds a frequency table in a probabilistic manner like HLL, so you can ask it "how many times have you seen xyz" and it will estimate a count for that item. It does this by hashing the element multiple times and using that to increment different counts in the datastructure.

    • @morgan0
      @morgan0 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      @@BreakingTaps neat

  • @kasuha
    @kasuha 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +102

    Definitely interesting algorithm but I think it only has use in scenarios where the cardinality is huge. Anything with order of magnitude less than millions can be counted exactly. For billions and above, this algorithm really makes sense.

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +84

      Yep, definitely! It excels in situations like counting unique session IDs, transaction IDs, IP addresses, Source-Desetination tuples, etc. Very high cardinality events that are likely stored in large (distributed) datasets across multiple servers. If the data fits on one machine, much better to just count it exactly 🙂

    • @monkeytube9
      @monkeytube9 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +27

      ​@@BreakingTapsActually, it's useful for things like upvote tallying. Not for 10s of votes, but storing a few thousand user-ids for tons of topics/threads/comments takes more space than a sketch. And the HLL sketch de-duplicates votes very efficiently.

    • @muskyoxes
      @muskyoxes 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      @@BreakingTaps It sounds like being distributed is everything. There isn't an overwhelming total count of anything you listed

    • @gaboqv
      @gaboqv 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

      when objects you want to count are very complex this would save you from using 30 gbs of ram to using like 100mbs even if you aren't in the millions

    • @kasuha
      @kasuha 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

      @@gaboqv You can always count just their hashes. Statistical methods are only necessary when even the hashes can't fit in reasonable sized storage.

  • @smellsofbikes
    @smellsofbikes 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    The content is great and really interesting, but I love your graphics design on this as well. This is really nicely explained, graphically.

  • @alipony
    @alipony 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Love the animation style! Im glad i clicked on this video. You did a great job of breaking down and explaining this topic

  • @bj_
    @bj_ 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

    6 years of watching math TH-cam and I've never seen this. Great rundown and very straightforward animation

  • @alterego3734
    @alterego3734 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

    2:46 That is _not_ the problem with hash maps. They can actually be quite space-efficient.

    • @alterego3734
      @alterego3734 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      And then there's Bloom filters too.

    • @alterego3734
      @alterego3734 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      Of course, they use exponentially more space than HLL, since they not only (approximately) count seen colors, but also keep track of which colors have been seen. So that is the real advantage of HLL.

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      It's a reasonable consideration when dealing with very large datasets. If you want decent performance you need a modest load factor, and it all becomes problematic when you have to transfer all the maps to a central location for merging. But I was trying to keep the topics simple for non-programmers, so grain of salt to the explanation in the video :)
      Bloom and cuckoo filters are indeed cool. But they can saturate far more easily than HLL and give false positives, so have their own considerations like HLL.

    • @Ryan_Thompson
      @Ryan_Thompson 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

      ​ @BreakingTaps Sure, but the problem with your description of hashes was that you said (and showed) that "physics is still physics. It needs a huge amount of storage for every *potential* color, even if we didn't actually see those colors in our pile". That's not at all how hashes work! Hashes have an O(n) worst case storage efficiency, where n is the number of elements you have inserted into the hash. In this case, you'd be inserting elements like Red→1, so n would be the number of *unique* elements *you have seen,* (not all *potential* elements, as you stated) as finding another red block would simply increment Red→2 without requiring more storage (and that's (exactly!) an O(1) operation, too). Now, don't get me wrong, HLL is asymptotically *way* better, a good fit for the datasets you described, and is a very cool algorithm, but as an old Perl programmer from *way* back, I love my hashes, and don't like to see them getting an undeserved bad rap.😜

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

      Haha well, I wasn't trying to disparage hashmaps! They are great. 🙂 But trying to convey loading factor to a lay audience is not an easy task (or really relevant tbh) so I just sidestepped it. Trying to discuss how it's not actually empty shelving but multi-purpose slots that multiple colors can use, but only one at a time, so have complicated collision-resolution mechanism and the shelves then resize when they get too full... way too complicated for something that's not the video's main topic.
      I wanted folks to understand that maps trade space for performance and I think the analogy of long, empty rows of shelving is fine for that.
      And to be honest even a load factor of say 0.7 ends up being pretty expensive when working on enormous datasets. Re-allocating the map's memory multiple times as the cardinality grows is a non-negligible expense. If you don't wan to pay that runtime cost you end up in a situation very much like the empty shelves: you over-allocate a large block of memory and it sits idle if you accidentally have a lower-cardinality scenario. If you're potentially in a situation where HLL makes sense -- millions to billions of distinct elements -- that's a very serious cost, and exacerbated by the merge-cost at a reduction node.

  • @sethputnam5925
    @sethputnam5925 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    Loved the Octarine reference hidden within the Magic Chest explanation.

  • @SeaScoutDan
    @SeaScoutDan 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    This feels like drawing a circle inside a square, then counting the number of dots inside the circle vs the square, to estimate PI

  • @ozne_2358
    @ozne_2358 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +6

    Interesting video but I would have liked to see a bit more on how this algorithm is used on in practice.

    • @surenpetrosyan1603
      @surenpetrosyan1603 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      imagine you have a large amount of data and you have to find the number of unique items among those. HyperLogLog is an efficient counting algorithm.
      you could use it in an airport company to find from what countries your tourists come from, taking out the “duplicates” from the same country.

  • @rammsund
    @rammsund 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

    So cool! More videoes like this please 😊

  • @automaticprojects
    @automaticprojects 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Terrific video and love the range of your content! Are you still using your Avid BenchTop Pro?

  • @greenboy256
    @greenboy256 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    The machining and nano stuff videos are amazing, but I'm pretty sure the kinds of people into both of those will have a healthy interest in algorithms and computers. Great stuff!

  • @huzzzzzzahh
    @huzzzzzzahh 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +7

    This is an amazing video, and a perfect “ah-ha centered” explanation. One critique though is when using colors to distinguish things it always helps to superimpose a symbol of some kind for accessibility reasons.

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +5

      Oh no, I totally didn't think about that! OofðŸ˜Ē Pretty big oversight on my part, should have picked something like shapes instead of colors. I'll keep that in mind for future videos, thanks for bringing it up!

    • @huzzzzzzahh
      @huzzzzzzahh 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

      @@BreakingTaps it happens! It’s so incredibly easy to take what we have for granted. There’s some good accessibility guides around too that make it a lot easier; for example there are some text/background combinations that I would never have imagined could be a problem, yet are for some people.

    • @leolen8029
      @leolen8029 5 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      @@BreakingTapsDon't worry, i'm colorblind and I still really enjoyed the video! The graphics were really incredible!

  • @goat9199
    @goat9199 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +5

    Damn, was just about to click away, but then you told me not to. Now I'm a sub. ðŸ˜Ē

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

      Condolences! ðŸ˜Ē

  • @M0nu5
    @M0nu5 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I love your channel and the videos you make. It's always a highlight to see new videos appearing.

  • @dominikmuller4477
    @dominikmuller4477 5 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    those multiple score cards are very necessary, especially for explaining how you can get within 2% accuracy with a scoring system that for a single score card only spits out powers of 2

  • @christianbruneau9762
    @christianbruneau9762 4 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I have been asked on an tech interview about how to solve approximate distinct count on large random events in a near-real time context for a data engineering position. Hyper log log was the solution expected but never heard of it. I learned with this video for the next time :)

  • @lamphobic
    @lamphobic 7 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Great video/explanation. My first real project when I started at google a few years ago was to do an HLL implementation and this was a great nostalgia trip for me.
    I'd never heard of it when I was given the project so I first had to learn the basics of HLL, and then go and learn a bunch of implementation specific details that complicated it.

  • @jsrodman
    @jsrodman 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    Amusingly, I did bugfixing work on a product that implemented HyperLogLog. When it was added to the product and I was part of the technical review I asked "What is the boundary for inaccuracy which is expected for this implementation?" and the implementor had no answer. I said "How can we ever know if it is operating corrrectly if we hae no definition of what correct behavior is?" and the implementor had no answer.
    No one was happy that day.

  • @haraseesgill8491
    @haraseesgill8491 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    That’s such a beautiful algorithm with such a wonderful presentation. Thank you for sharing this. I think this way of presenting is highly accessible. People who understand can extrapolate the larger concepts and people who don’t get a chance to understand.

  • @user-vr3ex1fj7c
    @user-vr3ex1fj7c 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Super interesting vid bud, keep up with the good work. Waiting for more algorithm videos in the future ;)

  • @agusmelo1
    @agusmelo1 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Hey great video, I really liked the animation style. This is my fist time hearing about this algorithm and having a visual representation helps a lot. Got one question about collisions, does the algorithm have a way to deal with binary secuences that finish with the same or less 0's (other than having multiple counters), or those cases are not probabilistically relevant enough to be taken in consideration?

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

      Just the multiple counters! In some ways it relies on all those collisions, since _most_ of the time it will be seeing duplicate runs in the binary sequence (even if they are coming from different elements). It uses all those collisions to probabilistically ignore them basically.
      But that's one of the reasons it needs a certain threshold of data before it starts predicting good values. I pinned a comment with some more details about that, but basically HLL works best once it has "warmed up" a little with some data so that things start to average out.

  • @jlhudson4521
    @jlhudson4521 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    This is a fantastic and highly engaging, interesting video. Keep up the awesome work, look forward to seeing more like this.

  • @cubandarknez
    @cubandarknez 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    extremely useful video, and it showcases a very powerful concept that shows up in modern problem solving. Sometimes getting the exact perfect answer is extremely hard and expensive either time wise or resource wise. However, there can be clever approaches that either get you an answer for a simpler problem that is good enough, or (like this one) gives you a close enough answer to the same problem.

  • @zxkredo
    @zxkredo 8 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    This is so well explained! Even for me as a programmer it is better to hear about it in a easier way.

  • @clementboutaric3952
    @clementboutaric3952 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I had an exam inspired by Flajolet work once. It consisted on deriving a PDE from a combinatorial problem to then use series to get probabilities. One of the most interesting exam of my life !

  • @Sylfa
    @Sylfa 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    It might not be your usual style of video, but it got me to subscribe!
    I'm aware that animation is too costly on youtube, takes too long and the algorithm punishes the lack of updates, but you could probably do the video with some props instead for quicker recording and gain most of the benefits. Or just cheat a bit with it, there's plenty of ways to do that, and more coming out with AI tools I imagine.

  • @bill_the_duck
    @bill_the_duck 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    It wasn't until I recognized the voice that I even realized what channel I was watching. An interesting departure from your usual videos, but I liked it. It fit the subject matter well.

  • @summer_rains
    @summer_rains 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Love your style of explaining and the animations!

  • @arcrad
    @arcrad 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Loved the format and animation. Very cool subject. Happy to learn about it!

  • @nsq2487
    @nsq2487 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    Wished you had elaborated me on why exactly this algorithm is important, giving examples rather than just saying big tech companies use it

  • @Robert_McGarry_Poems
    @Robert_McGarry_Poems 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    My takeaway: It's more simple to create an unindexed list and then check each new input against the current "longest run" hash function, than it is to index a list checking for unique inputs because every new input would require checking all previous inputs, every time.
    Longest run functions would give you a rough approximation of the number of items in the original list that are unique, based on how much information it ends up taking to describe all of the unique colors. The more unique colors, the more binary information is ultimately needed to express that as a list. Therefore a long run of similar inputs, in your example zero, would require many unique colors to exist.

  • @jkid1134
    @jkid1134 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Excellent video. The process I employ intuitively trying to solve your colorful block problem is practically identical to the algorithmic development you are trying to reveal.
    I suspect you would (or do!) very much enjoy what's called the german tank problem. It feels adjacent to this, in a way.

  • @sharpfang
    @sharpfang 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    I was thinking of something a bit similar to handle floating point calculations on vastly different numbers. Adding a very small number to a very big number - so different the small number would fall through, below the least significant digit of the big number - change it to a chance. Like, if you'd need 10,000*Smol to increment the least signifant digit of Big, each addition of Smol has 1 in 10,000 chance to do it alone.

  • @st3althyone
    @st3althyone 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    It was a nice change of pace, thanks for the snazzy explanation.

  • @bensonboys6609
    @bensonboys6609 8 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I love these styles of videos! I’m not currently a programmer, but I want to be! This was so fascinating and you do such a good job at explaining and illustrating what you’re trying to get across! Please I NEED more!

  • @Kreliho
    @Kreliho 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    Excellent video and animation! You've outdone yourself.

  • @Giorgosdnv
    @Giorgosdnv 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I loved the way you explained such a hard thing with a very understandable way

  • @neur303
    @neur303 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Thanks, this was very new to me!
    I would love a part 2 that talks about applications and maybe also an implementation. Are there any physics related problems that can be solved by this?

  • @3thanguy7
    @3thanguy7 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    absolutely love this video. super interesting topic, super stylishly done, really great some3 entry. also an appearance in the wild of the harmonic mean, the most underused mean

  • @AlecThilenius
    @AlecThilenius 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    Awesome video! You should make one on Bloom Filters next. Those also blew my mind when I first learned about them.

    • @BreakingTaps
      @BreakingTaps  10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      Oh yeah, same here! Bloom and Cuckoo filters are definitely in my top tier list of favorite algos. Will keep it on the short list!

  • @danbromberg
    @danbromberg 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    This was a wonderful topic, graphics, and presentation...thanks!

  • @deathfxu
    @deathfxu 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    In videos like these, you should talk more about all of the different applications of the algorithm so that we can understand how it is used

  • @elisgrahn6768
    @elisgrahn6768 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Fantastic video! Loving the animations and sound effects. 😄

  • @Amonimus
    @Amonimus 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    Thinking.
    Say, we have cubes
    color2 color6 color1 color2 color4 color2 color1,
    that can be encoded as a set [2, 6, 1, 2, 4, 2, 1]. We don't care about previous colors or all possible colors, we just know the value of each box we've looked through.
    Then we get a Nth prime number of each value.
    [3, 13, 2, 3, 7, 3, 2]
    and sum them together. We get a number 2*2+3*3+1*7+1*13.
    Not only we can tell by this how many unique values were in a set, but also how many each of them there are.
    The only problem is how quickly we can get the Nth prime.

  • @zephiask1758
    @zephiask1758 5 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    We had that in Algorithms and Datastructures in the second semester; its such usefull concept and algorithm

  • @TheLoneWolfling
    @TheLoneWolfling 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +3

    An important point to note about HLL: don't use it if there's any possibility of adversarial inputs! This is for two reasons.
    For one, an adversary can produce inputs that artificially inflate the resulting count. This is fairly straightforward if the hash function used is not cryptographically strong - just generate inputs whose hashes end in . (Redis, for instance, uses MurmurHash64A... which is a great hash in many ways, but is _not_ cryptographically secure.) Less obviously, this is still relatively straightforward even if the hash function is strong. The attacker can just generate N values, run them through the same HLL algorithm you're using, and send you only the item that scores highest in each bucket. (Of course, this requires O(n) work on the part of the attacker.) This results in each of said M items appearing as though they were worth N/M each, on average. You can work around this by using an HMAC with a secure (and secret!) key, but this is seldom done as it is costly for performance. (And there is no way to rekey if said key is compromised, either.)
    Another issue is that HLL is great for timing attacks. By which I mean, it is rather awful against timing attacks. For one because of the conditional update. (This part can be replaced with an unconditional writeback with a CSEL or similar... however this loses one of the attractive properties of HLLs, namely that actual writes to the HLL tend to be fairly infrequent, so you can share a single HLL across multiple cores very easily with relatively little contention.) And for another because it's essentially indexing into an array with a hash of the input object, so an adversary who can add inputs can often e.g. detect that no-one has added X recently. (...or anything else that hashes into any buckets in said cache line. It's not perfect.)

    • @drdca8263
      @drdca8263 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      Does the HMAC with a key differ from "add a (fixed, secret) salt before taking the hashes" ? (would it be less secure, provided that a cryptographic hash function is used?)

    • @TheLoneWolfling
      @TheLoneWolfling 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      ​@@drdca8263 Look up length extension attacks.
      To oversimplify, an HMAC is "just a hash with salt, but actually secure this time".
      (As a side note: if it has to be secret, it's not a salt. It's a key. Salts retain their usefulness even if the value is public; keys do not.)

  • @WilliamRoyNelson
    @WilliamRoyNelson 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Oh neat! I've used this via Apache Druid which implements the Apache Datasketches implementation of this.
    Thank you for the explanation!

  • @pand1024
    @pand1024 8 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    That adjustable precision also allows for anonymization

  • @vheyney
    @vheyney 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I love the animation, as a software engineer - it's sooo on point well done!

  • @hoodiegal
    @hoodiegal 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    very good explanation of a very interesting algorithm. great visual aids to help understanding. well done!

  • @HAMYLABS
    @HAMYLABS 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +2

    Only issue w the hashtables magic chests analogy is that hashtables are pretty efficient w sparse data. They do have a problem with high cardinality but only on existing items, no memory used for non logged colors in this example.

    • @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii
      @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      I fail to see practical use. In real world scenario a hash table gives excellent performance on huge data-set no problem. Video creator has mentioned in comments it's useful when data is distributed on different machines and can't be transferred all in one place. Makes sense but then the problem of counting unique element is not hard and inefficient as the video claims. The real problem is just that they can't get all data in the same process to check uniqueness.

  • @ChiaDai
    @ChiaDai 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    more of this style of concept explanation please!

  • @elEd0
    @elEd0 8 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I am a developer and had never heard about this, and I love it!
    I am probably going to research more about it and I'll try to implement it as a fun project.

  • @SickegalAlien
    @SickegalAlien 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Probabilistic algorithms are my favourite concept (after recursion of course)
    There's something so satisfying about doing things randomly, but still getting an acceptable result

  • @robinhodson9890
    @robinhodson9890 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    It sounds helpful if you don't need absolute precision in a search result, which for something like decisions in/for a social media algorithm, you don't - you just need a good idea of the figure. Your input data will be massive and constantly changing, so a precise value will be inaccurate by the same tolerance of error pretty quickly.
    Typing from an hpc perspective, as you're allowing parallel processing anyway, it sounds like it'd be a lot quicker to just merge multiple stacks via a binary tree. If you want to reduce the amount of parallism at the cost of speed, increase the iterations of stacking: Merging the results will be near-instantaneous afterall, and you don't even need traditional processors for the merge.

    • @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii
      @DJ_POOP_IT_OUT_FEAT_LIL_WiiWii 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

      it's obvious that real world computing systems do count unique element very efficiently and very accurately with a variety of algorithms. mainframes do it all the time, they could do it in the 1950s..
      As video creator stated in a comment, the real issue is when data is distributed across different network and you can't transfer all data in one place to perform the computation.
      I'm at a complete loss to understand why one would use this inaccurate algorithm outside distributed data hosting scenarios. Even Facebook would benefit from more accurate metrics.
      The government certainly count unique attributes in all their databases accurately, no reason social networks wouldn't. The processing cost is really not that high too.

  • @DFPercush
    @DFPercush 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    Never heard of this algorithm before, but you explained it so well I think I could code it easily. Nicely done!
    The only part I had to watch twice was the merging of two score cards. So the union of two scores A and B is just the max element of that slot, right? U = {max(A[0], B[0]), max(A[1], B[1]), ... }

  • @AntonioNoack
    @AntonioNoack 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    Chapters:
    0:00 Introduction to the problem: new HashSet(collection).size()
    1:09 Unsorted list, linear traversal
    2:18 HashSet
    3:02 HyperLogLog
    7:16 Precision / Memory Usage / Stability
    8:21 Parallelization
    10:39 Ad ;)

  • @ViralKiller
    @ViralKiller 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    I found a flaw in this algorithm. It assumes that the data set being inputted has been evenly randomised. It will not work if there are, say, 1000 blue cubes in the data set at the beginning, and then after the 1000th cube other variations begin to appear. So you would have to scramble your data before feeding it in the algo

  • @tylerbakeman
    @tylerbakeman 9 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™ +1

    Reminds me of soundwaves.
    Imagine each cube color -> integer (mapped to an integer: easy since colors already do that in hexadecimal).
    Then, those integers are unique frequencies. Assuming we have a complete set of integers, we could synchronize that data with a wave - representation of all of the integers added together or something
    Then, we can revisit that wave, any time, and test if a smaller “wavelet” exists within it.
    Wavelets are already a thing too. Cool stuff.

  • @joshroolf1966
    @joshroolf1966 10 āļŦāļĨāļēāļĒāđ€āļ”āļ·āļ­āļ™āļāđˆāļ­āļ™

    That was a very clear presentation and your animations are great!