18: Count Unique Active Users | Systems Design Interview Questions With Ex-Google SWE

แชร์
ฝัง

ความคิดเห็น • 20

  • @idiot7leon
    @idiot7leon 2 หลายเดือนก่อน +4

    Brief Outline
    00:00:22 Unique Users Problem
    00:00:55 Problem Requirements
    00:01:29 Terminology/Paradign Overview
    00:03:42 Session Expiration
    00:05:21 Change Data Capture
    00:06:37 Calculating Number of Unique Active Users - Naive
    00:08:42 Calculating Unique Number of Users - Better
    00:10:50 Calculating Unique Number of Users - Best
    00:11:58 Calculating Number of Users - Best Approach
    00:14:00 Generating Derived Data
    00:16:11 The Problem With Snapshots
    00:17:49 HypberLogLog
    00:19:35 HyperLogLog Example
    00:20:45 Building Our HLLs
    00:22:46 Reading Our HLLs
    00:23:59 Final Diagram - Count Unique Active Users
    Thanks, Jordan~

  • @philiphartman-queen1216
    @philiphartman-queen1216 2 หลายเดือนก่อน +1

    Struggler completes epic TH-cam vid with severe case of ligma 🙏🏻 thankful for your work as always

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Thanks I'm gonna have to get over my ligma rn

  • @Rohit-hs8wp
    @Rohit-hs8wp หลายเดือนก่อน +1

    Thank you for the videos. CDC is such a powerful tool and can be used in many system design.
    In the derived data where you build TSDB table to find session based on start / end time, you are partitioning based on start/end time.
    -> if we are partitioning based on hash range of start/end time( hash based partitioning), wouldn't the query to fetch the session by start/end time would be scatter/gather in nature means you have to read from multiple partition. Read will be slow.
    -> if we are partitioning based on range of start/end time (range based partitioning), wouldn't we create a hotspot partition where all the write are going to?
    Here also, we have to read from multiple partition if the start/end time range is not on one partition (less common) but read are better in this case compared to previous one.
    BTW, the query of finding unique number is somewhat similar to Range Sum Query with Range Update.(May be).

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      1) Agree range based partitioning is the way to go here
      2) This is why I propose using derived data, we can buffer the writes and upload them in batch. There will be hot partitions, certainly, but we can also introduce either caching or re-partitioning within a given range to keep the amount of scatter gather to a low degree.

  • @parvesh7058
    @parvesh7058 2 หลายเดือนก่อน +6

    That mustache is straight fire! How many feet pics did you score with that bad boy?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +2

      I feel like I'm always sending them and no one ever sends them back 😞

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      @@response5940 empty body

  • @a.m.6973
    @a.m.6973 หลายเดือนก่อน +1

    Another randomized idea: when the clients are sending their heartbeats, we forward them to a single server which, every minute, inserts a user-id into a hash set with probability of 0.001 (1/1000). This way even if we have a billion users the expected size of the hash set is only 1 million (fits in RAM). At the end of a minute, the approximate number of unique users is 1000 times the size of hash set. And we clear the hash set after. If we're worried about too many requests going to a single server, we can do this across more servers partitioned by user-id's and just sum the result.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      Seems reasonable to me (though my stats knowledge isn't great lol) - my one suggestion would be to do the randomization client side so that you don't overload the network on your single server.

    • @a.m.6973
      @a.m.6973 หลายเดือนก่อน

      @@jordanhasnolife5163 Nice, makes sense.

  • @SicknessesPVP
    @SicknessesPVP 2 หลายเดือนก่อน +1

    Great video!

  • @ichthyz
    @ichthyz 2 หลายเดือนก่อน +1

    What do you think would be more accurate? 1) (as in the video) Averaging the HLL values and then doing 2^avg. or 2) Calculating 2^HLL and then taking the average.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      I'm no mathematician, but I'm guessing the former, feel free to read the HLL paper though as they'll be able to explain way better than I ever could

  • @terminallychill8029
    @terminallychill8029 2 หลายเดือนก่อน +1

    Awesome great vid! Can you do a system design for a banking system?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      What do you see the functional requirements/complexity of a banking/payment system being?

  • @NBetweenStations
    @NBetweenStations หลายเดือนก่อน +1

    Is the Session DB sharded by user ID and the Session Db indexed by start time and end time 3 different databases or are they the same db with multiple indexes on the columns?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      Most likely different dbs I'd say, becomes tough to make them the same DB when we have tons of sessions, otherwise we need to use local indexes