18: Count Unique Active Users | Systems Design Interview Questions With Ex-Google SWE
ฝัง
- เผยแพร่เมื่อ 15 มิ.ย. 2024
- Turns out there was just 1 unique user generating the entire million dollars of revenue on my OF, wonder who it could have been
- วิทยาศาสตร์และเทคโนโลยี
Brief Outline
00:00:22 Unique Users Problem
00:00:55 Problem Requirements
00:01:29 Terminology/Paradign Overview
00:03:42 Session Expiration
00:05:21 Change Data Capture
00:06:37 Calculating Number of Unique Active Users - Naive
00:08:42 Calculating Unique Number of Users - Better
00:10:50 Calculating Unique Number of Users - Best
00:11:58 Calculating Number of Users - Best Approach
00:14:00 Generating Derived Data
00:16:11 The Problem With Snapshots
00:17:49 HypberLogLog
00:19:35 HyperLogLog Example
00:20:45 Building Our HLLs
00:22:46 Reading Our HLLs
00:23:59 Final Diagram - Count Unique Active Users
Thanks, Jordan~
Thanks Leon!!
Struggler completes epic TH-cam vid with severe case of ligma 🙏🏻 thankful for your work as always
Thanks I'm gonna have to get over my ligma rn
Thank you for the videos. CDC is such a powerful tool and can be used in many system design.
In the derived data where you build TSDB table to find session based on start / end time, you are partitioning based on start/end time.
-> if we are partitioning based on hash range of start/end time( hash based partitioning), wouldn't the query to fetch the session by start/end time would be scatter/gather in nature means you have to read from multiple partition. Read will be slow.
-> if we are partitioning based on range of start/end time (range based partitioning), wouldn't we create a hotspot partition where all the write are going to?
Here also, we have to read from multiple partition if the start/end time range is not on one partition (less common) but read are better in this case compared to previous one.
BTW, the query of finding unique number is somewhat similar to Range Sum Query with Range Update.(May be).
1) Agree range based partitioning is the way to go here
2) This is why I propose using derived data, we can buffer the writes and upload them in batch. There will be hot partitions, certainly, but we can also introduce either caching or re-partitioning within a given range to keep the amount of scatter gather to a low degree.
That mustache is straight fire! How many feet pics did you score with that bad boy?
I feel like I'm always sending them and no one ever sends them back 😞
@@response5940 empty body
Another randomized idea: when the clients are sending their heartbeats, we forward them to a single server which, every minute, inserts a user-id into a hash set with probability of 0.001 (1/1000). This way even if we have a billion users the expected size of the hash set is only 1 million (fits in RAM). At the end of a minute, the approximate number of unique users is 1000 times the size of hash set. And we clear the hash set after. If we're worried about too many requests going to a single server, we can do this across more servers partitioned by user-id's and just sum the result.
Seems reasonable to me (though my stats knowledge isn't great lol) - my one suggestion would be to do the randomization client side so that you don't overload the network on your single server.
@@jordanhasnolife5163 Nice, makes sense.
Great video!
What do you think would be more accurate? 1) (as in the video) Averaging the HLL values and then doing 2^avg. or 2) Calculating 2^HLL and then taking the average.
I'm no mathematician, but I'm guessing the former, feel free to read the HLL paper though as they'll be able to explain way better than I ever could
Awesome great vid! Can you do a system design for a banking system?
What do you see the functional requirements/complexity of a banking/payment system being?
Is the Session DB sharded by user ID and the Session Db indexed by start time and end time 3 different databases or are they the same db with multiple indexes on the columns?
Most likely different dbs I'd say, becomes tough to make them the same DB when we have tons of sessions, otherwise we need to use local indexes