17: Top K Leaderboard | Systems Design Interview Questions With Ex-Google SWE

แชร์
ฝัง

ความคิดเห็น • 78

  • @idiot7leon
    @idiot7leon 2 หลายเดือนก่อน +8

    Brief Outline
    00:00:23 Find the Top K...
    00:01:16 Problem Requirement
    00:02:16 Overall Approach
    00:03:23 Precise Solution
    00:06:11 Counting Phase
    00:08:57 Local Top K Step
    00:10:51 Global Top K
    00:12:27 Speeding Things Up - Approximation
    00:15:07 Windowed Top K, How To Calculate
    00:18:10 Windowed Top K, How To Read
    00:19:45 Publishing Faster Results
    00:21:11 Count Min Sketch
    00:23:19 Count Min Sketch - Continued
    00:26:01 Count Min Sketch - Continued
    00:27:44 Count Min Sketch - Fault Tolerance
    00:28:48 Final Diagram - Top K Problem
    Thanks, Jordan~

  • @skinnycucumber
    @skinnycucumber 2 หลายเดือนก่อน +4

    i have my first system design interview tomorrow and i've been cramming all your videos. great content!

  • @mimanshabhargav3266
    @mimanshabhargav3266 หลายเดือนก่อน +2

    I got this problem for my on-site today. Thanks to you I had some idea on how to do it.

  • @rakeshvarma8091
    @rakeshvarma8091 26 วันที่ผ่านมา +1

    As usual, great video Jordan.

  • @aliz166
    @aliz166 2 หลายเดือนก่อน +2

    I love your videos man! Any chance you might cover an ad click aggregation system soon? It's a bit similar to this actually, but of course has its own nuances. Thanks again for your help!

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Hey Ali! I'd just watch this one and the tinyurl video, curious to hear what else you feel this incorporates beyond that

  • @truptijoshi2535
    @truptijoshi2535 10 วันที่ผ่านมา +1

    Hi Jordan! I have some questions.
    1. For the precise solution, the spark node aggregates the event IDs. Why are we aggregating it again in Hadoop?
    2. After aggregating, we are doing local top k and then global top k. Suppose the local top k of 3 nodes look like this :
    1000->999->998->997->996 etc
    100->99->98->97
    200->300->400
    If we are only taking top 3 in each node, we might be missing out some of the bigger values like 997->996 right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  10 วันที่ผ่านมา

      1) The stream consumer is just putting tuples of (eventId, timestamp) into hadoop via parquet files.
      2) We do a local top k after aggregating as you mentioned, which means we did a shuffle to get all events with the same id on the same node. So we have the full counts of each event. Then, once we get the local top k of the eventIds that were hashed to a given node, we can do a global top k.

  • @monihazarika5628
    @monihazarika5628 2 หลายเดือนก่อน +1

    I really liked the content and the way technical concepts around building approx vs real time leaderboard considerations were made. I am looking for some insights on how the fantasy gaming platform like Dream11 works and it will be a great video for your india fans Jordan !!

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Thanks for the kind words! I've never heard of dream11 so I'd have to do some research there!

  • @kevinding0218
    @kevinding0218 หลายเดือนก่อน +1

    Thanks for the great content! I have a question about maintaining a K-size min-heap on a Spark streaming node. Since this is a streaming process, it's difficult to know the exact count of a specific keyword in real time as I assume the count may increase over time. If we only maintain a K-size heap, could we end up losing earlier counts for some keywords? Please correct me if I'm wrong.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Yup you can't do this if you have active incoming counts - you'd need to keep a sorted list (or treeset or something) of all incoming events to be precise

  • @Anonymous-ym6st
    @Anonymous-ym6st 2 หลายเดือนก่อน +1

    Thanks for the great contents as always! One quick question (I might miss from the video): why we don't store more aggregated data (more than the top k for every hour, but the full data) in the approximation + fast solution? It won't be more data than the accurate one right? And it won't sacrifice the accuracy? Or we mean the trade-off of fast read is not when we pre-calculate, but when we query for, say that 3 hour, sum them up more than the every hour's top k would be the expensive?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      Having a little bit of trouble understanding you but:
      1) That's a ton of data to store
      2) Now to calculate the top k over all of those windows I need to aggregate the counts for every event, which is slow!

  • @vincentchee4108
    @vincentchee4108 2 หลายเดือนก่อน +1

    Thanks for the video. I'm a bit curious the need of a tsdb. I assume a Postgres DB with time range partition would suffice the use case. With time range partition, you can still do:
    WITH cte AS (SELECT event_id, SUM(count) AS sum FROM table WHERE timestamp BETWEEN t1 and t2 GROUP BY event_id)
    SELECT * FROM cte ORDER BY cte.sum LIMIT k

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      I imagine this would work, but would be curious to see how the benchmarks actually look between them

  • @SunilKumar-qh2ln
    @SunilKumar-qh2ln 2 หลายเดือนก่อน +4

    finally this problem, Giga Chad of system design for a reason

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      With great power (a 3 inch schlong) comes great responsibility

  • @harshbhatt_
    @harshbhatt_ 2 หลายเดือนก่อน +1

    Hi Jordan, love the content! Quick question - how would this solution change if we need to find the top K for each user?
    Ex: Rather than finding the top 10 songs globally in the past 24 hours; we want to find the top 10 songs for each user in past 1day/7days/etc? (No complicated ML model; just based on counts of each song played)

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Makes sense! This probably becomes a bit easier then actually in the sense that you could potentially store all of this data on a single node per user.
      Basically, we can sink all data to a time series database, partitioning it by user id and ordered by timestamp. Then we can just do the exact computation, since we're hitting just one node at a time.

    • @saber3112
      @saber3112 2 หลายเดือนก่อน +2

      we can ue flink's windowing capabilities to create time windows for the specified time ranges (e.g., 1 day, 7 days).
      Aggregate the data within each window to compute the count of each song played by each user.
      Now for each window and each user, compute the top N songs based on their play count.

  • @mcee311
    @mcee311 2 หลายเดือนก่อน +1

    In the time series DB how are you differentiating the results from count min sketch versus the ones from spark streaming? Does the spark one just overrides the count min sketch?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      I think you could do either implementation but considering the count min sketch data is pretty small I figure you may as well keep it around

  • @parthpanchmatia224
    @parthpanchmatia224 2 หลายเดือนก่อน +1

    Hi Jordan, what iPad app do you use to draw the system design diagrams and present them? I have an iPad so this info will help me put my system design learnings to test

  • @jiananstrackjourney370
    @jiananstrackjourney370 15 ชั่วโมงที่ผ่านมา +1

    Thanks Jordan, I really appreciate the video. One question, is there any specific reason the kafka has to partition on event id? Even if it is not, the downstream spark streaming will still be able to count and aggregate the data, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  56 นาทีที่ผ่านมา

      Yes but it's going to take a lot less data shuffling if we partition by event Id

  • @harman654321
    @harman654321 11 วันที่ผ่านมา +1

    i think you can use redis to store intermediate count min sketch on each flink job, and aggregrate

    • @jordanhasnolife5163
      @jordanhasnolife5163  10 วันที่ผ่านมา

      Can you elaborate on this? I'm not sure in the flow of the problem where it belongs.

  • @hrjhonor
    @hrjhonor 2 หลายเดือนก่อน +1

    Great video! Two questions: what's the downside of writing the first spark streaming which outputs local top k results directly to the time series db, given the TSDB has many aggregation capabilities built in? Other question is is it possible to use redis + cdc/kafka, where real time result is computed from redis?

    • @hrjhonor
      @hrjhonor 2 หลายเดือนก่อน +1

      to extend the first question, why can't we use the TSDB in place of the HDFS? Is it mainly concerning the ingestion rate?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      1) You actually probably could do all of the writes to the time series database, and then run a big query on it! But keep in mind that we'd then have to store each of the events with their timestamps in the tsdb, which is a lot of data. Also, I do wonder (maybe it wouldn't) if the TSDB would try to perform some locking for a read only query (hadoop definitely wouldn't as the files are immutable). Great point though - I also do think that a spark job allows us to be specific for how we want to do the counting and joins, whereas I'm not sure how optimized a TSDB query would be for this.
      2) Yeah you could probably just put all of the local top k counts in the tsdb, but I'd hope that you'd only actually store the global top K (meaning the local top k's should overwrite one another). Otherwise we have ourselves a pretty big storage cost. I was thinking about the overwriting of data though to merge sorted lists, and I felt that in a db on disk that's probably quite a bit slower than just doing this in memory and then sinking it to a db.

    • @hrjhonor
      @hrjhonor 2 หลายเดือนก่อน +1

      Thanks for the thoughts!

  • @dinar.mingaliev
    @dinar.mingaliev 2 หลายเดือนก่อน +1

    hi Jordan, hope you still have more DM than Megan :)
    Could you please clarify how data from count min sketch goes to time-series. Basically it is a matrix of dimension # of hash function X modulo we use for hashing. The bigger the matrix - more numbers we can store there. Kinda dictionary. But it does not store keys - which is our event type for which we want to count how many times it appeared. How to reconstruct those IDs from min-count sketch?

    • @dinar.mingaliev
      @dinar.mingaliev 2 หลายเดือนก่อน +1

      for example if we keep in count-min sketch counts of type of videos we need to know all possible types and query them.
      if we keek ids of video to approximately count counts of every video we have to again query our CM sketch as many times as videos. Right?
      we can also shard count min sketch :)

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Hi Dinar! In the export to TSDB phase, we need to basically loop over every possible ID, and get a count for it, and then export that count to the tsdb. We could always pull these Ids from a database or something.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Good point though! It may be nice to have a second set of nodes listening to Kafka to ensure that we have a list of all the keys in some sort of database

  • @MrBrown78
    @MrBrown78 2 หลายเดือนก่อน +1

    Got an interview today, been spamming your videos and they've been really helpful! Hopefully I can perform tho XD

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      Best of luck! Make sure to pop a Viagra before

    • @MrBrown78
      @MrBrown78 2 หลายเดือนก่อน +1

      @@jordanhasnolife5163 Popped them shits like they were fruit gummies but ended up getting rejected. Man fuck, time to drink the pain away🥳

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      @@MrBrown78 lol no worries, what was the problem and how do you feel it could have gone better

    • @MrBrown78
      @MrBrown78 2 หลายเดือนก่อน

      ​@@jordanhasnolife5163 Asked me to design a system to allow employees within an organization to connect with those they normally wouldn't contact. I aimed for a simple messaging platform because I didn't really understand what they were looking for and they said it was supposed to be built with a timeline of 2 weeks. I approached it same way you designed whatsapp, but I think I forgot some design choices during it.

  • @andrewcisternino8271
    @andrewcisternino8271 7 วันที่ผ่านมา +1

    Hi Jordan... So I'm under the impression the time series db is storing top k per hr on each chunk. Yet, your slide for consumption via leaderboard service has us using a hashmap to aggregate from 3 separate hrs. Wouldn't the consumption be same process as 'Global top k' utilizing the 'merge sorted lists' leetcode problem since we are pulling 3 separate top k lists (1 per hr)? I guess I'm also wondering the format TSDB is storing data. Thanks for videos.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 วันที่ผ่านมา +1

      This is a bit different here because the top K from those hour intervals likely overlap so you'll need to use a hashmap to merge the total counts together first. Then from there, nothing is sorted, so we'll just have to use a heap of size k and loop over the events to figure out the new top k in the 3 hour interval.

    • @andrewcisternino8271
      @andrewcisternino8271 2 วันที่ผ่านมา

      @@jordanhasnolife5163 This makes sense. Thanks.

  • @andreybraslavskiy522
    @andreybraslavskiy522 หลายเดือนก่อน +1

    Hi Jordan, thank you for the great video. Can you please help me to understand how sketch min and spark jobs works together? Do we use sketch min only for the current not finished hour and as soon as spark job finish it overwrites sketch min result as more precise? Do we reset sketch min every hour ? If user query for the last 3 hour, does it mean that spark job results always has available for previous hours and we don’t need result from sketch min at all ? Actually I can’t get the scenario when data from sketch min is used, should it be combined with spark results somehow ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I'd think which implementation/how you combine them comes down to the application itself. It's your choice whether this is a worthy enough problem to even run a spark job for. If it is, it depends how often you run it. I think having a count min sketch available just makes the hour by hour computation a bit cheaper compared to something like flink where we actually would perform the counts on a per hour basis.
      I don't think you'd really ever combine the results from spark, if you have the spark results, you have the real answer.

    • @andreybraslavskiy522
      @andreybraslavskiy522 หลายเดือนก่อน

      @@jordanhasnolife5163 Thank you!

  • @Henry-wq7ki
    @Henry-wq7ki 2 หลายเดือนก่อน +1

    Why count min sketch server in addition to global top k in the time series DB?
    They are redundant right ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      In theory the count min sketch server would be able to expose metrics within a shorter period of time, albeit with less accuracy than the global top k

  • @tunepa4418
    @tunepa4418 17 วันที่ผ่านมา +1

    Thanks for the video. How do we get top k count from count min sketch, I assume count min sketch is just for counting not to get top k

    • @jordanhasnolife5163
      @jordanhasnolife5163  16 วันที่ผ่านมา

      In theory you can maintain an in memory heap of size k with the counts of the top k elements as you compute the count of each while using the count min sketch

  • @GuntherJones
    @GuntherJones 2 หลายเดือนก่อน +1

    For the top k don't you want a max heap? Thank you so much for these.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +2

      Nope min heap! If I want to get the top k I want to quickly compare with the smallest element of the current top k

  • @harshitnarang6354
    @harshitnarang6354 2 หลายเดือนก่อน +1

    Chad content, thanks Jordan !
    Will it be okay for you to share the notes ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Thanks Chad! I'll share them, but planning to do so in batch in a few weeks. I'm out of the country at the moment so I don't have much time to export them.

  • @gabbah79
    @gabbah79 2 หลายเดือนก่อน +2

    Love the content. Just which you would use more legible font, it really slows down my reading. 😢

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Sorry will try to work on this and make it a bit more clear

  • @varunigarg6492
    @varunigarg6492 22 วันที่ผ่านมา +1

    Why can’t we maintain state in flink for some x hours and run the aggregation there itself - this would surpass the need to have spark streaming right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  22 วันที่ผ่านมา

      Well what about after those few hours? How do we combine hourly intervals to get the top k over the aggregate?

    • @varunigarg6492
      @varunigarg6492 22 วันที่ผ่านมา +1

      In flink, we can create 2 process functions : 1- which computes top N at an hourly cadence (can be used to serve fast viz use cases via Druid for example) and 2- computes top N from last 24 hourly computed results. Eviction policy is time based. To store state we can make use of Rocks DB state backend that stores state as serialized bytes on disk with an in memory cache.
      So all functionalities encapsulated within Flink.

    • @jordanhasnolife5163
      @jordanhasnolife5163  21 วันที่ผ่านมา

      @@varunigarg6492 Sure but these aren't accurate results. If I want to know the top K for the last week, how would I do that? You could combine the top k from each smaller interval within that week, but you then don't consider that an event that wasn't included from the top k of each mini interval could be in the total top k

  • @AP-eh6gr
    @AP-eh6gr หลายเดือนก่อน +1

    17:42 each listening to a partition on a kafka broker to be more precise I guess..? correct me if wrong

  • @oakvillian5
    @oakvillian5 2 หลายเดือนก่อน +1

    Bro couldn't stay away from amsterdam

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Bro had to go back for his top two hobbies

  • @explorer9004
    @explorer9004 2 หลายเดือนก่อน +1

    let's get to Google bros

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +2

      Best of luck my friend - just remember don't harp on one dream company and take the best offer you get :)

  • @aa-kj5xi
    @aa-kj5xi 2 หลายเดือนก่อน +2

    any plans to start a merch shop for feet pics, farts, or bath water?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      Yeah thinking bath water may scale nicely. Any thoughts on pricing? Currently going for $10, and $20 for my post lift bath

    • @Hangar1318
      @Hangar1318 10 วันที่ผ่านมา

      If it doesn't scale, you could always shart or fartition.

  • @AP-eh6gr
    @AP-eh6gr หลายเดือนก่อน +1

    red light district 15 times 😄 someones an addict, jk

  • @loshmiification
    @loshmiification 2 หลายเดือนก่อน +1

    th-cam.com/video/nQpkRONzEQI/w-d-xo.html
    There is a 1 consumer per Kafka partition limit in existence. You either are not sharding Kafka using partitions, or you cannot have more then one consumer reading there. Care to elaborate more on it? (Either on the sharding mechanism, or the consumer limit). Thank you.
    Great content otherwise!

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      To be fair, I am using one consumer per Kafka partition

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      If you're referring to state machine replication the count min sketch replicas could always read from the replica partitions

    • @kbman99
      @kbman99 14 วันที่ผ่านมา

      Assuming you have a kafka topic with 1 partition, you can use 2 consumers as long as they are both set up with different consumer groups. In this case that would be what you want because you want both CMS state machines to receive the same data. In this manner, with different consumer groups, we are effectively implementing a one-to-many pub-sub.