15: Reddit Comments | Systems Design Interview Questions With Ex-Google SWE

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ก.ย. 2024
  • That chili put more load on my system than all of the upvotes and downvotes on all the comments for a single postId could put on a kafka queue

ความคิดเห็น • 65

  • @SkinnyPaddy
    @SkinnyPaddy 6 หลายเดือนก่อน +8

    100% this channel helped me get a FAANG offer last week. Cheers dude! ❤

  • @FatherPhi
    @FatherPhi 6 หลายเดือนก่อน +17

    Congrats on 25k!!! You deserve it man

  • @undefinedrandomness
    @undefinedrandomness 6 หลายเดือนก่อน +3

    Watched all your videos over the holidays and locked in my first faang offer. Just wanna say thanks for the goated content, and keep up the good work! I genuinely learned a ton from this channel. Congrats on 25k!📈📈

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      You're HIM!! Thanks man, I appreciate it!

  • @idiot7leon
    @idiot7leon 5 หลายเดือนก่อน +3

    Brief Outline
    00:00:51 Reddit Comments
    00:01:58 Problem Requirements
    00:03:37 Capacity Estimates
    00:06:07 Main Consideration
    00:07:56 "Source of Truth" Table
    00:08:48 Base Table Options - Native Graph Database
    00:12:22 Base Table Options - Use a normal database with an index
    00:15:28 Relational vs Graph Based Approach for "New" Ordering
    00:17:31 Top Comments
    00:19:19 Graph DB for Top Comments
    00:21:56 In Memory Graph for Top Comments
    00:24:55 Controversial Comments
    00:27:04 Hot Comments
    00:29:41 Synchronous Updates vs Derived Data
    00:31:07 Casual Consistency
    00:33:58 Change Data Capture in Practice
    00:35:54 Change Data Capture
    00:38:42 Upvotes DB
    00:41:16 Obligatory Caching Slide
    00:53:12 Final Diagram - Reddit Comments
    Thanks, Jordan~

  • @capriworld
    @capriworld 9 วันที่ผ่านมา +1

    again, a super cool & great video. just thinking, we may put all the comments in the datalake(internally s3) & only use the ids as the form of metadata in the active database etc. comments kind of mostly (considering edits are negligible) a static data , not sure,, if this has any side affects other than may introduce two phase commit or some kind of distributed atomicity while writing.

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 วันที่ผ่านมา +1

      You're saying keep the comment data in a data lake and just manipulate the metadata? Potentially a distributed join when reading, which I think slows us down quite a bit. We'd have to be smart about how we partition the actual comment data in the data lake.

    • @capriworld
      @capriworld 9 วันที่ผ่านมา +1

      @@jordanhasnolife5163
      yes, thanks for your insights on this thought.

  • @andriidanylov9453
    @andriidanylov9453 6 หลายเดือนก่อน +1

    It was a strong description for this task. Thank You

  • @hristianiliev7979
    @hristianiliev7979 6 หลายเดือนก่อน +2

    Hey Jordan, great video. I believe that you did not explain one aspect in the final diagram though. What is this arrow from Spark Streaming to the New Comments/Source of truth DB?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Sorry, brushed over that one! That's just to change the upvote/downvote count per comment in the source of truth table.

  • @thunderzeus8706
    @thunderzeus8706 หลายเดือนก่อน +1

    Hi Jordan, thanks for another great video! This one is really interesting and beefy.
    I have a question about the "hot comments" section. It is clear to me that when there is a new upvote, the CDC triggers the +1 to corresponding nodes in the db. However, how do we know when it is time to expire some upvotes?
    Keep monitoring the end of each linkedlist is expensive and wasteful, while getting rid of stale upvotes only when new upvotes comes in cause inaccurate stale "hot" comment ranking. Neither of them look good to me.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I really don't think it's expensive to monitor each linked list, it's a single background thread that can be run on any configurable interval of your choosing.

  • @AP-eh6gr
    @AP-eh6gr 6 หลายเดือนก่อน +2

    could you share some example docs/links on how to implement the logic in the 37th minute that says - hey consume from a flink operator by the post+comment_id AND consume from CDC of the upvotesDB AND have Spark streaming process a comment and its corresponding score/upvotes only when comment's presence is true? Not having used these in practice makes it a bit hard to come up with
    OR do like a smallish tutorial on implementing this in a future video so that it feels more real, so to speak, lol
    Damn tho, your final solution is impressive, definitely what I was imagining as the most optimized way for upvotes by user and upvotes by post, who they came from (which cares about userid), and their aggregated counts (which cares about commentId and postId)

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Thanks!
      Just pseudocoding here but:
      HashMap
      def processVote(CommentId, Vote):
      if vote is upvote:
      map[commentid] + 1
      else:
      map[commentId] -= 1
      if map[commentId] has comment:
      updateDatabases(commentId, score) -> here we rearrange our graph edges
      def processComment(commentId):
      map[commentId][commentPresent] = true
      if map[commentId][score] != 0:
      updateDatabases(commentId, score)

  • @tfk933
    @tfk933 4 หลายเดือนก่อน +1

    Could the CDC into Spark be used to calculate comment scores for each ranking system with each incoming vote? e.g. each comment would include: {hot_score: x, controversial_score: y, ...}. Since there are so few comments per post, instead of a derived view, the client could grab all the comments on a post (maybe with exceptions for extremely popular posts), each with all the scores, and handle the sorting client side.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน +2

      Sure, this is a fair approach, but it depends how many comments there are. Doing client side sorting does imply doing a lot of repeat work client side, and for posts where we want to paginate the comments this now becomes unfeasible.

  • @antonvityazev4823
    @antonvityazev4823 4 หลายเดือนก่อน +1

    Hello Jordan!
    Thanks for another good video with detailed explanation. I have a question though after watching it:
    Why don't we use Cassandra here instead of MySQL if we expect a huge load on writes?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      I wouldn't really say that I expect a huge load of writes, but moreso that I don't want to two phase commit a bunch of writes to four places.
      More importantly, reddit comments have parents which implies we need to maintain causal relationships which may be challenging in a leaderless setup.

  • @ariali2067
    @ariali2067 6 หลายเดือนก่อน +1

    Not related to this specific video though, Saga & TC/C are not covered in your system design concept series, do you think we need to learn about those for System Design Interviews? I do see several designs mentioned these two concepts when discussing distributed transactions, not sure whether I can just skip if not super important. Curious to hear your thoughts, thank you!

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Probably not but to be fair it's a pretty small topic you could probably learn it in like 10 minutes

  • @yrfvnihfcvhjikfjn
    @yrfvnihfcvhjikfjn 6 หลายเดือนก่อน +1

    Next video I'm going to play the Jordan has no life drinking game where you have to take a shot every time he mentions flink

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      I've gotta get them to sponsor me

    • @AP-eh6gr
      @AP-eh6gr 6 หลายเดือนก่อน

      flink fetish

  • @FoodpediaIndia
    @FoodpediaIndia 6 หลายเดือนก่อน +1

    @jordan loving your videos, in lot of your videos you mentioned that have a cdc on database to get the up to date info to the flink while same flink is consuming from Kafka but let’s say if a user a recently did not have any activity on his users db and if the same user tweeted , you said now flink will have available with users friends from cdc and fanout his tweet to all his friends. In that case would flink pull details from the user db directly?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Flink ideally should not have any TTL on that user data, so it should already be there. I suppose what you're describing is more of a caching mechanism, where after a while we drop the data, and then go fetch it from the db, which is also a viable option.

    • @scottyjacobson4655
      @scottyjacobson4655 6 หลายเดือนก่อน

      ​@@jordanhasnolife5163 I wonder if @factsfuture9778 is asking the same thing I was wondering yesterday, which is the need to possibly "pre-populate" Flink's key store from the db you'd be CDC-ing with, since it's likely there will be some users that Flink hasn't heard any events from the db about those users yet and thus doesn't know their state "pre-initialization"
      I asked ChatGPT about this, and he said to define an initialization step that "using Flink's JDBC connectors, execute[s] a query against the MySQL database and fetch the current [applicable state] of all users"
      is this common? / does flink have an idiom for this?

  • @aj_agr
    @aj_agr 3 หลายเดือนก่อน +1

    Was wondering about the cost of keeping this in memory graph index for all the posts over the years and whether it's worth it to build a hot/cold tier sort of index.
    Half a billion posts per year * 100 comments per post * 10 years * 8bytes overhead per comment = 4TB of RAM. 4GB RAM in AWS with reserved instances = 200$ per year. So for 4TB, you pay 200K$.
    100 comments per post on an average is a lot I feel - you can also reduce this 8 byte overhead to something less by making an assumption on the max # of comments per post - by a factor of 4.
    On a side note, persisting this index becomes potentially attractive to avoid having to re-stream all of this data in the event of index node failures. Maybe restreaming is not that bad from read replicas of comments and upvote DB.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      One of the reasons we use flink is so that we don't have to restream all data and can restore from a snapshot. I see your point and this is an interesting idea to only keep things in flink for x amount of time and otherwise recompute the data on a batch job or something!

  • @swanv951
    @swanv951 6 หลายเดือนก่อน +1

    For ordering by "Hot comments", if a comment doesn't get any activity in x mins then some periodic job needs to update its score. It needs to do that for each and every comment every x mins. How would that scale?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Our flink node that is receiving upvotes and downvotes and is partitioned by (postId, commentId) is maintaining a linked list locally, and checks the head of the list every once in a while to see if the node is expired

    • @sinsaremine
      @sinsaremine 6 หลายเดือนก่อน

      @@jordanhasnolife5163 I don't understand how flink has the memory to store all this data. Its 8MB of data per post, if there are 100ks of posts this can grow pretty quickly. How do you make sure that there is enough memory in flink to store all comment trees across all posts.

  • @ramannanda
    @ramannanda 3 หลายเดือนก่อน +1

    Question why not a separate topic and stream-stream join? rather than having two different producers publishing to the same kafka topic.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      I'm not sure which part of the design you're talking about can you elaborate?

    • @ramannanda
      @ramannanda 3 หลายเดือนก่อน

      ​@@jordanhasnolife5163 I think in the end you do show different kafka topics, but th-cam.com/video/BO2gRisnBcA/w-d-xo.html

  • @prashlovessamosa
    @prashlovessamosa 6 หลายเดือนก่อน +1

    Loved it.

  • @AP-eh6gr
    @AP-eh6gr 6 หลายเดือนก่อน +1

    around the 22nd minute mark, when you suddenly receive a large number of upvotes on a comment and it needs to be reordered putting it ahead of the formerly most upvoted comment, what sort of complexity are you looking at? oh ok missed the "maintaining a sorted list of edges" statement you make at 21:41

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Yep ideally we have to sort those edges. It shouldn't be n log n time because the list is sorted, so just changing the existing sort order should be O(n) like an iteration of insertion sort.

    • @AP-eh6gr
      @AP-eh6gr 6 หลายเดือนก่อน

      @@jordanhasnolife5163 it could be simply logn if it is a treeset (as opposed to list), so that way you delete old count and insert new count both in log time

  • @davendrasingh8846
    @davendrasingh8846 5 หลายเดือนก่อน +1

    why not simply use redis sorted set for top, controversial post ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      Well you could use sorted sets within sorted sets within sorted sets, at which point you've just built out the data structure I described

  • @joonlee3863
    @joonlee3863 6 หลายเดือนก่อน +1

    hey jordan, can you make a video on designing a donation system? e.g. donating to list of charities

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Hey Joon, besides a normal server and database here what do you consider to be the hard part of this problem?

    • @joonlee3863
      @joonlee3863 6 หลายเดือนก่อน +1

      @@jordanhasnolife5163 hey jordan, thanks for the reply.
      I just started studying system design. So topics like real-time data/money aggregation w/ high load/peak, how to reprocess failed payments, and how to avoid hot spots if one charity is popular, other various failure scenarios in an Async design, etc

    • @joonlee3863
      @joonlee3863 6 หลายเดือนก่อน +1

      Also maybe going over little tedious stuff like API designs and DB schema/details for a noob like me? System design interviews seem to ask more in details about these than what I've seen in YT videos :(
      Apologies for a big ask

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      @@joonlee3863 Thanks for elaborating! I think that for a problem like this, it probably would be employing a lot of the other design patterns that we come across in other problems that I've covered. For example, real time money aggregation is similar to the real time counting or real time order handling that I might cover in the tinyURL or Amazon retail case

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      @@joonlee3863 As for this one, totally makes sense, I try to go over these where they are relevant and spent more time on them in my 1.0 series. That being said, I think that once you see the pattern in these, covering them in each video can take a decent amount of time and may also feel a bit redundant.

  • @shandilyacodes
    @shandilyacodes 6 หลายเดือนก่อน +7

    Binge watching all your content. It's content dense and fun to watch at the same time. Thanks for doing this!

  • @Kevin-jt4oz
    @Kevin-jt4oz 6 หลายเดือนก่อน +2

    This is amazing. A system design video with enough algorithmic depth. Can you share any links where I can dive into this furhter?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      For this one unfortunately not really, was mostly shit I made up since I couldn't find any great resources for it.
      I think my design might not even be necessary for reddit's current needs, but hey it's a fun thought experiment.

  • @roshanzameer5020
    @roshanzameer5020 6 หลายเดือนก่อน +2

    Another great video. Would love an occassional live system design session. Also, congratulations on the 25K!!!

  • @levimatheri7682
    @levimatheri7682 2 หลายเดือนก่อน +1

    Your videos are invaluable! Helped me get an offer from Microsoft!

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      Congrats man!!! Enjoy the new role and it's awesome to see the hard work pay off!

  • @Onislayer
    @Onislayer 6 หลายเดือนก่อน +1

    Hey Jordan, why use Spark Streaming and not just use flink for the whole setup with maintaining causal consistency on that second kafka topic?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Hey! Just wanted to use microbatching there, we're just writing a bunch of upvotes and downvotes to a particular comment id, and then potentially reordering the edges of the graph. That could be expensive, so I think batching them up makes for a nice improvement!

    • @Onislayer
      @Onislayer 6 หลายเดือนก่อน +1

      also in this setup would you keep the linked list in spark streaming instead to update the edge weights on the hot db?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      @@Onislayer yep!

    • @vincentchee4108
      @vincentchee4108 6 หลายเดือนก่อน +1

      Flink can write in microbatch manner using window, but i agree it is not super intuitive.

  • @savanatong7041
    @savanatong7041 6 หลายเดือนก่อน +1

    hope references about the decision made in the video could be attached.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Not too many about reddit comment views if you look
      I see one for materialized path (which I have discussed), one for celko trees (which are dumb)
      Other than that, I know the tradeoffs of graph databases, the tradeoffs of normalized vs. denormalized data, and also the tradeoffs of how you choose to partition within a stream processing framework.

  • @mohamedmacow3610
    @mohamedmacow3610 3 หลายเดือนก่อน +1

    Can you do a video on API gateways

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Will give it some consideration if there's anything non trivial to be done there