15: Reddit Comments | Systems Design Interview Questions With Ex-Google SWE

แชร์
ฝัง

ความคิดเห็น • 58

  • @FatherPhi
    @FatherPhi 3 หลายเดือนก่อน +15

    Congrats on 25k!!! You deserve it man

  • @roshanzameer5020
    @roshanzameer5020 3 หลายเดือนก่อน +1

    Another great video. Would love an occassional live system design session. Also, congratulations on the 25K!!!

  • @shandilyacodes
    @shandilyacodes 3 หลายเดือนก่อน +3

    Binge watching all your content. It's content dense and fun to watch at the same time. Thanks for doing this!

  • @undefinedrandomness
    @undefinedrandomness 3 หลายเดือนก่อน +1

    Watched all your videos over the holidays and locked in my first faang offer. Just wanna say thanks for the goated content, and keep up the good work! I genuinely learned a ton from this channel. Congrats on 25k!📈📈

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      You're HIM!! Thanks man, I appreciate it!

  • @SkinnyPaddy
    @SkinnyPaddy 3 หลายเดือนก่อน +3

    100% this channel helped me get a FAANG offer last week. Cheers dude! ❤

  • @andriidanylov9453
    @andriidanylov9453 2 หลายเดือนก่อน +1

    It was a strong description for this task. Thank You

  • @antonvityazev4823
    @antonvityazev4823 23 วันที่ผ่านมา +1

    Hello Jordan!
    Thanks for another good video with detailed explanation. I have a question though after watching it:
    Why don't we use Cassandra here instead of MySQL if we expect a huge load on writes?

    • @jordanhasnolife5163
      @jordanhasnolife5163  23 วันที่ผ่านมา

      I wouldn't really say that I expect a huge load of writes, but moreso that I don't want to two phase commit a bunch of writes to four places.
      More importantly, reddit comments have parents which implies we need to maintain causal relationships which may be challenging in a leaderless setup.

  • @AP-eh6gr
    @AP-eh6gr 3 หลายเดือนก่อน +2

    could you share some example docs/links on how to implement the logic in the 37th minute that says - hey consume from a flink operator by the post+comment_id AND consume from CDC of the upvotesDB AND have Spark streaming process a comment and its corresponding score/upvotes only when comment's presence is true? Not having used these in practice makes it a bit hard to come up with
    OR do like a smallish tutorial on implementing this in a future video so that it feels more real, so to speak, lol
    Damn tho, your final solution is impressive, definitely what I was imagining as the most optimized way for upvotes by user and upvotes by post, who they came from (which cares about userid), and their aggregated counts (which cares about commentId and postId)

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Thanks!
      Just pseudocoding here but:
      HashMap
      def processVote(CommentId, Vote):
      if vote is upvote:
      map[commentid] + 1
      else:
      map[commentId] -= 1
      if map[commentId] has comment:
      updateDatabases(commentId, score) -> here we rearrange our graph edges
      def processComment(commentId):
      map[commentId][commentPresent] = true
      if map[commentId][score] != 0:
      updateDatabases(commentId, score)

  • @tfk933
    @tfk933 26 วันที่ผ่านมา +1

    Could the CDC into Spark be used to calculate comment scores for each ranking system with each incoming vote? e.g. each comment would include: {hot_score: x, controversial_score: y, ...}. Since there are so few comments per post, instead of a derived view, the client could grab all the comments on a post (maybe with exceptions for extremely popular posts), each with all the scores, and handle the sorting client side.

    • @jordanhasnolife5163
      @jordanhasnolife5163  23 วันที่ผ่านมา

      Sure, this is a fair approach, but it depends how many comments there are. Doing client side sorting does imply doing a lot of repeat work client side, and for posts where we want to paginate the comments this now becomes unfeasible.

  • @hristianiliev7979
    @hristianiliev7979 2 หลายเดือนก่อน +1

    Hey Jordan, great video. I believe that you did not explain one aspect in the final diagram though. What is this arrow from Spark Streaming to the New Comments/Source of truth DB?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Sorry, brushed over that one! That's just to change the upvote/downvote count per comment in the source of truth table.

  • @Kevin-jt4oz
    @Kevin-jt4oz 3 หลายเดือนก่อน +1

    This is amazing. A system design video with enough algorithmic depth. Can you share any links where I can dive into this furhter?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      For this one unfortunately not really, was mostly shit I made up since I couldn't find any great resources for it.
      I think my design might not even be necessary for reddit's current needs, but hey it's a fun thought experiment.

  • @ariali2067
    @ariali2067 3 หลายเดือนก่อน +1

    Not related to this specific video though, Saga & TC/C are not covered in your system design concept series, do you think we need to learn about those for System Design Interviews? I do see several designs mentioned these two concepts when discussing distributed transactions, not sure whether I can just skip if not super important. Curious to hear your thoughts, thank you!

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Probably not but to be fair it's a pretty small topic you could probably learn it in like 10 minutes

  • @FoodpediaIndia
    @FoodpediaIndia 3 หลายเดือนก่อน +1

    @jordan loving your videos, in lot of your videos you mentioned that have a cdc on database to get the up to date info to the flink while same flink is consuming from Kafka but let’s say if a user a recently did not have any activity on his users db and if the same user tweeted , you said now flink will have available with users friends from cdc and fanout his tweet to all his friends. In that case would flink pull details from the user db directly?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Flink ideally should not have any TTL on that user data, so it should already be there. I suppose what you're describing is more of a caching mechanism, where after a while we drop the data, and then go fetch it from the db, which is also a viable option.

    • @scottyjacobson4655
      @scottyjacobson4655 3 หลายเดือนก่อน

      ​@@jordanhasnolife5163 I wonder if @factsfuture9778 is asking the same thing I was wondering yesterday, which is the need to possibly "pre-populate" Flink's key store from the db you'd be CDC-ing with, since it's likely there will be some users that Flink hasn't heard any events from the db about those users yet and thus doesn't know their state "pre-initialization"
      I asked ChatGPT about this, and he said to define an initialization step that "using Flink's JDBC connectors, execute[s] a query against the MySQL database and fetch the current [applicable state] of all users"
      is this common? / does flink have an idiom for this?

  • @idiot7leon
    @idiot7leon 2 หลายเดือนก่อน +1

    Brief Outline
    00:00:51 Reddit Comments
    00:01:58 Problem Requirements
    00:03:37 Capacity Estimates
    00:06:07 Main Consideration
    00:07:56 "Source of Truth" Table
    00:08:48 Base Table Options - Native Graph Database
    00:12:22 Base Table Options - Use a normal database with an index
    00:15:28 Relational vs Graph Based Approach for "New" Ordering
    00:17:31 Top Comments
    00:19:19 Graph DB for Top Comments
    00:21:56 In Memory Graph for Top Comments
    00:24:55 Controversial Comments
    00:27:04 Hot Comments
    00:29:41 Synchronous Updates vs Derived Data
    00:31:07 Casual Consistency
    00:33:58 Change Data Capture in Practice
    00:35:54 Change Data Capture
    00:38:42 Upvotes DB
    00:41:16 Obligatory Caching Slide
    00:53:12 Final Diagram - Reddit Comments
    Thanks, Jordan~

  • @savanatong7041
    @savanatong7041 3 หลายเดือนก่อน +1

    hope references about the decision made in the video could be attached.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Not too many about reddit comment views if you look
      I see one for materialized path (which I have discussed), one for celko trees (which are dumb)
      Other than that, I know the tradeoffs of graph databases, the tradeoffs of normalized vs. denormalized data, and also the tradeoffs of how you choose to partition within a stream processing framework.

  • @Onislayer
    @Onislayer 3 หลายเดือนก่อน +1

    Hey Jordan, why use Spark Streaming and not just use flink for the whole setup with maintaining causal consistency on that second kafka topic?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Hey! Just wanted to use microbatching there, we're just writing a bunch of upvotes and downvotes to a particular comment id, and then potentially reordering the edges of the graph. That could be expensive, so I think batching them up makes for a nice improvement!

    • @Onislayer
      @Onislayer 3 หลายเดือนก่อน +1

      also in this setup would you keep the linked list in spark streaming instead to update the edge weights on the hot db?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      @@Onislayer yep!

    • @vincentchee4108
      @vincentchee4108 3 หลายเดือนก่อน +1

      Flink can write in microbatch manner using window, but i agree it is not super intuitive.

  • @prashlovessamosa
    @prashlovessamosa 3 หลายเดือนก่อน +1

    Loved it.

  • @yrfvnihfcvhjikfjn
    @yrfvnihfcvhjikfjn 3 หลายเดือนก่อน +1

    Next video I'm going to play the Jordan has no life drinking game where you have to take a shot every time he mentions flink

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      I've gotta get them to sponsor me

    • @AP-eh6gr
      @AP-eh6gr 3 หลายเดือนก่อน

      flink fetish

  • @aj_agr
    @aj_agr 6 วันที่ผ่านมา +1

    Was wondering about the cost of keeping this in memory graph index for all the posts over the years and whether it's worth it to build a hot/cold tier sort of index.
    Half a billion posts per year * 100 comments per post * 10 years * 8bytes overhead per comment = 4TB of RAM. 4GB RAM in AWS with reserved instances = 200$ per year. So for 4TB, you pay 200K$.
    100 comments per post on an average is a lot I feel - you can also reduce this 8 byte overhead to something less by making an assumption on the max # of comments per post - by a factor of 4.
    On a side note, persisting this index becomes potentially attractive to avoid having to re-stream all of this data in the event of index node failures. Maybe restreaming is not that bad from read replicas of comments and upvote DB.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 วันที่ผ่านมา

      One of the reasons we use flink is so that we don't have to restream all data and can restore from a snapshot. I see your point and this is an interesting idea to only keep things in flink for x amount of time and otherwise recompute the data on a batch job or something!

  • @ramannanda
    @ramannanda 7 วันที่ผ่านมา +1

    Question why not a separate topic and stream-stream join? rather than having two different producers publishing to the same kafka topic.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 วันที่ผ่านมา

      I'm not sure which part of the design you're talking about can you elaborate?

    • @ramannanda
      @ramannanda 5 วันที่ผ่านมา

      ​@@jordanhasnolife5163 I think in the end you do show different kafka topics, but th-cam.com/video/BO2gRisnBcA/w-d-xo.html

  • @swanv951
    @swanv951 3 หลายเดือนก่อน +1

    For ordering by "Hot comments", if a comment doesn't get any activity in x mins then some periodic job needs to update its score. It needs to do that for each and every comment every x mins. How would that scale?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Our flink node that is receiving upvotes and downvotes and is partitioned by (postId, commentId) is maintaining a linked list locally, and checks the head of the list every once in a while to see if the node is expired

    • @sinsaremine
      @sinsaremine 2 หลายเดือนก่อน

      @@jordanhasnolife5163 I don't understand how flink has the memory to store all this data. Its 8MB of data per post, if there are 100ks of posts this can grow pretty quickly. How do you make sure that there is enough memory in flink to store all comment trees across all posts.

  • @joonlee3863
    @joonlee3863 3 หลายเดือนก่อน +1

    hey jordan, can you make a video on designing a donation system? e.g. donating to list of charities

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Hey Joon, besides a normal server and database here what do you consider to be the hard part of this problem?

    • @joonlee3863
      @joonlee3863 3 หลายเดือนก่อน +1

      @@jordanhasnolife5163 hey jordan, thanks for the reply.
      I just started studying system design. So topics like real-time data/money aggregation w/ high load/peak, how to reprocess failed payments, and how to avoid hot spots if one charity is popular, other various failure scenarios in an Async design, etc

    • @joonlee3863
      @joonlee3863 3 หลายเดือนก่อน +1

      Also maybe going over little tedious stuff like API designs and DB schema/details for a noob like me? System design interviews seem to ask more in details about these than what I've seen in YT videos :(
      Apologies for a big ask

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      @@joonlee3863 Thanks for elaborating! I think that for a problem like this, it probably would be employing a lot of the other design patterns that we come across in other problems that I've covered. For example, real time money aggregation is similar to the real time counting or real time order handling that I might cover in the tinyURL or Amazon retail case

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      @@joonlee3863 As for this one, totally makes sense, I try to go over these where they are relevant and spent more time on them in my 1.0 series. That being said, I think that once you see the pattern in these, covering them in each video can take a decent amount of time and may also feel a bit redundant.

  • @mohamedmacow3610
    @mohamedmacow3610 14 วันที่ผ่านมา +1

    Can you do a video on API gateways

    • @jordanhasnolife5163
      @jordanhasnolife5163  13 วันที่ผ่านมา

      Will give it some consideration if there's anything non trivial to be done there

  • @davendrasingh8846
    @davendrasingh8846 2 หลายเดือนก่อน +1

    why not simply use redis sorted set for top, controversial post ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Well you could use sorted sets within sorted sets within sorted sets, at which point you've just built out the data structure I described

  • @AP-eh6gr
    @AP-eh6gr 3 หลายเดือนก่อน +1

    around the 22nd minute mark, when you suddenly receive a large number of upvotes on a comment and it needs to be reordered putting it ahead of the formerly most upvoted comment, what sort of complexity are you looking at? oh ok missed the "maintaining a sorted list of edges" statement you make at 21:41

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Yep ideally we have to sort those edges. It shouldn't be n log n time because the list is sorted, so just changing the existing sort order should be O(n) like an iteration of insertion sort.

    • @AP-eh6gr
      @AP-eh6gr 3 หลายเดือนก่อน

      @@jordanhasnolife5163 it could be simply logn if it is a treeset (as opposed to list), so that way you delete old count and insert new count both in log time