5: Netflix + YouTube | Systems Design Interview Questions With Ex-Google SWE

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ก.ย. 2024
  • Please reach out on LinkedIn for the super secret only fans version of JHNL

ความคิดเห็น • 161

  • @laserbam
    @laserbam 9 หลายเดือนก่อน +28

    Thanks for doing this series! A few days ago, I signed my L5 offer at Google, so your system design videos (and slide decks) came in clutch

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน +9

      Hell yes dude, extremely proud of you, keep killing it!!

  • @gawadeninad
    @gawadeninad หลายเดือนก่อน +3

    This is an underrated channel. Should have more views and subscribers. I liked how you are deep-diving into main scenarios rather than just covering everything on high level.

  • @idiot7leon
    @idiot7leon 5 หลายเดือนก่อน +8

    Brief Outline
    00:01:04 Problem Requirements
    00:01:46 Capacity Estimates
    00:02:52 Video Streaming Intro
    00:04:00 Video Chunking
    00:05:40 Chunking Advatages
    00:07:09 Database Tables - Subscribers
    00:09:39 Database Tables - User Videos, Users, Video Comments
    00:11:33 Database Tables - Video Chunks
    00:12:45 Database Choices
    00:14:45 Video Uploads
    00:15:57 Video Uploading - Broker
    00:16:46 Video Uploading - Broker
    00:18:51 Video Uploading - Chunks
    00:20:27 Video Uploading - Chunk Storage
    00:22:32 Video Uploading - Aggregation
    00:26:41 Video Uploading - Streaming Datamodels
    00:28:37 Video Uploading - Flink
    00:31:15 Video Uploading - Flink Continued
    00:33:53 Video Uploading - Search
    00:34:59 Search Index - Partitioning
    00:37:17 Search Index - Partitioning Continued
    00:38:57 Search Index Uploads
    00:40:21 Final Diagram - Netflix/TH-cam
    Thanks, Jordan~

  • @lelandrb
    @lelandrb 5 วันที่ผ่านมา +1

    I'm more on the frontend side but dang are these high quality. I wish there were similarly great sys design resources on my stack, but I still took a lot from this. Thanks so much

  • @allenxxx184
    @allenxxx184 6 หลายเดือนก่อน +13

    Your channel deserves at least 1M subscribers. Most high-quality system design video!!!

  • @sauravsingh5663
    @sauravsingh5663 6 หลายเดือนก่อน +5

    This is exactly what I was looking for. Love how you uncover the right level of detail where it is necessary.
    Great work !!

    • @dosya6601
      @dosya6601 3 หลายเดือนก่อน

      +

  • @MithunSasidharan1989
    @MithunSasidharan1989 9 หลายเดือนก่อน +9

    Thank you for continuing to do this. Its goldmine for engineers preparing for interviews : )

  • @muven2776
    @muven2776 2 หลายเดือนก่อน +3

    This is a great video f***ing Indeed! Got instant High & confidence going through this videos.
    To understand jordan videos
    my suggestion is to go through Jordans system design 2.0 playlist
    0. DB fundamentals (0 to 15 videos)
    1. Replication (16 to 24 videos)
    2. Stream Processing videos and Flink(42 to 45)
    After understanding above video - system design videos is like a cake walk. Note down the terms which you come across like zoo keeper, elastic search and go back to 2.0 playlist and comeback to this series.
    Note down the technical terms which he mentions like.
    "Split brains"
    "Read Repairs"
    "Anti entrophy"
    Keep using these terms in the interview for showing that you know what is distributed system :D

  • @vkchgc
    @vkchgc 3 หลายเดือนก่อน +1

    You really are doing the best system design videos I’ve ever seen ! Keep up the great work

  • @wensongliu5058
    @wensongliu5058 4 หลายเดือนก่อน +1

    Much appreciation to you, Jordan. This video covers so many detailed components and processes going back and forth, I already watched this video for many times and it's really helpful!

  • @rahulnath9655
    @rahulnath9655 9 หลายเดือนก่อน +4

    This one is so dense and detailed, thanks man. I feel like I really understand these systems now.

  • @KratosProton
    @KratosProton 5 หลายเดือนก่อน +1

    42:36 Jordan man its been a long way... from your super wobbly handwriting in the 1st concepts video to this super beautiful amazing handwriting. And as always quality content!!!

  • @nirajvora9314
    @nirajvora9314 9 หลายเดือนก่อน +2

    Don't stop making videos bro. Your content is unique and effective.

  • @jordiesteve8693
    @jordiesteve8693 2 หลายเดือนก่อน +1

    great video as always, really liked the aggregation part. One thing about the search index, if I got it right, you propose to shard for a given term and ideally have all (userId, videoId)s within the same shard. That's not possible, in Elasticsearch documents (descriptions in here) are sharded by documentId, and you need to end up running a distributed query for each search query. Why is that? Well, the algorithm that have powered the search space until recently has been bm25. There are two main components on this formula (the bm25 formula), one is how frequent each query term in each the document (the more the better), and how "popular" is this query term across the corpus of documents (the less popular, the better). For example if query terms are [a, b] the scoring formula would look like f(a, D) / popularity(a) + f(b, D) / popularity(b). Running bm25 in a single node is easy, however, when going to a distributed system unexpected things can happens (business as usual). In a distributed fashion, we HAVE TO (otherwise you are not running bm25 anymore) keep f and popularity consistent within a single node, that is why ES doesn't allow to control by what you partition on. Basically, when we run a query, we'll run a distributed query to all nodes, each one will run bm25, and a coordinator/aggregator node (can´t recall, there are different setups), will get the results and return the topk. Now, this has several implications, an important one: the bm25 runs within each shard and only considering the statistics (f and p) on that shard, that means that if we have doc1 and doc2 and doc1 is the same as doc2 and each live in a different shard, it could be that bm25(query, doc1) != bm25(query, doc2). Because of this, we need to make sure we do'´t overpartition and don't mess up the data distributions. Argh, hope people have followed this far!
    I really like inverted indexes, but the new kid in the block (well, not that new) is Vector Databases. The idea is you encode the text or whatever using Deep neural networks or whatever you have in hard to encode information to a vector, index it to a vector database and run approximate similarity searches.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      That makes a lot of sense to me! BM25 sounds like TF-IDF. I suppose we could do term partitioning without using elasticsearch if we didn't care about scoring, but of course scoring is super relevant.
      I mention that actually sharding by term in practice is pretty impossible more in my twitter search 1.0 video, and that the document based sharding is used IRL. Also makes for faster ingestions :)

    • @jordiesteve8693
      @jordiesteve8693 2 หลายเดือนก่อน +1

      yep! tf-idf is very similar, ES suports both afaik, but bm25 is more tailored to the search problem

  • @capriworld
    @capriworld 12 วันที่ผ่านมา +1

    first of all, I have seen, multiple channels and finally landed here & continuing and referring to people also. thanks a lot for helping me out.
    Can we not use a Job Scheduler to put a task for checking whether all the chunks are done & processed. as you have said, the client, inform the metadata to the microservice about the upload, that will initiate a task and then, the instead of kafka/flink? then remove the task once done,
    Basically both are same, but, from the technologies perspective i thought, this may be more aligned.
    thanks again.

    • @jordanhasnolife5163
      @jordanhasnolife5163  11 วันที่ผ่านมา +1

      Yeah I mean we may as well have re-built a job scheduler, I think that's just the core piece of the problem so I didn't want to abstract it away. I think the difference here is that job schedulers are nice for scheduling one job at a time, but how does it alert us when they're all done? Some do, but they probably use some sort of polling or stream processing like what we do under the hood.

  • @dmitrigekhtman1082
    @dmitrigekhtman1082 7 หลายเดือนก่อน +4

    The upload and processing pipeline could include lots of different jobs with complicated interdependencies, with the S3 upload stage as one of the first steps. Possibly, a general-purpose workflow orchestration framework (something like Temporal, maybe?) could help coordinate all of it.

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน +1

      Agreed, and I imagine that IRL they do probably have something like this!

    • @user-of5je3sp9n
      @user-of5je3sp9n 7 หลายเดือนก่อน +1

      You should do a video on workflow orchestration :D

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน

      What dependencies? I personally cannot think of any.
      The chunks can be processed out of order in any order.
      And the metadata about total chunks can also be updated out of order as Jordan explained.
      With additional requirements (beyond scope of interview?) we might have dependencies, but I don't think we have such with the requirements Jordan worked off of.

    • @dmitrigekhtman1082
      @dmitrigekhtman1082 หลายเดือนก่อน

      The processing of each chunk would typically be a multistep process, with different steps perhaps executing on different compute nodes. Imagine, for example, we wanted to run some ML-based object recognition on GPU nodes. Meanwhile, there’s a parallel job generating subtitles. Then the results of the object recognition and subtitles are combined for all of the chunks to feed into a recommendation system… you get the point - video processing involves a lot of systems and you need to coordinate the actions of those systems.

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน

      @@dmitrigekhtman1082 Yeah, true. It's just not part of his requirements.

  • @shetysify
    @shetysify หลายเดือนก่อน +1

    Need your blessings going for an interview !!! You should next do a dsa course . Thank you !!

  • @systemdesignlearner
    @systemdesignlearner 4 ชั่วโมงที่ผ่านมา

    2 questions:
    1) when a user watches a video, do they get the chunks from the database or the CDN?
    2) why is it better if all the queries for a userid go to the same node?

  • @dkcjenx
    @dkcjenx 4 ชั่วโมงที่ผ่านมา

    Hi Jordan! I actually encountered this stream processing aggregation problem at my work. I think your solution is much more elegant than ours, but I have a question -
    Are you going to shard the flink servers that tracks video chunk processing status? Otherwise every server instance will need to contain all the video upload information (in memory I assume?) and won't be scalable. Assuming you do sharding by videoId, are you going to use consistent hashing to make sure the same videoId goes to the same flink server? Also, how would you guarantee persistence in case the server goes down and data in memory is lost (use write-ahead log?).
    Or alternatively are you goign to store the chunk upload status in a DB table?

  • @Anonymous-ym6st
    @Anonymous-ym6st หลายเดือนก่อน +2

    Modern system more CPU bound instead of network bound -> not sure I understand it correctly. If it is about latency, network is definately taking more time. QPS wise, CPU bound can be solved by adding more nodes, but network bandwidth will be just like that much? (open to discuss, I don't have any experience on storage on my own..)

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      It basically means that in something like AWS, if we want to perform a large analytical query, the main thing slowing us down is the ability of CPUs to parse through the data, as opposed to actually moving data from host to host over the network in order to parse it.

  • @thunderzeus8706
    @thunderzeus8706 2 หลายเดือนก่อน +1

    Hi Jordan, I have dumb questions about the user-video table slide.
    Please correct me if I am wrong.
    1.The proposed design uses MySQL and assigns (userId, videoId, timestamp) as primary key, using "userId" for partitioning.
    2. You mentioned maybe a secondary index / sort key on videoId and timestamp
    1). Since you eventually chose MySQL, what is a "sort key" in that case? On the other hand, shouldn't we use (userId, timestamp, videoId) instead of (userId, videoId, timestamp) as primary key so that records will be in timestamp order within each userId?
    2). If secondary index, it will be a global secondary index. Won't the use case "look up by videoId" slow because you essentially need to find the videoId in logarithmic time, and then possible go to a different partition to get the metadata?
    Thanks (for bearing with me🤯)!

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      2.1) I'd probably just use some combined index of userId and timestamp (which basically is the same as a video id, who needs a videoid anyways)
      2.2) I suppose this answers your questions. If userId and timestamp make up the "video id", we can easily look things up by that combination.

  • @Luzkan
    @Luzkan 7 หลายเดือนก่อน +2

    Congratz on 21k Jordan! Its 5th video for me so far and I'm amazed every single time with the details u are manage to dwell into. For how long on average do you think about the whole system before starting the video itself (lets say without refining it to a someting presentable, just thought mapping it out)?
    14:39 / 41:40 - (In my design channel_id is the same thing as user_id) I'm wondering why do you suggest to shard on channel_id + video_id, rather than just video_id? I don't see how having close comments from other videos from a given user (channel) is helpful. 🤔
    24:49 - What happens if RabbitMQ dies after successful upload to S3 and just after messages have been put to the que with metadata (i know there is option for durable ques and persistent messages, but is that the way to go)?
    Btw, do you know how Discord handled the casual dependencies (relationships between messages like msg to msg replies) with Cassandra?

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน

      Hey! I'm basically remaking all of these videos right now, so I don't have to think about them for too long. I mainly just re-watch my old video on it and then try to decide if what I did last time was stupid haha.
      14:39 - yup, typo on my part nice catch.
      24:49 - Ideally we would have multiple replicas of rabbit mq so that if the leader dies the follower can take over and we can proceed as normal.
      I do not know the answer regarding discord! Maybe version vectors, maybe they always write to the same leader for a given parent comment Id, maybe quorums!
      I'd have to look into it.

  • @roshankumar0911
    @roshankumar0911 9 หลายเดือนก่อน +2

    I recently cleared my system design round after watching ur videos..it's so compact & precise. Thank you for making such videos. Can you please mention your linkedin id ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน

      Glad to hear!! Congrats!
      www.linkedin.com/in/jordan-epstein-69b017177?
      If you don't mind, just don't tag me in stuff so that I don't lose my job haha

    • @roshankumar0911
      @roshankumar0911 9 หลายเดือนก่อน

      @@jordanhasnolife5163 Sure, thanks :)

  • @isaacneale8421
    @isaacneale8421 2 หลายเดือนก่อน +1

    I like your idea of data locality in a DB for each of the processed chunks. But I don’t know if I understand if it works.
    When thinking about a single machine (say a personal laptop) reading from disk, the video ought to be stored as continuously as possible to ensure good data locality and no disk jumping. Makes sense.
    But when talking about a distributed service, I can’t see how this helps. As I understand a disk, it can be only reading in one location at once. There might be multiple physical hard drives on one machine though.
    Anyway, so let’s say I am watching a youtube video and I grab Chunk1 from the DB. Great. Chunk2 is next, which i’ll request in 5-10 seconds. But what happens if someone else is watching a video partitioned on the same DB shard. And they request their chunkXYZ. The disk jumps to their spot, then back to mine when i request chunk2.
    So it seems like making the distributed DB have good data locality can break down quite easily with concurrent requests.
    Hopefully, however, most videos are read from the CDN which would be much faster since it cache is in memory. But that’s a lot of expensive memory for caching all videos, so maybe that is on disk partially too, which I guess would have the same problem.
    Any thoughts? I suppose good data locality doesn’t hurt in the case where my sequential reads are not the systems sequential reads. So you might as well try to have good data locality.

    • @isaacneale8421
      @isaacneale8421 2 หลายเดือนก่อน +1

      Oops. I just rewatched and realized that you had an S3 location in this DDB. The data locality was for range queries to fetch the next X many chunk locations while buffering. This makes a lot of sense.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Yup, range queries is what you're looking for!

  • @Anonymous-ym6st
    @Anonymous-ym6st หลายเดือนก่อน +1

    I am curious if it is common to use two type of DB in real use case (of course for big company as TH-cam it's worth, but considering we are designing for a team / an org tech work), maybe compared with adopt cassandra, optimize based on mySQL will be more like a real case?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      Fair enough consistency of DB choice can be a real draw in some places.

  • @raaamu0007
    @raaamu0007 หลายเดือนก่อน +1

    How do you actually join chunks messages in Apache flink. I do not have much knowledge on it .
    So for a given tumbling window of say 6 hrs for example you expect all the chunks to be processed from rabbit MQ and once all the individual messages are received in flink from rabbitMq you join them with the kafka message which is already received and publish a video upload complete event when total number of chunks is equal from rabbit mq messages and kakfa message?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Yeah basically you just use a hashmap and key on chunkId.

  • @xiaoyinqi7296
    @xiaoyinqi7296 6 หลายเดือนก่อน +1

    Thanks for the video, Jordan, very impressive.
    want to understand the reason using Flink here, I know Flink is a streaming processing tool. I believe we want to confirm if the transcoding of all the chunks is done. my thought is to use chunk db table to mark each chunk's status.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      You can definitely use a chunk db. However, note that this means:
      1) You need to make an additional network request to the chunk db every time
      2) That request can fail, how do you ensure that we eventually write it there?

  • @reddy5095
    @reddy5095 หลายเดือนก่อน +1

    Since the user stores the entire video in S3, he will have the list of chunks and their links, then he can make one api call to store the details in chunk table right? why are we using the video chunks rabbit mq instead?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Also doable, you could run CDC on an event like that, but we want to make sure to process the s3 files (one for each chunk) in the background, so we need something to go into rabbitmq to tell us where those video files live.

  • @weijiachen2850
    @weijiachen2850 4 หลายเดือนก่อน +4

    How does this guy know all these as a junior engineer? He should be promoted to a staff engineer.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      Very unclear if I have what it takes for that

    • @Amin-wd4du
      @Amin-wd4du 7 วันที่ผ่านมา

      It’s not much about knowing all the technologies to succeed. It’s more about influence

  • @joemiller1057
    @joemiller1057 2 หลายเดือนก่อน +1

    I would not chunk on the clients. If you want to change that logic you would have to update all the clients, i feel like it could also introduce device specific bugs. Upload the file and process, split it up serverside.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      I can see the argument moreso if we have many upload servers that are geographically distributed

  • @alberdgdj1
    @alberdgdj1 9 หลายเดือนก่อน +1

    Hi Jordan, thanks for your videos they are of a huge value. I wonder if you could do a video about calculating BigO complexity with some exercises, that would be really helpful. Thanks mate!

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน +2

      I appreciate that! I can do this, however it realistically would be a while before I get to it, just due to the fact that I'm mainly trying to focus on systems design. That being said, there are many good resources on the internet for how to calculate this type of thing!

  • @aforty1
    @aforty1 4 หลายเดือนก่อน +1

    Liked and comment for the also! Thank you!

  • @meenalgoyal8933
    @meenalgoyal8933 5 หลายเดือนก่อน +1

    Hey Jordan, I am wondering how the design might change for audio streaming service like Spotify. I think a lot might remain same as youtube, but 2 major things:
    1. Do you think we need to break audio file into chunks? Sure we can benefit from parallel uploading and getting one chunk at a time for streaming but audio files are lighter than video.
    2. What kind of processing might be required for each audio file chunk?

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน +1

      Hey! I think 99% of it is probably going to be the same. You'd probably have different bit rates for streaming the audio if you have a worse connection, which is the processing involved. Maybe you wouldn't need chunking since as you mentioned the files are much smaller in size.

  • @asian1599
    @asian1599 7 ชั่วโมงที่ผ่านมา

    how would we support resuming the video at the time the user left off?

  • @dinar.mingaliev
    @dinar.mingaliev 9 หลายเดือนก่อน +1

    Hi Jordan, thank you so much for for keeping us educated and sharing your ideas in system design. Short question: dont we also need to add chunk processor, once a user uploaded a video into temporary S3 or DFS, the service splits it into chunks.
    And meanwhile one more question: if we have single leader replication + partitions in Cassandra, will it work with comment editing right?
    And also we need a service to create a user feed :)

    • @dinar.mingaliev
      @dinar.mingaliev 9 หลายเดือนก่อน +1

      also I guess for insert, updated and delete operation on a single row are atomic, isolated and durable in Cassandra and assuming that the same user edits its comments - there should not be a problem with eventual consistency. what do you think man? :)

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน

      Thanks!
      I had envisioned the user's client breaking the file into chunks.
      Secondly, I'd agree that edits of comments are no issue if we use single leader replication, but for multi leader replication they definitely could be!

  • @kword1337
    @kword1337 9 หลายเดือนก่อน +1

    Thanks for another banger dude! For complicated stuff like video aggregation, are you getting your ideas from white papers? Those level of designs seems beyond Designing Data Intensive Applications?

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน +1

      Well I don't feel like DDIA is ever super opinionated on how to design things in particular.
      That being said, real time aggregation using stream processing seems to be something used across many systems and it also handles pretty much all failure scenarios for us, hence the reason I keep abusing it haha

  • @college7290
    @college7290 8 หลายเดือนก่อน +1

    Real treasure! Thank you. What resources did you use to learn these concepts? I know your knowledge is not out of books, but based on years of hard work and experience. how I can start learning these concepts myself? What can I do to be knowledgeable like you in next 5~10 years?

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน +1

      Just reading haha, I'm nothing special! You'd be surprised how much you can learn by looking at "Uber system design" from reputable sources (their site and not TH-camrs)

  • @jieguo6666
    @jieguo6666 2 หลายเดือนก่อน +1

    Hey Jordan! Thanks for the video! If we use DDB we can use GSI of DDB so we seems don't need CDC. I'm curious is cassandra+CDC better than DDB, or it's a personal preference thing?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      If it's an eventually consistent global secondary index then I'd say personal preference. If it needs a two phase commit to stay completely consistent with the primary that seems like a pretty big difference then

  • @Randomguu
    @Randomguu 8 หลายเดือนก่อน +2

    Wonderful series, cannot stop watching. Just one question on something which is bugging me- I heart this suggestion in a few of the other videos as well, how do you decide that sql db will be better as we have a read heavy system. I understand the btree vs lsm tree point but nosql scales better hence will have less locking on a single contention to a single sql node ( even if we have master slave for reads - still scale poorly no?). I think lsm vs btree is merely theoritical discussion rather than having pratical application here

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      You say "NoSQL" scales better - what makes you say this? That's really only the case when we're running a bunch of distributed joins, which we aren't doing in any of this reads

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน +1

      ​@@jordanhasnolife5163 I think what what @Randomguu is saying is that if you set Cassandra's read quorum to 1, then the reads are ultra fast because different reads can land on different replicas in parallel. So saying SQL is faster for reads is not always true.
      Having said that, I think you are still spot on in using Cassandra _only_ for comments, and an ACID SQL DB for metadata like users, etc. for reason beyond write vs. read throughput.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      @@lagneslagnes That's true you can configure cassandra to be very fast for reads and still maintain quorum consistency :)

  • @VidulVerma
    @VidulVerma 8 หลายเดือนก่อน +2

    Awesome design 🙇

  • @Ryan-g7h
    @Ryan-g7h 22 วันที่ผ่านมา +1

    jorcan quick question, so are we storing both unprocessed /processed chunks in the same S3?

  • @ariali2067
    @ariali2067 6 หลายเดือนก่อน +1

    Again, sorry same question caught me again and again. Is search index basically building a new table or basically a secondary index to existing user video table? I already convinced myself that it's a secondary index on top of existing tables, but then this video it seems that we are creating a new table (with some denormalized data from user video table) -> if this is the case (create a new table) -> why we need (user id, video id) as partition key here? Why we cannot use term as partition key such that for a given term search all the results are on the same node for faster read speeds? This really bothered me.. really appreciate if you can help clear my confusion here, thanks again!

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      1) new table
      2) too many much data for a given term typically, imagine for "Donald trump"

  • @xRuneGunx
    @xRuneGunx 8 หลายเดือนก่อน +1

    In 41:31 you mentioned using Cassandra increases write throughput. However, doesn't Cassandra use a Leaderless replication model such that write availability is increased? I was under the impression that multiple leader replication increases write throughput due to its nature of processing events in parallel. Can you clear up my confusion?
    Thanks for the video

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      Yes sorry and good catch here. Cassandra can be run in multiple different configurations: one with quorum consistency, and another where writes just need to hit one node. I'm mainly referring to the latter, which is effectively multi leader replication.

  • @Anonymous-ym6st
    @Anonymous-ym6st หลายเดือนก่อน +1

    if we use video id + ts as index for comment, will it be case that some comment are posted at the exact the same ts?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I mean you can always add the user id of the comment poster if you're afraid that duplicates are going to overwrite one another.

  • @siddharthgupta6162
    @siddharthgupta6162 8 หลายเดือนก่อน +1

    Thanks for the video, Jordan. Awesome content as always.
    Is there any difference between streaming vs chunking? I read somewhere that streaming is an error-prone process so one should prefer chunking over it - but there was no explanation on it.
    Any thoughts on this?

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      Yeah to tell you the truth no clue - sounds like some guy spewing some bs as per usual with 99% of systems design videos lol

    • @siddharthgupta6162
      @siddharthgupta6162 8 หลายเดือนก่อน

      @@jordanhasnolife5163 lol sounds about right

  • @truptijoshi2535
    @truptijoshi2535 3 หลายเดือนก่อน +1

    Hi Jordan, can CDC have a single point of failure? If yes, how do we avoid? Also does CDC add extra latency?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      I mean in theory kafka, but I tend to imply that our Kafka cluster has replicas.
      CDC does make things slower, but I suppose in the cases where I use it I don't actually care (hence why I use it)

  • @PrabhuMarappan7
    @PrabhuMarappan7 2 หลายเดือนก่อน +1

    Hey Jordan, great videos for System Design. I was just wondering, will the client upload the whole file to S3 or upload it in chunks. And the backend process or job runner can essentially break it into chunks as it wants. Also, do you think splitting and uploading chunks will be more work on the client (the browser itself)?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Hey, it's possible that the chunks themselves are first routed via an intermediate server, but I think the actual chunking will itself first happen on the client, or else we lose some of the benefits of chunking. I agree that this is more load on the browser.

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน +1

      ​@@jordanhasnolife5163 Current blob/object store services provided by the main public clouds (wasn't true in past) have great support for chunking, streaming uploads, parallel uploads and resumable uploads. That is, even when we upload a single file, the service will do all that internally using their client SDKs and backend servers. i.e., it will chunk and reassemble the chunks to a single file in the blob store.
      For a video/audio file, I don't think we gain a lot by literally managing the chunks in the client, unless I'm missing something?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      @@lagneslagnes If the SDK does it for us, fantastic! I'd also then like to confirm however that loading these files can be done similarly in chunks, so that we can adjust our bitrate/resolution as we load them. Hence the reason for me wanting to store them as conceptually separate files.

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน +1

      ​@@jordanhasnolife5163 So, the big difference between uploads in Netflix and say, DropBox++, is that the chunks are more of an internal implementation detail of the streaming download protocol. For TH-cam/Netflix (see your requirements) we do not need the client nor the backend to know and track the chunks after the transcoding/packaging of chunks is done. The machines in the processing pipeline will break the uploaded file into chunks, and during the final packaging step, will put the metadata for the chunks within the streaming protocol's manifest file (e.g something like a specially named XML file that sits in a folder with all the separate chunk media files for the overall media item). The backend database(s) do not really need to keep tracking the chunks after that point, because there is no requirement to allow updates of parts of a audio/video file.
      So uploading to a object store, via the object store's regular upload SDK/protocol >might< secretly chunk and re-assemble for us. But overall that is transparent to us, so it just looks like a single file upload.
      Later processing steps run on that raw file, breaks it into chunks and converts to various formats/resolutions, then reassemble into a package that include a manifest file with details for all the chunks/formats/resolutions.
      Later when a client of the streaming download protocol downloads the file, it grabs the manifest file, which includes pointers(URLs) for the chunks within that same folder (for example). There is no need for chunk information to be requested from our actual backend databases, it's all internal implementation details of streaming protocols.
      Summary: Managing chunks with "our" design's metadata DBs is only really useful when we do require updates to happen at the chunk level (i.e., not when the semantics is always to re-upload entire file).
      Minor point btw, your video was full of good content, and showed a great way for creating a new streaming protocol system that just happens to manage chunks less transparently.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      @@lagneslagnes Thanks! Yeah, was aware this was the case for RTC streaming, wasn't aware S3 does it under the hood. Appreciate the info!

  • @jashmerchant5121
    @jashmerchant5121 2 หลายเดือนก่อน +1

    yours feel relatively complex and speedy tutorials compared to others

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      My girlfriend used to say this about our sex life

  • @vigneshraghuraman
    @vigneshraghuraman 3 หลายเดือนก่อน +1

    once the chunks are uploaded by the user to S3, how does upload service know which chunks to put on the rabbit MQ? is this done via S3 notifications to the upload service?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน +1

      The client will upload chunks based on which ones are "new". Then they all go into rabbit mq.

  • @JulianA-rm4ry
    @JulianA-rm4ry 4 หลายเดือนก่อน +1

    Thank you Jordan

    • @JulianA-rm4ry
      @JulianA-rm4ry 4 หลายเดือนก่อน +1

      Now i'm only 1/2 screwed

  • @MuhammadUmarHayat-b2d
    @MuhammadUmarHayat-b2d 2 หลายเดือนก่อน +1

    qq: is client to be responsible for chunking the video and upload to s3? or should there be a merchanism to upload the video directly to s3 and have some dedicated backend wokers to chunk it in async fashion?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      Typically you'd want the client doing chunking to avoid having to retry uploading the full video in the event of some failure.

    • @MuhammadUmarHayat-b2d
      @MuhammadUmarHayat-b2d 2 หลายเดือนก่อน +1

      Makes sense. I also checked, s3 does provide support of multipart upload of fixed chunks which would be handy here

  • @saurabhmittal6947
    @saurabhmittal6947 3 หลายเดือนก่อน +1

    hey jordan, I have one question.. how is client able to uniquely generate the chunk-id and video-id because here, you are showing that client will be uploading to s3 and then sending that data to upload-service but who is assigning unique-ids to all these entities flowing in our system ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน +1

      The video id can just be some userId + a hash or something. The chunk ID is also basically a hash and just needs to be unique per video id

  • @9527-ljc
    @9527-ljc 8 หลายเดือนก่อน +1

    Thanks, this is great content. For entry lvl sde, which part should we focus more in SD interview?

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      If you're looking for junior roles, I'd honestly just keep grinding leetcode haha.
      Otherwise, I'd say that the whole video is still relevant. Can't hurt to learn!

  • @adithyabhat4770
    @adithyabhat4770 8 หลายเดือนก่อน +1

    Thanks Jordan!

  • @vorandrew
    @vorandrew 8 หลายเดือนก่อน +1

    Chunking stuff question... Why would you want to store chunks except in cache? Let's say video is 50Mb, you want save permanently transcoded 3-4 resolutions x 1-2 formats? Petabyte here petabyte there and we are talking about big numbers... If you always can re-create them - no need to store transcodes for video that was last viewed 3 years ago... cache them with last-access timeout set to 1 week for example... Maybe you want to store first chunk for fast access at maximum

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      Would appreciate if you could elaborate here! While it's true that we could store the entire video file and never deal with any chunks, assuming we originally upload chunks to S3 when first uploading the file we'll always need at least some chunk metadata in our database to load them

    • @vorandrew
      @vorandrew 8 หลายเดือนก่อน +1

      @@jordanhasnolife5163 my guess is like this - we are receiving file of original resolution -> chunking it by 2 sec -> long term storage. Transcode first chunk into 144,240,360,480 etc resolutions (don't store) -> CDN expiration = 1Y last access (just to have fast start experience). Whenever somebody starts to watch video we transcode necessary resolution on the fly from original chunks in parallel and store it in CDN expiration = 1 week. I'm sure sum of transcode speed will be faster than viewing speed so we will make viewing seamless
      Regarding metadata - as you said during upload we can store all necessary chunking stuff in some nosql db

    • @vorandrew
      @vorandrew 8 หลายเดือนก่อน +1

      Than you for your videos! ❤ after viewing some I can see your designs tend to give out space as FED is printing money 😂

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      @@vorandrew Ah I see what you're saying here, I think it's one of those things that we'd have to actually try out and see if the latencies would be low enough. We do care a lot more about lowering read latencies here, so I wonder if this would work in practice but it's an interesting thought!

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน

      @@vorandrew Haha yeah - my personal philosophy here is to use as much disk space as needed, we could always optimize for cost saving measures in the future! At least for the interview I don't know how often it would come up, but it's possible!

  • @indraneelghosh6607
    @indraneelghosh6607 7 หลายเดือนก่อน +1

    Hi Jordan. Had a few questions related to the video upload flow. Could you please explain why you chose RabbitMQ over Kafka while uploading the metadata? Also, there may be times when there may be a spike in the amount of videos being uploaded particularly in the case of a TH-cam-like system. I would expect video uploading on youtube would have a rather irregular traffic pattern as compared to a streaming platform like Netflix. Any ideas on how to tackle these spikes without manual intervention?

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน +1

      To be honest, I do think that the uploading on TH-cam would be more regular than you think. You've got people in every timezone. But yeah, I guess the way you'd do it is just have your consumers that are doing the encoding be part of some hadoop cluster that also is performing other work in the meantime, and as more jobs come in for uploads you can kill whatever jobs those nodes are currently doing and use them for uploads.
      For your first question, RabbitMQ is going to allow me to use a fan out design such that I don't need a bunch of different partitions (one per consumer) as I would with kafka. I don't care about message ordering at all here, so a fan out is fine.

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน +1

      @@jordanhasnolife5163 I don't get the fan-out comment. In Kafka, you can fan-out to hundreds of consumers even with a single partition by putting each consumer in a different consumer group.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      @@lagneslagnes that's fair - your suggestion is basically just partitioning the topic and having a bunch of nodes read from each partition as opposed to using a single JMS broker which doesn't rely on in ordered delivery.

  • @ankitagarwal4022
    @ankitagarwal4022 4 หลายเดือนก่อน +1

    @jordanhasnolife5163 Hi Jordan, I have just one question, your processor is transforming the video into a list of transforming videos, it will depend on the number of encodings * resolutions.
    let's say for example we have 10 encodings and 4 resolutions. it will make it 40. So we have to transform on 1 chunk into 40 and upload into 40 into s3.
    I assume transforming one chunk to another itself a heavy process. Can you suggest some optimization here? if our event processing fails so we don't have to transform every chunk from the beginning.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      I'm pretty confused what you mean here - each resolution/encoding is processed independently in tandem already, so if one fails the rest do not fail, feel free to elaborate!

    • @ankitagarwal4022
      @ankitagarwal4022 4 หลายเดือนก่อน +1

      @@jordanhasnolife5163 what I understand about the flow of data
      1. first we are uploading chunks is S3, lets say (c1,c2,c3.....)
      2. adding chunk details in broker (rabbitmq)
      3. The processor consumes chunk details from the broker let's say C1 and puts a list of transformed (C1R1E1, C1R1E2, C1R1E3, C1R2E1,C1R2E2,C1R2E3)video into the S3 considering (resolutions(R) = 2, encoding(E) = 3 ). and processor also put list details into flink.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      @@ankitagarwal4022 The only transformation of one chunk to another that we're doing right at the start is creating the list of all of the metadata that we will eventually need to create. So that can all go into rabbit mq, and once it does we can be fairly confident that the chunk will eventually be created downstream because it will only get removed from rabbit mq once the consumer puts the completion message in kafka

  • @rahulrachh3320
    @rahulrachh3320 6 หลายเดือนก่อน +1

    Video Timestamp: 10:18
    Part-1:
    For the user Videos table, We can omit timestamp as UserId+VideoId make a unique pair and when you get the videos from the table, you get timeStamp and then you sort them and display the videos for a user who uploads videos. Correct me If I am wrong.
    Part-2:
    Also, in the Video Comments table, VideoId will be unique so why are we using timestamp along with this. Does this help in getting output in sorted manner ?
    Thanks :)
    Edit: Added Video Timestamp

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      1: Definitely doable, however it is easier to keep things pre-sorted by timestamp in the metadata database so that you don't have to sort them on the fly for each read.
      2: You answered your own question :). Having a timestamp for comments allows us to easily fetch comments in a pre sorted order, as we can index those comments on timestamp per video.

    • @rahulrachh3320
      @rahulrachh3320 6 หลายเดือนก่อน

      @@jordanhasnolife5163 Thank you :) I love this series and System Design 2.0. This got me thinking of starting my own series on System Design topics. Maybe one day for sure :)

    • @rahulrachh3320
      @rahulrachh3320 6 หลายเดือนก่อน +1

      @@jordanhasnolife5163 Thanks got it. This series and System Design 2.0 are gold. I might even start making videos on similar topics sometime sooner :)

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      @@rahulrachh3320 Just don't take too many of my viewers away from me it's all I've got ;)

    • @rahulrachh3320
      @rahulrachh3320 6 หลายเดือนก่อน

      @@jordanhasnolife5163 haha, I'll try not to take the viewers ;)

  • @rakeshvarma8091
    @rakeshvarma8091 5 หลายเดือนก่อน +1

    You Are Awesome Bro!!

  • @niapuchun
    @niapuchun 6 หลายเดือนก่อน +1

    The page at time 2:10th min the last line should say 1 million videos..isn’t it ?

  • @ravi72munde
    @ravi72munde 8 หลายเดือนก่อน +1

    For processing chunks, is it possible be to use Kafka + spark so each spark job handles single video but processes it’s chunks on multiple workers and at the end marks the job completed when all chucks are processed. Making keeping of state of the video’s chunks redundant.

    • @jordanhasnolife5163
      @jordanhasnolife5163  8 หลายเดือนก่อน +1

      A couple of concerns here that you'd have to address:
      1) how do we know when to trigger the spark job?
      2) You're triggering a lot of spark jobs haha
      In practice, this may work! I think we'd have to try it out.

    • @ravi72munde
      @ravi72munde 8 หลายเดือนก่อน

      Good point! How about if you could use Kafka queue to queue jobs. Message would just contains the videoID which has chunks ready to process. A consumer could act as a spark streaming(master) node. Picks available message, fetches all the chunks_ids/fileurls for that video and distributes chunks to worker nodes. Once all chunks are processed the master node would know and mark the video as complete. As an advantage it’ll be easy to track which video failed rather than chunks.

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน

      @@ravi72munde Your solution I would go with in an interview. I see nothing wrong with it.
      For interviews, I prefer solutions that abstract away a lot of details simply in favour of time.
      Jordan's solution is super cool though - for a longer more relaxed discussion. Most system design interviews I've had are a mad rush with no real dwell time to think in too much details.

    • @ravi72munde
      @ravi72munde หลายเดือนก่อน

      I did and I got the job 🥳

    • @lagneslagnes
      @lagneslagnes หลายเดือนก่อน

      @@ravi72munde Congrats!

  • @davidabu3170
    @davidabu3170 6 หลายเดือนก่อน +1

    you forgot the userId in the table it is quite important

  • @calvincruzada1016
    @calvincruzada1016 9 หลายเดือนก่อน +1

    Awesome

  • @zhonglin5985
    @zhonglin5985 5 หลายเดือนก่อน +1

    At th-cam.com/video/43bB7oSn190/w-d-xo.html, another queue is needed to stream total chunk count to Flink. This look a bit redundant to me. Why don't we just include total chunk count as an extra field of events that are sent to RabbitMQ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      Totally doable as well, I considered this approach too. I mainly assumed there'd be a lot of other metadata around and didn't wanna bloat the messages.

  • @imutkarshy
    @imutkarshy 9 หลายเดือนก่อน +8

    Your obsession with Flink 😅

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 หลายเดือนก่อน +3

      They should be paying me
      Oh wait it's open source

    • @imutkarshy
      @imutkarshy 9 หลายเดือนก่อน

      @@jordanhasnolife5163 Wait till they open a company like Confluent from this.

    • @sauravkumarsharma6812
      @sauravkumarsharma6812 5 หลายเดือนก่อน +1

      @@jordanhasnolife5163😂

  • @ankitagarwal4022
    @ankitagarwal4022 5 หลายเดือนก่อน +1

    @jordanhasnolife5163 thank you for your video. love your content