The videos with you alone talking are awesome. But this one with an actual candidate doing a real interview I think prepares candidates better as it includes both the challenges to the design and the psychological prep that this kind of interviews entails. This is gold and brings a lot of extra value.
Your content and clarity is way beyond the expectations.Posting the comment now and will revisit once I have cleared interviews ,thanks a lot buddy !! May the force be with you !!
This is helpful especially after having read the FB Live Comments design on the website. Thanks to the candidate for sharing this and allowing us to all see this.
[51:06] Couple of questions: 1. What's the rationale behind hashing over the video_id? 2. And are the videos uploaded by the commenters stored in the Dynamo-DB? And should the video binaries necessarily go via the pub-sub? Can' it be short-circuited to a blob-storage directly, say Amz S3 (while the video_s3_link being the partition key)? 3. Does it make sense to keep publishing to the topics, even when there are no live users? Or is that the very reason (to decouple commenter processing from the viewers) 4. What if the live users are connected from across far away geographies... should the partition key take into account the geo-location, in addition to the video_id (or video_s3_link)? Does Reddis scale well with far-away geographies?
Isn't it a bad design in the RTMS to consistent hash based on video? 1. One server anyway can't handle connections for 180M users for a hot video. 2. If that server goes down, all 180M subscribers will reconnect, leading to many other problems like thundering herd etc. 3. You will need to open a new connection whenever the user switches to a different video stream. I think a separate mapping of videoId to user + then a user-based consistent hash inside RTMS solves all these issues. What do you think?
Thank you so much! This is incredibly helpful. I do have a question that came to mind while following the solution: why would we choose to implement a push mechanism instead of allowing the client to manage comment updates through polling? It seems to me that simplifying the process and steering clear of various pitfalls and edge cases would make more sense. I assume there's a solid rationale behind this decision, but I'm not quite seeing it.
With millions of viewers across thousand of videos, to acieve real time comments you would need to poll very frequently which would be very inefficient and expensive on the server side. Moreover the comments won't be real time
How does the solution design solve the requirement of viewing old comments. What happens when a client joins live midway since the new client will start getting messages from current offset and not the first offset. Do we query dynamo? Not much depth provided on that front.
It depends on the level. E3/E4, possibly but likely on a border line. E5+ probably not. - The candidate clears demonstrated some experiences with building distributed systems - TC showed some hesitation when explaining certain choices - a red flag in most cases unless you can convince the interviewer that it is a miscommunication (instead of lack of understanding of tradeoffs). I'm pretty sure the mock helped TC a lot and hopefully TC performed much better in the actual interview.
@hello_interview , question related to the answer key. You are showing the Realtime Messaging Service bring responsible to notify. You mentioned during Functional requirement that there can be Millions of users to notify. Can Web Socket handle that much of traffic to notify users? How is it going to notify that many users? Even when I ponder on this and think that it can break the notification list to multiple queues and have worker per queue, its still a lot of work load. Isn't Polling than Pushing a better solution in such cases?
There is a single websocket connection for each client. The scale comes from scaling the real-time messaging services, so more servers to handle more client connections.
Great video ! May i know which level this candidate is aiming for (E4, E5, or E6)? I got 2.5Yoe and just a junior start to prepare for E4 System Design, I'm unsure if I need to perform at the same level as this candidate. Thanks!
thanks for this awesome video! One question here - the candidate said "start with high level design" but then started talking about api and data model. What's your thought here? Would it have been better if the candidate started with the high level flow and talk about api/data model later?
Regarding the answer key, I think it misses an interesting scalability issue. For popular streams it wouldn't be practical to push ALL chat messages down to viewer devices as that could be tens of MBs of data per second. Where and how would they be throttled? One potential implementation: AFAIK redis pub/sub has an outgoing message buffer per subscriber (the realtime message servers in this case). A reasonable max capacity could be set for those buffers, and whenever there's back pressure the content management service would just drop messages on the floor (after persisting to comments db). Also why use DynamoDB for the comments db? Higher write throughput could be achieved using Cassandra. It doesn't have strong consistency but eventual consistency is fine for this use case. Some other interesting requirements that might come up in an interview - live videos have an international audience. How would chat messages get distributed globally? - how would inappropriate messages be filtered?
@@hello_interview If it was a Product interview question (which it can very well be), which areas of this design should one focus more than the other? I understand the high level difference between System and Product interview, but to reinforce my understanding, could you highlight focus areas for a product in reference to this Live Comment Problem?
This is great video. I have a question, since you said to candidate that , due to possible security concerns, we should not be sending userId in the request. what we should do instead of sending userId and how does the fetch service identifies particular user in this case?
@hello_interview thank you, this is great video, i want to ask one question. The candidate has proposed writing to db and then cdc event and you proposed writing to db and wait for db write and then send to redis. isn't this is duel write problems. like writing to cassenda got success and writing to redis got failed. So we can avoid that by saving data in db and saving to redis can we done by cdc.
If 180M users are watching video1, it means 180M SSE connection should be handled from single server under current design. I don't think SSE scales that's well
you can but when you are pushing notification to millions of users, it becomes a major challenge for server to notify All the users. Let's break it down a bit. Let's say we have got a new comment and that comment is being placed in a message queue (not shown in the video. Not sure why interviewee is relying on Db to notify message queue). In that message, you have the actual video Id. Now we need to know how many users are currently actively logged in. Say you have a service that keeps track of that when user logs in. So now we have a message in message queue and we have NotifyUsersService which takes the message and is responsible to notify users. It goes to LiveUsersService to get list of all active users for that video. It gets say millions in number along with their host/port. How would NotifyUsersService push this notification to millions of users? You can say that it will dedicate 10 different queues and break millsions by 10 and queue them and then have a dedicated worker thread working with each queue. Even then, you will have over million of messages to notify per worker thread. You see the issue???? Just too much work load. In this case, you don't want to push. Instead you want to take hybrid approach. In cases where online users > 10,000 (just a number I came up with), you switch the notification from Push to Pull (long pull). This is the same approach taken when we have a celebrity posting and there are many followers. In such cases, you are better off pulling than pushing.
@@hello_interview Thanks for letting me know but I didn't get it.What is "Built in house" ? Is it different from zoom &teams? Couldn't find it with a quick google search.I am asking to practice giving interview with others.
@@basic-2-advancehe means that it was specifically developed for the HelloInterview platform. It is not publicly available to use like zoom or google meets, etc.
Great video, thanks! The interviewer did a nice job guiding the candidate away from straying too far off track
Glad it was helpful!
The videos with you alone talking are awesome. But this one with an actual candidate doing a real interview I think prepares candidates better as it includes both the challenges to the design and the psychological prep that this kind of interviews entails. This is gold and brings a lot of extra value.
Would love to see more of these
The videos you are posting are great, please continue to post more in the future! These are invaluable :)
great video! I have been following you, and must say the blogs, the feedback, and the question were awesome! Keep on uploading more of these!
Thank you! Will do!
Your content and clarity is way beyond the expectations.Posting the comment now and will revisit once I have cleared interviews ,thanks a lot buddy !! May the force be with you !!
This is helpful especially after having read the FB Live Comments design on the website. Thanks to the candidate for sharing this and allowing us to all see this.
Great interviewer. His feedback is very useful!
[51:06] Couple of questions:
1. What's the rationale behind hashing over the video_id?
2. And are the videos uploaded by the commenters stored in the Dynamo-DB? And should the video binaries necessarily go via the pub-sub? Can' it be short-circuited to a blob-storage directly, say Amz S3 (while the video_s3_link being the partition key)?
3. Does it make sense to keep publishing to the topics, even when there are no live users? Or is that the very reason (to decouple commenter processing from the viewers)
4. What if the live users are connected from across far away geographies... should the partition key take into account the geo-location, in addition to the video_id (or video_s3_link)? Does Reddis scale well with far-away geographies?
It was a great video! learnt a lot from this. Please keep posting such content.
Isn't it a bad design in the RTMS to consistent hash based on video?
1. One server anyway can't handle connections for 180M users for a hot video.
2. If that server goes down, all 180M subscribers will reconnect, leading to many other problems like thundering herd etc.
3. You will need to open a new connection whenever the user switches to a different video stream.
I think a separate mapping of videoId to user + then a user-based consistent hash inside RTMS solves all these issues. What do you think?
Thank you so much! This is incredibly helpful.
I do have a question that came to mind while following the solution: why would we choose to implement a push mechanism instead of allowing the client to manage comment updates through polling? It seems to me that simplifying the process and steering clear of various pitfalls and edge cases would make more sense.
I assume there's a solid rationale behind this decision, but I'm not quite seeing it.
With millions of viewers across thousand of videos, to acieve real time comments you would need to poll very frequently which would be very inefficient and expensive on the server side. Moreover the comments won't be real time
How does the solution design solve the requirement of viewing old comments. What happens when a client joins live midway since the new client will start getting messages from current offset and not the first offset. Do we query dynamo? Not much depth provided on that front.
Did he get an offer?
Even we are invested now.
It depends on the level. E3/E4, possibly but likely on a border line. E5+ probably not.
- The candidate clears demonstrated some experiences with building distributed systems
- TC showed some hesitation when explaining certain choices - a red flag in most cases unless you can convince the interviewer that it is a miscommunication (instead of lack of understanding of tradeoffs).
I'm pretty sure the mock helped TC a lot and hopefully TC performed much better in the actual interview.
in your new design, which part handle load old messages? what's the data flow of that like in your new design?
Could you share the Excalidraw+ link with us?
@hello_interview , question related to the answer key. You are showing the Realtime Messaging Service bring responsible to notify. You mentioned during Functional requirement that there can be Millions of users to notify. Can Web Socket handle that much of traffic to notify users? How is it going to notify that many users? Even when I ponder on this and think that it can break the notification list to multiple queues and have worker per queue, its still a lot of work load. Isn't Polling than Pushing a better solution in such cases?
There is a single websocket connection for each client. The scale comes from scaling the real-time messaging services, so more servers to handle more client connections.
For Meta E4, would an interviewer actually ask "Design Memcached on a single server" instead of some common product (Dropbox, TH-cam, etc)?
Yes
Great video ! May i know which level this candidate is aiming for (E4, E5, or E6)? I got 2.5Yoe and just a junior start to prepare for E4 System Design, I'm unsure if I need to perform at the same level as this candidate. Thanks!
This was senior
This was awesome.
When you say this guy is interview for senior, what does that mean? 5years + experience or senior title at meta?
thanks for this awesome video! One question here - the candidate said "start with high level design" but then started talking about api and data model. What's your thought here? Would it have been better if the candidate started with the high level flow and talk about api/data model later?
Regarding the answer key, I think it misses an interesting scalability issue. For popular streams it wouldn't be practical to push ALL chat messages down to viewer devices as that could be tens of MBs of data per second. Where and how would they be throttled? One potential implementation: AFAIK redis pub/sub has an outgoing message buffer per subscriber (the realtime message servers in this case). A reasonable max capacity could be set for those buffers, and whenever there's back pressure the content management service would just drop messages on the floor (after persisting to comments db).
Also why use DynamoDB for the comments db? Higher write throughput could be achieved using Cassandra. It doesn't have strong consistency but eventual consistency is fine for this use case.
Some other interesting requirements that might come up in an interview
- live videos have an international audience. How would chat messages get distributed globally?
- how would inappropriate messages be filtered?
All great points! The throttling is something that I'll tend to probe in a staff interview actually.
Any preferred approach for throttling ?
the voice was cut in last couple minutes when you started talking about Redis for historical comments. what was that part?
Just to clarify, this is a systems design interview for the Systems role not Product at Meta?
Correct
@@hello_interview If it was a Product interview question (which it can very well be), which areas of this design should one focus more than the other? I understand the high level difference between System and Product interview, but to reinforce my understanding, could you highlight focus areas for a product in reference to this Live Comment Problem?
This is great video. I have a question, since you said to candidate that , due to possible security concerns, we should not be sending userId in the request. what we should do instead of sending userId and how does the fetch service identifies particular user in this case?
It should be passed in the header, not as a REST parameter
@@azhp42069yep usually in JWT
To be clear, the JWT token or session token should be passed along to the header and the server verifies/parses it to get the user id
what category this will fit in? Meta System Design or Product Design.. Thanks for sharing the video... very helpful
Both :) more system but it’s shown up in both plenty
@hello_interview thank you, this is great video, i want to ask one question.
The candidate has proposed writing to db and then cdc event and you proposed writing to db and wait for db write and then send to redis. isn't this is duel write problems. like writing to cassenda got success and writing to redis got failed. So we can avoid that by saving data in db and saving to redis can we done by cdc.
We should use CDC, comments come in stream. Any change should be propagated to Redis/cache
If 180M users are watching video1, it means 180M SSE connection should be handled from single server under current design. I don't think SSE scales that's well
I think the idea is the RTMS can have multiple servers that own the same videoId. This way connections are spread around to different servers.
What if we use websockets ?
you can but when you are pushing notification to millions of users, it becomes a major challenge for server to notify All the users. Let's break it down a bit. Let's say we have got a new comment and that comment is being placed in a message queue (not shown in the video. Not sure why interviewee is relying on Db to notify message queue). In that message, you have the actual video Id. Now we need to know how many users are currently actively logged in. Say you have a service that keeps track of that when user logs in. So now we have a message in message queue and we have NotifyUsersService which takes the message and is responsible to notify users. It goes to LiveUsersService to get list of all active users for that video. It gets say millions in number along with their host/port. How would NotifyUsersService push this notification to millions of users? You can say that it will dedicate 10 different queues and break millsions by 10 and queue them and then have a dedicated worker thread working with each queue. Even then, you will have over million of messages to notify per worker thread. You see the issue???? Just too much work load. In this case, you don't want to push. Instead you want to take hybrid approach. In cases where online users > 10,000 (just a number I came up with), you switch the notification from Push to Pull (long pull). This is the same approach taken when we have a celebrity posting and there are many followers. In such cases, you are better off pulling than pushing.
What tool is used for recording?
Built in house
@@hello_interview Thanks for letting me know but I didn't get it.What is "Built in house" ? Is it different from zoom &teams? Couldn't find it with a quick google search.I am asking to practice giving interview with others.
He means that the tool was built by them, a custom solution, not something that out there for commercial or private use. @@basic-2-advance
@@basic-2-advancehe means that it was specifically developed for the HelloInterview platform. It is not publicly available to use like zoom or google meets, etc.
Does anyone know the name of the tool he is using to diagram?
Excalidraw
@@hello_interview Is this the tool Meta uses as well?
Watched. ---