@@SystemDesignInterviewthank you for making these videos! What would you recommend to learn these topics in depth? What do you personally use? Any good books, courses etc?
The pedagogical steps in all of these videos are perfect. So many other "courses" say "here's how to design whatsapp" don't really start from first principles like this! (leaving me scratching my head wondering why they did something!) I hope you find time to continue making these kinds of videos. I'd absolutely support via patreon or an online course if you made it. There's really nothing as well-presented as this out there!
The best part of this channel is that Mikhail does not use ready-made solution available in the market. He shows us how to think simple and pushes for us to think what is under the hood. The thinking simple forces us to challenge our basics. Exactly what we need to prepare for System Design interview. Sadly, he has not been publishing any new videos for a long time now :-)
Glad you liked the channel, Hadi! Thank you a lot for the feedback. I do plan to come back to TH-cam with more regular video postings. Just need a bit more time to finish what I am working on currently. Stay tuned.
Absolutely the best System Design videos on the internet. I especially love that the videos are kept relatively short despite the huge amount of knowledge and explanation. Waiting for the new videos! Спасибо огромное, Михаил!
After watching several courses and reading several blogs yours is truly top class and one should watch any of your video more than once to consume it thoroughly.
These videos are really high quality and in depth. Haven't found any others diving deep into each individual component specially with options for each component and pros and cons for choices made.
Excellent explanation!!!! You are teaching us how to think which is a very important part. Most of the youtube channels are just showing big and scary architectures of the companies which are not useful from the interview point of view or from the learning point of view.
This is mind blowing. I didn't expected this kind of system design tutorial on TH-cam. Your style to explain HLD and then deep dive into each component, Its awesome and hidden gem. Hats off to you. I know it takes lots of time to create such content but please try to upload more videos.
I have been watching Mikhail's videos for 2 years now, and they continue to be extremely valuable, and the best sys arch videos on youtube. If you can grok all of his concepts, you will be prepared. Still, it would be nice to see Mikhail cover some of the more advanced topics that the book "Designing Data Intensive Applications" discusses.
I was speechless after watching this video. There were so many details and though process being discussed in it, that I ended up taking more than an hour(while taking notes and capturing thought process). A great video covering breadth and depth. I know it would be extremely hard to make up this high-quality content, but I would love to see more of those.
Your videos are highly informative. Please create more such videos. They are way better than other youtube videos on system design i have seen. The only thing that I found missing is scalability numbers and estimates like bandwidth calculations, number of servers needed, amount of storage needed and how they can go as system stress goes as per growing userbase. In some videos i noticed top down approach like they start with numbers first and break down the question in some feasible small system. Here I see a bottom up approach where we start with small system and grow it with scale. I believe starting with small system and growing it is better than starting with big system and breaking it down and playing with numbers.
Hi Mangesh. Thanks a lot for the feedback! Let me add a topic to my TODO list on how to estimate capacity (number of servers, required network bandwidth, storage, etc.) for a distributed system. As an interviewee, I find its hard to start with numbers, unless problem domain is well-known. Without understanding the API (what data we send into the system and how data is retrieved) and at least high-level design (where and how we store this data), high chances that estimated numbers will be far off. And as an interviewer, I would also recommend to postpone numbers till the end of the interview. Or provide them during the interview, if requested. In many cases the problem domain is not known to us upfront and we need to start with something simple (similar to a brute force solution in a coding interview). What is really important for me as an interviewer, is that a candidate is able to identify units of scalability. E.g. for the notification service such units are: number of publishers, average number of topics per publisher, average number of messages per topic, average number of subscribers per topic, amount of time we store messages in the system (retention period), etc. And as long as we can discuss scalability issues on a t-shirt size (small, medium, high) level, it is usually enough to evaluate candidate's ability to think at scale. Numbers are crucial for real designs, though.
@@SystemDesignInterview I totally agree with you that it is hard to start with numbers. We should've some idea about the numbers to come up with all the design considerations, but it doesn't sound like a great idea to just crunch all the numbers and capacity estimation in the beginning like how many hosts are required, before even getting to the details of the design. So in my mind, we should've some idea about the scale we need to support, but we need not get into the estimate capacity initially. But unfortunately all other random youtube video channels started following this idea of capacity estimations and now it is becoming a norm in such a way that some inexperienced interviewer also think that it is normal to expect that and insist on it.
Very helpful content, not just for interview preparation but for a deeper understanding of the concepts. Truly enjoyed all your series. I actually interviewed with one of the FAANG companies and got similar system to design.
Great content! One question I had is currently frontend server pushes message to temporary storage and sender component retrieves message from temporary component. Here, we may need to keep watching the temporary storage to start the sender process. Instead we can take a hybrid approach. 1. All the publish() calls will be transmitted to sender component after persisting data in temporary storage. Now sender system acts on the task . 2. In the case of message failures, we will write that into a re-try queue and a re-try component which handles the request
" we may need to keep watching the temporary storage to start the sender process". But isn't the Sender service(using threads pool) that handles if it wants to read the data? at 15:25.
Thumbs up here! I implemented a similar but more complicated system (with additional retry queue, message filtering or transformation, complicated UI with search capabilities) like you described in this video, it is a production platform service used by engineering teams in the company. Certainly you are a seasoned engineer and have hands-on some of the technologies you are talking about in this video.
Thank you for the feedback, junminstorage! You totally deserve my praise for implementing a real production system like this. There are many small details that are tough to do right. Well done!
Thank you so much for awesome content. Also, I want to thank you for replying to most of the comments. I could learn a lot from your answers in comments. I would like to see more and more such high quality videos, but given these videos requires lot and lot of effort and as you mentioned in other comments that you are too busy in your work, we respect your time and choice.
Thank you very much for the feedback, Abhishek! Please let me know if you have questions/concerns or looking for an advice on a technical problem. Will be glad to help.
Another fantastic addition to my system design favorites! Such a simple and easy to understand design of notification service. Thank you Mikhail and pls keep posting such a great videos :)
Mikhail - Amazing videos as always. Really good content and explanation. Thanks for spending time creating such videos! One minor suggestion - it will be great if you can also explain the DB choices (SQL vs NoSQL) and which of the popular NoSQL DBs (Dynamo, Cassandra, MongoDB etc) is a good choice for the use case being discussed in all your videos.
Hi N M! Thank you for the feedback! You are right, DB related topics deserve more attention. And I plan to close this gap. The recent video (th-cam.com/video/bUHFg8CZFws/w-d-xo.html) contains some information on this topic. More to come!
Nice explanation. 👍👏 Often interviewers are looking for the entity models for the stored data. i.e. metadata for a subscription in this case. It would greatly help to include those aspects as well.
I don't think people realize how much of the functional requirements tie in with knowledge of TOPIC BASED Pub-Sub. Quite in depth for something that's not explained. I find part at 2:34 under explained.
This is simply amazing. Thanks for making these videos. I think you should a write a book or create some paid tutorials. I would definitely pay for those. I am learning a lot through your videos already.
Thanks a lot again for such a great channel. That's awesome. I'm in the middle of system design interview preparation and your channel helps me a lot. few comments/question 1. as far as I understood, notification event that is handled by sender service gets endpoint information from metadata service. Let's consider a broadcast event (for example: sms storm warning, email from a system to all users about a new policy, etc) that has to be delivered to a number of endpoints > 100 000 or even more. If it is a 1 single event that is processed by 1 single node even in multi threaded environment, it would take way too long. I'd probably generate those events that contain both - event and endpoint beforehand and speedup actual delivery. In even could not event itself, but just pair of two id's. This idea is based on assumption, that delivery itself is a more time-consuming operation than enrichment of an event with an endpoint information. 2. Regarding a temporary storage, you asked a very good question. The best that comes into my mind is probably a hybrid approach, i.e. having some key-value no SQL DB for failed events (that I'd redeliver later on) and Redis cache for everything else.
Hi quantumlexa. Apologies for the delayed response. And thank you for the feedback! Your idea of decoupling event generation and the actual delivery is a good one. Please take a look at this thread, where we discussed resembling ideas: th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=Ugzg_JJd9yUMUX9ySwt4AaABAg P.S. Wish you all the luck on your interviews!
This is awesome. I have some doubts at 19:29, there seems to be redundant components, When message retriever thread has got the message, then why to create tasks , it can directly send to the http,email etc microservices? What are we achieving by putting them in task creater and then running threads again in task executor which eventually calls other micro service. Seems over complicated.. One message retrieval thread will take message, and then it just send to the http or required endpoint.
Rarely leave comments to videos - but that one is amazing! Great job! Please post move videos to related design problems (e.g. monitoring/logging systems, Twitter/Instagram clones etc.)!
Thanks a lot! Really the best system design which has very good balance of high level logic and low level details. This is what I have been looking for!
I just found your channel in 2022. You have made a lot of good system design videos. It’s sad that no more new video I hope that you can make more quality videos.
Great step by step approach , It would be great if you can provide a link for the final design that viewers can keep it handy as a big picture while learning specifics/details of each sub-component . This recommendation/request applies to all of your system design videos
Hi Sharath. Thank you for the feedback! Can you please clarify how you see it. Do you mean combine all components with their details on a single slide? Or you mean a text blog post version of the video? Or something else?
these are the best videos I have seen and I could not have learned better without them. Why don't you publish them anymore? If you shifted your publishing channel, please post in the reply.
Upon revisiting, I think the requirements at 2:34 are very vague. For example why pub-sub with topics? That's assuming notifications need to be fanned out. For 1:1 messaging that need notifications, maybe individual queues would be better, and a producer/consumer style messaging?
Great video! Keep up the good work! :) Pros for using SQS: data size is small, no strict requirement around ordering for this problem, reliability. On the other hand, that raises several questions - when do we exactly erase the message from the queue. Only when all the tasks have picked up the message? What happens if one of the tasks fail.. etc. It might also be useful to discuss about WebSockets for notifying the clients. I'm assuming that's what the tasks would ultimately do?
Thank you for the feedback, Saranya! Agree with you. I should have extended the video to cover details of pushing messages to end clients (subscribers). Specifically, talk about HTTP polling (long and short), websockets, server-sent events. Let me leave this topic for a separate discussion.
Very clear and concise presentation for a complex system. For the winner of Temporary storage I choose streaming solution like kafka, wanna know your lucky winners as well. Also I wonder whether the MessageRetriever's implementation actually varies based on the choice on temporary storage.
Agree with you, Xiaoyun. Here is my take on this: th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=UgwmaW3Ek0-XnkXb8KB4AaABAg.8tJRBPf4mun8tKuzz6sHVs As for MessageRetriever's implementation, you are right, it depends on temporary storage we use and some other factors. E.g. whether order of messages is important. If it does, we better stick to a single-threaded retriever. If not, we can use multi-threaded message retriever.
Good call! I also favor message queue and stream processing platform options. With already built-in mechanisms that many such systems provide, we may filter messages and apply transformations on top of it. And Kafka is a popular choice. One of the examples: www.confluent.io/blog/real-time-financial-alerts-rabobank-apache-kafkas-streams-api/
Hi Shalin. Please take a look at this comment and the whole thread: th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=UgxoaZ_vr1TFpHn6ynx4AaABAg.90FYh-PbeGT91FmzoWHuf-
@@SystemDesignInterview First, ask many have mentioned, these are incredibly good quality videos. Thanks for the significant effort that went into these. I have a follow up question. Is message queue the favored solution in this case because of the built-in mechanisms and popularity, therefore ease of obtaining information and support, or are there other wins such as performance, complexity or cost? Unrelated second followup question: regarding ordering. In a real-world scenario, if I'm already in the AWS environment, is there any justification for implementing my own solution for FIFO queue when i can just pay more (apprently 20% more) for SQS FIFO? I have to consider additional resource cost and operational cost, and I have not crunched the numbers. But I was wondering if there's some other limitations or pitfalls of SQS. I think you may have mentioned something in passing in one of your videos but can't remember which.
Hi Miao, One of the main benefits of a queue (e.g. message brokers, Kafka) is ordering support (at least on the partition level). Ordering is not always required for notifications, but is preferable typically. Depending on the volume of messages, queue solution may be substantially cheaper than e.g. database. With the growing number of messages, queue solution helps to save on cost. Queue APIs usually allow to batch messages on both producer and consumer side, helping to save on number of calls, and as a result, the total cost. Also, building notification service on top of a queue of some sort is a widely used pattern, I would say. For example message brokers (e.g. RabbitMQ), implement Pub/Sub mechanism by reading messages from a queue and pushing them to many consumers over TCP connections.
Great video, but I have to mention the following problems: 1. How does message acking work? Once we start thinking about message acking, it would expose problems in database choice and the sender design (multi-threaded vs single-threaded, partitioning, acking probably need some form of transactions). 2. Missing dead letter queue.
I have a question regarding the Metadata Service and its data storage. In the video, you mentioned that Metadata Service will be a distributed cache system, but you didn't mention what technology you will use or in which way you will use the cache system. I think that for the distributed cache system we could use Zookeeper (you mentioned this) and Redis (multiple nodes). The caching strategy will be cache-aside. For storing the data permanently I think we could use a key/value storage such as Dynamo DB. The key will be the name of the topic and the value will be a list of subscribers. If we need to store more information about the topic or the subscribers I would use a document based DB such as mongo/couch DB. Does this make sense?
Hi Milton. Thank you for the question! All technologies you mentioned make total sense to me and can be applied here. I like your thought process and attention to details. Keep sharing your thoughts!
Messages need to be stored permanently (at least last 1 year) for auditing purpose and to keep track of it for the failed deliveries when substriber is not available (offline ) or there is outage with our third party service use to send out the notifications. Coming to the data store it should be cassandra because of sheer amount of write ops compare to read ops.
If you're using a message queue like Kafka or Amazon SQS, they already handle metadata storage. Then we only need to implement the consumers (i.e. the senders).
Absolutely Amazing ....I haven't found any other video from any other you tuber with such details. Thanks for the content. Are there any videos of your content which are not available in you tube,might be in udemy .I want to checkout out that as well. Kindly tell. Thanks
The videos on this channel are one of the best system design videos. I wonder why you stopped doing them? Would there be more videos coming any time soon?
Great video! I have a question regarding message retriever component. How do we ensure that same messages are not being read from temporary storage by multiple threads/hosts?
Hi Ashwin. Great question. It depends on what Storage we use. Let's take a look at different options. Message queue. There may be several different flavors. For example, in AWS SQS, when message is retrieved, it is marked as invisible for some period of time. This prevents other consumers from processing the message. More on this here: docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html In case of Kafka or AWS Kinesis there is this concept of a monotonically increasing sequence number. Consumers keep track of what message (number) they have processed. In case of a database, we need to implement our own logic. For example delete this message from the database when message is retrieved and store it back in case message delivery failed. Please, also remember that SQS/Kafka/Kinesis supports at-least-once semantics. Which means that the same message may be delivered more than one time to the consumer.
@@SystemDesignInterview Instead of deleting message from database, we can consider making it inactive using some flag, because if the host processing the messages is down, all those messages might be permanently lost.
@@SystemDesignInterview With Kafka/Kinesis/Azure EventHub, isn't Kafka only allow one consumer per partition? If you have multiple threads, kafka will block other threads from concurrently access the same partition. So, in reality, the threads are reading the kafka message sequentially. We could of course have different threads read from different partition of the same topic, but still there is no chance of those threads re-read each others message, unless there is a some thread/consumer crashed before commiting the processed sequenceId. Then another consumer thread might pick it up and re-read the message. Re-read should be fine, since if we want at least once guarantee
2:26 Functional - publish(topicName, message) - subscribe(topicName, endpoint) Why only subscribe need the parameter endpoint? There are some difference between publisher and subscriber communicating with the service? Can someone help me figure it out?
Hi, this is one of the best videos I have seen. Can you make similar videos for WhatsApp & Twitter as well? I have gone through all your videos and they are all amazing. But I see you haven't been able to upload any video since past some time. A sincere request (and I think I can say this for everyone here), please work on more such videos. You are doing an amazing work.
Hi Hanspeter Pfister. Thank you for the feedback! I have both topics you mentioned in my TODO list. I am pretty busy at work these days. But I keep thinking about all my dear viewers and will come back with more videos.
Hi Suren. Thank you for pushing me :)) Appreciate it :)) I am not yet working on topics mentioned in this thread. But I do work on more content. Let me not give you false promises and specify any dates. I do plan to create and publish videos more regularly. Waiting for these days to come...
I must say, this is amazing content(probably the best on internet) and you really are doing great service to job aspirants and distributed system enthusiast. I have one question, what would be the database schema for meta data database. I guess we are only storing topic and subscriber information in meta data database and in worst case(Cache miss) if we are calling database, we need to get all subscriber list for a topic really fast for sender service. Also in worst case, we need to do request validation(e.g. topic is present ) fast. What do you think should be the database schema for meta data service and also in what format we are going to store data in caching layer.
Subsribers here the users which may or may not have an account/registration to get the notification. So to store the user information and their notification preference postgresql should be used. To save the information about the topic and the subsriber tied to it we can use some key value store like dyanmo DB and to cache this information we can use redis cluster
How about having a queue in between the temporary storage and the sender service. In this way we do a fan-out approach where multiple sender instances poll the queue for messages and send it to the appropriate micro-service. The queue layer can be a distributed queue based on topics. This will increase the throughput of our service and also will be more fault tolerant as we would decouple temporary storage and sender components. Inside sender service as you mentioned we can still have the concept of task creation and execution to further increase throughput. Thoughts?
yes I have the same question as you, look at my comments above. However, I don' t think a msg que is necessary here, cuz in principal, the temporary storage is THE msg que. You can impl the storage with Kafka. You can also impl it with NoSQL but just use it as a msg que. IMO, msg que is for heavy write. Here for b/w temp store and senders, we don't really need heavy write. we can use heavy reads. However, I do agree with you that, there should be and support multi senders instances.
Great video so thank you! However, I'm a little confused. You list the Frontend service responsibilities (~5:24) including SSL termination, authorization, authentication, request dispatching, ... and then (~7:40) describe it as processing the message. Is that right? Initially I initially understood your definition of Frontend Service as an API Gateway. Later I understood it as a Message Processing Service. Are these one and the same thing?
I think there are few major areas I would like to see more details if I am interviewer: 1. What's stored in the tmp storage? what's the schema look like? Is it denormalized msg? say for subscriber a, b, c there are msg 1, 2, 3 not delivered? 2. How to handle large fan-out msg? Say Twitter used this notification system and Trump tweet... XD Is it a single node sender service to pick the tweet and try to notify all followers/subscribers? I guess that will take forever. I really like expanding #2 above, because large fan-out is a really hard problem to solve in real life.
Thank you so much for an informative video. I have a couple of doubts 1) When do we remove a topic from the temporary storage? How can we decide if all the consumers have consumed the data? 2) What should happen if a consumer got added for a topic in the middle? How should the task executor deal with this?
1. We can first disable the topic for let's say x unit of time ( could be hours/days etc). We would not accept new notifications for this topic after disabled. This way we'll stop processing notifications from senders end. The next thing is to remove the topic, which can be done by the user end after making sure about the pending messages , subscribers of the topic etc from the monitoring system.
2. since for each message, the subscribers list is fetched from the temporary storage, after a user has subscribed to a topic, the list of users for that topic must have updated in the metadata storage. Hence whenever a new message came, the sender picks up the updated list (excluding the cache eviction and updation policies here) and start sending the notifications wrapping up it in a new task.
The approach used in this video seems a synchronous push for the notification. Does it scale well if the publish and subscribe qps are at hundred thousands or even million level?
Fantastic explanation. I just have one question. What is the usage of calling the metadata service from frontend service if we are not using the data to send it to the downstream services and we are anyway calling the metadata service from sender service.
Hi Saikat. Thank you very much for the feedback! Please take a look at my response in this thread: th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=UgyCIBZD38Zvib7Gf-Z4AaABAg.8zF3OPIIMqv8zR54tNr8mP
This is really great. The best system design video I found online. I am trying to find some information about : How to design a Google Calendar like system. Would you mind publishing a video on that topic ?
Number of writes to the database is relatively small, as writes happen when new topics are created or subscribers subscribe. There are many reads in the system, but only a fraction of those reads actually hit the database. As most of the reads are served by the Metadata Service (cache). So, both SQL and NoSQL can be used. If to talk about specific names, here is the possible list of options: DynamoDB, Cassandra, MySQL, PostgreSQL, Aurora. Personally I would favor NoSQL options for this use case (DynamoDB, Cassandra).
I think the front end service in slides should be named to backend service? Front end usually implies the user interface while backend is the application logic / controller / request processor
Thank you for sharing! This is the most helpful material I have seen online... A question: how to scale senders? I understand you mention pool of threads, but that is for a single machine. Sorry I missed your instruction if you mentioned it for scaling out. A simple solution: use consistent hashing to assign the task to a ring of instants/machines, just like distributed cahce. Will this be a good solution? Thanks!
Hi Hullo, Thank you for the question. Sender service is scaled both vertically (by having more threads in the pool of a single machine) and horizontally (by adding more Sender instances/machines). Consistent hashing idea you mentioned will work as well. But we do not actually need consistent hashing. A simple random hashing will work. Consistent hashing is usually used to make sure the same machine is chosen for the same key (message). In our case we can just pick a random Sender machine for processing the message. Random hashing is simpler and less prone to "hot" Sender issue. Feel free to ask any follow up questions. Will be glad to clarify it further.
@@SystemDesignInterview Firstly, thanks a lot for the videos. Its less than an hour, but when I go over it, it has so much material packed into it. Amazing. I have one question about horizontal scaling senders. All the threads from all the senders will now fetch msgs for sending from a certain tmp storage. How could we sync between the senders? With in one sender, we can use locks, and indicate which msg has been taken for delivery. Then other threads would not process this msg. But I am not sure how we can prevent 2 threads on different senders from processing same msg simultaneously.
@@theboo5857 You can use kafka/ AWS SQS that have consumer group support which handles all of this out of the box for you. in AWS SQS due to concept of visibility timeout a message is visible only to one consumer at a time .
Great video Mikhail! One comment though, there is no component tracking successful processing of a notification. Any of the sub-components of the sender may crash and would take down the list of to-be-executed and waiting-for-retry tasks. This would break at least once delivery guarantee.
Hi Sankalp Bose. Thank you for the feedback! You are correct. And I have mentioned in the video that the best guarantee the system gives us is "at-least-once". When Senders retrieve messages and send them out, they need to acknowledge back to the Temporary Storage upon successful delivery. If this acknowledgement does not happen for any reason (e.g. Sender machine sent a message and crashed right after that), message will be retried. Which may cause duplicates on the Subscriber's end. That is why it is important to de-duplicate messages on the Subscriber's end, if duplicates must be avoided.
@System Design Interview - We can use Redis for temporary storage as it is in memory, provides persistence and has ease of use ( get and put). Please let me know if there is a better option
finally, someone talking about system design who actually know how things work internally :-).
Kudos to you brother!
This is somehow the best system design tutorial I've ever seen. Keep doing bro!
This is somehow one of the best feedbacks I've ever seen. Keep sharing your thoughts with us all bro! )))
2nd this. Not only for interviews, but also for day-to-day work! thanks.
It is.
@@SystemDesignInterviewthank you for making these videos! What would you recommend to learn these topics in depth? What do you personally use? Any good books, courses etc?
The pedagogical steps in all of these videos are perfect. So many other "courses" say "here's how to design whatsapp" don't really start from first principles like this! (leaving me scratching my head wondering why they did something!)
I hope you find time to continue making these kinds of videos. I'd absolutely support via patreon or an online course if you made it. There's really nothing as well-presented as this out there!
The best part of this channel is that Mikhail does not use ready-made solution available in the market. He shows us how to think simple and pushes for us to think what is under the hood. The thinking simple forces us to challenge our basics. Exactly what we need to prepare for System Design interview. Sadly, he has not been publishing any new videos for a long time now :-)
Glad you liked the channel, Hadi! Thank you a lot for the feedback.
I do plan to come back to TH-cam with more regular video postings. Just need a bit more time to finish what I am working on currently. Stay tuned.
@@SystemDesignInterview appreciated with all the videos!
I had an interview at A Very Big Company that asked me a similar question and this video was extremely helpful.
Very glad to be helpful! Thanks for sharing.
Best System design Video ever. No buzzwords and it also goes into depth of various components reasonably well. I'm waiting for more videos from you
Thank you for the feedback, Saurabh!
Mikhail made a course. More material there now)
Absolutely the best System Design videos on the internet. I especially love that the videos are kept relatively short despite the huge amount of knowledge and explanation. Waiting for the new videos! Спасибо огромное, Михаил!
After watching several courses and reading several blogs yours is truly top class and one should watch any of your video more than once to consume it thoroughly.
These videos are really high quality and in depth. Haven't found any others diving deep into each individual component specially with options for each component and pros and cons for choices made.
Thank you, Jitu! Glad you liked the content.
I feel, This channel is hidden Gem!!!
Thank you so much for explaining things in great details!!
Thank you for the feedback, Satyanarayana!
This is the most densely packed sys design video I’ve seen and it’s so full of good information. Really appreciate your work!
Highly underrated content that is not just full of buzzwords.
Please keep it up. Can't wait for more uploads.
Excellent explanation!!!! You are teaching us how to think which is a very important part. Most of the youtube channels are just showing big and scary architectures of the companies which are not useful from the interview point of view or from the learning point of view.
Glad you liked the videos, kry kwy. Thank you for the feedback!
This is mind blowing. I didn't expected this kind of system design tutorial on TH-cam. Your style to explain HLD and then deep dive into each component, Its awesome and hidden gem. Hats off to you. I know it takes lots of time to create such content but please try to upload more videos.
Thank you for the feedback, Dheeraj! Glad you liked the video!
Every second of this video is GOLD!
Glad you liked the video, Aman! Thank you for the feedback.
I have been watching Mikhail's videos for 2 years now, and they continue to be extremely valuable, and the best sys arch videos on youtube. If you can grok all of his concepts, you will be prepared. Still, it would be nice to see Mikhail cover some of the more advanced topics that the book "Designing Data Intensive Applications" discusses.
Dude your videos are so good! Best systems design videos I've seen, love how you've especially tailored it for interviews. Thank you!
I was speechless after watching this video. There were so many details and though process being discussed in it, that I ended up taking more than an hour(while taking notes and capturing thought process). A great video covering breadth and depth. I know it would be extremely hard to make up this high-quality content, but I would love to see more of those.
Thank you, Hemant, for the feedback. Really appreciate your kind words!
Your videos are highly informative. Please create more such videos. They are way better than other youtube videos on system design i have seen. The only thing that I found missing is scalability numbers and estimates like bandwidth calculations, number of servers needed, amount of storage needed and how they can go as system stress goes as per growing userbase. In some videos i noticed top down approach like they start with numbers first and break down the question in some feasible small system. Here I see a bottom up approach where we start with small system and grow it with scale. I believe starting with small system and growing it is better than starting with big system and breaking it down and playing with numbers.
Hi Mangesh. Thanks a lot for the feedback!
Let me add a topic to my TODO list on how to estimate capacity (number of servers, required network bandwidth, storage, etc.) for a distributed system.
As an interviewee, I find its hard to start with numbers, unless problem domain is well-known. Without understanding the API (what data we send into the system and how data is retrieved) and at least high-level design (where and how we store this data), high chances that estimated numbers will be far off.
And as an interviewer, I would also recommend to postpone numbers till the end of the interview. Or provide them during the interview, if requested. In many cases the problem domain is not known to us upfront and we need to start with something simple (similar to a brute force solution in a coding interview). What is really important for me as an interviewer, is that a candidate is able to identify units of scalability. E.g. for the notification service such units are: number of publishers, average number of topics per publisher, average number of messages per topic, average number of subscribers per topic, amount of time we store messages in the system (retention period), etc. And as long as we can discuss scalability issues on a t-shirt size (small, medium, high) level, it is usually enough to evaluate candidate's ability to think at scale. Numbers are crucial for real designs, though.
@@SystemDesignInterview I totally agree with you that it is hard to start with numbers. We should've some idea about the numbers to come up with all the design considerations, but it doesn't sound like a great idea to just crunch all the numbers and capacity estimation in the beginning like how many hosts are required, before even getting to the details of the design. So in my mind, we should've some idea about the scale we need to support, but we need not get into the estimate capacity initially.
But unfortunately all other random youtube video channels started following this idea of capacity estimations and now it is becoming a norm in such a way that some inexperienced interviewer also think that it is normal to expect that and insist on it.
Very helpful content, not just for interview preparation but for a deeper understanding of the concepts. Truly enjoyed all your series. I actually interviewed with one of the FAANG companies and got similar system to design.
Thank you for the feedback, Athanasios! I hope your interview went well!
Please Please keep doing this bro! This is the best system design tutorial I've ever watched.
Thank you, Desheng, for sharing the feedback! Much appreciated.
I browsed through 4 system design channels, you are the best so far
Another excellent System Design video: extremely thorough, detail oriented, and nailing both Functional and Non-functional requirements.
Appreciate the feedback, Salman! Thanks.
Best system design interview resource online!!!! The only drawback would be the number of video :) keep up the great work!!
Thanks Yue Liang! Working on more videos. The next one will come out in a week.
Yue, got you here:)
Great content! One question I had is currently frontend server pushes message to temporary storage and sender component retrieves message from temporary component. Here, we may need to keep watching the temporary storage to start the sender process. Instead we can take a hybrid approach. 1. All the publish() calls will be transmitted to sender component after persisting data in temporary storage. Now sender system acts on the task .
2. In the case of message failures, we will write that into a re-try queue and a re-try component which handles the request
" we may need to keep watching the temporary storage to start the sender process".
But isn't the Sender service(using threads pool) that handles if it wants to read the data? at 15:25.
@@ignashi7plays401 yes u are right . I think he meant push based system but the vidoe describes pull based
Best System Design channel. Very helpful. Please keep posting learning videos.
Glad you liked the channel, Shobhit! Appreciate the feedback!
Thumbs up here!
I implemented a similar but more complicated system (with additional retry queue, message filtering or transformation, complicated UI with search capabilities) like you described in this video, it is a production platform service used by engineering teams in the company. Certainly you are a seasoned engineer and have hands-on some of the technologies you are talking about in this video.
Thank you for the feedback, junminstorage!
You totally deserve my praise for implementing a real production system like this. There are many small details that are tough to do right. Well done!
I guess best System Design Interview Resource available. Thanks for creating this channel
Damn bro. This was dope. Thank you for making such a detailed video. Deepest gratitude ever expressed.
Deepest gratitude for the feedback, bro )) Appreciate it!
Where were you my entire life bro. This is so good.
Glad we eventually found each other ))) Thank you for the feedback, Dipendra!
Thank you so much for awesome content. Also, I want to thank you for replying to most of the comments. I could learn a lot from your answers in comments. I would like to see more and more such high quality videos, but given these videos requires lot and lot of effort and as you mentioned in other comments that you are too busy in your work, we respect your time and choice.
Thank you very much for the feedback, Abhishek! Please let me know if you have questions/concerns or looking for an advice on a technical problem. Will be glad to help.
Another fantastic addition to my system design favorites! Such a simple and easy to understand design of notification service. Thank you Mikhail and pls keep posting such a great videos :)
Thank you, Nitin, for the feedback! Much appreciated.
I'm glad that FPS Russia has found a new passion in systems design.
I am so lucky I stumbled upon this channel. Amazing work, please keep em coming :)
Mikhail - Amazing videos as always. Really good content and explanation. Thanks for spending time creating such videos! One minor suggestion - it will be great if you can also explain the DB choices (SQL vs NoSQL) and which of the popular NoSQL DBs (Dynamo, Cassandra, MongoDB etc) is a good choice for the use case being discussed in all your videos.
Hi N M! Thank you for the feedback!
You are right, DB related topics deserve more attention. And I plan to close this gap. The recent video (th-cam.com/video/bUHFg8CZFws/w-d-xo.html) contains some information on this topic. More to come!
Probably your design are best and explanations are easy to understand. I encourage you to do more system design videos.
Thank you for the feedback, Kanaiya! Appreciate it.
Nice explanation. 👍👏
Often interviewers are looking for the entity models for the stored data. i.e. metadata for a subscription in this case.
It would greatly help to include those aspects as well.
I don't think people realize how much of the functional requirements tie in with knowledge of TOPIC BASED Pub-Sub. Quite in depth for something that's not explained.
I find part at 2:34 under explained.
I just found your videos to help me study, they are the best in depth explanations!
Glad to be helpful, novaaaa. Welcome to the channel!
Thanks for your sharing. This is the best system design video I've ever seen and I will recommend this to others!
That is everything I can ask for - share the knowledge with more people. Thank you for the feedback, Jeremy, and the kind words!
This is simply amazing. Thanks for making these videos. I think you should a write a book or create some paid tutorials. I would definitely pay for those. I am learning a lot through your videos already.
he has already a paid course bro.. buy it
@@pushpendrasingh1819 Thanks for letting me know. Seems to be a new creation. Definitely checking it out.
Thanks a lot again for such a great channel. That's awesome. I'm in the middle of system design interview preparation and your channel helps me a lot.
few comments/question
1. as far as I understood, notification event that is handled by sender service gets endpoint information from metadata service. Let's consider a broadcast event (for example: sms storm warning, email from a system to all users about a new policy, etc) that has to be delivered to a number of endpoints > 100 000 or even more. If it is a 1 single event that is processed by 1 single node even in multi threaded environment, it would take way too long.
I'd probably generate those events that contain both - event and endpoint beforehand and speedup actual delivery. In even could not event itself, but just pair of two id's. This idea is based on assumption, that delivery itself is a more time-consuming operation than enrichment of an event with an endpoint information.
2. Regarding a temporary storage, you asked a very good question. The best that comes into my mind is probably a hybrid approach, i.e. having some key-value no SQL DB for failed events (that I'd redeliver later on) and Redis cache for everything else.
Hi quantumlexa. Apologies for the delayed response. And thank you for the feedback!
Your idea of decoupling event generation and the actual delivery is a good one. Please take a look at this thread, where we discussed resembling ideas: th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=Ugzg_JJd9yUMUX9ySwt4AaABAg
P.S. Wish you all the luck on your interviews!
Great Job! very clear and detailed explanation. Please make more videos on other system design problems and topics.
Awesome! every component, every process, and every possibilities are explained very well. Thank you!
This is awesome. I have some doubts at 19:29, there seems to be redundant components, When message retriever thread has got the message, then why to create tasks , it can directly send to the http,email etc microservices? What are we achieving by putting them in task creater and then running threads again in task executor which eventually calls other micro service. Seems over complicated.. One message retrieval thread will take message, and then it just send to the http or required endpoint.
Rarely leave comments to videos - but that one is amazing! Great job! Please post move videos to related design problems (e.g. monitoring/logging systems, Twitter/Instagram clones etc.)!
Thank you Vadim, appreciate the feedback! There are plans to cover all the topics you mentioned. But cannot promise specific dates.
Amazing content !! Probably the best.
Thanks for sharing your knowledge in such a presentable manner.
Excellent video! Another super effective way to prepare system design interviews: Do mock interviews with FAANG engineers at Meetapro.
You are doing a really great work. Thanks for compiling so much knowledge here.
A great resource for in-depth understanding of System Design.
Thanks a lot! Really the best system design which has very good balance of high level logic and low level details. This is what I have been looking for!
Thank you for the feedback!
There is so much detailed and thoughtful information packed into every second of this video. Thank you so much 🙏
Thank you, Christopher, for the feedback!
I just found your channel in 2022. You have made a lot of good system design videos. It’s sad that no more new video I hope that you can make more quality videos.
Superb, detailed and wonderful presentation !!!
Great step by step approach , It would be great if you can provide a link for the final design that viewers can keep it handy as a big picture while learning specifics/details of each sub-component . This recommendation/request applies to all of your system design videos
Hi Sharath. Thank you for the feedback!
Can you please clarify how you see it. Do you mean combine all components with their details on a single slide? Or you mean a text blog post version of the video? Or something else?
very apt...keepup the good work..content always speaks
Thank you for the feedback, Harsha.
these are the best videos I have seen and I could not have learned better without them. Why don't you publish them anymore? If you shifted your publishing channel, please post in the reply.
Upon revisiting, I think the requirements at 2:34 are very vague. For example why pub-sub with topics? That's assuming notifications need to be fanned out. For 1:1 messaging that need notifications, maybe individual queues would be better, and a producer/consumer style messaging?
Great video! Keep up the good work! :)
Pros for using SQS: data size is small, no strict requirement around ordering for this problem, reliability. On the other hand, that raises several questions - when do we exactly erase the message from the queue. Only when all the tasks have picked up the message? What happens if one of the tasks fail.. etc.
It might also be useful to discuss about WebSockets for notifying the clients. I'm assuming that's what the tasks would ultimately do?
Thank you for the feedback, Saranya!
Agree with you. I should have extended the video to cover details of pushing messages to end clients (subscribers). Specifically, talk about HTTP polling (long and short), websockets, server-sent events. Let me leave this topic for a separate discussion.
Very clear and concise presentation for a complex system. For the winner of Temporary storage I choose streaming solution like kafka, wanna know your lucky winners as well. Also I wonder whether the MessageRetriever's implementation actually varies based on the choice on temporary storage.
Agree with you, Xiaoyun. Here is my take on this:
th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=UgwmaW3Ek0-XnkXb8KB4AaABAg.8tJRBPf4mun8tKuzz6sHVs
As for MessageRetriever's implementation, you are right, it depends on temporary storage we use and some other factors. E.g. whether order of messages is important. If it does, we better stick to a single-threaded retriever. If not, we can use multi-threaded message retriever.
@@SystemDesignInterview I was curious too - but, that link is taking back to this same video
Best Design tutorials, really helpful, Thankyou for making!
Thank you for providing a feedback on a regular basis!
Can't appreciate more 🙌 Kudos!
Pls make videos for 1.Design TH-cam/Netflix , 2. Design Twitter/Facebook, 3. Design Yelp etc..
In the temporary storage, I am thinking Apache Kafka might be a good fit, since it can also handle streaming data
Good call! I also favor message queue and stream processing platform options. With already built-in mechanisms that many such systems provide, we may filter messages and apply transformations on top of it. And Kafka is a popular choice. One of the examples: www.confluent.io/blog/real-time-financial-alerts-rabobank-apache-kafkas-streams-api/
@@SystemDesignInterview Isn't this whole system developing kind of Kafka system?
Hi Shalin. Please take a look at this comment and the whole thread: th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=UgxoaZ_vr1TFpHn6ynx4AaABAg.90FYh-PbeGT91FmzoWHuf-
@@SystemDesignInterview First, ask many have mentioned, these are incredibly good quality videos. Thanks for the significant effort that went into these. I have a follow up question. Is message queue the favored solution in this case because of the built-in mechanisms and popularity, therefore ease of obtaining information and support, or are there other wins such as performance, complexity or cost?
Unrelated second followup question: regarding ordering. In a real-world scenario, if I'm already in the AWS environment, is there any justification for implementing my own solution for FIFO queue when i can just pay more (apprently 20% more) for SQS FIFO? I have to consider additional resource cost and operational cost, and I have not crunched the numbers. But I was wondering if there's some other limitations or pitfalls of SQS. I think you may have mentioned something in passing in one of your videos but can't remember which.
Hi Miao,
One of the main benefits of a queue (e.g. message brokers, Kafka) is ordering support (at least on the partition level). Ordering is not always required for notifications, but is preferable typically.
Depending on the volume of messages, queue solution may be substantially cheaper than e.g. database. With the growing number of messages, queue solution helps to save on cost. Queue APIs usually allow to batch messages on both producer and consumer side, helping to save on number of calls, and as a result, the total cost.
Also, building notification service on top of a queue of some sort is a widely used pattern, I would say. For example message brokers (e.g. RabbitMQ), implement Pub/Sub mechanism by reading messages from a queue and pushing them to many consumers over TCP connections.
Great video, but I have to mention the following problems:
1. How does message acking work? Once we start thinking about message acking, it would expose problems in database choice and the sender design (multi-threaded vs single-threaded, partitioning, acking probably need some form of transactions).
2. Missing dead letter queue.
Your videos are so good! Thank you!
Great Content and and awesome explanation. keep doing more videos bro
I have a question regarding the Metadata Service and its data storage. In the video, you mentioned that Metadata Service will be a distributed cache system, but you didn't mention what technology you will use or in which way you will use the cache system. I think that for the distributed cache system we could use Zookeeper (you mentioned this) and Redis (multiple nodes). The caching strategy will be cache-aside. For storing the data permanently I think we could use a key/value storage such as Dynamo DB. The key will be the name of the topic and the value will be a list of subscribers. If we need to store more information about the topic or the subscribers I would use a document based DB such as mongo/couch DB. Does this make sense?
Hi Milton. Thank you for the question!
All technologies you mentioned make total sense to me and can be applied here.
I like your thought process and attention to details. Keep sharing your thoughts!
Messages need to be stored permanently (at least last 1 year) for auditing purpose and to keep track of it for the failed deliveries when substriber is not available (offline ) or there is outage with our third party service use to send out the notifications. Coming to the data store it should be cassandra because of sheer amount of write ops compare to read ops.
If you're using a message queue like Kafka or Amazon SQS, they already handle metadata storage. Then we only need to implement the consumers (i.e. the senders).
18:10 is the key thought process for deciding on approiach in an interview... how would we do it as fast as you did???
I watched these six videos several times. They are really good. Will you consider a video about monitoring system?
Thank you for the feedback, Li Haopeng! I have this topic in my short list. But cannot promise specific dates.
Absolutely Amazing ....I haven't found any other video from any other you tuber with such details. Thanks for the content. Are there any videos of your content which are not available in you tube,might be in udemy .I want to checkout out that as well. Kindly tell.
Thanks
The videos on this channel are one of the best system design videos. I wonder why you stopped doing them?
Would there be more videos coming any time soon?
Thanks for the detailed explanation.
Thank you for the feedback!
Great video!
I have a question regarding message retriever component. How do we ensure that same messages are not being read from temporary storage by multiple threads/hosts?
Hi Ashwin. Great question.
It depends on what Storage we use. Let's take a look at different options.
Message queue. There may be several different flavors. For example, in AWS SQS, when message is retrieved, it is marked as invisible for some period of time. This prevents other consumers from processing the message. More on this here: docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
In case of Kafka or AWS Kinesis there is this concept of a monotonically increasing sequence number. Consumers keep track of what message (number) they have processed.
In case of a database, we need to implement our own logic. For example delete this message from the database when message is retrieved and store it back in case message delivery failed.
Please, also remember that SQS/Kafka/Kinesis supports at-least-once semantics. Which means that the same message may be delivered more than one time to the consumer.
@@SystemDesignInterview Instead of deleting message from database, we can consider making it inactive using some flag, because if the host processing the messages is down, all those messages might be permanently lost.
@@SystemDesignInterview With Kafka/Kinesis/Azure EventHub, isn't Kafka only allow one consumer per partition? If you have multiple threads, kafka will block other threads from concurrently access the same partition. So, in reality, the threads are reading the kafka message sequentially. We could of course have different threads read from different partition of the same topic, but still there is no chance of those threads re-read each others message, unless there is a some thread/consumer crashed before commiting the processed sequenceId. Then another consumer thread might pick it up and re-read the message. Re-read should be fine, since if we want at least once guarantee
2:26 Functional
- publish(topicName, message)
- subscribe(topicName, endpoint)
Why only subscribe need the parameter endpoint? There are some difference between publisher and subscriber communicating with the service?
Can someone help me figure it out?
Hi, this is one of the best videos I have seen. Can you make similar videos for WhatsApp & Twitter as well? I have gone through all your videos and they are all amazing. But I see you haven't been able to upload any video since past some time. A sincere request (and I think I can say this for everyone here), please work on more such videos. You are doing an amazing work.
Hi Hanspeter Pfister. Thank you for the feedback! I have both topics you mentioned in my TODO list.
I am pretty busy at work these days. But I keep thinking about all my dear viewers and will come back with more videos.
@@SystemDesignInterview Hey There - Hope you are staying safe - did you get a chance to action the TODO list :) ?
Hi Suren. Thank you for pushing me :)) Appreciate it :))
I am not yet working on topics mentioned in this thread. But I do work on more content. Let me not give you false promises and specify any dates. I do plan to create and publish videos more regularly. Waiting for these days to come...
I must say, this is amazing content(probably the best on internet) and you really are doing great service to job aspirants and distributed system enthusiast. I have one question, what would be the database schema for meta data database. I guess we are only storing topic and subscriber information in meta data database and in worst case(Cache miss) if we are calling database, we need to get all subscriber list for a topic really fast for sender service. Also in worst case, we need to do request validation(e.g. topic is present ) fast. What do you think should be the database schema for meta data service and also in what format we are going to store data in caching layer.
Any insights to this question?
Subsribers here the users which may or may not have an account/registration to get the notification. So to store the user information and their notification preference postgresql should be used. To save the information about the topic and the subsriber tied to it we can use some key value store like dyanmo DB and to cache this information we can use redis cluster
How about having a queue in between the temporary storage and the sender service. In this way we do a fan-out approach where multiple sender instances poll the queue for messages and send it to the appropriate micro-service. The queue layer can be a distributed queue based on topics. This will increase the throughput of our service and also will be more fault tolerant as we would decouple temporary storage and sender components. Inside sender service as you mentioned we can still have the concept of task creation and execution to further increase throughput. Thoughts?
yes I have the same question as you, look at my comments above. However, I don' t think a msg que is necessary here, cuz in principal, the temporary storage is THE msg que. You can impl the storage with Kafka. You can also impl it with NoSQL but just use it as a msg que. IMO, msg que is for heavy write. Here for b/w temp store and senders, we don't really need heavy write. we can use heavy reads.
However, I do agree with you that, there should be and support multi senders instances.
Great video so thank you! However, I'm a little confused. You list the Frontend service responsibilities (~5:24) including SSL termination, authorization, authentication, request dispatching, ... and then (~7:40) describe it as processing the message. Is that right? Initially I initially understood your definition of Frontend Service as an API Gateway. Later I understood it as a Message Processing Service. Are these one and the same thing?
What is the difference between the frontEnd component and API Gateway?
I think there are few major areas I would like to see more details if I am interviewer:
1. What's stored in the tmp storage? what's the schema look like? Is it denormalized msg? say for subscriber a, b, c there are msg 1, 2, 3 not delivered?
2. How to handle large fan-out msg? Say Twitter used this notification system and Trump tweet... XD Is it a single node sender service to pick the tweet and try to notify all followers/subscribers? I guess that will take forever.
I really like expanding #2 above, because large fan-out is a really hard problem to solve in real life.
Wow, this video and the channel is soooo helpful to me! Thank you so much!
Glad it was helpful! Thank you for the feedback, Xiangjie!
Awesome system design tutorial.
Nicely explained, thank you for this great video.
Thank you, Abhishek, for the feedback! Much appreciated!
nice video. @13:40 , i think Casandra is not columnar database,its row oriented.
Still waiting on your monitoring video follow-up! Really good content.
You are too nice it's fine
I will listen to sub titles
Thanks for the effort
Glad you found a way out. Thanks for the feedback!
This is a gem. Thank you so much!
Thank you so much for an informative video. I have a couple of doubts
1) When do we remove a topic from the temporary storage? How can we decide if all the consumers have consumed the data?
2) What should happen if a consumer got added for a topic in the middle? How should the task executor deal with this?
1. We can first disable the topic for let's say x unit of time ( could be hours/days etc). We would not accept new notifications for this topic after disabled. This way we'll stop processing notifications from senders end. The next thing is to remove the topic, which can be done by the user end after making sure about the pending messages , subscribers of the topic etc from the monitoring system.
2. since for each message, the subscribers list is fetched from the temporary storage, after a user has subscribed to a topic, the list of users for that topic must have updated in the metadata storage.
Hence whenever a new message came, the sender picks up the updated list (excluding the cache eviction and updation policies here) and start sending the notifications wrapping up it in a new task.
The approach used in this video seems a synchronous push for the notification. Does it scale well if the publish and subscribe qps are at hundred thousands or even million level?
Thank you for material. Is Matt Damon programmer?
good job. very clear explanation.
Fantastic explanation.
I just have one question. What is the usage of calling the metadata service from frontend service if we are not using the data to send it to the downstream services and we are anyway calling the metadata service from sender service.
Hi Saikat. Thank you very much for the feedback!
Please take a look at my response in this thread: th-cam.com/video/bBTPZ9NdSk8/w-d-xo.html&lc=UgyCIBZD38Zvib7Gf-Z4AaABAg.8zF3OPIIMqv8zR54tNr8mP
This is really great. The best system design video I found online. I am trying to find some information about : How to design a Google Calendar like system. Would you mind publishing a video on that topic ?
Thank you for the feedback, Xin! Added your topic to the TODO list. Please do not expect a quick answer though, the list is already quite long ((
Which DB we should we use as our Metadata DB? We talked about the MD service being a dist. cache, but what about the DB?
Number of writes to the database is relatively small, as writes happen when new topics are created or subscribers subscribe. There are many reads in the system, but only a fraction of those reads actually hit the database. As most of the reads are served by the Metadata Service (cache). So, both SQL and NoSQL can be used. If to talk about specific names, here is the possible list of options: DynamoDB, Cassandra, MySQL, PostgreSQL, Aurora. Personally I would favor NoSQL options for this use case (DynamoDB, Cassandra).
I think the front end service in slides should be named to backend service? Front end usually implies the user interface while backend is the application logic / controller / request processor
Thank you for sharing! This is the most helpful material I have seen online... A question: how to scale senders? I understand you mention pool of threads, but that is for a single machine. Sorry I missed your instruction if you mentioned it for scaling out. A simple solution: use consistent hashing to assign the task to a ring of instants/machines, just like distributed cahce. Will this be a good solution? Thanks!
Hi Hullo,
Thank you for the question. Sender service is scaled both vertically (by having more threads in the pool of a single machine) and horizontally (by adding more Sender instances/machines).
Consistent hashing idea you mentioned will work as well. But we do not actually need consistent hashing. A simple random hashing will work. Consistent hashing is usually used to make sure the same machine is chosen for the same key (message). In our case we can just pick a random Sender machine for processing the message. Random hashing is simpler and less prone to "hot" Sender issue.
Feel free to ask any follow up questions. Will be glad to clarify it further.
System Design Interview Can these multiple sender instances be picking tasks out of a distributed mq like sqs?
Hi Tej. Why not? AWS SQS is one of the options for the Temporary Storage. We can use other message brokers as well.
@@SystemDesignInterview Firstly, thanks a lot for the videos. Its less than an hour, but when I go over it, it has so much material packed into it. Amazing. I have one question about horizontal scaling senders. All the threads from all the senders will now fetch msgs for sending from a certain tmp storage. How could we sync between the senders? With in one sender, we can use locks, and indicate which msg has been taken for delivery. Then other threads would not process this msg. But I am not sure how we can prevent 2 threads on different senders from processing same msg simultaneously.
@@theboo5857 You can use kafka/ AWS SQS that have consumer group support which handles all of this out of the box for you. in AWS SQS due to concept of visibility timeout a message is visible only to one consumer at a time .
Great video Mikhail! One comment though, there is no component tracking successful processing of a notification. Any of the sub-components of the sender may crash and would take down the list of to-be-executed and waiting-for-retry tasks. This would break at least once delivery guarantee.
I also thought the same
Hi Sankalp Bose. Thank you for the feedback!
You are correct. And I have mentioned in the video that the best guarantee the system gives us is "at-least-once". When Senders retrieve messages and send them out, they need to acknowledge back to the Temporary Storage upon successful delivery. If this acknowledgement does not happen for any reason (e.g. Sender machine sent a message and crashed right after that), message will be retried. Which may cause duplicates on the Subscriber's end. That is why it is important to de-duplicate messages on the Subscriber's end, if duplicates must be avoided.
@System Design Interview - We can use Redis for temporary storage as it is in memory, provides persistence and has ease of use ( get and put). Please let me know if there is a better option
I did not get, are all components within one server? Fox exmaple load balancer is Nginx as variant?