I have seen all of your videos and I really love the way you explain all the possible solutions along with tradeoffs with all solutions. Please also make system design videos on Distributed task scheduler, Quora, Google Map, as well as newsfeed. Also, I know you have done a door-dash video with Gaurav Sen but for the sake of completeness on this channel and also the format of an interview makes things go in a certain way, please make video for food delivery system as well
Hey Sharad! I appreciate the suggestion, but I should say this: I feel that a lot of those videos that you mentioned, I don't really need to make and I'll explain why: Google maps - already done News feed - already done Door dash - same as Uber Quora - basically just a bunch of rest services, don't really see the complicated aspect of this one Task Scheduler - I suppose I could make this lol I'm being very careful on this channel to not make redundant videos, as I don't want to waste your guys' time, and I don't want to waste mine either. I'm going to make a dedicated video to this topic, but I hope you understand :)
@@jordanhasnolife5163 Thanks for the detailed reply. I also want you to consider making a video for Promotion Engine(offers, discounts). Also, do you think we need a detailed video on low-level components of E-commerce like Catalog, Order Orchestrator, and Inventory Management or your E-commerce video is good enough?
@@sharad20073024 Could you describe the question further? If it's something that requires a good amount of technical depth and is unique to my other videos I'll happily do it.
Great video. At 14:21, where you describe lock server nodes will hold data in memory, I would add that since we also write to a write ahead log (you mentioned this earlier on), we can always reconstruct the in memory data structure if a node fails from the log and not have to entirely rely on other replicas' in memory contents. I don't think relying on other replicas' data held in memory would be acceptable for a lock system like it would for a simple cache like Redis (just my opinion, would love to be corrected). So, I believe something like this would work well: 1. In memory hash table (not a "log") to keep track of who owns which lock 2. On disk write ahead log which should be fairly quick to append to (just like the one used in LSM based databases).
Hi! I appreciate the comment - I think the general point here is that while a write ahead log works for restoring state that was already on a node, the system state will have changed while that node is down and as such you will want to use the other replicas to make sure that a node that had failed but is back online is up to date. This is the general point of consensus, I'd recommend checking out my video on Raft if you have some time - it should clear things up. Generally speaking though what you've suggested is inline with what I've tried to suggest in my video too :)
So I've taken a few minutes to think about this: how is this problem different than distributed locking? Being able to make a universal increasing counter implies that you have total order broadcast which implies linearizability/consensus. Let me know if you were thinking of something slightly different!
How about we just store key value like resource name to node id which says lock acquired by that node, and whenever we want to acquire a lock, we do something like check and put in hbase which is atomic in nature underlaying it does locking as well for that.
@@tavvayaswanth While this works, it still requires consensus - HBase is built on top of Hadoop which is not fault tolerant due to the NameNode which is a single point of failure. The only way to make it fault tolerant is to use a secondary name node which is kept up to date using an external coordination service like zookeeper.
@@jordanhasnolife5163 got it more over its better to have lock information in memory across hosts as the reads would be more I assume and we don’t need to persist to hbase which would increase latency, storing in memory works well as long as we have slaves which can be consistent using raft protocol as you mentioned.
@@jordanhasnolife5163 hbase already uses zookeeper for leader election so it will work fine it provides acid properties at row key level by using locking also it is consistent because of zookeeper so we don’t need any further changes there.
How can we scale this to multi Data centre ? would that mean Raft instance in each DC? But then if DC2 can grab a lock which DC1 has acquired and not replicated to DC2 yet?
Thanks for the video! It's great to watch this video! One suggestion is that can u write a code(maybe just a piece) for this? Sometimes they asked for code even for the system design round. For this, just like how to lock and release with the linked list etc. thanks!
fencingToken requires the storage system to implement the functionality around verifying and validating the fencingTokens. If its a simple file system then i think need to have an additional layer before the storage system.
You make a good point, I suppose I could have mentioned that more explicitly but placing a caching layer in front of the storage system would probably suffice!
@@jordanhasnolife5163 hmm but this has nothing to do with caching right? I think what they meant was that rejecting write requests based on the fencing token should be handled either in the client or in another auxiliary service in between the client and the object storage. Unless you meant the caching layer would cache and do that.
In the DynamoDB example @10.14, if those were quorum writes and quorum reads, then the scenario of reading the record from only one node won't arise. As in that case, the write itself would have failed. Similarly, if the write was successful, then quorum read would fail if it gets two different values. R+W > N should guarantee reading the up-to-date record.
Hey! So the write "failing" just means the client sees that their write didn't go through. How does the database know that it needs to revert that write? It has to ask the other nodes? What if there's a network partition?
@@jordanhasnolife5163 Writes in DynamoDB are through Leaders, a write is successful only when a quorum of nodes accept it. For Reads(from AWS documentation) -> "DynamoDB provides read-committed isolation and ensures that read operations always return committed values for an item. The read will never present a view to the item from a write which did not ultimately succeed." I don’t have the details on whether DynamoDB reverts or just discards the unsuccessful writes, but the point here is subsequent Reads won’t fetch such records.
@@jordanhasnolife5163 Writes in DynamoDB are through Leaders, a write is successful only when a quorum of nodes accept it. For Reads(from AWS documentation) - DynamoDB provides read-committed isolation and ensures that read operations always return committed values for an item. The read will never present a view to the item from a write which did not ultimately succeed. I don’t have the details on whether DynamoDB reverts or just disregards the unsuccessful writes, but the point here is subsequent Reads won’t fetch such records.
Writes in DynamoDB are through Leaders, a write is successful only when a majority of nodes accept it. For Reads(from AWS documentation) - "DynamoDB provides read-committed isolation and ensures that read operations always return committed values for an item. The read will never present a view to the item from a write which did not ultimately succeed." I am not quite sure whether DynamoDB reverts or just disregards the unsuccessful writes, but the point here is subsequent strongly consistent Reads won’t fetch such records.
In the diagram that you have shown, are the clients running on machines which are completely different from Raft servers (in other words, this means are Raft servers external to the client applications) ? or are the clients running on same nodes of Raft servers (peer-to-peer) ? I have a use case where there are some n machines. I want only one of these machines to perform a task. Is there any peer-to-peer protocol which will allow only one of these machines to have lock at a point of time ? or should these machines make an external call to a RAFT server to acquire lock ? I want to know if there is any solution where we don't want to use an external service for locks and instead use a peer to peer protocol to acquire locks within the same set of machines where client applications are running.
I guess that's kinda the point of something like the Blockchain - but that being said yeah the raft servers should typically be dedicated, or else u can have a job being run on one of the raft servers that happens to be in the minority
In th-cam.com/video/szw54UbPJRE/w-d-xo.html , you said the machine thinks it still has the lock. So it can corrupt the file on s3. I don't understand why it leads to corruption? Since the ttl is expired, lock server releases the lock for the down machine and gives it to another machine. Now if the dead machine goes alive again, it can't edit the file at the same time, because it doesn't have the lock anymore.
Simply grabbing the lock doesn't deem whether or not you can write to s3, as s3 has no knowledge of the distributed lock. S3 will accept writes from anywhere. The distributed lock is only useful for managing the activities of different nodes within our own backend.
I think it is best if you read DDIA chapter 8 - The truth is defined by the majority - sections The leader and the lock & Fencing Tokens. Basically the dead server (which previously held the expired lock) once it comes back alive will still be able to write to the register it was accessing because in its mind it thinks it still holds the lock. It was in the middle of writing when it went dead temporarily. Without proper usage of incrementing Fencing tokens the s3 data would corrupt.
I think node being down means network partition. It may still be possible that the node is fine. If that is the case, then the node will still be holding the lock even though it expired at the remote service that granted the lock. DDIA is indeed a great reading material and a must I would say if you are preparing for interviews.
It is a crime this channel doesn't have more subs. This is easily the best system design content on YT.
The real crime is more people not buying my foot pics
I have seen all of your videos and I really love the way you explain all the possible solutions along with tradeoffs with all solutions. Please also make system design videos on Distributed task scheduler, Quora, Google Map, as well as newsfeed. Also, I know you have done a door-dash video with Gaurav Sen but for the sake of completeness on this channel and also the format of an interview makes things go in a certain way, please make video for food delivery system as well
Hey Sharad! I appreciate the suggestion, but I should say this: I feel that a lot of those videos that you mentioned, I don't really need to make and I'll explain why:
Google maps - already done
News feed - already done
Door dash - same as Uber
Quora - basically just a bunch of rest services, don't really see the complicated aspect of this one
Task Scheduler - I suppose I could make this lol
I'm being very careful on this channel to not make redundant videos, as I don't want to waste your guys' time, and I don't want to waste mine either. I'm going to make a dedicated video to this topic, but I hope you understand :)
@@jordanhasnolife5163 Thanks for the detailed reply. I also want you to consider making a video for Promotion Engine(offers, discounts). Also, do you think we need a detailed video on low-level components of E-commerce like Catalog, Order Orchestrator, and Inventory Management or your E-commerce video is good enough?
@@jordanhasnolife5163 Thanks for the detailed reply. I also want you to consider making a video for Promotion Engine(offers, discounts).
@@sharad20073024 Could you describe the question further? If it's something that requires a good amount of technical depth and is unique to my other videos I'll happily do it.
I'm gonna watch all your videos man, this is awesome.
6:56 Fencing tokens
7:56 Assigning fencing tokens
9:09 Pitfalls of quorum read-writes
10:51 Consensus and Raft
14:38 Thundering herd problem
Thanks man, best and most concise vid I ever watched. please making more
Thank you very much for you contribution to the community.
Great video. At 14:21, where you describe lock server nodes will hold data in memory, I would add that since we also write to a write ahead log (you mentioned this earlier on), we can always reconstruct the in memory data structure if a node fails from the log and not have to entirely rely on other replicas' in memory contents. I don't think relying on other replicas' data held in memory would be acceptable for a lock system like it would for a simple cache like Redis (just my opinion, would love to be corrected).
So, I believe something like this would work well:
1. In memory hash table (not a "log") to keep track of who owns which lock
2. On disk write ahead log which should be fairly quick to append to (just like the one used in LSM based databases).
Hi! I appreciate the comment - I think the general point here is that while a write ahead log works for restoring state that was already on a node, the system state will have changed while that node is down and as such you will want to use the other replicas to make sure that a node that had failed but is back online is up to date. This is the general point of consensus, I'd recommend checking out my video on Raft if you have some time - it should clear things up. Generally speaking though what you've suggested is inline with what I've tried to suggest in my video too :)
How about a system that generates unique auto incrementing numbers
So I've taken a few minutes to think about this: how is this problem different than distributed locking? Being able to make a universal increasing counter implies that you have total order broadcast which implies linearizability/consensus. Let me know if you were thinking of something slightly different!
How about we just store key value like resource name to node id which says lock acquired by that node, and whenever we want to acquire a lock, we do something like check and put in hbase which is atomic in nature underlaying it does locking as well for that.
@@tavvayaswanth While this works, it still requires consensus - HBase is built on top of Hadoop which is not fault tolerant due to the NameNode which is a single point of failure. The only way to make it fault tolerant is to use a secondary name node which is kept up to date using an external coordination service like zookeeper.
@@jordanhasnolife5163 got it more over its better to have lock information in memory across hosts as the reads would be more I assume and we don’t need to persist to hbase which would increase latency, storing in memory works well as long as we have slaves which can be consistent using raft protocol as you mentioned.
@@jordanhasnolife5163 hbase already uses zookeeper for leader election so it will work fine it provides acid properties at row key level by using locking also it is consistent because of zookeeper so we don’t need any further changes there.
Unfortunately S3 doesn't have conditional writes.
Alas
why timestamps are not reliable ?
Ah I've got videos devoted to this, gist is clocks aren't synchronized between different computers
awesome video
How can we scale this to multi Data centre ? would that mean Raft instance in each DC? But then if DC2 can grab a lock which DC1 has acquired and not replicated to DC2 yet?
I think you'd want just one raft instance for your entire job scheduler, that's your source of truth right there
Would love to see a "Design a hotel booking system" video!
Hey! I'd take a look at the ticketmaster video, it's effectively the same thing
Thanks for the video! It's great to watch this video! One suggestion is that can u write a code(maybe just a piece) for this? Sometimes they asked for code even for the system design round. For this, just like how to lock and release with the linked list etc. thanks!
Sure, I can try to do something like this for when I re-make it in a couple of months
@@jordanhasnolife5163 thanks!
fencingToken requires the storage system to implement the functionality around verifying and validating the fencingTokens. If its a simple file system then i think need to have an additional layer before the storage system.
You make a good point, I suppose I could have mentioned that more explicitly but placing a caching layer in front of the storage system would probably suffice!
@@jordanhasnolife5163 hmm but this has nothing to do with caching right? I think what they meant was that rejecting write requests based on the fencing token should be handled either in the client or in another auxiliary service in between the client and the object storage. Unless you meant the caching layer would cache and do that.
can you share the link to your wrist watch?
www.movado.com/us/en/shop-watches/movado-face-3640032.html
In the DynamoDB example @10.14, if those were quorum writes and quorum reads, then the scenario of reading the record from only one node won't arise. As in that case, the write itself would have failed. Similarly, if the write was successful, then quorum read would fail if it gets two different values. R+W > N should guarantee reading the up-to-date record.
Hey! So the write "failing" just means the client sees that their write didn't go through. How does the database know that it needs to revert that write? It has to ask the other nodes? What if there's a network partition?
@@jordanhasnolife5163 Writes in DynamoDB are through Leaders, a write is successful only when a quorum of nodes accept it.
For Reads(from AWS documentation) -> "DynamoDB provides read-committed isolation and ensures that read operations always return committed values for an item. The read will never present a view to the item from a write which did not ultimately succeed."
I don’t have the details on whether DynamoDB reverts or just discards the unsuccessful writes, but the point here is subsequent Reads won’t fetch such records.
@@jordanhasnolife5163 Writes in DynamoDB are through Leaders, a write is successful only when a quorum of nodes accept it.
For Reads(from AWS documentation) - DynamoDB provides read-committed isolation and ensures that read operations always return committed values for an item. The read will never present a view to the item from a write which did not ultimately succeed.
I don’t have the details on whether DynamoDB reverts or just disregards the unsuccessful writes, but the point here is subsequent Reads won’t fetch such records.
Writes in DynamoDB are through Leaders, a write is successful only when a majority of nodes accept it.
For Reads(from AWS documentation) - "DynamoDB provides read-committed isolation and ensures that read operations always return committed values for an item. The read will never present a view to the item from a write which did not ultimately succeed."
I am not quite sure whether DynamoDB reverts or just disregards the unsuccessful writes, but the point here is subsequent strongly consistent Reads won’t fetch such records.
@@sshah2k9 Yeah I think ultimately you'd have to dive into how they know a record isn't committed - they need to send two network calls to that node
In the diagram that you have shown, are the clients running on machines which are completely different from Raft servers (in other words, this means are Raft servers external to the client applications) ? or are the clients running on same nodes of Raft servers (peer-to-peer) ?
I have a use case where there are some n machines. I want only one of these machines to perform a task. Is there any peer-to-peer protocol which will allow only one of these machines to have lock at a point of time ? or should these machines make an external call to a RAFT server to acquire lock ? I want to know if there is any solution where we don't want to use an external service for locks and instead use a peer to peer protocol to acquire locks within the same set of machines where client applications are running.
I guess that's kinda the point of something like the Blockchain - but that being said yeah the raft servers should typically be dedicated, or else u can have a job being run on one of the raft servers that happens to be in the minority
0:35 says might shit his pants, a few seconds later new outfit, we know what happened. Its ok we all been there.
Must have been a great shit
In th-cam.com/video/szw54UbPJRE/w-d-xo.html , you said the machine thinks it still has the lock. So it can corrupt the file on s3. I don't understand why it leads to corruption? Since the ttl is expired, lock server releases the lock for the down machine and gives it to another machine. Now if the dead machine goes alive again, it can't edit the file at the same time, because it doesn't have the lock anymore.
Simply grabbing the lock doesn't deem whether or not you can write to s3, as s3 has no knowledge of the distributed lock. S3 will accept writes from anywhere. The distributed lock is only useful for managing the activities of different nodes within our own backend.
I think it is best if you read DDIA chapter 8 - The truth is defined by the majority - sections The leader and the lock & Fencing Tokens.
Basically the dead server (which previously held the expired lock) once it comes back alive will still be able to write to the register it was accessing because in its mind it thinks it still holds the lock. It was in the middle of writing when it went dead temporarily. Without proper usage of incrementing Fencing tokens the s3 data would corrupt.
I think node being down means network partition. It may still be possible that the node is fine. If that is the case, then the node will still be holding the lock even though it expired at the remote service that granted the lock.
DDIA is indeed a great reading material and a must I would say if you are preparing for interviews.
Neat
Low key kinda funny, okay never mind he just said hes going to shit himself. Hilarious.
Hi from my shitter
Has someone told you, you look like Zuck... !!!
That's gotta be anti semitic