I think your system design videos are top notch and helped me tremendously. I usually spend my working hours writing code and not worrying about the design videos, but thanks to you I was able to bridge the gap of missing knowledge
We can optimise the generation of the URL part in your service layer by delegating the key generation part to multiple instances of offline services(KeyGenerationService). Each service will generate the keys in the particular range and give the pre-generated tiny urls to servers in need. Servers can also keep some unused url in their local caches to speed up the process more. Their will be some race conditions which could be taken care by designing the service properly to take care of interaction between different threads in the services. All in all, it is a great video, kudos to the efforts you put to come up with such content !!!
The pro of using a range is because of its compact size to be stored and operated on. Pre-generated keys will take up a lot more space and increase a lot more roundtrip time between internal services while calculating a tinyurl from a integer is relatively cheap. So I think it's only worth it if the calculation is expensive, like blockchain addresses.
I appreciate your efforts in making system design tutorial videos. My algorithms for tackling this question: -6 characters long TinyURL: a. Use MD5/SHA256 on Base64 encoding with 6 letters long key (This was mentioned by you also, I uplifted it a bit) b. Use Key Generation Service for 6 digit keys already pre-computed - Key DB of ~500 GB size + Replication of this to avoid Single Point Of Failure + Lock system to avoid concurrency issue + Use Memcache to speed up -> Fetch key from DB and Use HTTP 302 redirect to browser, in case of success else return 404 to the user.
Tushar! I'm so grateful for what you're doing here on TH-cam! I've literally spent hours watching your videos on dynamic programming this last week to study for an algorithms module of mine. Even almost got a bit of your accent!
People (multiple people) spend days/weeks/months designing a scaleable system. Then comes some hot shot interviewer with all the right answers he found earlier from his google search, and then asks you to design a tinyurl system and expects you to spit it out in 24 minutes? Also consider the fact that his company has a product that is being used by only a few users, that a Windows 98 PC could happily support. Interviewers really need to get off their fucking clouds and stop searching for unicorns.
I’ve exactly same view. People/scientists invest days, months or even years to come to solution and some stupid people expect to answer in 30 or 60 mins.
Again, very good content. I really like that you're talking fast and not waste time on things that are easily found on the internet (definitions etc.). Very nice to watch.
Hi Tushar - Thanks for the awesome video! You asked how other people would design the url service and here’s my 2 cents: I would start the design with a stateless API service and write thru in-memory cache, backed by a relational database. Collisions are detected at the db level (eg. a unique primary key on the shorten url) and corrected by hashing the original url + a prefix. The regional issue you mentioned could potentially be mitigated with a combo of load balancer routing configuration and in-memory cache at each service instance. In the case of GET request burst, it is likely that the instances will have the URL cached. I would imagine this starter design is good for couple thousands URL per sec. An AWS RDS with SSD could handle up to 10K IOPS according to their doc. The in-memory cache size could also be adjusted for read performance. To scale up this design for higher volume, I probably would try the distributed cache with write-behind before considering Zookeeper. In general, I like to stay stateless if possible :) I created a simple tiny url implementation in case anyone is interested. github.com/edithau/simple-tiny-url
Hello, tushar, I would really like to thank you for your videos. recently, I was able to crack an interview by watching your videos. Especially your videos on Dynamic Programming are treat to eyes. Thanks again.
Dear Tushar, it is great to see that you are back in action. Thanks a lot for your efforts. I truly appreciate your hard work. Could you please share your knowledge on design problems. Thanks a lot.
@Tushar, thanks for the video. 12:15-13:55 is a very convoluted way of conveying that log2(62**7) bits are needed encode 62**7 values. Also, that evaluates to 42 and not 43. An oversight, I presume...
Appreciate such videos which give different/alternative ways towards solutions & one must go through various such views before finalizing on the solution. But seriously, only a king of fools can ask such questions & expect a good logical answers in those 10-15 minutes. I myself have got such questions with the expectations to say those fancy words of design patterns & algorithms. Man, such interviewers themselves do google the questions/answers for the interviews & they themselves don’t know to solve the trivial problems in their own projects. Many issues with interviewers’ mentality & processes but for such videos, these really help a person to think differently & find the solutions. Appreciate Tushar for providing such informative videos.
I think it is supposed to be 42 bits not 43. If you think about it, your language has 62 chars and to uniquely identify each char you need 6 bits. because 2^6 is 64. Then you can map 000000 to a, 000001 to b and so on till 9. So in a 7 character string each character will need 6 bits and 7*6 is 42. So you will have to save 42 bits in your DB and when you get it back from your DB, you have to use the same mapper and get back the characters. Hope that clears out the doubt around 42 and 43 bits.
Around 9:14 Tushar describes a solution that first puts (short, long) to the database and then reads to verify nobody wrote after it. Doesn't the put corrupt the previous value that someone else left there (and wanted it as the value in the database)?
11:17 so let's say you make your GET request on a newly generated md5 tiny url and the first result's long url doesn't match the long url you just entered (i.e. that tiny url already exists). How do you generate a new tiny url such that if the same link is entered again you'll be able to find the same one? Bit shift and repeat?
For people who are confused why do we need 43 bits: 2**42 = 4398046511104 = 4.3 * 10 ** 12 = 4.3 Trillion. So, we only need 42 bits to represent 4.3 trillion numbers But these numbers also include negative numbers and we only need positive numbers (0 - 3.8 Trillion). So we will 43 bits and keep our MSB as 0 to make it a positive number.
Curious why you choose 43 bits, as you only actually need 42 bits to represent 3.5 trillion (max for 42 bits is 4.398 trillion). Is there a use for the extra bit I'm not seeing?
2*42 = 4398046511104 = 4.3 * 10 * 12 = 4.3 Trillion. So, we only need 42 bits to represent 4.3 trillion numbers But these numbers also include negative numbers and we only need positive numbers (0 - 3.8 Trillion). So we will 43 bits and keep our MSB as 0 to make it a positive number.
the best i have seen on this is to pre-generate the short urls and keep them ready. Have relational DB with failover (two or more DB servers) to hand over these keys (relational allows to implement locking). Hash Partition the read shards with consistent hashing. You have the full solution without any single point of failure or hot-spots.
a) 9:00 why does the 1st option not work? Is it because 2 different longurl's can generate same tinyurl? b) Why does 2nd option putIfAbsent() work ? Is it due to some database locking where whole table gets locked and if 1 thread is writing, then other 999 threads are blocked?
That's was enlightening, Tushar. What do you think about probabilistic data structures to generate the tinyUrl? Like the MD5 hash you mentioned, I think we could use Bloom Filters to tackle the problem too.
this was actually one of the things i thought of as well. I think a md5 hash might be too slow for an efficient bloom filter depending on the performance requirements but maybe it can be combined with the CDN idea. i was reading about bloom filters on wiki and there was an example about how akamai uses bloom filters in their webservers en.wikipedia.org/wiki/Bloom_filter#Examples
Gaurav Sen, thanks for bringing up this point. I agree with you that we can use bloom filter to generate tinyURL. But there is a caveat. As far as I understand, bloom filter can help you to check if the hashOfBigURL(tinyURL) is already existing on the persistence layer or not. It will help us finding the existence of the tinyURL faster instead of going to the disk & finding it there. The remains part is how would you generate the hashOfBigURL(tinyURL) that remains not clear. Tushar have demonstrated those options to generate. Please correct me if there is any invalid statement I made.
Does the #3 Counter technique need to combine with #2 of MD5 to generate a tiny URL? Also, it has to retry like 10:12 if the tiny URL is duplicate even using #3 Counter?
I have a question based on the concept inspired by this video at 7:40, If there is a concurrency race problem for multiple users buying inventory item, we can use serializable/optimistic/pessimistic locking, but what if two users read inventory item as with status unsold and both will call 'update inventory set status = sold and userId = A/B where inventory id = 1(for ex) and status = unsold' and only one will succeed, this way we avoided optimistic locking(rollbacks) although intrinsic write lock on row will be there fo all updates, for other users update will not update anything and they can try other items, What is this approach known as?? why don't we use it more often?
Hi Tushar, Awesome video and thanks for the same. The only minor observation I have is that 3.5 Trillion translates to a 2^42=4,398,046,511, 104 i.e. 4.3 Trillion. So we should be using 42 bits instead of 43 bits from the 128 Bit MD5 Hash ? 43 bits translates to 2^43 = 8,796,093,022,208 i.e. 8.7 Trillion.
Instead of base 62 we can use base 64.Just need to add 2 more characters. Underscore and dash(_-) can be added to 62 characters we have. This makes its easier to convert a MD5 output to a 7 character printable string as you can convert blocks of 6 bits directly to base64 character. Even this TH-cam link uses underscore to represent url of this video !!!
I haven't seen anyone explain how to implement the b62 hash properly. If you pass in 1 to the typical base62 function, you will get a single character in certain cases
Won't the counter/all-host/range_based approch generate duplicate tiny urls? With these approaches how do I search if a long url already exists? Won't a DB table [long -> tiny] also be required to avoid duplicate URL generation?
To support ranges we created service that just give your process some range (x to x+100000) and increment counter in DB. It is guarantee the everyone is completely unique and that service is not heavily loaded
Hi, In the MD5 hash approach: How can we be sure that for a 7 char length short url , we need to take 43 bits out of 128 bits of the MD5 hash of long url, convert that to decimal and then to base 62. And once we convert it to base 62 it will give a 7 char length short url. What is the mathematics behind the 43 out of 128 bits approach? MY doubt is why just 43 ? Thanks.
Hi @Anant Saksena, Here is the mathematical formula for this - log(2) of 62^7 = 42 .... which means, based on the no of URLs to be generated per sec, we have assumed that if we take base 64 encoding, then you can generate 62^7 URLs ( URL length would be 7 chars). Since MD5 generates binary value, the equivalent number is 2^42. Hence the no of bits required for this calculation is approximately 42
I don't get it for the put tiny url into db part, what's the difference between solution 1 and solution 2? they both check if a tiny url exists in db and if not then put the pair into it
If a server crashes or restarts, does it get back the same range or new range? If it gets a new range every time, the ranges could get exhausted after a point. How do we recover the expired URLs with this design? What happens after exhaustion?
I am curious why not generate the hashes offline and store them somewhere and that way the API to create a short URL can just "grab" a hash key from this store? I guess this assumes that the data store has an atomic operarion: "select one row and delete it" . Is that feasible to implement?
Tushar, I thoroughly enjoy your design analysis. One suggestion though, it would be beneficial for the audience if, in the description, you directly linked any reference to the dependencies - concepts, articles etc (example, Apache Zookeeper) that you talk about in the video. Thanks and keep being awesome.
For the first 43bit MD5 solution, what do you do if you encounter a collision with 2 different long URLs, i.e., there are 2 totally different long URLs that has the same first 43bit MD5 sum.
Perhaps we could pluck some parts of the counter solution, e.g., take 30 bits of the MD5 + the 16 latter bits of the timestamp (since first 16 are much slower to change) + 7 bits randomly generated or by counter?
19:19 , when a worker is given a new range to work with, would it have to generate a number within that range (either randomly or incrementally), get an MD5 hash of that number, and then pick the first 43 bits of that hash?
I am not getting 3rd part of check with db. How is it not inconsistent? If on puts tiny url and long url in db, then one will get latest stored long url for that tiny url always.
Awesome video Tushar. I don't understand how Approach 3 would work when you put(tinyURL, longURL), get(tinyURL) and check if this is equal to the longURL we have. If there was a duplicate in the DB, won't that get overwritten when I put(tinyURL, longURL) and won't get fetch the new longURL Value?
Indeed I have same understanding. The RestAPI put will Update if the key already exist no? So not only this Approach 3 will never end up in "fail and try again" scenario, it will even cause issue as if client try to Get the key, he will get the updated longURL which is not what he wanted.
Instead of covering how we will design starting from what we will ask the interviewer to creating high level design and then scaling that solution and so on and so forth, the presenter is going to detailed into how we will generate the tiny url etc. I would recommend looking at this for a good example of how to design tiny url www.hiredintech.com/classrooms/system-design/lesson/52
Counter based approach is not optimal. There are multiples challenges like maintaining the locking in a worker thread. Another thing these days deployment of worker node done multiple times in a day and every time worker will lose the counter and It will have to ask a new range from zookeeper.
Hi, Tushar I have a question for your 3 technique 1.if you are putting the newly generated value in DB and then in 2nd step getting the value against it verify against your long URL(it should override the value at same key and result is always same you will lose the old value ). will not it match always?
I have the same concern as you. I think the third approach should have some strategies to prevent this situation. Otherwise, the old value will be wrritten.
I remember timestamp is 64 bites(8 bytes). "The internal representation of a timestamp is a string of 7 - 13 bytes. Each byte consists of 2 packed decimal digits. The first 4 bytes represent the date, the next 3 bytes the time, and the last 0 - 6 bytes the fractions of a second"
With year 2038 problem, a 32-bit can represent Unix time. But it will end after the completion of 2,147,483,647 (2^31 - 1) seconds from the beginning (00:00:00 1 January 1970), i.e., on 19 January, 2038 03:14:08 GMT.
In the last range based approach where we involve zookeeper, how are we generating the random short URL? Are we taking the integer and doing a base62 conversion? If we do that for smaller integers like say 100, we might not be getting a short URL key which is 6 characters long. Kindly clarify
Would it be correct to pass all randomly generated urls through a bloom filter so we can use the bloom filter to check for existence of the tiny url before using it instead of querying the database? We can say the tiny url is only valid if the bloom filter says the tiny url is not in the database, otherwise generate a new one and put it into the database. The length of tiny url would be increased to improve random hits on an unused tiny url, whereas recovering tiny urls by using expiration time on already generated tiny urls would also help.
thanks for this video.I got a doubt.Over the time ,the url will become invalid and will be deleted.Now the random urls will get generated,then random counter values will be left open for be re used.In this range based system, how can we reuse the scattered value that does not fit in a range? Like counter values 5,100,5000 etc. gets deleted.Is there a way to reuse this?Once we exhausts all the keys at one time say in future,how can we reuse the deleted/unused ?
If I am not mistaken, log 6.5 trillion to base 2 is 42.5 which is approx 43. So 43 bits are taken for a unique value for each of the possible 6.5 trillion tinyurl
@Tushar I have a question, if multiple worker hosts can access db , then even while executing second technique put if absent () , we can have conflict with any operation from other worker thread coming and executing a request..so is there any constraint that at a time only one instruction can interact with db ? same can happen in 3rd technique..stable prod. system might not deploy the third technique as well
In a sense using an RDBMS is very similar to the counter from a limitations standpoint -- they scale vertically not horizontally, so you'd have the same single point of failure issue.
I think your system design videos are top notch and helped me tremendously. I usually spend my working hours writing code and not worrying about the design videos, but thanks to you I was able to bridge the gap of missing knowledge
We can optimise the generation of the URL part in your service layer by delegating the key generation part to multiple instances of offline services(KeyGenerationService). Each service will generate the keys in the particular range and give the pre-generated tiny urls to servers in need. Servers can also keep some unused url in their local caches to speed up the process more. Their will be some race conditions which could be taken care by designing the service properly to take care of interaction between different threads in the services. All in all, it is a great video, kudos to the efforts you put to come up with such content !!!
its a good idea of pre-gnerating of tiny url. Thanks for bring this up. Hopefully other viewers will read this.
Awesome approach!, I think race condition we can handle via some synchronization approach.
+Tushar Roy - Coding Made Simple thanks for the great video.
you can pin a comment if you want your viewers to see it.
The pro of using a range is because of its compact size to be stored and operated on. Pre-generated keys will take up a lot more space and increase a lot more roundtrip time between internal services while calculating a tinyurl from a integer is relatively cheap. So I think it's only worth it if the calculation is expensive, like blockchain addresses.
in this approach you should first check actual url already present, insert only if not.
I appreciate your efforts in making system design tutorial videos. My algorithms for tackling this question:
-6 characters long TinyURL:
a. Use MD5/SHA256 on Base64 encoding with 6 letters long key (This was mentioned by you also, I uplifted it a bit)
b. Use Key Generation Service for 6 digit keys already pre-computed - Key DB of ~500 GB size + Replication of this to avoid Single Point Of Failure + Lock system to avoid concurrency issue + Use Memcache to speed up -> Fetch key from DB and Use HTTP 302 redirect to browser, in case of success else return 404 to the user.
Tushar! I'm so grateful for what you're doing here on TH-cam! I've literally spent hours watching your videos on dynamic programming this last week to study for an algorithms module of mine. Even almost got a bit of your accent!
People (multiple people) spend days/weeks/months designing a scaleable system. Then comes some hot shot interviewer with all the right answers he found earlier from his google search, and then asks you to design a tinyurl system and expects you to spit it out in 24 minutes? Also consider the fact that his company has a product that is being used by only a few users, that a Windows 98 PC could happily support. Interviewers really need to get off their fucking clouds and stop searching for unicorns.
lol
Vitamin B goddamn tech interviews. (begrudgingly goes back to studying some obscure algorithm that some brilliant mind took ages to figure out)
couldnt agree more
lol, exactly said. I was asked to design the same in 30 mins, that too over telephonic discussion. Great video, thanks to Tushar Roy.
I’ve exactly same view. People/scientists invest days, months or even years to come to solution and some stupid people expect to answer in 30 or 60 mins.
Again, very good content. I really like that you're talking fast and not waste time on things that are easily found on the internet (definitions etc.). Very nice to watch.
Hi Tushar - Thanks for the awesome video! You asked how other people would design the url service and here’s my 2 cents: I would start the design with a stateless API service and write thru in-memory cache, backed by a relational database. Collisions are detected at the db level (eg. a unique primary key on the shorten url) and corrected by hashing the original url + a prefix. The regional issue you mentioned could potentially be mitigated with a combo of load balancer routing configuration and in-memory cache at each service instance. In the case of GET request burst, it is likely that the instances will have the URL cached.
I would imagine this starter design is good for couple thousands URL per sec. An AWS RDS with SSD could handle up to 10K IOPS according to their doc. The in-memory cache size could also be adjusted for read performance. To scale up this design for higher volume, I probably would try the distributed cache with write-behind before considering Zookeeper. In general, I like to stay stateless if possible :)
I created a simple tiny url implementation in case anyone is interested.
github.com/edithau/simple-tiny-url
+Edith Au thanks for your insights
Hello, tushar, I would really like to thank you for your videos. recently, I was able to crack an interview by watching your videos. Especially your videos on Dynamic Programming are treat to eyes. Thanks again.
+Rishabh Daim great. Happy videos helped.
Thank you for the video! That's the most organized / structured content (about app layer) I've seen for that problem.
Dear Tushar, it is great to see that you are back in action. Thanks a lot for your efforts. I truly appreciate your hard work. Could you please share your knowledge on design problems. Thanks a lot.
Navdeep Dahiya Kya baat hai
I m trying. Started with tiny url and more to come.
You are awesome..... Plz keep it up with ur design knowledge.....
Please continue to do your good work of design problems I am waiting for your next coming videos on this eagerly.
@Tushar, thanks for the video. 12:15-13:55 is a very convoluted way of conveying that log2(62**7) bits are needed encode 62**7 values. Also, that evaluates to 42 and not 43. An oversight, I presume...
Appreciate such videos which give different/alternative ways towards solutions & one must go through various such views before finalizing on the solution. But seriously, only a king of fools can ask such questions & expect a good logical answers in those 10-15 minutes. I myself have got such questions with the expectations to say those fancy words of design patterns & algorithms. Man, such interviewers themselves do google the questions/answers for the interviews & they themselves don’t know to solve the trivial problems in their own projects. Many issues with interviewers’ mentality & processes but for such videos, these really help a person to think differently & find the solutions. Appreciate Tushar for providing such informative videos.
I think it is supposed to be 42 bits not 43. If you think about it, your language has 62 chars and to uniquely identify each char you need 6 bits. because 2^6 is 64. Then you can map 000000 to a, 000001 to b and so on till 9. So in a 7 character string each character will need 6 bits and 7*6 is 42. So you will have to save 42 bits in your DB and when you get it back from your DB, you have to use the same mapper and get back the characters. Hope that clears out the doubt around 42 and 43 bits.
@Revanth Kumar Thanks for the clarification. I was thinking why it is an odd number (43).
I have built one too rdt-li an open source free url shortener with built in analytics, check it out
i really appreciate how indepth your design analysis was! thank you
This guy and this video helped a millions.
Ommggggg Tushar's first ever design video 😍😍😍
Around 9:14 Tushar describes a solution that first puts (short, long) to the database and then reads to verify nobody wrote after it. Doesn't the put corrupt the previous value that someone else left there (and wanted it as the value in the database)?
no, it is insert if not present.
@@gsb22 I don't think so.. "insert if not present" is the second option suggested by @tusharroy2525 not in the third option
Great articulation & simple enough for non CS folks to understand. Thanks Tushar!
11:17 so let's say you make your GET request on a newly generated md5 tiny url and the first result's long url doesn't match the long url you just entered (i.e. that tiny url already exists). How do you generate a new tiny url such that if the same link is entered again you'll be able to find the same one? Bit shift and repeat?
Thank you so much for the video, I am really excited and glad to have you back!
Keep on killing it!
Very nice explanation. Please share more design/ system design topics. Thanks a lot.
definitely
when can we see more of them?
This is a great video for people who are trying to learn system design. Not so great for people prepping for an interview
For people who are confused why do we need 43 bits:
2**42 = 4398046511104 = 4.3 * 10 ** 12 = 4.3 Trillion.
So, we only need 42 bits to represent 4.3 trillion numbers
But these numbers also include negative numbers and we only need positive numbers (0 - 3.8 Trillion). So we will 43 bits and keep our MSB as 0 to make it a positive number.
However, Tushar says that 5 bits can be randomized, rather he should mention it as 4 bits, please correct me if am wrong.
@@vipuljain9550 Yes, that should be 4 as he needs first 7 bits to represent number 0 - 63 (MSB is always 0 because number is positive)
I just want to say, you are so gogerous.
Thank you. This saved my time
I don't think that this is correct. Just assume that the number is unsigned integer.
Curious why you choose 43 bits, as you only actually need 42 bits to represent 3.5 trillion (max for 42 bits is 4.398 trillion). Is there a use for the extra bit I'm not seeing?
Yes also since we have 7 char tiny url, and each char is 6 bits in base64, 42 bits is enough even if we use base64 encoding (a-z, A-Z, 0-9, _ , . )?
@@rohanabhutkar he is using base62 because there are 62 unique chars.
2*42 = 4398046511104 = 4.3 * 10 * 12 = 4.3 Trillion.
So, we only need 42 bits to represent 4.3 trillion numbers
But these numbers also include negative numbers and we only need positive numbers (0 - 3.8 Trillion). So we will 43 bits and keep our MSB as 0 to make it a positive number.
@@renurani3311 or we can use unsigned data type and save 1 bit :D
@@renurani3311 ddkkokoi
Iojkoiw e seeders (know((in(.m
Mi
the best i have seen on this is to pre-generate the short urls and keep them ready. Have relational DB with failover (two or more DB servers) to hand over these keys (relational allows to implement locking). Hash Partition the read shards with consistent hashing. You have the full solution without any single point of failure or hot-spots.
9:41 if suppose there is a tiny url exising for a long url and you do put, then it will override the existing tiny url.. will it make sense?
a) 9:00 why does the 1st option not work? Is it because 2 different longurl's can generate same tinyurl?
b) Why does 2nd option putIfAbsent() work ? Is it due to some database locking where whole table gets locked and if 1 thread is writing, then other 999 threads are blocked?
Thank you for this. I have been going through examples and lessons and this made it all click for me.
That's was enlightening, Tushar. What do you think about probabilistic data structures to generate the tinyUrl? Like the MD5 hash you mentioned, I think we could use Bloom Filters to tackle the problem too.
Haha, yes I have become interested in bloom filters recently :)
this was actually one of the things i thought of as well. I think a md5 hash might be too slow for an efficient bloom filter depending on the performance requirements but maybe it can be combined with the CDN idea. i was reading about bloom filters on wiki and there was an example about how akamai uses bloom filters in their webservers
en.wikipedia.org/wiki/Bloom_filter#Examples
Gaurav Sen, thanks for bringing up this point. I agree with you that we can use bloom filter to generate tinyURL.
But there is a caveat. As far as I understand, bloom filter can help you to check if the hashOfBigURL(tinyURL) is already existing on the persistence layer or not. It will help us finding the existence of the tinyURL faster instead of going to the disk & finding it there. The remains part is how would you generate the hashOfBigURL(tinyURL) that remains not clear. Tushar have demonstrated those options to generate. Please correct me if there is any invalid statement I made.
My my we have a design principle celebrity here
0
I really learnt alot on this system design, good job
Does the #3 Counter technique need to combine with #2 of MD5 to generate a tiny URL? Also, it has to retry like 10:12 if the tiny URL is duplicate even using #3 Counter?
@Tushar, thanks for the video. Please keep up the hard work.
I have a question based on the concept inspired by this video at 7:40, If there is a concurrency race problem for multiple users buying inventory item, we can use serializable/optimistic/pessimistic locking, but what if two users read inventory item as with status unsold and both will call 'update inventory set status = sold and userId = A/B where inventory id = 1(for ex) and status = unsold' and only one will succeed, this way we avoided optimistic locking(rollbacks) although intrinsic write lock on row will be there fo all updates, for other users update will not update anything and they can try other items, What is this approach known as?? why don't we use it more often?
Wonderful, Tushar! Lot of details ! So much passion!
I'm becoming your fan !!
+Yamini L thanks
Thank you so much Tushar, your videos are always helpful! We really appreciate your work!!
I was so so looking forward to your amazing videos ! keep it coming please .
I will
Hi Tushar, Awesome video and thanks for the same. The only minor observation I have is that 3.5 Trillion translates to a 2^42=4,398,046,511, 104 i.e. 4.3 Trillion. So we should be using 42 bits instead of 43 bits from the 128 Bit MD5 Hash ? 43 bits translates to 2^43 = 8,796,093,022,208 i.e. 8.7 Trillion.
this is correct, I observed that as well. log(2) of 62^7 = 42
Great to see design interviews. Welcome back :)
16:36 , can we reduce collisions by considering timestamp along with milliseconds?
but that would increase the bits needed for the timestamp. from 32 to 48.
Thanks Tushar your explanation was really simple and it did help me a lot to understand the topic well.
Instead of base 62 we can use base 64.Just need to add 2 more characters. Underscore and dash(_-) can be added to 62 characters we have. This makes its easier to convert a MD5 output to a 7 character printable string as you can convert blocks of 6 bits directly to base64 character. Even this TH-cam link uses underscore to represent url of this video !!!
I haven't seen anyone explain how to implement the b62 hash properly. If you pass in 1 to the typical base62 function, you will get a single character in certain cases
Very concise abd clear video.
Easy to follow.
Won't the counter/all-host/range_based approch generate duplicate tiny urls? With these approaches how do I search if a long url already exists? Won't a DB table [long -> tiny] also be required to avoid duplicate URL generation?
To support ranges we created service that just give your process some range (x to x+100000) and increment counter in DB. It is guarantee the everyone is completely unique and that service is not heavily loaded
+Aleksey Telyshev fair enough. That should work
Hi, In the MD5 hash approach:
How can we be sure that for a 7 char length short url , we need to take 43 bits out of 128 bits of the MD5 hash of long url, convert that to decimal and then to base 62. And once we convert it to base 62 it will give a 7 char length short url. What is the mathematics behind the 43 out of 128 bits approach? MY doubt is why just 43 ?
Thanks.
Hi @Anant Saksena, Here is the mathematical formula for this - log(2) of 62^7 = 42 .... which means, based on the no of URLs to be generated per sec, we have assumed that if we take base 64 encoding, then you can generate 62^7 URLs ( URL length would be 7 chars). Since MD5 generates binary value, the equivalent number is 2^42. Hence the no of bits required for this calculation is approximately 42
Hurray ! The legend is back!! more design questions please. 😍😍
Impressive! It is also super helpful to practice with FAANG engineers at Meetapro through mock interviews
I don't get it for the put tiny url into db part, what's the difference between solution 1 and solution 2? they both check if a tiny url exists in db and if not then put the pair into it
If a server crashes or restarts, does it get back the same range or new range? If it gets a new range every time, the ranges could get exhausted after a point. How do we recover the expired URLs with this design? What happens after exhaustion?
I am curious why not generate the hashes offline and store them somewhere and that way the API to create a short URL can just "grab" a hash key from this store?
I guess this assumes that the data store has an atomic operarion: "select one row and delete it" . Is that feasible to implement?
Very well explained, Tushar. Liked your presentation and the length of the video.
but the range based approach discussed at 22:19 will generate multiple short urls for same url on multiple create requests.
Brilliant video! Well explained and perfectly paced.
Tushar, I thoroughly enjoy your design analysis. One suggestion though, it would be beneficial for the audience if, in the description, you directly linked any reference to the dependencies - concepts, articles etc (example, Apache Zookeeper) that you talk about in the video. Thanks and keep being awesome.
+Sudipta Karmakar I will add that
The best system design video ever!
+erol yeniaras thanks
why 43 bits? I did not get it..
Yes, exactly. I was about to post the same question. I should be 42 bits I think.
Thanks for creating such amazing content!
I love your videos, it's so elegant, clean and beautiful
For the first 43bit MD5 solution, what do you do if you encounter a collision with 2 different long URLs, i.e., there are 2 totally different long URLs that has the same first 43bit MD5 sum.
Perhaps we could pluck some parts of the counter solution, e.g., take 30 bits of the MD5 + the 16 latter bits of the timestamp (since first 16 are much slower to change) + 7 bits randomly generated or by counter?
Hi Tushar so good to see you back . Please upload more videos on system design
+Akshay Suman doing it
I have a ques why we are taking 43 bits from md5
Because 42 bits will suffice for generating 7 character tiny url
great guy is back! more system design video please.
How does the server ask for a new range from the zookeeper? what zookeeper API will it use?
19:19 , when a worker is given a new range to work with, would it have to generate a number within that range (either randomly or incrementally), get an MD5 hash of that number, and then pick the first 43 bits of that hash?
Yes I think so.
I am not getting 3rd part of check with db. How is it not inconsistent? If on puts tiny url and long url in db, then one will get latest stored long url for that tiny url always.
Awesome video Tushar. I don't understand how Approach 3 would work when you put(tinyURL, longURL), get(tinyURL) and check if this is equal to the longURL we have. If there was a duplicate in the DB, won't that get overwritten when I put(tinyURL, longURL) and won't get fetch the new longURL Value?
Indeed I have same understanding. The RestAPI put will Update if the key already exist no? So not only this Approach 3 will never end up in "fail and try again" scenario, it will even cause issue as if client try to Get the key, he will get the updated longURL which is not what he wanted.
Great AWesome Bro to see you back.
How do we handle the case when the base62 encoding of a counter is < 7 characters?
Like video a lot, hoping to see more service design videos in the future
Instead of covering how we will design starting from what we will ask the interviewer to creating high level design and then scaling that solution and so on and so forth, the presenter is going to detailed into how we will generate the tiny url etc. I would recommend looking at this for a good example of how to design tiny url
www.hiredintech.com/classrooms/system-design/lesson/52
thanks
toooo goood. I think even the interviewer will also have to refer this as well. Hope to see more. :)
Counter based approach is not optimal. There are multiples challenges like maintaining the locking in a worker thread. Another thing these days deployment of worker node done multiple times in a day and every time worker will lose the counter and It will have to ask a new range from zookeeper.
Amazing !! You are like my guru dude learnt so much from you.
Hello Tushar, Thank you for the great explanation and making it very simple to understand the problem. Waiting for more designing problems. :)
Great work.. please make more design videos
Hi, Tushar I have a question for your 3 technique 1.if you are putting the newly generated value in DB and then in 2nd step getting the value against it verify against your long URL(it should override the value at same key and result is always same you will lose the old value ). will not it match always?
I have the same concern as you. I think the third approach should have some strategies to prevent this situation. Otherwise, the old value will be wrritten.
@Tushar With Zookeeper bucket counter approach, how can be check if a short url is already generated for a given long url?
we can use 6 bits to express 62 characers since 2^6 > 62. So for 7 characters long url, 7 * 6 = 42 bits are enough
Thanks for coming back sir
Can you please share how zookeeper store these ranges? Which API worker node can use to access these ranges?
Your videos are super educative thanks
Sir with 42 bits, we can generate a total combinations of 4*10^12 and with seven bits about 3*10^12. So why did we take 43 bits and not 42 bits?
What happens if in the actual interview you say, "This is 62^7, and this is about 3.5 trillion"
Lol. I dunno. Interviewer will think you are well prepared or great mathematician
you should say it is 2^(7log62).
why only 43 bits ? Based on which logic ?
I remember timestamp is 64 bites(8 bytes). "The internal representation of a timestamp is a string of 7 - 13 bytes. Each byte consists of 2 packed decimal digits. The first 4 bytes represent the date, the next 3 bytes the time, and the last 0 - 6 bytes the fractions of a second"
With year 2038 problem, a 32-bit can represent Unix time. But it will end after the completion of 2,147,483,647 (2^31 - 1) seconds from the beginning (00:00:00 1 January 1970), i.e., on 19 January, 2038 03:14:08 GMT.
@@saam6348 I would be more than 70 years old by then LOL
In the last range based approach where we involve zookeeper, how are we generating the random short URL? Are we taking the integer and doing a base62 conversion? If we do that for smaller integers like say 100, we might not be getting a short URL key which is 6 characters long. Kindly clarify
Would it be correct to pass all randomly generated urls through a bloom filter so we can use the bloom filter to check for existence of the tiny url before using it instead of querying the database? We can say the tiny url is only valid if the bloom filter says the tiny url is not in the database, otherwise generate a new one and put it into the database. The length of tiny url would be increased to improve random hits on an unused tiny url, whereas recovering tiny urls by using expiration time on already generated tiny urls would also help.
thanks for this video.I got a doubt.Over the time ,the url will become invalid and will be deleted.Now the random urls will get generated,then random counter values will be left open for be re used.In this range based system, how can we reuse the scattered value that does not fit in a range? Like counter values 5,100,5000 etc. gets deleted.Is there a way to reuse this?Once we exhausts all the keys at one time say in future,how can we reuse the deleted/unused ?
Hi, if we generate an mD5 from a unique number, and then pick the first 42 bits, then how can we guarantee that there won't be a collision?
Can anyone please help me understand how we arrived at the number 43 as the bits to be picked from the MD5?
If I am not mistaken, log 6.5 trillion to base 2 is 42.5 which is approx 43.
So 43 bits are taken for a unique value for each of the possible 6.5 trillion tinyurl
It was very nice explanation! :) I've listened with pleasure.
Thanks
Why not just use auto-incremented id of a url in database to save these records and decode them in runtime ?
in 9:00 , if the entry already exist it will update it... then how its consistent?
@Tushar Roy - Thanks for such helpful videos. Please keep doing more...
Amazing video and explanation.. u got one more subscriber
Great Video. Can you talk about the retrieval part? Specially given a long url, see if we already have a url for it and give that back.
@Tushar I have a question, if multiple worker hosts can access db , then even while executing second technique put if absent () , we can have conflict with any operation from other worker thread coming and executing a request..so is there any constraint that at a time only one instruction can interact with db ? same can happen in 3rd technique..stable prod. system might not deploy the third technique as well
In a sense using an RDBMS is very similar to the counter from a limitations standpoint -- they scale vertically not horizontally, so you'd have the same single point of failure issue.
In md5 technique what is the action to be taken when we get a redundant 43 bits.?