I think this attracts a particular kind of readers. Those looking for more in-depth and first- principles understanding vs memoization. I think you also have a good balance of real use case driven thinking and hand waving where necessary. The latter is crucial and people ( including me at times) tend to overlook the importance of hand waving. Not everything deserves the same level of detail. But knowing which does is the game changer! Anyways, I will be subscribing and reviewing a good portion of this 😊
I've gone through a lot of videos on this subject and this is by far the best. Not just regurgitating what they've read elsewhere, but actually coming from a place of proper understanding.
I appreciate the new 2.0 video style and how in depth it is. I found the older videos too brief considering the complexity of the systems they covered.
You are so underrated brother... this is the best content on the internet regarding System Design interviews 100%... who cares about an interviewer saying 'yes' to every question that the interviewee asks, or an interviewee coming up with the same 10 boxes every single System Design question when you explain stuff like this... keep it up, you are saving lifes!!
You are very good, the best even, at driving the solution of a design problem by applying logical reasons rather than throwing something at my face i wouldn't understand. Thanks for existing!
Thank you very much, I’ve been reading DIDA and Understanding Distributed Systems so it’s great to see how those concepts apply to the use cases you are showing. This is the first video of yours that I’ve seen yet, planning to watch all the series when I can because you highlight the questions that we should be asking when designing and assert the tradeoffs of different solutions, it’s amazing. I don’t know if you have done for the other videos but if I could ask for anything would be for you to change the pencil color when reviewing the flows of data, it makes it easier to find where you are pointing to in the drawing board. Again, thank you very much, these videos are really helpful.
I really like the format for these types of videos of starting with a naive solution and then iterating on each component as we discover performance problems. That way the initial db schema and the overall graph of how requests / responses implement the features can be clearly presented to the viewer + the viewer gets a conceptual framework to start evaluating the ways in which the simple picture needed to become more sophisticated.
At 12:34 -> You do not have to do any predicate query. Relational databases use B-trees and are not append-only like databases that use LSM trees. So an "INSERT" will fail if another INSERT beat it to the race because the 2nd insert will notice the primary key in the B-tree. This sort of "INSERT and fail if primary key exists" is very hard to accomplish in LSM based DBs as they are append-only. In fact to accomplish this in an LSM based database, you usually have to do a special slow instruction that involves multiple phases and uses something like Paxos or raft or 2PC. So with a database like postgres, one solution to this TinyURL is to actually just have different application servers INSERT new rows. If there was a race on the exact same primary key, the 2nd application server will get a "duplicate key exists" error and will know to retry or send the error to the client.
Yeah - I probably shouldn't have included this section in the video since it's just thrown people off. I meant to say that predicate locking was one way of accomplishing this, but like you mentioned in SQL databases they're gonna get this done for you on the primary key. Interesting stuff on the LSM tidbit, that's new to me! I imagine if the key was still in the memtable perhaps we wouldn't need too much additional complexity, as we'd see it was there, but had it been flushed this could get complex. Thanks for the insight, and I appreciate you pointing out these flaws!
Hey, first of all thank you for the videos. They've been really helpful for me prepping for an upcoming interview! I watched a separate video on tinyURL/pastebin link generation, and the main point of the video was that instead of hashing, which can lead collisions when we truncate the MD5/whatever hash, like you mentioned in the video. Their suggestion was that we could have a cluster of key generations/ID generators, and each one would have a range of URLs to distribute when a write request is made. They could be coordinated by ZooKeeper and would allocate only from their own chunks, and would refresh their URLs as needed. The only concern is that we lose that range of URLs (say 1 million) if a specific generator goes down, but since we have 2 trillion combinations if we use base 36 with 8 characters and even more if we use base 62, this isn't a major concern. This would also speed up writes in the case there is a collision. What are your thoughts on this?
Hey yeah, I touch upon that in this video. That's basically the key generation service where you partition the keys and assign in sequence. That works too! There may some lock contention within a partition but if you have enough partitions such that it doesn't matter that's fine too!
Thank you for all the system design content! I’ve been binging through the 2.0 playlist and this video does a great job putting all the pieces together. I don’t have any experience with system design interviews so I have two questions in particular: 1. If in an interview, someone were able to deliver a design with the same level of depth + complexity that you’ve described here, what level would they be placed at? Alternatively, what sort of depth would interviewers be looking for in say an L4 interview? 2. Many of the system design interview guides recommend to lay out a basic architecture first and then go into a deep dive at the end but in your videos, you weave both parts together. How would you recommend to split these parts up? For example, is database schema + choosing a type of database part of the deep dive or should it be part of the basic architecture? And how about for replication/partitioning?
1) At the end of the day, you're interviewing for a certain level, so you'd be placed at that level. I don't really think uplevels are ever going to happen, but failing a systems design interview might lead to a downlevel. I like to think that these videos are sufficient to pass just about any systems design interview, considering the results of my viewers, but I can't be sure, as I'm not an L5 and haven't had the test to try that theory out haha 2) Yeah, and I used to do my old videos this way. However what I've come to realize is that one's choice of database schema is intimately tied to their database architecture choice. So if I just list a schema outright, but don't go into what architecture I'm using, it might not make sense. In an actual interview, it may make sense to list a schema first, and then say you'll explain it later, but my main objective of these videos is to make them as clear as possible - so I hope that my choice to do this makes sense.
Good content technically speaking, but a bit removed from an actual System Design interview. I found HelloInterview to be more geared towards passing the actual interviews (I swear I'm not shilling, just giving you feedback because I genuinely love your channel for the technical aspect).
I'm hearing this a lot these days. I could see it depending on the interviewer/level, but I'm pretty content to overprepare for these since I find it actually helps me at work too! And no worries, you can have preferences :)
Awesome video! Couple thoughts: - For storage, it may be useful to talk about content addressable storage. Many pastes may be the same and that will save space - I’m not sure what the CDN buys us here. Reading from s3 isn’t crazy slow but the big issue is that the object space is huge and my hunch is that most of the time, you’d get CDN misses in most regions. That is assuming you’re not storing every paste to your CDN, which feels expensive. This is more a not that a cdn feels like a premature optimization - I really like your count approach. Another couple be using a counter CRDT and increment server side while serving the link. That being said, yours is more general purpose and scales for other types of analyitics
Thanks! Good call on the first point, you could take some sort of hash to deduplicate. For the CDN point, I mostly have it there just to talk about CDNs. In practice, caches seem to be one of those things that you just introduce as you need, so we'd have to see how our existing latencies were and then experiment here.
Hey Jordan, awesome video as usual! I really appreciate the way you break down tradeoffs. I checked out the TinyURL/Pastebin design from grokking's course, but it felt a bit sloppy to me, your version is so much better :) thank you!
I'd love it if you could make content regarding what level of depth is expected from a Mid level SD interview vs Senior vs Staff! It would be cool to take the same problem and do a mock at each level so we can understand the differences! Would love to see a mock series between you and Gaurav tackling this. I've interviewed with companies where they haven't picked a level for me yet and they say we will determine it based on your performance. But I don't know what the expectation is exactly in the context of a system design interview.
To be honest I just don't have enough context here as I haven't conducted interviews for a company, I do really wish I could say otherwise but at the moment my objective is to just go as in depth as possible.
Once again, learned a lot and read and write scaling techniques were helpful in my system design interview. Another write scaling technique you can talk about is Log structured merge trees for o(1) write and O(logn) reads.
@@jordanhasnolife5163 you are correct they are eventually not o(1) but if they are async writes, for your request sake, it's o(1) and at some point we add the entire log in sorted manner. Again, I am not an expert here so I could be completely wrong, lol. Please let me know if I am. Will do more research for complete understanding.
Hi Jordan. You were talking about 16 TB of potential data, partitioning, single leader replication. Sounds like Mongo fits better than relational solution. Thank you for great content!
Wonder why key-value DB wouldn’t be the db choice. It doesn’t require data relation and we can quickly check if tiny url has been used and also supports fast retrieval
Yeah I tried to say that this would be the best choice, but just doubt that storing all data in memory is feasible from a cost perspective, so wouldn't be surprised if your interviewer pushed you harder here.
I was wondering the same thing. Its weird how advice on costs usually comes up in the "what to say in the beginning" interview advice, but isn't mentioned more in technology tradeoffs. I also would've thought that non relational dbs are, for the most part, cheaper than relational ones, but I guess that's only until the point when they stop being caches and start being proper dbs. Am I right or missing something?
@@adi96adi Yeah not exactly sure regarding the cost of non relational vs. relational dbs but I can say with certainty that storing in memory is more expensive and for something like tinyURL, likely to be overkill.
Thanks for the detailed breakdown for storing url click counts. A couple of questions 1. In a multi region/multi-DC world, assuming we use a single leader writes, there could be atleast two Spark streaming instances writing to DB for the same Key (In a 2 region world). To ensure correctness, do we need to store 2 Idempotent keys here per row? Alternately, I think Kafka mirroring can solve this by replicating clicks from multiple regions to be unified in one region so we can get away with a single idempotent key 2. If we extend this problem to like counts, where we store liked_by users in a DB and rely on the kafka + stream_processing for counts, What happens if Kafka is down? For correctness, do we want to fail the liked_by write to DB or is there a better way to recover counts without impacting like actions availability?
1) I'd say so, assuming they're going to the same database row 2) You could just use change data capture to put a write in kafka when a new liked_by user is written. Ideally, kafka shouldn't be going down. This is why we replicate things.
@@jordanhasnolife5163 I’ve seen cases where company internal/platform level issues can cause temporary Kafka outages. In those cases, I’m thinking if we can write the updates to a log file and when Kafka is back up, process log files to enqueue Kafka? My motivation for the question is to ensure eventual correctness in counts if a major component such as Kafka fails
@@lv0320 Fair enough, I think a log file that eventually sinks to CDC when kafka is back sounds best. Perhaps look into debezium and see if that's something they already support
11:21 hi here you mention about the race condition when 2 users would add/create for a same actual url, but I’m confused as to why would this happen if the hash functions accounts for the userID already when generating the tinyURL? Wouldn’t that make the tinyurls unique for 2 users even for same links?
Hi Jordan, this is the best in-depth design video about tinyurl in any book or on any website. Thanks a lot! I do have two questions: 1. I am confused by the discussion around the concurrent writes part. In the video you first ruled out the multi-leader and leaderless replication designs. Then you discussed how to handle the concurrent writes to a single row. But if we are choosing single-leader replication, would this still happen? Shouldn't the leader handle the concurrent writes with locking or queuing, and determine the "one with later timestamp" to be unsuccessful (any maybe then assign a different unused shortURL to it)? 2. In many other sources, there is the other way to assign the tinyURL, which is through distributing sequentially increasing numbers in base-36/62 via a ticketing server. The ticketing server can scale itself by preallocating different ID ranges. I am curious what kind of partitioning could this design have to handle large read and write traffic? Like if we use hash range partition, clearly all the writes will concentrate on one partition which is BAD. Of course we could do another hash on the shortURL and partition on that to make it evenly distributed, but hashing itself is not free, and adds to complexity. What's your point of view on that? Thanks in advance!
1) Yes, which is why we use single leader replication with locking :) 2) Yeah, that's the key generation approach that we describe. Each database covers a range of keys, and hence we should get linear increases in speed as we add more nodes. I don't see why all writes will concentrate on one partition if we use a hash range, you can just randomly route each incoming request to one of the partitions and the partition will assign it a UUID.
One comment about the writer - if the writer is an API instance serving the requests to the user, and we're writing to the CDN/S3/DB, this could potentially take a long time and you'd likely want some kind of async mechanism to do these writes with a way of updating the user on its progress.
Agreed. That's potentially more of a client side change though, pretty sure that when uploading files to these services they expose the percentage completion
Great video, especially the discussion on single leader vs multi leader databases here. A couple notes: I don’t think write through for the CDN is a good idea. If you wrote a 10 GB file there, it would be at the expense of other files that may be getting accessed frequently. We shouldn’t give priority to a large unknown file over something that has been accessed and we won’t know will be popular. This is assuming that CDNs have a bounded size. Maybe there is a fancy one out there that isn’t bounded. I think you could make an argument that you won’t have records that are accessed for a long time after they are created.
Hi Jordan, great video! I'm not sure that we have to have uderId in the table, because the same long URL can be used by different users, and the eventual short URL (hash) will be the same.
While I agree with you, you want to be able to attribute clicks to the links of different users, so it may be useful to have different short urls for the same long url
great video, man. a lot of system design videos are super generic with just "here's some lego blocks that I know off the top of my head" and don't start from a bottom-up approach with the system architecture designed to support a specific algorithm, not just a top-level cookie-cut use case.
I'm not concerned about stale data, I'm concerned with write conflicts, which would occur when two users both claim the same URL. Single leader replication doesn't have this issue.
Awesome Content. I wasn't expecting this much detail in url shortener. Two questions though: 1. How does partition to kafka will be done? Ideally, servers need to know which kafka patition the message should be sent to and also, some coordination service would be required to maintain that? 2. For writing large files for Pastebin. CDN is a cache and cache will be have eviction policy, so if any url will be hit with gigantic file, which is not used for days and someone hit it after lets say weeks. Considering, we use LRU, it might get evicted and CDN anyways would need to load it again. Is the only reason to put it in CDN first is to make sure popular urls will be in CDN itself and when even first few users hit it, they all will get it pretty quickly?
1) Yeah kafka employs zookeeper internally. You can just tell it a "partition id" you want to read and it figures out the rest IIRC. 2) Yep basically! If it hasn't been hit in a while, no point of keeping it in the CDN anyways.
For storage engine, the reasoning to discard hash indexes is flawed. We are NOT storing 1PB in hash index, we only need to store keys (1T * (8 bytes for hash key + 32 bytes for pointer to value on DB) = 40TB). 40TB still is lot, hence we can discard it.
Why do you need predicate locks for solving inserts in MySQL DB ? Why can't we just use primary key constraints / unique key constraints given by DB to achieve uniqueness ?
To be clear, we totally can, I just would imagine that under the hood that has to use some sort of predicate locking so that we don't create two rows with the same primary key
Can you explain a bit more why an on-disk hash index would not work for this problem? I understand that on-disk hash indexes will suffer from poor data locality and disk I/O optimization, but in this case we don't care about range queries, so locality is not a concern. Also, it seems to me that we can cache disk pages in memory for a hash index just as we would for a B+tree or similar and achieve similar a similar probability for cache hits, as co-located shortURLs on disk have no increased probability of access. At the very least, it seems like we could use a hybrid solution with a B+tree to index our on-disk hash table and achieve constant time value access once the hash table range containing our target key is paged into memory. PS love your channel dude really interesting stuff 🫡
Thanks for this video. One of the best I have seen in this topic. Question for you: You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. How do we know in advance which bucket we will be generating the key for?
Superb video, quick questions Are you storing all possible tinyURL combinations beforehand in DB OR after hashing longURL + timestamp + userID tinyURL is inserted into respective partition? And why the click counter is triggered from the reader's side? why not when we read the DB?
I discuss the pros and cons of doing both. If you're using a SQL table, you can probably store them ad hoc. I don't know what you mean "triggered from the reader's side". It is incremented when we read the database. However, in order to avoid lock contention, we perform the counting asynchronously.
Great video! However, can you elaborate the following things related to the short URL generation? 1. Most of the common hash function, e.g. MD5, SHA-2, generate the string way longer than you need. How do you use it while still keep the hash collision low? 2. Can you elaborate how to detect the hash collision? You mention using the linear probing so it kinda imply you want to do it in the memory, so my guess is the service probably need to first query the DB to check if this hash/key already exist. Won't that be a performance concern? 3. I've read a few other designs of generating the unique short URL, and some of them raise the issues of using the hashing of the original URL (with optionally including other data like timestamp). They usually suggest to use a separate unique ID generation service/component, and perform proper encoding like Base 62 and use it as the short URL(besides the domain name part). Have you ever considered this and what's your thought of this? Btw, unique ID generation strategy in the distributed system may be an interesting topic to discuss as well if you haven't done it yet 🙂
I think you should just be able to truncate, if we found that this was leading to too many collisions we could always use 10 characters instead of 8 for example 2) oh on the contrary I'm basically proposing we use application logic that says if key x exists take it otherwise try to write key x+1 in the db. This could be done a bit faster with something like a stored procedure.
@@jordanhasnolife5163 Grokking says the key gen is great at solving hash collisions since everything is created beforehand on an offline process. It will create two tables in your DB (used & unused). The application can periodically grab a bunch of unused keys from the key gen service (KGS). Once used, app will update KGS to move entry into used. That's their reasoning for it anyways. Yours makes sense as well to just populate db with everything from get go since its only a few TBs which is nothing.
@@jordanhasnolife5163generating a unique id in a sequence, then create a base 36-64 string as a shorten URL allow you to not having to deal with a collision ever. Running number is guaranteed to be unique unlike hash. However, it run the risk of being guessable and could be ddos, by making 2 shorten URLs that point to each other.
@@jordanhasnolife5163 so each url-assign-node will be assigned a range for example (0000 - 5555). This way no 2 nodes will ever try to insert the same short url. Each node will just increment its local counter when ever it's creating a new short url. This way theres no conflict resolution to be solved.
Great video. I wonder if it would be considering overengineering to go for a keyed time windowed stream in flink, sink each trigger to the db idempotently (semi-event sourcing style) and trigger a reduce on the db once in x updates/once in a while (something like what rocksdb does). You know what, I think I just answered my own question writing this comment 😅
Is it really worth the trade off to use a relational db here due to the fear of write conflicts? We are giving up on a ton of write throughput by making this choice aren't we? What if we instead not take the input of the tinyURL from the user, instead just take his longUrl, have our cassandra already prepolulated with tinyURLs and then round robin requests among our leaderless architecture?
I think it's moreso about how you justify it. To clarify, I said to use single leader replication, not necessarily a relational db. That being said, if you round robin requests around your leaderless architecture and two people have the same shortURL that you propose they could very well still take the same shortURL on different replicas which will lead to a write conflict.
@@ninad11081991 it is very hard to shard sql dbs but much easier using mongodb/cassandra. For mongo db can use the same strategy as a sql DB shown in the video. But for cassandra you need a separate KeyGen service since write conflicts can happen in cassandra. Do you think they failed you because they were looking for a cassandra solution?
Great Stuff!! So, to maintain the unique constraint using predicate locks, serializable transactions will have to be used with first a select and then insert, right?
@@jordanhasnolife5163 If there is already a unique constraint on the database for the short_url, I am confused as to why you would need to do anything in this case. The database will return a UNIQUE_CONSTRAINT failure to whoever tries to INSERT after a short_url was already inserted, and that client can retry with an increased number. Am I missing something?
You could say something along the lines of "for key x only write to replica 1 of partition 3". But then you've basically just re-created single leader replication.
Great Video, Have a question 29:51 , it is exactly 1 idempotency key per row and not few if we partition the kafka topic by short URL. can you please clarify.
Sup Jordan! After 2:35 I went ahead and made a design too and I'm looking forward to hear an opinion about my thoughts, feel free to confront and bully me based on my decision. You have hard constrained the choices around no write conflicts (with the example of an identical hash for two different urls by two users made simultaneously). I thought, that situation could be extremally rare in real life (and even if - we could add extra complexity on how we generate tinyurl to assure its uniqueness, with a salt maybe, or a distributed counter, but tbh, only after such event would happen considerable number of times). Thus, I wouldn't want to discard options based on that edge case. That enables the Cassandra option, which looks very nice for a system which is in-fact read heavy as you said, but! each tinyurl visit (which is a read) is also a write because of inc() counter requirement. I really like your idea of going with mini batch updates and Spark Streaming, though. I think this is also a good design, and I'm not holding a strong opinion on either of those, and now what? :D Cheers! Thank you for the content, keep up great work!
Yeah! I think Cassandra could be interesting here, assuming that you used a leaderless counter (so basically a CRDT), and I agree that IRL conflicts should be relatively rare. Since there are a lot fewer hashes than there are URLs different sites may hash to the same result but ideally by then enough time will have passed that Cassandra could have performed anti entropy so we could see that the hash was taken and then probe to another. Thanks, I appreciate the kind words!
Hi Jordan! Great video. I have a question about probing. Should the application logic check if the hash is available? Since we are creating a hash based on the long URL + userID + timestamp, why would there be any conflict? Should we also think of UUID or Twitter's snowflake or Base 62 conversions to avoid probing? Could you please explain more about the Hashing function?
Just because those are our inputs to the hash function doesn't mean we'll generate a unique output. If we only have 6 characters in our short link, for example, some things will hash to the same value.
For the issue with multi leader potentially causing write conflicts, a simple fix could be to hash both the incoming url and outgoing url together. That would leave the chances of same hashed value being infinitely less unless the business is also looking up p0rn sites
Haha yeah I think another thing to note here is the range space. We can hash on lots of fields but if we don't have that many hashing combinations we'll have collisions
You have drawn multiple instances of kafka in the same partition. Does it also follow Single leader replication? If not, how does it determine which click count event should go to which replica of the same partition?
Well each one is basically simultaneously aggregating some of the counts of people that have clicked our link. If one of them goes down, and then comes back up, and tries to hit the database again with an operation to increment the counter for our link, we need to make sure we haven't already applied that operation (and then spark streaming could have gone down before the ack from the DB got back to it). So using an idempotency key would allow the database to say, update y from consumer z has already been applied, don't do it again.
I think they both have their pros and cons here, but a URL that was popular a year ago but nobody looks at now is still technically pretty "frequently used"
Thanks for the vid, wondering what if the URL read succeeds and the write to the Kafka queue doesn't? Is there any decent way to to solve this, I know you mention 2p commit a bit, but was wondering if there's anything faster.
Just found the channel, and now watching all your videos from a year ago. Appreciate that you're going through the rationale and building intuition about these concepts. Also, what would be you're top 3 book recommendations?
I already have! th-cam.com/video/ty9DQhM32mM/w-d-xo.html&ab_channel=Jordanhasnolife Didn't go into Levenshtein distance indexing, I think it's probably too specific for the details of this particular channel but I'm sure there are some good resources explaining it somewhere online!
Hey Jordan, Great video. Wanted to ask how analytics calculation would work in this case if in the system we were to use Atomic write operations (I am referencing DDIA Chapter 7 - Preventing Lost Updates - Atomic Write Operations). From my understanding we are basically trying to implement a read modify write cycle and it would still boil down to a lock which you mentioned. If we designed a distributed Redis store partitioned by URL to handle the counter incrementing would the main issue at hand be maintaining an accurate count during failover? Could using a consensus algorithm help with solving with this issue or would this reduce performance and throughout too much?
We can use atomics, but atomics are built around grabbing locks. If we have too many people incrementing the same counter, we could have contention. In my solution, I batch all of the upgrades, and increment them in intervals, via a real time stream consumer. By having just one consumer per URL, I don't need to worry about locking. Since redis is single threaded, I suppose your idea works. That being said, again, we are limited by the throughput of the redis node. While a consensus algorithm would ensure that we don't lose any writes, I'd agree with you that it's probably too slow and doesn't make sense here.
If you write to S3 first and then the DB write fails, you’re left with untracked orphaned junk data from the failed request. Another approach would be to write to the DB first to persist the intent to upload the pastebin, do the rest async, and make sure to communicate something reasonable to the user while the upload is in progress.
While this is true, I prefer orphaned junk to the alternative of a paste that links to nothing. I could just have a cleanup job run in the background once a month!
A paste that links to nothing is definitely bad. Recurring cleanup works and is necessary in any case, at least as a fallback for bugs. But it's more reliable to have your normal path be to write something to the DB first [chunk_id: xxx, status: need_to_upload], then take (idempotent) action [upload to S3 if not done already] based on the persisted goal state, then record that you're done [chunk_id: xxx, status: uploaded]. Less things to clean up this way. The relevant buzzword is reconciliation.
Hi Jordan, I have a few questions on how this can be deployed globally. Suppose you need to generate URL's for different users across the world. Then you will need to deploy the generator and datastores in different regions and generate unique hashes for each region. If not, there could be hash collisions which will require you to check different data bases in different continents. Maybe we need to generate a UUID kind of solution, so that each region can have its own local store which can create and serve the content better. Also, the view count you might be getting it from different continents. Even in this case, you need to have some sort of sharded counter type of thing to consolidate counts from different regions. What are your thoughts.
I'd use a key generation service with a db leader in each region, and aggregate counts using stream processing via Kafka and flink, as that can take as long as I need. You can have DB follower replicas in other regions for faster reads.
@@jordanhasnolife5163 by key generation service do you mean something that will append a region specific code to the hash, otherwise, how can the key generated be unique, as verifying that it is not conflicting with other urls in other regions will take time.
Hey! You won't believe but I need to make Tiny URL Service for the company which I work for! Any additional advices would be welcome! :) Thank you for your video
5:27 you mentioned that the hash function provides a unique value. Are you sure? Doesn't the hash function of a hash table typically return an index to map keys to indices in the underlying array?
Another question: when mentioning a few TB should be good for one single machine, I am wondering in practice what storage usually can be done in one machine? a few hundred TB?
Hey Jordan, you mentioned that we shouldn't use write back cache b/c that will incur the same issue as multi-leader/leaderless replication. Curious why we cannot use Redis for write back cache here? Redis supports single leader replication right? Thanks in advance!
Hi Aria - you can use redis as a write back cache, and redis supports single leader replication. But a write back cache means that you write to a cache first and eventually flush that to your database. If you haven't watched my playlist prior to this one, I think you may get some benefit from drilling down those concepts first! These videos should then all make more sense.
What strikes me is a question if kafka can actually handle 2 trillion partitions 👀 for example firehose can handle max of 5k partitions per delivery stream. I think some more hybrid approach would be in order here.
29:40 Unless you didn’t mean to have a partition for each url, but just a range-based partitioning. Because that was kinda clear only in the last architecture diagram, not during analytics deep dive. If so then all good 😅
Great Video Jordan. I have couple of questions. 1. You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. But why do we need to do this kind of partitioning ? Consistent Hashing offers a good evenly distributed keys among available servers isn't it. Is it wise to use ConsistentHashing everywhere i.e., For DB partitioning, Cache Partitioning & Kafka Queue Partitioning. 2. At th-cam.com/video/5V6Lam8GZo4/w-d-xo.html, Can you shed some more light on the Idempotency key for achieving the Exactly once processing. How does keeping the last written key will help in avoiding duplicate updates for clicks ? Because we might get the clicks for that same key later as well right i.e., Subsequent clicks on the url ? How do we differentiate between these two ?
1) We are using consistent hashing, albeit on the range of short keys. That's because the short keys themselves are hashes, no need to hash them again (there's no reason you couldn't if you wanted). 2) Each batch of clicks that we handle in spark streaming gets assigned a random key id (you can hash the incoming data or something so it is deterministic), called an idempotency key. When we change the value in the db, we check (in the clicks database) to make sure we haven't seen that idempotency key so that we don't accidentally update our db twice.
25:25 What is the difference b/t Spark Streaming and Flink, if you were to just use a larger batch in Flink that your example (you said 10, but consider we use a more comparable 100)?
I have one question after the reading educative IO site on the same topic. The recommendation there is about precomputing the shortURLs ahead of time (using some out-of-band batch process). What are the potential pros and cons of using the batch approach for write vs real time write?
@@jordanhasnolife5163 - Ty for replying. I am trying to ask if there is pros and cons of having the URL generation as a non real time process. Say we precompute URL's ahead and then when write requests come just give out a short URL instead of computing on the fly.
@@svar1938 Ah, I just don't know if the computation of "next URL" is an intensive enough process to make sense in this case. I suppose it saves you an index read.
How can we achieve or guarantee uniqueness of the tiny url field across multiple shards. Wouldn’t this be too expensive especially if we are using predicate locks
Creating a partition for each short URL in Kafka could be problematic- large number of partitions in Kafka can lead to increased overhead in terms of memory usage, disk space, and metadata management,Each partition in Kafka consumes system resources such as file descriptors, memory, and network connections,Management Complexity.....
Hi Jordan! If we are pre-generating the keys, then we cannot use user-ID and long url for the hash right? Then how do we generate 8character unique short url?
I'm not sure what you mean when you say pregenerating keys, presumably you mean the materialized rows approach. In this case we do no hashing and just randomly assign a write to one of the partitions. If you meant the hashing approach we hash the user id and long URL and deal with any collisions via probing.
I do not have such a video. Sounds like another aggregation problem though with creating small windows of data! I'm using an iPad, oneNote, and an apple pencil
Yeah, perhaps after the service sees a certain number of loads for some paste it can place it in the CDN and replace the link for it in the database accordingly.
Hi Jordan, great vid. Just had a question. In your DB schema, you mentioned having an index on the tinyURL. But given that the tinyURLs in each row are unique, wouldn't they be the primary key for our rows? And if so, why would we add an index on a primary key?
Firstly very awesome content, thanks a ton. Secondly had a query regarding a point you mentioned on using Kafka queue for analytics where each short url would belong to a partition. i.e we might have 2 trillion partitions. I think that would be very bad in general as that would increase the latency and would it even be possible to create that many partitions? (there should be some upper bound I think) thanks!!
I think you could make that argument about any partitioning schema to be fair! When I say partition by x, I typically mean partition by the hash range of x if that clears things up at all. My point is I just want all values for key x to be on the same node.
Hi Jordan, nice content! A question -- why would it be a problem if we use spark streaming to calculate total counts? You mentioned that specifically the it will fail if there are multiple spark streaming consumers. What does that mean?
I just meant that if you have many consumers for one url, now your counts are split between many consumers. That's ok, but you'll face contention when incrementing the counter in the db from multiple places.
What makes you say that noSQL scales horizontally much easier? This may be true when we have a lot of distributed joins to perform, but in the case in which we don't, I really don't see any advantage. We get equal amounts of data locality either way.
Thanks for the great content! I am wondering for the write-back cache and data consistency, is it possible to make sure the same user always write and read from the same cache? in that case will there still be any data consistency problem or anything I am not considering correctly? (apart from edge cases that the cache node is failing and they can only read from a new cache?
Yeah in theory if you use consistent hashing to put the same user on the same write back cache this should be ok, but as you mentioned if that cache goes down we're in trouble
If we are doing something(probing etc) to avoid collisions then why do we have to care what replication we use? As then there should be no conflict, no?
We still have eventual consistency. I can grab a short URL x on node A and you can grab it on node B, and we won't know that there's a write conflict until we synchronize.
@@AizazShahid-ck8cn I think that anything with single leader replication would. I don't think we get much benefit from being column oriented here, and if we're prioritizing for read speeds perhaps we're better suited with a B-tree.
Every read has to access the database anyways more or less, just write the timestamp then and then don't overwrite it if it exists. You'll need to use locking here, but what you can do to avoid the head is first read the TS without the lock, and if null then grab the lock and update it in order to minimize contention.
Put some URLs on multiple database partitions, and have enough partitions that the increased throughput of many partitions is able to offset the lock contention required to assign the next URL (by primary key) to a given user that gets routed to that partition.
Yeah I'd say that's effectively the same. I'd imagine there's some form of string representation of a QR code and we would just store that as the primary key in the database instead of a short link.
Hey Jordan, love your content! I'm a bit confused about when the S3 data is being used here, if we are always writing to CDN first and fetching from it as well. Is it a fallback if some data gets evicted from a particular edge CDN server?
Hey Yash! We aren't always writing to the CDN here! Rather, we use the CDN as a write around cache, where pastes that are accessed many times ideally will eventually be loaded in the CDN. For the first write, we go to S3.
Hey man, Just came across this channel and i see two playlists for system design problems Are the videos from first playlist good enough to cover ground for now until you keep coming up for videos in this playlist ? Just asking if there's any reason behind updating them as they were only 1 year old
I have a question - how we would get the same shortlinks "abc" at 5:58 if we are using hash function( userid, timestamp, long url). if hash function is generating the same value then wouldn't it be resolved by probing? Also hash function has different values in it like userid, long url.
There are only so many short URLs, so even if you have different inputs they could theoretically hash the the same one. And yes, this would be resolved by probing, but in order to actually know that the key is taken and we need to probe, we have to be sure to lock on that row so that we don't have two users grabbing a key at the same time.
Even if they weren't in memory, the performance would be bad due to random seeks. There do seem to be augmentations of normal indexes with hash ones, but those are gonna be in memory.
@@jordanhasnolife5163 afaik, hash indexes used to perform worse in postgres because they were not optimised. They are now, and should outperform b-trees for equality searches. In a b-tree for those searches you are still paying O(logN) every time
Can you explain a bit more about the technology choices? you picked mySQL as it has all the requirements you came up with. but wouldn't scaling be easier with something like mongodb compared to mySQL. mongodb also has indexes. so wouldn't mongo be the choice in terms of faster development
Can you clarify your assumptions here? Why does mongo lead to faster development? Why does mongo scale better? I think that people often say this about NoSQL databases without much evidence to back it up, so that's going to be my retort for you here. Generally, most developers are familiar with SQL databases, and I believe that I can get away with using the relational model here, both in the shape of data, and in the sense that I want the ability to have ACID transactions and use single leader replication. Could I use mongo? Sure - but I don't see any clear advantage that it has over MySQL when both work perfectly fine in this situation.
Heyy @@jordanhasnolife5163, a beginner here, what about the argument where people say that NoSQL is easier to scale horizontally? I always get a bit confused on this point, I think scaling horizontally relates to since we have data locality in NoSQL, we don't need to perform joins across nodes. But do I definitely choose SQL vs NoSQL? Should I always go for SQL, if it fits, choose it, else try with Nosql?
btree is not scalable, and kafka uses LSM tables as its main system, and is highly scalable. so using ULID or UUIDs and replicating over a kafka queue might increase capaity without degrading your performance i think (although am not an expert in this system type)
1) I don't know where you're getting that Kafka uses LSM trees, because it's basically just writing to a commit log, therefore not requiring any sort of indexing 2) What makes you say that B-trees aren't scalable? And what do you mean by "not scalable"? Let me know if I'm missing anything!
Hey Jordan, In the case of one publisher Per Row(29:34) you mentioned that we will have one spark streamer per tiny url id. But here we have trillions of URLs how is that feasible to have trillion streamers for this case?
Hey Rishabh! What I meant there is that the data for each tiny url id belongs on one spark streaming node, but you can have a spark streaming node responsible for many tiny url ids. You basically partition the consumers/kafka queues on tiny url id.
Hi Jordan, thanks for this video! I dont get why we need to worry about two people adding rows with the same primary "tinyUrl" key, doesnt every ACID compliant db prevent adding two of the same row like that or am I missing something very obvious??
Sorry I still dont understand 😢 I was referring to 11:15 where it talks about locking for adding new rows, I thought databases already did this, so why do we need to worry about it?
I swear this channel is such a hidden gem.
Thanks man! Hopefully not too hidden haha
I think this attracts a particular kind of readers. Those looking for more in-depth and first- principles understanding vs memoization. I think you also have a good balance of real use case driven thinking and hand waving where necessary. The latter is crucial and people ( including me at times) tend to overlook the importance of hand waving. Not everything deserves the same level of detail. But knowing which does is the game changer! Anyways, I will be subscribing and reviewing a good portion of this 😊
You stole my thoughts! What is this man!!! Amazing…brilliant
I've gone through a lot of videos on this subject and this is by far the best. Not just regurgitating what they've read elsewhere, but actually coming from a place of proper understanding.
Thanks James!
Great content! Probably the most in-depth TinyURL design I've seen. Most others don't actually get into write conflicts/locking.
I appreciate the new 2.0 video style and how in depth it is. I found the older videos too brief considering the complexity of the systems they covered.
You are so underrated brother... this is the best content on the internet regarding System Design interviews 100%... who cares about an interviewer saying 'yes' to every question that the interviewee asks, or an interviewee coming up with the same 10 boxes every single System Design question when you explain stuff like this... keep it up, you are saving lifes!!
You are very good, the best even, at driving the solution of a design problem by applying logical reasons rather than throwing something at my face i wouldn't understand. Thanks for existing!
Thank you very much, I’ve been reading DIDA and Understanding Distributed Systems so it’s great to see how those concepts apply to the use cases you are showing.
This is the first video of yours that I’ve seen yet, planning to watch all the series when I can because you highlight the questions that we should be asking when designing and assert the tradeoffs of different solutions, it’s amazing.
I don’t know if you have done for the other videos but if I could ask for anything would be for you to change the pencil color when reviewing the flows of data, it makes it easier to find where you are pointing to in the drawing board.
Again, thank you very much, these videos are really helpful.
Sounds good - Ive gotten this suggestion a couple of times now so I'll give it a shot at some point
I really like the format for these types of videos of starting with a naive solution and then iterating on each component as we discover performance problems. That way the initial db schema and the overall graph of how requests / responses implement the features can be clearly presented to the viewer + the viewer gets a conceptual framework to start evaluating the ways in which the simple picture needed to become more sophisticated.
Thanks for the feedback!
At 12:34 -> You do not have to do any predicate query. Relational databases use B-trees and are not append-only like databases that use LSM trees.
So an "INSERT" will fail if another INSERT beat it to the race because the 2nd insert will notice the primary key in the B-tree. This sort of "INSERT and fail if primary key exists" is very hard to accomplish in LSM based DBs as they are append-only. In fact to accomplish this in an LSM based database, you usually have to do a special slow instruction that involves multiple phases and uses something like Paxos or raft or 2PC.
So with a database like postgres, one solution to this TinyURL is to actually just have different application servers INSERT new rows. If there was a race on the exact same primary key, the 2nd application server will get a "duplicate key exists" error and will know to retry or send the error to the client.
Yeah - I probably shouldn't have included this section in the video since it's just thrown people off. I meant to say that predicate locking was one way of accomplishing this, but like you mentioned in SQL databases they're gonna get this done for you on the primary key.
Interesting stuff on the LSM tidbit, that's new to me! I imagine if the key was still in the memtable perhaps we wouldn't need too much additional complexity, as we'd see it was there, but had it been flushed this could get complex.
Thanks for the insight, and I appreciate you pointing out these flaws!
Hey, first of all thank you for the videos. They've been really helpful for me prepping for an upcoming interview!
I watched a separate video on tinyURL/pastebin link generation, and the main point of the video was that instead of hashing, which can lead collisions when we truncate the MD5/whatever hash, like you mentioned in the video. Their suggestion was that we could have a cluster of key generations/ID generators, and each one would have a range of URLs to distribute when a write request is made. They could be coordinated by ZooKeeper and would allocate only from their own chunks, and would refresh their URLs as needed. The only concern is that we lose that range of URLs (say 1 million) if a specific generator goes down, but since we have 2 trillion combinations if we use base 36 with 8 characters and even more if we use base 62, this isn't a major concern. This would also speed up writes in the case there is a collision.
What are your thoughts on this?
Hey yeah, I touch upon that in this video. That's basically the key generation service where you partition the keys and assign in sequence.
That works too! There may some lock contention within a partition but if you have enough partitions such that it doesn't matter that's fine too!
My sister catches me binge watching Jordan again and says:
"Just like Jordan, you have no life too 🤣"
Thank you for all the system design content! I’ve been binging through the 2.0 playlist and this video does a great job putting all the pieces together. I don’t have any experience with system design interviews so I have two questions in particular:
1. If in an interview, someone were able to deliver a design with the same level of depth + complexity that you’ve described here, what level would they be placed at? Alternatively, what sort of depth would interviewers be looking for in say an L4 interview?
2. Many of the system design interview guides recommend to lay out a basic architecture first and then go into a deep dive at the end but in your videos, you weave both parts together. How would you recommend to split these parts up? For example, is database schema + choosing a type of database part of the deep dive or should it be part of the basic architecture? And how about for replication/partitioning?
1) At the end of the day, you're interviewing for a certain level, so you'd be placed at that level. I don't really think uplevels are ever going to happen, but failing a systems design interview might lead to a downlevel. I like to think that these videos are sufficient to pass just about any systems design interview, considering the results of my viewers, but I can't be sure, as I'm not an L5 and haven't had the test to try that theory out haha
2) Yeah, and I used to do my old videos this way. However what I've come to realize is that one's choice of database schema is intimately tied to their database architecture choice. So if I just list a schema outright, but don't go into what architecture I'm using, it might not make sense. In an actual interview, it may make sense to list a schema first, and then say you'll explain it later, but my main objective of these videos is to make them as clear as possible - so I hope that my choice to do this makes sense.
Good content technically speaking, but a bit removed from an actual System Design interview. I found HelloInterview to be more geared towards passing the actual interviews (I swear I'm not shilling, just giving you feedback because I genuinely love your channel for the technical aspect).
I'm hearing this a lot these days. I could see it depending on the interviewer/level, but I'm pretty content to overprepare for these since I find it actually helps me at work too!
And no worries, you can have preferences :)
Personally love the in-depth stuff and trade offs for each choice as that’s the expectations for very senior (L7+) roles.
Awesome video! Couple thoughts:
- For storage, it may be useful to talk about content addressable storage. Many pastes may be the same and that will save space
- I’m not sure what the CDN buys us here. Reading from s3 isn’t crazy slow but the big issue is that the object space is huge and my hunch is that most of the time, you’d get CDN misses in most regions. That is assuming you’re not storing every paste to your CDN, which feels expensive. This is more a not that a cdn feels like a premature optimization
- I really like your count approach. Another couple be using a counter CRDT and increment server side while serving the link. That being said, yours is more general purpose and scales for other types of analyitics
Thanks!
Good call on the first point, you could take some sort of hash to deduplicate.
For the CDN point, I mostly have it there just to talk about CDNs. In practice, caches seem to be one of those things that you just introduce as you need, so we'd have to see how our existing latencies were and then experiment here.
Appreciate. It is a very serious description/implementation for such easy task.
Hey Jordan, awesome video as usual! I really appreciate the way you break down tradeoffs. I checked out the TinyURL/Pastebin design from grokking's course, but it felt a bit sloppy to me, your version is so much better :) thank you!
Thanks Harika!
These are so sick! Much appreciated
I'd love it if you could make content regarding what level of depth is expected from a Mid level SD interview vs Senior vs Staff!
It would be cool to take the same problem and do a mock at each level so we can understand the differences! Would love to see a mock series between you and Gaurav tackling this.
I've interviewed with companies where they haven't picked a level for me yet and they say we will determine it based on your performance. But I don't know what the expectation is exactly in the context of a system design interview.
To be honest I just don't have enough context here as I haven't conducted interviews for a company, I do really wish I could say otherwise but at the moment my objective is to just go as in depth as possible.
I fucking love how you can explain complex shit and be funny at the same time. Keep up the good work!
Learned a lot, concise, well explained, and also funny
I have a sys design interview today, thank you so much
Good luck!!
Once again, learned a lot and read and write scaling techniques were helpful in my system design interview. Another write scaling technique you can talk about is Log structured merge trees for o(1) write and O(logn) reads.
I wouldn't say writes there are O(1) since they also go into a tree, but it should be a much smaller tree and in memory, so maybe practically it is :)
@@jordanhasnolife5163 you are correct they are eventually not o(1) but if they are async writes, for your request sake, it's o(1) and at some point we add the entire log in sorted manner. Again, I am not an expert here so I could be completely wrong, lol. Please let me know if I am. Will do more research for complete understanding.
@@advaitchabukswar4163 I don't think it is asynchronous, the write to the tree is synchronous as far as I'm aware
@@jordanhasnolife5163 I see…
Thanks for explaining the rationale behind your design decisions, very few videos do that
Brief Outline
00:01:15 Problem Requirements
00:02:04 Performance Considerations
00:03:23 Link Generation
00:05:45 Assigning URLs - Replication
00:08:07 Assigning URLs - Caching
00:08:51 Assigning URLs - Partitioning
00:10:34 Assigning URLs - Single Node
00:11:29 Assigning URLs - Predicate Locks
00:14:12 Assigning URLs - Engine Implementation?
00:15:42 Assigning URLs - Database Choice
00:16:17 Maximizing Read Speeds
00:17:35 Maximizing Read Speeds - Hot Links
00:18:50 Maximizing Read Speeds - Populating the Cache
00:21:02 Analytics - Naive Solution
00:22:22 Analytics - Stream Processing
00:23:49 Analytics - Click Consumer
00:26:00 Analytics - Exactly Once?
00:29:16 Analytics - One Publisher Per Row
00:30:17 Deleting Expired Links
00:31:21 PasteBin - Huge Pastes
00:34:23 Final Diagram
Thanks, Jordan!
Thank you man! Legend!
Hi Jordan. You were talking about 16 TB of potential data, partitioning, single leader replication. Sounds like Mongo fits better than relational solution.
Thank you for great content!
I think they're more or less equally feasible for this one, feel free to go with mongo, basically personal preference at this point
Using PH as your example URL makes you a legend
Legend is a strong word to use for what I am
I'm here for your eyes
👀
Wonder why key-value DB wouldn’t be the db choice. It doesn’t require data relation and we can quickly check if tiny url has been used and also supports fast retrieval
Yeah I tried to say that this would be the best choice, but just doubt that storing all data in memory is feasible from a cost perspective, so wouldn't be surprised if your interviewer pushed you harder here.
I was wondering the same thing.
Its weird how advice on costs usually comes up in the "what to say in the beginning" interview advice, but isn't mentioned more in technology tradeoffs. I also would've thought that non relational dbs are, for the most part, cheaper than relational ones, but I guess that's only until the point when they stop being caches and start being proper dbs.
Am I right or missing something?
@@adi96adi Yeah not exactly sure regarding the cost of non relational vs. relational dbs but I can say with certainty that storing in memory is more expensive and for something like tinyURL, likely to be overkill.
Thanks for the detailed breakdown for storing url click counts. A couple of questions
1. In a multi region/multi-DC world, assuming we use a single leader writes, there could be atleast two Spark streaming instances writing to DB for the same Key (In a 2 region world). To ensure correctness, do we need to store 2 Idempotent keys here per row? Alternately, I think Kafka mirroring can solve this by replicating clicks from multiple regions to be unified in one region so we can get away with a single idempotent key
2. If we extend this problem to like counts, where we store liked_by users in a DB and rely on the kafka + stream_processing for counts, What happens if Kafka is down? For correctness, do we want to fail the liked_by write to DB or is there a better way to recover counts without impacting like actions availability?
1) I'd say so, assuming they're going to the same database row
2) You could just use change data capture to put a write in kafka when a new liked_by user is written. Ideally, kafka shouldn't be going down. This is why we replicate things.
@@jordanhasnolife5163 I’ve seen cases where company internal/platform level issues can cause temporary Kafka outages. In those cases, I’m thinking if we can write the updates to a log file and when Kafka is back up, process log files to enqueue Kafka?
My motivation for the question is to ensure eventual correctness in counts if a major component such as Kafka fails
@@lv0320 Fair enough, I think a log file that eventually sinks to CDC when kafka is back sounds best. Perhaps look into debezium and see if that's something they already support
@@jordanhasnolife5163 makes sense, thank you
6:40 never thought I'd laugh out loud during a system design video. Thanks for making my studying somewhat tolerable
Thanks Scott!
11:21 hi here you mention about the race condition when 2 users would add/create for a same actual url, but I’m confused as to why would this happen if the hash functions accounts for the userID already when generating the tinyURL? Wouldn’t that make the tinyurls unique for 2 users even for same links?
It's super unlikely but as long as there are fewer hash combinations than userId and url combinations you can have some collisions
Hi Jordan, this is the best in-depth design video about tinyurl in any book or on any website. Thanks a lot!
I do have two questions:
1. I am confused by the discussion around the concurrent writes part. In the video you first ruled out the multi-leader and leaderless replication designs. Then you discussed how to handle the concurrent writes to a single row. But if we are choosing single-leader replication, would this still happen? Shouldn't the leader handle the concurrent writes with locking or queuing, and determine the "one with later timestamp" to be unsuccessful (any maybe then assign a different unused shortURL to it)?
2. In many other sources, there is the other way to assign the tinyURL, which is through distributing sequentially increasing numbers in base-36/62 via a ticketing server. The ticketing server can scale itself by preallocating different ID ranges. I am curious what kind of partitioning could this design have to handle large read and write traffic? Like if we use hash range partition, clearly all the writes will concentrate on one partition which is BAD. Of course we could do another hash on the shortURL and partition on that to make it evenly distributed, but hashing itself is not free, and adds to complexity. What's your point of view on that?
Thanks in advance!
1) Yes, which is why we use single leader replication with locking :)
2) Yeah, that's the key generation approach that we describe. Each database covers a range of keys, and hence we should get linear increases in speed as we add more nodes. I don't see why all writes will concentrate on one partition if we use a hash range, you can just randomly route each incoming request to one of the partitions and the partition will assign it a UUID.
One comment about the writer - if the writer is an API instance serving the requests to the user, and we're writing to the CDN/S3/DB, this could potentially take a long time and you'd likely want some kind of async mechanism to do these writes with a way of updating the user on its progress.
Agreed. That's potentially more of a client side change though, pretty sure that when uploading files to these services they expose the percentage completion
Great video, especially the discussion on single leader vs multi leader databases here.
A couple notes:
I don’t think write through for the CDN is a good idea. If you wrote a 10 GB file there, it would be at the expense of other files that may be getting accessed frequently. We shouldn’t give priority to a large unknown file over something that has been accessed and we won’t know will be popular.
This is assuming that CDNs have a bounded size. Maybe there is a fancy one out there that isn’t bounded.
I think you could make an argument that you won’t have records that are accessed for a long time after they are created.
Yeah agree with all said, I don't think we know in advance what's popular here anyways so a write around approach seems better.
FK , this man contents has its own system design . Love you buddy
Hi Jordan, great video! I'm not sure that we have to have uderId in the table, because the same long URL can be used by different users, and the eventual short URL (hash) will be the same.
While I agree with you, you want to be able to attribute clicks to the links of different users, so it may be useful to have different short urls for the same long url
great video, man. a lot of system design videos are super generic with just "here's some lego blocks that I know off the top of my head" and don't start from a bottom-up approach with the system architecture designed to support a specific algorithm, not just a top-level cookie-cut use case.
Appreciate that man!!
Jordan does deserve pat on back ❤
oh wow u changed the original system design vids! Great I was thinking of reviewing soon :)
Welcome back ;)
hey jordan, curious, why do we avoid multi leader and use single leader when both can have stale data issue
I'm not concerned about stale data, I'm concerned with write conflicts, which would occur when two users both claim the same URL. Single leader replication doesn't have this issue.
@@jordanhasnolife5163 thank you legend.
Awesome Content. I wasn't expecting this much detail in url shortener.
Two questions though:
1. How does partition to kafka will be done? Ideally, servers need to know which kafka patition the message should be sent to and also, some coordination service would be required to maintain that?
2. For writing large files for Pastebin. CDN is a cache and cache will be have eviction policy, so if any url will be hit with gigantic file, which is not used for days and someone hit it after lets say weeks. Considering, we use LRU, it might get evicted and CDN anyways would need to load it again. Is the only reason to put it in CDN first is to make sure popular urls will be in CDN itself and when even first few users hit it, they all will get it pretty quickly?
1) Yeah kafka employs zookeeper internally. You can just tell it a "partition id" you want to read and it figures out the rest IIRC.
2) Yep basically! If it hasn't been hit in a while, no point of keeping it in the CDN anyways.
For storage engine, the reasoning to discard hash indexes is flawed. We are NOT storing 1PB in hash index, we only need to store keys (1T * (8 bytes for hash key + 32 bytes for pointer to value on DB) = 40TB). 40TB still is lot, hence we can discard it.
In retrospect, I do actually think we can get pretty far with the hash index, it's pretty unlikely we'll hit that scale irl lol
Why do you need predicate locks for solving inserts in MySQL DB ?
Why can't we just use primary key constraints / unique key constraints given by DB to achieve uniqueness ?
To be clear, we totally can, I just would imagine that under the hood that has to use some sort of predicate locking so that we don't create two rows with the same primary key
This series is definitely in a league of its own. Who da man! Jordan da man - Fellow dork
Can you explain a bit more why an on-disk hash index would not work for this problem?
I understand that on-disk hash indexes will suffer from poor data locality and disk I/O optimization, but in this case we don't care about range queries, so locality is not a concern. Also, it seems to me that we can cache disk pages in memory for a hash index just as we would for a B+tree or similar and achieve similar a similar probability for cache hits, as co-located shortURLs on disk have no increased probability of access.
At the very least, it seems like we could use a hybrid solution with a B+tree to index our on-disk hash table and achieve constant time value access once the hash table range containing our target key is paged into memory.
PS love your channel dude really interesting stuff 🫡
In this case it would work great if you keep the hashing part of it in memory and store the address on disk as the value. Agreed!
Thanks for this video. One of the best I have seen in this topic. Question for you: You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. How do we know in advance which bucket we will be generating the key for?
You take the hash of your inputs and then use some load balancer which listens to a service discovery layer (zookeeper) to go to the right node
Superb video, quick questions
Are you storing all possible tinyURL combinations beforehand in DB OR after hashing longURL + timestamp + userID tinyURL is inserted into respective partition?
And why the click counter is triggered from the reader's side? why not when we read the DB?
I discuss the pros and cons of doing both. If you're using a SQL table, you can probably store them ad hoc.
I don't know what you mean "triggered from the reader's side". It is incremented when we read the database. However, in order to avoid lock contention, we perform the counting asynchronously.
Great video! However, can you elaborate the following things related to the short URL generation?
1. Most of the common hash function, e.g. MD5, SHA-2, generate the string way longer than you need. How do you use it while still keep the hash collision low?
2. Can you elaborate how to detect the hash collision? You mention using the linear probing so it kinda imply you want to do it in the memory, so my guess is the service probably need to first query the DB to check if this hash/key already exist. Won't that be a performance concern?
3. I've read a few other designs of generating the unique short URL, and some of them raise the issues of using the hashing of the original URL (with optionally including other data like timestamp). They usually suggest to use a separate unique ID generation service/component, and perform proper encoding like Base 62 and use it as the short URL(besides the domain name part). Have you ever considered this and what's your thought of this?
Btw, unique ID generation strategy in the distributed system may be an interesting topic to discuss as well if you haven't done it yet 🙂
I think you should just be able to truncate, if we found that this was leading to too many collisions we could always use 10 characters instead of 8 for example
2) oh on the contrary I'm basically proposing we use application logic that says if key x exists take it otherwise try to write key x+1 in the db. This could be done a bit faster with something like a stored procedure.
3. I don't really see the difference, feel free to elaborate on how this improves things!
@@jordanhasnolife5163 Grokking says the key gen is great at solving hash collisions since everything is created beforehand on an offline process. It will create two tables in your DB (used & unused). The application can periodically grab a bunch of unused keys from the key gen service (KGS). Once used, app will update KGS to move entry into used. That's their reasoning for it anyways. Yours makes sense as well to just populate db with everything from get go since its only a few TBs which is nothing.
@@jordanhasnolife5163generating a unique id in a sequence, then create a base 36-64 string as a shorten URL allow you to not having to deal with a collision ever.
Running number is guaranteed to be unique unlike hash. However, it run the risk of being guessable and could be ddos, by making 2 shorten URLs that point to each other.
@@jordanhasnolife5163 so each url-assign-node will be assigned a range for example (0000 - 5555). This way no 2 nodes will ever try to insert the same short url. Each node will just increment its local counter when ever it's creating a new short url. This way theres no conflict resolution to be solved.
Great video. I wonder if it would be considering overengineering to go for a keyed time windowed stream in flink, sink each trigger to the db idempotently (semi-event sourcing style) and trigger a reduce on the db once in x updates/once in a while (something like what rocksdb does).
You know what, I think I just answered my own question writing this comment 😅
Ha, I think you could probably do it, just not sure if it's necessary!
Is it really worth the trade off to use a relational db here due to the fear of write conflicts?
We are giving up on a ton of write throughput by making this choice aren't we?
What if we instead not take the input of the tinyURL from the user, instead just take his longUrl, have our cassandra already prepolulated with tinyURLs and then round robin requests among our leaderless architecture?
Fun fact, Bloomberg actually rejected me because I used mysql in my system design interview for this problem
I think it's moreso about how you justify it. To clarify, I said to use single leader replication, not necessarily a relational db.
That being said, if you round robin requests around your leaderless architecture and two people have the same shortURL that you propose they could very well still take the same shortURL on different replicas which will lead to a write conflict.
@@jordanhasnolife5163 Makes sense! and Makes Sense. If I get asked this again, I'll bring up both considerations
@@ninad11081991 it is very hard to shard sql dbs but much easier using mongodb/cassandra. For mongo db can use the same strategy as a sql DB shown in the video.
But for cassandra you need a separate KeyGen service since write conflicts can happen in cassandra. Do you think they failed you because they were looking for a cassandra solution?
Great Stuff!! So, to maintain the unique constraint using predicate locks, serializable transactions will have to be used with first a select and then insert, right?
Yep more or less!
@@jordanhasnolife5163 If there is already a unique constraint on the database for the short_url, I am confused as to why you would need to do anything in this case.
The database will return a UNIQUE_CONSTRAINT failure to whoever tries to INSERT after a short_url was already inserted, and that client can retry with an increased number.
Am I missing something?
for multi leader / leaderless cant we resolve the issues of conflicting writes by using consistent hashing?
You could say something along the lines of "for key x only write to replica 1 of partition 3". But then you've basically just re-created single leader replication.
Great Video, Have a question 29:51 , it is exactly 1 idempotency key per row and not few if we partition the kafka topic by short URL. can you please clarify.
Yes, I was saying this is if you don't partition by short URL
Sup Jordan! After 2:35 I went ahead and made a design too and I'm looking forward to hear an opinion about my thoughts, feel free to confront and bully me based on my decision.
You have hard constrained the choices around no write conflicts (with the example of an identical hash for two different urls by two users made simultaneously). I thought, that situation could be extremally rare in real life (and even if - we could add extra complexity on how we generate tinyurl to assure its uniqueness, with a salt maybe, or a distributed counter, but tbh, only after such event would happen considerable number of times). Thus, I wouldn't want to discard options based on that edge case. That enables the Cassandra option, which looks very nice for a system which is in-fact read heavy as you said, but! each tinyurl visit (which is a read) is also a write because of inc() counter requirement.
I really like your idea of going with mini batch updates and Spark Streaming, though. I think this is also a good design, and I'm not holding a strong opinion on either of those, and now what? :D
Cheers! Thank you for the content, keep up great work!
Yeah! I think Cassandra could be interesting here, assuming that you used a leaderless counter (so basically a CRDT), and I agree that IRL conflicts should be relatively rare. Since there are a lot fewer hashes than there are URLs different sites may hash to the same result but ideally by then enough time will have passed that Cassandra could have performed anti entropy so we could see that the hash was taken and then probe to another.
Thanks, I appreciate the kind words!
That Google logo and the stand behind looks like ghost to me 😰😱
If we are single leader, why do we need predicate lock?
Only one client will accès the row, right ?
Not if it doesn't exist yet!
Hi Jordan! Great video. I have a question about probing. Should the application logic check if the hash is available? Since we are creating a hash based on the long URL + userID + timestamp, why would there be any conflict? Should we also think of UUID or Twitter's snowflake or Base 62 conversions to avoid probing? Could you please explain more about the Hashing function?
Just because those are our inputs to the hash function doesn't mean we'll generate a unique output. If we only have 6 characters in our short link, for example, some things will hash to the same value.
For the issue with multi leader potentially causing write conflicts, a simple fix could be to hash both the incoming url and outgoing url together. That would leave the chances of same hashed value being infinitely less unless the business is also looking up p0rn sites
Haha yeah I think another thing to note here is the range space. We can hash on lots of fields but if we don't have that many hashing combinations we'll have collisions
You have drawn multiple instances of kafka in the same partition. Does it also follow Single leader replication? If not, how does it determine which click count event should go to which replica of the same partition?
As far as I can tell I only drew one Kafka instance per partition, but yes this would use single leader replication
At 28:55 why do we need to store an extra idempotency key for each additional SS consumer? im not quite understanding that part
Well each one is basically simultaneously aggregating some of the counts of people that have clicked our link. If one of them goes down, and then comes back up, and tries to hit the database again with an operation to increment the counter for our link, we need to make sure we haven't already applied that operation (and then spark streaming could have gone down before the ack from the DB got back to it). So using an idempotency key would allow the database to say, update y from consumer z has already been applied, don't do it again.
For the caches, since we want to handle hot-keys, why not use LFU as opposed to LRU for the eviction policy?
I think they both have their pros and cons here, but a URL that was popular a year ago but nobody looks at now is still technically pretty "frequently used"
Thanks for the vid, wondering what if the URL read succeeds and the write to the Kafka queue doesn't? Is there any decent way to to solve this, I know you mention 2p commit a bit, but was wondering if there's anything faster.
Nope, unfortunately nothing better than 2pc. If you figure it out though let me know, we'll be billionaires!
Just found the channel, and now watching all your videos from a year ago. Appreciate that you're going through the rationale and building intuition about these concepts. Also, what would be you're top 3 book recommendations?
DDIA, 50 shades of gray, the art of the deal
Thanks for the kind words!
Been watching all your videos.. Can you do a video on Lucence OR how in general how search index work with a given edit distance?
yeah interested in apache lucene, elastic search and searching indexing
I already have! th-cam.com/video/ty9DQhM32mM/w-d-xo.html&ab_channel=Jordanhasnolife
Didn't go into Levenshtein distance indexing, I think it's probably too specific for the details of this particular channel but I'm sure there are some good resources explaining it somewhere online!
Hey Jordan,
Great video. Wanted to ask how analytics calculation would work in this case if in the system we were to use Atomic write operations (I am referencing DDIA Chapter 7 - Preventing Lost Updates - Atomic Write Operations). From my understanding we are basically trying to implement a read modify write cycle and it would still boil down to a lock which you mentioned.
If we designed a distributed Redis store partitioned by URL to handle the counter incrementing would the main issue at hand be maintaining an accurate count during failover? Could using a consensus algorithm help with solving with this issue or would this reduce performance and throughout too much?
We can use atomics, but atomics are built around grabbing locks. If we have too many people incrementing the same counter, we could have contention. In my solution, I batch all of the upgrades, and increment them in intervals, via a real time stream consumer. By having just one consumer per URL, I don't need to worry about locking.
Since redis is single threaded, I suppose your idea works. That being said, again, we are limited by the throughput of the redis node. While a consensus algorithm would ensure that we don't lose any writes, I'd agree with you that it's probably too slow and doesn't make sense here.
If you write to S3 first and then the DB write fails, you’re left with untracked orphaned junk data from the failed request. Another approach would be to write to the DB first to persist the intent to upload the pastebin, do the rest async, and make sure to communicate something reasonable to the user while the upload is in progress.
While this is true, I prefer orphaned junk to the alternative of a paste that links to nothing. I could just have a cleanup job run in the background once a month!
A paste that links to nothing is definitely bad. Recurring cleanup works and is necessary in any case, at least as a fallback for bugs. But it's more reliable to have your normal path be to write something to the DB first [chunk_id: xxx, status: need_to_upload], then take (idempotent) action [upload to S3 if not done already] based on the persisted goal state, then record that you're done [chunk_id: xxx, status: uploaded]. Less things to clean up this way. The relevant buzzword is reconciliation.
@@dmitrigekhtman1082 Seems like a fair point to me!
Hi Jordan, I have a few questions on how this can be deployed globally. Suppose you need to generate URL's for different users across the world. Then you will need to deploy the generator and datastores in different regions and generate unique hashes for each region. If not, there could be hash collisions which will require you to check different data bases in different continents. Maybe we need to generate a UUID kind of solution, so that each region can have its own local store which can create and serve the content better. Also, the view count you might be getting it from different continents. Even in this case, you need to have some sort of sharded counter type of thing to consolidate counts from different regions. What are your thoughts.
I'd use a key generation service with a db leader in each region, and aggregate counts using stream processing via Kafka and flink, as that can take as long as I need. You can have DB follower replicas in other regions for faster reads.
@@jordanhasnolife5163 by key generation service do you mean something that will append a region specific code to the hash, otherwise, how can the key generated be unique, as verifying that it is not conflicting with other urls in other regions will take time.
Hey! You won't believe but I need to make Tiny URL Service for the company which I work for! Any additional advices would be welcome! :) Thank you for your video
My only advice is to not plan for scale you won't have :)
When you say you would use database partitioning, do you actually mean sharding? How do you approach discussing these two topics during an interview?
I've never really known them to be different terms
5:27 you mentioned that the hash function provides a unique value. Are you sure? Doesn't the hash function of a hash table typically return an index to map keys to indices in the underlying array?
A hash function is any function that deterministically maps some element of a domain to a different element of a range
Another question: when mentioning a few TB should be good for one single machine, I am wondering in practice what storage usually can be done in one machine? a few hundred TB?
Depends how you're storing the data, but just look up the prices of hard drives and look up how many you can fit in an average server
Hey Jordan, you mentioned that we shouldn't use write back cache b/c that will incur the same issue as multi-leader/leaderless replication. Curious why we cannot use Redis for write back cache here? Redis supports single leader replication right? Thanks in advance!
Hi Aria - you can use redis as a write back cache, and redis supports single leader replication.
But a write back cache means that you write to a cache first and eventually flush that to your database.
If you haven't watched my playlist prior to this one, I think you may get some benefit from drilling down those concepts first! These videos should then all make more sense.
In interviews, do we have to talk about all the ways we can do a part. eg all ways to hash/partition/caches
I guess discuss with your interviewer, but it's not unreasonable to touch upon why you wouldn't want to do a particular approach
What strikes me is a question if kafka can actually handle 2 trillion partitions 👀 for example firehose can handle max of 5k partitions per delivery stream. I think some more hybrid approach would be in order here.
29:40 Unless you didn’t mean to have a partition for each url, but just a range-based partitioning. Because that was kinda clear only in the last architecture diagram, not during analytics deep dive. If so then all good 😅
Yep you got it!
Great Video Jordan.
I have couple of questions.
1. You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. But why do we need to do this kind of partitioning ? Consistent Hashing offers a good evenly distributed keys among available servers isn't it. Is it wise to use ConsistentHashing everywhere i.e., For DB partitioning, Cache Partitioning & Kafka Queue Partitioning.
2. At th-cam.com/video/5V6Lam8GZo4/w-d-xo.html, Can you shed some more light on the Idempotency key for achieving the Exactly once processing. How does keeping the last written key will help in avoiding duplicate updates for clicks ? Because we might get the clicks for that same key later as well right i.e., Subsequent clicks on the url ? How do we differentiate between these two ?
1) We are using consistent hashing, albeit on the range of short keys. That's because the short keys themselves are hashes, no need to hash them again (there's no reason you couldn't if you wanted).
2) Each batch of clicks that we handle in spark streaming gets assigned a random key id (you can hash the incoming data or something so it is deterministic), called an idempotency key. When we change the value in the db, we check (in the clicks database) to make sure we haven't seen that idempotency key so that we don't accidentally update our db twice.
25:25
What is the difference b/t Spark Streaming and Flink, if you were to just use a larger batch in Flink that your example (you said 10, but consider we use a more comparable 100)?
I'd say none then really, just wanted the opportunity to use something other than flink lol
I’ve never used either technology (only Kinesis streams from AWS) so was just curious!
I have one question after the reading educative IO site on the same topic. The recommendation there is about precomputing the shortURLs ahead of time (using some out-of-band batch process). What are the potential pros and cons of using the batch approach for write vs real time write?
You're gonna have to elaborate a bit more for me to answer this one, because I'm not sure what out of band means off the bat.
@@jordanhasnolife5163 - Ty for replying. I am trying to ask if there is pros and cons of having the URL generation as a non real time process. Say we precompute URL's ahead and then when write requests come just give out a short URL instead of computing on the fly.
@@svar1938 Ah, I just don't know if the computation of "next URL" is an intensive enough process to make sense in this case. I suppose it saves you an index read.
How can we achieve or guarantee uniqueness of the tiny url field across multiple shards. Wouldn’t this be too expensive especially if we are using predicate locks
You shard by the tinyurl ID so that you don't have to!
Creating a partition for each short URL in Kafka could be problematic- large number of partitions in Kafka can lead to increased overhead in terms of memory usage, disk space, and metadata management,Each partition in Kafka consumes system resources such as file descriptors, memory, and network connections,Management Complexity.....
Not proposing this - when I say partition by x it's me using shorthand for partition by hash range of x
@@jordanhasnolife5163 thanks
Hi Jordan!
If we are pre-generating the keys, then we cannot use user-ID and long url for the hash right? Then how do we generate 8character unique short url?
I'm not sure what you mean when you say pregenerating keys, presumably you mean the materialized rows approach. In this case we do no hashing and just randomly assign a write to one of the partitions.
If you meant the hashing approach we hash the user id and long URL and deal with any collisions via probing.
@@jordanhasnolife5163 Yes, for the materialized rows approach, how are we generating unique keys? Are we using UUID ?
Do we have a video on phone billing for monthly and custom date generation? Also what devide are you using for whiteboard?
I do not have such a video. Sounds like another aggregation problem though with creating small windows of data!
I'm using an iPad, oneNote, and an apple pencil
@@jordanhasnolife5163Thanks. Which iapd model is that?
@@akhilmittalji6816 an air perhaps
Also, could you explain the sequence of workflows for pastebin? Does it redirect to the long url and then fetches the paste content from CDN?
Yeah, perhaps after the service sees a certain number of loads for some paste it can place it in the CDN and replace the link for it in the database accordingly.
Hi Jordan, great vid. Just had a question. In your DB schema, you mentioned having an index on the tinyURL. But given that the tinyURLs in each row are unique, wouldn't they be the primary key for our rows? And if so, why would we add an index on a primary key?
Same thing basically - not all databases sort on that primary key I believe so I'm just saying this to be clear
Firstly very awesome content, thanks a ton.
Secondly had a query regarding a point you mentioned on using Kafka queue for analytics where each short url would belong to a partition. i.e we might have 2 trillion partitions.
I think that would be very bad in general as that would increase the latency and would it even be possible to create that many partitions? (there should be some upper bound I think) thanks!!
I think you could make that argument about any partitioning schema to be fair!
When I say partition by x, I typically mean partition by the hash range of x if that clears things up at all. My point is I just want all values for key x to be on the same node.
@@jordanhasnolife5163 ahh thanks this is helpful
Hi Jordan, nice content! A question -- why would it be a problem if we use spark streaming to calculate total counts? You mentioned that specifically the it will fail if there are multiple spark streaming consumers. What does that mean?
I just meant that if you have many consumers for one url, now your counts are split between many consumers. That's ok, but you'll face contention when incrementing the counter in the db from multiple places.
What are the benefits gained from choosing SQL DB over noSQL? noSQL can scale horizontally much easier
What makes you say that noSQL scales horizontally much easier? This may be true when we have a lot of distributed joins to perform, but in the case in which we don't, I really don't see any advantage. We get equal amounts of data locality either way.
Thanks for the great content! I am wondering for the write-back cache and data consistency, is it possible to make sure the same user always write and read from the same cache? in that case will there still be any data consistency problem or anything I am not considering correctly? (apart from edge cases that the cache node is failing and they can only read from a new cache?
Yeah in theory if you use consistent hashing to put the same user on the same write back cache this should be ok, but as you mentioned if that cache goes down we're in trouble
If we are doing something(probing etc) to avoid collisions then why do we have to care what replication we use? As then there should be no conflict, no?
We still have eventual consistency. I can grab a short URL x on node A and you can grab it on node B, and we won't know that there's a write conflict until we synchronize.
@@jordanhasnolife5163 Wouldn't HBase make the most sense in that case?
@@AizazShahid-ck8cn I think that anything with single leader replication would. I don't think we get much benefit from being column oriented here, and if we're prioritizing for read speeds perhaps we're better suited with a B-tree.
How to find out when the tinyURL was accessed for the first time?
Every read has to access the database anyways more or less, just write the timestamp then and then don't overwrite it if it exists. You'll need to use locking here, but what you can do to avoid the head is first read the TS without the lock, and if null then grab the lock and update it in order to minimize contention.
Can you please clarify how you assign short URL’s ?
The second option with materialized conflicts ?
Put some URLs on multiple database partitions, and have enough partitions that the increased throughput of many partitions is able to offset the lock contention required to assign the next URL (by primary key) to a given user that gets routed to that partition.
@@jordanhasnolife5163 thanks
If we want to add a new feature to the system - to generate QR code instead of short URLs, what'd be the changes if any?
Yeah I'd say that's effectively the same. I'd imagine there's some form of string representation of a QR code and we would just store that as the primary key in the database instead of a short link.
Hey Jordan, love your content!
I'm a bit confused about when the S3 data is being used here, if we are always writing to CDN first and fetching from it as well. Is it a fallback if some data gets evicted from a particular edge CDN server?
Hey Yash! We aren't always writing to the CDN here! Rather, we use the CDN as a write around cache, where pastes that are accessed many times ideally will eventually be loaded in the CDN. For the first write, we go to S3.
Hey man, Just came across this channel and i see two playlists for system design problems
Are the videos from first playlist good enough to cover ground for now until you keep coming up for videos in this playlist ? Just asking if there's any reason behind updating them as they were only 1 year old
Yeah I think so, but I'm just trying to be more visual in the second series. The first one is just a lot of me talking with no backround images.
I have a question - how we would get the same shortlinks "abc" at 5:58 if we are using hash function( userid, timestamp, long url). if hash function is generating the same value then wouldn't it be resolved by probing? Also hash function has different values in it like userid, long url.
There are only so many short URLs, so even if you have different inputs they could theoretically hash the the same one. And yes, this would be resolved by probing, but in order to actually know that the key is taken and we need to probe, we have to be sure to lock on that row so that we don't have two users grabbing a key at the same time.
@@jordanhasnolife5163 Thanks! I watched many videos from this channel and all are great!
@jordan do you have slides for the problems, just like you have for the system design concepts? Thank you, keep rocking!
Yep! I'm waiting until the series is done to get them all posted in batch
Hello Jordern is it possible to get these slides (or whatever they are). It will be very helpful for revision. Thank you :)
Eventually (when I am done with this current series) I will upload everything in batch
- Jordern
Why do you think hash indexes are only in-memory? Afaik, for example postgres supports materialized hash indexes
Even if they weren't in memory, the performance would be bad due to random seeks. There do seem to be augmentations of normal indexes with hash ones, but those are gonna be in memory.
@@jordanhasnolife5163 afaik, hash indexes used to perform worse in postgres because they were not optimised. They are now, and should outperform b-trees for equality searches. In a b-tree for those searches you are still paying O(logN) every time
Can you explain a bit more about the technology choices? you picked mySQL as it has all the requirements you came up with. but wouldn't scaling be easier with something like mongodb compared to mySQL. mongodb also has indexes. so wouldn't mongo be the choice in terms of faster development
Can you clarify your assumptions here? Why does mongo lead to faster development? Why does mongo scale better?
I think that people often say this about NoSQL databases without much evidence to back it up, so that's going to be my retort for you here.
Generally, most developers are familiar with SQL databases, and I believe that I can get away with using the relational model here, both in the shape of data, and in the sense that I want the ability to have ACID transactions and use single leader replication. Could I use mongo? Sure - but I don't see any clear advantage that it has over MySQL when both work perfectly fine in this situation.
Heyy @@jordanhasnolife5163, a beginner here, what about the argument where people say that NoSQL is easier to scale horizontally? I always get a bit confused on this point, I think scaling horizontally relates to since we have data locality in NoSQL, we don't need to perform joins across nodes.
But do I definitely choose SQL vs NoSQL? Should I always go for SQL, if it fits, choose it, else try with Nosql?
btree is not scalable, and kafka uses LSM tables as its main system, and is highly scalable. so using ULID or UUIDs and replicating over a kafka queue might increase capaity without degrading your performance i think (although am not an expert in this system type)
1) I don't know where you're getting that Kafka uses LSM trees, because it's basically just writing to a commit log, therefore not requiring any sort of indexing
2) What makes you say that B-trees aren't scalable? And what do you mean by "not scalable"?
Let me know if I'm missing anything!
Kafka does not use LSM tree or any other B tree index
@@shuaizhang6502 Agree
Hey Jordan, In the case of one publisher Per Row(29:34) you mentioned that we will have one spark streamer per tiny url id. But here we have trillions of URLs how is that feasible to have trillion streamers for this case?
Hey Rishabh! What I meant there is that the data for each tiny url id belongs on one spark streaming node, but you can have a spark streaming node responsible for many tiny url ids. You basically partition the consumers/kafka queues on tiny url id.
Was exactly my doubt too. Thanks!
Hi Jordan, thanks for this video! I dont get why we need to worry about two people adding rows with the same primary "tinyUrl" key, doesnt every ACID compliant db prevent adding two of the same row like that or am I missing something very obvious??
Yup! (Admittedly I don't know how this would work when many partitions are involved), but this is how they'd do it
Sorry I still dont understand 😢 I was referring to 11:15 where it talks about locking for adding new rows, I thought databases already did this, so why do we need to worry about it?
@@OF-nf2cb Because it's a systems design interview, and we should know how things work :)
@@jordanhasnolife5163 Ah I see so you were talking about how dbs internally handle it? I was thinking you meant we were implementing that
@@OF-nf2cb Depends on the db :) as you mentioned, most SQL ones will get it done for ya via primary key constraints