1: TinyURL + PasteBin | Systems Design Interview Questions With Ex-Google SWE

Jordan has no life

มุมมอง 55 041

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 25 ก.ย. 2024
38 minutes in one go, my ex girlfriend punching the air right now

ความคิดเห็น • 279

@prashantgodhwani6304 10 หลายเดือนก่อน ⁺¹⁰⁷
I swear this channel is such a hidden gem.
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺¹⁴
Thanks man! Hopefully not too hidden haha
@ronakshah725 3 หลายเดือนก่อน
I think this attracts a particular kind of readers. Those looking for more in-depth and first- principles understanding vs memoization. I think you also have a good balance of real use case driven thinking and hand waving where necessary. The latter is crucial and people ( including me at times) tend to overlook the importance of hand waving. Not everything deserves the same level of detail. But knowing which does is the game changer! Anyways, I will be subscribing and reviewing a good portion of this 😊
@dwivedys 3 หลายเดือนก่อน
You stole my thoughts! What is this man!!! Amazing…brilliant
@jporritt 6 หลายเดือนก่อน ⁺²⁵
I've gone through a lot of videos on this subject and this is by far the best. Not just regurgitating what they've read elsewhere, but actually coming from a place of proper understanding.
@jordanhasnolife5163 6 หลายเดือนก่อน ⁺¹
Thanks James!
@gchris6434 6 หลายเดือนก่อน ⁺⁸
Great content! Probably the most in-depth TinyURL design I've seen. Most others don't actually get into write conflicts/locking.
@brandonwl 10 หลายเดือนก่อน ⁺¹¹
I appreciate the new 2.0 video style and how in depth it is. I found the older videos too brief considering the complexity of the systems they covered.
@hzoe 2 หลายเดือนก่อน ⁺²
You are so underrated brother... this is the best content on the internet regarding System Design interviews 100%... who cares about an interviewer saying 'yes' to every question that the interviewee asks, or an interviewee coming up with the same 10 boxes every single System Design question when you explain stuff like this... keep it up, you are saving lifes!!
@gopald9962 7 หลายเดือนก่อน ⁺⁶
You are very good, the best even, at driving the solution of a design problem by applying logical reasons rather than throwing something at my face i wouldn't understand. Thanks for existing!
@ww12835 7 หลายเดือนก่อน ⁺⁷
Thank you very much, I’ve been reading DIDA and Understanding Distributed Systems so it’s great to see how those concepts apply to the use cases you are showing.
This is the first video of yours that I’ve seen yet, planning to watch all the series when I can because you highlight the questions that we should be asking when designing and assert the tradeoffs of different solutions, it’s amazing.
I don’t know if you have done for the other videos but if I could ask for anything would be for you to change the pencil color when reviewing the flows of data, it makes it easier to find where you are pointing to in the drawing board.
Again, thank you very much, these videos are really helpful.
@jordanhasnolife5163 6 หลายเดือนก่อน
Sounds good - Ive gotten this suggestion a couple of times now so I'll give it a shot at some point
@firezdog 4 หลายเดือนก่อน ⁺³
I really like the format for these types of videos of starting with a naive solution and then iterating on each component as we discover performance problems. That way the initial db schema and the overall graph of how requests / responses implement the features can be clearly presented to the viewer + the viewer gets a conceptual framework to start evaluating the ways in which the simple picture needed to become more sophisticated.
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Thanks for the feedback!
@slover4384 3 หลายเดือนก่อน ⁺⁵
At 12:34 -> You do not have to do any predicate query. Relational databases use B-trees and are not append-only like databases that use LSM trees.
So an "INSERT" will fail if another INSERT beat it to the race because the 2nd insert will notice the primary key in the B-tree. This sort of "INSERT and fail if primary key exists" is very hard to accomplish in LSM based DBs as they are append-only. In fact to accomplish this in an LSM based database, you usually have to do a special slow instruction that involves multiple phases and uses something like Paxos or raft or 2PC.
So with a database like postgres, one solution to this TinyURL is to actually just have different application servers INSERT new rows. If there was a race on the exact same primary key, the 2nd application server will get a "duplicate key exists" error and will know to retry or send the error to the client.
@jordanhasnolife5163 3 หลายเดือนก่อน ⁺³
Yeah - I probably shouldn't have included this section in the video since it's just thrown people off. I meant to say that predicate locking was one way of accomplishing this, but like you mentioned in SQL databases they're gonna get this done for you on the primary key.
Interesting stuff on the LSM tidbit, that's new to me! I imagine if the key was still in the memtable perhaps we wouldn't need too much additional complexity, as we'd see it was there, but had it been flushed this could get complex.
Thanks for the insight, and I appreciate you pointing out these flaws!
@mytabelle 7 หลายเดือนก่อน ⁺³
Hey, first of all thank you for the videos. They've been really helpful for me prepping for an upcoming interview!
I watched a separate video on tinyURL/pastebin link generation, and the main point of the video was that instead of hashing, which can lead collisions when we truncate the MD5/whatever hash, like you mentioned in the video. Their suggestion was that we could have a cluster of key generations/ID generators, and each one would have a range of URLs to distribute when a write request is made. They could be coordinated by ZooKeeper and would allocate only from their own chunks, and would refresh their URLs as needed. The only concern is that we lose that range of URLs (say 1 million) if a specific generator goes down, but since we have 2 trillion combinations if we use base 36 with 8 characters and even more if we use base 62, this isn't a major concern. This would also speed up writes in the case there is a collision.
What are your thoughts on this?
@jordanhasnolife5163 7 หลายเดือนก่อน ⁺¹
Hey yeah, I touch upon that in this video. That's basically the key generation service where you partition the keys and assign in sequence.
That works too! There may some lock contention within a partition but if you have enough partitions such that it doesn't matter that's fine too!
@abhimanyushekhawat2626 20 วันที่ผ่านมา ⁺¹
My sister catches me binge watching Jordan again and says:
"Just like Jordan, you have no life too 🤣"
@gammagorom 10 หลายเดือนก่อน ⁺⁷
Thank you for all the system design content! I’ve been binging through the 2.0 playlist and this video does a great job putting all the pieces together. I don’t have any experience with system design interviews so I have two questions in particular:
1. If in an interview, someone were able to deliver a design with the same level of depth + complexity that you’ve described here, what level would they be placed at? Alternatively, what sort of depth would interviewers be looking for in say an L4 interview?
2. Many of the system design interview guides recommend to lay out a basic architecture first and then go into a deep dive at the end but in your videos, you weave both parts together. How would you recommend to split these parts up? For example, is database schema + choosing a type of database part of the deep dive or should it be part of the basic architecture? And how about for replication/partitioning?
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺⁶
1) At the end of the day, you're interviewing for a certain level, so you'd be placed at that level. I don't really think uplevels are ever going to happen, but failing a systems design interview might lead to a downlevel. I like to think that these videos are sufficient to pass just about any systems design interview, considering the results of my viewers, but I can't be sure, as I'm not an L5 and haven't had the test to try that theory out haha
2) Yeah, and I used to do my old videos this way. However what I've come to realize is that one's choice of database schema is intimately tied to their database architecture choice. So if I just list a schema outright, but don't go into what architecture I'm using, it might not make sense. In an actual interview, it may make sense to list a schema first, and then say you'll explain it later, but my main objective of these videos is to make them as clear as possible - so I hope that my choice to do this makes sense.
@YeetYeetYe หลายเดือนก่อน ⁺³
Good content technically speaking, but a bit removed from an actual System Design interview. I found HelloInterview to be more geared towards passing the actual interviews (I swear I'm not shilling, just giving you feedback because I genuinely love your channel for the technical aspect).
@jordanhasnolife5163 หลายเดือนก่อน ⁺²
I'm hearing this a lot these days. I could see it depending on the interviewer/level, but I'm pretty content to overprepare for these since I find it actually helps me at work too!
And no worries, you can have preferences :)
@lv0320 27 วันที่ผ่านมา ⁺¹
Personally love the in-depth stuff and trade offs for each choice as that’s the expectations for very senior (L7+) roles.
@zalooooo 9 หลายเดือนก่อน ⁺³
Awesome video! Couple thoughts:
- For storage, it may be useful to talk about content addressable storage. Many pastes may be the same and that will save space
- I’m not sure what the CDN buys us here. Reading from s3 isn’t crazy slow but the big issue is that the object space is huge and my hunch is that most of the time, you’d get CDN misses in most regions. That is assuming you’re not storing every paste to your CDN, which feels expensive. This is more a not that a cdn feels like a premature optimization
- I really like your count approach. Another couple be using a counter CRDT and increment server side while serving the link. That being said, yours is more general purpose and scales for other types of analyitics
@jordanhasnolife5163 8 หลายเดือนก่อน
Thanks!
Good call on the first point, you could take some sort of hash to deduplicate.
For the CDN point, I mostly have it there just to talk about CDNs. In practice, caches seem to be one of those things that you just introduce as you need, so we'd have to see how our existing latencies were and then experiment here.
@andriidanylov9453 8 หลายเดือนก่อน ⁺²
Appreciate. It is a very serious description/implementation for such easy task.
@harikareddy5759 10 หลายเดือนก่อน ⁺⁴
Hey Jordan, awesome video as usual! I really appreciate the way you break down tradeoffs. I checked out the TinyURL/Pastebin design from grokking's course, but it felt a bit sloppy to me, your version is so much better :) thank you!
@jordanhasnolife5163 10 หลายเดือนก่อน
Thanks Harika!
@widididid8662 5 วันที่ผ่านมา ⁺¹
These are so sick! Much appreciated
@ninad11081991 10 หลายเดือนก่อน ⁺²
I'd love it if you could make content regarding what level of depth is expected from a Mid level SD interview vs Senior vs Staff!
It would be cool to take the same problem and do a mock at each level so we can understand the differences! Would love to see a mock series between you and Gaurav tackling this.
I've interviewed with companies where they haven't picked a level for me yet and they say we will determine it based on your performance. But I don't know what the expectation is exactly in the context of a system design interview.
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺¹
To be honest I just don't have enough context here as I haven't conducted interviews for a company, I do really wish I could say otherwise but at the moment my objective is to just go as in depth as possible.
@psychechip 6 หลายเดือนก่อน ⁺²
I fucking love how you can explain complex shit and be funny at the same time. Keep up the good work!
@kumarc4853 4 หลายเดือนก่อน ⁺¹
Learned a lot, concise, well explained, and also funny
I have a sys design interview today, thank you so much
@jordanhasnolife5163 4 หลายเดือนก่อน
Good luck!!
@advaitchabukswar4163 หลายเดือนก่อน ⁺¹
Once again, learned a lot and read and write scaling techniques were helpful in my system design interview. Another write scaling technique you can talk about is Log structured merge trees for o(1) write and O(logn) reads.
@jordanhasnolife5163 หลายเดือนก่อน
I wouldn't say writes there are O(1) since they also go into a tree, but it should be a much smaller tree and in memory, so maybe practically it is :)
@advaitchabukswar4163 หลายเดือนก่อน ⁺¹
@@jordanhasnolife5163 you are correct they are eventually not o(1) but if they are async writes, for your request sake, it's o(1) and at some point we add the entire log in sorted manner. Again, I am not an expert here so I could be completely wrong, lol. Please let me know if I am. Will do more research for complete understanding.
@jordanhasnolife5163 หลายเดือนก่อน
@@advaitchabukswar4163 I don't think it is asynchronous, the write to the tree is synchronous as far as I'm aware
@advaitchabukswar4163 หลายเดือนก่อน
@@jordanhasnolife5163 I see…
@vedant9173 2 หลายเดือนก่อน ⁺¹
Thanks for explaining the rationale behind your design decisions, very few videos do that
@idiot7leon 6 หลายเดือนก่อน ⁺¹
Brief Outline
00:01:15 Problem Requirements
00:02:04 Performance Considerations
00:03:23 Link Generation
00:05:45 Assigning URLs - Replication
00:08:07 Assigning URLs - Caching
00:08:51 Assigning URLs - Partitioning
00:10:34 Assigning URLs - Single Node
00:11:29 Assigning URLs - Predicate Locks
00:14:12 Assigning URLs - Engine Implementation?
00:15:42 Assigning URLs - Database Choice
00:16:17 Maximizing Read Speeds
00:17:35 Maximizing Read Speeds - Hot Links
00:18:50 Maximizing Read Speeds - Populating the Cache
00:21:02 Analytics - Naive Solution
00:22:22 Analytics - Stream Processing
00:23:49 Analytics - Click Consumer
00:26:00 Analytics - Exactly Once?
00:29:16 Analytics - One Publisher Per Row
00:30:17 Deleting Expired Links
00:31:21 PasteBin - Huge Pastes
00:34:23 Final Diagram
Thanks, Jordan!
@jordanhasnolife5163 6 หลายเดือนก่อน ⁺¹
Thank you man! Legend!
@andreybraslavskiy522 5 หลายเดือนก่อน ⁺¹
Hi Jordan. You were talking about 16 TB of potential data, partitioning, single leader replication. Sounds like Mongo fits better than relational solution.
Thank you for great content!
@jordanhasnolife5163 5 หลายเดือนก่อน
I think they're more or less equally feasible for this one, feel free to go with mongo, basically personal preference at this point
@BRBallin1 3 หลายเดือนก่อน ⁺²
Using PH as your example URL makes you a legend
@jordanhasnolife5163 3 หลายเดือนก่อน
Legend is a strong word to use for what I am
@sarrahdoesthings 3 หลายเดือนก่อน ⁺²
I'm here for your eyes
@jordanhasnolife5163 3 หลายเดือนก่อน ⁺¹
👀
@Didneodl 10 หลายเดือนก่อน ⁺⁵
Wonder why key-value DB wouldn’t be the db choice. It doesn’t require data relation and we can quickly check if tiny url has been used and also supports fast retrieval
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺¹
Yeah I tried to say that this would be the best choice, but just doubt that storing all data in memory is feasible from a cost perspective, so wouldn't be surprised if your interviewer pushed you harder here.
@adi96adi 9 หลายเดือนก่อน ⁺¹
I was wondering the same thing.
Its weird how advice on costs usually comes up in the "what to say in the beginning" interview advice, but isn't mentioned more in technology tradeoffs. I also would've thought that non relational dbs are, for the most part, cheaper than relational ones, but I guess that's only until the point when they stop being caches and start being proper dbs.
Am I right or missing something?
@jordanhasnolife5163 9 หลายเดือนก่อน
@@adi96adi Yeah not exactly sure regarding the cost of non relational vs. relational dbs but I can say with certainty that storing in memory is more expensive and for something like tinyURL, likely to be overkill.
@lv0320 14 วันที่ผ่านมา ⁺¹
Thanks for the detailed breakdown for storing url click counts. A couple of questions
1. In a multi region/multi-DC world, assuming we use a single leader writes, there could be atleast two Spark streaming instances writing to DB for the same Key (In a 2 region world). To ensure correctness, do we need to store 2 Idempotent keys here per row? Alternately, I think Kafka mirroring can solve this by replicating clicks from multiple regions to be unified in one region so we can get away with a single idempotent key
2. If we extend this problem to like counts, where we store liked_by users in a DB and rely on the kafka + stream_processing for counts, What happens if Kafka is down? For correctness, do we want to fail the liked_by write to DB or is there a better way to recover counts without impacting like actions availability?
@jordanhasnolife5163 13 วันที่ผ่านมา
1) I'd say so, assuming they're going to the same database row
2) You could just use change data capture to put a write in kafka when a new liked_by user is written. Ideally, kafka shouldn't be going down. This is why we replicate things.
@lv0320 13 วันที่ผ่านมา ⁺¹
@@jordanhasnolife5163 I’ve seen cases where company internal/platform level issues can cause temporary Kafka outages. In those cases, I’m thinking if we can write the updates to a log file and when Kafka is back up, process log files to enqueue Kafka?
My motivation for the question is to ensure eventual correctness in counts if a major component such as Kafka fails
@jordanhasnolife5163 13 วันที่ผ่านมา
@@lv0320 Fair enough, I think a log file that eventually sinks to CDC when kafka is back sounds best. Perhaps look into debezium and see if that's something they already support
@lv0320 13 วันที่ผ่านมา ⁺¹
@@jordanhasnolife5163 makes sense, thank you
@scottmangiapane 4 หลายเดือนก่อน ⁺¹
6:40 never thought I'd laugh out loud during a system design video. Thanks for making my studying somewhat tolerable
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Thanks Scott!
@2020goro 8 หลายเดือนก่อน ⁺³
11:21 hi here you mention about the race condition when 2 users would add/create for a same actual url, but I’m confused as to why would this happen if the hash functions accounts for the userID already when generating the tinyURL? Wouldn’t that make the tinyurls unique for 2 users even for same links?
@jordanhasnolife5163 8 หลายเดือนก่อน ⁺²
It's super unlikely but as long as there are fewer hash combinations than userId and url combinations you can have some collisions
@thunderzeus8706 2 หลายเดือนก่อน ⁺¹
Hi Jordan, this is the best in-depth design video about tinyurl in any book or on any website. Thanks a lot!
I do have two questions:
1. I am confused by the discussion around the concurrent writes part. In the video you first ruled out the multi-leader and leaderless replication designs. Then you discussed how to handle the concurrent writes to a single row. But if we are choosing single-leader replication, would this still happen? Shouldn't the leader handle the concurrent writes with locking or queuing, and determine the "one with later timestamp" to be unsuccessful (any maybe then assign a different unused shortURL to it)?
2. In many other sources, there is the other way to assign the tinyURL, which is through distributing sequentially increasing numbers in base-36/62 via a ticketing server. The ticketing server can scale itself by preallocating different ID ranges. I am curious what kind of partitioning could this design have to handle large read and write traffic? Like if we use hash range partition, clearly all the writes will concentrate on one partition which is BAD. Of course we could do another hash on the shortURL and partition on that to make it evenly distributed, but hashing itself is not free, and adds to complexity. What's your point of view on that?
Thanks in advance!
@jordanhasnolife5163 2 หลายเดือนก่อน
1) Yes, which is why we use single leader replication with locking :)
2) Yeah, that's the key generation approach that we describe. Each database covers a range of keys, and hence we should get linear increases in speed as we add more nodes. I don't see why all writes will concentrate on one partition if we use a hash range, you can just randomly route each incoming request to one of the partitions and the partition will assign it a UUID.
@michaelparrott128 10 หลายเดือนก่อน ⁺²
One comment about the writer - if the writer is an API instance serving the requests to the user, and we're writing to the CDN/S3/DB, this could potentially take a long time and you'd likely want some kind of async mechanism to do these writes with a way of updating the user on its progress.
@jordanhasnolife5163 10 หลายเดือนก่อน
Agreed. That's potentially more of a client side change though, pretty sure that when uploading files to these services they expose the percentage completion
@isaacneale8421 3 หลายเดือนก่อน ⁺²
Great video, especially the discussion on single leader vs multi leader databases here.
A couple notes:
I don’t think write through for the CDN is a good idea. If you wrote a 10 GB file there, it would be at the expense of other files that may be getting accessed frequently. We shouldn’t give priority to a large unknown file over something that has been accessed and we won’t know will be popular.
This is assuming that CDNs have a bounded size. Maybe there is a fancy one out there that isn’t bounded.
I think you could make an argument that you won’t have records that are accessed for a long time after they are created.
@jordanhasnolife5163 3 หลายเดือนก่อน
Yeah agree with all said, I don't think we know in advance what's popular here anyways so a write around approach seems better.
@fane1159 3 หลายเดือนก่อน ⁺¹
FK , this man contents has its own system design . Love you buddy
@ihor_1995 2 หลายเดือนก่อน ⁺¹
Hi Jordan, great video! I'm not sure that we have to have uderId in the table, because the same long URL can be used by different users, and the eventual short URL (hash) will be the same.
@jordanhasnolife5163 2 หลายเดือนก่อน
While I agree with you, you want to be able to attribute clicks to the links of different users, so it may be useful to have different short urls for the same long url
@edcoronado 9 หลายเดือนก่อน ⁺¹
great video, man. a lot of system design videos are super generic with just "here's some lego blocks that I know off the top of my head" and don't start from a bottom-up approach with the system architecture designed to support a specific algorithm, not just a top-level cookie-cut use case.
@jordanhasnolife5163 9 หลายเดือนก่อน
Appreciate that man!!
@shubhamjambhe361 13 วันที่ผ่านมา ⁺¹
Jordan does deserve pat on back ❤
@cc-to2jn 10 หลายเดือนก่อน ⁺¹
oh wow u changed the original system design vids! Great I was thinking of reviewing soon :)
@jordanhasnolife5163 10 หลายเดือนก่อน
Welcome back ;)
@Ryan-sd6os 4 หลายเดือนก่อน ⁺²
hey jordan, curious, why do we avoid multi leader and use single leader when both can have stale data issue
@jordanhasnolife5163 4 หลายเดือนก่อน
I'm not concerned about stale data, I'm concerned with write conflicts, which would occur when two users both claim the same URL. Single leader replication doesn't have this issue.
@Ryan-sd6os 4 หลายเดือนก่อน
@@jordanhasnolife5163 thank you legend.
@krish000back หลายเดือนก่อน ⁺¹
Awesome Content. I wasn't expecting this much detail in url shortener.
Two questions though:
1. How does partition to kafka will be done? Ideally, servers need to know which kafka patition the message should be sent to and also, some coordination service would be required to maintain that?
2. For writing large files for Pastebin. CDN is a cache and cache will be have eviction policy, so if any url will be hit with gigantic file, which is not used for days and someone hit it after lets say weeks. Considering, we use LRU, it might get evicted and CDN anyways would need to load it again. Is the only reason to put it in CDN first is to make sure popular urls will be in CDN itself and when even first few users hit it, they all will get it pretty quickly?
@jordanhasnolife5163 หลายเดือนก่อน
1) Yeah kafka employs zookeeper internally. You can just tell it a "partition id" you want to read and it figures out the rest IIRC.
2) Yep basically! If it hasn't been hit in a while, no point of keeping it in the CDN anyways.
@namanagarwal2565 หลายเดือนก่อน ⁺¹
For storage engine, the reasoning to discard hash indexes is flawed. We are NOT storing 1PB in hash index, we only need to store keys (1T * (8 bytes for hash key + 32 bytes for pointer to value on DB) = 40TB). 40TB still is lot, hence we can discard it.
@jordanhasnolife5163 หลายเดือนก่อน
In retrospect, I do actually think we can get pretty far with the hash index, it's pretty unlikely we'll hit that scale irl lol
@harshdalal2891 7 หลายเดือนก่อน ⁺²
Why do you need predicate locks for solving inserts in MySQL DB ?
Why can't we just use primary key constraints / unique key constraints given by DB to achieve uniqueness ?
@jordanhasnolife5163 7 หลายเดือนก่อน
To be clear, we totally can, I just would imagine that under the hood that has to use some sort of predicate locking so that we don't create two rows with the same primary key
@SatinderSingh71 10 หลายเดือนก่อน ⁺¹
This series is definitely in a league of its own. Who da man! Jordan da man - Fellow dork
@cole_dm หลายเดือนก่อน ⁺¹
Can you explain a bit more why an on-disk hash index would not work for this problem?
I understand that on-disk hash indexes will suffer from poor data locality and disk I/O optimization, but in this case we don't care about range queries, so locality is not a concern. Also, it seems to me that we can cache disk pages in memory for a hash index just as we would for a B+tree or similar and achieve similar a similar probability for cache hits, as co-located shortURLs on disk have no increased probability of access.
At the very least, it seems like we could use a hybrid solution with a B+tree to index our on-disk hash table and achieve constant time value access once the hash table range containing our target key is paged into memory.
PS love your channel dude really interesting stuff 🫡
@jordanhasnolife5163 หลายเดือนก่อน
In this case it would work great if you keep the hashing part of it in memory and store the address on disk as the value. Agreed!
@mmdip2001 2 หลายเดือนก่อน ⁺¹
Thanks for this video. One of the best I have seen in this topic. Question for you: You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. How do we know in advance which bucket we will be generating the key for?
@jordanhasnolife5163 2 หลายเดือนก่อน
You take the hash of your inputs and then use some load balancer which listens to a service discovery layer (zookeeper) to go to the right node
@sahilguleria6976 หลายเดือนก่อน ⁺¹
Superb video, quick questions
Are you storing all possible tinyURL combinations beforehand in DB OR after hashing longURL + timestamp + userID tinyURL is inserted into respective partition?
And why the click counter is triggered from the reader's side? why not when we read the DB?
@jordanhasnolife5163 27 วันที่ผ่านมา
I discuss the pros and cons of doing both. If you're using a SQL table, you can probably store them ad hoc.
I don't know what you mean "triggered from the reader's side". It is incremented when we read the database. However, in order to avoid lock contention, we perform the counting asynchronously.
@choucw1206 10 หลายเดือนก่อน ⁺⁴
Great video! However, can you elaborate the following things related to the short URL generation?
1. Most of the common hash function, e.g. MD5, SHA-2, generate the string way longer than you need. How do you use it while still keep the hash collision low?
2. Can you elaborate how to detect the hash collision? You mention using the linear probing so it kinda imply you want to do it in the memory, so my guess is the service probably need to first query the DB to check if this hash/key already exist. Won't that be a performance concern?
3. I've read a few other designs of generating the unique short URL, and some of them raise the issues of using the hashing of the original URL (with optionally including other data like timestamp). They usually suggest to use a separate unique ID generation service/component, and perform proper encoding like Base 62 and use it as the short URL(besides the domain name part). Have you ever considered this and what's your thought of this?
Btw, unique ID generation strategy in the distributed system may be an interesting topic to discuss as well if you haven't done it yet 🙂
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺²
I think you should just be able to truncate, if we found that this was leading to too many collisions we could always use 10 characters instead of 8 for example
2) oh on the contrary I'm basically proposing we use application logic that says if key x exists take it otherwise try to write key x+1 in the db. This could be done a bit faster with something like a stored procedure.
@jordanhasnolife5163 10 หลายเดือนก่อน
3. I don't really see the difference, feel free to elaborate on how this improves things!
@SatinderSingh71 10 หลายเดือนก่อน
@@jordanhasnolife5163 Grokking says the key gen is great at solving hash collisions since everything is created beforehand on an offline process. It will create two tables in your DB (used & unused). The application can periodically grab a bunch of unused keys from the key gen service (KGS). Once used, app will update KGS to move entry into used. That's their reasoning for it anyways. Yours makes sense as well to just populate db with everything from get go since its only a few TBs which is nothing.
@varshard0 9 หลายเดือนก่อน
@@jordanhasnolife5163generating a unique id in a sequence, then create a base 36-64 string as a shorten URL allow you to not having to deal with a collision ever.
Running number is guaranteed to be unique unlike hash. However, it run the risk of being guessable and could be ddos, by making 2 shorten URLs that point to each other.
@adithyasama7558 6 หลายเดือนก่อน
@@jordanhasnolife5163 so each url-assign-node will be assigned a range for example (0000 - 5555). This way no 2 nodes will ever try to insert the same short url. Each node will just increment its local counter when ever it's creating a new short url. This way theres no conflict resolution to be solved.
@fillywinks 10 หลายเดือนก่อน ⁺²
Great video. I wonder if it would be considering overengineering to go for a keyed time windowed stream in flink, sink each trigger to the db idempotently (semi-event sourcing style) and trigger a reduce on the db once in x updates/once in a while (something like what rocksdb does).
You know what, I think I just answered my own question writing this comment 😅
@jordanhasnolife5163 10 หลายเดือนก่อน
Ha, I think you could probably do it, just not sure if it's necessary!
@ninad11081991 10 หลายเดือนก่อน ⁺³
Is it really worth the trade off to use a relational db here due to the fear of write conflicts?
We are giving up on a ton of write throughput by making this choice aren't we?
What if we instead not take the input of the tinyURL from the user, instead just take his longUrl, have our cassandra already prepolulated with tinyURLs and then round robin requests among our leaderless architecture?
@ninad11081991 10 หลายเดือนก่อน ⁺¹
Fun fact, Bloomberg actually rejected me because I used mysql in my system design interview for this problem
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺²
I think it's moreso about how you justify it. To clarify, I said to use single leader replication, not necessarily a relational db.
That being said, if you round robin requests around your leaderless architecture and two people have the same shortURL that you propose they could very well still take the same shortURL on different replicas which will lead to a write conflict.
@ninad11081991 10 หลายเดือนก่อน
@@jordanhasnolife5163 Makes sense! and Makes Sense. If I get asked this again, I'll bring up both considerations
@brandonwl 10 หลายเดือนก่อน
@@ninad11081991 it is very hard to shard sql dbs but much easier using mongodb/cassandra. For mongo db can use the same strategy as a sql DB shown in the video.
But for cassandra you need a separate KeyGen service since write conflicts can happen in cassandra. Do you think they failed you because they were looking for a cassandra solution?
@vedantshinde2277 8 หลายเดือนก่อน ⁺³
Great Stuff!! So, to maintain the unique constraint using predicate locks, serializable transactions will have to be used with first a select and then insert, right?
@jordanhasnolife5163 8 หลายเดือนก่อน ⁺¹
Yep more or less!
@obidan2 7 หลายเดือนก่อน
@@jordanhasnolife5163 If there is already a unique constraint on the database for the short_url, I am confused as to why you would need to do anything in this case.
The database will return a UNIQUE_CONSTRAINT failure to whoever tries to INSERT after a short_url was already inserted, and that client can retry with an increased number.
Am I missing something?
@mcee311 10 หลายเดือนก่อน ⁺³
for multi leader / leaderless cant we resolve the issues of conflicting writes by using consistent hashing?
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺²
You could say something along the lines of "for key x only write to replica 1 of partition 3". But then you've basically just re-created single leader replication.
@PalanivelPurushothaman 2 หลายเดือนก่อน ⁺¹
Great Video, Have a question 29:51 , it is exactly 1 idempotency key per row and not few if we partition the kafka topic by short URL. can you please clarify.
@jordanhasnolife5163 2 หลายเดือนก่อน ⁺¹
Yes, I was saying this is if you don't partition by short URL
@Luzkan 7 หลายเดือนก่อน ⁺¹
Sup Jordan! After 2:35 I went ahead and made a design too and I'm looking forward to hear an opinion about my thoughts, feel free to confront and bully me based on my decision.
You have hard constrained the choices around no write conflicts (with the example of an identical hash for two different urls by two users made simultaneously). I thought, that situation could be extremally rare in real life (and even if - we could add extra complexity on how we generate tinyurl to assure its uniqueness, with a salt maybe, or a distributed counter, but tbh, only after such event would happen considerable number of times). Thus, I wouldn't want to discard options based on that edge case. That enables the Cassandra option, which looks very nice for a system which is in-fact read heavy as you said, but! each tinyurl visit (which is a read) is also a write because of inc() counter requirement.
I really like your idea of going with mini batch updates and Spark Streaming, though. I think this is also a good design, and I'm not holding a strong opinion on either of those, and now what? :D
Cheers! Thank you for the content, keep up great work!
@jordanhasnolife5163 7 หลายเดือนก่อน
Yeah! I think Cassandra could be interesting here, assuming that you used a leaderless counter (so basically a CRDT), and I agree that IRL conflicts should be relatively rare. Since there are a lot fewer hashes than there are URLs different sites may hash to the same result but ideally by then enough time will have passed that Cassandra could have performed anti entropy so we could see that the hash was taken and then probe to another.
Thanks, I appreciate the kind words!
@dhruvpatidar359 3 หลายเดือนก่อน ⁺¹
That Google logo and the stand behind looks like ghost to me 😰😱
@Henry-wq7ki 4 หลายเดือนก่อน ⁺¹
If we are single leader, why do we need predicate lock?
Only one client will accès the row, right ?
@jordanhasnolife5163 4 หลายเดือนก่อน
Not if it doesn't exist yet!
@truptijoshi2535 4 หลายเดือนก่อน ⁺¹
Hi Jordan! Great video. I have a question about probing. Should the application logic check if the hash is available? Since we are creating a hash based on the long URL + userID + timestamp, why would there be any conflict? Should we also think of UUID or Twitter's snowflake or Base 62 conversions to avoid probing? Could you please explain more about the Hashing function?
@jordanhasnolife5163 4 หลายเดือนก่อน
Just because those are our inputs to the hash function doesn't mean we'll generate a unique output. If we only have 6 characters in our short link, for example, some things will hash to the same value.
@akbarkool 6 หลายเดือนก่อน ⁺¹
For the issue with multi leader potentially causing write conflicts, a simple fix could be to hash both the incoming url and outgoing url together. That would leave the chances of same hashed value being infinitely less unless the business is also looking up p0rn sites
@jordanhasnolife5163 6 หลายเดือนก่อน
Haha yeah I think another thing to note here is the range space. We can hash on lots of fields but if we don't have that many hashing combinations we'll have collisions
@truptijoshi2535 2 หลายเดือนก่อน ⁺¹
You have drawn multiple instances of kafka in the same partition. Does it also follow Single leader replication? If not, how does it determine which click count event should go to which replica of the same partition?
@jordanhasnolife5163 2 หลายเดือนก่อน
As far as I can tell I only drew one Kafka instance per partition, but yes this would use single leader replication
@MasterRace023 3 หลายเดือนก่อน ⁺¹
At 28:55 why do we need to store an extra idempotency key for each additional SS consumer? im not quite understanding that part
@jordanhasnolife5163 3 หลายเดือนก่อน ⁺¹
Well each one is basically simultaneously aggregating some of the counts of people that have clicked our link. If one of them goes down, and then comes back up, and tries to hit the database again with an operation to increment the counter for our link, we need to make sure we haven't already applied that operation (and then spark streaming could have gone down before the ack from the DB got back to it). So using an idempotency key would allow the database to say, update y from consumer z has already been applied, don't do it again.
@firezdog 4 หลายเดือนก่อน ⁺¹
For the caches, since we want to handle hot-keys, why not use LFU as opposed to LRU for the eviction policy?
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
I think they both have their pros and cons here, but a URL that was popular a year ago but nobody looks at now is still technically pretty "frequently used"
@jerryhuang472 6 หลายเดือนก่อน ⁺²
Thanks for the vid, wondering what if the URL read succeeds and the write to the Kafka queue doesn't? Is there any decent way to to solve this, I know you mention 2p commit a bit, but was wondering if there's anything faster.
@jordanhasnolife5163 6 หลายเดือนก่อน ⁺²
Nope, unfortunately nothing better than 2pc. If you figure it out though let me know, we'll be billionaires!
@moonbiscuit8742 10 หลายเดือนก่อน ⁺¹
Just found the channel, and now watching all your videos from a year ago. Appreciate that you're going through the rationale and building intuition about these concepts. Also, what would be you're top 3 book recommendations?
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺⁴
DDIA, 50 shades of gray, the art of the deal
@jordanhasnolife5163 10 หลายเดือนก่อน
Thanks for the kind words!
@sachin469D 10 หลายเดือนก่อน ⁺³
Been watching all your videos.. Can you do a video on Lucence OR how in general how search index work with a given edit distance?
@Saurabh2816 10 หลายเดือนก่อน ⁺¹
yeah interested in apache lucene, elastic search and searching indexing
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺¹
I already have! th-cam.com/video/ty9DQhM32mM/w-d-xo.html&ab_channel=Jordanhasnolife
Didn't go into Levenshtein distance indexing, I think it's probably too specific for the details of this particular channel but I'm sure there are some good resources explaining it somewhere online!
@kanani749 5 หลายเดือนก่อน ⁺¹
Hey Jordan,
Great video. Wanted to ask how analytics calculation would work in this case if in the system we were to use Atomic write operations (I am referencing DDIA Chapter 7 - Preventing Lost Updates - Atomic Write Operations). From my understanding we are basically trying to implement a read modify write cycle and it would still boil down to a lock which you mentioned.
If we designed a distributed Redis store partitioned by URL to handle the counter incrementing would the main issue at hand be maintaining an accurate count during failover? Could using a consensus algorithm help with solving with this issue or would this reduce performance and throughout too much?
@jordanhasnolife5163 5 หลายเดือนก่อน
We can use atomics, but atomics are built around grabbing locks. If we have too many people incrementing the same counter, we could have contention. In my solution, I batch all of the upgrades, and increment them in intervals, via a real time stream consumer. By having just one consumer per URL, I don't need to worry about locking.
Since redis is single threaded, I suppose your idea works. That being said, again, we are limited by the throughput of the redis node. While a consensus algorithm would ensure that we don't lose any writes, I'd agree with you that it's probably too slow and doesn't make sense here.
@dmitrigekhtman1082 8 หลายเดือนก่อน ⁺¹
If you write to S3 first and then the DB write fails, you’re left with untracked orphaned junk data from the failed request. Another approach would be to write to the DB first to persist the intent to upload the pastebin, do the rest async, and make sure to communicate something reasonable to the user while the upload is in progress.
@jordanhasnolife5163 8 หลายเดือนก่อน ⁺²
While this is true, I prefer orphaned junk to the alternative of a paste that links to nothing. I could just have a cleanup job run in the background once a month!
@dmitrigekhtman1082 8 หลายเดือนก่อน ⁺¹
A paste that links to nothing is definitely bad. Recurring cleanup works and is necessary in any case, at least as a fallback for bugs. But it's more reliable to have your normal path be to write something to the DB first [chunk_id: xxx, status: need_to_upload], then take (idempotent) action [upload to S3 if not done already] based on the persisted goal state, then record that you're done [chunk_id: xxx, status: uploaded]. Less things to clean up this way. The relevant buzzword is reconciliation.
@jordanhasnolife5163 8 หลายเดือนก่อน ⁺¹
@@dmitrigekhtman1082 Seems like a fair point to me!
@vipulspartacus7771 4 หลายเดือนก่อน ⁺¹
Hi Jordan, I have a few questions on how this can be deployed globally. Suppose you need to generate URL's for different users across the world. Then you will need to deploy the generator and datastores in different regions and generate unique hashes for each region. If not, there could be hash collisions which will require you to check different data bases in different continents. Maybe we need to generate a UUID kind of solution, so that each region can have its own local store which can create and serve the content better. Also, the view count you might be getting it from different continents. Even in this case, you need to have some sort of sharded counter type of thing to consolidate counts from different regions. What are your thoughts.
@jordanhasnolife5163 4 หลายเดือนก่อน
I'd use a key generation service with a db leader in each region, and aggregate counts using stream processing via Kafka and flink, as that can take as long as I need. You can have DB follower replicas in other regions for faster reads.
@vipulspartacus7771 4 หลายเดือนก่อน
@@jordanhasnolife5163 by key generation service do you mean something that will append a region specific code to the hash, otherwise, how can the key generated be unique, as verifying that it is not conflicting with other urls in other regions will take time.
@nobodyspecial946 6 วันที่ผ่านมา ⁺¹
Hey! You won't believe but I need to make Tiny URL Service for the company which I work for! Any additional advices would be welcome! :) Thank you for your video
@jordanhasnolife5163 6 วันที่ผ่านมา
My only advice is to not plan for scale you won't have :)
@5atc 9 หลายเดือนก่อน ⁺²
When you say you would use database partitioning, do you actually mean sharding? How do you approach discussing these two topics during an interview?
@jordanhasnolife5163 9 หลายเดือนก่อน
I've never really known them to be different terms
@ilqarilyasov 5 หลายเดือนก่อน ⁺¹
5:27 you mentioned that the hash function provides a unique value. Are you sure? Doesn't the hash function of a hash table typically return an index to map keys to indices in the underlying array?
@jordanhasnolife5163 5 หลายเดือนก่อน
A hash function is any function that deterministically maps some element of a domain to a different element of a range
@Anonymous-ym6st 5 หลายเดือนก่อน ⁺¹
Another question: when mentioning a few TB should be good for one single machine, I am wondering in practice what storage usually can be done in one machine? a few hundred TB?
@jordanhasnolife5163 5 หลายเดือนก่อน
Depends how you're storing the data, but just look up the prices of hard drives and look up how many you can fit in an average server
@ariali2067 6 หลายเดือนก่อน ⁺¹
Hey Jordan, you mentioned that we shouldn't use write back cache b/c that will incur the same issue as multi-leader/leaderless replication. Curious why we cannot use Redis for write back cache here? Redis supports single leader replication right? Thanks in advance!
@jordanhasnolife5163 6 หลายเดือนก่อน ⁺¹
Hi Aria - you can use redis as a write back cache, and redis supports single leader replication.
But a write back cache means that you write to a cache first and eventually flush that to your database.
If you haven't watched my playlist prior to this one, I think you may get some benefit from drilling down those concepts first! These videos should then all make more sense.
@ShivangiSingh-wc3gk 2 หลายเดือนก่อน ⁺¹
In interviews, do we have to talk about all the ways we can do a part. eg all ways to hash/partition/caches
@jordanhasnolife5163 2 หลายเดือนก่อน
I guess discuss with your interviewer, but it's not unreasonable to touch upon why you wouldn't want to do a particular approach
@radosawmul8963 5 หลายเดือนก่อน ⁺¹
What strikes me is a question if kafka can actually handle 2 trillion partitions 👀 for example firehose can handle max of 5k partitions per delivery stream. I think some more hybrid approach would be in order here.
@radosawmul8963 5 หลายเดือนก่อน ⁺¹
29:40 Unless you didn’t mean to have a partition for each url, but just a range-based partitioning. Because that was kinda clear only in the last architecture diagram, not during analytics deep dive. If so then all good 😅
@jordanhasnolife5163 5 หลายเดือนก่อน
Yep you got it!
@rakeshvarma8091 5 หลายเดือนก่อน ⁺¹
Great Video Jordan.
I have couple of questions.
1. You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. But why do we need to do this kind of partitioning ? Consistent Hashing offers a good evenly distributed keys among available servers isn't it. Is it wise to use ConsistentHashing everywhere i.e., For DB partitioning, Cache Partitioning & Kafka Queue Partitioning.
2. At th-cam.com/video/5V6Lam8GZo4/w-d-xo.html, Can you shed some more light on the Idempotency key for achieving the Exactly once processing. How does keeping the last written key will help in avoiding duplicate updates for clicks ? Because we might get the clicks for that same key later as well right i.e., Subsequent clicks on the url ? How do we differentiate between these two ?
@jordanhasnolife5163 5 หลายเดือนก่อน
1) We are using consistent hashing, albeit on the range of short keys. That's because the short keys themselves are hashes, no need to hash them again (there's no reason you couldn't if you wanted).
2) Each batch of clicks that we handle in spark streaming gets assigned a random key id (you can hash the incoming data or something so it is deterministic), called an idempotency key. When we change the value in the db, we check (in the clicks database) to make sure we haven't seen that idempotency key so that we don't accidentally update our db twice.
@matthewsaucedo2471 10 หลายเดือนก่อน ⁺²
25:25
What is the difference b/t Spark Streaming and Flink, if you were to just use a larger batch in Flink that your example (you said 10, but consider we use a more comparable 100)?
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺¹
I'd say none then really, just wanted the opportunity to use something other than flink lol
@matthewsaucedo2471 10 หลายเดือนก่อน
I’ve never used either technology (only Kinesis streams from AWS) so was just curious!
@svar1938 25 วันที่ผ่านมา ⁺¹
I have one question after the reading educative IO site on the same topic. The recommendation there is about precomputing the shortURLs ahead of time (using some out-of-band batch process). What are the potential pros and cons of using the batch approach for write vs real time write?
@jordanhasnolife5163 23 วันที่ผ่านมา
You're gonna have to elaborate a bit more for me to answer this one, because I'm not sure what out of band means off the bat.
@svar1938 12 วันที่ผ่านมา ⁺¹
@@jordanhasnolife5163 - Ty for replying. I am trying to ask if there is pros and cons of having the URL generation as a non real time process. Say we precompute URL's ahead and then when write requests come just give out a short URL instead of computing on the fly.
@jordanhasnolife5163 10 วันที่ผ่านมา
@@svar1938 Ah, I just don't know if the computation of "next URL" is an intensive enough process to make sense in this case. I suppose it saves you an index read.
@tunepa4418 6 หลายเดือนก่อน ⁺¹
How can we achieve or guarantee uniqueness of the tiny url field across multiple shards. Wouldn’t this be too expensive especially if we are using predicate locks
@jordanhasnolife5163 6 หลายเดือนก่อน
You shard by the tinyurl ID so that you don't have to!
@bennyamor 5 หลายเดือนก่อน ⁺¹
Creating a partition for each short URL in Kafka could be problematic- large number of partitions in Kafka can lead to increased overhead in terms of memory usage, disk space, and metadata management,Each partition in Kafka consumes system resources such as file descriptors, memory, and network connections,Management Complexity.....
@jordanhasnolife5163 5 หลายเดือนก่อน
Not proposing this - when I say partition by x it's me using shorthand for partition by hash range of x
@bennyamor 5 หลายเดือนก่อน
@@jordanhasnolife5163 thanks
@truptijoshi2535 2 หลายเดือนก่อน ⁺¹
Hi Jordan!
If we are pre-generating the keys, then we cannot use user-ID and long url for the hash right? Then how do we generate 8character unique short url?
@jordanhasnolife5163 2 หลายเดือนก่อน
I'm not sure what you mean when you say pregenerating keys, presumably you mean the materialized rows approach. In this case we do no hashing and just randomly assign a write to one of the partitions.
If you meant the hashing approach we hash the user id and long URL and deal with any collisions via probing.
@truptijoshi2535 2 หลายเดือนก่อน
@@jordanhasnolife5163 Yes, for the materialized rows approach, how are we generating unique keys? Are we using UUID ?
@akhilmittalji6816 2 วันที่ผ่านมา ⁺¹
Do we have a video on phone billing for monthly and custom date generation? Also what devide are you using for whiteboard?
@jordanhasnolife5163 วันที่ผ่านมา
I do not have such a video. Sounds like another aggregation problem though with creating small windows of data!
I'm using an iPad, oneNote, and an apple pencil
@akhilmittalji6816 วันที่ผ่านมา ⁺¹
@@jordanhasnolife5163Thanks. Which iapd model is that?
@jordanhasnolife5163 17 ชั่วโมงที่ผ่านมา
@@akhilmittalji6816 an air perhaps
@truptijoshi2535 2 หลายเดือนก่อน ⁺¹
Also, could you explain the sequence of workflows for pastebin? Does it redirect to the long url and then fetches the paste content from CDN?
@jordanhasnolife5163 2 หลายเดือนก่อน
Yeah, perhaps after the service sees a certain number of loads for some paste it can place it in the CDN and replace the link for it in the database accordingly.
@tarunnurat 7 หลายเดือนก่อน ⁺¹
Hi Jordan, great vid. Just had a question. In your DB schema, you mentioned having an index on the tinyURL. But given that the tinyURLs in each row are unique, wouldn't they be the primary key for our rows? And if so, why would we add an index on a primary key?
@jordanhasnolife5163 7 หลายเดือนก่อน
Same thing basically - not all databases sort on that primary key I believe so I'm just saying this to be clear
@LeoLeo-nx5gi 10 หลายเดือนก่อน ⁺²
Firstly very awesome content, thanks a ton.
Secondly had a query regarding a point you mentioned on using Kafka queue for analytics where each short url would belong to a partition. i.e we might have 2 trillion partitions.
I think that would be very bad in general as that would increase the latency and would it even be possible to create that many partitions? (there should be some upper bound I think) thanks!!
@jordanhasnolife5163 10 หลายเดือนก่อน ⁺²
I think you could make that argument about any partitioning schema to be fair!
When I say partition by x, I typically mean partition by the hash range of x if that clears things up at all. My point is I just want all values for key x to be on the same node.
@LeoLeo-nx5gi 10 หลายเดือนก่อน
@@jordanhasnolife5163 ahh thanks this is helpful
@zhonglin5985 6 หลายเดือนก่อน ⁺¹
Hi Jordan, nice content! A question -- why would it be a problem if we use spark streaming to calculate total counts? You mentioned that specifically the it will fail if there are multiple spark streaming consumers. What does that mean?
@jordanhasnolife5163 6 หลายเดือนก่อน
I just meant that if you have many consumers for one url, now your counts are split between many consumers. That's ok, but you'll face contention when incrementing the counter in the db from multiple places.
@orhemi8076 8 หลายเดือนก่อน ⁺¹
What are the benefits gained from choosing SQL DB over noSQL? noSQL can scale horizontally much easier
@jordanhasnolife5163 8 หลายเดือนก่อน ⁺²
What makes you say that noSQL scales horizontally much easier? This may be true when we have a lot of distributed joins to perform, but in the case in which we don't, I really don't see any advantage. We get equal amounts of data locality either way.
@Anonymous-ym6st 5 หลายเดือนก่อน ⁺¹
Thanks for the great content! I am wondering for the write-back cache and data consistency, is it possible to make sure the same user always write and read from the same cache? in that case will there still be any data consistency problem or anything I am not considering correctly? (apart from edge cases that the cache node is failing and they can only read from a new cache?
@jordanhasnolife5163 5 หลายเดือนก่อน ⁺¹
Yeah in theory if you use consistent hashing to put the same user on the same write back cache this should be ok, but as you mentioned if that cache goes down we're in trouble
@AizazShahid-ck8cn 2 หลายเดือนก่อน ⁺¹
If we are doing something(probing etc) to avoid collisions then why do we have to care what replication we use? As then there should be no conflict, no?
@jordanhasnolife5163 2 หลายเดือนก่อน
We still have eventual consistency. I can grab a short URL x on node A and you can grab it on node B, and we won't know that there's a write conflict until we synchronize.
@AizazShahid-ck8cn 2 หลายเดือนก่อน ⁺¹
@@jordanhasnolife5163 Wouldn't HBase make the most sense in that case?
@jordanhasnolife5163 2 หลายเดือนก่อน
@@AizazShahid-ck8cn I think that anything with single leader replication would. I don't think we get much benefit from being column oriented here, and if we're prioritizing for read speeds perhaps we're better suited with a B-tree.
@tmehmood007 5 หลายเดือนก่อน ⁺¹
How to find out when the tinyURL was accessed for the first time?
@jordanhasnolife5163 5 หลายเดือนก่อน ⁺¹
Every read has to access the database anyways more or less, just write the timestamp then and then don't overwrite it if it exists. You'll need to use locking here, but what you can do to avoid the head is first read the TS without the lock, and if null then grab the lock and update it in order to minimize contention.
@Henry-wq7ki 4 หลายเดือนก่อน ⁺¹
Can you please clarify how you assign short URL’s ?
The second option with materialized conflicts ?
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Put some URLs on multiple database partitions, and have enough partitions that the increased throughput of many partitions is able to offset the lock contention required to assign the next URL (by primary key) to a given user that gets routed to that partition.
@Henry-wq7ki 4 หลายเดือนก่อน
@@jordanhasnolife5163 thanks
@vanderer_prashant 8 หลายเดือนก่อน ⁺¹
If we want to add a new feature to the system - to generate QR code instead of short URLs, what'd be the changes if any?
@jordanhasnolife5163 8 หลายเดือนก่อน
Yeah I'd say that's effectively the same. I'd imagine there's some form of string representation of a QR code and we would just store that as the primary key in the database instead of a short link.
@yashagarwal8249 7 หลายเดือนก่อน ⁺¹
Hey Jordan, love your content!
I'm a bit confused about when the S3 data is being used here, if we are always writing to CDN first and fetching from it as well. Is it a fallback if some data gets evicted from a particular edge CDN server?
@jordanhasnolife5163 7 หลายเดือนก่อน
Hey Yash! We aren't always writing to the CDN here! Rather, we use the CDN as a write around cache, where pastes that are accessed many times ideally will eventually be loaded in the CDN. For the first write, we go to S3.
@ishanvijay507 9 หลายเดือนก่อน ⁺¹
Hey man, Just came across this channel and i see two playlists for system design problems
Are the videos from first playlist good enough to cover ground for now until you keep coming up for videos in this playlist ? Just asking if there's any reason behind updating them as they were only 1 year old
@jordanhasnolife5163 9 หลายเดือนก่อน
Yeah I think so, but I'm just trying to be more visual in the second series. The first one is just a lot of me talking with no backround images.
@himanimadan3547 8 หลายเดือนก่อน ⁺¹
I have a question - how we would get the same shortlinks "abc" at 5:58 if we are using hash function( userid, timestamp, long url). if hash function is generating the same value then wouldn't it be resolved by probing? Also hash function has different values in it like userid, long url.
@jordanhasnolife5163 8 หลายเดือนก่อน
There are only so many short URLs, so even if you have different inputs they could theoretically hash the the same one. And yes, this would be resolved by probing, but in order to actually know that the key is taken and we need to probe, we have to be sure to lock on that row so that we don't have two users grabbing a key at the same time.
@himanimadan3547 8 หลายเดือนก่อน
@@jordanhasnolife5163 Thanks! I watched many videos from this channel and all are great!
@pranapavan 6 หลายเดือนก่อน ⁺¹
@jordan do you have slides for the problems, just like you have for the system design concepts? Thank you, keep rocking!
@jordanhasnolife5163 6 หลายเดือนก่อน ⁺¹
Yep! I'm waiting until the series is done to get them all posted in batch
@satadhi 4 หลายเดือนก่อน ⁺¹
Hello Jordern is it possible to get these slides (or whatever they are). It will be very helpful for revision. Thank you :)
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
Eventually (when I am done with this current series) I will upload everything in batch
- Jordern
@fedpet 3 หลายเดือนก่อน ⁺¹
Why do you think hash indexes are only in-memory? Afaik, for example postgres supports materialized hash indexes
@jordanhasnolife5163 3 หลายเดือนก่อน
Even if they weren't in memory, the performance would be bad due to random seeks. There do seem to be augmentations of normal indexes with hash ones, but those are gonna be in memory.
@fedpet 3 หลายเดือนก่อน
@@jordanhasnolife5163 afaik, hash indexes used to perform worse in postgres because they were not optimised. They are now, and should outperform b-trees for equality searches. In a b-tree for those searches you are still paying O(logN) every time
@adithyasama7558 6 หลายเดือนก่อน ⁺¹
Can you explain a bit more about the technology choices? you picked mySQL as it has all the requirements you came up with. but wouldn't scaling be easier with something like mongodb compared to mySQL. mongodb also has indexes. so wouldn't mongo be the choice in terms of faster development
@jordanhasnolife5163 6 หลายเดือนก่อน ⁺¹
Can you clarify your assumptions here? Why does mongo lead to faster development? Why does mongo scale better?
I think that people often say this about NoSQL databases without much evidence to back it up, so that's going to be my retort for you here.
Generally, most developers are familiar with SQL databases, and I believe that I can get away with using the relational model here, both in the shape of data, and in the sense that I want the ability to have ACID transactions and use single leader replication. Could I use mongo? Sure - but I don't see any clear advantage that it has over MySQL when both work perfectly fine in this situation.
@sahilkalamkar5332 6 หลายเดือนก่อน
Heyy @@jordanhasnolife5163, a beginner here, what about the argument where people say that NoSQL is easier to scale horizontally? I always get a bit confused on this point, I think scaling horizontally relates to since we have data locality in NoSQL, we don't need to perform joins across nodes.
But do I definitely choose SQL vs NoSQL? Should I always go for SQL, if it fits, choose it, else try with Nosql?
@mickeyp1291 8 หลายเดือนก่อน ⁺¹
btree is not scalable, and kafka uses LSM tables as its main system, and is highly scalable. so using ULID or UUIDs and replicating over a kafka queue might increase capaity without degrading your performance i think (although am not an expert in this system type)
@jordanhasnolife5163 8 หลายเดือนก่อน ⁺²
1) I don't know where you're getting that Kafka uses LSM trees, because it's basically just writing to a commit log, therefore not requiring any sort of indexing
2) What makes you say that B-trees aren't scalable? And what do you mean by "not scalable"?
Let me know if I'm missing anything!
@shuaizhang6502 5 หลายเดือนก่อน ⁺¹
Kafka does not use LSM tree or any other B tree index
@jordanhasnolife5163 5 หลายเดือนก่อน
@@shuaizhang6502 Agree
@rishabhsaxena8096 8 หลายเดือนก่อน ⁺¹
Hey Jordan, In the case of one publisher Per Row(29:34) you mentioned that we will have one spark streamer per tiny url id. But here we have trillions of URLs how is that feasible to have trillion streamers for this case?
@jordanhasnolife5163 8 หลายเดือนก่อน
Hey Rishabh! What I meant there is that the data for each tiny url id belongs on one spark streaming node, but you can have a spark streaming node responsible for many tiny url ids. You basically partition the consumers/kafka queues on tiny url id.
@siddharthgupta6162 8 หลายเดือนก่อน ⁺¹
Was exactly my doubt too. Thanks!
@OF-nf2cb 4 หลายเดือนก่อน ⁺¹
Hi Jordan, thanks for this video! I dont get why we need to worry about two people adding rows with the same primary "tinyUrl" key, doesnt every ACID compliant db prevent adding two of the same row like that or am I missing something very obvious??
@jordanhasnolife5163 4 หลายเดือนก่อน
Yup! (Admittedly I don't know how this would work when many partitions are involved), but this is how they'd do it
@OF-nf2cb 4 หลายเดือนก่อน ⁺¹
Sorry I still dont understand 😢 I was referring to 11:15 where it talks about locking for adding new rows, I thought databases already did this, so why do we need to worry about it?
@jordanhasnolife5163 4 หลายเดือนก่อน ⁺¹
@@OF-nf2cb Because it's a systems design interview, and we should know how things work :)
@OF-nf2cb 4 หลายเดือนก่อน ⁺¹
@@jordanhasnolife5163 Ah I see so you were talking about how dbs internally handle it? I was thinking you meant we were implementing that
@jordanhasnolife5163 4 หลายเดือนก่อน
@@OF-nf2cb Depends on the db :) as you mentioned, most SQL ones will get it done for ya via primary key constraints

ต่อไป

เล่นอัตโนมัติ

2: Instagram + Twitter + Facebook + Reddit | Systems Design Interview Questions With Ex-Google SWE