1: TinyURL + PasteBin | Systems Design Interview Questions With Ex-Google SWE

แชร์
ฝัง

ความคิดเห็น • 211

  • @prashantgodhwani6304
    @prashantgodhwani6304 7 หลายเดือนก่อน +68

    I swear this channel is such a hidden gem.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +8

      Thanks man! Hopefully not too hidden haha

    • @ronakshah725
      @ronakshah725 2 วันที่ผ่านมา

      I think this attracts a particular kind of readers. Those looking for more in-depth and first- principles understanding vs memoization. I think you also have a good balance of real use case driven thinking and hand waving where necessary. The latter is crucial and people ( including me at times) tend to overlook the importance of hand waving. Not everything deserves the same level of detail. But knowing which does is the game changer! Anyways, I will be subscribing and reviewing a good portion of this 😊

    • @dwivedys
      @dwivedys วันที่ผ่านมา

      You stole my thoughts! What is this man!!! Amazing…brilliant

  • @jporritt
    @jporritt 3 หลายเดือนก่อน +15

    I've gone through a lot of videos on this subject and this is by far the best. Not just regurgitating what they've read elsewhere, but actually coming from a place of proper understanding.

  • @gchris6434
    @gchris6434 2 หลายเดือนก่อน +5

    Great content! Probably the most in-depth TinyURL design I've seen. Most others don't actually get into write conflicts/locking.

  • @ww12835
    @ww12835 3 หลายเดือนก่อน +4

    Thank you very much, I’ve been reading DIDA and Understanding Distributed Systems so it’s great to see how those concepts apply to the use cases you are showing.
    This is the first video of yours that I’ve seen yet, planning to watch all the series when I can because you highlight the questions that we should be asking when designing and assert the tradeoffs of different solutions, it’s amazing.
    I don’t know if you have done for the other videos but if I could ask for anything would be for you to change the pencil color when reviewing the flows of data, it makes it easier to find where you are pointing to in the drawing board.
    Again, thank you very much, these videos are really helpful.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Sounds good - Ive gotten this suggestion a couple of times now so I'll give it a shot at some point

  • @brandonwl
    @brandonwl 6 หลายเดือนก่อน +6

    I appreciate the new 2.0 video style and how in depth it is. I found the older videos too brief considering the complexity of the systems they covered.

  • @gopald9962
    @gopald9962 3 หลายเดือนก่อน +5

    You are very good, the best even, at driving the solution of a design problem by applying logical reasons rather than throwing something at my face i wouldn't understand. Thanks for existing!

  • @gammagorom
    @gammagorom 7 หลายเดือนก่อน +7

    Thank you for all the system design content! I’ve been binging through the 2.0 playlist and this video does a great job putting all the pieces together. I don’t have any experience with system design interviews so I have two questions in particular:
    1. If in an interview, someone were able to deliver a design with the same level of depth + complexity that you’ve described here, what level would they be placed at? Alternatively, what sort of depth would interviewers be looking for in say an L4 interview?
    2. Many of the system design interview guides recommend to lay out a basic architecture first and then go into a deep dive at the end but in your videos, you weave both parts together. How would you recommend to split these parts up? For example, is database schema + choosing a type of database part of the deep dive or should it be part of the basic architecture? And how about for replication/partitioning?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +5

      1) At the end of the day, you're interviewing for a certain level, so you'd be placed at that level. I don't really think uplevels are ever going to happen, but failing a systems design interview might lead to a downlevel. I like to think that these videos are sufficient to pass just about any systems design interview, considering the results of my viewers, but I can't be sure, as I'm not an L5 and haven't had the test to try that theory out haha
      2) Yeah, and I used to do my old videos this way. However what I've come to realize is that one's choice of database schema is intimately tied to their database architecture choice. So if I just list a schema outright, but don't go into what architecture I'm using, it might not make sense. In an actual interview, it may make sense to list a schema first, and then say you'll explain it later, but my main objective of these videos is to make them as clear as possible - so I hope that my choice to do this makes sense.

  • @harikareddy5759
    @harikareddy5759 6 หลายเดือนก่อน +3

    Hey Jordan, awesome video as usual! I really appreciate the way you break down tradeoffs. I checked out the TinyURL/Pastebin design from grokking's course, but it felt a bit sloppy to me, your version is so much better :) thank you!

  • @firezdog
    @firezdog หลายเดือนก่อน +1

    I really like the format for these types of videos of starting with a naive solution and then iterating on each component as we discover performance problems. That way the initial db schema and the overall graph of how requests / responses implement the features can be clearly presented to the viewer + the viewer gets a conceptual framework to start evaluating the ways in which the simple picture needed to become more sophisticated.

  • @BRBallin1
    @BRBallin1 4 วันที่ผ่านมา +2

    Using PH as your example URL makes you a legend

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 วันที่ผ่านมา

      Legend is a strong word to use for what I am

  • @andriidanylov9453
    @andriidanylov9453 5 หลายเดือนก่อน +1

    Appreciate. It is a very serious description/implementation for such easy task.

  • @dhruvpatidar359
    @dhruvpatidar359 2 วันที่ผ่านมา +1

    That Google logo and the stand behind looks like ghost to me 😰😱

  • @zalooooo
    @zalooooo 5 หลายเดือนก่อน +3

    Awesome video! Couple thoughts:
    - For storage, it may be useful to talk about content addressable storage. Many pastes may be the same and that will save space
    - I’m not sure what the CDN buys us here. Reading from s3 isn’t crazy slow but the big issue is that the object space is huge and my hunch is that most of the time, you’d get CDN misses in most regions. That is assuming you’re not storing every paste to your CDN, which feels expensive. This is more a not that a cdn feels like a premature optimization
    - I really like your count approach. Another couple be using a counter CRDT and increment server side while serving the link. That being said, yours is more general purpose and scales for other types of analyitics

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      Thanks!
      Good call on the first point, you could take some sort of hash to deduplicate.
      For the CDN point, I mostly have it there just to talk about CDNs. In practice, caches seem to be one of those things that you just introduce as you need, so we'd have to see how our existing latencies were and then experiment here.

  • @mytabelle
    @mytabelle 4 หลายเดือนก่อน +1

    Hey, first of all thank you for the videos. They've been really helpful for me prepping for an upcoming interview!
    I watched a separate video on tinyURL/pastebin link generation, and the main point of the video was that instead of hashing, which can lead collisions when we truncate the MD5/whatever hash, like you mentioned in the video. Their suggestion was that we could have a cluster of key generations/ID generators, and each one would have a range of URLs to distribute when a write request is made. They could be coordinated by ZooKeeper and would allocate only from their own chunks, and would refresh their URLs as needed. The only concern is that we lose that range of URLs (say 1 million) if a specific generator goes down, but since we have 2 trillion combinations if we use base 36 with 8 characters and even more if we use base 62, this isn't a major concern. This would also speed up writes in the case there is a collision.
    What are your thoughts on this?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน +1

      Hey yeah, I touch upon that in this video. That's basically the key generation service where you partition the keys and assign in sequence.
      That works too! There may some lock contention within a partition but if you have enough partitions such that it doesn't matter that's fine too!

  • @ninad11081991
    @ninad11081991 6 หลายเดือนก่อน +2

    I'd love it if you could make content regarding what level of depth is expected from a Mid level SD interview vs Senior vs Staff!
    It would be cool to take the same problem and do a mock at each level so we can understand the differences! Would love to see a mock series between you and Gaurav tackling this.
    I've interviewed with companies where they haven't picked a level for me yet and they say we will determine it based on your performance. But I don't know what the expectation is exactly in the context of a system design interview.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      To be honest I just don't have enough context here as I haven't conducted interviews for a company, I do really wish I could say otherwise but at the moment my objective is to just go as in depth as possible.

  • @moonbiscuit8742
    @moonbiscuit8742 6 หลายเดือนก่อน +1

    Just found the channel, and now watching all your videos from a year ago. Appreciate that you're going through the rationale and building intuition about these concepts. Also, what would be you're top 3 book recommendations?

  • @slover4384
    @slover4384 10 วันที่ผ่านมา +1

    At 12:34 -> You do not have to do any predicate query. Relational databases use B-trees and are not append-only like databases that use LSM trees.
    So an "INSERT" will fail if another INSERT beat it to the race because the 2nd insert will notice the primary key in the B-tree. This sort of "INSERT and fail if primary key exists" is very hard to accomplish in LSM based DBs as they are append-only. In fact to accomplish this in an LSM based database, you usually have to do a special slow instruction that involves multiple phases and uses something like Paxos or raft or 2PC.
    So with a database like postgres, one solution to this TinyURL is to actually just have different application servers INSERT new rows. If there was a race on the exact same primary key, the 2nd application server will get a "duplicate key exists" error and will know to retry or send the error to the client.

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 วันที่ผ่านมา +1

      Yeah - I probably shouldn't have included this section in the video since it's just thrown people off. I meant to say that predicate locking was one way of accomplishing this, but like you mentioned in SQL databases they're gonna get this done for you on the primary key.
      Interesting stuff on the LSM tidbit, that's new to me! I imagine if the key was still in the memtable perhaps we wouldn't need too much additional complexity, as we'd see it was there, but had it been flushed this could get complex.
      Thanks for the insight, and I appreciate you pointing out these flaws!

  • @yashagarwal8249
    @yashagarwal8249 4 หลายเดือนก่อน +1

    Hey Jordan, love your content!
    I'm a bit confused about when the S3 data is being used here, if we are always writing to CDN first and fetching from it as well. Is it a fallback if some data gets evicted from a particular edge CDN server?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      Hey Yash! We aren't always writing to the CDN here! Rather, we use the CDN as a write around cache, where pastes that are accessed many times ideally will eventually be loaded in the CDN. For the first write, we go to S3.

  • @psychechip
    @psychechip 3 หลายเดือนก่อน +1

    I fucking love how you can explain complex shit and be funny at the same time. Keep up the good work!

  • @fillywinks
    @fillywinks 6 หลายเดือนก่อน +2

    Great video. I wonder if it would be considering overengineering to go for a keyed time windowed stream in flink, sink each trigger to the db idempotently (semi-event sourcing style) and trigger a reduce on the db once in x updates/once in a while (something like what rocksdb does).
    You know what, I think I just answered my own question writing this comment 😅

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Ha, I think you could probably do it, just not sure if it's necessary!

  • @andreybraslavskiy522
    @andreybraslavskiy522 หลายเดือนก่อน +1

    Hi Jordan. You were talking about 16 TB of potential data, partitioning, single leader replication. Sounds like Mongo fits better than relational solution.
    Thank you for great content!

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I think they're more or less equally feasible for this one, feel free to go with mongo, basically personal preference at this point

  • @vedantshinde2277
    @vedantshinde2277 5 หลายเดือนก่อน +3

    Great Stuff!! So, to maintain the unique constraint using predicate locks, serializable transactions will have to be used with first a select and then insert, right?

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน +1

      Yep more or less!

    • @obidan2
      @obidan2 4 หลายเดือนก่อน

      @@jordanhasnolife5163 If there is already a unique constraint on the database for the short_url, I am confused as to why you would need to do anything in this case.
      The database will return a UNIQUE_CONSTRAINT failure to whoever tries to INSERT after a short_url was already inserted, and that client can retry with an increased number.
      Am I missing something?

  • @tarunnurat
    @tarunnurat 4 หลายเดือนก่อน +1

    Hi Jordan, great vid. Just had a question. In your DB schema, you mentioned having an index on the tinyURL. But given that the tinyURLs in each row are unique, wouldn't they be the primary key for our rows? And if so, why would we add an index on a primary key?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      Same thing basically - not all databases sort on that primary key I believe so I'm just saying this to be clear

  • @ishanvijay507
    @ishanvijay507 5 หลายเดือนก่อน +1

    Hey man, Just came across this channel and i see two playlists for system design problems
    Are the videos from first playlist good enough to cover ground for now until you keep coming up for videos in this playlist ? Just asking if there's any reason behind updating them as they were only 1 year old

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      Yeah I think so, but I'm just trying to be more visual in the second series. The first one is just a lot of me talking with no backround images.

  • @scottmangiapane
    @scottmangiapane 29 วันที่ผ่านมา +1

    6:40 never thought I'd laugh out loud during a system design video. Thanks for making my studying somewhat tolerable

  • @LeoLeo-nx5gi
    @LeoLeo-nx5gi 6 หลายเดือนก่อน +1

    Firstly very awesome content, thanks a ton.
    Secondly had a query regarding a point you mentioned on using Kafka queue for analytics where each short url would belong to a partition. i.e we might have 2 trillion partitions.
    I think that would be very bad in general as that would increase the latency and would it even be possible to create that many partitions? (there should be some upper bound I think) thanks!!

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      I think you could make that argument about any partitioning schema to be fair!
      When I say partition by x, I typically mean partition by the hash range of x if that clears things up at all. My point is I just want all values for key x to be on the same node.

    • @LeoLeo-nx5gi
      @LeoLeo-nx5gi 6 หลายเดือนก่อน

      @@jordanhasnolife5163 ahh thanks this is helpful

  • @kumarc4853
    @kumarc4853 หลายเดือนก่อน +1

    Learned a lot, concise, well explained, and also funny
    I have a sys design interview today, thank you so much

  • @zhonglin5985
    @zhonglin5985 2 หลายเดือนก่อน +1

    Hi Jordan, nice content! A question -- why would it be a problem if we use spark streaming to calculate total counts? You mentioned that specifically the it will fail if there are multiple spark streaming consumers. What does that mean?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      I just meant that if you have many consumers for one url, now your counts are split between many consumers. That's ok, but you'll face contention when incrementing the counter in the db from multiple places.

  • @michaelparrott128
    @michaelparrott128 7 หลายเดือนก่อน +2

    One comment about the writer - if the writer is an API instance serving the requests to the user, and we're writing to the CDN/S3/DB, this could potentially take a long time and you'd likely want some kind of async mechanism to do these writes with a way of updating the user on its progress.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Agreed. That's potentially more of a client side change though, pretty sure that when uploading files to these services they expose the percentage completion

  • @Didneodl
    @Didneodl 7 หลายเดือนก่อน +5

    Wonder why key-value DB wouldn’t be the db choice. It doesn’t require data relation and we can quickly check if tiny url has been used and also supports fast retrieval

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      Yeah I tried to say that this would be the best choice, but just doubt that storing all data in memory is feasible from a cost perspective, so wouldn't be surprised if your interviewer pushed you harder here.

    • @adi96adi
      @adi96adi 5 หลายเดือนก่อน +1

      I was wondering the same thing.
      Its weird how advice on costs usually comes up in the "what to say in the beginning" interview advice, but isn't mentioned more in technology tradeoffs. I also would've thought that non relational dbs are, for the most part, cheaper than relational ones, but I guess that's only until the point when they stop being caches and start being proper dbs.
      Am I right or missing something?

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      @@adi96adi Yeah not exactly sure regarding the cost of non relational vs. relational dbs but I can say with certainty that storing in memory is more expensive and for something like tinyURL, likely to be overkill.

  • @2020goro
    @2020goro 4 หลายเดือนก่อน +3

    11:21 hi here you mention about the race condition when 2 users would add/create for a same actual url, but I’m confused as to why would this happen if the hash functions accounts for the userID already when generating the tinyURL? Wouldn’t that make the tinyurls unique for 2 users even for same links?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน +2

      It's super unlikely but as long as there are fewer hash combinations than userId and url combinations you can have some collisions

  • @ariali2067
    @ariali2067 3 หลายเดือนก่อน +1

    Hey Jordan, you mentioned that we shouldn't use write back cache b/c that will incur the same issue as multi-leader/leaderless replication. Curious why we cannot use Redis for write back cache here? Redis supports single leader replication right? Thanks in advance!

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน +1

      Hi Aria - you can use redis as a write back cache, and redis supports single leader replication.
      But a write back cache means that you write to a cache first and eventually flush that to your database.
      If you haven't watched my playlist prior to this one, I think you may get some benefit from drilling down those concepts first! These videos should then all make more sense.

  • @cc-to2jn
    @cc-to2jn 6 หลายเดือนก่อน +1

    oh wow u changed the original system design vids! Great I was thinking of reviewing soon :)

  • @kanani749
    @kanani749 หลายเดือนก่อน +1

    Hey Jordan,
    Great video. Wanted to ask how analytics calculation would work in this case if in the system we were to use Atomic write operations (I am referencing DDIA Chapter 7 - Preventing Lost Updates - Atomic Write Operations). From my understanding we are basically trying to implement a read modify write cycle and it would still boil down to a lock which you mentioned.
    If we designed a distributed Redis store partitioned by URL to handle the counter incrementing would the main issue at hand be maintaining an accurate count during failover? Could using a consensus algorithm help with solving with this issue or would this reduce performance and throughout too much?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      We can use atomics, but atomics are built around grabbing locks. If we have too many people incrementing the same counter, we could have contention. In my solution, I batch all of the upgrades, and increment them in intervals, via a real time stream consumer. By having just one consumer per URL, I don't need to worry about locking.
      Since redis is single threaded, I suppose your idea works. That being said, again, we are limited by the throughput of the redis node. While a consensus algorithm would ensure that we don't lose any writes, I'd agree with you that it's probably too slow and doesn't make sense here.

  • @5atc
    @5atc 6 หลายเดือนก่อน +2

    When you say you would use database partitioning, do you actually mean sharding? How do you approach discussing these two topics during an interview?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      I've never really known them to be different terms

  • @ky747r0
    @ky747r0 7 หลายเดือนก่อน +1

    Great video jordan, one question though. Will this playlist have different questions than your last system designs questions playlist or will it be the same??

    • @ky747r0
      @ky747r0 7 หลายเดือนก่อน +1

      BTW much more detailed explanation than the last playlist video, highly appreciated! Keep it up!!

    • @ky747r0
      @ky747r0 7 หลายเดือนก่อน +1

      The quality, not the other thing!

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน +1

      I'm going to incorporate different modifications to questions, but I'd say the questions themselves will be the same services. For example, my last TinyURL video didn't include anything about analytics. I may add a couple of new ones at the end of the series depending on how things are going.

    • @ky747r0
      @ky747r0 7 หลายเดือนก่อน +1

      @@jordanhasnolife5163Great, thanks dude!

    • @user-vv8fw4fj5i
      @user-vv8fw4fj5i 7 หลายเดือนก่อน +1

      I also like the way this time he talking while writing, it’s getting me easier to understand.

  • @jerryhuang472
    @jerryhuang472 3 หลายเดือนก่อน +2

    Thanks for the vid, wondering what if the URL read succeeds and the write to the Kafka queue doesn't? Is there any decent way to to solve this, I know you mention 2p commit a bit, but was wondering if there's anything faster.

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน +1

      Nope, unfortunately nothing better than 2pc. If you figure it out though let me know, we'll be billionaires!

  • @truptijoshi2535
    @truptijoshi2535 24 วันที่ผ่านมา +1

    Hi Jordan! Great video. I have a question about probing. Should the application logic check if the hash is available? Since we are creating a hash based on the long URL + userID + timestamp, why would there be any conflict? Should we also think of UUID or Twitter's snowflake or Base 62 conversions to avoid probing? Could you please explain more about the Hashing function?

    • @jordanhasnolife5163
      @jordanhasnolife5163  23 วันที่ผ่านมา

      Just because those are our inputs to the hash function doesn't mean we'll generate a unique output. If we only have 6 characters in our short link, for example, some things will hash to the same value.

  • @Anonymous-ym6st
    @Anonymous-ym6st 2 หลายเดือนก่อน +1

    Thanks for the great content! I am wondering for the write-back cache and data consistency, is it possible to make sure the same user always write and read from the same cache? in that case will there still be any data consistency problem or anything I am not considering correctly? (apart from edge cases that the cache node is failing and they can only read from a new cache?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +1

      Yeah in theory if you use consistent hashing to put the same user on the same write back cache this should be ok, but as you mentioned if that cache goes down we're in trouble

  • @sachin469D
    @sachin469D 7 หลายเดือนก่อน +3

    Been watching all your videos.. Can you do a video on Lucence OR how in general how search index work with a given edit distance?

    • @Saurabh2816
      @Saurabh2816 7 หลายเดือนก่อน +1

      yeah interested in apache lucene, elastic search and searching indexing

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      I already have! th-cam.com/video/ty9DQhM32mM/w-d-xo.html&ab_channel=Jordanhasnolife
      Didn't go into Levenshtein distance indexing, I think it's probably too specific for the details of this particular channel but I'm sure there are some good resources explaining it somewhere online!

  • @mcee311
    @mcee311 6 หลายเดือนก่อน +3

    for multi leader / leaderless cant we resolve the issues of conflicting writes by using consistent hashing?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +2

      You could say something along the lines of "for key x only write to replica 1 of partition 3". But then you've basically just re-created single leader replication.

  • @rakeshvarma8091
    @rakeshvarma8091 2 หลายเดือนก่อน +1

    Great Video Jordan.
    I have couple of questions.
    1. You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. But why do we need to do this kind of partitioning ? Consistent Hashing offers a good evenly distributed keys among available servers isn't it. Is it wise to use ConsistentHashing everywhere i.e., For DB partitioning, Cache Partitioning & Kafka Queue Partitioning.
    2. At th-cam.com/video/5V6Lam8GZo4/w-d-xo.html, Can you shed some more light on the Idempotency key for achieving the Exactly once processing. How does keeping the last written key will help in avoiding duplicate updates for clicks ? Because we might get the clicks for that same key later as well right i.e., Subsequent clicks on the url ? How do we differentiate between these two ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      1) We are using consistent hashing, albeit on the range of short keys. That's because the short keys themselves are hashes, no need to hash them again (there's no reason you couldn't if you wanted).
      2) Each batch of clicks that we handle in spark streaming gets assigned a random key id (you can hash the incoming data or something so it is deterministic), called an idempotency key. When we change the value in the db, we check (in the clicks database) to make sure we haven't seen that idempotency key so that we don't accidentally update our db twice.

  • @pranapavan
    @pranapavan 3 หลายเดือนก่อน +1

    @jordan do you have slides for the problems, just like you have for the system design concepts? Thank you, keep rocking!

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน +1

      Yep! I'm waiting until the series is done to get them all posted in batch

  • @Anonymous-ym6st
    @Anonymous-ym6st 2 หลายเดือนก่อน +1

    Another question: when mentioning a few TB should be good for one single machine, I am wondering in practice what storage usually can be done in one machine? a few hundred TB?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Depends how you're storing the data, but just look up the prices of hard drives and look up how many you can fit in an average server

  • @matthewsaucedo2471
    @matthewsaucedo2471 6 หลายเดือนก่อน +2

    25:25
    What is the difference b/t Spark Streaming and Flink, if you were to just use a larger batch in Flink that your example (you said 10, but consider we use a more comparable 100)?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      I'd say none then really, just wanted the opportunity to use something other than flink lol

    • @matthewsaucedo2471
      @matthewsaucedo2471 6 หลายเดือนก่อน

      I’ve never used either technology (only Kinesis streams from AWS) so was just curious!

  • @harshdalal2891
    @harshdalal2891 3 หลายเดือนก่อน +2

    Why do you need predicate locks for solving inserts in MySQL DB ?
    Why can't we just use primary key constraints / unique key constraints given by DB to achieve uniqueness ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      To be clear, we totally can, I just would imagine that under the hood that has to use some sort of predicate locking so that we don't create two rows with the same primary key

  • @choucw1206
    @choucw1206 6 หลายเดือนก่อน +4

    Great video! However, can you elaborate the following things related to the short URL generation?
    1. Most of the common hash function, e.g. MD5, SHA-2, generate the string way longer than you need. How do you use it while still keep the hash collision low?
    2. Can you elaborate how to detect the hash collision? You mention using the linear probing so it kinda imply you want to do it in the memory, so my guess is the service probably need to first query the DB to check if this hash/key already exist. Won't that be a performance concern?
    3. I've read a few other designs of generating the unique short URL, and some of them raise the issues of using the hashing of the original URL (with optionally including other data like timestamp). They usually suggest to use a separate unique ID generation service/component, and perform proper encoding like Base 62 and use it as the short URL(besides the domain name part). Have you ever considered this and what's your thought of this?
    Btw, unique ID generation strategy in the distributed system may be an interesting topic to discuss as well if you haven't done it yet 🙂

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +2

      I think you should just be able to truncate, if we found that this was leading to too many collisions we could always use 10 characters instead of 8 for example
      2) oh on the contrary I'm basically proposing we use application logic that says if key x exists take it otherwise try to write key x+1 in the db. This could be done a bit faster with something like a stored procedure.

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      3. I don't really see the difference, feel free to elaborate on how this improves things!

    • @SatinderSingh71
      @SatinderSingh71 6 หลายเดือนก่อน

      @@jordanhasnolife5163 Grokking says the key gen is great at solving hash collisions since everything is created beforehand on an offline process. It will create two tables in your DB (used & unused). The application can periodically grab a bunch of unused keys from the key gen service (KGS). Once used, app will update KGS to move entry into used. That's their reasoning for it anyways. Yours makes sense as well to just populate db with everything from get go since its only a few TBs which is nothing.

    • @varshard0
      @varshard0 5 หลายเดือนก่อน

      ​@@jordanhasnolife5163generating a unique id in a sequence, then create a base 36-64 string as a shorten URL allow you to not having to deal with a collision ever.
      Running number is guaranteed to be unique unlike hash. However, it run the risk of being guessable and could be ddos, by making 2 shorten URLs that point to each other.

    • @adithyasama7558
      @adithyasama7558 3 หลายเดือนก่อน

      @@jordanhasnolife5163 so each url-assign-node will be assigned a range for example (0000 - 5555). This way no 2 nodes will ever try to insert the same short url. Each node will just increment its local counter when ever it's creating a new short url. This way theres no conflict resolution to be solved.

  • @Ryan-sd6os
    @Ryan-sd6os หลายเดือนก่อน +2

    hey jordan, curious, why do we avoid multi leader and use single leader when both can have stale data issue

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I'm not concerned about stale data, I'm concerned with write conflicts, which would occur when two users both claim the same URL. Single leader replication doesn't have this issue.

    • @Ryan-sd6os
      @Ryan-sd6os หลายเดือนก่อน

      @@jordanhasnolife5163 thank you legend.

  • @SatinderSingh71
    @SatinderSingh71 6 หลายเดือนก่อน +1

    This series is definitely in a league of its own. Who da man! Jordan da man - Fellow dork

  • @vanderer_prashant
    @vanderer_prashant 4 หลายเดือนก่อน +1

    If we want to add a new feature to the system - to generate QR code instead of short URLs, what'd be the changes if any?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      Yeah I'd say that's effectively the same. I'd imagine there's some form of string representation of a QR code and we would just store that as the primary key in the database instead of a short link.

  • @vipulspartacus7771
    @vipulspartacus7771 21 วันที่ผ่านมา +1

    Hi Jordan, I have a few questions on how this can be deployed globally. Suppose you need to generate URL's for different users across the world. Then you will need to deploy the generator and datastores in different regions and generate unique hashes for each region. If not, there could be hash collisions which will require you to check different data bases in different continents. Maybe we need to generate a UUID kind of solution, so that each region can have its own local store which can create and serve the content better. Also, the view count you might be getting it from different continents. Even in this case, you need to have some sort of sharded counter type of thing to consolidate counts from different regions. What are your thoughts.

    • @jordanhasnolife5163
      @jordanhasnolife5163  19 วันที่ผ่านมา

      I'd use a key generation service with a db leader in each region, and aggregate counts using stream processing via Kafka and flink, as that can take as long as I need. You can have DB follower replicas in other regions for faster reads.

    • @vipulspartacus7771
      @vipulspartacus7771 19 วันที่ผ่านมา

      @@jordanhasnolife5163 by key generation service do you mean something that will append a region specific code to the hash, otherwise, how can the key generated be unique, as verifying that it is not conflicting with other urls in other regions will take time.

  • @Luzkan
    @Luzkan 4 หลายเดือนก่อน +1

    Sup Jordan! After 2:35 I went ahead and made a design too and I'm looking forward to hear an opinion about my thoughts, feel free to confront and bully me based on my decision.
    You have hard constrained the choices around no write conflicts (with the example of an identical hash for two different urls by two users made simultaneously). I thought, that situation could be extremally rare in real life (and even if - we could add extra complexity on how we generate tinyurl to assure its uniqueness, with a salt maybe, or a distributed counter, but tbh, only after such event would happen considerable number of times). Thus, I wouldn't want to discard options based on that edge case. That enables the Cassandra option, which looks very nice for a system which is in-fact read heavy as you said, but! each tinyurl visit (which is a read) is also a write because of inc() counter requirement.
    I really like your idea of going with mini batch updates and Spark Streaming, though. I think this is also a good design, and I'm not holding a strong opinion on either of those, and now what? :D
    Cheers! Thank you for the content, keep up great work!

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      Yeah! I think Cassandra could be interesting here, assuming that you used a leaderless counter (so basically a CRDT), and I agree that IRL conflicts should be relatively rare. Since there are a lot fewer hashes than there are URLs different sites may hash to the same result but ideally by then enough time will have passed that Cassandra could have performed anti entropy so we could see that the hash was taken and then probe to another.
      Thanks, I appreciate the kind words!

  • @OF-nf2cb
    @OF-nf2cb 26 วันที่ผ่านมา +1

    Hi Jordan, thanks for this video! I dont get why we need to worry about two people adding rows with the same primary "tinyUrl" key, doesnt every ACID compliant db prevent adding two of the same row like that or am I missing something very obvious??

    • @jordanhasnolife5163
      @jordanhasnolife5163  25 วันที่ผ่านมา

      Yup! (Admittedly I don't know how this would work when many partitions are involved), but this is how they'd do it

    • @OF-nf2cb
      @OF-nf2cb 25 วันที่ผ่านมา +1

      Sorry I still dont understand 😢 I was referring to 11:15 where it talks about locking for adding new rows, I thought databases already did this, so why do we need to worry about it?

    • @jordanhasnolife5163
      @jordanhasnolife5163  25 วันที่ผ่านมา +1

      @@OF-nf2cb Because it's a systems design interview, and we should know how things work :)

    • @OF-nf2cb
      @OF-nf2cb 25 วันที่ผ่านมา +1

      @@jordanhasnolife5163 Ah I see so you were talking about how dbs internally handle it? I was thinking you meant we were implementing that

    • @jordanhasnolife5163
      @jordanhasnolife5163  25 วันที่ผ่านมา

      @@OF-nf2cb Depends on the db :) as you mentioned, most SQL ones will get it done for ya via primary key constraints

  • @aram_mkr
    @aram_mkr 2 หลายเดือนก่อน +1

    Any thoughts on using sharded counters for click count vs kafka stream processing ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Yep, I did mention counter crdts, which imo are basically the same

  • @ninad11081991
    @ninad11081991 6 หลายเดือนก่อน +3

    Is it really worth the trade off to use a relational db here due to the fear of write conflicts?
    We are giving up on a ton of write throughput by making this choice aren't we?
    What if we instead not take the input of the tinyURL from the user, instead just take his longUrl, have our cassandra already prepolulated with tinyURLs and then round robin requests among our leaderless architecture?

    • @ninad11081991
      @ninad11081991 6 หลายเดือนก่อน +1

      Fun fact, Bloomberg actually rejected me because I used mysql in my system design interview for this problem

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน +1

      I think it's moreso about how you justify it. To clarify, I said to use single leader replication, not necessarily a relational db.
      That being said, if you round robin requests around your leaderless architecture and two people have the same shortURL that you propose they could very well still take the same shortURL on different replicas which will lead to a write conflict.

    • @ninad11081991
      @ninad11081991 6 หลายเดือนก่อน

      @@jordanhasnolife5163 Makes sense! and Makes Sense. If I get asked this again, I'll bring up both considerations

    • @brandonwl
      @brandonwl 6 หลายเดือนก่อน

      @@ninad11081991 it is very hard to shard sql dbs but much easier using mongodb/cassandra. For mongo db can use the same strategy as a sql DB shown in the video.
      But for cassandra you need a separate KeyGen service since write conflicts can happen in cassandra. Do you think they failed you because they were looking for a cassandra solution?

  • @orhemi8076
    @orhemi8076 5 หลายเดือนก่อน +1

    What are the benefits gained from choosing SQL DB over noSQL? noSQL can scale horizontally much easier

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน +1

      What makes you say that noSQL scales horizontally much easier? This may be true when we have a lot of distributed joins to perform, but in the case in which we don't, I really don't see any advantage. We get equal amounts of data locality either way.

  • @firezdog
    @firezdog หลายเดือนก่อน +1

    For the caches, since we want to handle hot-keys, why not use LFU as opposed to LRU for the eviction policy?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I think they both have their pros and cons here, but a URL that was popular a year ago but nobody looks at now is still technically pretty "frequently used"

  • @akbarkool
    @akbarkool 3 หลายเดือนก่อน +1

    For the issue with multi leader potentially causing write conflicts, a simple fix could be to hash both the incoming url and outgoing url together. That would leave the chances of same hashed value being infinitely less unless the business is also looking up p0rn sites

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Haha yeah I think another thing to note here is the range space. We can hash on lots of fields but if we don't have that many hashing combinations we'll have collisions

  • @ilqarilyasov
    @ilqarilyasov 2 หลายเดือนก่อน +1

    5:27 you mentioned that the hash function provides a unique value. Are you sure? Doesn't the hash function of a hash table typically return an index to map keys to indices in the underlying array?

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      A hash function is any function that deterministically maps some element of a domain to a different element of a range

  • @tunepa4418
    @tunepa4418 3 หลายเดือนก่อน +1

    How can we achieve or guarantee uniqueness of the tiny url field across multiple shards. Wouldn’t this be too expensive especially if we are using predicate locks

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      You shard by the tinyurl ID so that you don't have to!

  • @edcoronado
    @edcoronado 6 หลายเดือนก่อน +1

    great video, man. a lot of system design videos are super generic with just "here's some lego blocks that I know off the top of my head" and don't start from a bottom-up approach with the system architecture designed to support a specific algorithm, not just a top-level cookie-cut use case.

  • @himanimadan3547
    @himanimadan3547 4 หลายเดือนก่อน +1

    I have a question - how we would get the same shortlinks "abc" at 5:58 if we are using hash function( userid, timestamp, long url). if hash function is generating the same value then wouldn't it be resolved by probing? Also hash function has different values in it like userid, long url.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      There are only so many short URLs, so even if you have different inputs they could theoretically hash the the same one. And yes, this would be resolved by probing, but in order to actually know that the key is taken and we need to probe, we have to be sure to lock on that row so that we don't have two users grabbing a key at the same time.

    • @himanimadan3547
      @himanimadan3547 4 หลายเดือนก่อน

      @@jordanhasnolife5163 Thanks! I watched many videos from this channel and all are great!

  • @sarrahdoesthings
    @sarrahdoesthings 7 วันที่ผ่านมา +1

    I'm here for your eyes

  • @Henry-wq7ki
    @Henry-wq7ki หลายเดือนก่อน +1

    Can you please clarify how you assign short URL’s ?
    The second option with materialized conflicts ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      Put some URLs on multiple database partitions, and have enough partitions that the increased throughput of many partitions is able to offset the lock contention required to assign the next URL (by primary key) to a given user that gets routed to that partition.

    • @Henry-wq7ki
      @Henry-wq7ki หลายเดือนก่อน

      @@jordanhasnolife5163 thanks

  • @MasterRace023
    @MasterRace023 10 วันที่ผ่านมา +1

    At 28:55 why do we need to store an extra idempotency key for each additional SS consumer? im not quite understanding that part

    • @jordanhasnolife5163
      @jordanhasnolife5163  9 วันที่ผ่านมา +1

      Well each one is basically simultaneously aggregating some of the counts of people that have clicked our link. If one of them goes down, and then comes back up, and tries to hit the database again with an operation to increment the counter for our link, we need to make sure we haven't already applied that operation (and then spark streaming could have gone down before the ack from the DB got back to it). So using an idempotency key would allow the database to say, update y from consumer z has already been applied, don't do it again.

  • @user-se9zv8hq9r
    @user-se9zv8hq9r 6 หลายเดือนก่อน +1

    in the future will you include envoy in your designs?

  • @rishabhsaxena8096
    @rishabhsaxena8096 5 หลายเดือนก่อน +1

    Hey Jordan, In the case of one publisher Per Row(29:34) you mentioned that we will have one spark streamer per tiny url id. But here we have trillions of URLs how is that feasible to have trillion streamers for this case?

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      Hey Rishabh! What I meant there is that the data for each tiny url id belongs on one spark streaming node, but you can have a spark streaming node responsible for many tiny url ids. You basically partition the consumers/kafka queues on tiny url id.

    • @siddharthgupta6162
      @siddharthgupta6162 4 หลายเดือนก่อน +1

      Was exactly my doubt too. Thanks!

  • @idiot7leon
    @idiot7leon 2 หลายเดือนก่อน

    Brief Outline
    00:01:15 Problem Requirements
    00:02:04 Performance Considerations
    00:03:23 Link Generation
    00:05:45 Assigning URLs - Replication
    00:08:07 Assigning URLs - Caching
    00:08:51 Assigning URLs - Partitioning
    00:10:34 Assigning URLs - Single Node
    00:11:29 Assigning URLs - Predicate Locks
    00:14:12 Assigning URLs - Engine Implementation?
    00:15:42 Assigning URLs - Database Choice
    00:16:17 Maximizing Read Speeds
    00:17:35 Maximizing Read Speeds - Hot Links
    00:18:50 Maximizing Read Speeds - Populating the Cache
    00:21:02 Analytics - Naive Solution
    00:22:22 Analytics - Stream Processing
    00:23:49 Analytics - Click Consumer
    00:26:00 Analytics - Exactly Once?
    00:29:16 Analytics - One Publisher Per Row
    00:30:17 Deleting Expired Links
    00:31:21 PasteBin - Huge Pastes
    00:34:23 Final Diagram
    Thanks, Jordan!

  • @satadhi
    @satadhi หลายเดือนก่อน +1

    Hello Jordern is it possible to get these slides (or whatever they are). It will be very helpful for revision. Thank you :)

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      Eventually (when I am done with this current series) I will upload everything in batch
      - Jordern

  • @Henry-wq7ki
    @Henry-wq7ki หลายเดือนก่อน +1

    If we are single leader, why do we need predicate lock?
    Only one client will accès the row, right ?

  • @mariusz.zlotucha
    @mariusz.zlotucha 3 หลายเดือนก่อน +1

    Hi, could you estimate cost of all sql dbs from your example? I'm guessing it's pretty huge :)

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Hey, I'm not too familiar with exact pricing here, but you can go on AWS, check out how much it costs for an Amazon RDS instance, and then multiply that by the number of databases that you need as we discuss in the capacity estimates section.

  • @dmitrigekhtman1082
    @dmitrigekhtman1082 4 หลายเดือนก่อน +1

    If you write to S3 first and then the DB write fails, you’re left with untracked orphaned junk data from the failed request. Another approach would be to write to the DB first to persist the intent to upload the pastebin, do the rest async, and make sure to communicate something reasonable to the user while the upload is in progress.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน +1

      While this is true, I prefer orphaned junk to the alternative of a paste that links to nothing. I could just have a cleanup job run in the background once a month!

    • @dmitrigekhtman1082
      @dmitrigekhtman1082 4 หลายเดือนก่อน +1

      A paste that links to nothing is definitely bad. Recurring cleanup works and is necessary in any case, at least as a fallback for bugs. But it's more reliable to have your normal path be to write something to the DB first [chunk_id: xxx, status: need_to_upload], then take (idempotent) action [upload to S3 if not done already] based on the persisted goal state, then record that you're done [chunk_id: xxx, status: uploaded]. Less things to clean up this way. The relevant buzzword is reconciliation.

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      @@dmitrigekhtman1082 Seems like a fair point to me!

  • @hemanthaugust7217
    @hemanthaugust7217 6 หลายเดือนก่อน +1

    How do you generated hashcode (base36) from url & timestamp with minimal collisions?
    I've seen other designs generating Sequence no in a distributed fashion and converting number to base36. We can avoid this approach, if you can share how to generate such a hashcode ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      I'm a bit confused here, why don't you just use any typical hash function?

    • @hemanthaugust7217
      @hemanthaugust7217 6 หลายเดือนก่อน

      @@jordanhasnolife5163 a hash function like md5/sha256 might generate a unique code but when it's truncated to 7-8chars, it leads to a lot of collisions. I've not tested it, but can have a high probability.

    • @varshard0
      @varshard0 5 หลายเดือนก่อน +2

      I wouldn't recommend using a sequence number as it's likely to be guessable.
      Then the attacker could predict the next shorten URL and try to have 2 shorten URLs pointing at each other and create an infinite loop that hurt the server.

  • @radosawmul8963
    @radosawmul8963 2 หลายเดือนก่อน +1

    What strikes me is a question if kafka can actually handle 2 trillion partitions 👀 for example firehose can handle max of 5k partitions per delivery stream. I think some more hybrid approach would be in order here.

    • @radosawmul8963
      @radosawmul8963 2 หลายเดือนก่อน +1

      29:40 Unless you didn’t mean to have a partition for each url, but just a range-based partitioning. Because that was kinda clear only in the last architecture diagram, not during analytics deep dive. If so then all good 😅

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      Yep you got it!

  • @firewater586
    @firewater586 3 หลายเดือนก่อน +1

    Great content!

  • @tmehmood007
    @tmehmood007 หลายเดือนก่อน +1

    How to find out when the tinyURL was accessed for the first time?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      Every read has to access the database anyways more or less, just write the timestamp then and then don't overwrite it if it exists. You'll need to use locking here, but what you can do to avoid the head is first read the TS without the lock, and if null then grab the lock and update it in order to minimize contention.

  • @adithyasama7558
    @adithyasama7558 3 หลายเดือนก่อน +1

    Can you explain a bit more about the technology choices? you picked mySQL as it has all the requirements you came up with. but wouldn't scaling be easier with something like mongodb compared to mySQL. mongodb also has indexes. so wouldn't mongo be the choice in terms of faster development

    • @jordanhasnolife5163
      @jordanhasnolife5163  3 หลายเดือนก่อน

      Can you clarify your assumptions here? Why does mongo lead to faster development? Why does mongo scale better?
      I think that people often say this about NoSQL databases without much evidence to back it up, so that's going to be my retort for you here.
      Generally, most developers are familiar with SQL databases, and I believe that I can get away with using the relational model here, both in the shape of data, and in the sense that I want the ability to have ACID transactions and use single leader replication. Could I use mongo? Sure - but I don't see any clear advantage that it has over MySQL when both work perfectly fine in this situation.

    • @sahilkalamkar5332
      @sahilkalamkar5332 2 หลายเดือนก่อน

      Heyy ​@@jordanhasnolife5163, a beginner here, what about the argument where people say that NoSQL is easier to scale horizontally? I always get a bit confused on this point, I think scaling horizontally relates to since we have data locality in NoSQL, we don't need to perform joins across nodes.
      But do I definitely choose SQL vs NoSQL? Should I always go for SQL, if it fits, choose it, else try with Nosql?

  • @raunakchoudhary4781
    @raunakchoudhary4781 2 หลายเดือนก่อน +1

    @Jordan Hey, where do you read from ?

  • @bennyamor
    @bennyamor หลายเดือนก่อน +1

    Creating a partition for each short URL in Kafka could be problematic- large number of partitions in Kafka can lead to increased overhead in terms of memory usage, disk space, and metadata management,Each partition in Kafka consumes system resources such as file descriptors, memory, and network connections,Management Complexity.....

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Not proposing this - when I say partition by x it's me using shorthand for partition by hash range of x

    • @bennyamor
      @bennyamor หลายเดือนก่อน

      @@jordanhasnolife5163 thanks

  • @mickeyp1291
    @mickeyp1291 5 หลายเดือนก่อน +1

    Loving your videos - as a n enterprise system architect for the past 7 years i feel however that I dont understand WHY you're using hashing, do you plan to find that hash again ? this is classic ULID approach that was intended because of replication and consistency issue. when looking up the ULID as an index it can be indexed and this would solve all the issues and problems you faced. hashing is when you need to take the raw data and authenticate it somehow. i think you need to revisit this solution, you are using hashing as an indexing method and that is a red flag for me

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน

      I really don't see a functional difference here between hashing and UUID here, but I agree with you that this is a perfectly acceptable option. Considering that we're taking in things like the userId and timestamp in order to create our hash, I'd imagine that we'd have equally well distributed data.
      According to this guy, github.com/oklog/ulid, the ULID is created from a timestamp and some random data, which effectively just makes it a variant on a hash. We don't really care about whether or not I can sort these, because we're not running any range queries here.
      Another advantage of hashing is that it allows me to deduplicate requests if somebody submits multiple identical ones, to stop them from creating multiple tiny URLs. I think there are many benefits to hashing beyond just authentication.
      Nonetheless, I appreciate your opinion and think that using something like a random UUID makes for a totally fine key here and in practice works just as well or better than a hash.

  • @adrian333dev
    @adrian333dev หลายเดือนก่อน +1

    You've got good humor

  • @mickeyp1291
    @mickeyp1291 5 หลายเดือนก่อน +1

    btree is not scalable, and kafka uses LSM tables as its main system, and is highly scalable. so using ULID or UUIDs and replicating over a kafka queue might increase capaity without degrading your performance i think (although am not an expert in this system type)

    • @jordanhasnolife5163
      @jordanhasnolife5163  5 หลายเดือนก่อน +2

      1) I don't know where you're getting that Kafka uses LSM trees, because it's basically just writing to a commit log, therefore not requiring any sort of indexing
      2) What makes you say that B-trees aren't scalable? And what do you mean by "not scalable"?
      Let me know if I'm missing anything!

    • @shuaizhang6502
      @shuaizhang6502 2 หลายเดือนก่อน +1

      Kafka does not use LSM tree or any other B tree index

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน

      @@shuaizhang6502 Agree

  • @davendrasingh518
    @davendrasingh518 6 หลายเดือนก่อน +1

    i like this make it more. Its very real

  • @brycejunkinz
    @brycejunkinz 3 หลายเดือนก่อน +1

    goated video

  • @foxedex447
    @foxedex447 20 วันที่ผ่านมา +1

    bruh why he has only 34k

  • @yrfvnihfcvhjikfjn
    @yrfvnihfcvhjikfjn 6 หลายเดือนก่อน +1

    1 trillion urls? That's a moderately sized pornography collection 😏

  • @siddharthsingh7281
    @siddharthsingh7281 6 หลายเดือนก่อน +2

    Bro why don't you share those notes

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Ah I'll get to it eventually, just a bit tougher to export them in a decent format

    • @siddharthsingh7281
      @siddharthsingh7281 6 หลายเดือนก่อน

      @@jordanhasnolife5163 just dump them whatever format they are

  • @wij8044
    @wij8044 2 หลายเดือนก่อน +1

    The capacity planning was a bit lacking. Decision to go with SQL vs NoSQL is evident if you have better capacity planning, so that’s my two cents. Good content overall.

    • @jordanhasnolife5163
      @jordanhasnolife5163  2 หลายเดือนก่อน +2

      Can you elaborate on how the decision of nosql vs SQL becomes evident through capacity estimates? I'm curious what I've missed

    • @wij8044
      @wij8044 2 หลายเดือนก่อน

      @@jordanhasnolife5163 You can vertically scale until you reach the physical limit of the commodity machine, which is why a distributed network is preferred for the case when you want split computation, storage, or memory. For example:
      A typical commodity machine can have maybe 10TB storage, maybe 256 GB RAM, so if your DB or storage can’t feasibly retain your data then you can opt for a distributed network.
      Of course going with NoSQL you lose the ACID gaurantees, so in that case if there’s something like payment processing (like maybe an airline booking system) then you may need to consider optimistic or pessimistic locks alongside the 2PC or 3PC.
      Of course take all that with a grain of salt, StackOverflow actually serves their billions of requests per year without a distributed network! I would consider them an edge case worth looking into if you haven’t already.

  • @tahasaleh4697
    @tahasaleh4697 หลายเดือนก่อน +1

    *decrease latency 9:47