Introduction to Partitioning | Systems Design Interview 0 to 1 with Ex-Google SWE

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ม.ค. 2025

ความคิดเห็น • 38

  • @romannagorkin
    @romannagorkin 6 หลายเดือนก่อน +3

    Gosh. I am not a native speaker and have picked up some English apart from engineering stuff. Amazing!

  • @dvsavocs5290
    @dvsavocs5290 4 หลายเดือนก่อน +3

    I would like to point out for anyone viewing this, that partitioning and sharding have different meaning (at least according to stack overflow and some other websites). Partitioning is done locally, sharding = horizontal partitioning, that is distributing the original database stored in a single machine in multiple machines. The following informaton are probably not 100% correct as I am still learning about it, but it should give an idea on why these are two different concepts. If you are reading this and can point out mistakes please do!
    Partitioning is done locally, a 1tera database is partitioned based on its primary key (say id monotonically increasing) in n partitions, for example from 0 to 100, 101 to 200 etc. This is all done locally and the partitions are still stored on the same original machine. This still offers benefits, as when querying the db, we can now know at priori (based on the id) to which partition we should access independently from others. This also means that a CPU can access different partitions and parallelize its workload, since partitions are independent from each other.
    Sharding distributes an original big database onto different machines, which means a huge improvement as we have access to more cpus, ram etc, as well as potentially more performance per money spent (horizontally scaling is less expensive than vertically scaling a machine). The cost is increased complexity, especially with the notions of replication previously covered in the course, now we don't replicate entire databases, but shards of the whole database, increasing the overall complexity.
    I believe this also means that local secondary indexes are applied to single partitions on the single machine, whereas global secondary indexes are applied to every node (shard) of the network

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      I think that's fair enough. You distribute database partitions on multiple shards.

  • @yogita_garud
    @yogita_garud 4 หลายเดือนก่อน

    This is amazing. Learning a lot pretty quickly with these videos!

  • @VedanshBansal-y8f
    @VedanshBansal-y8f 6 หลายเดือนก่อน +2

    So basically the LSIs are stored in the same shard, as an extra copy with different sort key. But do GSI also store in the same shard?

    • @jordanhasnolife5163
      @jordanhasnolife5163  6 หลายเดือนก่อน

      Well they're global. So it can be stored wherever.

  • @FaisalAnees46
    @FaisalAnees46 7 หลายเดือนก่อน +2

    Awesome video man ! Super qq - towards the end, in your secondary global index - where the shards are partitioned by height, why would Dwyane's 6'3" "height" get hashed to the second shard ? Isn't the hashing happening on the value of heights and because of that Dwyane would hash to shard 1 ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน

      So the point is the height index is the secondary one. So his primary hash puts him on 2 but his secondary puts him on 1, now we need a two phase commit.

  • @pl5778
    @pl5778 ปีที่แล้ว +2

    Awesome video as always!

  • @saileshsirari2014
    @saileshsirari2014 5 หลายเดือนก่อน +2

    Awesome as always!

  • @jhguygih
    @jhguygih 2 หลายเดือนก่อน +1

    Can we partition in single node? Like create partition in mysql. Whats the benefit like just going directly to the Hard drive section where the data is? Because I assume the db files are already sized limited and in the B tree we know where they are

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I'm a bit confused by your question here can you elaborate/reword it?

    • @jhguygih
      @jhguygih หลายเดือนก่อน +1

      @jordanhasnolife5163 I was trying to bring the idea of same network partitions like we have in mysql. I lookup for ans they partition table files to be quicker.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      @@jhguygih Sometimes we may partition on the same node with pre-defined boundaries to be able to move them around more easily

  • @xiyunliu9252
    @xiyunliu9252 4 หลายเดือนก่อน +1

    Thanks for the video! I have a question though. The “partition” usually means partitioning the data on one node and sharding is across multiple nodes, but here the examples you gave is putting data on different shards and the secondary indexes table is using the same set of nodes( because you said the GSI can be on shard 2 while primary index is on shard 1) I’m confused. Could you please explain more on it? Thank you very much!

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      Yes sorry I often use the terms interchangeably. For the purposes of videos when I say partitioning assume I mean sharding. We take the partitions of the tables and place them on different shards.

    • @xiyunliu9252
      @xiyunliu9252 4 หลายเดือนก่อน

      @@jordanhasnolife5163 Thanks!

    • @xiyunliu9252
      @xiyunliu9252 4 หลายเดือนก่อน +1

      @@jordanhasnolife5163 Then are all the partition videos you shared from this video to the next section are about sharding?

    • @jordanhasnolife5163
      @jordanhasnolife5163  4 หลายเดือนก่อน

      @@xiyunliu9252 I think that's fair to say.

  • @gammagorom
    @gammagorom ปีที่แล้ว +2

    Is sharding and partitioning the same thing? In this video, the terms are used interchangeably but online, I’m seeing mixed views about what the two terms mean

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว

      I use them interchangeably, I've seen other people say they aren't the same but to tell you the truth I've never really seen the nuance there

  • @LeoLeo-nx5gi
    @LeoLeo-nx5gi ปีที่แล้ว +2

    Hello Jordan, great video, just one question :- In case of hash based approach won't we need rehashing of entire dataset (again and again) if say any shard goes down? or we add extra shards (I don't know if we can do this later on but yeah)?
    That would be one of the biggest problem correct? If so any idea how do we handle the above scenarios.....

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +1

      I'd say wait until next Thursday :). Thanks for bringing this up

    • @LeoLeo-nx5gi
      @LeoLeo-nx5gi ปีที่แล้ว +1

      @@jordanhasnolife5163 sure sure

    • @pl5778
      @pl5778 ปีที่แล้ว +1

      this is where consistent hashing is the better option?

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว

      @@pl5778 A few good options here I'd say but consistent hashing is one of them!

    • @LeoLeo-nx5gi
      @LeoLeo-nx5gi ปีที่แล้ว

      @@pl5778 Yeah u are correct!!

  • @vankram1552
    @vankram1552 ปีที่แล้ว +2

    Wish my company would do any of these things. We are literally moving to big table (we're getting flown to google for training) because management refuse to investigate not using a managed postgres database that does support the replication strategy that our scale requires 😢

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +1

      Ah rip - for what it's worth partitioning can really complicate things, but sometimes it's necessary. Vertical scaling can be valid, but if they're flying you all out there just to train to learn how do use big table seems dumb to me lol

  • @aloha9938
    @aloha9938 7 หลายเดือนก่อน

    this doesn’t make sense to me, range based partitioning leads to hotspots, so we hash ranged based partitioning for even distribution, then we keep a global secondary index for faster range queries, but wouldnt gsi still lead to hotspots because the shards for the secondary index are partitioned like a normal range based partitioning? it feels like we came full circle

    • @jordanhasnolife5163
      @jordanhasnolife5163  7 หลายเดือนก่อน +1

      If we're using hash based partitioning it's likely because we don't want range based querying. If we need range based querying, we'll have to do range based partitioning and be smart about how we balance the load there.

  • @losco_arti
    @losco_arti 8 หลายเดือนก่อน +2

    Really like ur content!

  • @chaitanyatanwar8151
    @chaitanyatanwar8151 7 หลายเดือนก่อน +2

    Thanks!

  • @htm332
    @htm332 ปีที่แล้ว +4

    👑

  • @asian1599
    @asian1599 4 หลายเดือนก่อน +1

    this is why i love jews

  • @cricket4671
    @cricket4671 ปีที่แล้ว +2

    Have you thought about channel name "Gigachads", m almost tempted to steal it 😛

    • @jordanhasnolife5163
      @jordanhasnolife5163  ปีที่แล้ว +1

      I appreciate the suggestion, but being a gigachad is a lifestyle choice, and I can't steal it for personal gain