I would like to point out for anyone viewing this, that partitioning and sharding have different meaning (at least according to stack overflow and some other websites). Partitioning is done locally, sharding = horizontal partitioning, that is distributing the original database stored in a single machine in multiple machines. The following informaton are probably not 100% correct as I am still learning about it, but it should give an idea on why these are two different concepts. If you are reading this and can point out mistakes please do! Partitioning is done locally, a 1tera database is partitioned based on its primary key (say id monotonically increasing) in n partitions, for example from 0 to 100, 101 to 200 etc. This is all done locally and the partitions are still stored on the same original machine. This still offers benefits, as when querying the db, we can now know at priori (based on the id) to which partition we should access independently from others. This also means that a CPU can access different partitions and parallelize its workload, since partitions are independent from each other. Sharding distributes an original big database onto different machines, which means a huge improvement as we have access to more cpus, ram etc, as well as potentially more performance per money spent (horizontally scaling is less expensive than vertically scaling a machine). The cost is increased complexity, especially with the notions of replication previously covered in the course, now we don't replicate entire databases, but shards of the whole database, increasing the overall complexity. I believe this also means that local secondary indexes are applied to single partitions on the single machine, whereas global secondary indexes are applied to every node (shard) of the network
Awesome video man ! Super qq - towards the end, in your secondary global index - where the shards are partitioned by height, why would Dwyane's 6'3" "height" get hashed to the second shard ? Isn't the hashing happening on the value of heights and because of that Dwyane would hash to shard 1 ?
So the point is the height index is the secondary one. So his primary hash puts him on 2 but his secondary puts him on 1, now we need a two phase commit.
Can we partition in single node? Like create partition in mysql. Whats the benefit like just going directly to the Hard drive section where the data is? Because I assume the db files are already sized limited and in the B tree we know where they are
@jordanhasnolife5163 I was trying to bring the idea of same network partitions like we have in mysql. I lookup for ans they partition table files to be quicker.
Thanks for the video! I have a question though. The “partition” usually means partitioning the data on one node and sharding is across multiple nodes, but here the examples you gave is putting data on different shards and the secondary indexes table is using the same set of nodes( because you said the GSI can be on shard 2 while primary index is on shard 1) I’m confused. Could you please explain more on it? Thank you very much!
Yes sorry I often use the terms interchangeably. For the purposes of videos when I say partitioning assume I mean sharding. We take the partitions of the tables and place them on different shards.
Is sharding and partitioning the same thing? In this video, the terms are used interchangeably but online, I’m seeing mixed views about what the two terms mean
Hello Jordan, great video, just one question :- In case of hash based approach won't we need rehashing of entire dataset (again and again) if say any shard goes down? or we add extra shards (I don't know if we can do this later on but yeah)? That would be one of the biggest problem correct? If so any idea how do we handle the above scenarios.....
Wish my company would do any of these things. We are literally moving to big table (we're getting flown to google for training) because management refuse to investigate not using a managed postgres database that does support the replication strategy that our scale requires 😢
Ah rip - for what it's worth partitioning can really complicate things, but sometimes it's necessary. Vertical scaling can be valid, but if they're flying you all out there just to train to learn how do use big table seems dumb to me lol
this doesn’t make sense to me, range based partitioning leads to hotspots, so we hash ranged based partitioning for even distribution, then we keep a global secondary index for faster range queries, but wouldnt gsi still lead to hotspots because the shards for the secondary index are partitioned like a normal range based partitioning? it feels like we came full circle
If we're using hash based partitioning it's likely because we don't want range based querying. If we need range based querying, we'll have to do range based partitioning and be smart about how we balance the load there.
Gosh. I am not a native speaker and have picked up some English apart from engineering stuff. Amazing!
I would like to point out for anyone viewing this, that partitioning and sharding have different meaning (at least according to stack overflow and some other websites). Partitioning is done locally, sharding = horizontal partitioning, that is distributing the original database stored in a single machine in multiple machines. The following informaton are probably not 100% correct as I am still learning about it, but it should give an idea on why these are two different concepts. If you are reading this and can point out mistakes please do!
Partitioning is done locally, a 1tera database is partitioned based on its primary key (say id monotonically increasing) in n partitions, for example from 0 to 100, 101 to 200 etc. This is all done locally and the partitions are still stored on the same original machine. This still offers benefits, as when querying the db, we can now know at priori (based on the id) to which partition we should access independently from others. This also means that a CPU can access different partitions and parallelize its workload, since partitions are independent from each other.
Sharding distributes an original big database onto different machines, which means a huge improvement as we have access to more cpus, ram etc, as well as potentially more performance per money spent (horizontally scaling is less expensive than vertically scaling a machine). The cost is increased complexity, especially with the notions of replication previously covered in the course, now we don't replicate entire databases, but shards of the whole database, increasing the overall complexity.
I believe this also means that local secondary indexes are applied to single partitions on the single machine, whereas global secondary indexes are applied to every node (shard) of the network
I think that's fair enough. You distribute database partitions on multiple shards.
This is amazing. Learning a lot pretty quickly with these videos!
So basically the LSIs are stored in the same shard, as an extra copy with different sort key. But do GSI also store in the same shard?
Well they're global. So it can be stored wherever.
Awesome video man ! Super qq - towards the end, in your secondary global index - where the shards are partitioned by height, why would Dwyane's 6'3" "height" get hashed to the second shard ? Isn't the hashing happening on the value of heights and because of that Dwyane would hash to shard 1 ?
So the point is the height index is the secondary one. So his primary hash puts him on 2 but his secondary puts him on 1, now we need a two phase commit.
Awesome video as always!
Awesome as always!
Can we partition in single node? Like create partition in mysql. Whats the benefit like just going directly to the Hard drive section where the data is? Because I assume the db files are already sized limited and in the B tree we know where they are
I'm a bit confused by your question here can you elaborate/reword it?
@jordanhasnolife5163 I was trying to bring the idea of same network partitions like we have in mysql. I lookup for ans they partition table files to be quicker.
@@jhguygih Sometimes we may partition on the same node with pre-defined boundaries to be able to move them around more easily
Thanks for the video! I have a question though. The “partition” usually means partitioning the data on one node and sharding is across multiple nodes, but here the examples you gave is putting data on different shards and the secondary indexes table is using the same set of nodes( because you said the GSI can be on shard 2 while primary index is on shard 1) I’m confused. Could you please explain more on it? Thank you very much!
Yes sorry I often use the terms interchangeably. For the purposes of videos when I say partitioning assume I mean sharding. We take the partitions of the tables and place them on different shards.
@@jordanhasnolife5163 Thanks!
@@jordanhasnolife5163 Then are all the partition videos you shared from this video to the next section are about sharding?
@@xiyunliu9252 I think that's fair to say.
Is sharding and partitioning the same thing? In this video, the terms are used interchangeably but online, I’m seeing mixed views about what the two terms mean
I use them interchangeably, I've seen other people say they aren't the same but to tell you the truth I've never really seen the nuance there
Hello Jordan, great video, just one question :- In case of hash based approach won't we need rehashing of entire dataset (again and again) if say any shard goes down? or we add extra shards (I don't know if we can do this later on but yeah)?
That would be one of the biggest problem correct? If so any idea how do we handle the above scenarios.....
I'd say wait until next Thursday :). Thanks for bringing this up
@@jordanhasnolife5163 sure sure
this is where consistent hashing is the better option?
@@pl5778 A few good options here I'd say but consistent hashing is one of them!
@@pl5778 Yeah u are correct!!
Wish my company would do any of these things. We are literally moving to big table (we're getting flown to google for training) because management refuse to investigate not using a managed postgres database that does support the replication strategy that our scale requires 😢
Ah rip - for what it's worth partitioning can really complicate things, but sometimes it's necessary. Vertical scaling can be valid, but if they're flying you all out there just to train to learn how do use big table seems dumb to me lol
this doesn’t make sense to me, range based partitioning leads to hotspots, so we hash ranged based partitioning for even distribution, then we keep a global secondary index for faster range queries, but wouldnt gsi still lead to hotspots because the shards for the secondary index are partitioned like a normal range based partitioning? it feels like we came full circle
If we're using hash based partitioning it's likely because we don't want range based querying. If we need range based querying, we'll have to do range based partitioning and be smart about how we balance the load there.
Really like ur content!
Thanks!
👑
this is why i love jews
Thanks asian1599
Have you thought about channel name "Gigachads", m almost tempted to steal it 😛
I appreciate the suggestion, but being a gigachad is a lifestyle choice, and I can't steal it for personal gain