Yes when you age them out, they might leave gaps in the index tree but also reuse the gaps regularly. They might not “nowhere near each other” but always in between. Sp with larger pages the splits are less likely. (Of course IOT is to a good usecase)
The whole time I was wonder why not just insert on a new index rather than pointing to a UUID, that way it's always ordered. But then you said MySQL defaults to this at the end of the video. That's critical for the 'why' of this video
He talked about this at 4:45. He explains that MySQL uses the primary key index as the table itself, this implementation is called clustered index. In the case of MySQL you can have only one clustered index per table, because the index is the table itself. The other indexes you create are "common indexes" which stores only the index id and a pointer to the actual data, and this pointer is actually the primary key.
Ive used UUID generated outside the database and stored as primary key varchar within the db. Its worked a lot better for over 100 million devices that I was working with. Primary goal was to use it as hashtables when use other systems like redis/dynamo etc. For lookups, with a bit off magic in UUID generation which is not being generated by DB, you can apply Bloom filters as well. I've been out of engineering field for a decade+ but these sort of videos are always much needed to discuss the fundamentals of how things work.
MySQL 8 uses UUID v1 (kind of compatible with UUID v2). By using the UUID_TO_BIN()/BIN_TO_UUID() function with the optional second argument 'swap_flag' it will reformat the UUID before converting it to a BINARY(16). This will make them sequential (since UUID v1/v2 is based on the timestamp).
@@cedricbrisson7240 "What security and why should it apply for this case?" uuidv1 has some characteristic that make it possible for someone who want to *intentionally* clash/guess your id, they might be able to. If YOUR OWN service is creating this id, why would you intentionally do it? Don't just parrot that something is insecure. Knows if it applies to your use case and attack vector as well.
Jesus chill lol It's simply good practice to avoid using vulnerable services/tools when safer alternatives can be used. As someone on a CSEC team I'd rather pre-emptively protect myself than wait for someone to find a use case for a specific flaw :) But sure, I'm just "parroting". You're the kind that praises himself as a software engineer thinking their above "programmers" right?
@@cedricbrisson7240 do you think “integer” is secure? Do you avoid using integer? It’s nonsense statement. Uuid v1 is insecure for a certain use case. Using it as an id you generate yourself is not one of them.
Will this concept be applied to columns that are Varchar and indexed? Because strings are also random like username, email, url slugs, etc.. If so, what is the workaround for storing string values as indexed?
What about distributed NO SQL DBs (document, key-value, etc) where the recommendation is to use partition keys with high cardinality and avoid sequential values because that will create “hot” partitions for inserts at least? Maybe its a different use case but it would be an interesting topic too
ปีที่แล้ว
where did you read that? Could you share the article?
There are 3 new uuids in rfc draft. ulid is equivalent to uuidv7. uuidv8 let's you define your own data. And uuidv6 has a machine key followed by a timestamp akin to ulid which can help with the situation you're describing.
@alegon2007, that is true for NO SQL DB's as it does not use B+ tree datastructure to store data. If I take DynamoDB as an example, It relies on something like partition key to hash and figure out the partition. This makes sure your single partition is not going to be overloaded with requests. While for sort key, it's recommended to keep it sequential if possible as internally in the shard it must have been using a B+ tree to sort on the sort keys
it is, but dont start avoiding it and enforcing new guidelines yet, wait till you reach shopify level scale where you can benefit from that sort of performance improvement. its all about ROI
The worst is using UUID with SQL Server; since the PK is Clustered; performance becomes awful very quickly. I use Postgresql and never had performance issues with UUIDs.
Can you tell me a bit more about this? What means clustered in that case for SQL Server? I just structured a whole DB with PK UUIDv4 in SQL Server but it’s not shipped to prod yet 👀
At least the prod system is not high-performance required I expect around 2000 writes per day and 1_000_000 reads per day And let’s say 10_000_000 rows in total with around 10 tables that are read mainly, 5 more tables for not so much read tables and 2 views (maybe need to be materialized views)
@@ShinigamiZone Clustered means physically ordered on the storage. Random UUID used as a Clustered PK (the default on SQL Server) means your records will be fragmented everywhere on your storage, leading to awful performance.
Perhaps you don't want to reveal the IDs of your records such that it becomes easy to query your records incrementally. This could be a security risk in API available publicly
I remember reading about an incident in GithHub involving auto-incremented counters: "For example, in one incident at GitHub [13], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users." Kleppmann, Martin. Designing Data-Intensive Applications (p. 250). O'Reilly Media. Kindle Edition.
I've seen DBs with millions of records across various tables perform just fine with UUIDs leveraged everywhere. Either the performance difference is negligible or it only affects extremely large datasets.
it is like a file that DB put indexes in. Lets say you want to get a record from DB with id number 80. if indexing in pages were not used it would have to scan entire DB to find where a record with id=80 saved. This wont be an eficient way of getting data. this is actualy how file system work. In DB it uses indexes that points which sector that record are saved. And pages are basicly where those indexesare saved and its pointers are where actual data/record are saved. So DB first goes to pages and finds index. then looks for the exact record from where index is pointing. Long story short pages are phonebook for DB...
I just randomly stumbled upon your video, and I've always had this itch about UUID performance. Your video really boosted my confidence in what I was thinking. Thanks a lot! Oh, and quick question: Do databases like MariaDB, MySQL, or PostgreSQL automatically play nice with ULIDs? And how do they know to sort 'em out for indexing?
Postgresql, mysql 8++, mariadb 10.7++ support uuid internally. Postgresql have best implementation since it’s saved as 128bit binary but can be accessed as string without custom function on client. MariaDB/MySql client need to call certain function if you want to query a binary uuid field.
hash index for secondary index is possible, but for primary key not all systems can set it with hash. also, uuid is large, we should only use it as primary key when we really need.
Updates are slower on hash indices in the worst case. When a bucket becomes full, you now have to update every single bucket, since ID % (number of buckets) is now resulting in a different value for every single ID
You never had to distribute databases into multiple locations, that insert independently and then access global data. I use UUIDs for all the data tables. You have 100 databases with millions of IDs. Just insert independently whenever you want, the worst thing that can happen is a rare collision hypothetically (never happened to me yet). They will still be OK with UUIDs. Now imagine this with auto-increment integer. Insanity. How would you keep track of this auto-increment ID and ensure it's unique across all databases, other then synchronizing them. Auto-increment integers as a key are for script kiddie databases, when you have one local database. I still wouldn't use integers even then. Because you never know, whether you'll need distributed database later.
"Auto-increment integers as a key are for script kiddie databases" I'm willing to bet the house that most tables in most databases powering most applications use auto-incrementing PKs lol
@@ConsuelaPlaysRS That doesn't automatically mean there's a good reason to have such keys. Also most databases are probably local file databases. Everyone can do whatever they want. I'm just saying after some gathered experience, you realize that little gain in being sequential is worth sacrificing for a more robust and secure database.
for the longest time you wouldnt distribute your databases if you could avoid it, i know that sql sharding exists but that is something you would have to build your persistence layer around. With the rise of nosql databases this has changed, although that in turn also has implications for design. Its engineering. The perfect solution is the one that works.
I am confused at the statement that indexes are b+ tree and it gets indexed and it gets balanced but if gets balanced isnt the order gets messed up...i am a bit confused 😐 🤔
@@SmashingChords order in a balanced BST is never there, it's for search only , in a balanced bst search takes less time compared to unbalanced that's the whole premise, it also supports less than and greater than queries easily.
@@Aspartame12he mentioned the ds to be ordered at 3.40 .. aaaaahhhhhhh i am confused now, had the idea of tree now doesnt have the idea of plant as well😂
@@SmashingChords Okay, so a BST is ordered in the sense that there is a order in which the nodes are arranged left child is smaller right child is bigger and so on, if you balance the tree your root value changes , that means you do not have ready access to smallest value or largest value, but the middle value ( approx ) , so while searching for something you start from middle value ( root ) and go left or right to find it. You can also write a simple recursive/stack based code to print contents of a BST in ascending sorted order or descending sorted order.
Sorry, but having a B+ tree with a page size of 2 is very close to degenerate the B+ tree to a binary tree -- and binary trees either need to be rebalanced regularly, or might degenerate into a linear list. Rebalancing is an expensive operation, especially if it involved a secondary storage like HDD / SSD. And that's what B+ trees are optimized for: to minimize HDD access, leveraging having some leeway for inserts and deletes without "reshuffling", and generally keep the tree depth low. This requires the page size to be considerably larger than two, and be harmonized with the block size on the disk. Besides that, different DBMS might offer different variants of indexes, e.g. hash indexes, that are not implemented as trees, but as hash tables. Depending on the use case this might be the index type to choose, and your argument becomes void. I personally don't like UUIDs very much in many use cases, but for other reasons: first is they are not easily graspable for humans, they don't tell a story, don't have some inherent use case information in their structure, ... yet they are very useful when randomness is required, because the systems involved only synchronize their data eventually. But besides that: too much cognitive load ... Anyhow: I don't buy your argument, sorry.
Hum…Hussein, I’m a kind of surprised here…. To a point I maybe doubtful…. Here you’re showing the ordering of the whole rows ( as stating item as key value) when we’re talking about indexes (indices). Indexes are ordered (by nature ;)) and are pointing to the value, no matter where the value is. But here, you’re saying the primary index, the one for the primary key, is not behaving as a normal index…. If that’s the case, that’s a major (I should use uppercase here) problem. That means that we shouldn’t be allowed, in any case(uppercase again) to use anything but autoincrement number ! See, that’s so big it makes me doubt.
My understanding is that's not a problem in-of-itself. But when you have another vulnerability, this will make it way worse. For example, say I found a way to see a user's phone number. I do this for user 1, then user 2, etc. I can run the attack on every single user rather quickly. If the IDs are random, then I have to guess and check user IDs, and I won't be able to affect as many people.
Using numbers as the example is misleading. UUID are not ordered by design so you cannot sort them. I do not understand why the index has to be sorted? What about hash indexes?
Hash indices are even slower to update. Instead of updating a few nodes, you now have to update every single bucket, since ID % (number of buckets) is now resulting in a different value for every single ID.
UULD is not 100% accurate, the issue is the datetime ticks, not every generation, generates a tick. So you can still have this shuffling, though it won't be to the same extent.
@@roy20021 I don't really have time to spend time on a stranger on youtube, and you can't really post links on youtube without getting your content deleted. So I can only guide you what to google for. Read about the partitioner of Cassandra and the hash function used. Search for partitioning hot spots and educate yourself about it. Also all relevant databases optimize their indexes for truly random UUIDs and allow tuples to be indexed. That is all I can do for you, if I could I would have posted you direct links instead. Thank youtube for being a P**e of s***.
Do you need to have your index ordered? A linked list? TERRIBLE. A tree? gosh! I always imagined random uuids would be utilizing hash table indexes, which is completely useless to sort. Using random numbers with a sorted index seems like a pretty dumb mistake. I feel like when you say "index", you don't even consider that there's more than one index type.
Hash indices are really fast for select queries, but they make the updates even worse. If a page gets full, now instead of just updating a few nodes, you have to update every single bucket, since the ID % (number of buckets) will result in a different value now.
@@michawhite7613 If it was a student who wrote the hashing lib then yes. In SQL, however, it's so well optimized that the page overflow happens extremely rarely and it's not as dramatic as it may seem. Plus in vast majority of cases it's totally worth it to have inserts longer if it means having O(1) for reading. And the complexity of the inserts is also linear or at least log(n), so insignificant.
All talk without showing numbers. You can theorize on how the index performs all you want, but you should take the time to run some performance tests before-hand and present the data. Otherwise this video is just basic guessing and a waste of time.
What you expect? Watch him calculate how long it takes to conclude which method they use? You do realise he’s talking about how it works, a performance test won’t help with that. If you don’t have time to learn and just need to know which is faster, then just do that. Instead of shutting on other peoples work.
You can analyse algorithms analytically not just statistically. You can request for a follow up video instead of discrediting the hypothesis and the author
There is a big difference between a database that understands a uuid type and is storing it as a number compared with a uuid stored as a string (varchar). Ultimately though using either a int64/bigint or uuid is the right way to create a primary key for a table. Never should a user entered value be a primary key. Otherwise the rest of this video is mostly garbage or irrelevant in the scheme of things when it comes to performance in a database.
database fundamentals course database.husseinnasser.com
This is good , i'm on udemy business, do keep more courses open to business plan users! thank you! great videos you have there
Let's make it bigger.
That's what she said😂😂
That caught me off guard 9:40 😂😂😂 damn all indians are unhinged
Yes when you age them out, they might leave gaps in the index tree but also reuse the gaps regularly. They might not “nowhere near each other” but always in between. Sp with larger pages the splits are less likely. (Of course IOT is to a good usecase)
The whole time I was wonder why not just insert on a new index rather than pointing to a UUID, that way it's always ordered. But then you said MySQL defaults to this at the end of the video. That's critical for the 'why' of this video
He said that in the start of the video as well
You can use a sequence but keeping them has its own scalability problem, plus the pk won’t be stable
He talked about this at 4:45. He explains that MySQL uses the primary key index as the table itself, this implementation is called clustered index. In the case of MySQL you can have only one clustered index per table, because the index is the table itself. The other indexes you create are "common indexes" which stores only the index id and a pointer to the actual data, and this pointer is actually the primary key.
Actually, I was thinking about that question a couple of days ago
Thank you :)
Ive used UUID generated outside the database and stored as primary key varchar within the db. Its worked a lot better for over 100 million devices that I was working with. Primary goal was to use it as hashtables when use other systems like redis/dynamo etc. For lookups, with a bit off magic in UUID generation which is not being generated by DB, you can apply Bloom filters as well. I've been out of engineering field for a decade+ but these sort of videos are always much needed to discuss the fundamentals of how things work.
MySQL 8 uses UUID v1 (kind of compatible with UUID v2).
By using the UUID_TO_BIN()/BIN_TO_UUID() function with the optional second argument 'swap_flag' it will reformat the UUID before converting it to a BINARY(16). This will make them sequential (since UUID v1/v2 is based on the timestamp).
How costly is that operation tho?
uuidv1 is not secure tho
@@cedricbrisson7240 "What security and why should it apply for this case?" uuidv1 has some characteristic that make it possible for someone who want to *intentionally* clash/guess your id, they might be able to. If YOUR OWN service is creating this id, why would you intentionally do it?
Don't just parrot that something is insecure. Knows if it applies to your use case and attack vector as well.
Jesus chill lol It's simply good practice to avoid using vulnerable services/tools when safer alternatives can be used. As someone on a CSEC team I'd rather pre-emptively protect myself than wait for someone to find a use case for a specific flaw :) But sure, I'm just "parroting". You're the kind that praises himself as a software engineer thinking their above "programmers" right?
@@cedricbrisson7240 do you think “integer” is secure? Do you avoid using integer? It’s nonsense statement. Uuid v1 is insecure for a certain use case. Using it as an id you generate yourself is not one of them.
Awesome! Thanks. Dialectic of randomness and order. Very beautiful.
Never thought, Hussein will come with a pun in the video "that's what she said"
He made it bigger.
Will this concept be applied to columns that are Varchar and indexed? Because strings are also random like username, email, url slugs, etc.. If so, what is the workaround for storing string values as indexed?
me usually used index "order by timestamp", thats why UUID (in this case, just for unique identity) not a problem.
The silent joke at 10:00 That's what she said. Haha, classic.
This one cracked me up 09:59 😂
Great work as usual 👏
Please consider making a video on Pinecone and vector databases 🙏
What about distributed NO SQL DBs (document, key-value, etc) where the recommendation is to use partition keys with high cardinality and avoid sequential values because that will create “hot” partitions for inserts at least? Maybe its a different use case but it would be an interesting topic too
where did you read that? Could you share the article?
Different scenario.
There are 3 new uuids in rfc draft. ulid is equivalent to uuidv7. uuidv8 let's you define your own data. And uuidv6 has a machine key followed by a timestamp akin to ulid which can help with the situation you're describing.
Also possible problem is that ULID doesn't have RFC, so implementation can differ.
@alegon2007, that is true for NO SQL DB's as it does not use B+ tree datastructure to store data. If I take DynamoDB as an example,
It relies on something like partition key to hash and figure out the partition. This makes sure your single partition is not going to be overloaded with requests. While for sort key, it's recommended to keep it sequential if possible as internally in the shard it must have been using a B+ tree to sort on the sort keys
Pro tip: watch at 1.5x speed. You're welcome
thx
does it mean that, generally, indexing on a random value column is hurt?
it is, but dont start avoiding it and enforcing new guidelines yet, wait till you reach shopify level scale where you can benefit from that sort of performance improvement. its all about ROI
The worst is using UUID with SQL Server; since the PK is Clustered; performance becomes awful very quickly.
I use Postgresql and never had performance issues with UUIDs.
Can you tell me a bit more about this? What means clustered in that case for SQL Server?
I just structured a whole DB with PK UUIDv4 in SQL Server but it’s not shipped to prod yet 👀
At least the prod system is not high-performance required
I expect around 2000 writes per day and 1_000_000 reads per day
And let’s say 10_000_000 rows in total with around 10 tables that are read mainly, 5 more tables for not so much read tables and 2 views (maybe need to be materialized views)
@@ShinigamiZone Clustered means physically ordered on the storage. Random UUID used as a Clustered PK (the default on SQL Server) means your records will be fragmented everywhere on your storage, leading to awful performance.
Happy Teachers Day Husseinn
For postgresql, why would anyone use UUID , SERIAL is the default method for incremental ID right?
Perhaps you don't want to reveal the IDs of your records such that it becomes easy to query your records incrementally. This could be a security risk in API available publicly
@@dz5483I'm thinking when you get the record from the DB on the front end rather you show the UUID.
@@dz5483I'm thinking when you get the record from the DB on the front end rather you show the UUID.
I remember reading about an incident in GithHub involving auto-incremented counters:
"For example, in one incident at GitHub [13], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users."
Kleppmann, Martin. Designing Data-Intensive Applications (p. 250). O'Reilly Media. Kindle Edition.
@@dz5483 in this case, people use slug which is unique , otherwise it's like youtube watch?v=qRv5g5MEF2k , let me know i anyone has a better method.
Please do a video on vector databases.
10:00 would have never expected that joke from you 😂
BRUHHH XDDD
I've seen DBs with millions of records across various tables perform just fine with UUIDs leveraged everywhere.
Either the performance difference is negligible or it only affects extremely large datasets.
i think so too. usually data has no correlation between each other for example customers in a table. for timeseries data uid v7 can be used
Databases are much more powerful than we think.
Any one can explain about what is page in that explanation? i don't really understand
it is like a file that DB put indexes in. Lets say you want to get a record from DB with id number 80. if indexing in pages were not used it would have to scan entire DB to find where a record with id=80 saved. This wont be an eficient way of getting data. this is actualy how file system work. In DB it uses indexes that points which sector that record are saved. And pages are basicly where those indexesare saved and its pointers are where actual data/record are saved. So DB first goes to pages and finds index. then looks for the exact record from where index is pointing. Long story short pages are phonebook for DB...
What about database performance of using a uuid stored in varchar column vs new db versions which support a native uuid type
Doesn't change the problem, and in fact you'll probably get even worse performance since you'll have a less compact data type.
So what are we supposed to use as a primary key identifier in MySQL, if we use distributed database setup?
using UUID for distributed database is fine. the video is about using identifier in general case. using distributed database is a special case.
set page fill to 71% and periodically rebuild your index at low fragmentation percent and problem solved for indexes on uuid4
can you please point to documentation for these? would be helpful.
What shower of a db requires that you rebuild it all the time haha
@@ehfoss set up a service that runs every so often that rebuild your index programatically during periods of low load.
@@durgeshchoudhary To my fellow neckbeards who wish to know the black arts.
th-cam.com/video/jx-FuNp4fOA/w-d-xo.html
If your IDs are public and you make them sequential, be aware that third parties will know the number of entities created over a time period.
Enumeration attack, you can prehash to protect against that
What about NOSql databases like mongo db?
I just randomly stumbled upon your video, and I've always had this itch about UUID performance. Your video really boosted my confidence in what I was thinking. Thanks a lot!
Oh, and quick question: Do databases like MariaDB, MySQL, or PostgreSQL automatically play nice with ULIDs? And how do they know to sort 'em out for indexing?
Postgresql, mysql 8++, mariadb 10.7++ support uuid internally. Postgresql have best implementation since it’s saved as 128bit binary but can be accessed as string without custom function on client.
MariaDB/MySql client need to call certain function if you want to query a binary uuid field.
@@nirnullz I asked about ULID, bro ?
How about when we go ahead and distribute the data. Won't this make the last shard hot all the time?
His videos are so interesting but so slow. I can easily watch it at 2x speed.
how about ULID thought?
ULID leaks timestamp. For sensitive data, you may not want to expose this info.
lowkey "That's what she said" reference at 9:57 😂
Missing the point. A UUID allows an application to create the ID rather having a two step approach of asking the DB for an ID first.
Applications can create a ULID.
if a B-tree index on UUID is hurt, will the hash index on UUID be better?
hash index for secondary index is possible, but for primary key not all systems can set it with hash. also, uuid is large, we should only use it as primary key when we really need.
My guess is that it'd be better but would still suffer because inserts would be stored far apart, and so you'd still have the caching issue
Updates are slower on hash indices in the worst case. When a bucket becomes full, you now have to update every single bucket, since ID % (number of buckets) is now resulting in a different value for every single ID
😂😂😂😂
Seems to be a little bit small, so let's make it bigger............ that's what she said
9:56
does this apply also for uuid5?
Yes.
UUIDv6 is the implementation which is proposed to fix this
This is very narrowly scoped on dbs that don't optimize for distributed cases...
You never had to distribute databases into multiple locations, that insert independently and then access global data. I use UUIDs for all the data tables. You have 100 databases with millions of IDs. Just insert independently whenever you want, the worst thing that can happen is a rare collision hypothetically (never happened to me yet). They will still be OK with UUIDs. Now imagine this with auto-increment integer. Insanity. How would you keep track of this auto-increment ID and ensure it's unique across all databases, other then synchronizing them. Auto-increment integers as a key are for script kiddie databases, when you have one local database. I still wouldn't use integers even then. Because you never know, whether you'll need distributed database later.
"Auto-increment integers as a key are for script kiddie databases" I'm willing to bet the house that most tables in most databases powering most applications use auto-incrementing PKs lol
@@ConsuelaPlaysRS That doesn't automatically mean there's a good reason to have such keys. Also most databases are probably local file databases. Everyone can do whatever they want. I'm just saying after some gathered experience, you realize that little gain in being sequential is worth sacrificing for a more robust and secure database.
I work for a major b2c org and we use auto increment IDs to store order info 😅
@@ConsuelaPlaysRS While I agree that's true and it's convenient in many case,it just won't work in large distributed system.
for the longest time you wouldnt distribute your databases if you could avoid it, i know that sql sharding exists but that is something you would have to build your persistence layer around. With the rise of nosql databases this has changed, although that in turn also has implications for design.
Its engineering. The perfect solution is the one that works.
thanks so much its make sense now
If you need an index on usernames, won't you have the same problem?
I am confused at the statement that indexes are b+ tree and it gets indexed and it gets balanced but if gets balanced isnt the order gets messed up...i am a bit confused 😐 🤔
The tree gets balanced by shifting which node is treated as root , balanced binary search tree concept.
@@Aspartame12 but if change the root isnt the root becomes the first node then where is the order still same confuision
@@SmashingChords order in a balanced BST is never there, it's for search only , in a balanced bst search takes less time compared to unbalanced that's the whole premise, it also supports less than and greater than queries easily.
@@Aspartame12he mentioned the ds to be ordered at 3.40 .. aaaaahhhhhhh i am confused now, had the idea of tree now doesnt have the idea of plant as well😂
@@SmashingChords Okay, so a BST is ordered in the sense that there is a order in which the nodes are arranged left child is smaller right child is bigger and so on, if you balance the tree your root value changes , that means you do not have ready access to smallest value or largest value, but the middle value ( approx ) , so while searching for something you start from middle value ( root ) and go left or right to find it.
You can also write a simple recursive/stack based code to print contents of a BST in ascending sorted order or descending sorted order.
so what about storing uuid in hashmap index?
Sorry, but having a B+ tree with a page size of 2 is very close to degenerate the B+ tree to a binary tree -- and binary trees either need to be rebalanced regularly, or might degenerate into a linear list.
Rebalancing is an expensive operation, especially if it involved a secondary storage like HDD / SSD. And that's what B+ trees are optimized for: to minimize HDD access, leveraging having some leeway for inserts and deletes without "reshuffling", and generally keep the tree depth low. This requires the page size to be considerably larger than two, and be harmonized with the block size on the disk.
Besides that, different DBMS might offer different variants of indexes, e.g. hash indexes, that are not implemented as trees, but as hash tables. Depending on the use case this might be the index type to choose, and your argument becomes void.
I personally don't like UUIDs very much in many use cases, but for other reasons: first is they are not easily graspable for humans, they don't tell a story, don't have some inherent use case information in their structure, ... yet they are very useful when randomness is required, because the systems involved only synchronize their data eventually. But besides that: too much cognitive load ...
Anyhow: I don't buy your argument, sorry.
Hash table would still suffer from performance as it would have to load from disk if inserts aren't stored close to each other
it was just for an example dumbass we all know what b trees do
Hum…Hussein, I’m a kind of surprised here…. To a point I maybe doubtful….
Here you’re showing the ordering of the whole rows ( as stating item as key value) when we’re talking about indexes (indices).
Indexes are ordered (by nature ;)) and are pointing to the value, no matter where the value is.
But here, you’re saying the primary index, the one for the primary key, is not behaving as a normal index….
If that’s the case, that’s a major (I should use uppercase here) problem.
That means that we shouldn’t be allowed, in any case(uppercase again) to use anything but autoincrement number !
See, that’s so big it makes me doubt.
For clustered index, also called IOT the value is in the index - however even if the value is not there you need to shuffle them
I learned this the hard way. In mongodb I keep getting 'document not found' even though it does.
Why not use auto incremented IDs instead
My understanding is that's not a problem in-of-itself. But when you have another vulnerability, this will make it way worse. For example, say I found a way to see a user's phone number. I do this for user 1, then user 2, etc. I can run the attack on every single user rather quickly. If the IDs are random, then I have to guess and check user IDs, and I won't be able to affect as many people.
I was just thinking about this 😵
Great info. I need a tutorial on how to do this with PlanetScale. I ordered your Udemy class. You should add it to that.
👍
10:02
I messed up 😂😂😂 gotta talk to my CTO as soon as possible
Using numbers as the example is misleading. UUID are not ordered by design so you cannot sort them. I do not understand why the index has to be sorted? What about hash indexes?
You can sort them by comparing their raw byte or hex string representations, both is used depending on your db type
@@berndeckenfels yes, but it doesn't make sense, you have no control over the sort order.
@@hipertracker for an index which is supposed to find a specific UUID the sort order does not matter as long as it can be used to find it.
Hash indices are even slower to update. Instead of updating a few nodes, you now have to update every single bucket, since ID % (number of buckets) is now resulting in a different value for every single ID.
did he just make a that's what she said joke at 10:00 😂
UULD is not 100% accurate, the issue is the datetime ticks, not every generation, generates a tick. So you can still have this shuffling, though it won't be to the same extent.
And now you get locking contention on that last page. Fix one problem, create another.
What about turning the uuid to binary(16) ?
You're like my college professor who lives in a fantasy world.
how do you detect and troubleshoot bad page splits?
That's what she said 🤭
Use escalidraw
They never made sense to me. There's no force powerful enough in the universe to convience me I need them over autoincremented integers.
I get the sense you hate indexed random UUIDs in databases.
the speed of the video is 1.25x
you've never worked with distributed databases 😂
can you elaborate?
@@roy20021you need uuid to handle data migrations, data should be independent of where it's located
@@roy20021 he is correct.
@@malibudancer8472 ok, can you elaborate a bit more?
@@roy20021 I don't really have time to spend time on a stranger on youtube, and you can't really post links on youtube without getting your content deleted. So I can only guide you what to google for.
Read about the partitioner of Cassandra and the hash function used. Search for partitioning hot spots and educate yourself about it.
Also all relevant databases optimize their indexes for truly random UUIDs and allow tuples to be indexed.
That is all I can do for you, if I could I would have posted you direct links instead. Thank youtube for being a P**e of s***.
Do you need to have your index ordered? A linked list? TERRIBLE. A tree? gosh! I always imagined random uuids would be utilizing hash table indexes, which is completely useless to sort. Using random numbers with a sorted index seems like a pretty dumb mistake. I feel like when you say "index", you don't even consider that there's more than one index type.
Hash indices are really fast for select queries, but they make the updates even worse. If a page gets full, now instead of just updating a few nodes, you have to update every single bucket, since the ID % (number of buckets) will result in a different value now.
@@michawhite7613 If it was a student who wrote the hashing lib then yes. In SQL, however, it's so well optimized that the page overflow happens extremely rarely and it's not as dramatic as it may seem. Plus in vast majority of cases it's totally worth it to have inserts longer if it means having O(1) for reading. And the complexity of the inserts is also linear or at least log(n), so insignificant.
All talk without showing numbers. You can theorize on how the index performs all you want, but you should take the time to run some performance tests before-hand and present the data. Otherwise this video is just basic guessing and a waste of time.
I appreciate the conceptual overview
What you expect? Watch him calculate how long it takes to conclude which method they use?
You do realise he’s talking about how it works, a performance test won’t help with that.
If you don’t have time to learn and just need to know which is faster, then just do that. Instead of shutting on other peoples work.
You can analyse algorithms analytically not just statistically. You can request for a follow up video instead of discrediting the hypothesis and the author
10:01 we said it at the same time 😂
There is a big difference between a database that understands a uuid type and is storing it as a number compared with a uuid stored as a string (varchar). Ultimately though using either a int64/bigint or uuid is the right way to create a primary key for a table. Never should a user entered value be a primary key. Otherwise the rest of this video is mostly garbage or irrelevant in the scheme of things when it comes to performance in a database.
why do you have to make this a 20 min video???
ULID > UUID
First!
Congratulations
Incredible achievement
Wow
@@gradientObrought to you by the Honda Gang