1:42 if users can change the ID in a URL to access data they are not allowed to, then the fix isn't to switch from int to uuid. You need to implement proper permission checks before serving the request. There are scenarios where non-guessable IDs won't protect you. For example copy&pasting - most normal users have no idea which part of a URL contains critical data so they will just send the whole thing to a colleage/friend. Or someone bookmarks a page and later their access role is revoked.
Bingo. If you can change ID fields, why can't you change uuid fields? It seems you have the same issue. Correct way is for the middleware or DB to prevent calls in using IDs which you aren't supposed to access. Also, in many DBs, IDENTITY fields are read only.
Great point (and probably one I didn't make that well in my video). It should not be possible for a user to change an ID in a URL to see things for another account or record that they are not allowed to se.
That is correct, but IDs in a sequence still leak information and that might be undesired - e.g. if you have order numbers that work like that and you make a couple of orders a day apart, you would know how many orders a business got just by looking at your collected order numbers. By itself this might not be that useful, but once it's combined with other information this might just become a problem.
RDBMS is a technology developed in the early era of computing history and RDBMS are well optimised to use INT as PK. Then the internet age came and when partitioning and sharding are implemented to solve the problem of scaling horizontally, the INT PK became a limitation. With UUID, multiple nodes can generate the PK, allowing horizontal scaling.
Totally false - this opinion belongs to the hype of nosql blabla buzzwords... PK is never a limitation, because it is a concept, implemented with technology to avoid duplicates.
Agree with this statement. It’s not some nosql thing either as the other responder said - if you truly want databases that can scale with your application, active/active and multi datacenter uuid is the right tool for the job. And yes people will hammer other solutions (even/odd, increment by 3/4) but this is a manual and unnecessary step. Just use uuid if active+ setups are in the cards
But if I use both, for example, a users table: id: bigint, uuid: uuid, name: varchar(255) Considering that I will only expose the uuid to the API, I would have GET: users/:uuid Which would be the query SELECT * FROM users WHERE uuid = param.uuid If I have a table with tens of millions of users, I would have to create an index for this query, right? Then we fall into the disadvantage of uuid, which is index fragmentation. Since we always will have to search by uuid, wouldn't it be better to use only it as the primary key? Does the performance in joins justify having both, even if they take up more space?
Another con of uuid4 as your primary key: Normally you will cluster on primary key, but if the PK is a uuid4 then your table will be constantly fragmenting.
@@avwie132 Sure, but you'd better be sure it doesn't matter in each case. Every PK creates an index for itself, and so a uuid4 PK's index will be constantly page splitting, even if you mark the table nonclustered.
They are separate concerns, that should not be comingled. Primary keys are there for referential integrity within your database, and should be int for maximum performance. UUIDs are there to uniquely identity a records at higher application layers, or across databases , or for replication - so they should be added as an additional column when needed.
@@DatabaseStar Using both Int PK & UUID has some pros but it has some cons of both. It increases row size significantly and looses advantage of preparing data & references before performing actual writes. Also, using ORM or generating models from SQL in application code will require to additionally worry about hiding integer primary keys in API responses. I would rather pick one depending on the table and the amount of writes & reads it will receive.
@@dyto2287 If your ORM entities are somehow making it all the way to your API controllers where you have to hide your primary keys, there's something terribly wrong with your application architecture. ORM entities should not make it past your repository layer, and the rest of the application should be completely isolated from having to deal with database concerns such as primary keys and referential integrity.
@@squintps In a perfect world yes but in reality when you need to build fast and ship features too much of isolation and guardrails will bite you in the ass slowing down all development. Don't over-complicate stuff before it's absolutely necessary.
For the sorting pros of integer primary keys, I always add 3 columns at the end--note as a text field for when something needs to be clarified about the data record, date_created, and date_updated. date_created will be set by default to something like systimestamp and date_updated will update to the date time stamp when the record is updated. a trigger will fire with an on update command clause. then in Oracle I add a virtual date_created_or_updated where its date_created if no updates have been before, and date_updated if updates have been performed. I put date_updated to null at the beginning. That way you can sort to see when the records were created like sorting the integer primary keys.
Well, I just use UUID v7 as the primary key. Additionally, I allow my authenticated clients to define its value when creating or upserting records (though I validate on the backend to ensure it's a valid V7 UUID, as monotonic increment is important for indexing). This way, if a mobile device works offline and needs to sync new or updated items, it can generate a UUIDv7 offline, and the ID won't change after the upload. Moreover, there's no need to define two IDs (local and remote) for each item-there's always one unique ID for a specific item. Since it starts with a Unix timestamp (instead of random noise), I can sort by ID on the server side without needing a separate creation time. The best of both worlds. Yes, it takes up more space compared to an auto-incrementing (big)int, so joins may be slightly slower, but if I ever need to migrate or merge databases, I can use the original IDs without conflict or recalculation across multiple tables. There are so many reasons to use UUIDv7 that I don't mind those nanoseconds of extra processing time.
Integer IDs are NOT a security risk, if proper application security is implemented. I should be able to take any ID - regardless of the type of value - and not be able to access the record unless I'm specifically authorised. And anyone who thinks having a non-guessable ID is a "layer" of security is fooling themselves with "security through obscurity".
@Grumpicles - thank you, this is the first comment done by somebody with experience, all the other comments, in my opinion, are done by people using nosql approach because it is cool. sharding and partitioning for applications that will use at max one thousand records... c'mon!!
Yes, good SEs should not perpetuate this myth. UUID (or any other UID algo) as an identifier is really about generating identifiers at application layer (python etc), in distributed systems (multiple services/process/threads; "scaling" as this video covers).
My personal approach would be like this. Every time I make a new table, my primary key is an integer. But when I need to expose the data to the application, I don't expose the primary key. That's really bad! So instead I make another column just to store UUID and I set the default value to "UUID()" so that every time I insert a new data, an UUID will automatically be generated. But my traditional method use varchar to store UUID. I might need to change it in the future since you said it would be better to store it in binary. Nevertheless, by doing this I can expose the data just by exposing the UUID. Let me know if there's a better approach than mine, cheers.
3:02 Frequency of collisions depends on details of how UUIDs are generated. Actual probability might be much higher than a theoretical probability that assumes a perfectly distributed random variable. I've seen some very poor implementations. Having some (very rare) collisions is not a problem if a collision doesn't lead into a catastrophe. But a software engineer has to make sure it does not. Sometimes you just have to be sure that the IDs will always be unique and no collision will ever happen.
@@Davide73 Yes. It protects from doing potentially harmful inserts case of a collision. But it doesn't stop a careless programmer passing on the UUID before doing the insert or otherwise making sure there is no collision.
@@FINALLYQQQQAVAILABLE Sorry, I don't get yuor point; the primary key avoids any duplicate UUID, and that's all. If a dev makes a mistake, it will receive the error.
@@Davide73 There is a non zero chance of collision, which grows exponentially with table size, so extra error checking code is needed, and if forgotten in places, there is higher chance of app crash or server error. Also extra work in generating ids, so they aren't as care free as video implies.
They don't have to be incremental....just unique. And really, there are rare use cases when one needs to order by UUID or Int anyways. The ordering is most often by one or a few columns in the table.
what I did is retain the default integer id and add another unique field of uuid v4. this way a merger/sorting can happen and you can give the user at some end point the uuid as an id to reduce security risks. the only big con about this is the memory usage (both in runtime and storage).
I prefer to use auto-increment INT or hybrid. I have the kind of luck that I could be pulling on a rope (to wrap it up and put it away) and it would get stuck on a completely spherical object, like a beach ball. The last thing I need is for that type of bad luck to generate identical UUIDs for 2 records in a db. I know they say that UUID's likelihood of duplicates is near zero but INT is exactly zero. They used to say that MD5 duplication was also near zero, but we all saw how that turned out.
I used integers with MySQL along, and then I was learning Postgres and I learned about UUIDs and I learned the advantages of UUIDs and I became converted/converted to UUIDs.
The advantage of UUID is, I can set it manually, I can use it in memory before storing an item to the database. On an autoincrement value, I have to have something temporary in memory before, then I have to save my item, and after that I have to retrieve the value, the database generated. Quiet a lot of steps. Becomes all obsolete by setting UUID myself before saving.
When you create a record, the DB returns the ID generated so there isn't any additional request. On a UUID you have an additional step of creating it, and debugging your app is a more tedious with everything referencing a UUID. Also, realistically there is a chance of collision, which grows exponentially as your table size increases (think birthday problem), so you need extra error checking on all inserts.
One thing that relates to a different discussion which is a use a surrogate key not a name or something for the primary key. So say I have my table city with the column city that stores the name of the city and city_id which stores the primary key as an automatically generated UUID (which implementation will differ in MySQL--bin, Postgres/PostgreSQL--UUID, Microsoft (MS) SQL Server--can't remember how its handled, and Oracle--raw(16)). So the surrogate key means we can update the city if we need to. However, I think there should be a term like functional primary key where its unique and not null. Its not the primary key because we want a value that is independent of the primary key, but its important enough and unique enough that it is commonly used to identify something. Examples of "functional primary keys"/natural keys include first name, last name, email address, phone number, physical address, and name. Each of these should be mapped 1 to 1 to the entity. So if I know your first and last name that should map to 1 person. I say should because maybe 2 people have the same first and last name. We could say well then your middle name has to be unique. Well then the father and the son have the same first, middle, last. well then add the prefix like John James Smith 7 is son of John James Smith 6. An email address should have a one to one relationship with a person, but this is often not the case where you multiple email addresses, phone numbers, physical addresses I guess if you own 2 houses, names, etc. The point I'm getting at is I think with a surrogate key there is a loss of how the functional primary key should have a 1 to 1 relationship and be unique and not null, so a new concept like functional primary key could be introduced. It could be that said that its functionally primary, even though its not really primary because if its really primary it should have no relation to the input data. I hope that made sense.
Good point about the functional primary key and how it's often unique. It's a good idea to separate it from the table primary key. I've heard the term "business key" used to refer to this, or even a "candidate key".
@@DatabaseStar I think candidate key is something that could be a primary key. I think there's a difference between candidate key and "functional primary key".
Ok guys, you won! I am going to change my recipe system which contains 30 recepies using 2000 ingredients to use uuid! Soon a safe merge can be done as my next recipe will be safer once no one will guess next id! Be safe guys!
Then you are probably doing it wrong. UUIDs are not a replacement for sequences. Just use them if you want/need to expose entities. So your api uses UUIDs as parameters instead of actual valid ids. But using integers as primary keys is the most performant way to handle relationships, no question about that.
@@Ewig_Luftenglanz If your distributed model allows for collisions, the issue is the model, not the PK. If you need to put the entities together, then your DB is not distributed. But even if you absolutely realize integers are not the best solution in your very specific case, not even composite keys, UUIDs are no replacement for integers. Those are just two totally separated use cases still.
UUIDs are almost always used as *indexed* secondary columns anyways so the "storage" argument doesn't even make sense at all. Because removing the indexed Integer column means less storage wasted since you only care about your unique IDs. I find a lot of people ignoring them anyways in queries... so again, what's the point? I use a special kind of UUIDs, mostly a more compact base62 nano ID. It saves a lot of space, and you can have a very good random range in just 16 bytes.
I really like int: date+time+sequence or other. Prepend w another unique, such as client id, server id. You can pull entries away from this or that db and know exactly what's up. There's a lot of options, but you can pack info into that. Or simply use a string and parse it. Custom solution.
@@DatabaseStar As long as no crucial info is in the index scheme, it works for app dev, and discovering data genesis without adding columns. You simply need to develop a parser and search. Good for science, POS and logistic integration. Works for me. Maybe not immediately handy, but helps with drill-down.
I do pentest and UUID never stopped me from compromising the system, never. ID or UUID if it's send to the browser then that's just it. You don't get hacked just because you show the ID/UUID, you get hacked because of IDOR/the lack of permissions. The more capable the system is the harder it is to secure it.
Great video. Trying to decide between UUIDv7 and integers for a new app atm. It's the kind of app that would need DB sharding if it ever took off. Wouldn't that be difficult with integer IDs? I know some people say stick with integers until then, but wouldn't switching every table from integers to UUIDs and then rebuilding all the FKs be a nightmare? Would love to hear your thoughts.
Good question. On one hand, optimising for possible DB sharding in the future could be a "premature optimisation". On the other hand, using UUD now is planning for the future. I don't think it's a nightmare to migrate every table, but would be a little tricky. I imagine it's something you can do in stages, perhaps one table (or one relationship) at a time. But if you're pretty sure you'll need UUIDs, then it may be better just to use that from the start.
@@Davide73 Depends on the version, the commonly used UUIDv4 is not sortable, yes, there is other versions that are sortable, but it's as complicated as the naming of Xperia phones (Xperia 5 iii for example), ULIDs are my preference today.
I don't quite understand the disadvantage of integers and the advantage of UUIDs about how integers have to be generated by the database and UUIDs can be generated by the client. I understand, kind of.
The idea is that if you use incremental int, you send the request to the backend, in which in turns does some computations and sends the insert sql query to the db, which then checks which integer is next and inserts that (for example id column) and then gives it back to the backend which then it gives it back to you (something like user created with id=42) In the case of uuid the frontend randomly generates the user uuid as the id column and then sends it to the backend to be put in the db without db need to do anything. Meaning you know in advance the user id without even hitting the database and you dont need db to respond to proceed with what you are doing
The comment from "someoneanonymous6707" is right. The integer IDs need to be generated by the database because that's where the record of available integers is stored (e.g. within a sequence). With a UUID, it could be calculated by the database, or could be calculated by any system that interacts with it.
@@someoneanonymous6707 So, if you have a bug in your application, you won't have a UUID bat a bunch of random chars for example with "/" instead of "-"... totally crazy, if you use a proper RDBMS it is safer to leave this task to the engine itself, not to the application.
@@DatabaseStar No, in particular if you have "any system that interacts with it", you must leave this task to the RDBMS, because that's the only way to be sure that your key will be properly generated. "any system" could create the key in a different way.
2:30 ... I don't know why people keep summarising uuids in general as random. Only v4 is, and parts of some other versions. It's not inherent to uuids in general.
I believe when people say UUID they usually mean UUID v4 which is the "mostly random numbers and text" one. Other types of UUID as mentioned include time components. But it's a good point.
@@HappyCheeryChap its all related, and this is basic Database Design here. Effectively, UUID's are unique. The #1 point of using a UUID or Identity integer is to make each row unique...Normal Form 1...google Database Design Normal Forms (tech from the early 1970s and still massively important today)...there are other DB design normal forms too. The #2 reason is to provide a table with a RELEVENT unique key...the UUID or Integer can be a PK, but its Un-Natural, and not really related directly to the data contents for each row. For example, a Customer Table can have a UUID PK, but this does not stop a user from keying in the same customer many times ( name, address, phone, web site, blah, blah, blah)...a Natural PK must also be assigned to the customer table by one or a combo of other customer fields...and since a table can have only 1 PK, you can specify the natural PK as a UNIQUE KEY...that way a user cannot set up a customer row multiple times for the same customer (having same name, number, address, etc). Let the database enforce its own rules, not the JavaScript or whatever language you use. The Normal Forms will keep the DB out of trouble and guard against bad programs.
I prefer to use multiple columns as the primary key where unique values need to be enforced for normalization. That said, I do think UUID is better for long-term maintainability, but if the table will never scale beyond what a Int32 will do then it doesn't matter.
Great as usual what if we want using uuid as invoice no ? it will be more complicated but as pros it will be scalable and avoid conflicts across store branches
Thanks! We wouldn't want to make the UUID an invoice number as they represent different things and the UUID is hard to remember. The invoice number could (and should) be a separate field.
UUID is the unnatural Primary key, and Invoice number would be the natural key (coded as UNIQUE KEY), so BOTH are required as a best practice. Otherwise, using a UUID or Integer as the PK alone would make the table vulnerable to duplicates. We all need to get away from this thinking that UUID or Integer alone is the PK...in most cases, it is not.
Good question. A UUID is a format or type of value that can be used in a column. A surrogate key is a column that is specifically created to be used as a primary key and has no definition from a user's perspective.
@@DatabaseStar, but users are not supposed to know about UUID's, nor ever see them....if anything, users would see the Natural PK on their UIs and its this value they know the rows by. UUID for internal code, Natural Keys for user-facing. Both are required in most cases.
I think one disadvantage of UUIDs is that they are harder to remember. For the possible security risk of integers, at Marshall University, where I'm a student, I have a student ID that starts with 901 or 903 like 901 234 567. 901 234 567 is a lot easier to remember than 32d7169d-fb3f-4543-8148-94a75641bc55. If I have to remember a value to repeat to someone, I would prefer 901 234 567 to 32d7169d-fb3f-4543-8148-94a75641bc55. Memorizing 32d7169d-fb3f-4543-8148-94a75641bc55 is doable, but not easy.
Good point, and like @edgeeffect mentioned, humans aren't supposed to remember UUIDs. The primary key may not even need to be shown to the user, even if it's an integer. A student table may have a primary key field, which can be either INT or UUID. It could also have a separate student_id field, which is an integer, and can be used when making enquiried about your record.
Having a predictable PK value can also be a security hazard. It makes your item keys guessable. This is a weakness of auto-incrementing values and a strength of UUIDs. Integer PKs can also potentially make future table merges and table migrations a challenge whereas UUIDs never pose such a challenge (they are globally unique and you will not have conflicts).
i used INT as auto increment PK and uuid column that corresponds to PK. use uuid for UI display and INT for JOIN, SELECT and SORT. dont just rely on 1 column. DONT EXPOSE YOUR PK.
Just use a pseudo-random, incrementing number algo like Snowflake. All the benefits of an auto incrementing int without the problems of being able to guess the next one in sequence, worrying about collisions in a distributed setup and much more compact than a GUID, unless you opt for an int64 but you still get the randomness, incrementing, better DB clustering. Because Snowflake and similar algos are time based incremental, even on a distributed setup you have built in time ordering so you can sort on the PK for newest/oldest time ordering which might save you from needing a date/time column if you don’t need one for other reasons.
Thanks for this tip. I‘ve only come across snowflake ids when working with Discords API but deemed it unremarkable. As a side question: Do you know where to get the „machine id“ from? Is it a „random“ static id assigned per backend node?
@@hi_im_julian_kirsch Depends on your architecture. If you don’t have a distributed system then the machine ID can just be some static value you hard code into the back end. If you have multiple back end components that need IDs such as multiple web services or cloud functions, or just multiple instances of the same web service for load balancing or fault tolerance, it’s probably best to consolidate your ID generation into a dedicated service or function. Each one of these would have a static ID as the machine ID. If you need super fast ID generation you can even have your ID generator component generate a pool of IDs to pass back to whatever is using them which would cache them and replenish the cache before it got too low, saving the overhead of doing a call for each ID. Like you could grab 1000 or 10000 at a time.
For distributed systems often each node is given an offset. So node A would insert 1, 5, 9, 13, node B would insert 2, 6, 10, 14, etc. There are advantages to UUID, and may be necessary in many, but INT can still be used in many systems.
Using an Integer for the unnatural PK is not good in most cases. All tables should have a natural PK too, but that's another topic, and I digress. Back to a response to this video: A UUID or GUID is best because that value will in fact be unique across all rows in the table, and across all rows in all tables and on all servers of the world; so long as the table DDL constrait is coded correctly. This is because the UUID is made up of CPU/GPU serial numbers along with many other things. It's 36 bytes long with dashes. Sure it takes up more room then an integer on disk, however storage is dirt cheap and should usually not be a consideration. The problem with integers is that, for example value 62432 on one system could be the ID for a different row in another. But a UUID would guarantee uniqueness across the universe as I stated above. Why would a URL contain the PK? Better to use a REST service PUT, PATCH, or POST to obfuscate the PK. Its 2024 so why does the URL contain a PK? As to data authority, and general table security, the back-end would decide if a request can get to specific data, and regardless what is in the URL or request JSON. In my company, all our tables use a GUID (UUID) for the unnatural PK, and always, always if possible a Natural PK too. Thinking that an integer or UUID (GUID) is the only PK a table needs is foolish and is proof that the table designer knows nothing about proper DB design and the Normal Forms that are applied to design by the pros. Google "Database Normal Forms".
Thanks for your input! You make some good points and there are many benefits of using a kind of UUID as mentioned in your comments and in the video. I’ve seen some websites show the PK in the URL, generally older sites or maybe sites that don’t use an API. Regarding a natural PK - it’s not always needed or recommended to have one. You only need one PK and are only allowed on PK in a table for a relational database. An integer as the only PK for a table is a perfectly valid approach for a table and satisfies the normal forms. Using a UUID does make it easier to be unique across multiple databases though if that’s the architecture you have.
@DatabaseStar, using an integer assigned by the system as an IDENTITY has its use cases, absolutely. However, its a fallicy to think a natural PK is not required if an unnatural is used (integer or UUID). In fact, it is required, and a best practice. Here's an example: Lets say you have a Customer Master table with the expected columns Name, Address, City, Postal, phone, web site, etc. Lets say you have two rows in such a way, that the same customer can be found with different UUID/integer values. Now you have a duplication of customers! Using a UUID/Integer as a PK alone is not going to prevent duplicates. And in fact, this thinking, that UUID/Int is all you need is very, very wrong, and its pervasive with young coders. You should have both, a unnatural PK UUID/Int as defined as with the constraint type PRIMARY KEY, and a natural PK defined as UNIQUE KEY constraint. We must design databases so that they do not allow duplicates at the database level (as defined in the DDL), not the code base level. UUID's are great for joining various tables together via PK's and foreign keys (FK). Remember, UUID or Integer alone will not prevent duplica†es...and no, its not acceptable for the code to enforce this. In my 45 years of designing databases, and working with code that access that DB, I have seen way too many situations where tables have duplicates because the designer wrongly ignored the use of natural PKs.
Unless someone changed the underlying algorithm for auto generated keys in the last decade, int does not garuntee any ordering. In transaction databases with more than a couple of hundred simultaneous users, it is almost certainly not a good assumption to rely on unless you don't care about bottlenecking inserts due to keys.
When you say "ordering" do you mean "working out the order that data was inserted"? If so, then yes that's right, there could be situations where auto-incremented integers are calculated but inserted in a different order. If you mean "generated in an ordered sequence of numbers", then many databases do this, using things such as sequence objects.
@@DatabaseStar because of readability and memorability. If you display the ID in the interface,, you maybe also want to remember some ID, but with UUID this is almost impossible.
I don’t get how little a role ULID played in this video, especially with the conclusion still mentioning v7 as an alternative even though ULID is more compact
@@DatabaseStar yeah, the id was generating in python on lambda from aws. I had to put a check to see if the key already exists and try again 10 times for a new one. it was a messy fix, and i`m not proud of this. recently i changed the way the whole system works to avoid this problem. For a quite while i was thinking that could be a randomizing problem with lambda, but i didn`t research this that much. I needed the key before it was in the bank to debug purposes. If you are in the same situation i would suggest to take the all numbers in timestamp, convert to hex and mix it with uuid, the chances of going wrong almost nullify.
Stop mentioning users can guess the key value as a pro or con. If you built a system that allows for injections of values for users to fetch rows from your DB you already failed. No one should be able to supply a field your queries will use to access anything they shouldn't have access to. Save their user ID (INT Key) in a session that is stored server side. The client can only submit their session ID (cookie) which you use to fetch their user ID from the server, not giving them any means of changing the key value the query will use.
Good point! Yes, if the application is designed correctly as you mentioned, then users won't be able to enter a different key and see things they shouldn't. So, you're right, it's not really a con of using this approach.
I would say that UUIDs are better. I would say short answer use UUIDs and you won't regret it. if you start using integer keys, you might regret it and have to change to UUIDs later on.
Is not as easy as that. The speedup you get from sequential numbers are not that marginal as you would expect especially if you have big monolithic apps. Another thing is space itself. I would actually say you start with sequential first then go UUID. As you probably start with a monolithic app first and then you can split it and make it distributed where the UUID are a bit more used
No! You can expose it... it is the security layer of the application that must protect any non authoized changes, not keeping it hidden... pls remember that "security by obfuscation" is not security
@@Davide73 It's not just about Security. It's about more, like gaining all the benefits of Integer Keys without the costs which come from exposing them. And it's about that exposing keys is an invitation to write bad code, like e.g. Client-Side code that really should be server-side instead. And that you can do something does not mean you should. There is no Problem that has been solved using exposed keys that could not have been solved without using keys. There's no reason to expose keys as they are not properties of the entities. If the backend's lazy and just saying the App is responsible, then that's just a lazy excuse. You don't gain any benefits from exposing keys, you only gain problems. I say this all the time: Just stop following standards and start thinking yourself. But a lot of time, this requires communication between frontend and backend devs and management which requires time. E.g. i wad working on a project where it had this exact issue, they were exposing keys without actually needing to do that and as a result they switched to some sort of UUIDs which then caused all kinds of new issues. I tried to tell the management that there's a better way and they just wouldn't listen and say they do it because it's a standard. And because of that they didn't want to 'waste' any time thinking about it. Ugh so frustrating.
Auto incremental number as key is a wrong design. You have to identify what makes every row unique. Can be one atribute of the table that is unique. For expmple suppose that in uk every number plate in cars is unique. In your table of cars this will be your pk. No need to add another attribute that is not needed as the number plate is unique. Sometimes one attribute is not enough ad you have to use two attributes. Adding unnecessary additional attributes is a wrong design.
Having a primary key that is not a value from the "domain" or the "business" is perfectly valid, and is usually a better design. This is called a surrogate key. Using an auto-incrementing ID is a great way of having a unique value for each row, and is great for primary keys and therefore foreign keys to other tables. In your example, you've used a "business" or "domain" value as the primary key, which is the number plate of the car. Yes, it could be unique. But it doesn't identify the car. What if a car changes it's number plate (for example, the owner purchases a different number plate, or sells the car and wants to keep the number plate)? The number plate is not a way to identify the car. What if you've built constraints on this number plate to ensure it's only 7 characters (or whatever the UK rules are), and then it's changed to 8 characters? You'll need to update the database and application as your keys will break. Perhaps the VIN (Vehicle Identification Number) could be used, but this may also change format and might also be unknown for some cars.
I get your point, and I agree, but this is not true 100% - sometimes you need to keep the "history" of a record, so for the same "business id", you need more records.
@@DatabaseStar I am not sure if you are just trolling or you really do mean it... Auto incrementing integer never is a valid key in a database.. it is a poor man hack for ppl who doesn't know what they are doing.. it is a time based pseudo random value (and badly random) which entangles all inserts into the database.. Just imagine what happens if two ppl are concurrently inserting bunch of rows.. Try to imagine what happens if you have a distributed database.. they get interlocked (or deadlocked) to not get collisions.. such "key" has no relation to the data, so in case of any failure you are not able to recover or reconstruct it... if you need something to index the rows why would you do that by a random value ? Use a hash function if you have to, but any competent db engine does that (and more) for you on the background... and if you are more formal just check what adding of a random numbers to the data does to db normal forms.. maybe you can get ok with atomicity but any higher NF than that breaks instantly... and btw if you really don't have any natural key in the db structure (often because of a bad db model..) then maybe just don't use any key.. tbh uuids don't solve all the issues but at least they drop the time-based randomness for much better randomness, and drop the "incrementing" constrain.. still they are not a true "keys" because they are only statistically unique.. but if done right it is usually good enough.. if you extend "uuids concept" to equivalency classes of uuids it is almost perfect (but well...🤣)
@@DatabaseStar sorry man just wrote you an answer why it is never a valid key (rumbling about time based pseudo randomness, interlocking issues, normal forms breaking, recovery hell, performance hit, concurrency problems.. you name it) but I accidentally deleted that comment and even I am still wondering how useful can be an index on a random value I don't wanna write that again..🥴
No problem, I still see your comment! I wasn't trolling - I actually believe that auto-incrementing integers are a good choice for primary keys. But they aren't perfect, as mentioned in the video. I don't think an auto-incrementing integer is a time-based pseudo-random value. It uses a database feature, such as a sequence, to retrieve the next integer value each time it is called. For example, the first value is 1, the next value is 2, and so on. There's nothing random about it. Generally, these approaches work for multiple transactions. For example, the numbers generated by an Oracle sequence are independent of transactions, so you won't get collisions. I don't see how the format of the primary key relates to normal forms though. You're right that UUIDs can help. They are better for distributed databases and can help avoid locking and dependencies on the database, as mentioned in the video.
@@DatabaseStar Well, second NF basically says that attributes in a row have to depend only on the value of the key... How do they depend on a value of that assigned integer? Someone in the comments tried to fixed that with adding timestamps not realizing that timestamps would be actually a better key... If several ppl are inserting many rows at once (concurrently) how could they predict (if those are not random) the assigned values? You say that each time you call a "db feature" it returns different number... but in db you are not doing things sequentially, db tries to reorder data for efficiency and work with them in bulk... but to avoid a collision each part of the db structure (imagine a cluster) needs to synchronize (and wait) and then assign almost one by one those integers... I mean ok, in the last century there were valid reasons to use such things.. like memory constrains, inefficient db engines (mysql anyone?) and there were a tiny amounts of data in those databases (comparing to today), not having to deal with parallel things and so on.... but today? buy ram, use smart db engine... and focus on proper data structure.. after few years and thousand times more data then planned, you will be happy you haven't had used those autoincrements...
Want to easily remember the SQL commands for your database? Get my free SQL Cheat Sheets here: www.databasestar.com/get-sql-cheat-sheets/?
1:42 if users can change the ID in a URL to access data they are not allowed to, then the fix isn't to switch from int to uuid. You need to implement proper permission checks before serving the request.
There are scenarios where non-guessable IDs won't protect you. For example copy&pasting - most normal users have no idea which part of a URL contains critical data so they will just send the whole thing to a colleage/friend. Or someone bookmarks a page and later their access role is revoked.
Bingo. If you can change ID fields, why can't you change uuid fields? It seems you have the same issue. Correct way is for the middleware or DB to prevent calls in using IDs which you aren't supposed to access. Also, in many DBs, IDENTITY fields are read only.
Great point (and probably one I didn't make that well in my video). It should not be possible for a user to change an ID in a URL to see things for another account or record that they are not allowed to se.
For anyone who wants to look deeper into this. It's also known as an IDOR vulnerability (insecure direct object references)
That is correct, but IDs in a sequence still leak information and that might be undesired - e.g. if you have order numbers that work like that and you make a couple of orders a day apart, you would know how many orders a business got just by looking at your collected order numbers. By itself this might not be that useful, but once it's combined with other information this might just become a problem.
Security risk - predictability ! Hahahahahahahah
RDBMS is a technology developed in the early era of computing history and RDBMS are well optimised to use INT as PK.
Then the internet age came and when partitioning and sharding are implemented to solve the problem of scaling horizontally, the INT PK became a limitation.
With UUID, multiple nodes can generate the PK, allowing horizontal scaling.
Good point!
Totally false - this opinion belongs to the hype of nosql blabla buzzwords... PK is never a limitation, because it is a concept, implemented with technology to avoid duplicates.
@@Davide73 does he talk about Auto Incrementing INT as PK
Agree with this statement. It’s not some nosql thing either as the other responder said - if you truly want databases that can scale with your application, active/active and multi datacenter uuid is the right tool for the job. And yes people will hammer other solutions (even/odd, increment by 3/4) but this is a manual and unnecessary step. Just use uuid if active+ setups are in the cards
we had this debate in another database forum. the consensus was the combo int PK/FK, and UUID for API's.
Good to know! That's a good option.
I've been doing this on my own without knowing it was an established thing. You really get the best of both worlds
@@DatabaseStar this was literally what you suggested, so, this is more like +1 than a suggestion
@@mind.journey its more complex, but yes, otherwise best of both worlds
But if I use both, for example, a users table:
id: bigint,
uuid: uuid,
name: varchar(255)
Considering that I will only expose the uuid to the API, I would have
GET: users/:uuid
Which would be the query
SELECT * FROM users WHERE uuid = param.uuid
If I have a table with tens of millions of users, I would have to create an index for this query, right? Then we fall into the disadvantage of uuid, which is index fragmentation. Since we always will have to search by uuid, wouldn't it be better to use only it as the primary key? Does the performance in joins justify having both, even if they take up more space?
Another con of uuid4 as your primary key: Normally you will cluster on primary key, but if the PK is a uuid4 then your table will be constantly fragmenting.
That's true, good point.
Which doesn’t matter in most cases
@@avwie132 Sure, but you'd better be sure it doesn't matter in each case. Every PK creates an index for itself, and so a uuid4 PK's index will be constantly page splitting, even if you mark the table nonclustered.
True, but MSSQL has a sequential GUID function that guarantees the GUID is incrementing by packing a time based section at the beginning.
@@Joooooooooooosh TIL
They are separate concerns, that should not be comingled. Primary keys are there for referential integrity within your database, and should be int for maximum performance. UUIDs are there to uniquely identity a records at higher application layers, or across databases , or for replication - so they should be added as an additional column when needed.
Good point. From my research, and other comments, a valid approach is to use both the INT PK and a UUID (or similar) in the way you have mentioned.
@@DatabaseStar Using both Int PK & UUID has some pros but it has some cons of both. It increases row size significantly and looses advantage of preparing data & references before performing actual writes. Also, using ORM or generating models from SQL in application code will require to additionally worry about hiding integer primary keys in API responses. I would rather pick one depending on the table and the amount of writes & reads it will receive.
@@dyto2287 If your ORM entities are somehow making it all the way to your API controllers where you have to hide your primary keys, there's something terribly wrong with your application architecture. ORM entities should not make it past your repository layer, and the rest of the application should be completely isolated from having to deal with database concerns such as primary keys and referential integrity.
@@squintps In a perfect world yes but in reality when you need to build fast and ship features too much of isolation and guardrails will bite you in the ass slowing down all development. Don't over-complicate stuff before it's absolutely necessary.
@@DatabaseStar Totally crazy using two "primary key"... if one is a real primary key, that one is enough...
1:01 primary keys always have indexes, you don't need to make a separate one.
That's a good point, yes they do.
For the sorting pros of integer primary keys, I always add 3 columns at the end--note as a text field for when something needs to be clarified about the data record, date_created, and date_updated. date_created will be set by default to something like systimestamp and date_updated will update to the date time stamp when the record is updated. a trigger will fire with an on update command clause. then in Oracle I add a virtual date_created_or_updated where its date_created if no updates have been before, and date_updated if updates have been performed. I put date_updated to null at the beginning. That way you can sort to see when the records were created like sorting the integer primary keys.
That's true, using dates when the record was created or updated is a good way to see how old records are or to sort by when they were created.
Well, I just use UUID v7 as the primary key. Additionally, I allow my authenticated clients to define its value when creating or upserting records (though I validate on the backend to ensure it's a valid V7 UUID, as monotonic increment is important for indexing).
This way, if a mobile device works offline and needs to sync new or updated items, it can generate a UUIDv7 offline, and the ID won't change after the upload. Moreover, there's no need to define two IDs (local and remote) for each item-there's always one unique ID for a specific item. Since it starts with a Unix timestamp (instead of random noise), I can sort by ID on the server side without needing a separate creation time. The best of both worlds.
Yes, it takes up more space compared to an auto-incrementing (big)int, so joins may be slightly slower, but if I ever need to migrate or merge databases, I can use the original IDs without conflict or recalculation across multiple tables. There are so many reasons to use UUIDv7 that I don't mind those nanoseconds of extra processing time.
Great points, thanks for sharing. Using UUIDv7 as you have mentioned seems to be a popular option based on the comments here.
Integer IDs are NOT a security risk, if proper application security is implemented.
I should be able to take any ID - regardless of the type of value - and not be able to access the record unless I'm specifically authorised.
And anyone who thinks having a non-guessable ID is a "layer" of security is fooling themselves with "security through obscurity".
That's a good point!
@Grumpicles - thank you, this is the first comment done by somebody with experience, all the other comments, in my opinion, are done by people using nosql approach because it is cool. sharding and partitioning for applications that will use at max one thousand records... c'mon!!
Yes, good SEs should not perpetuate this myth.
UUID (or any other UID algo) as an identifier is really about generating identifiers at application layer (python etc), in distributed systems (multiple services/process/threads; "scaling" as this video covers).
My personal approach would be like this. Every time I make a new table, my primary key is an integer. But when I need to expose the data to the application, I don't expose the primary key. That's really bad! So instead I make another column just to store UUID and I set the default value to "UUID()" so that every time I insert a new data, an UUID will automatically be generated. But my traditional method use varchar to store UUID. I might need to change it in the future since you said it would be better to store it in binary.
Nevertheless, by doing this I can expose the data just by exposing the UUID.
Let me know if there's a better approach than mine, cheers.
That sounds like a good approach!
@@DatabaseStar Thanks a lot!
So do you index on your UUID column in addition to the PK? I imagine it would be slow to retrieve records based on the UUID?
@@sangramjitchakraborty7845 No, it's still on the integer
Why is exposing your primary key bad?
3:02 Frequency of collisions depends on details of how UUIDs are generated. Actual probability might be much higher than a theoretical probability that assumes a perfectly distributed random variable. I've seen some very poor implementations. Having some (very rare) collisions is not a problem if a collision doesn't lead into a catastrophe. But a software engineer has to make sure it does not. Sometimes you just have to be sure that the IDs will always be unique and no collision will ever happen.
Great point, thanks for sharing!
The primary key of the RDBMS will protect you from non unique values...
@@Davide73 Yes. It protects from doing potentially harmful inserts case of a collision. But it doesn't stop a careless programmer passing on the UUID before doing the insert or otherwise making sure there is no collision.
@@FINALLYQQQQAVAILABLE Sorry, I don't get yuor point; the primary key avoids any duplicate UUID, and that's all. If a dev makes a mistake, it will receive the error.
@@Davide73 There is a non zero chance of collision, which grows exponentially with table size, so extra error checking code is needed, and if forgotten in places, there is higher chance of app crash or server error. Also extra work in generating ids, so they aren't as care free as video implies.
Thanks! I was just querying a database that uses UUIDs as primary keys. I didn't previously know that they are not generated incrementally.
Glad it helped!
They don't have to be incremental....just unique. And really, there are rare use cases when one needs to order by UUID or Int anyways. The ordering is most often by one or a few columns in the table.
what I did is retain the default integer id and add another unique field of uuid v4.
this way a merger/sorting can happen and you can give the user at some end point the uuid as an id to reduce security risks.
the only big con about this is the memory usage (both in runtime and storage).
Good point, good to hear it works for you.
I prefer to use auto-increment INT or hybrid. I have the kind of luck that I could be pulling on a rope (to wrap it up and put it away) and it would get stuck on a completely spherical object, like a beach ball. The last thing I need is for that type of bad luck to generate identical UUIDs for 2 records in a db. I know they say that UUID's likelihood of duplicates is near zero but INT is exactly zero. They used to say that MD5 duplication was also near zero, but we all saw how that turned out.
That’s good to know, and glad it works for you.
I used integers with MySQL along, and then I was learning Postgres and I learned about UUIDs and I learned the advantages of UUIDs and I became converted/converted to UUIDs.
Good to know! My journey was similar.
Just to be aware of when using uuid with mysql (innoDb engine), it destroys the write performance
Sorry guys, but I continue to avoid UUIDs... I don't see any plus in a RDBMS... it is the usual hype after nosql guys arrived on the playground
The advantage of UUID is, I can set it manually, I can use it in memory before storing an item to the database.
On an autoincrement value, I have to have something temporary in memory before, then I have to save my item, and after that I have to retrieve the value, the database generated. Quiet a lot of steps. Becomes all obsolete by setting UUID myself before saving.
That’s true, it can be created manually and outside the DB which is helpful as you mentioned. Thanks for sharing!
When you create a record, the DB returns the ID generated so there isn't any additional request. On a UUID you have an additional step of creating it, and debugging your app is a more tedious with everything referencing a UUID. Also, realistically there is a chance of collision, which grows exponentially as your table size increases (think birthday problem), so you need extra error checking on all inserts.
One thing that relates to a different discussion which is a use a surrogate key not a name or something for the primary key. So say I have my table city with the column city that stores the name of the city and city_id which stores the primary key as an automatically generated UUID (which implementation will differ in MySQL--bin, Postgres/PostgreSQL--UUID, Microsoft (MS) SQL Server--can't remember how its handled, and Oracle--raw(16)). So the surrogate key means we can update the city if we need to. However, I think there should be a term like functional primary key where its unique and not null. Its not the primary key because we want a value that is independent of the primary key, but its important enough and unique enough that it is commonly used to identify something. Examples of "functional primary keys"/natural keys include first name, last name, email address, phone number, physical address, and name. Each of these should be mapped 1 to 1 to the entity. So if I know your first and last name that should map to 1 person. I say should because maybe 2 people have the same first and last name. We could say well then your middle name has to be unique. Well then the father and the son have the same first, middle, last. well then add the prefix like John James Smith 7 is son of John James Smith 6. An email address should have a one to one relationship with a person, but this is often not the case where you multiple email addresses, phone numbers, physical addresses I guess if you own 2 houses, names, etc. The point I'm getting at is I think with a surrogate key there is a loss of how the functional primary key should have a 1 to 1 relationship and be unique and not null, so a new concept like functional primary key could be introduced. It could be that said that its functionally primary, even though its not really primary because if its really primary it should have no relation to the input data. I hope that made sense.
Good point about the functional primary key and how it's often unique. It's a good idea to separate it from the table primary key. I've heard the term "business key" used to refer to this, or even a "candidate key".
@@DatabaseStar I think candidate key is something that could be a primary key. I think there's a difference between candidate key and "functional primary key".
good overview, always good to think about before planning a new program
Thanks!
Ok guys, you won!
I am going to change my recipe system which contains 30 recepies using 2000 ingredients to use uuid! Soon a safe merge can be done as my next recipe will be safer once no one will guess next id!
Be safe guys!
Thanks for sharing!
Brilliant, I been looking for this info for years, uuid much better,no sequence naming,no central authority to control sewiences
Beware "better" is relative. What is "good" or "bad' always depends on context.
Then you are probably doing it wrong.
UUIDs are not a replacement for sequences. Just use them if you want/need to expose entities. So your api uses UUIDs as parameters instead of actual valid ids.
But using integers as primary keys is the most performant way to handle relationships, no question about that.
Glad you liked it!
@@ramireznoy UUID are better as PK for distributed databases because you avoid collisions between those
@@Ewig_Luftenglanz If your distributed model allows for collisions, the issue is the model, not the PK.
If you need to put the entities together, then your DB is not distributed.
But even if you absolutely realize integers are not the best solution in your very specific case, not even composite keys, UUIDs are no replacement for integers. Those are just two totally separated use cases still.
By far I prefer int
Good to know!
Brilliant effort!
Thanks!
Thank you for the wonderful video sir, it is so helpful for my team and thanks again sir
Glad it helps!
UUIDs are almost always used as *indexed* secondary columns anyways so the "storage" argument doesn't even make sense at all. Because removing the indexed Integer column means less storage wasted since you only care about your unique IDs. I find a lot of people ignoring them anyways in queries... so again, what's the point?
I use a special kind of UUIDs, mostly a more compact base62 nano ID. It saves a lot of space, and you can have a very good random range in just 16 bytes.
Thanks for sharing. That’s good to know, especially how you use a specific type of UUID.
Complex scenario made simple... i shall subscribe dear sir & thank you!
Thanks!
I really like int: date+time+sequence or other. Prepend w another unique, such as client id, server id. You can pull entries away from this or that db and know exactly what's up. There's a lot of options, but you can pack info into that. Or simply use a string and parse it. Custom solution.
Oh that's an interesting approach. Thanks for sharing.
@@DatabaseStar As long as no crucial info is in the index scheme, it works for app dev, and discovering data genesis without adding columns. You simply need to develop a parser and search. Good for science, POS and logistic integration. Works for me. Maybe not immediately handy, but helps with drill-down.
I do pentest and UUID never stopped me from compromising the system, never. ID or UUID if it's send to the browser then that's just it. You don't get hacked just because you show the ID/UUID, you get hacked because of IDOR/the lack of permissions. The more capable the system is the harder it is to secure it.
Good point, you explained it better than I did. Thanks for sharing!
Great video. Trying to decide between UUIDv7 and integers for a new app atm. It's the kind of app that would need DB sharding if it ever took off. Wouldn't that be difficult with integer IDs? I know some people say stick with integers until then, but wouldn't switching every table from integers to UUIDs and then rebuilding all the FKs be a nightmare? Would love to hear your thoughts.
Good question. On one hand, optimising for possible DB sharding in the future could be a "premature optimisation". On the other hand, using UUD now is planning for the future.
I don't think it's a nightmare to migrate every table, but would be a little tricky. I imagine it's something you can do in stages, perhaps one table (or one relationship) at a time.
But if you're pretty sure you'll need UUIDs, then it may be better just to use that from the start.
Does auto increment work with sharded/partitioned tables (globally)
I'm not sure, I haven't tried it.
Normally I’ll use ULIDs instead o UUIDs, I find ULIDs more reliable and storage efficient, and also has the bonus to be sortable.
You can always sort a UUID
@@Davide73 Depends on the version, the commonly used UUIDv4 is not sortable, yes, there is other versions that are sortable, but it's as complicated as the naming of Xperia phones (Xperia 5 iii for example), ULIDs are my preference today.
That's good to know, thanks for sharing!
I don't quite understand the disadvantage of integers and the advantage of UUIDs about how integers have to be generated by the database and UUIDs can be generated by the client. I understand, kind of.
The idea is that if you use incremental int, you send the request to the backend, in which in turns does some computations and sends the insert sql query to the db, which then checks which integer is next and inserts that (for example id column) and then gives it back to the backend which then it gives it back to you (something like user created with id=42)
In the case of uuid the frontend randomly generates the user uuid as the id column and then sends it to the backend to be put in the db without db need to do anything. Meaning you know in advance the user id without even hitting the database and you dont need db to respond to proceed with what you are doing
The comment from "someoneanonymous6707" is right.
The integer IDs need to be generated by the database because that's where the record of available integers is stored (e.g. within a sequence). With a UUID, it could be calculated by the database, or could be calculated by any system that interacts with it.
@@someoneanonymous6707 So, if you have a bug in your application, you won't have a UUID bat a bunch of random chars for example with "/" instead of "-"... totally crazy, if you use a proper RDBMS it is safer to leave this task to the engine itself, not to the application.
@@DatabaseStar No, in particular if you have "any system that interacts with it", you must leave this task to the RDBMS, because that's the only way to be sure that your key will be properly generated. "any system" could create the key in a different way.
You don't understand because there are no...
2:30 ... I don't know why people keep summarising uuids in general as random. Only v4 is, and parts of some other versions. It's not inherent to uuids in general.
I believe when people say UUID they usually mean UUID v4 which is the "mostly random numbers and text" one. Other types of UUID as mentioned include time components. But it's a good point.
No, wrong, they are as good as unique across the universe with near zero chance of a dup, and this has been the case for a very long time (decades).
@@TheSilverGlow who are you saying is wrong, and why? We're not even talking about uniqueness here.
@@HappyCheeryChap its all related, and this is basic Database Design here. Effectively, UUID's are unique. The #1 point of using a UUID or Identity integer is to make each row unique...Normal Form 1...google Database Design Normal Forms (tech from the early 1970s and still massively important today)...there are other DB design normal forms too. The #2 reason is to provide a table with a RELEVENT unique key...the UUID or Integer can be a PK, but its Un-Natural, and not really related directly to the data contents for each row. For example, a Customer Table can have a UUID PK, but this does not stop a user from keying in the same customer many times ( name, address, phone, web site, blah, blah, blah)...a Natural PK must also be assigned to the customer table by one or a combo of other customer fields...and since a table can have only 1 PK, you can specify the natural PK as a UNIQUE KEY...that way a user cannot set up a customer row multiple times for the same customer (having same name, number, address, etc). Let the database enforce its own rules, not the JavaScript or whatever language you use. The Normal Forms will keep the DB out of trouble and guard against bad programs.
I prefer to use multiple columns as the primary key where unique values need to be enforced for normalization. That said, I do think UUID is better for long-term maintainability, but if the table will never scale beyond what a Int32 will do then it doesn't matter.
That's a good point. You could also consider using a Unique Constraint if values need to be unique in a table.
You could make gen_random_uuid the default instead of putting it in the insert statements at 7:55.
That's true, I could do that.
i tend to use uuid on some of the tables when needed
That’s good to know
Great as usual
what if we want using uuid as invoice no ?
it will be more complicated but as pros it will be scalable and avoid conflicts across store branches
Thanks! We wouldn't want to make the UUID an invoice number as they represent different things and the UUID is hard to remember.
The invoice number could (and should) be a separate field.
The invoice number is a "business attribute", potentially it could be different from the database id...
UUID is the unnatural Primary key, and Invoice number would be the natural key (coded as UNIQUE KEY), so BOTH are required as a best practice. Otherwise, using a UUID or Integer as the PK alone would make the table vulnerable to duplicates. We all need to get away from this thinking that UUID or Integer alone is the PK...in most cases, it is not.
In short, use integers when you can, and UUIDs when you can't.
Good summary!
What is the difference between a uuid and a surrogate key?
Good question. A UUID is a format or type of value that can be used in a column. A surrogate key is a column that is specifically created to be used as a primary key and has no definition from a user's perspective.
@@DatabaseStar, but users are not supposed to know about UUID's, nor ever see them....if anything, users would see the Natural PK on their UIs and its this value they know the rows by. UUID for internal code, Natural Keys for user-facing. Both are required in most cases.
I think one disadvantage of UUIDs is that they are harder to remember. For the possible security risk of integers, at Marshall University, where I'm a student, I have a student ID that starts with 901 or 903 like 901 234 567. 901 234 567 is a lot easier to remember than 32d7169d-fb3f-4543-8148-94a75641bc55. If I have to remember a value to repeat to someone, I would prefer 901 234 567 to 32d7169d-fb3f-4543-8148-94a75641bc55. Memorizing 32d7169d-fb3f-4543-8148-94a75641bc55 is doable, but not easy.
UUIDs are for computers not machines... if any system requires people to remember arcane identifiers... it's an awful system.
Good point, and like @edgeeffect mentioned, humans aren't supposed to remember UUIDs.
The primary key may not even need to be shown to the user, even if it's an integer.
A student table may have a primary key field, which can be either INT or UUID. It could also have a separate student_id field, which is an integer, and can be used when making enquiried about your record.
Having a predictable PK value can also be a security hazard. It makes your item keys guessable. This is a weakness of auto-incrementing values and a strength of UUIDs. Integer PKs can also potentially make future table merges and table migrations a challenge whereas UUIDs never pose such a challenge (they are globally unique and you will not have conflicts).
@@tubero911 that's what I'm saying.
There is no reason to remember a UUID or row integer identity.
Which I use depends on which DB i use, mySQL i use int, cassandra i use uuid
Good to know!
i used INT as auto increment PK and uuid column that corresponds to PK. use uuid for UI display and INT for JOIN, SELECT and SORT. dont just rely on 1 column. DONT EXPOSE YOUR PK.
That sounds like a good approach
Sure sounds like security through obscurity to me. If a user knowing their ID is a bad thing, your app design is flawed.
Just use a pseudo-random, incrementing number algo like Snowflake. All the benefits of an auto incrementing int without the problems of being able to guess the next one in sequence, worrying about collisions in a distributed setup and much more compact than a GUID, unless you opt for an int64 but you still get the randomness, incrementing, better DB clustering.
Because Snowflake and similar algos are time based incremental, even on a distributed setup you have built in time ordering so you can sort on the PK for newest/oldest time ordering which might save you from needing a date/time column if you don’t need one for other reasons.
Ditto, Snowflake is cool
Thanks for this tip. I‘ve only come across snowflake ids when working with Discords API but deemed it unremarkable.
As a side question: Do you know where to get the „machine id“ from? Is it a „random“ static id assigned per backend node?
@@hi_im_julian_kirsch Depends on your architecture.
If you don’t have a distributed system then the machine ID can just be some static value you hard code into the back end.
If you have multiple back end components that need IDs such as multiple web services or cloud functions, or just multiple instances of the same web service for load balancing or fault tolerance, it’s probably best to consolidate your ID generation into a dedicated service or function. Each one of these would have a static ID as the machine ID.
If you need super fast ID generation you can even have your ID generator component generate a pool of IDs to pass back to whatever is using them which would cache them and replenish the cache before it got too low, saving the overhead of doing a call for each ID.
Like you could grab 1000 or 10000 at a time.
Great point! Snowflake seems like a good approach, and in my research, many people were also recommending Snowflake.
For distributed systems often each node is given an offset. So node A would insert 1, 5, 9, 13, node B would insert 2, 6, 10, 14, etc. There are advantages to UUID, and may be necessary in many, but INT can still be used in many systems.
I would edit the answer to it depends, but in general use UUIDs or something like that.
You get my "thumbs up" for saying "it depends".
That's true, it does depend.
Using an Integer for the unnatural PK is not good in most cases. All tables should have a natural PK too, but that's another topic, and I digress. Back to a response to this video: A UUID or GUID is best because that value will in fact be unique across all rows in the table, and across all rows in all tables and on all servers of the world; so long as the table DDL constrait is coded correctly. This is because the UUID is made up of CPU/GPU serial numbers along with many other things. It's 36 bytes long with dashes. Sure it takes up more room then an integer on disk, however storage is dirt cheap and should usually not be a consideration. The problem with integers is that, for example value 62432 on one system could be the ID for a different row in another. But a UUID would guarantee uniqueness across the universe as I stated above. Why would a URL contain the PK? Better to use a REST service PUT, PATCH, or POST to obfuscate the PK. Its 2024 so why does the URL contain a PK? As to data authority, and general table security, the back-end would decide if a request can get to specific data, and regardless what is in the URL or request JSON. In my company, all our tables use a GUID (UUID) for the unnatural PK, and always, always if possible a Natural PK too. Thinking that an integer or UUID (GUID) is the only PK a table needs is foolish and is proof that the table designer knows nothing about proper DB design and the Normal Forms that are applied to design by the pros. Google "Database Normal Forms".
Thanks for your input! You make some good points and there are many benefits of using a kind of UUID as mentioned in your comments and in the video.
I’ve seen some websites show the PK in the URL, generally older sites or maybe sites that don’t use an API.
Regarding a natural PK - it’s not always needed or recommended to have one. You only need one PK and are only allowed on PK in a table for a relational database.
An integer as the only PK for a table is a perfectly valid approach for a table and satisfies the normal forms.
Using a UUID does make it easier to be unique across multiple databases though if that’s the architecture you have.
@DatabaseStar, using an integer assigned by the system as an IDENTITY has its use cases, absolutely. However, its a fallicy to think a natural PK is not required if an unnatural is used (integer or UUID). In fact, it is required, and a best practice. Here's an example: Lets say you have a Customer Master table with the expected columns Name, Address, City, Postal, phone, web site, etc. Lets say you have two rows in such a way, that the same customer can be found with different UUID/integer values. Now you have a duplication of customers! Using a UUID/Integer as a PK alone is not going to prevent duplicates. And in fact, this thinking, that UUID/Int is all you need is very, very wrong, and its pervasive with young coders. You should have both, a unnatural PK UUID/Int as defined as with the constraint type PRIMARY KEY, and a natural PK defined as UNIQUE KEY constraint. We must design databases so that they do not allow duplicates at the database level (as defined in the DDL), not the code base level. UUID's are great for joining various tables together via PK's and foreign keys (FK). Remember, UUID or Integer alone will not prevent duplica†es...and no, its not acceptable for the code to enforce this. In my 45 years of designing databases, and working with code that access that DB, I have seen way too many situations where tables have duplicates because the designer wrongly ignored the use of natural PKs.
Unless someone changed the underlying algorithm for auto generated keys in the last decade, int does not garuntee any ordering. In transaction databases with more than a couple of hundred simultaneous users, it is almost certainly not a good assumption to rely on unless you don't care about bottlenecking inserts due to keys.
When you say "ordering" do you mean "working out the order that data was inserted"? If so, then yes that's right, there could be situations where auto-incremented integers are calculated but inserted in a different order.
If you mean "generated in an ordered sequence of numbers", then many databases do this, using things such as sequence objects.
I think UUIDs can be better with sharding replication and backup and migration. I think that's right I could be wrong.
That's a good point.
We can use more than one type of keys for different purpose!
Yes, you can use different types of fields, but you can only have one primary key.
thanks
You’re welcome
You can also create your own id generation algorithm.
You can, but why would you need to?
@@DatabaseStar because of readability and memorability. If you display the ID in the interface,, you maybe also want to remember some ID, but with UUID this is almost impossible.
uuid used to identify disk entity in Ubuntu server
Good to know
Not great for large joins.
Which one is not great for large joins? And what do you mean by large joins?
@DatabaseStar uuid is slower for large joins as the primary key. I mean queries with joins between tables of 100k+ rows for analytics or similar.
I don’t get how little a role ULID played in this video, especially with the conclusion still mentioning v7 as an alternative even though ULID is more compact
Good point. I mentioned it briefly as I had covered the concept and differences of UUID earlier in the video, but ULID is a good option.
I used uuidv4 and had collisions all the time.
Oh really? That’s good to know.
@@DatabaseStar yeah, the id was generating in python on lambda from aws. I had to put a check to see if the key already exists and try again 10 times for a new one. it was a messy fix, and i`m not proud of this. recently i changed the way the whole system works to avoid this problem. For a quite while i was thinking that could be a randomizing problem with lambda, but i didn`t research this that much. I needed the key before it was in the bank to debug purposes. If you are in the same situation i would suggest to take the all numbers in timestamp, convert to hex and mix it with uuid, the chances of going wrong almost nullify.
Stop mentioning users can guess the key value as a pro or con. If you built a system that allows for injections of values for users to fetch rows from your DB you already failed. No one should be able to supply a field your queries will use to access anything they shouldn't have access to. Save their user ID (INT Key) in a session that is stored server side. The client can only submit their session ID (cookie) which you use to fetch their user ID from the server, not giving them any means of changing the key value the query will use.
Good point! Yes, if the application is designed correctly as you mentioned, then users won't be able to enter a different key and see things they shouldn't. So, you're right, it's not really a con of using this approach.
UUID4 is mostly random. There is a single 4 in every UUID
That's interesting, I didn't know that, and just did a bit of Googling on it.
Thank you, very useful video.
Glad it was helpful!
Or just create your own PK format that will fit YOUR usecase.
The formats in this video should fit almost all use cases, but yes if you have a situation where none of them will work, you can consider your own.
Nanoid my friend.
Thanks!
ULID’s
Yeah they are a good option.
A good solution is tsid
Good to know!
take both, INT as PK and UUID for the application and as PK fallback if you need to upscale your (D)RDBMS.
Good point, that seems to be a common approach from these comments.
Int
Thanks for sharing!
Integers for PK, uuid with uniq index (supplementary if needed)
Good approach!
I would say that UUIDs are better. I would say short answer use UUIDs and you won't regret it. if you start using integer keys, you might regret it and have to change to UUIDs later on.
Is not as easy as that. The speedup you get from sequential numbers are not that marginal as you would expect especially if you have big monolithic apps. Another thing is space itself. I would actually say you start with sequential first then go UUID. As you probably start with a monolithic app first and then you can split it and make it distributed where the UUID are a bit more used
Beware of thinking anything is "better" in all situations. Everything depends on context.
Good point!
?? Do you know how much more space you need for UUID? And for performances?
CUID is much better than UUID
Oh I haven’t heard of CUID, I’ll have to look into it.
It's GUID, not UUID.
I believe they are used interchangeably
Just don't expose your PK/FKs. Always found it bad practice that REST does this. If you need keys on the client side, you're doing something wrong.
Good point!
So basically every crude is wrong?😅
No! You can expose it... it is the security layer of the application that must protect any non authoized changes, not keeping it hidden... pls remember that "security by obfuscation" is not security
@@Davide73 It's not just about Security. It's about more, like gaining all the benefits of Integer Keys without the costs which come from exposing them. And it's about that exposing keys is an invitation to write bad code, like e.g. Client-Side code that really should be server-side instead. And that you can do something does not mean you should. There is no Problem that has been solved using exposed keys that could not have been solved without using keys. There's no reason to expose keys as they are not properties of the entities. If the backend's lazy and just saying the App is responsible, then that's just a lazy excuse. You don't gain any benefits from exposing keys, you only gain problems. I say this all the time: Just stop following standards and start thinking yourself. But a lot of time, this requires communication between frontend and backend devs and management which requires time. E.g. i wad working on a project where it had this exact issue, they were exposing keys without actually needing to do that and as a result they switched to some sort of UUIDs which then caused all kinds of new issues. I tried to tell the management that there's a better way and they just wouldn't listen and say they do it because it's a standard. And because of that they didn't want to 'waste' any time thinking about it. Ugh so frustrating.
Auto incremental number as key is a wrong design. You have to identify what makes every row unique. Can be one atribute of the table that is unique. For expmple suppose that in uk every number plate in cars is unique. In your table of cars this will be your pk. No need to add another attribute that is not needed as the number plate is unique. Sometimes one attribute is not enough ad you have to use two attributes. Adding unnecessary additional attributes is a wrong design.
Having a primary key that is not a value from the "domain" or the "business" is perfectly valid, and is usually a better design. This is called a surrogate key.
Using an auto-incrementing ID is a great way of having a unique value for each row, and is great for primary keys and therefore foreign keys to other tables.
In your example, you've used a "business" or "domain" value as the primary key, which is the number plate of the car. Yes, it could be unique. But it doesn't identify the car.
What if a car changes it's number plate (for example, the owner purchases a different number plate, or sells the car and wants to keep the number plate)? The number plate is not a way to identify the car.
What if you've built constraints on this number plate to ensure it's only 7 characters (or whatever the UK rules are), and then it's changed to 8 characters? You'll need to update the database and application as your keys will break.
Perhaps the VIN (Vehicle Identification Number) could be used, but this may also change format and might also be unknown for some cars.
I get your point, and I agree, but this is not true 100% - sometimes you need to keep the "history" of a record, so for the same "business id", you need more records.
man.. if someone is using an auto incrementing integer as a key in database, it is not a database guy, it is a butcher...
Hah! An auto incrementing integer for a primary key is valid, but it does have some disadvantages.
@@DatabaseStar I am not sure if you are just trolling or you really do mean it... Auto incrementing integer never is a valid key in a database.. it is a poor man hack for ppl who doesn't know what they are doing.. it is a time based pseudo random value (and badly random) which entangles all inserts into the database.. Just imagine what happens if two ppl are concurrently inserting bunch of rows.. Try to imagine what happens if you have a distributed database.. they get interlocked (or deadlocked) to not get collisions.. such "key" has no relation to the data, so in case of any failure you are not able to recover or reconstruct it... if you need something to index the rows why would you do that by a random value ? Use a hash function if you have to, but any competent db engine does that (and more) for you on the background... and if you are more formal just check what adding of a random numbers to the data does to db normal forms.. maybe you can get ok with atomicity but any higher NF than that breaks instantly... and btw if you really don't have any natural key in the db structure (often because of a bad db model..) then maybe just don't use any key..
tbh uuids don't solve all the issues but at least they drop the time-based randomness for much better randomness, and drop the "incrementing" constrain.. still they are not a true "keys" because they are only statistically unique.. but if done right it is usually good enough.. if you extend "uuids concept" to equivalency classes of uuids it is almost perfect (but well...🤣)
@@DatabaseStar sorry man just wrote you an answer why it is never a valid key (rumbling about time based pseudo randomness, interlocking issues, normal forms breaking, recovery hell, performance hit, concurrency problems.. you name it) but I accidentally deleted that comment and even I am still wondering how useful can be an index on a random value I don't wanna write that again..🥴
No problem, I still see your comment!
I wasn't trolling - I actually believe that auto-incrementing integers are a good choice for primary keys. But they aren't perfect, as mentioned in the video.
I don't think an auto-incrementing integer is a time-based pseudo-random value. It uses a database feature, such as a sequence, to retrieve the next integer value each time it is called. For example, the first value is 1, the next value is 2, and so on. There's nothing random about it.
Generally, these approaches work for multiple transactions. For example, the numbers generated by an Oracle sequence are independent of transactions, so you won't get collisions.
I don't see how the format of the primary key relates to normal forms though.
You're right that UUIDs can help. They are better for distributed databases and can help avoid locking and dependencies on the database, as mentioned in the video.
@@DatabaseStar Well, second NF basically says that attributes in a row have to depend only on the value of the key... How do they depend on a value of that assigned integer? Someone in the comments tried to fixed that with adding timestamps not realizing that timestamps would be actually a better key...
If several ppl are inserting many rows at once (concurrently) how could they predict (if those are not random) the assigned values?
You say that each time you call a "db feature" it returns different number... but in db you are not doing things sequentially, db tries to reorder data for efficiency and work with them in bulk... but to avoid a collision each part of the db structure (imagine a cluster) needs to synchronize (and wait) and then assign almost one by one those integers...
I mean ok, in the last century there were valid reasons to use such things.. like memory constrains, inefficient db engines (mysql anyone?) and there were a tiny amounts of data in those databases (comparing to today), not having to deal with parallel things and so on.... but today? buy ram, use smart db engine... and focus on proper data structure.. after few years and thousand times more data then planned, you will be happy you haven't had used those autoincrements...
Use integers for server->db communication. Uuid's for Client API
That’s a good approach
then how to sync?
@@mukhlimkurowo kan 1 row itu bang, kalo tau UUID nya, pasti bisa tau ID nya apa
@@putrafajarh coba kasus simpel, misal ada tabel customer, product, order, order_item. rancangannya gmn (harus support offline mode)
Thanks
You’re welcome