In case it wasn't obvious in the video, this video is NOT against GUIDs. Distributed systems absolutely need them but distributed systems will probably use NoSQL databases which don't suffer from the problems outlined. Also, of course everything should have proper auth where needed and you shouldn’t rely on unguessable urls. This was never in question and was never brought up as a selling point of the library or the approach. It is assumed to be the bare minimum that you should have. The video is all about how you can keep using sequental IDs internally, if the only reason you wanted to move to GUIDs was the exposure of the data with a concern about losing database performance, without having to worry about exposing guessable ids and opening your system up to potential security problems. Sequential ids, both ints and guids, can give your competitors business intelligence for your system (user/order count, rate of growth etc). We need to acknowledge that there is a huge amount of people that don't work in scaled out, cloud native microservices, and this video is for them. For a video focusing on the practicality and security issues with enumerated entities in URLs check out Tom Scott's video here: th-cam.com/video/gocwRvLhDf8/w-d-xo.html
COMB GUIDs are a good alternative if you are worried about both sequential ids and randomness they have most of the randomness of a normal guid while being mostly sequential even when generated on multiple servers without contacting the db first.
Depending on what you're concerned about, you could also have a second column that's not guessable (ex: a GUID). Index by the sequential number to get the row; then, when you get the row, make sure that the challenge matches up (assuming you don't have a better mechanism of authentication for your given system/UX).
Distributed systems don't have particular connection to database types. NoSQL is a rather ambiguous and meaningless term, and there are many distributed relational (SQL) databases now. I think that discussion is getting away from the actual topic which is that this is just a simple obfuscation library. Instead of turning numbers into base64, it's like turning them into your custom base format.
MG i havent looked into code but theres on question. lets says we have id 1 and we initiate hashid as ("Salt",5) the user posts hashid a2b45 thats going to be ok but if user inputs say d do we have a check or loose overload when we are decoding and will hashid check for decoding d which is a single character but we havent initiated an object for single character overload? i am just talking that guessing id based on indexes. or maybe 2 that long guessing is going to wor again for this?
No, it is not called neither salt nor pepper. These two terms have specific meaning. It is called a secret. I hate such self absorbed idiots. Go study a book first.
@@mezzer34 Zeleno might be picky but for a video regarding any security related things I also agree that you should be diligent in using the right terms. Misusing terms can lure users to thing its something its not. As Nick specifically points out, the library is named hashid but its not a real hash at all, and that IS important. If your protecting sensitive data this is NOT the library for you since its not based on a well tested validated encryption method, which the author of the library also states in the github description. He even renamed the methods from encrypt/decrypt to encode/decode to emphase that its not a real encryption and it could be vulnerable so for really sensitive data you probably want something better
You talk about the security aspect of sequential id's at the beginning, which I appreciate. But if you don't keep the HashId-seed secret, it has basically the same problem, an attacker just needs to decode the hash into a sequential id, increment or decrement and encode again to get another possibly valid hash. Well, eventually, authorization should anyway be enforced in other ways, because hashes, guids and sequential id's can always leak, and that shouldn't give non-authorized people any power in a secure system.
Auth should be in place in the first place. Obfuscating sequential ids are the cherry on top. If people just see the hashid they most likely don't even know there is a sequential int behind it since there are tens of different ways to approach it. So it's extremely unlikely that they will use the same algorithm to decode it and even more extremely unlikely that they will spend the years of computation needed to bruteforce a 32-byte salt.
@@nickchapsas Exactly, but if your software is open source and you just have that seed and method in the code, everyone can look at it and therefore there is no real security benefit of using hash over sequential id's. So the cherry on top is only slightly better than sequential id's, because you won't immediately notice that you can just increment the id. However, of course, there are other advantages to the hash-approach.
@@asdfxyz_randomname2133 In an Open Source software you wouldn‘t put the salt in the code, you would make it configurable. But authentication should be used anyway. Or you would generate it at initialization or something and store it in the DB or somewhere more secure.
@@blubblurb You should probably use the same method that you would use to store any other salt (for example, for hashing the passwords). Kubernetes has a type of environment variable called Secret which can be used to store such values. Even if it's a closed-source project, it's a bad idea to have salts in the source code (and you should also probably have different salts for each environment).
@@asdfxyz_randomname2133 In any software, regardless of it being open or closed source, you would never checkout any sensitive information like that. That's like checking in your API keys into git.
I work on a large project where we originally used GUIDs as the primary key in the database, but for DB2 at least, the indexing was horrible because they were basically random and caused a lot of index cache misses. Switching to a sequential ID was the way to go for efficiency. But for exposing to web UIs, we keep a pair of dictionaries in the session state which map numeric IDs to guid and guid to numeric ID. Obviously just for the IDs we need to return to the user. Works pretty well.
It can be a burden to keep your dictionary grows I think that just ciphering the Ids like that library does would be much more efficient in terms of memory solution
What we did where I used to work at, is add a Key column to the table where the GUID will live, and when we need to interface that outside, the key is used instead of the id. Simple, the id is used for query joins and whatever is done internally, but when something is taking out of the API into the world... it is the Key column not the Id that is used. 🙂
Thank you Nick. I really like the idea to hide the real sequentiell integer id. Without drawback of perfomance issues of a guid. Also to be possible to "hash" multiple ids is great. Sometimes you need it. Great explanation also. :-)
Hi Nick, very nice video, I've been also using some other ID format called KSUID (k-sortable unique ID), these are basically smaller for storage than UUID but with more entropy bytes and I really loved them. They are sequential sortable by design, so no encode/decode has to take place. There is support in a various range of programming languages nowadays (originally coming from Go). I would love to see a video on them, too.
4:36 - The term “hash” and “hashing” long predates crypto as a term of art in computing. It simply refers to the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hash ID perfectly describes this use case and is not even slightly arbitrary in meaning or application. In fact, cryptography is a form of hashing but hashing is not necessarily cryptographic. That is to say, hashing doesn’t necessarily obfuscate the original value. Take, for example, a hashtag. To go full circle, GUIDs and UUIDs are also hashes.
"cryptography is a form of hashing" makes no sense as a sentence. Cryptography uses hashing in its practices, perhaps. However hashing in contrast to encryption is a one way process, so "Hash ID" should really be called "Encrypt ID" to describe itself.
This is one way to do it, I typically just use a secondary unique index with uuids. My queries/joins use the typical relationship integer ids but I don't expose those as identifiers or in my db code. That is a better way architecturally imho since the monotonic ids are actually leaking your db implementation details, making it harder to swap out your database and the ID generator inside the database becomes a singleton service that is difficult to replace.
kudos to that. I remember when I had a friend and she used to bash people for creating guids as Ids and show the clustered index -int/long internal with unique index guid as external and stuff.
What a great idea. We've been using GUID and had to add an additional incremental column to get around indexing and paging limitations. Too late to rewrite everything now, but for new projects this is definitely a much better way to go about it.
Guid + incremental column is still better for security, the author specifically say its not cryptographically secure and should not be used for very sensitive data.
@@buriedstpatrick2294 If the data is not sensitive or if it already requires a login, this would be more than enough to hide trade secrets like how many accounts there are and prevent the most basic attempts to circumvent security or prevent easy harvesting of data. But as soon as there is any personal information or otherwise valuable data a sha256 of the id and some secret or internal salt would be much more secure, and almost as easy to implement.
Hello Nick, very good explanation of the topic! Little note I wanted to add: 1. If performances are not an issue and security is preferred, it is recommended to use standard cryptography algorithms to do exactly what you described here, keeping the sequential indexing power of RDBS, but also obfuscating the ids. Everything done at runtime by a middleware or the endpoints themselves. 2. In performances scenario, perhaps doing maths on huge numbers may be way faster than doing stuff on string as `hashids` is doing, keeping the unpredictable property using `long`. But anyway, the git repo of hashids looks great and the idea is fair interesting to customize the obfuscated ids with custom alphabets. :) Thanks for sharing!
Could not have come at a better time. I was never a big fan of converting to base64 strings, trimming the non-alpha characters etc. Thanks for the video - the solution is perfect. For anyone using long integers as their keys, you can convert those to a hex string first and then use the EncodeHex and DecodeHex calls from the nuget package.
@@omriliad659 The method was to convert the long to a hex string. EncodeHex produces a hash ID (unguessable) from a hex string. But since there is an EncodeLong method I don't need to do this method anyway.
We use Type 1 UUIDs which are sequential. I believe Cassandra uses them for the clustering key also. Also guaranteed unique with no conversion required. You can also represent the UUID as Base64: "Ej5FZ-ibEtOkVkJmVUQAAA". 22 chars instead of 36.
Type 1 UUIDs do not have random entropy in the whole value. In fact on the same machine on the same day only a part of the UUID will vary. Doesn't that defeat the purpose?
I use ulids, it has benefit of uuids (non colliding, and okay in distributed systems), and yet you can sort them in order (it uses current time as well), :)
I found using Smitty Warbenjagermanjensen as a test name works really well because of it's length. Always fun when it shows up in a meeting with my business partners. You know you gets it right away.
It's nice to have options for different types of IDs to be used in different situations. GUIDs are nice sometimes, integers are nice other times. I've really enjoyed Flake IDs in some distributed situations, and hash-based IDs are great for these user-visible URL situations you're describing. Picking the right format of IDs for the right use-cases, and being able to cheaply translate between them when necessary, is important for the good design of many systems.
first of all i don't think that a bad library but in c# their is already a interface called IDataProtectionProvider. which can do the same thing and in terms of memory allocation i think that is also not bad either. also Congrats on the 100K subs.
Interesting. Previously I have used checksums to improve performance when searching strings. More for urls, emails, etc. where you have to store the full thing, but you want more optimal indexing.
Great vid! I've been using hashids lots recently and it's worth pointing out that I've had pentesters rule that the hashids library output is guessable. They aren't entirely wrong, if you generate a bunch of them in sequence you will see that a pattern that forms over time. Especially if you limit the alphabet like I recently did to make a hashid more human readable by removing ambiguous letters that could be numbers etc.
Yes and the author specifically mentioned that its not cryptographically secure and not intended for very sensitive data, for that you will want to use a real hash like sha256 or better. And you can actually combine them so the URL contain both a hashid number + the real hash. The first for quick lookup and the later for good security. You could even use the real ID directly. Or if indexing is not a problem you can go with GUIDs
@@davidmartensson273 At that point, you probably better off with a well known symmetric key block cipher algorithm and encode/decode your sequential key with an app-wide secret cipher key.
@@gabiold possibly but that can depending on plattform be more complex to setup. But its more powerful. But it could require padding to avoid to short result if the id is very few digits
@@davidmartensson273 Actually if you treat the block data as an integer, not you encode the string representation of that integer, then there is no need for padding.
Nice content as always! Tbh the security problem is not the id type, the issue is not checking the user access rights. And GUID Ids are usually recommended for distributed DB or merging multiple DB into a data warehouse. Also, to avoid fragmentation, the best approach is to use sequential GUIDs generated by the DBMS.
The access rights are assumed to be in place. The package tries to make ids unguessable to prevent potential future issues with security when that auth is compromised and prevent people collecting BI from you, for example get how many orders you have in your system. Exposing sequential guids suffers from the same problem. People can calculate guid ranges and get BI on you app, for example how many users you have, or what is the rate of growth of your application.
@@benoittremblay5705 You should always have both and consider status codes too. For example GitHub doens't return 401 on unauthed api requests for repositories, but 404 so people can try and guess which repos exist in which organisation.
I would say that if your API is going to return an object, it should be checking who is requesting it (authentication), and that that user has rights to see that object (authorization / access control). Just knowing the ID generally shouldn't give one access to the object. I haven't looked at the source of this library. (How cryptographicaly secure is it?) However one thing we can see from the examples shown is that *it does leak some information about the size of the ID* . A 32-bit ID (~4.2 billion possible values) (with the value 1) was mapped to two characters in a character set with 52 characters (capital and lowercase letters) (2704 possible value (52^2)) - a reduction from 32 bits of entropy to less than 12. If an attacker knows the algorithm (and depending on them not knowing (security by obscurity) is a bad practice), or just sees one low ID and guesses that others would be the same size, they've only this smaller key space to brute force, in order to find some valid values. If the system accepted 5 requests per second, they could cover those 2704 values in under 10 minutes (or a 12-bit range in under 14 minutes). If you depend on an attacker not knowing an ID (the ID value visible in the API), ideally, it should be *increasing* the size of the ID to something that is clearly infeasible to brute force. A 32-bit key would generally not be considered very secure, but it depends on your application, and how fast the attacker could make requests, and whether they would be locked out after a few invalid ones. Also remember that they don't have to cover the whole ID space to compromise some data. (e.g. if you had 32-bit IDs and your obfuscation function mapped them so that they were randomly distributed over the 32-bit space, and you had 100,000 records (of the same type/class/table) accessible by an ID, then an attacker would find a match after trying 42950 values on average). Another fundamental problem with lack of authorization is that someone who had access to a record at one time, might not be entitled to access it at another time, but they could still know the ID.
Another thing is: If you use a standard GUID for a salt ("pepper" is a better term for it in the case; but this applies to any salt or pepper), ensure that it is actually (and preferably cryptographically securely) random. A GUID generator might use your network card ID and the time, for example, or a non-secure random number generator. It is not necessarily a requirement, in general, that GUIDs are unpredictable. (The concept is of a way to ensure global uniqueness when those generating the values are cooperating.)
I think it bears mentioning that using a system like this is great for a greenfield project where you have never exposed an int ID. However if you have it would be trivial to back into the salt value from a previously known ID converted to a hashid.
If the hashing is implemented correctly, it shouldn't be feasible to figure out the salt just by knowing some inputs and outputs. If it was, that would mean that whenever a database with hashed passwords gets leaked, by knowing just some of the original passwords, you would be able to crack the salt (and all the other original passwords in the database).
The primary use of GUIDs is so you don't have to synchronize them. You could have two different places generating them, in completely separate processes,and later you could combine some or all of them without having collisions. Whatever your tip is for it's less than 0.1% of the use cases of guids covered. Yes guids are also pretty good db secrets, but they're not the best for that either. An actual public key and private key is much better, though even more high cost.
The video is about those who use auto incremented ids because they don’t have the need for a highly distributed system. If you have synchronisation issues on the pk you shouldn’t be using an RDBMS in the first place
I worked on a project years ago that used nHibernate as the ORM and we had it configured in such a way as to use a specific guid algorithm (I think it was called comb) that generated indexable guids, so it is possible to get database friendly guids. That was actually my first job and we used guids everywhere, felt weird on my next job using ints as PKs
What are you trying to compare though? GUIDs are just generated once and then used. Hashids are converted between numbers and encoded hashes so if you don't want the original number or dont care about integers then just stick with GUIDs. But I find that integers are easier to deal with everywhere and hashids let you encode and "hide" the number when necessary.
I presume you are referring to database performance? It doesn't seem relevant to compare hashids to GUID because the idea is that you expose hashids publicly, but convert them internally to sequential IDs before you hit the DB.
@@anrikezeroti4680 It's microseconds for both. So insignificant that you don't need to worry - and if you really do have to worry about that kind of performance impact then you wouldn't be using any of this anyway.
@@anrikezeroti4680in a relational database id are typically stored as a sequential integer or UUID,. with hashids you encode the integer to be hash but this is done on a server not on the database. as far as the database knows, it only sees an integer as the id. so really your question is what is performance difference between using guid or sequential integers as primary keys in a relational database like sqlserver. and the answer is that it has a substantial effect when rows in a table exceed a large quantity like 10,000+ but there are other videos comparing the performance of those cases
I love this video, and thanks for highlighting this project. But, since UUIDs are a native datatype for most databases, they are stored in 128 bits internally. So, an alternative to UUIDs for data in flight might be to convert them to base64. You get the advantages of this project plus all the benefits of UUiDs
The problem is the fragmentation they can cause so if you wanna use them but not suffer from that on the RDBMS level then ULID are a good alternative to that
The biggest problem for me is naming. I don't like when someone makes a function that is reversible by design and calls it "hash". Encoded, Encrypted - OK, but it's not a hash.
And then they are no longer random. Don't know about clashing probability, but they no longer have the U in UUID so strong anymore. (not that it ever was fully unique, but in the classic UUID the U is quite strong).
@@marsovac timeuuid is exactly like a uuid or a guid. Cassandra works well in big clusters, so they need to make sure that on each node the generated ID (timeuuid) is unique 😄
I'm growing by every video you release. Thanks for the your efforts towards the community. My kind request - please make book suggestion videos for software design.
An alternative is to have an auto incrementing primary column and a separate GUID column (which is also indexed) that is assigned when the record is created. The GUID is used externally and the uint used internally. No GUID conversion necessary.. no cache or GUID lookup required
@@nickchapsas Having GUIDs in the database layer is not great, but using GUIDs as identifiers lets you use application-generated keys and then insert related objects into a database without needing multiple round trips to get the ids of parent objects. I do love the video though! I would want to see how it works and build my own version, but the idea is superb!
Well done with the 100k subscribers, well deserved. Your videos are excellent. On this topic, you could create a custom binder to decode for you before you hit the action method?
Very weird to see this in your latest videos when I just implemented this exact solution last week lol. After a very successful prototype it had me wondering where else I can use this in older projects instead of a GUID. Solid video man
Just remember that unless using cryptographic hashes, it is not really a security thing but obscurity. Creating your own security solutions are almost never a good idea unless your a world class math expert and have the result independently verified.
That's super cool! I am a JavaScript developer and I honestly like this approach more than GUID/UUIDs. There is a package called *nanoid* which also does the same thing and it would be amazing if you could do a comparison of all three (GUID vs NanoID vs HashID) to generate random ids and check the collision rate of them. Because I heard somewhere that NanoID's arent's really universally unique as GUID or UUID
HashID is for obfuscation of an int, not anything to do with nanoid/uuids/guids except trivially that guids are not sequentially guessable. So hashid is not the same thing, its just an obfuscator. If you want a better obfuscator, try Knuth's hash algorithm. Generates reversible ints rather than strings and is way, way, way faster.
Great video Nick! I will definitely check this library out! But during the whole video I kept thinking about collisions (with GUIDs you don’t have to worry) especially if you have a disproportionally big amount of data and have configured the library to a short “hash” length. Really nice video though, I’m learning a lot from you and I am inspired by your passion for deep, well explained knowledge.
Guides can overlap as well, it’s possible just very unlikely, a hash is the same way, it can shuffle very very well, and the odds of an overlap is small
@@hipihypnoctice I think you are missing the point. GUIDs are made out of the box not to overlap. Hashes on the other hand can easily overlap if you have too much data and too short hashes.
I know that this video is a bit older, but I just wanted to pass on a big thank you for it! I've always hated exposing key's as int's as you are right, makes it easy to hack the system. Also love you took a moment to look at the cpu cost to use the encoding plugin. Very complete. One question I did walk away with is what is the 'cost' of using guid's as primary keys as to just using int's from your experience? Thanks Again!!
This does not fix the security problem, proper authorization checking is the only way to prevent an attacker from querying your endpoints with id's they shouldn't have access to.
Guids are still useful because they don't have the requirement of knowing what the last number in the sequence is. So you could create a lot of objects and their ID's before saving them to the database and the ID's will be valid even before that.
@@Kingside88 Sure, but have you benchmarked and actually given those benchmarks thought? Sequential integer ids are simply superior ... _if_ you are inserting 1 billion rows sequentially and need to immediately do joins/etc without doing any database maintenance. In general, you still need to occasionally run database maintenance, and usual use-cases do not involve inserting billions of rows in a short time frame. And such use-cases will _often_ benefit more from distributed processing, which _hugely_ benefit from the fact that you don't need to generate ids sequentially on a shared resource, so you'd have to have a particular use-case that where inserting these records is actually your bottleneck, not processing them. If that's not your exact use case, then GUIDs are generally equally viable from a performance standpoint. In my experience, the performance benefits of sequential ids are _extremely_ niche and I've never seen a real-world use that actually demonstrates that they were better; I've only seen micro benchmarks that specifically target this niche to show that it exists in theory. Theoretical arguments don't always translate to real-world benefits.
@@Kingside88 JOINs are a normal thing for databases. The reason why non-sequential random ID's like GUID's are bad is that they cause the data being inserted in a not optimal state, the ID's not being in a sorted order. Also, they are bigger so are slower to match and take up more space.
Congrats on 100k subs! I am interested in your thoughts about Twitter Snowflake ID and compare them GUID and maybe hashid. Which one would you choose to build a distributed system. Great video as always.
I liked this and was about to use it, but then checked around for alternatives. Found Knuth's hash from ye olden dayes. If you want a quick comparison it generates ints that can be reversed back to the ID instead of strings and is apparently less crackable (though I doubt that matters too much for the use-case). Performance however, Knuth completed my benchmark in 950 ns while HashID took 1,466,137 ns.
I might be wrong about this... I thought the most common reason for introducing GUID/UUID was database replication with multiple write replicas. So this seems only be useful in an environment with a single main database for writing and maybe read replicas. That's a tough decision to make upfront development, since it might be the wrong direction after all and changing all indexes afterwards might be very tedious.
The main reason was uniqueness in distributed systems. You don't need to check if a Guid exists before you do an insert. You just assume it does. It is also very useful when it comes to idempotency.
Partitioning columns could be a better solution for multiple writers. Guids are terrible for clustered indexes so should only be used as a nonclustered index. The idea of a HashID is to reduce the need for a value exposed via presentation layer, which is what you might use a guid for.
Hey Nick, cool package for sure I can see it’s use case. What are your thoughts on the hashid using primitive types and not a value type that encapsulates it’s logic? Also, I would love to hear more on how you optimize the use of GUIDs in your applications as well.
Pretty cool. Just maybe I'm spoiled, but I'm thinking it would be nice if we didn't have think about any of this, and could just use this kinda package implicitly... Meaning something like - have a model where it's just User.Id on the incoming model, with an attribute to indicate [HashId] and then have a middleware or modelbinder that automatically decodes incoming hashes back into ints... and outgoing ints into hashes
This isn't supposed to replace proper auth in endpoints. If you need auth, you should have it. On top of auth, you should also not leak your BI data by allowing endpoint enumeration
One thing not covered but peeking my curiosity is if DR = 2 and Ir = 3. If the array of Ids is just a concatenation of individual values I think there is a major reduction in the value of this Lib as it would become much easier to guess values which was kind of the entire point.
Great video as always...I don't understand why you don't have 2M subscribers already...honestly. There must be something you can do to increase your visibility... The sequential hashes are great and all in particular scenarios but fails miserably when doing inserts. I started using Guids as PKs before Jesus wore diapers, for that particular reason. And when moving to distributed environments it makes them even more attractive.
I'd be interested in a GUID deep dive for databases. My team is using UUIDs for everything because we can generate them in a distributed fashion without any need for synchronization and still have zero collisions. Sequential IDs in an RDBMS are problematic because you need hi-lo algorithms or other tricks to be able to generate many objects in a single transaction while avoiding clashes. I personally see the advantage of sequential IDs but UUIDs served me much much better.
I just hope you're not using them as the Primary Keys in the Database, that's not very good if you are. Every time your inserting a new record with the GUID as the PK, sql will be rewriting the index to slot it in the sort order, slowing down the time it takes to insert.
@@harag9 you better believe I am. That's the problem of the DB engine, not mine. Also (unless you're doing very large bulk inserts) any b-tree implementation worth its money will not care very much about it, even less so on SSD storage. That's an argument from the 80s.
As you pointed out, this is not in fact cryptographically secure hash - the ID isn't encrypted, it's just obscured. Personally, I actually prefer separating these values, so a URL pattern like user-{id}-{hash} is more "honest" and easier to debug. And in that case, not much point to this library - just base64 encode your hash, perhaps replacing URL characters like "/", and then validate it in the controller. I've done that dozens of times in different languages, it's just a few lines of code. 🙂
Oh, and if security *does* matter, just use a basic two-way (e.g. DES) encryption for the actual key, and base64 encode that. (and of course *don't* put it in the URL.)
There is a performance tradeoff. Random hash id will lead to fast index fragmentation in DB. So, maybe, sequential guid (UUID v1) is a good alternative.
Hey, are you planning on bundling every course together or not? And can we pay in euros or only in pounds? Thanks for great content and congrads on 100k, well deserved.
If you have an issue with auto incremental "security" is because you're solving problems in a wrong way. Limit rating, permissions, auth in general solve those things
In case it wasn't obvious in the video, this video is NOT against GUIDs. Distributed systems absolutely need them but distributed systems will probably use NoSQL databases which don't suffer from the problems outlined. Also, of course everything should have proper auth where needed and you shouldn’t rely on unguessable urls. This was never in question and was never brought up as a selling point of the library or the approach. It is assumed to be the bare minimum that you should have. The video is all about how you can keep using sequental IDs internally, if the only reason you wanted to move to GUIDs was the exposure of the data with a concern about losing database performance, without having to worry about exposing guessable ids and opening your system up to potential security problems. Sequential ids, both ints and guids, can give your competitors business intelligence for your system (user/order count, rate of growth etc). We need to acknowledge that there is a huge amount of people that don't work in scaled out, cloud native microservices, and this video is for them.
For a video focusing on the practicality and security issues with enumerated entities in URLs check out Tom Scott's video here: th-cam.com/video/gocwRvLhDf8/w-d-xo.html
COMB GUIDs are a good alternative if you are worried about both sequential ids and randomness they have most of the randomness of a normal guid while being mostly sequential even when generated on multiple servers without contacting the db first.
security by obscurity :-( always check permission
Depending on what you're concerned about, you could also have a second column that's not guessable (ex: a GUID). Index by the sequential number to get the row; then, when you get the row, make sure that the challenge matches up (assuming you don't have a better mechanism of authentication for your given system/UX).
"This video is targeted towards actual .NET users" - Nick Chapsas
Distributed systems don't have particular connection to database types. NoSQL is a rather ambiguous and meaningless term, and there are many distributed relational (SQL) databases now. I think that discussion is getting away from the actual topic which is that this is just a simple obfuscation library. Instead of turning numbers into base64, it's like turning them into your custom base format.
Ah, I was wondering why the project was suddenly getting PRs today. I'm the current maintainer, thanks for highlighting this!
lol, this is where I got the overload idea from and then found out there was also an issue regarding that.
MG i havent looked into code but theres on question. lets says we have id 1 and we initiate hashid as ("Salt",5)
the user posts hashid a2b45 thats going to be ok but if user inputs say d
do we have a check or loose overload when we are decoding and will hashid check for decoding d which is a single character but we havent initiated an object for single character overload?
i am just talking that guessing id based on indexes. or maybe 2 that long guessing is going to wor again for this?
A global seed stored in the app's code is usually called "pepper". "Salt" is what differs for every record, and is stored in the datastore.
No, it is not called neither salt nor pepper. These two terms have specific meaning. It is called a secret. I hate such self absorbed idiots. Go study a book first.
@@ZelenoJabko Zeleno is quite Salty about the use of the word "Salt", or indeed "Pepper". Chill bro
@@mezzer34 Zeleno might be picky but for a video regarding any security related things I also agree that you should be diligent in using the right terms. Misusing terms can lure users to thing its something its not.
As Nick specifically points out, the library is named hashid but its not a real hash at all, and that IS important.
If your protecting sensitive data this is NOT the library for you since its not based on a well tested validated encryption method, which the author of the library also states in the github description.
He even renamed the methods from encrypt/decrypt to encode/decode to emphase that its not a real encryption and it could be vulnerable so for really sensitive data you probably want something better
And I don't think this is hash algo either, cause hashes are one-way functions (i.e. you can't decode them).
@@alexxx4434 no, as he said in the video, the name is not right, its not a hash, it just looks like one.
When you start with "Hello everybody" and your name is Nick, I immediately picture dr. Nick from the Simpsons. Love your videos!
Congrats on the 100K subs, well deserved man!
You talk about the security aspect of sequential id's at the beginning, which I appreciate.
But if you don't keep the HashId-seed secret, it has basically the same problem, an attacker just needs to decode the hash into a sequential id, increment or decrement and encode again to get another possibly valid hash.
Well, eventually, authorization should anyway be enforced in other ways, because hashes, guids and sequential id's can always leak, and that shouldn't give non-authorized people any power in a secure system.
Auth should be in place in the first place. Obfuscating sequential ids are the cherry on top. If people just see the hashid they most likely don't even know there is a sequential int behind it since there are tens of different ways to approach it. So it's extremely unlikely that they will use the same algorithm to decode it and even more extremely unlikely that they will spend the years of computation needed to bruteforce a 32-byte salt.
@@nickchapsas Exactly, but if your software is open source and you just have that seed and method in the code, everyone can look at it and therefore there is no real security benefit of using hash over sequential id's.
So the cherry on top is only slightly better than sequential id's, because you won't immediately notice that you can just increment the id.
However, of course, there are other advantages to the hash-approach.
@@asdfxyz_randomname2133 In an Open Source software you wouldn‘t put the salt in the code, you would make it configurable. But authentication should be used anyway. Or you would generate it at initialization or something and store it in the DB or somewhere more secure.
@@blubblurb You should probably use the same method that you would use to store any other salt (for example, for hashing the passwords).
Kubernetes has a type of environment variable called Secret which can be used to store such values. Even if it's a closed-source project, it's a bad idea to have salts in the source code (and you should also probably have different salts for each environment).
@@asdfxyz_randomname2133 In any software, regardless of it being open or closed source, you would never checkout any sensitive information like that. That's like checking in your API keys into git.
I work on a large project where we originally used GUIDs as the primary key in the database, but for DB2 at least, the indexing was horrible because they were basically random and caused a lot of index cache misses. Switching to a sequential ID was the way to go for efficiency. But for exposing to web UIs, we keep a pair of dictionaries in the session state which map numeric IDs to guid and guid to numeric ID. Obviously just for the IDs we need to return to the user. Works pretty well.
It can be a burden to keep your dictionary grows
I think that just ciphering the Ids like that library does would be much more efficient in terms of memory solution
What we did where I used to work at, is add a Key column to the table where the GUID will live, and when we need to interface that outside, the key is used instead of the id. Simple, the id is used for query joins and whatever is done internally, but when something is taking out of the API into the world... it is the Key column not the Id that is used. 🙂
Thank you Nick. I really like the idea to hide the real sequentiell integer id. Without drawback of perfomance issues of a guid.
Also to be possible to "hash" multiple ids is great. Sometimes you need it.
Great explanation also. :-)
Great video as always nick. Please please create a video in optimizing GUIDs as IDs Nick 🙏
Hi Nick, very nice video, I've been also using some other ID format called KSUID (k-sortable unique ID), these are basically smaller for storage than UUID but with more entropy bytes and I really loved them. They are sequential sortable by design, so no encode/decode has to take place.
There is support in a various range of programming languages nowadays (originally coming from Go).
I would love to see a video on them, too.
4:36 - The term “hash” and “hashing” long predates crypto as a term of art in computing. It simply refers to the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hash ID perfectly describes this use case and is not even slightly arbitrary in meaning or application. In fact, cryptography is a form of hashing but hashing is not necessarily cryptographic. That is to say, hashing doesn’t necessarily obfuscate the original value. Take, for example, a hashtag. To go full circle, GUIDs and UUIDs are also hashes.
"cryptography is a form of hashing" makes no sense as a sentence. Cryptography uses hashing in its practices, perhaps.
However hashing in contrast to encryption is a one way process, so "Hash ID" should really be called "Encrypt ID" to describe itself.
This is one way to do it, I typically just use a secondary unique index with uuids. My queries/joins use the typical relationship integer ids but I don't expose those as identifiers or in my db code. That is a better way architecturally imho since the monotonic ids are actually leaking your db implementation details, making it harder to swap out your database and the ID generator inside the database becomes a singleton service that is difficult to replace.
kudos to that. I remember when I had a friend and she used to bash people for creating guids as Ids and show the clustered index -int/long internal with unique index guid as external and stuff.
What a great idea. We've been using GUID and had to add an additional incremental column to get around indexing and paging limitations. Too late to rewrite everything now, but for new projects this is definitely a much better way to go about it.
Guid + incremental column is still better for security, the author specifically say its not cryptographically secure and should not be used for very sensitive data.
@@davidmartensson273 gotcha. Guessing a particular ID isn't really a problem for the kind of data I'm handling, but worth noting.
@@buriedstpatrick2294 If the data is not sensitive or if it already requires a login, this would be more than enough to hide trade secrets like how many accounts there are and prevent the most basic attempts to circumvent security or prevent easy harvesting of data. But as soon as there is any personal information or otherwise valuable data a sha256 of the id and some secret or internal salt would be much more secure, and almost as easy to implement.
Hello Nick,
very good explanation of the topic!
Little note I wanted to add:
1. If performances are not an issue and security is preferred, it is recommended to use standard cryptography algorithms to do exactly what you described here, keeping the sequential indexing power of RDBS, but also obfuscating the ids. Everything done at runtime by a middleware or the endpoints themselves.
2. In performances scenario, perhaps doing maths on huge numbers may be way faster than doing stuff on string as `hashids` is doing, keeping the unpredictable property using `long`.
But anyway, the git repo of hashids looks great and the idea is fair interesting to customize the obfuscated ids with custom alphabets. :)
Thanks for sharing!
I suggest using ULID. They're randomly generated like GUID, but sortable chronologically. This gives the best of both worlds.
ULID and Ksuid are both quite compelling I agree
Could not have come at a better time. I was never a big fan of converting to base64 strings, trimming the non-alpha characters etc. Thanks for the video - the solution is perfect. For anyone using long integers as their keys, you can convert those to a hex string first and then use the EncodeHex and DecodeHex calls from the nuget package.
There is a method called EncodeLong and DecodeLong to encode and decode long integers. There is no need to convert long integers to hex.
Hexadecimal values would not solve the Issue of easily guessable values for the next and previous ID
@@Brodeon Thanks - didn't see that. Solves the problem
@@omriliad659 The method was to convert the long to a hex string. EncodeHex produces a hash ID (unguessable) from a hex string. But since there is an EncodeLong method I don't need to do this method anyway.
We use Type 1 UUIDs which are sequential. I believe Cassandra uses them for the clustering key also. Also guaranteed unique with no conversion required. You can also represent the UUID as Base64: "Ej5FZ-ibEtOkVkJmVUQAAA". 22 chars instead of 36.
Type 1 UUIDs do not have random entropy in the whole value. In fact on the same machine on the same day only a part of the UUID will vary. Doesn't that defeat the purpose?
I use ulids, it has benefit of uuids (non colliding, and okay in distributed systems), and yet you can sort them in order (it uses current time as well), :)
Yes please on the "optimise Guids for RDB" idea! Thanks for the vid.
It is a kind of relief to see that professionals as Nick codes some silly things as printing peepee poopoo just like me 🤡
I found using Smitty Warbenjagermanjensen as a test name works really well because of it's length. Always fun when it shows up in a meeting with my business partners. You know you gets it right away.
Hora de depurar meu codigo em java... hmm, vamos ver se esse método está sendo executado...
System.out.println("Foda-se");
It's nice to have options for different types of IDs to be used in different situations. GUIDs are nice sometimes, integers are nice other times. I've really enjoyed Flake IDs in some distributed situations, and hash-based IDs are great for these user-visible URL situations you're describing. Picking the right format of IDs for the right use-cases, and being able to cheaply translate between them when necessary, is important for the good design of many systems.
first of all i don't think that a bad library but in c# their is already a interface called IDataProtectionProvider. which can do the same thing and in terms of memory allocation i think that is also not bad either. also Congrats on the 100K subs.
I am not aware of that interface or its implementation. I'll take a look into it, thanks!
it's in Microsoft.AspNetCore.DataProtection assembly, IDataProtectionProvider does pretty much the same thing
Yeah, IDataProtectionProvider works quite similarly
DataProtectionProvider creates enormously long "encoded" strings, however! 134 characters!
Interesting. Previously I have used checksums to improve performance when searching strings.
More for urls, emails, etc. where you have to store the full thing, but you want more optimal indexing.
It was nice to see you directly saying the outro 😀
And congratulations for the 100k subs, you really deserve it 🎉
Oh noooo I forgot to add the patreon scrolling text 😭😭
In MYSQL , you can store uuid columns as BINARY. MySQL 8 has UUID_TO_BIN() function and vice versa
Congratulations on reaching 100K subscribers....
As usual you present good value and good explanations! Keep up the good work!
Great vid! I've been using hashids lots recently and it's worth pointing out that I've had pentesters rule that the hashids library output is guessable.
They aren't entirely wrong, if you generate a bunch of them in sequence you will see that a pattern that forms over time. Especially if you limit the alphabet like I recently did to make a hashid more human readable by removing ambiguous letters that could be numbers etc.
Yes and the author specifically mentioned that its not cryptographically secure and not intended for very sensitive data, for that you will want to use a real hash like sha256 or better.
And you can actually combine them so the URL contain both a hashid number + the real hash.
The first for quick lookup and the later for good security.
You could even use the real ID directly.
Or if indexing is not a problem you can go with GUIDs
@@davidmartensson273 At that point, you probably better off with a well known symmetric key block cipher algorithm and encode/decode your sequential key with an app-wide secret cipher key.
@@gabiold possibly but that can depending on plattform be more complex to setup. But its more powerful. But it could require padding to avoid to short result if the id is very few digits
@@davidmartensson273 Actually if you treat the block data as an integer, not you encode the string representation of that integer, then there is no need for padding.
Nice content as always! Tbh the security problem is not the id type, the issue is not checking the user access rights.
And GUID Ids are usually recommended for distributed DB or merging multiple DB into a data warehouse.
Also, to avoid fragmentation, the best approach is to use sequential GUIDs generated by the DBMS.
The access rights are assumed to be in place. The package tries to make ids unguessable to prevent potential future issues with security when that auth is compromised and prevent people collecting BI from you, for example get how many orders you have in your system. Exposing sequential guids suffers from the same problem. People can calculate guid ranges and get BI on you app, for example how many users you have, or what is the rate of growth of your application.
Incremented IDs are still leaking information about how many things you have, but I agree it's never secure to trust user input without proper IAM.
@@benoittremblay5705 You should always have both and consider status codes too. For example GitHub doens't return 401 on unauthed api requests for repositories, but 404 so people can try and guess which repos exist in which organisation.
Yes we want that video sir
Hi Nick, each video is a incredible class. I from Brazil and try to follow by subtitles.
μπράβο ρε τέλεια τα βίντεο, ο μόνος C# TH-camr που παρακολουθώ!
Congrats on 100K subscribers. Your videos always give some new knowledge and ideas.
I would say that if your API is going to return an object, it should be checking who is requesting it (authentication), and that that user has rights to see that object (authorization / access control). Just knowing the ID generally shouldn't give one access to the object.
I haven't looked at the source of this library. (How cryptographicaly secure is it?) However one thing we can see from the examples shown is that *it does leak some information about the size of the ID* . A 32-bit ID (~4.2 billion possible values) (with the value 1) was mapped to two characters in a character set with 52 characters (capital and lowercase letters) (2704 possible value (52^2)) - a reduction from 32 bits of entropy to less than 12. If an attacker knows the algorithm (and depending on them not knowing (security by obscurity) is a bad practice), or just sees one low ID and guesses that others would be the same size, they've only this smaller key space to brute force, in order to find some valid values. If the system accepted 5 requests per second, they could cover those 2704 values in under 10 minutes (or a 12-bit range in under 14 minutes).
If you depend on an attacker not knowing an ID (the ID value visible in the API), ideally, it should be *increasing* the size of the ID to something that is clearly infeasible to brute force. A 32-bit key would generally not be considered very secure, but it depends on your application, and how fast the attacker could make requests, and whether they would be locked out after a few invalid ones.
Also remember that they don't have to cover the whole ID space to compromise some data. (e.g. if you had 32-bit IDs and your obfuscation function mapped them so that they were randomly distributed over the 32-bit space, and you had 100,000 records (of the same type/class/table) accessible by an ID, then an attacker would find a match after trying 42950 values on average).
Another fundamental problem with lack of authorization is that someone who had access to a record at one time, might not be entitled to access it at another time, but they could still know the ID.
Another thing is: If you use a standard GUID for a salt ("pepper" is a better term for it in the case; but this applies to any salt or pepper), ensure that it is actually (and preferably cryptographically securely) random. A GUID generator might use your network card ID and the time, for example, or a non-secure random number generator. It is not necessarily a requirement, in general, that GUIDs are unpredictable. (The concept is of a way to ensure global uniqueness when those generating the values are cooperating.)
Congrats on reaching 100K, you deserve it! This is a very useful topic, thanks. :)
Very interesting. EF Core uses a sequential guid value generator by default (when using MSSQL) to avoid this problem.
Doesn't sequential guid values defeat the purpose of non-guessable keys, or am I misunderstanding what that means?
@@RaptorMerlin They are sequential, but still random
Exactly, only part of the sequential guid is sequential. And for our use, the important part is distributed unique (across the database) ids.
I think it bears mentioning that using a system like this is great for a greenfield project where you have never exposed an int ID. However if you have it would be trivial to back into the salt value from a previously known ID converted to a hashid.
If the hashing is implemented correctly, it shouldn't be feasible to figure out the salt just by knowing some inputs and outputs.
If it was, that would mean that whenever a database with hashed passwords gets leaked, by knowing just some of the original passwords, you would be able to crack the salt (and all the other original passwords in the database).
@@andrijaantunovic8756 - the featured package does not actually hash the values and the same "salt" is used for everything.
The primary use of GUIDs is so you don't have to synchronize them. You could have two different places generating them, in completely separate processes,and later you could combine some or all of them without having collisions. Whatever your tip is for it's less than 0.1% of the use cases of guids covered. Yes guids are also pretty good db secrets, but they're not the best for that either. An actual public key and private key is much better, though even more high cost.
The video is about those who use auto incremented ids because they don’t have the need for a highly distributed system. If you have synchronisation issues on the pk you shouldn’t be using an RDBMS in the first place
I worked on a project years ago that used nHibernate as the ORM and we had it configured in such a way as to use a specific guid algorithm (I think it was called comb) that generated indexable guids, so it is possible to get database friendly guids. That was actually my first job and we used guids everywhere, felt weird on my next job using ints as PKs
Best feeling when u see that nick uploaded a new video😁
Good and interesting topic btw
Been using that library for ages. I highly recommend it
I am interested in performance comparison between hashId and GUID
What are you trying to compare though? GUIDs are just generated once and then used. Hashids are converted between numbers and encoded hashes so if you don't want the original number or dont care about integers then just stick with GUIDs. But I find that integers are easier to deal with everywhere and hashids let you encode and "hide" the number when necessary.
I presume you are referring to database performance? It doesn't seem relevant to compare hashids to GUID because the idea is that you expose hashids publicly, but convert them internally to sequential IDs before you hit the DB.
What I meant is speed comparison of creating GUID or hashing id.
Caveat will be also decoding hashids.
@@anrikezeroti4680 It's microseconds for both. So insignificant that you don't need to worry - and if you really do have to worry about that kind of performance impact then you wouldn't be using any of this anyway.
@@anrikezeroti4680in a relational database id are typically stored as a sequential integer or UUID,. with hashids you encode the integer to be hash but this is done on a server not on the database. as far as the database knows, it only sees an integer as the id. so really your question is what is performance difference between using guid or sequential integers as primary keys in a relational database like sqlserver. and the answer is that it has a substantial effect when rows in a table exceed a large quantity like 10,000+ but there are other videos comparing the performance of those cases
I love this video, and thanks for highlighting this project. But, since UUIDs are a native datatype for most databases, they are stored in 128 bits internally. So, an alternative to UUIDs for data in flight might be to convert them to base64. You get the advantages of this project plus all the benefits of UUiDs
The problem is the fragmentation they can cause so if you wanna use them but not suffer from that on the RDBMS level then ULID are a good alternative to that
The biggest problem for me is naming. I don't like when someone makes a function that is reversible by design and calls it "hash". Encoded, Encrypted - OK, but it's not a hash.
It’s the first thing they mention in the repo and I mention it in the video too. The naming is bad, I agree, but they don’t claim that it’s a hash
CypherID's would be a much cooler name
Looking forward for the video on guid optimization. Thank you for the video!
Please make a video how to optimise GUIDs as Ids
Yes! I was just thinking about how I might optimize the use of GUIDS as IDs in a NoSql DB.
Another interesting solution that I saw in Cassandra database is to use timeuuid (guid/uuid which contains time and can be sorted)
And then they are no longer random. Don't know about clashing probability, but they no longer have the U in UUID so strong anymore. (not that it ever was fully unique, but in the classic UUID the U is quite strong).
@@marsovac timeuuid is exactly like a uuid or a guid. Cassandra works well in big clusters, so they need to make sure that on each node the generated ID (timeuuid) is unique 😄
I'm growing by every video you release. Thanks for the your efforts towards the community. My kind request - please make book suggestion videos for software design.
I'd definitely give this a try on my up comming projects. Thanks for sharing!
An alternative is to have an auto incrementing primary column and a separate GUID column (which is also indexed) that is assigned when the record is created. The GUID is used externally and the uint used internally. No GUID conversion necessary.. no cache or GUID lookup required
The problem with indexing a guid is fragmentation. It is just not efficient and you're storing 16 extra bytes which will cause more harm than good
@@nickchapsas Time-based UUIDs negate the fragmentation issue.
@@nickchapsas Having GUIDs in the database layer is not great, but using GUIDs as identifiers lets you use application-generated keys and then insert related objects into a database without needing multiple round trips to get the ids of parent objects. I do love the video though! I would want to see how it works and build my own version, but the idea is superb!
Good learning stuff, thanks. Congrats on 100+k subscribers
Your content is gold! Thanks and congrats on 100k!
Hello Nick. thanks for the insights on the hashids. I would love to see how to see the how optimise GUID searches video
Well done with the 100k subscribers, well deserved. Your videos are excellent. On this topic, you could create a custom binder to decode for you before you hit the action method?
You totally could yeah
Very weird to see this in your latest videos when I just implemented this exact solution last week lol. After a very successful prototype it had me wondering where else I can use this in older projects instead of a GUID. Solid video man
Just remember that unless using cryptographic hashes, it is not really a security thing but obscurity. Creating your own security solutions are almost never a good idea unless your a world class math expert and have the result independently verified.
That's super cool! I am a JavaScript developer and I honestly like this approach more than GUID/UUIDs.
There is a package called *nanoid* which also does the same thing and it would be amazing if you could do a comparison of all three (GUID vs NanoID vs HashID) to generate random ids and check the collision rate of them.
Because I heard somewhere that NanoID's arent's really universally unique as GUID or UUID
HashID is for obfuscation of an int, not anything to do with nanoid/uuids/guids except trivially that guids are not sequentially guessable. So hashid is not the same thing, its just an obfuscator. If you want a better obfuscator, try Knuth's hash algorithm. Generates reversible ints rather than strings and is way, way, way faster.
Great video Nick! I will definitely check this library out! But during the whole video I kept thinking about collisions (with GUIDs you don’t have to worry) especially if you have a disproportionally big amount of data and have configured the library to a short “hash” length. Really nice video though, I’m learning a lot from you and I am inspired by your passion for deep, well explained knowledge.
Guides can overlap as well, it’s possible just very unlikely, a hash is the same way, it can shuffle very very well, and the odds of an overlap is small
@@hipihypnoctice I think you are missing the point. GUIDs are made out of the box not to overlap. Hashes on the other hand can easily overlap if you have too much data and too short hashes.
Hello Nick! :)
Thank you for clarification!
This came at the right time for me. Thanks Nick.
I know that this video is a bit older, but I just wanted to pass on a big thank you for it! I've always hated exposing key's as int's as you are right, makes it easy to hack the system. Also love you took a moment to look at the cpu cost to use the encoding plugin. Very complete. One question I did walk away with is what is the 'cost' of using guid's as primary keys as to just using int's from your experience? Thanks Again!!
Yes please to the GUID optimization video.
Congratulations on 100k!
I like when u said : "A url friendly random looking hash type thing" 😂😂
Hi Nick,
Thank you for sharing this very interesting topic.
Lot’s of greetings, Dennis 🇳🇱
Twitter uses Snowflake ID which is also interesting because it’s unique to your cluster of N machines and is also sortable.
Snowflake ID is amazing for distributed systems
Very nice. Didn’t know about this library. Thanks for sharing Nick!
Been using it for years. Great it’s available for almost all popular languages. Also one note to mention, it’s not as efficient in inserting as guid
Have been wondering about that security flaw for some time. This really eases resolving it in existing applications.
This does not fix the security problem, proper authorization checking is the only way to prevent an attacker from querying your endpoints with id's they shouldn't have access to.
@@baracek8797 I'm not really talking about authorization here, that's another topic. It's just about the fact that you're exposing real identifiers.
Guids are still useful because they don't have the requirement of knowing what the last number in the sequence is. So you could create a lot of objects and their ID's before saving them to the database and the ID's will be valid even before that.
Sure but using Joins in a SQL Database is a big perfomance issue
@@Kingside88 Sure, but have you benchmarked and actually given those benchmarks thought?
Sequential integer ids are simply superior ... _if_ you are inserting 1 billion rows sequentially and need to immediately do joins/etc without doing any database maintenance.
In general, you still need to occasionally run database maintenance, and usual use-cases do not involve inserting billions of rows in a short time frame. And such use-cases will _often_ benefit more from distributed processing, which _hugely_ benefit from the fact that you don't need to generate ids sequentially on a shared resource, so you'd have to have a particular use-case that where inserting these records is actually your bottleneck, not processing them.
If that's not your exact use case, then GUIDs are generally equally viable from a performance standpoint. In my experience, the performance benefits of sequential ids are _extremely_ niche and I've never seen a real-world use that actually demonstrates that they were better; I've only seen micro benchmarks that specifically target this niche to show that it exists in theory. Theoretical arguments don't always translate to real-world benefits.
@@Kingside88 JOINs are a normal thing for databases. The reason why non-sequential random ID's like GUID's are bad is that they cause the data being inserted in a not optimal state, the ID's not being in a sorted order. Also, they are bigger so are slower to match and take up more space.
Congrats on 100k subs! I am interested in your thoughts about Twitter Snowflake ID and compare them GUID and maybe hashid. Which one would you choose to build a distributed system. Great video as always.
A really good library. Thanks for sharing.
I found the easter egg, I knew I would have been rickrolled :)
Great Video!
The best thing about uuids is that you know the id before it's committed to the database.
I liked this and was about to use it, but then checked around for alternatives. Found Knuth's hash from ye olden dayes. If you want a quick comparison it generates ints that can be reversed back to the ID instead of strings and is apparently less crackable (though I doubt that matters too much for the use-case). Performance however, Knuth completed my benchmark in 950 ns while HashID took 1,466,137 ns.
I might be wrong about this... I thought the most common reason for introducing GUID/UUID was database replication with multiple write replicas. So this seems only be useful in an environment with a single main database for writing and maybe read replicas. That's a tough decision to make upfront development, since it might be the wrong direction after all and changing all indexes afterwards might be very tedious.
The main reason was uniqueness in distributed systems. You don't need to check if a Guid exists before you do an insert. You just assume it does. It is also very useful when it comes to idempotency.
Partitioning columns could be a better solution for multiple writers. Guids are terrible for clustered indexes so should only be used as a nonclustered index. The idea of a HashID is to reduce the need for a value exposed via presentation layer, which is what you might use a guid for.
Backpack (school app) had a similar issue. Kids pictures were stored by student id and it was accessible publicly.
Hey Nick, cool package for sure I can see it’s use case. What are your thoughts on the hashid using primitive types and not a value type that encapsulates it’s logic? Also, I would love to hear more on how you optimize the use of GUIDs in your applications as well.
Sounds like a good plan
Great stuff, thanks Nick!
Pretty cool. Just maybe I'm spoiled, but I'm thinking it would be nice if we didn't have think about any of this, and could just use this kinda package implicitly...
Meaning something like - have a model where it's just User.Id on the incoming model, with an attribute to indicate [HashId] and then have a middleware or modelbinder that automatically decodes incoming hashes back into ints... and outgoing ints into hashes
Very creative console messages!
Thanks for sharing-- interesting package, certainly worth giving it a shot
This looks like a good package, but don't forget to do auth on all endpoints. Remember, security through obscurity is not security at all 🙂
This isn't supposed to replace proper auth in endpoints. If you need auth, you should have it. On top of auth, you should also not leak your BI data by allowing endpoint enumeration
Please enlighten us on Guid optimization and possibly Snowflake Ids. Thanks Nick.
One thing not covered but peeking my curiosity is if DR = 2 and Ir = 3.
If the array of Ids is just a concatenation of individual values I think there is a major reduction in the value of this Lib as it would become much easier to guess values which was kind of the entire point.
Great video as always...I don't understand why you don't have 2M subscribers already...honestly. There must be something you can do to increase your visibility...
The sequential hashes are great and all in particular scenarios but fails miserably when doing inserts. I started using Guids as PKs before Jesus wore diapers, for that particular reason. And when moving to distributed environments it makes them even more attractive.
I'd be interested in a GUID deep dive for databases. My team is using UUIDs for everything because we can generate them in a distributed fashion without any need for synchronization and still have zero collisions. Sequential IDs in an RDBMS are problematic because you need hi-lo algorithms or other tricks to be able to generate many objects in a single transaction while avoiding clashes. I personally see the advantage of sequential IDs but UUIDs served me much much better.
I just hope you're not using them as the Primary Keys in the Database, that's not very good if you are. Every time your inserting a new record with the GUID as the PK, sql will be rewriting the index to slot it in the sort order, slowing down the time it takes to insert.
@@harag9 you better believe I am. That's the problem of the DB engine, not mine. Also (unless you're doing very large bulk inserts) any b-tree implementation worth its money will not care very much about it, even less so on SSD storage. That's an argument from the 80s.
@@AlanDarkworld Just make sure they are not clustered Index and you use a separate index for your clustered index. PK by default are clustered.
As you pointed out, this is not in fact cryptographically secure hash - the ID isn't encrypted, it's just obscured. Personally, I actually prefer separating these values, so a URL pattern like user-{id}-{hash} is more "honest" and easier to debug. And in that case, not much point to this library - just base64 encode your hash, perhaps replacing URL characters like "/", and then validate it in the controller. I've done that dozens of times in different languages, it's just a few lines of code. 🙂
Oh, and if security *does* matter, just use a basic two-way (e.g. DES) encryption for the actual key, and base64 encode that. (and of course *don't* put it in the URL.)
Thank you Nick, great video as always...
It is url encryption and we have done it already :))
There is a performance tradeoff. Random hash id will lead to fast index fragmentation in DB. So, maybe, sequential guid (UUID v1) is a good alternative.
The hashid will be translated to a number on server side, so no problem.
I would love to see a video about guid optimization
Hey, are you planning on bundling every course together or not? And can we pay in euros or only in pounds? Thanks for great content and congrads on 100k, well deserved.
This opened my eyes. On all of our API's we are still using raw ID's. Thank you!
"Hello everybody I'm naked..." WHAT
Thanks Nick, I love it
That is cool... I use ints for non-secure URLs, and I try to hide the whole query string for personal data... But i will be grabbing this package...
I just ran the benchmarks Nick outlined in the video and they've made some improvements, about 1-2 ns with no overhead now. (I tested with .NET 7.)
This is actually GREAT
Thank you!!!
If you have an issue with auto incremental "security" is because you're solving problems in a wrong way. Limit rating, permissions, auth in general solve those things
Это видео должно называться зумеры открыли хэш-функции, которые лет 30 как используются в IT если не раньше)