If you enjoyed this, we also have a free MySQL for Developers course with 50+ videos. Thanks for watching! planetscale.com/learn/courses/mysql-for-developers/introduction/course-introduction
As a student that has taken a dbms course in mysql and hated every part of it, I can absolutely tell that I would love it had it been even a bit like this video and series in general. Thank you Aaron, this content is genuinely incredible and is making me want to actually learn the details of mysql.
You'll have to learn it one day. There's no system in the world that doesn't use data - it's the entire point of why we build applications. I used to fall asleep in my database classes when I was in university (15 odd years ago) and my mate used to wake me up when the class finished because the lecturer was terrible. But after graduating I spent a lot of time teaching myself about databases for the reason I mentioned earlier.
I appreciate you talking about collisions because the one case that you do have where it happens would be a nightmare if you didn't plan for it no matter how small the chance may be.
Also when using non cryptographic hash function you have to assume that a malicious user may enter data to ptoduce hash collisions on purpose. In best case this will lead to an annoying bug, in worst case a severe security vurnability.
There are some pitfalls when using concatenated rows as the key, especially when there is no separator, but also when columns are nullable. What I'm thinking about is a row with column values 123 and 45 and another row with values 12 and 345, which would end up having the same concatenated value (when using just concat), and therefore the same hash. You could still end up in a similar situation when using concat_ws when some of the columns are nullable. We saw that the null values are completely omitted, so two sequential nullable columns could end up with the same hash, when a value is present in either one or the other column. There are even more ways where null values in some columns could result in the same input for the hash calculation. If the null values were converted to an empty string, which would create a separator representing every column, which would solve that problem, it would however be hashed the same as an empty string, or any other string you would use to represent null, but it's the best solution I can think of. That's why I think it's really critical to do the full comparisons described at 11:50.
I'm coming from your planetscale mysql db course and just gotta say - it's really amazing. Just the perfect level of depth, engineering knowledge and ease of use. I've been mostly the ml engineer guy through my career, now I'm switching to full-stack role and your course is really amazing for me to delve into optmizing the database shenanigans :) Thank you very much!
Lossy compound indexes seems such a good idea, that it should become a standard built-in feature at some point, e.g. a particular index type. You could then have index types like Postgres' GIST which will handle the deduping automatically.
You missed one big thing. When you add the generated column, you need to also specify the "stored" keyword. This means when inserts/updates are made, the generated column will be computed and stored in the table at the time of writing. The default behaviour is "virtual", which means the generated column is not stored with the data and instead must be computed at the time of reading. This means every time you select from the table, mysql needs to recompute the hash for every row in order to find the one you want, which kinda defeats the purpose of the whole exercise.
In this particular example the data is stored even though it is a virtual column. Because we put an index on it, those values are calculated and then written into the btree. Calculating the hash is pretty cheap, but if you have an expensive generated column it might be worth declaring it stored instead of virtual.
I've been using MySQL for ages now, and yet you still come up with new things and ideas (like this one and the Geo box combined with haverseine calculation). Keep it up!
Excellent Aaron! This technique can also be used to create a Cache key to store data in memory (or on disk), if you need to "roll-you-own" caching mechanism. Basically, sort/uppercase your arguments, then MD5 them together, and then use a hash array to map that MD5 to the response from the server. We do that with address lookups so we aren't constantly going to Google for geocode data (and eventually getting charged). Once we see an address we cache Google's geocode response and we never have to do the lookup again, thus saving money down the road.
My man this is is very greate and the niche you choose to post this video perticularly is very great the quality and everything I hope you make this as a series as this will help everyone a lot ❤
i didn't even use MySQL, yet i was pleasantly entertained by your video (i believe some of the knowledge is applicable to other databases as well) thank you, keep up the good vid 💪🏻
One more thing, should we ignore the cost of concat, md5 or any pre-processing stuff here? I mean, I'm not a MySQL expert so I'm not sure if we do this calcuation in database level is good or not. And when should will consider using a generated column instead of a composite index? Should we always make a experiment or if there is any instuction we can follow?
BTW the generated column is a really nice idea, if there is a column referencing or calculating by the others that our apps will have to keep it correct, it's struggling because we are afarid of there is some place in our app that doesn't handle that calculation correctly, but if this job is done by the database, at least we can avoid lots of code changes and just trust the database will keep the consistency as usual.
Generated columns are so fun. You can use them in so many places. I would probably ignore the cost of a concat or MD5, but if you were to do something very expensive in a generated column, you might consider declaring it STORED so it's not calculated over and over. I like having the database maintain these calculated values so that no matter where the insert or update is coming from, the generated column is always going to be correct and present.
The MD5 hash collision isn't the reason you add the WHERE, it's because sometimes your data can have the separator in it and "123|abc", "def" gets combined to the exact same string as "123", "abc|def" even though the values don't match.
Awesome video! one small nitpick, since you're dropping nulls when joining columns into a string with the "|" separator, the following rows will be converted into the same string before hashing, even if values came from different columns. All these: "A", NULL, "B" NULL, "A", "B" "A", "B", NULL Map to "A|B" I suggest using a more resilient value concat'ing scheme
This is really cool, the only thing I was curious about is the hashing performance? I know the hash is only created in insert and update. It doesn't look like MySQL supports any of the newer hashes like xxh3 which would be my choice since it's so incredibly fast, typically faster than the computers memory bandwidth. It's not cryptographically ( is that a word ) strong, but it's randomness is extremely high
MD5 is incredibly fast also, which is actually one of the reasons it shouldn't be used for cryptography! I haven't benchmarked it, but I think it should be fast enough for most needs. You could try CRC32 but you'd have a ton of collisions so you'd definitely need to add the additional conditions mentioned at the end of the video. Worth a few experiments to see, I suppose!
I really like your presentation style. Clear and explanatory and good video length. I never use MySQL, so one improvement for me would be if you could mention if this would be relevant for MSSQL and Postgres as well and if the syntax is different, just make a short note of that. I assume most concept are similar.
I love this, I’m curious why this isn’t just the default behavior for indices and what the crossover point is for when this is faster than the traditional index
@@PlanetScale In short, all your videos I've watched made lots of sense to me considering my side quest on building a database. I'm now able to see things from a perspective different from the usual "do this don't do that because of so and so..." which is quite rampant in the engineering world.
We use this to search for duplicates on the database. A thing that I would personally change is making all strings to LOWER(), so that if the user uses a big letter instead of a smaller one, it still sees it as a duplicate.
I would rather choose smaller hash and worry about hash collisions than picking wider hash and hoping there will be no collisions. Additionally to that, our CPUs and memory works better with 32/64 bit data that with 128 bit data. And, in this case, I would pick CRC32. Also, MD5 was designed as a cryptographic hash, you don't need cryptograpy overhead here. Simpler hash is better here. It would be nice to see a benchmark of those 3 approaches. Wide index, MD5, CRC32. Anyway, thanks for the video. I learned about auto-generated columns. And it was interesting.
Worrying about collision of md5 maybe a bit paranoid but one other collision I do worry about is concat_ws doesn't entirely solve the null problem, that null values simply disappear. For example, using your table schema, there will be collision if there is an address with street_name HL with no street_suffix, and another with no street_name and street_suffix HL. Boom, collision. I have seen too many weird real life data to be at ease about this.
Thanks! I think the most common use cases would be 1) when you're trying to index something that's not easily indexable, you can reach for this or a functional index 2) when you're trying to "paper over" some weirdness in your data model as a convenience or 3) when you're trying to create an ID for an entity that doesn't already have a natural ID. In this case the address didn't have a natural ID, so calculating a hash of the parts gives us a nice ID that we can use for deduping, uniqueness, or indexed lookups from the application side
First of all that's a great video and thanks a lot for that. I have a question regarding this, as we have 6 columns hashed as index and I need to search in 3 columns instead of six, will it be helpful in that case too?? Like I want to search in address, zip code and state column, will it work??
Ok, wait a minute. How is it possible I haven't come across your content before? You actually know your stuff. Moreover, I can't even play out the jokes in my head before you say them out loud. GET OUT OF MY HEAD. Also, keep producing amazing content. Deal? Deal.
Just being pedantic here, but you would have collisions if two columns can be null, and contain the same data, which is perfectly possible with user fed data, like "city=null, state=texas" and "city=texas, state=null" would have the same hash with you example. Just being pedantic as I said. Great video.
What program do you use to execute the queries and view the results? EDIT: Nvm, a few seconds after posting the comment I reached 8:37 where you mention that it is TablePlus.
Well Aaron, here is where the Real Black Magic happens! Actually, this is really logical for some specific cases, not all of course. Keep introducing more of these hidden gems!
Great technique but I’m not sure why you would have a large composite index (that’s not unique) since in my experience the optimizer only uses the first or at most the first and second columns from the index. I suppose you can use the latter columns to reduce a read to the main table but only if the query reads columns only from the index. But in that case a hash index wouldn’t be a viable solution.
One question. Would it be possible to send a hashed value to MySQL over unencrypted network at use some built in MySQL function to unhash this data to separate columns?
@@PlanetScale In general I have two questions. 1st. You showed that MySQL can (un)hash data and keep it private or concatenate larger information, it can even do so in realtime based on information changed in the fields of a table. Can it also do it in the other way, update fields in the table based on changed data in the hashed field in real time? 2nd. Maybe this is a topic for different time/video. Synchronise two distant databases. An offline version periodically connects to an online and upon connection a synchronisation takes place. Is there a built in MySQL function to perform something like this?
Risk of MD5 collision is abysmal, I agree, but you know how it is, would you feel safe using it with millions of records in production waiting for that phone call one night because the wrong customer was billed? :) Just for my piece of mind I'd probably make it a unique, and catch it when the record is created somehow and handle it there rather than allowing it to be inserted and then down the line have some very very weird stuff happening probably.
Amazing video, I'm a sql newbie, well newbie in any aspect of software engineering. But I was wondering, what's the downside of having an identity column as a primary key? Or rather, in what scenario would this concept be used over using identity?
I'd probably still have a primary key that's an integer, personally. Since these hashes would be randomly ordered, I wouldn't want them as PKs. I'd use them to check for existence or additionally throw a unique constraint on the hash to enforce uniqueness
@@PlanetScale Thanks for the response. So for checking existence, would I be correct to assume this method would be faster than checking "WHERE ... AND ... AND ..."? And as for the unique constraint, its possible to apply a unique constraint over multiple columns. So would applying the unique constraint to only the hash column have any benefits over multiple?
I don't think a trigger is necessary here, as you can use a generated column to do the concat automatically. Perhaps I've misunderstood the question though!
@@PlanetScale yeah to generate it yes. But once you added the column. You said in order to search with that indext you can concat it in the select query and search on the generated column. Bit instead of concating in your select query, could you add a trigger that takes the un concatenated fields to concat it for you and then search on the generated column. Just in case you forget to concat in your query or older parts of the software don’t implement that yet
@@iWhacko Ah gotcha. I actually don't know! Maybe so? I haven't worked with select triggers very much. I'd probably opt against that personally, but it might work!
Is there a hash function in MySQL, such that if I run a query with "order by f(a,b,c)", the result set will be in the same order, as for the query with "order by a, b, c"?
Well now that's an interesting question isn't it. I can't think of a lossless way to do that off the top of my head, but the question will haunt me so I'll let you know if I come up with anything.
If you enjoyed this, we also have a free MySQL for Developers course with 50+ videos. Thanks for watching! planetscale.com/learn/courses/mysql-for-developers/introduction/course-introduction
As a student that has taken a dbms course in mysql and hated every part of it, I can absolutely tell that I would love it had it been even a bit like this video and series in general. Thank you Aaron, this content is genuinely incredible and is making me want to actually learn the details of mysql.
I can't tell you how happy this makes me. It's such a shame to me that so many interesting things are taught in such an uninteresting way.
@@PlanetScale Teaching is a skill that not every expert possesses (assuming they were expert in the first place)
@@ahmad-murery Extremely good point
Aaron, this approach is refreshing. Hope you get to some focused use case series, like SQL for Product Managers!
You'll have to learn it one day. There's no system in the world that doesn't use data - it's the entire point of why we build applications. I used to fall asleep in my database classes when I was in university (15 odd years ago) and my mate used to wake me up when the class finished because the lecturer was terrible. But after graduating I spent a lot of time teaching myself about databases for the reason I mentioned earlier.
I don’t usually post on videos but the quality and the details on these is amazing! Thank you Aaron! Keep them coming!
Thank you so much, I really appreciate you saying that. We're definitely gonna keep em coming 🫡
Agreed! I don't even use or like MySQL very much, but the content is often more broadly applicable.
I appreciate you talking about collisions because the one case that you do have where it happens would be a nightmare if you didn't plan for it no matter how small the chance may be.
Totally. Especially if you use CRC32 which is going to have massive amounts of collisions!
Also when using non cryptographic hash function you have to assume that a malicious user may enter data to ptoduce hash collisions on purpose.
In best case this will lead to an annoying bug, in worst case a severe security vurnability.
@@Pilikio It can very-very theoretically cause Hash DoS attack
There are some pitfalls when using concatenated rows as the key, especially when there is no separator, but also when columns are nullable.
What I'm thinking about is a row with column values 123 and 45 and another row with values 12 and 345, which would end up having the same concatenated value (when using just concat), and therefore the same hash. You could still end up in a similar situation when using concat_ws when some of the columns are nullable. We saw that the null values are completely omitted, so two sequential nullable columns could end up with the same hash, when a value is present in either one or the other column. There are even more ways where null values in some columns could result in the same input for the hash calculation. If the null values were converted to an empty string, which would create a separator representing every column, which would solve that problem, it would however be hashed the same as an empty string, or any other string you would use to represent null, but it's the best solution I can think of.
That's why I think it's really critical to do the full comparisons described at 11:50.
Yup this is a great argument for adding the extra conditions
Rule of thumb - absolutely always do use separators when concatenating multiple values and producing whatever fingerprint you are producing
I wish the results were rendered in a monospace font so that the md5 hashes lined up neatly
You and me both 🫠
I'm coming from your planetscale mysql db course and just gotta say - it's really amazing. Just the perfect level of depth, engineering knowledge and ease of use. I've been mostly the ml engineer guy through my career, now I'm switching to full-stack role and your course is really amazing for me to delve into optmizing the database shenanigans :) Thank you very much!
I'm so glad to hear that! Thank you for letting me know. I put a lot of work into that course so I'm pleased that you've enjoyed it
Lossy compound indexes seems such a good idea, that it should become a standard built-in feature at some point, e.g. a particular index type. You could then have index types like Postgres' GIST which will handle the deduping automatically.
Yeah that's an interesting idea. It would be neat to see that built in
You missed one big thing. When you add the generated column, you need to also specify the "stored" keyword. This means when inserts/updates are made, the generated column will be computed and stored in the table at the time of writing. The default behaviour is "virtual", which means the generated column is not stored with the data and instead must be computed at the time of reading. This means every time you select from the table, mysql needs to recompute the hash for every row in order to find the one you want, which kinda defeats the purpose of the whole exercise.
It all depends on how fast is hashing vs disk I/O.
In this particular example the data is stored even though it is a virtual column. Because we put an index on it, those values are calculated and then written into the btree. Calculating the hash is pretty cheap, but if you have an expensive generated column it might be worth declaring it stored instead of virtual.
I've been using MySQL for ages now, and yet you still come up with new things and ideas (like this one and the Geo box combined with haverseine calculation). Keep it up!
These videos are really high quality. You are doing an excellent job
Thank you so much for saying so
Yeah so much this!
When I see a new video on my feed, I watch it before all my usual content because honestly, this is just so good!
"remember if you hash passwords with md5, straight to jail" the delivery got me to laugh out loud, take your like.
Got em! 😂 Glad you liked that part
You are awesome man:) I am self learning web development and you are such a nice source for wholesome and humouristic learning!
I love to hear that! Thank you for letting me know. And good luck on your web dev journey!
Excellent Aaron! This technique can also be used to create a Cache key to store data in memory (or on disk), if you need to "roll-you-own" caching mechanism. Basically, sort/uppercase your arguments, then MD5 them together, and then use a hash array to map that MD5 to the response from the server. We do that with address lookups so we aren't constantly going to Google for geocode data (and eventually getting charged). Once we see an address we cache Google's geocode response and we never have to do the lookup again, thus saving money down the road.
As someone who is just getting started with databases, this is incredible! Thank you A Aron
A A Ron 😂😭 You're welcome!
As someone who likes reading the MySQL docs, printing them out and binding them is hardcore 😂 props Aaron 🎉
I saw this video and it literally solved a problem I had at work the next day. Stellar job Aaron!
Oooo no way! Can you give me any more hints? What'd you end up hashing?
You got my sub for the "Instead of hanging out with friends, ....we continue to read " line. :)
Haha glad you liked that one 😂
As developer that always use mysql, you very did a great job explaining this topic. Thank you very much
Thank you for talking about collision, I was wondering about it even if I knew this wasn’t really an issue !
It's always fun to look up collision probabilities!
@@PlanetScale not as much as your videos, and I’m learning a ton.
Please keep up the good work !
Hi from India!
It's the best thing I came across today.
I haven't learned such a useful trick from a youtube video in a long time, thx
Super. I remember dozens of usecases, where this implementation would of saved the day.
You do great work! Tough to find solid material to point developers to.
I appreciate that very much
Content like this I love - I hope I can find a useful way to do this in my position but my indexes are always Material + location + date usually.
My man this is is very greate and the niche you choose to post this video perticularly is very great the quality and everything I hope you make this as a series as this will help everyone a lot ❤
i didn't even use MySQL, yet i was pleasantly entertained by your video (i believe some of the knowledge is applicable to other databases as well)
thank you, keep up the good vid 💪🏻
Thank you so much!
Awesome video. Now I just need to find the T-SQL version of your channel and I'm complete :D
I can see how this could be very useful in optimizations. Thank you for your practical guide on deeply doc-founded knowledge.
The generated column got me thinking if there is some places I can actually do better on my work, thanks Aaron!!
One more thing, should we ignore the cost of concat, md5 or any pre-processing stuff here? I mean, I'm not a MySQL expert so I'm not sure if we do this calcuation in database level is good or not. And when should will consider using a generated column instead of a composite index? Should we always make a experiment or if there is any instuction we can follow?
BTW the generated column is a really nice idea, if there is a column referencing or calculating by the others that our apps will have to keep it correct, it's struggling because we are afarid of there is some place in our app that doesn't handle that calculation correctly, but if this job is done by the database, at least we can avoid lots of code changes and just trust the database will keep the consistency as usual.
Generated columns are so fun. You can use them in so many places. I would probably ignore the cost of a concat or MD5, but if you were to do something very expensive in a generated column, you might consider declaring it STORED so it's not calculated over and over. I like having the database maintain these calculated values so that no matter where the insert or update is coming from, the generated column is always going to be correct and present.
@@PlanetScale Fair enough! Will definitely try it, thanks Aaron!!
So well explained with the right pace and example, its incredible. Thank you!
Just 😲, a huge thank you for all these information in all videos.
You're welcome 🤗
The MD5 hash collision isn't the reason you add the WHERE, it's because sometimes your data can have the separator in it and "123|abc", "def" gets combined to the exact same string as "123", "abc|def" even though the values don't match.
Whish there is a channel like this for mongodb :(
Awesome content ❤
Guess you'll have to switch to MySQL 😏😏
I really learned something today, thank you man.
That's was definitely a great piece of knowledge I got in just 13 minutes. Thank you!
Dude your content is 🔥.
You have enough friends and I'm proud to be your friend 👍🏻
Great content. Got yourself a new subscriber!
Awesome video! one small nitpick, since you're dropping nulls when joining columns into a string with the "|" separator, the following rows will be converted into the same string before hashing, even if values came from different columns.
All these:
"A", NULL, "B"
NULL, "A", "B"
"A", "B", NULL
Map to "A|B"
I suggest using a more resilient value concat'ing scheme
True! Unlikely to happen with this data, but that would be a good reason to use the conditions shown at the end of the video
This is really cool, the only thing I was curious about is the hashing performance?
I know the hash is only created in insert and update.
It doesn't look like MySQL supports any of the newer hashes like xxh3 which would be my choice since it's so incredibly fast, typically faster than the computers memory bandwidth. It's not cryptographically ( is that a word ) strong, but it's randomness is extremely high
MD5 is incredibly fast also, which is actually one of the reasons it shouldn't be used for cryptography! I haven't benchmarked it, but I think it should be fast enough for most needs. You could try CRC32 but you'd have a ton of collisions so you'd definitely need to add the additional conditions mentioned at the end of the video. Worth a few experiments to see, I suppose!
Any concern with performance for tables where rows are frequently updated?
Hashes like this will be pretty cheap to calculate, so it's no concern to me.
This is so useful that I feel stupid that didn’t knowing this after 6 years working with mysql 😂
Nah don't feel stupid, we should all be learning new things constantly!
I really like your presentation style. Clear and explanatory and good video length.
I never use MySQL, so one improvement for me would be if you could mention if this would be relevant for MSSQL and Postgres as well and if the syntax is different, just make a short note of that.
I assume most concept are similar.
Thanks Johan! I'm not as familiar with the other databases so I feel hesitant to speak on them
I love this, I’m curious why this isn’t just the default behavior for indices and what the crossover point is for when this is faster than the traditional index
It has pretty severe tradeoffs, like the fact that it only works for strict equality lookups
Thank you for this video. It provides more insight into why some rule of thumb exists in the database world.
You're welcome!
@@PlanetScale In short, all your videos I've watched made lots of sense to me considering my side quest on building a database. I'm now able to see things from a perspective different from the usual "do this don't do that because of so and so..." which is quite rampant in the engineering world.
You can have collisions due to the separator being included in the value stored in the columns as well
Totally! If that's possible with your data, that'd be a great reason to use the redundant conditions (mentioned at the end of the video) for deduping.
Nicely explained! What software are you using for querying?
TablePlus! th-cam.com/video/7V_CJBPZPes/w-d-xo.html
This video screams high quality content!
Amazing video, great explanation and also, nice sql tool you have there. More videos 🌟
Thank you! Got a big one coming out tomorrow on a similar topic 🤐
Thanks for sharing this amazing idea, very well explained, thank you!
This is so great! I hope this all works on MariaDB as well?? And now I also need to print this "book" ;)
Theoretically it should work the exact same! Smaller = faster. Give it a go and let me know
We use this to search for duplicates on the database. A thing that I would personally change is making all strings to LOWER(), so that if the user uses a big letter instead of a smaller one, it still sees it as a duplicate.
Nice! That's probably a good addition. Good thinking
Can you discuss your thoughts on ULID
Yes! It's on my list
this is great, thank you aaron
I would rather choose smaller hash and worry about hash collisions than picking wider hash and hoping there will be no collisions.
Additionally to that, our CPUs and memory works better with 32/64 bit data that with 128 bit data. And, in this case, I would pick CRC32.
Also, MD5 was designed as a cryptographic hash, you don't need cryptograpy overhead here. Simpler hash is better here.
It would be nice to see a benchmark of those 3 approaches. Wide index, MD5, CRC32.
Anyway, thanks for the video. I learned about auto-generated columns. And it was interesting.
CRC32 plus the additional conditions to dedupe hashes would work just fine!
Anyone have any idea of what SQL client he is using? It looks neat
Table Plus
A gem again. Thanks
Worrying about collision of md5 maybe a bit paranoid but one other collision I do worry about is concat_ws doesn't entirely solve the null problem, that null values simply disappear. For example, using your table schema, there will be collision if there is an address with street_name HL with no street_suffix, and another with no street_name and street_suffix HL. Boom, collision. I have seen too many weird real life data to be at ease about this.
Totally! In that case adding the extra conditions makes a lot of sense
Quality videos man! Keep it up
Love this. Is there any tips on thinking about this when building/designing the database schemas ?
Thanks! I think the most common use cases would be 1) when you're trying to index something that's not easily indexable, you can reach for this or a functional index 2) when you're trying to "paper over" some weirdness in your data model as a convenience or 3) when you're trying to create an ID for an entity that doesn't already have a natural ID. In this case the address didn't have a natural ID, so calculating a hash of the parts gives us a nice ID that we can use for deduping, uniqueness, or indexed lookups from the application side
First of all that's a great video and thanks a lot for that.
I have a question regarding this, as we have 6 columns hashed as index and I need to search in 3 columns instead of six, will it be helpful in that case too?? Like I want to search in address, zip code and state column, will it work??
10:50
Unfortunately @LucasSouza1 is correct. It's no longer possible because you've destroyed quite a bit of that data by hashing it.
what if you update one of the columns that is used for indexing, will the hash also be updated ?
So many great tips! Thank you very much!
Excellent video!
This man needs to be more viral; question: you mentioned strict equality, but will this approach also work with statements like ‘is like’?
The hash function obliterates LIKE because there's no substring any more to match on.
The MD5 hash is only suitable for exact matching.
Unfortunately it only works for strict equality
thanks you so much, your videos so helpul for my job. Can you do more video about EXPLAIN ANALYZE and how to optimize execution plan in MYSQL
Niw i wonder what if the we add a cokumn from other table
Good for you, Aaron!
Great content up to the point! What MySQL version are you using and reading docs of?
The docs here were version 8, but this pattern will work in any version!
Do you have any plans to add foreign key support?
We do in fact 🤐 Follow us here or on Twitter to stay up to date, but things are ✨happening✨
@@PlanetScale bro I tried it with drizzle orm before it did not work. foreign key is a big thing for my workflow.When did you release it?
a bit of content addressable storage in your database
You said that MD5 is just fine for this specific purpose. Wouldn't it be highly preferable though? Because it's cheap computationally?
from how many indexed tables does it start to make sense?
Ok, wait a minute. How is it possible I haven't come across your content before? You actually know your stuff. Moreover, I can't even play out the jokes in my head before you say them out loud. GET OUT OF MY HEAD. Also, keep producing amazing content. Deal? Deal.
You drive a hard bargain but you got yourself a deal
What's the program you're using for interacting with mysql? Looks neat!
Table Plus!
thanks! great video btw
Just being pedantic here, but you would have collisions if two columns can be null, and contain the same data, which is perfectly possible with user fed data, like "city=null, state=texas" and "city=texas, state=null" would have the same hash with you example. Just being pedantic as I said. Great video.
You're totally right! Worth adding the extra conditions shown at the end if that's the case (and it may well be!)
What program do you use to execute the queries and view the results?
EDIT: Nvm, a few seconds after posting the comment I reached 8:37 where you mention that it is TablePlus.
Bah you beat me to it! Or I guess in a way I beat me to it. Regardless... TablePlus 😂
So this technique could be used to hash string column with uuid values that are a key, to avoid indexing uuid column?
There is actually a uuid_to_bin column to compress a uuid value, which you could use to create a smaller index!
Mysql docs are blocked for my country Venezuela
Well Aaron, here is where the Real Black Magic happens!
Actually, this is really logical for some specific cases, not all of course.
Keep introducing more of these hidden gems!
Truly excellent video
Very good video!
Great technique but I’m not sure why you would have a large composite index (that’s not unique) since in my experience the optimizer only uses the first or at most the first and second columns from the index.
I suppose you can use the latter columns to reduce a read to the main table but only if the query reads columns only from the index. But in that case a hash index wouldn’t be a viable solution.
In the original case a unique constraint was added, which is a nice further use of a hash.
@@PlanetScale makes sense thanks
Great quality content
Thanks. This is really awesome trick.
Quality content👌🏾, Thank you.
Honestly this video series was a big part of me switching back from neondb.
🤐
this makes me happy
Using that alongside hstore is pretty
This is great content.
One question. Would it be possible to send a hashed value to MySQL over unencrypted network at use some built in MySQL function to unhash this data to separate columns?
Hashing (like shown here) is a one-way function, so it's not reversible. I'm not sure exactly what you're going for, to be honest!
@@PlanetScale In general I have two questions. 1st. You showed that MySQL can (un)hash data and keep it private or concatenate larger information, it can even do so in realtime based on information changed in the fields of a table. Can it also do it in the other way, update fields in the table based on changed data in the hashed field in real time? 2nd. Maybe this is a topic for different time/video. Synchronise two distant databases. An offline version periodically connects to an online and upon connection a synchronisation takes place. Is there a built in MySQL function to perform something like this?
Risk of MD5 collision is abysmal, I agree, but you know how it is, would you feel safe using it with millions of records in production waiting for that phone call one night because the wrong customer was billed? :)
Just for my piece of mind I'd probably make it a unique, and catch it when the record is created somehow and handle it there rather than allowing it to be inserted and then down the line have some very very weird stuff happening probably.
Remember: If you hash passwords with MD5: straight to jail😂. Great video, very educational
Amazing video, I'm a sql newbie, well newbie in any aspect of software engineering. But I was wondering, what's the downside of having an identity column as a primary key? Or rather, in what scenario would this concept be used over using identity?
I'd probably still have a primary key that's an integer, personally. Since these hashes would be randomly ordered, I wouldn't want them as PKs. I'd use them to check for existence or additionally throw a unique constraint on the hash to enforce uniqueness
@@PlanetScale Thanks for the response. So for checking existence, would I be correct to assume this method would be faster than checking "WHERE ... AND ... AND ..."?
And as for the unique constraint, its possible to apply a unique constraint over multiple columns. So would applying the unique constraint to only the hash column have any benefits over multiple?
is it possible to add a trigger on select statements to do the concat automatically? and then make it search on that?
I don't think a trigger is necessary here, as you can use a generated column to do the concat automatically. Perhaps I've misunderstood the question though!
@@PlanetScale yeah to generate it yes. But once you added the column. You said in order to search with that indext you can concat it in the select query and search on the generated column. Bit instead of concating in your select query, could you add a trigger that takes the un concatenated fields to concat it for you and then search on the generated column. Just in case you forget to concat in your query or older parts of the software don’t implement that yet
@@iWhacko Ah gotcha. I actually don't know! Maybe so? I haven't worked with select triggers very much. I'd probably opt against that personally, but it might work!
Is there a hash function in MySQL, such that if I run a query with "order by f(a,b,c)", the result set will be in the same order, as for the query with "order by a, b, c"?
Well now that's an interesting question isn't it. I can't think of a lossless way to do that off the top of my head, but the question will haunt me so I'll let you know if I come up with anything.
Sounds pretty intuitive
awesome!, I had no idea about this, I have a scenario where to use it. Thanks
Love to hear that! Please report back how it goes
See Ya
what about joinning?