If you enjoyed this, we also have a free MySQL for Developers course with 50+ videos. Thanks for watching! planetscale.com/learn/courses/mysql-for-developers/introduction/course-introduction
As a student that has taken a dbms course in mysql and hated every part of it, I can absolutely tell that I would love it had it been even a bit like this video and series in general. Thank you Aaron, this content is genuinely incredible and is making me want to actually learn the details of mysql.
You'll have to learn it one day. There's no system in the world that doesn't use data - it's the entire point of why we build applications. I used to fall asleep in my database classes when I was in university (15 odd years ago) and my mate used to wake me up when the class finished because the lecturer was terrible. But after graduating I spent a lot of time teaching myself about databases for the reason I mentioned earlier.
I appreciate you talking about collisions because the one case that you do have where it happens would be a nightmare if you didn't plan for it no matter how small the chance may be.
Also when using non cryptographic hash function you have to assume that a malicious user may enter data to ptoduce hash collisions on purpose. In best case this will lead to an annoying bug, in worst case a severe security vurnability.
I've been using MySQL for ages now, and yet you still come up with new things and ideas (like this one and the Geo box combined with haverseine calculation). Keep it up!
You missed one big thing. When you add the generated column, you need to also specify the "stored" keyword. This means when inserts/updates are made, the generated column will be computed and stored in the table at the time of writing. The default behaviour is "virtual", which means the generated column is not stored with the data and instead must be computed at the time of reading. This means every time you select from the table, mysql needs to recompute the hash for every row in order to find the one you want, which kinda defeats the purpose of the whole exercise.
In this particular example the data is stored even though it is a virtual column. Because we put an index on it, those values are calculated and then written into the btree. Calculating the hash is pretty cheap, but if you have an expensive generated column it might be worth declaring it stored instead of virtual.
I'm coming from your planetscale mysql db course and just gotta say - it's really amazing. Just the perfect level of depth, engineering knowledge and ease of use. I've been mostly the ml engineer guy through my career, now I'm switching to full-stack role and your course is really amazing for me to delve into optmizing the database shenanigans :) Thank you very much!
Lossy compound indexes seems such a good idea, that it should become a standard built-in feature at some point, e.g. a particular index type. You could then have index types like Postgres' GIST which will handle the deduping automatically.
My man this is is very greate and the niche you choose to post this video perticularly is very great the quality and everything I hope you make this as a series as this will help everyone a lot ❤
Excellent Aaron! This technique can also be used to create a Cache key to store data in memory (or on disk), if you need to "roll-you-own" caching mechanism. Basically, sort/uppercase your arguments, then MD5 them together, and then use a hash array to map that MD5 to the response from the server. We do that with address lookups so we aren't constantly going to Google for geocode data (and eventually getting charged). Once we see an address we cache Google's geocode response and we never have to do the lookup again, thus saving money down the road.
i didn't even use MySQL, yet i was pleasantly entertained by your video (i believe some of the knowledge is applicable to other databases as well) thank you, keep up the good vid 💪🏻
There are some pitfalls when using concatenated rows as the key, especially when there is no separator, but also when columns are nullable. What I'm thinking about is a row with column values 123 and 45 and another row with values 12 and 345, which would end up having the same concatenated value (when using just concat), and therefore the same hash. You could still end up in a similar situation when using concat_ws when some of the columns are nullable. We saw that the null values are completely omitted, so two sequential nullable columns could end up with the same hash, when a value is present in either one or the other column. There are even more ways where null values in some columns could result in the same input for the hash calculation. If the null values were converted to an empty string, which would create a separator representing every column, which would solve that problem, it would however be hashed the same as an empty string, or any other string you would use to represent null, but it's the best solution I can think of. That's why I think it's really critical to do the full comparisons described at 11:50.
I really like your presentation style. Clear and explanatory and good video length. I never use MySQL, so one improvement for me would be if you could mention if this would be relevant for MSSQL and Postgres as well and if the syntax is different, just make a short note of that. I assume most concept are similar.
The MD5 hash collision isn't the reason you add the WHERE, it's because sometimes your data can have the separator in it and "123|abc", "def" gets combined to the exact same string as "123", "abc|def" even though the values don't match.
@@PlanetScale In short, all your videos I've watched made lots of sense to me considering my side quest on building a database. I'm now able to see things from a perspective different from the usual "do this don't do that because of so and so..." which is quite rampant in the engineering world.
Awesome video! one small nitpick, since you're dropping nulls when joining columns into a string with the "|" separator, the following rows will be converted into the same string before hashing, even if values came from different columns. All these: "A", NULL, "B" NULL, "A", "B" "A", "B", NULL Map to "A|B" I suggest using a more resilient value concat'ing scheme
One more thing, should we ignore the cost of concat, md5 or any pre-processing stuff here? I mean, I'm not a MySQL expert so I'm not sure if we do this calcuation in database level is good or not. And when should will consider using a generated column instead of a composite index? Should we always make a experiment or if there is any instuction we can follow?
BTW the generated column is a really nice idea, if there is a column referencing or calculating by the others that our apps will have to keep it correct, it's struggling because we are afarid of there is some place in our app that doesn't handle that calculation correctly, but if this job is done by the database, at least we can avoid lots of code changes and just trust the database will keep the consistency as usual.
Generated columns are so fun. You can use them in so many places. I would probably ignore the cost of a concat or MD5, but if you were to do something very expensive in a generated column, you might consider declaring it STORED so it's not calculated over and over. I like having the database maintain these calculated values so that no matter where the insert or update is coming from, the generated column is always going to be correct and present.
Ok, wait a minute. How is it possible I haven't come across your content before? You actually know your stuff. Moreover, I can't even play out the jokes in my head before you say them out loud. GET OUT OF MY HEAD. Also, keep producing amazing content. Deal? Deal.
What program do you use to execute the queries and view the results? EDIT: Nvm, a few seconds after posting the comment I reached 8:37 where you mention that it is TablePlus.
Thanks! I think the most common use cases would be 1) when you're trying to index something that's not easily indexable, you can reach for this or a functional index 2) when you're trying to "paper over" some weirdness in your data model as a convenience or 3) when you're trying to create an ID for an entity that doesn't already have a natural ID. In this case the address didn't have a natural ID, so calculating a hash of the parts gives us a nice ID that we can use for deduping, uniqueness, or indexed lookups from the application side
I would rather choose smaller hash and worry about hash collisions than picking wider hash and hoping there will be no collisions. Additionally to that, our CPUs and memory works better with 32/64 bit data that with 128 bit data. And, in this case, I would pick CRC32. Also, MD5 was designed as a cryptographic hash, you don't need cryptograpy overhead here. Simpler hash is better here. It would be nice to see a benchmark of those 3 approaches. Wide index, MD5, CRC32. Anyway, thanks for the video. I learned about auto-generated columns. And it was interesting.
This is really cool, the only thing I was curious about is the hashing performance? I know the hash is only created in insert and update. It doesn't look like MySQL supports any of the newer hashes like xxh3 which would be my choice since it's so incredibly fast, typically faster than the computers memory bandwidth. It's not cryptographically ( is that a word ) strong, but it's randomness is extremely high
MD5 is incredibly fast also, which is actually one of the reasons it shouldn't be used for cryptography! I haven't benchmarked it, but I think it should be fast enough for most needs. You could try CRC32 but you'd have a ton of collisions so you'd definitely need to add the additional conditions mentioned at the end of the video. Worth a few experiments to see, I suppose!
We use this to search for duplicates on the database. A thing that I would personally change is making all strings to LOWER(), so that if the user uses a big letter instead of a smaller one, it still sees it as a duplicate.
First of all that's a great video and thanks a lot for that. I have a question regarding this, as we have 6 columns hashed as index and I need to search in 3 columns instead of six, will it be helpful in that case too?? Like I want to search in address, zip code and state column, will it work??
Just wondering: Is an index on a generated column like in your example identical performance-wise and storage-wise to an index on an ULID or optimized UUID column? I assume the fragmentation would be worse for such a generated hash column due to the randomness of the values. (And yeah, I know the two serve totally different purposes, no question in that.)
It'll be about the same as any other secondary index! The bad thing about random UUIDs as *primary* keys is that the entire clustered index has to be rebalanced along with all the data. That's why you (rightly) here people go on and on about ULIDs and time-ordered UUIDs, so that that rebalancing doesn't have to happen. With a secondary index, it's constantly being broken and rebalanced, but it's so much cheaper because it's only the indexed data, not the entire table's worth of data as is the case with the clustered index. (Note this is accurate for MySQL only, I know very little about the inner workings of other databases.) Hope that makes sense slash answers your question!
I love this, I’m curious why this isn’t just the default behavior for indices and what the crossover point is for when this is faster than the traditional index
Well Aaron, here is where the Real Black Magic happens! Actually, this is really logical for some specific cases, not all of course. Keep introducing more of these hidden gems!
Are generated columns in mysql persisted by default or does it allow you to create an index on non-persisted value? both options seem kinda weird/interesting
By default they are virtual, meaning they are calculated at run time. You can create them with the STORED keyword and they'll be written to disk. Interestingly, if you create a virtual generated column and then index it, the value is in fact written to disk, because the index is written to disk.
One question. Would it be possible to send a hashed value to MySQL over unencrypted network at use some built in MySQL function to unhash this data to separate columns?
@@PlanetScale In general I have two questions. 1st. You showed that MySQL can (un)hash data and keep it private or concatenate larger information, it can even do so in realtime based on information changed in the fields of a table. Can it also do it in the other way, update fields in the table based on changed data in the hashed field in real time? 2nd. Maybe this is a topic for different time/video. Synchronise two distant databases. An offline version periodically connects to an online and upon connection a synchronisation takes place. Is there a built in MySQL function to perform something like this?
Is there a hash function in MySQL, such that if I run a query with "order by f(a,b,c)", the result set will be in the same order, as for the query with "order by a, b, c"?
Well now that's an interesting question isn't it. I can't think of a lossless way to do that off the top of my head, but the question will haunt me so I'll let you know if I come up with anything.
If you enjoyed this, we also have a free MySQL for Developers course with 50+ videos. Thanks for watching! planetscale.com/learn/courses/mysql-for-developers/introduction/course-introduction
As a student that has taken a dbms course in mysql and hated every part of it, I can absolutely tell that I would love it had it been even a bit like this video and series in general. Thank you Aaron, this content is genuinely incredible and is making me want to actually learn the details of mysql.
I can't tell you how happy this makes me. It's such a shame to me that so many interesting things are taught in such an uninteresting way.
@@PlanetScale Teaching is a skill that not every expert possesses (assuming they were expert in the first place)
@@ahmad-murery Extremely good point
Aaron, this approach is refreshing. Hope you get to some focused use case series, like SQL for Product Managers!
You'll have to learn it one day. There's no system in the world that doesn't use data - it's the entire point of why we build applications. I used to fall asleep in my database classes when I was in university (15 odd years ago) and my mate used to wake me up when the class finished because the lecturer was terrible. But after graduating I spent a lot of time teaching myself about databases for the reason I mentioned earlier.
I don’t usually post on videos but the quality and the details on these is amazing! Thank you Aaron! Keep them coming!
Thank you so much, I really appreciate you saying that. We're definitely gonna keep em coming 🫡
Agreed! I don't even use or like MySQL very much, but the content is often more broadly applicable.
I appreciate you talking about collisions because the one case that you do have where it happens would be a nightmare if you didn't plan for it no matter how small the chance may be.
Totally. Especially if you use CRC32 which is going to have massive amounts of collisions!
Also when using non cryptographic hash function you have to assume that a malicious user may enter data to ptoduce hash collisions on purpose.
In best case this will lead to an annoying bug, in worst case a severe security vurnability.
@@Pilikio It can very-very theoretically cause Hash DoS attack
I've been using MySQL for ages now, and yet you still come up with new things and ideas (like this one and the Geo box combined with haverseine calculation). Keep it up!
You missed one big thing. When you add the generated column, you need to also specify the "stored" keyword. This means when inserts/updates are made, the generated column will be computed and stored in the table at the time of writing. The default behaviour is "virtual", which means the generated column is not stored with the data and instead must be computed at the time of reading. This means every time you select from the table, mysql needs to recompute the hash for every row in order to find the one you want, which kinda defeats the purpose of the whole exercise.
It all depends on how fast is hashing vs disk I/O.
In this particular example the data is stored even though it is a virtual column. Because we put an index on it, those values are calculated and then written into the btree. Calculating the hash is pretty cheap, but if you have an expensive generated column it might be worth declaring it stored instead of virtual.
I wish the results were rendered in a monospace font so that the md5 hashes lined up neatly
You and me both 🫠
I'm coming from your planetscale mysql db course and just gotta say - it's really amazing. Just the perfect level of depth, engineering knowledge and ease of use. I've been mostly the ml engineer guy through my career, now I'm switching to full-stack role and your course is really amazing for me to delve into optmizing the database shenanigans :) Thank you very much!
I'm so glad to hear that! Thank you for letting me know. I put a lot of work into that course so I'm pleased that you've enjoyed it
Lossy compound indexes seems such a good idea, that it should become a standard built-in feature at some point, e.g. a particular index type. You could then have index types like Postgres' GIST which will handle the deduping automatically.
Yeah that's an interesting idea. It would be neat to see that built in
So well explained with the right pace and example, its incredible. Thank you!
As developer that always use mysql, you very did a great job explaining this topic. Thank you very much
"remember if you hash passwords with md5, straight to jail" the delivery got me to laugh out loud, take your like.
Got em! 😂 Glad you liked that part
My man this is is very greate and the niche you choose to post this video perticularly is very great the quality and everything I hope you make this as a series as this will help everyone a lot ❤
I can see how this could be very useful in optimizations. Thank you for your practical guide on deeply doc-founded knowledge.
That's was definitely a great piece of knowledge I got in just 13 minutes. Thank you!
I saw this video and it literally solved a problem I had at work the next day. Stellar job Aaron!
Oooo no way! Can you give me any more hints? What'd you end up hashing?
You are awesome man:) I am self learning web development and you are such a nice source for wholesome and humouristic learning!
I love to hear that! Thank you for letting me know. And good luck on your web dev journey!
Excellent Aaron! This technique can also be used to create a Cache key to store data in memory (or on disk), if you need to "roll-you-own" caching mechanism. Basically, sort/uppercase your arguments, then MD5 them together, and then use a hash array to map that MD5 to the response from the server. We do that with address lookups so we aren't constantly going to Google for geocode data (and eventually getting charged). Once we see an address we cache Google's geocode response and we never have to do the lookup again, thus saving money down the road.
I haven't learned such a useful trick from a youtube video in a long time, thx
These videos are really high quality. You are doing an excellent job
Thank you so much for saying so
Yeah so much this!
When I see a new video on my feed, I watch it before all my usual content because honestly, this is just so good!
As someone who is just getting started with databases, this is incredible! Thank you A Aron
A A Ron 😂😭 You're welcome!
Thanks for sharing this amazing idea, very well explained, thank you!
Content like this I love - I hope I can find a useful way to do this in my position but my indexes are always Material + location + date usually.
You got my sub for the "Instead of hanging out with friends, ....we continue to read " line. :)
Haha glad you liked that one 😂
i didn't even use MySQL, yet i was pleasantly entertained by your video (i believe some of the knowledge is applicable to other databases as well)
thank you, keep up the good vid 💪🏻
Thank you so much!
So many great tips! Thank you very much!
Just 😲, a huge thank you for all these information in all videos.
You're welcome 🤗
As someone who likes reading the MySQL docs, printing them out and binding them is hardcore 😂 props Aaron 🎉
Super. I remember dozens of usecases, where this implementation would of saved the day.
Quality videos man! Keep it up
Great content. Got yourself a new subscriber!
You do great work! Tough to find solid material to point developers to.
I appreciate that very much
Dude your content is 🔥.
Hi from India!
It's the best thing I came across today.
Amazing video, great explanation and also, nice sql tool you have there. More videos 🌟
Thank you! Got a big one coming out tomorrow on a similar topic 🤐
Awesome video. Now I just need to find the T-SQL version of your channel and I'm complete :D
I really learned something today, thank you man.
Thank you for talking about collision, I was wondering about it even if I knew this wasn’t really an issue !
It's always fun to look up collision probabilities!
@@PlanetScale not as much as your videos, and I’m learning a ton.
Please keep up the good work !
A gem again. Thanks
Good for you, Aaron!
There are some pitfalls when using concatenated rows as the key, especially when there is no separator, but also when columns are nullable.
What I'm thinking about is a row with column values 123 and 45 and another row with values 12 and 345, which would end up having the same concatenated value (when using just concat), and therefore the same hash. You could still end up in a similar situation when using concat_ws when some of the columns are nullable. We saw that the null values are completely omitted, so two sequential nullable columns could end up with the same hash, when a value is present in either one or the other column. There are even more ways where null values in some columns could result in the same input for the hash calculation. If the null values were converted to an empty string, which would create a separator representing every column, which would solve that problem, it would however be hashed the same as an empty string, or any other string you would use to represent null, but it's the best solution I can think of.
That's why I think it's really critical to do the full comparisons described at 11:50.
Yup this is a great argument for adding the extra conditions
Rule of thumb - absolutely always do use separators when concatenating multiple values and producing whatever fingerprint you are producing
Quality content👌🏾, Thank you.
Excellent video!
this is great, thank you aaron
Very good video!
Truly excellent video
I really like your presentation style. Clear and explanatory and good video length.
I never use MySQL, so one improvement for me would be if you could mention if this would be relevant for MSSQL and Postgres as well and if the syntax is different, just make a short note of that.
I assume most concept are similar.
Thanks Johan! I'm not as familiar with the other databases so I feel hesitant to speak on them
Thanks. This is really awesome trick.
This is great content.
Great quality content
The MD5 hash collision isn't the reason you add the WHERE, it's because sometimes your data can have the separator in it and "123|abc", "def" gets combined to the exact same string as "123", "abc|def" even though the values don't match.
Very nice!
You have enough friends and I'm proud to be your friend 👍🏻
This video screams high quality content!
dude u are awesome always!! thank you
🫡 thank you
Thank you for this video. It provides more insight into why some rule of thumb exists in the database world.
You're welcome!
@@PlanetScale In short, all your videos I've watched made lots of sense to me considering my side quest on building a database. I'm now able to see things from a perspective different from the usual "do this don't do that because of so and so..." which is quite rampant in the engineering world.
Earned a Sub
This is so useful that I feel stupid that didn’t knowing this after 6 years working with mysql 😂
Nah don't feel stupid, we should all be learning new things constantly!
This is gold
Awesome video! one small nitpick, since you're dropping nulls when joining columns into a string with the "|" separator, the following rows will be converted into the same string before hashing, even if values came from different columns.
All these:
"A", NULL, "B"
NULL, "A", "B"
"A", "B", NULL
Map to "A|B"
I suggest using a more resilient value concat'ing scheme
True! Unlikely to happen with this data, but that would be a good reason to use the conditions shown at the end of the video
thanks you so much, your videos so helpul for my job. Can you do more video about EXPLAIN ANALYZE and how to optimize execution plan in MYSQL
awesome!, I had no idea about this, I have a scenario where to use it. Thanks
Love to hear that! Please report back how it goes
This is so great! I hope this all works on MariaDB as well?? And now I also need to print this "book" ;)
Theoretically it should work the exact same! Smaller = faster. Give it a go and let me know
Another brilliant fucking video! Keep them coming 😂
The generated column got me thinking if there is some places I can actually do better on my work, thanks Aaron!!
One more thing, should we ignore the cost of concat, md5 or any pre-processing stuff here? I mean, I'm not a MySQL expert so I'm not sure if we do this calcuation in database level is good or not. And when should will consider using a generated column instead of a composite index? Should we always make a experiment or if there is any instuction we can follow?
BTW the generated column is a really nice idea, if there is a column referencing or calculating by the others that our apps will have to keep it correct, it's struggling because we are afarid of there is some place in our app that doesn't handle that calculation correctly, but if this job is done by the database, at least we can avoid lots of code changes and just trust the database will keep the consistency as usual.
Generated columns are so fun. You can use them in so many places. I would probably ignore the cost of a concat or MD5, but if you were to do something very expensive in a generated column, you might consider declaring it STORED so it's not calculated over and over. I like having the database maintain these calculated values so that no matter where the insert or update is coming from, the generated column is always going to be correct and present.
@@PlanetScale Fair enough! Will definitely try it, thanks Aaron!!
Whish there is a channel like this for mongodb :(
Awesome content ❤
Guess you'll have to switch to MySQL 😏😏
"Straight to jail!"🤣
It's true, don't try it!
Ok, wait a minute. How is it possible I haven't come across your content before? You actually know your stuff. Moreover, I can't even play out the jokes in my head before you say them out loud. GET OUT OF MY HEAD. Also, keep producing amazing content. Deal? Deal.
You drive a hard bargain but you got yourself a deal
Great content up to the point! What MySQL version are you using and reading docs of?
The docs here were version 8, but this pattern will work in any version!
what if you update one of the columns that is used for indexing, will the hash also be updated ?
thanks
What program do you use to execute the queries and view the results?
EDIT: Nvm, a few seconds after posting the comment I reached 8:37 where you mention that it is TablePlus.
Bah you beat me to it! Or I guess in a way I beat me to it. Regardless... TablePlus 😂
Love this. Is there any tips on thinking about this when building/designing the database schemas ?
Thanks! I think the most common use cases would be 1) when you're trying to index something that's not easily indexable, you can reach for this or a functional index 2) when you're trying to "paper over" some weirdness in your data model as a convenience or 3) when you're trying to create an ID for an entity that doesn't already have a natural ID. In this case the address didn't have a natural ID, so calculating a hash of the parts gives us a nice ID that we can use for deduping, uniqueness, or indexed lookups from the application side
That straight to jail part goes hard ngl
Sounds pretty intuitive
Honestly this video series was a big part of me switching back from neondb.
🤐
this makes me happy
I would rather choose smaller hash and worry about hash collisions than picking wider hash and hoping there will be no collisions.
Additionally to that, our CPUs and memory works better with 32/64 bit data that with 128 bit data. And, in this case, I would pick CRC32.
Also, MD5 was designed as a cryptographic hash, you don't need cryptograpy overhead here. Simpler hash is better here.
It would be nice to see a benchmark of those 3 approaches. Wide index, MD5, CRC32.
Anyway, thanks for the video. I learned about auto-generated columns. And it was interesting.
CRC32 plus the additional conditions to dedupe hashes would work just fine!
Great ♥♥
This is really cool, the only thing I was curious about is the hashing performance?
I know the hash is only created in insert and update.
It doesn't look like MySQL supports any of the newer hashes like xxh3 which would be my choice since it's so incredibly fast, typically faster than the computers memory bandwidth. It's not cryptographically ( is that a word ) strong, but it's randomness is extremely high
MD5 is incredibly fast also, which is actually one of the reasons it shouldn't be used for cryptography! I haven't benchmarked it, but I think it should be fast enough for most needs. You could try CRC32 but you'd have a ton of collisions so you'd definitely need to add the additional conditions mentioned at the end of the video. Worth a few experiments to see, I suppose!
Nicely explained! What software are you using for querying?
TablePlus! ruclips.net/video/7V_CJBPZPes/видео.html
You are very entertaining. Great content. Thanks 👍
Straight to jail! 😂
Love you Aaron
🤗 Thanks Mahdi! Made my day
We use this to search for duplicates on the database. A thing that I would personally change is making all strings to LOWER(), so that if the user uses a big letter instead of a smaller one, it still sees it as a duplicate.
Nice! That's probably a good addition. Good thinking
This man needs to be more viral; question: you mentioned strict equality, but will this approach also work with statements like ‘is like’?
The hash function obliterates LIKE because there's no substring any more to match on.
The MD5 hash is only suitable for exact matching.
Unfortunately it only works for strict equality
First of all that's a great video and thanks a lot for that.
I have a question regarding this, as we have 6 columns hashed as index and I need to search in 3 columns instead of six, will it be helpful in that case too?? Like I want to search in address, zip code and state column, will it work??
10:50
Unfortunately @LucasSouza1 is correct. It's no longer possible because you've destroyed quite a bit of that data by hashing it.
Just wondering: Is an index on a generated column like in your example identical performance-wise and storage-wise to an index on an ULID or optimized UUID column? I assume the fragmentation would be worse for such a generated hash column due to the randomness of the values. (And yeah, I know the two serve totally different purposes, no question in that.)
It'll be about the same as any other secondary index! The bad thing about random UUIDs as *primary* keys is that the entire clustered index has to be rebalanced along with all the data. That's why you (rightly) here people go on and on about ULIDs and time-ordered UUIDs, so that that rebalancing doesn't have to happen. With a secondary index, it's constantly being broken and rebalanced, but it's so much cheaper because it's only the indexed data, not the entire table's worth of data as is the case with the clustered index. (Note this is accurate for MySQL only, I know very little about the inner workings of other databases.) Hope that makes sense slash answers your question!
@@PlanetScale Thanks!
I love this, I’m curious why this isn’t just the default behavior for indices and what the crossover point is for when this is faster than the traditional index
It has pretty severe tradeoffs, like the fact that it only works for strict equality lookups
Well Aaron, here is where the Real Black Magic happens!
Actually, this is really logical for some specific cases, not all of course.
Keep introducing more of these hidden gems!
from how many indexed tables does it start to make sense?
Niw i wonder what if the we add a cokumn from other table
Remember: If you hash passwords with MD5: straight to jail😂. Great video, very educational
Any concern with performance for tables where rows are frequently updated?
Hashes like this will be pretty cheap to calculate, so it's no concern to me.
Using that alongside hstore is pretty
You can have collisions due to the separator being included in the value stored in the columns as well
Totally! If that's possible with your data, that'd be a great reason to use the redundant conditions (mentioned at the end of the video) for deduping.
Are generated columns in mysql persisted by default or does it allow you to create an index on non-persisted value? both options seem kinda weird/interesting
By default they are virtual, meaning they are calculated at run time. You can create them with the STORED keyword and they'll be written to disk. Interestingly, if you create a virtual generated column and then index it, the value is in fact written to disk, because the index is written to disk.
What app are you using to query request mysql?
Table Plus!
So this technique could be used to hash string column with uuid values that are a key, to avoid indexing uuid column?
There is actually a uuid_to_bin column to compress a uuid value, which you could use to create a smaller index!
One question. Would it be possible to send a hashed value to MySQL over unencrypted network at use some built in MySQL function to unhash this data to separate columns?
Hashing (like shown here) is a one-way function, so it's not reversible. I'm not sure exactly what you're going for, to be honest!
@@PlanetScale In general I have two questions. 1st. You showed that MySQL can (un)hash data and keep it private or concatenate larger information, it can even do so in realtime based on information changed in the fields of a table. Can it also do it in the other way, update fields in the table based on changed data in the hashed field in real time? 2nd. Maybe this is a topic for different time/video. Synchronise two distant databases. An offline version periodically connects to an online and upon connection a synchronisation takes place. Is there a built in MySQL function to perform something like this?
Is there a hash function in MySQL, such that if I run a query with "order by f(a,b,c)", the result set will be in the same order, as for the query with "order by a, b, c"?
Well now that's an interesting question isn't it. I can't think of a lossless way to do that off the top of my head, but the question will haunt me so I'll let you know if I come up with anything.
very usefull