One other option for the "hash" variant is to convert the hash into a numeric column (take the bottom half or quarter of md5 (a 128 bit hash), for example) and then modulo that by some arbitrary number. The modulo would help produce groups that you could then filter by and randomize the order of. Ideally the number is determined by your data size, since a small number gives larger groups, so if you have like 1000 rows, % 3 or % 7 might work well, while on millions of rows, something % 17737 might work better. The concept is similar, in theory, to a hash table, where you would take the hash % the size of the table to figure out what group to put them in. This also lets you "seed" some of the randomness, especially if you do the remaining randomness in your application, which can let you have randomized pages of data, and you can just change the modulo if you want to change the "seed".
hey man your videos kick ass and i cannot thank you enough for your approach with these. your videos can be watched once and understood... every single one of them... i don't know how you do it, but the way you have picked to teach anything you teach is incredible. you freaking rock! thank you!
On your first example, you may have collisions (same id appears twice or more), i think a simple "distinct" could avoid that (if the list of ids you're parsing is small enough)
in my case, primary key is uuid, i can randomly pick 4 char inside 32 length from primary key and sort it and limit it, also i able to make those 4 random positions in to a random seed for pagination
11 месяцев назад+20
Another way: Create a generated decimal column, with default value is rand() function. Add index to them and use-it.
Good idea! That would lead to rows being really "clumpy" though. Rows with the same or similar random numbers would be returned together. Also a generated column can't have a RAND default unfortunately
@@IsaacWalkercox Because the function is not deterministic, A function is deterministic if, given the same data in tables, multiple invocations produce the same result, independently of the connected user. This is generally in place for features like master-slave database replication.
@@devhammed makes sense. The index would need the same information per user. Seems like something they could get round with some sort of wider context though
for gaps in auto incremented ids or uuids you could create a row_number with a window function to use the rand() function against and guarantee results will be returned. but idk how computationally expensive that’d be.
Thank you so much for the video. I have an application that I use to learn vocabulary. It sends me ten random words by email every day. I'm using the random() function, which is fine because the vocabulary table is small. However, I've always wondered if there are better ways to do this. Thanks to your video, I now know some alternative methods.
Excellent Video. Picking up where you left off, using the hash column index to expedite the lookup, how would generating rand (n) hex values improve randomness (16^4 space) over just using two hex buckets ? Would it make sense to generate multiple two digit hash columns and use them both?
Wow! I never knew about covering indexes. I don't need to do a random pull like in the video but I could definitely put those indexes to use. Great video, keep it up!
If I had to make a guess, I would say that it is less performant, the inner join is a subquery, so it is executed first, and then it is getting joined on the primary key of the bookmarks table. "where exists " I would think is getting looked up (or even executed) for each row. (not an expert btw)
I love videos that get me into start thinking. What if you strip down the size of the data several orders of magnitude with something like WHERE RAND() < 0.001 And then just ORDER BY RAND() only for the subset?
If i try to make a virtual field an index, then Maria DB tells me this: 1901 - Function or expression 'concat(`fecha`,' ',`hora`)' cannot be used in the GENERATED ALWAYS AS clause of `fecha_hora` Does it works just in MySQL?
I got a similar problem with my plantescale free db where my bad code hanered the rows read the first month with the order by rand(). I ended up doing a less random solution using cache and code. In php I generate a random int inside the count of table rows (so i dont care about id gaps), and i get n rows from the db with that int as an offset. Then I randomly order them in php and return an array. As I increase the amount of rows extracted, I'm caching that query in Laravel for a x amount of time and on each request I pick from the cache as with a big enough amount of rows the limited randomnes is good enough in my application. To reduce the time in the wire, I only select the columns I need for the use case, and I have an index on the columns needed. The cache can be generated by a worker to avoid having a slow request every x second to the web user.
I'm pretty sure someone will implement a random password generator this way, by storing all possible passwords in a table and then letting MySQL do its magic.
@@PlanetScale here's the cppcon talk I've learned about it from: ruclips.net/video/YA-nB2wjVcI/видео.htmlsi=H40wG1Iha9yOygHZ (I resisted the urge to make it a Rick roll)
@@PlanetScale I am always saying cache just enough data to speed up the process. This should always be the goal. Trust me, there is an optimal solution involving a cache.
One other option for the "hash" variant is to convert the hash into a numeric column (take the bottom half or quarter of md5 (a 128 bit hash), for example) and then modulo that by some arbitrary number. The modulo would help produce groups that you could then filter by and randomize the order of. Ideally the number is determined by your data size, since a small number gives larger groups, so if you have like 1000 rows, % 3 or % 7 might work well, while on millions of rows, something % 17737 might work better. The concept is similar, in theory, to a hash table, where you would take the hash % the size of the table to figure out what group to put them in. This also lets you "seed" some of the randomness, especially if you do the remaining randomness in your application, which can let you have randomized pages of data, and you can just change the modulo if you want to change the "seed".
hey man your videos kick ass and i cannot thank you enough for your approach with these. your videos can be watched once and understood... every single one of them... i don't know how you do it, but the way you have picked to teach anything you teach is incredible. you freaking rock! thank you!
Gah that really encourages me. Thank you
On your first example, you may have collisions (same id appears twice or more), i think a simple "distinct" could avoid that (if the list of ids you're parsing is small enough)
in my case, primary key is uuid, i can randomly pick 4 char inside 32 length from primary key and sort it and limit it, also i able to make those 4 random positions in to a random seed for pagination
Another way: Create a generated decimal column, with default value is rand() function. Add index to them and use-it.
Good idea! That would lead to rows being really "clumpy" though. Rows with the same or similar random numbers would be returned together. Also a generated column can't have a RAND default unfortunately
@@PlanetScale why can't Rand() be the default for a generated column? Too expensive computationally?
@@IsaacWalkercox Because the function is not deterministic, A function is deterministic if, given the same data in tables, multiple invocations produce the same result, independently of the connected user. This is generally in place for features like master-slave database replication.
@@devhammed makes sense. The index would need the same information per user. Seems like something they could get round with some sort of wider context though
does this make result always same?
What client GUI are you using?
Looks like he's using TablePlus
for gaps in auto incremented ids or uuids you could create a row_number with a window function to use the rand() function against and guarantee results will be returned. but idk how computationally expensive that’d be.
Thank you so much for the video. I have an application that I use to learn vocabulary. It sends me ten random words by email every day. I'm using the random() function, which is fine because the vocabulary table is small. However, I've always wondered if there are better ways to do this. Thanks to your video, I now know some alternative methods.
Excellent Video. Picking up where you left off, using the hash column index to expedite the lookup, how would generating rand (n) hex values improve randomness (16^4 space) over just using two hex buckets ? Would it make sense to generate multiple two digit hash columns and use them both?
Great video as always! 👏
Wow! I never knew about covering indexes. I don't need to do a random pull like in the video but I could definitely put those indexes to use. Great video, keep it up!
Check out the video right before this I've, it's all about covering indexes! The one with the timer in the thumbnail
Would that inner join at 13:14 be effectively replaceable by a where exists keeping the same performance?
I'm wondering this as well.
If I had to make a guess, I would say that it is less performant, the inner join is a subquery, so it is executed first, and then it is getting joined on the primary key of the bookmarks table.
"where exists " I would think is getting looked up (or even executed) for each row. (not an expert btw)
He did explain this around 6:00. RAND() is a volatile function, so it being in the WHERE clause calls RAND() over and over again for each row.
I love videos that get me into start thinking.
What if you strip down the size of the data several orders of magnitude with something like WHERE RAND() < 0.001
And then just ORDER BY RAND() only for the subset?
If i try to make a virtual field an index, then Maria DB tells me this:
1901 - Function or expression 'concat(`fecha`,' ',`hora`)' cannot be used in the GENERATED ALWAYS AS clause of `fecha_hora`
Does it works just in MySQL?
You're an amazing teacher!
yeah, deferred joins and covering indexes rules! great idea using hashes for "randomness"!
If its fast enough, its fast enough... relevant in so many places...
I got a similar problem with my plantescale free db where my bad code hanered the rows read the first month with the order by rand().
I ended up doing a less random solution using cache and code.
In php I generate a random int inside the count of table rows (so i dont care about id gaps), and i get n rows from the db with that int as an offset. Then I randomly order them in php and return an array. As I increase the amount of rows extracted, I'm caching that query in Laravel for a x amount of time and on each request I pick from the cache as with a big enough amount of rows the limited randomnes is good enough in my application.
To reduce the time in the wire, I only select the columns I need for the use case, and I have an index on the columns needed.
The cache can be generated by a worker to avoid having a slow request every x second to the web user.
I'm pretty sure someone will implement a random password generator this way, by storing all possible passwords in a table and then letting MySQL do its magic.
i have a question, what is the application that he's using ?
This was really interesting mate! Love your videos, you filled a huge gap in related db topic for a lot of people.
Thank you 😌
Super Useful Video! Thank you, Aaron!
I liked the bucket method,
Thanks Aaron!
Reservoir Sampling. 'nough said
Perhaps not quite enough, I've never heard of Reservoir Sampling! Will have to look that up.
@@PlanetScale here's the cppcon talk I've learned about it from: ruclips.net/video/YA-nB2wjVcI/видео.htmlsi=H40wG1Iha9yOygHZ (I resisted the urge to make it a Rick roll)
Yeah, maybe in your SQL
Redis the rows and choose randomly from cache. Problem solved.
Redis the rows? I'm not sure what that means
@@PlanetScale Cache = Performance. Kudos on doing all that work to get your database to cough up the data but caching is always faster.
I understand that 🤗 But are you saying store all the rows in Redis? "Redis the rows" is unclear to me
@@PlanetScale I am always saying cache just enough data to speed up the process. This should always be the goal. Trust me, there is an optimal solution involving a cache.