RAND is too slow (in MySQL)

Поделиться
HTML-код
  • Опубликовано: 26 ноя 2024

Комментарии • 45

  • @Wundero
    @Wundero 11 месяцев назад +5

    One other option for the "hash" variant is to convert the hash into a numeric column (take the bottom half or quarter of md5 (a 128 bit hash), for example) and then modulo that by some arbitrary number. The modulo would help produce groups that you could then filter by and randomize the order of. Ideally the number is determined by your data size, since a small number gives larger groups, so if you have like 1000 rows, % 3 or % 7 might work well, while on millions of rows, something % 17737 might work better. The concept is similar, in theory, to a hash table, where you would take the hash % the size of the table to figure out what group to put them in. This also lets you "seed" some of the randomness, especially if you do the remaining randomness in your application, which can let you have randomized pages of data, and you can just change the modulo if you want to change the "seed".

  • @ijf03208rek
    @ijf03208rek 11 месяцев назад +10

    hey man your videos kick ass and i cannot thank you enough for your approach with these. your videos can be watched once and understood... every single one of them... i don't know how you do it, but the way you have picked to teach anything you teach is incredible. you freaking rock! thank you!

    • @PlanetScale
      @PlanetScale  11 месяцев назад +2

      Gah that really encourages me. Thank you

  • @olivierm.594
    @olivierm.594 11 месяцев назад +1

    On your first example, you may have collisions (same id appears twice or more), i think a simple "distinct" could avoid that (if the list of ids you're parsing is small enough)

  • @samho321
    @samho321 11 месяцев назад +3

    in my case, primary key is uuid, i can randomly pick 4 char inside 32 length from primary key and sort it and limit it, also i able to make those 4 random positions in to a random seed for pagination

  •  11 месяцев назад +20

    Another way: Create a generated decimal column, with default value is rand() function. Add index to them and use-it.

    • @PlanetScale
      @PlanetScale  11 месяцев назад +3

      Good idea! That would lead to rows being really "clumpy" though. Rows with the same or similar random numbers would be returned together. Also a generated column can't have a RAND default unfortunately

    • @IsaacWalkercox
      @IsaacWalkercox 11 месяцев назад +1

      @@PlanetScale why can't Rand() be the default for a generated column? Too expensive computationally?

    • @devhammed
      @devhammed 11 месяцев назад +4

      @@IsaacWalkercox Because the function is not deterministic, A function is deterministic if, given the same data in tables, multiple invocations produce the same result, independently of the connected user. This is generally in place for features like master-slave database replication.

    • @IsaacWalkercox
      @IsaacWalkercox 11 месяцев назад +2

      @@devhammed makes sense. The index would need the same information per user. Seems like something they could get round with some sort of wider context though

    • @RandomGeometryDashStuff
      @RandomGeometryDashStuff 11 месяцев назад +1

      does this make result always same?

  • @tylerslater
    @tylerslater 9 месяцев назад +1

    What client GUI are you using?

    • @troublescoot
      @troublescoot 9 месяцев назад

      Looks like he's using TablePlus

  • @cafehodge
    @cafehodge 11 месяцев назад

    for gaps in auto incremented ids or uuids you could create a row_number with a window function to use the rand() function against and guarantee results will be returned. but idk how computationally expensive that’d be.

  • @sungvadan
    @sungvadan 11 месяцев назад

    Thank you so much for the video. I have an application that I use to learn vocabulary. It sends me ten random words by email every day. I'm using the random() function, which is fine because the vocabulary table is small. However, I've always wondered if there are better ways to do this. Thanks to your video, I now know some alternative methods.

  • @roganl
    @roganl 11 месяцев назад

    Excellent Video. Picking up where you left off, using the hash column index to expedite the lookup, how would generating rand (n) hex values improve randomness (16^4 space) over just using two hex buckets ? Would it make sense to generate multiple two digit hash columns and use them both?

  • @gfni
    @gfni 11 месяцев назад +3

    Great video as always! 👏

  • @jacksonzamorano
    @jacksonzamorano 11 месяцев назад +1

    Wow! I never knew about covering indexes. I don't need to do a random pull like in the video but I could definitely put those indexes to use. Great video, keep it up!

    • @PlanetScale
      @PlanetScale  11 месяцев назад

      Check out the video right before this I've, it's all about covering indexes! The one with the timer in the thumbnail

  • @kriansa
    @kriansa 11 месяцев назад +3

    Would that inner join at 13:14 be effectively replaceable by a where exists keeping the same performance?

    • @mme725
      @mme725 11 месяцев назад

      I'm wondering this as well.

    • @balduin_b4334
      @balduin_b4334 11 месяцев назад

      If I had to make a guess, I would say that it is less performant, the inner join is a subquery, so it is executed first, and then it is getting joined on the primary key of the bookmarks table.
      "where exists " I would think is getting looked up (or even executed) for each row. (not an expert btw)

    • @OliverONeill
      @OliverONeill 11 месяцев назад

      He did explain this around 6:00. RAND() is a volatile function, so it being in the WHERE clause calls RAND() over and over again for each row.

  • @DuniC0
    @DuniC0 11 месяцев назад

    I love videos that get me into start thinking.
    What if you strip down the size of the data several orders of magnitude with something like WHERE RAND() < 0.001
    And then just ORDER BY RAND() only for the subset?

  • @hannielgutierrez9139
    @hannielgutierrez9139 11 месяцев назад

    If i try to make a virtual field an index, then Maria DB tells me this:
    1901 - Function or expression 'concat(`fecha`,' ',`hora`)' cannot be used in the GENERATED ALWAYS AS clause of `fecha_hora`
    Does it works just in MySQL?

  • @___dyego
    @___dyego 9 месяцев назад

    You're an amazing teacher!

  • @AlexCernat
    @AlexCernat 11 месяцев назад

    yeah, deferred joins and covering indexes rules! great idea using hashes for "randomness"!

  • @Kane0123
    @Kane0123 11 месяцев назад

    If its fast enough, its fast enough... relevant in so many places...

  • @daniel_petrica
    @daniel_petrica 11 месяцев назад

    I got a similar problem with my plantescale free db where my bad code hanered the rows read the first month with the order by rand().
    I ended up doing a less random solution using cache and code.
    In php I generate a random int inside the count of table rows (so i dont care about id gaps), and i get n rows from the db with that int as an offset. Then I randomly order them in php and return an array. As I increase the amount of rows extracted, I'm caching that query in Laravel for a x amount of time and on each request I pick from the cache as with a big enough amount of rows the limited randomnes is good enough in my application.
    To reduce the time in the wire, I only select the columns I need for the use case, and I have an index on the columns needed.
    The cache can be generated by a worker to avoid having a slow request every x second to the web user.

  • @IvanMontilla
    @IvanMontilla 11 месяцев назад

    I'm pretty sure someone will implement a random password generator this way, by storing all possible passwords in a table and then letting MySQL do its magic.

  • @mamad-dev
    @mamad-dev 11 месяцев назад

    i have a question, what is the application that he's using ?

  • @tamasnogradi4524
    @tamasnogradi4524 11 месяцев назад

    This was really interesting mate! Love your videos, you filled a huge gap in related db topic for a lot of people.

  • @vasiovasio
    @vasiovasio 11 месяцев назад

    Super Useful Video! Thank you, Aaron!

  • @ahmad-murery
    @ahmad-murery 11 месяцев назад

    I liked the bucket method,
    Thanks Aaron!

  • @Voltra_
    @Voltra_ 11 месяцев назад

    Reservoir Sampling. 'nough said

    • @PlanetScale
      @PlanetScale  11 месяцев назад +1

      Perhaps not quite enough, I've never heard of Reservoir Sampling! Will have to look that up.

    • @Voltra_
      @Voltra_ 11 месяцев назад

      @@PlanetScale here's the cppcon talk I've learned about it from: ruclips.net/video/YA-nB2wjVcI/видео.htmlsi=H40wG1Iha9yOygHZ (I resisted the urge to make it a Rick roll)

  • @youtubehotdogdogyoutube
    @youtubehotdogdogyoutube 11 месяцев назад

    Yeah, maybe in your SQL

  • @RayHorn5128088056
    @RayHorn5128088056 11 месяцев назад

    Redis the rows and choose randomly from cache. Problem solved.

    • @PlanetScale
      @PlanetScale  11 месяцев назад

      Redis the rows? I'm not sure what that means

    • @RayHorn5128088056
      @RayHorn5128088056 11 месяцев назад

      @@PlanetScale Cache = Performance. Kudos on doing all that work to get your database to cough up the data but caching is always faster.

    • @PlanetScale
      @PlanetScale  11 месяцев назад

      I understand that 🤗 But are you saying store all the rows in Redis? "Redis the rows" is unclear to me

    • @RayHorn5128088056
      @RayHorn5128088056 11 месяцев назад

      @@PlanetScale I am always saying cache just enough data to speed up the process. This should always be the goal. Trust me, there is an optimal solution involving a cache.