I loaded 100,000,000 rows into MySQL (fast)

PlanetScale

Просмотров 180 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 ноя 2024

Комментарии • 165

@bozgi 9 месяцев назад ⁺⁹¹
Didn't even realise that this is a company account. Loved that video
@tcarreira 9 месяцев назад ⁺¹⁶⁸
you are doing it right PlanetScale. Promoting your brand on the background while you give real value to people watching this videos. Congrats.
Even letting different people is well done (I got used to Aaron though :D)
@PlanetScale 9 месяцев назад ⁺¹⁸
We could all use some more Aaron.
@TrondHus 9 месяцев назад
Every company should have an Aaron,@@PlanetScale
@dudaseifert 9 месяцев назад ⁺¹⁷⁰
PlanetScale is LITERALLY the only company i'm subbed to on youtube. your content is THAT good
@darshanrander1950 9 месяцев назад
So true!!
@electrolyteorb 9 месяцев назад ⁺¹
But not the platform itself
@dudaseifert 9 месяцев назад
@@electrolyteorb what do you mean?
@8starsAND 9 месяцев назад
Yeah, especially loading ready to go python packages…. Yeah, rocket science…
@Frank2212 8 месяцев назад
Time to unsubscribe. They laid him off. He's the only reason the videos are good
@69k_gold 9 месяцев назад ⁺²¹
Using the "teach a man how to fish" rule in product marketing is one hell of a genius idea. What a win-win!
@ahmettek315 9 месяцев назад ⁺⁴
Update: I also tried this on Mac Mini M1. This time instead of using all those 17 columns, I did as you did and only imported 6 of them. For postgreSQL, it took 4 m 39 s 810 ms this time :) Almost half of your MySQL timing (and again this is in docker container).
I knew that postgreSQL should be better. I fell in love with postgreSQL once more. Thanks.
Ah, BTW, maybe I should be considered cheating, I use the open source postgreSQL extension Citus (but single node in a docker container - it would be insane with multiple nodes).
(I guess planet scale is also using the Go project Vitess and should be fast on multiple nodes)
@PlanetScale 9 месяцев назад ⁺³
Good experimentation! This was loaded into vanilla MySQL, and you're right that the performance is very dependent an hardware, which DBMS you're using, etc.
@robbyface3707 9 месяцев назад ⁺²¹
Ben! My past lecturer at the UofA, he is an amazing teacher so I’m glad I can still learn from him 🎉
@miguel-espinoza 9 месяцев назад ⁺⁷
as a frequent user of that same lichess database I would really appreciate that c-program you wrote :)
@HoverCatzMeow 9 месяцев назад ⁺⁵
Love that you kept errors/issues in the video! Shows that even professionals makes simple mistakes :)
@Richard_GIS 9 месяцев назад ⁺¹
8:42 thats totally understandable why it takes for ever, each ';' is a commit and this takes processing power. Best would be a bulk load of the data. But the direct loading is great too
@DoubleM55 9 месяцев назад ⁺¹⁵
09:23 - Am I the only one who noticed that you got error at row one, not even importing any data :)
Anyway, awesome video :) I always get excited when I see new one uploaded :)
@PlanetScale 9 месяцев назад ⁺¹
Good catch! @dloura07 pointed that out also. Using -f to continue after errors causes it to take even longer! Definitely need a better solution.
@athar-q 9 месяцев назад ⁺¹
I have used SQL Loader in one of the pipelines I developed for the company I work for. That process time was reduced from 30 minutes to 2 minutes.
@codewithantonio 9 месяцев назад ⁺²⁷
Aaron, you look different!
jk, always love to see new PlaneScale content! ❤
@muhamadrafipamungkas4465 9 месяцев назад ⁺²
Always love your course, planning to do you Miro clone for this month
@ahmad-murery 9 месяцев назад
Yes, I thing he got flu 😁
@UTUBMRBUZZ 9 месяцев назад ⁺¹
I had the same problem but with a 300 hundred million rows. The real pain was hidden characters in the data and comma at the wrong places. So too import that data into the database did take over 6 hours. So i had to run the script over night. Too optimize it we did compare the changes with the first import so the second and further imports did go in 1-5 minutes.
@mirzashakirbaig1950 9 месяцев назад
It's really awesome.
Recently I tried importing a CSV into MySql database ( using Laravel and excel libraries ) which only had a few million records. It was taking 5-7 minutes even after a lot of improvisations I did.
Using multiple tools and technologies to get the data processed is the main ingredient here. Will give it a try with my import your ideas.
@sherlockholmes6849 9 месяцев назад
Could you share that code?
@Arkounay 9 месяцев назад ⁺³
Great video, I didn't know about the "load data local infile" sql command 👍
I would have loved a comparison with INSERT into that inserts multiples lines and once instead of one insert per games, and see if it's faster or slower than LOAD DATA LOCAL INFILE
@thejonte 9 месяцев назад ⁺¹⁰
Please use the torrent next time. That'll save them bandwidth and probably also be faster.
@I_SEE_RED 9 месяцев назад ⁺¹
🤓
@thejonte 9 месяцев назад
Thanks for calling me smart. Y'know the nerds are the ones successful in this field@@I_SEE_RED
@gemmaatroxxibox5322 9 месяцев назад
I was thinking it must be -> Load data infile because that has saved me many a time. But the preprocessing in C has made me think I really need to learn that, or Rust or Zig. Very impressive.
@MrPudgyChicken 9 месяцев назад ⁺¹⁵
I would be curious to see the performance difference between multiple independent INSERT statements and one INSERT statement with many values. In my experience, it is much, much faster, though I have never tried importing a dataset quite this large.
@reed6514 9 месяцев назад
Same experience here.
On another note, server-side, I've had to chunk my inserts into a thousand or so rows at a time bc my production db is a diff server than my web server, and it only lets me send so much data at once.
I also wonder how a transaction would affect speed of many separate inserts.
@PlanetScale 9 месяцев назад ⁺²
Great idea! That's exactly what mysqldump does when exporting a db. I didn't show a performance comparison, but let me know what you find :)
@ToX82 9 месяцев назад ⁺¹
I did that for an import tool that inserts ~50.000 rows each time. I wrote that tool a few years ago so the numbers could be slightly different, but according to my benchmarks (and my memory), using one insert per 800-1000 records is the fastest option, and it takes about 5-6 seconds. Using one insert per line takes ~30 seconds. More than 800-1000 records per insert, again turned out to be slower.
@sounakmondal3342 9 месяцев назад ⁺¹
Would really love a playlist dedicated to performance tips
@matrizaev 9 месяцев назад ⁺¹
I would rather make that first C converter load the data into MySQL. This way you could use your PGN file as a SoT and load it into both your local MySQL and PlanetScale.
@Klinoklaz 9 месяцев назад ⁺¹
thought you were to use awk to process that huge file. i sometimes use it to parse gigantic xml files, it's much easier to write than C. feel like it shouldn't be too slow in your usecase either
@Tjkrusinski 9 месяцев назад ⁺⁵
Very good. My brutal feedback is Aaron sometimes has too many opinions sprinkled into his videos where this fella has almost none. A middle ground would be like wow chefs kiss.
@IvanRandomDude 9 месяцев назад ⁺¹⁸
100k inserts in 1.7s. Yep, your app with 5 active users and 1000 inserts per day definitely needs Cassandra.
@llothar68 9 месяцев назад ⁺⁷
But what happen when we go viral and become as rich and huge as Google? We have to be prepared for this. More Microservices, more distributed systems please and we need a registration server for the 3 users that logged in yesterday -
@IvanRandomDude 9 месяцев назад ⁺⁶
@@llothar68 We need separate instance for each user, to make sure it scales.
@indramal 9 месяцев назад
What do you expect to tell us? " your app with 5 active users and 1000 inserts per day definitely needs Cassandra."
Concurrent connection is slow in this MySQL database?
5 x 1000 = 5,000 raws per day, so why do we need cassandra DB for this small insets?
@llothar68 9 месяцев назад ⁺⁵
@@indramal You did not get the joke. Or the terrible situation the web development world is in at the moment.
@AlexeyKlimko 9 месяцев назад
The thing is that it failed after 21000 rows…
@tjnangosha 9 месяцев назад
This video is so satisfying to watch.
@abimbolaobadare6691 9 месяцев назад ⁺¹
Bro this is wild, the C script got me questioning my life choices. Can we train an AI model with this chess data?
@alexrusin 9 месяцев назад ⁺²
Great video, as always. Thank you. I got a question: why did you use ec2 instance to load data to Planetscale? Couldn't you do it directly from your local Mac?
@PlanetScale 9 месяцев назад ⁺⁴
Good question. Mostly for the sake of improved network performance (which can be a bottleneck in instances like this). Home internet often has very slow upload speeds :)
@scottstroz 9 месяцев назад ⁺¹
I am curious to know if there would be any performance improvements in the dump and load process if you used multi-threaded dump and load methods of MySQL Shell instead of mysqldump.
@regmiller536 9 месяцев назад
I love this kind of videos, they are so educational.
@avramcs 9 месяцев назад ⁺¹
what a suprise lol, great vid
@Mcs1v 9 месяцев назад ⁺¹
"Load data" isn't faster if you write your loader properly, so you just won a running competition against a disabled opponent
@igboanugwocollins4452 9 месяцев назад
This is phenomenal, thank you very much for this masterpiece
@NuncNuncNuncNunc 9 месяцев назад
In general, should you run multiple tests with different number of rows in order to measure any startup time? Should the first test be using MySQL's built in LOAD DATA INFILE, i.e. baseline should be the zero-effort case?
@G5STU 9 месяцев назад ⁺¹
Great vid, loving the C, always use the right tool for the job (not what's cool) 👍
@ignauy 9 месяцев назад
Loved the video. I just have a question, couldn't you just have established a connection to planetscale even before the dump?
When you run `mylsql -u root chess < inserts.sql`, couldn't you have directly imported that into planetscale like you did import the dump in the ec2 instance?
@PlanetScale 9 месяцев назад
Yes! A connection to the PlanetScale DB could have been established directly form my computer instead of from the EC2 instance. The advantage of the EC2 instance is that, if it is in the same region as the database, it can have better network bandwidth to the server.
@Pici3k 9 месяцев назад
Learned a ton from this. Thanks!
@permad5189 9 месяцев назад
Coolio neato! Only thing that's a problem is the MySQL pronunciation, even the official mysql docs says the correct way is my s-q-l!
@ropoxdev 9 месяцев назад ⁺⁴
Aaron where!?
@PlanetScale 9 месяцев назад
👀
@danielgangl6560 9 месяцев назад ⁺²
How long should it take to load data into Planetscale? My application currently takes ~1 second for 1000 (1 thousand) in a single createMany Prisma statement.
(1000 entries into 1 table)
Adding more or less seems to scale linearly.
That seems really slow to me. Wondering if there are some benchmark for that to compare to see if I am doing something wrong or if it's normal.
@PlanetScale 9 месяцев назад ⁺¹
Depends on a bunch of factors including the PlanetScale plan you are on, network connectivity / proximity to your DB, how Prisma's createMany function is implemented, etc. What's your setup?
@danielgangl6560 9 месяцев назад
@@PlanetScale I'm only interested in the query latency. So connectivity and all that should be outside of that. My createMany simply gets an array with 1000 objects, each with 10-15 column values. Plan is Scaler. (but does it make a difference in this case with e.g. free? from the pricing page it only looks like the upper quantity would later make an issue but otherwise I don't see what would affect it)
@reed6514 9 месяцев назад
I like that you do everything in the terminal
@adityanandula 9 месяцев назад
This is interesting! When I tried the same amount of records from one SQL instance to another in AWS via network. It took me 6 hours. I've doubt what about the BinLog data ? Can we apply the same method with the backup SQL data chunks ?
@wes337 9 месяцев назад ⁺¹
Great video! Would actually love to see the code for that C program you wrote to convert to CSV
@PlanetScale 9 месяцев назад
The pause button is your friend here ⏸
@phpnotasp 9 месяцев назад
@@PlanetScaleSarcastic comments aside, even going frame-by-frame in your video, you cannot see all the code.
@thegrumpydeveloper 9 месяцев назад ⁺¹
Curious why the upload to ec2 rather than direct from local. Is that just to run so you can do other things and to take advantage of network? Otherwise seems similar to upload direct rather than upload then upload again.
@musashigarami 9 месяцев назад ⁺¹
Came to comments for this, also was wondering, as surely you'd spend time uploading first to ec2? And then upload again from there.. but the first upload to ec2 negates benefits? Really would like to know the reasoning behind it
@evko9264 9 месяцев назад
what should we do if we have to use python, s3 buckets and aws glue just to make pretty the same: convert some txt file to csv? The problem is that we need it FAST)
@0xtz_ 9 месяцев назад
man the best channel ever 😎, cool videos as always and ton of informations
@AndrzejPauli 9 месяцев назад ⁺¹
Wher's "Seee yyaa" ??? ... ?? 😀
Naa, just kidding. Great content!
@equus6460 9 месяцев назад
was able to successfully load 100 million record in SQLite3 database its about 20 GB. Would be interested to know how big was the 100 million MYSQL database.
@PlanetScale 9 месяцев назад ⁺¹
The chess DB in MySQL was around 10.7 gigs after running the tests and importing the big data set. The dump was ~13.
@equus6460 9 месяцев назад
@@PlanetScale Was planning to scale to 1 billion rows but had issues with my SSD space. Guess MySQL would scale upto 1 billion rows..
Was impressed with SQLite3 ability to scale up with ease to 100million records.
Administration overhead for SQLite3 is much less compared to other databases such as Postgress , MYSQL and Oracle that's my opinion.
@OscarFromIndia 9 месяцев назад
How can someone learn all the stuff that your doing here?? Can you share some concepts to get started?
@rombael8152 9 месяцев назад
Hardware? Intel Pentium III 500Mhz?
@gefyaqiilah6678 9 месяцев назад
wow, thank you for sharing
@felixisaac 9 месяцев назад ⁺¹
Just curious, how did you send the dump file to your AWS instance? Is it through FTP?
@PlanetScale 9 месяцев назад
Transferred via scp!
@felixisaac 9 месяцев назад
Thank you! @@PlanetScale
@vishwanathbondugula4593 9 месяцев назад
Can you share the C code you have written to parse pgn to CSV ?
@abdelkadermh 9 месяцев назад
Can you please make a video of using ARK or SED to transform the data and show how fast it can be?
@bilalthepunjabi 9 месяцев назад
I automate all these process of downloading, extracting, schema building and migrating into database in a Laravel project.
I initially choose python for this task, but it is slower which I amazed too so php with short lifespan and chunking with recursive calling make all this happen with realtime updates over socket...!
@PlanetScale 9 месяцев назад
Cool!
@datamini01 9 месяцев назад
Hello bilalthepunjabi,
Could you please share the source code you implemented with PHP (Laravel)? I would greatly appreciate it. Currently, I am also exploring some solutions to accomplish this task with PHP, but I think it may not be as fast as Python.
@Mesinko6 9 месяцев назад ⁺¹
Maybe using batch insert would taken less time
@카라멜땅콩-p2p 9 месяцев назад
good topic
@amranimohamad4741 9 месяцев назад
what about mojo isn't it making py for effeciant
@gfeher123 9 месяцев назад
Finished watching when you inserted records one by one
@tigranrostomyan9231 9 месяцев назад ⁺²
Did you just loaded 200+gb file in 1 second with vim..? (2:16)
@PlanetScale 9 месяцев назад ⁺¹
In vim, you can control-c to cut loading short. The more you know!
@renjithvk492 3 месяца назад
How you got inserted 'white_elo' value to 'while_elo' column? 😊
@PlanetScale 3 месяца назад
Someone is paying attention! It got fixed off screen.
@0xmassive526 9 месяцев назад
Video so good i didnt even need to watch this but still did. 😅
@DavisTibbz 9 месяцев назад
Wow impressive. Came here from Google News, this video was featured 😅
@Wald246 9 месяцев назад
15:59 I'm always baffled that MySQL does not come with an option to import and export data using a binary file format and instead forces the use of a plaintext SQL file. Seems so inefficient
@jeetchheda8916 9 месяцев назад
please tell me this load script can be developed in PostgreSQL
@januzi2 9 месяцев назад ⁺¹
You don't have to use select count(*). You could check the schema for the next autoinc or just ask for the id from the last record: select id from chessgame order by id desc limit 1; It will use the index and will be a lot faster than the whole table scan.
It will return the correct number under the conditions: 1. the table has the autoinc column, 2. nothing was deleted, 3. the query doesn't contain any "where".
@RandomGeometryDashStuff 9 месяцев назад
07:38 where are mystery numbers (64, 256, 32) coming from?
@lian1238 9 месяцев назад
I think the numbers could be higher in a real application
@tcarreira 9 месяцев назад
When creating a MySQL table and using VARCHAR data types with specific sizes like 32, 64, or 256, you are essentially defining the maximum length of the character data that can be stored in that column.
The specific numbers like 32, 64, and 256 are often used in VARCHAR column definitions for several reasons, but they are not rigidly fixed; you can choose other sizes based on your specific data requirements.
Here are some reasons behind the common use of these numbers:
- Powers of 2: 32, 64, and 256 are powers of 2 (2^5, 2^6, and 2^8), which aligns well with the binary storage and memory management systems used in computer hardware. This alignment can make data manipulation and storage more efficient in some cases.
- Common Data Sizes: These sizes often align with common data sizes in various contexts. For example, 32 characters might be chosen for things like usernames, which are often limited in length. 64 and 256 characters cover a broad range of data sizes, accommodating longer strings or descriptions while still keeping data within reasonable limits.
- Historical Precedence: These sizes have been used in database design for a long time, and they have become somewhat standard. Database administrators and developers are accustomed to seeing these sizes in schemas.
- Practicality: They are practical sizes that work well for many types of data. Using numbers like 32, 64, and 256 simplifies decisions about column sizes because they are easy to work with and understand.
That said, there's no requirement to use these specific sizes. Your choice should always be based on your specific data needs. If your data requirements don't align with these sizes, it's perfectly fine to choose different sizes that better suit your use case.
@PlanetScale 9 месяцев назад ⁺¹
Just "reasonable" values chosen for this example. For a production application, would want to validate that those are good values more thoroughly.
@benjaminshtark5977 9 месяцев назад
its all good, the only part not usable by most people is writing that C code to convert to CSV :)
most people are not that good in classical C programming...
@bowenwang 9 месяцев назад
Can you provide a link to your C code that converts pgn to csv
@dimitar.bogdanov 9 месяцев назад ⁺²
Question - why not torrent when downloading the set?
@CaboLabsHealthInformatics 9 месяцев назад
Did you really decompress and stdout the whole 200GB to the terminal?
@sudeshryan8707 9 месяцев назад
i was more impressed vim was able to open 200gb file in no time.. while im struggling with 300mg text files in Notepad++ 🤣🤣
@CaboLabsHealthInformatics 9 месяцев назад
@@sudeshryan8707 please avoid to stdout the whole thing :)
@sudeshryan8707 9 месяцев назад
i was more impressed vim was able to open 200gb file in no time 😅😅
@PlanetScale 9 месяцев назад
CTRL-C is your friend :)
@NaourassDerouichi 9 месяцев назад ⁺¹
Wait, have you just loaded a 200 gig text file into the editor without any fire ?!
@Shuroii 9 месяцев назад ⁺²
The art of vim chunked loading
@gokukakarot6323 3 месяца назад
where is the c code?
@vardmans 9 месяцев назад
Would you mind to share that C script?
@bitwise2832 9 месяцев назад ⁺¹
FIELDS are not FILEDS.
To run meaningful queries you need some more indices.
id alone is limited.
@spacetabs 9 месяцев назад ⁺³
True definition of BigData
@DoubleM55 9 месяцев назад ⁺⁸
Not even close :)
@IvanRandomDude 9 месяцев назад
30 gigs is nothing in big data terms.
@samislam2746 9 месяцев назад
Next video: Do this in mongoDB
@ahmettek315 9 месяцев назад ⁺²
Well that was fast for 100 million rows, but you didn't tell us about the hardware you are using. From video I can only tell that you are using a Mac, I guess it has M series CPU and a decent drive. Likely faster than my AMD Ryzen 9 3900x, 32 Gb RAM, Corsair MP600 1 TB NVME.
I did what you did, converted that pgn to CSV (using Go, all 17 columns), and imported all 17 columns to postgreSQL. 100,023,791 rows affected in 8 m 27 s 347 ms. This was done in a docker container, thus I think it made it a little bit slower.
Interestingly you wait over 23 seconds just for a count(*)???? That was really slow. It took 3.7 seconds for me on postgreSQL. I wonder what would happen if I do this on my Mac mini M1. I named my table "speedo", this query:
select count(*) from Speedo where Termination = 'Time forfeit';
returns 31,471,426 in 4.7 seconds. Can you also time this on that mySQL, just curious.
@TheMainManNorthOfEquator 9 месяцев назад
Will try with Postgres 16
@towhidulislam6192 9 месяцев назад ⁺¹
where's AARON?
@AlexeyKlimko 9 месяцев назад
9:23 query failed, 100000 rows were not inserted. Read all of the output.
@syntaxerorr 9 месяцев назад
Did vim just open a 214 gigabyte file in seconds? I think I've opened a 12 gig file before, not 214!
@PlanetScale 9 месяцев назад ⁺¹
Tip: You can CTRL-C as vim is in the process of loading a file to cut it short, allowing you to view part of the file!
@teenigma_ 9 месяцев назад ⁺⁴
I see that while_elo 😉
@dloura07 9 месяцев назад ⁺²
Did the multiple insert statemets actually work at 9:22 or did it take 1.7s in total because it failed, so it would actually take longer if it worked?
@lian1238 9 месяцев назад
Yes. Seems like the error cut the operation short
@PlanetScale 9 месяцев назад ⁺¹
Nice catch. You're right that the error cut it short. Using -f to continue after errors causes it to take even longer! All the more reason the use LOAD DATA LOCAL INFILE 😄
@MrGilbert1995 9 месяцев назад
How make query run faster 22 sec for a count is long :P
@vahanmkrtchyan2504 9 месяцев назад
Bruh we have terabytes in mysql . And all good just don't do aggregations use warehouse for that.
@LewisCowles 9 месяцев назад ⁺¹
Where is Aaron?
@PlanetScale 9 месяцев назад
👀
@chiebidoluchinaemerem5860 9 месяцев назад
Man, Python is really slo............w
@KiamatChange-ke8nd 9 месяцев назад
200 GB text data? WTH is that? 😳
@mbe2916 9 месяцев назад
Python can be fast if you stay in the c world.
@patlecat 9 месяцев назад
Dude you had all the data loaded into that C program. It would have been super easy and fast to then write it into MySQL directly from there! Why you go all those complicated detours and persist to use Python? LOL
@hqcart1 9 месяцев назад
dude, you could have asked chatgpt and its done
@NedumEze 7 месяцев назад
Surely, Sir, you're not running the same mysql version as the rest of us who have PHPMYADMIN's integrated version? Or are you?
If you are, how is it that you can have 100M rows in one Table and I can't have even half a million in my Version in PHPMYADMIN?
@peacekeeper7968 9 месяцев назад ⁺¹
You discovered that C is faster than python hahahahahahhahahahhahah
@alquinn8576 9 месяцев назад
y use mysql when you could have loaded it into 100 LINKED ACCESS DATABASES?
@kodosensei 9 месяцев назад
👀
@robertgarrison1738 9 месяцев назад
while_elo instead of white_elo lmao
@zrny 9 месяцев назад
my"SEQUEL", im not watching this LOL :D
@HeenDee 9 месяцев назад
please..pgn to csv converter c source....... python very slow....ㅜ_ㅜ
@asdfaf343r4 9 месяцев назад
skill issues
@mimbaradinugraha 9 месяцев назад
subscribed because he is using vim
@PlanetScale 9 месяцев назад
Game recognize game

Следующие

Автовоспроизведение

Faster database indexes (straight from the docs)