I loaded 100,000,000 rows into MySQL (fast)
ฝัง
- เผยแพร่เมื่อ 25 พ.ย. 2024
- Watch me wrangle over 100,000,000 rows of data and load it into MySQL.
To learn more about PlanetScale, head to planetscale.co...!
💬 Follow PlanetScale on social media
• Twitter: / planetscale
• Discord: / discord
• TikTok: / planetscale
• Twitch: / planetscale
• LinkedIn: / planetscale
you are doing it right PlanetScale. Promoting your brand on the background while you give real value to people watching this videos. Congrats.
Even letting different people is well done (I got used to Aaron though :D)
We could all use some more Aaron.
Every company should have an Aaron,@@PlanetScale
PlanetScale is LITERALLY the only company i'm subbed to on youtube. your content is THAT good
So true!!
But not the platform itself
@@electrolyteorb what do you mean?
Yeah, especially loading ready to go python packages…. Yeah, rocket science…
Time to unsubscribe. They laid him off. He's the only reason the videos are good
Didn't even realise that this is a company account. Loved that video
Using the "teach a man how to fish" rule in product marketing is one hell of a genius idea. What a win-win!
Update: I also tried this on Mac Mini M1. This time instead of using all those 17 columns, I did as you did and only imported 6 of them. For postgreSQL, it took 4 m 39 s 810 ms this time :) Almost half of your MySQL timing (and again this is in docker container).
I knew that postgreSQL should be better. I fell in love with postgreSQL once more. Thanks.
Ah, BTW, maybe I should be considered cheating, I use the open source postgreSQL extension Citus (but single node in a docker container - it would be insane with multiple nodes).
(I guess planet scale is also using the Go project Vitess and should be fast on multiple nodes)
Good experimentation! This was loaded into vanilla MySQL, and you're right that the performance is very dependent an hardware, which DBMS you're using, etc.
Ben! My past lecturer at the UofA, he is an amazing teacher so I’m glad I can still learn from him 🎉
09:23 - Am I the only one who noticed that you got error at row one, not even importing any data :)
Anyway, awesome video :) I always get excited when I see new one uploaded :)
Good catch! @dloura07 pointed that out also. Using -f to continue after errors causes it to take even longer! Definitely need a better solution.
Aaron, you look different!
jk, always love to see new PlaneScale content! ❤
Always love your course, planning to do you Miro clone for this month
Yes, I thing he got flu 😁
as a frequent user of that same lichess database I would really appreciate that c-program you wrote :)
I would be curious to see the performance difference between multiple independent INSERT statements and one INSERT statement with many values. In my experience, it is much, much faster, though I have never tried importing a dataset quite this large.
Same experience here.
On another note, server-side, I've had to chunk my inserts into a thousand or so rows at a time bc my production db is a diff server than my web server, and it only lets me send so much data at once.
I also wonder how a transaction would affect speed of many separate inserts.
Great idea! That's exactly what mysqldump does when exporting a db. I didn't show a performance comparison, but let me know what you find :)
I did that for an import tool that inserts ~50.000 rows each time. I wrote that tool a few years ago so the numbers could be slightly different, but according to my benchmarks (and my memory), using one insert per 800-1000 records is the fastest option, and it takes about 5-6 seconds. Using one insert per line takes ~30 seconds. More than 800-1000 records per insert, again turned out to be slower.
8:42 thats totally understandable why it takes for ever, each ';' is a commit and this takes processing power. Best would be a bulk load of the data. But the direct loading is great too
Love that you kept errors/issues in the video! Shows that even professionals makes simple mistakes :)
I have used SQL Loader in one of the pipelines I developed for the company I work for. That process time was reduced from 30 minutes to 2 minutes.
Great video, I didn't know about the "load data local infile" sql command 👍
I would have loved a comparison with INSERT into that inserts multiples lines and once instead of one insert per games, and see if it's faster or slower than LOAD DATA LOCAL INFILE
I had the same problem but with a 300 hundred million rows. The real pain was hidden characters in the data and comma at the wrong places. So too import that data into the database did take over 6 hours. So i had to run the script over night. Too optimize it we did compare the changes with the first import so the second and further imports did go in 1-5 minutes.
I am curious to know if there would be any performance improvements in the dump and load process if you used multi-threaded dump and load methods of MySQL Shell instead of mysqldump.
I would rather make that first C converter load the data into MySQL. This way you could use your PGN file as a SoT and load it into both your local MySQL and PlanetScale.
Please use the torrent next time. That'll save them bandwidth and probably also be faster.
🤓
Thanks for calling me smart. Y'know the nerds are the ones successful in this field@@I_SEE_RED
Very good. My brutal feedback is Aaron sometimes has too many opinions sprinkled into his videos where this fella has almost none. A middle ground would be like wow chefs kiss.
thought you were to use awk to process that huge file. i sometimes use it to parse gigantic xml files, it's much easier to write than C. feel like it shouldn't be too slow in your usecase either
In general, should you run multiple tests with different number of rows in order to measure any startup time? Should the first test be using MySQL's built in LOAD DATA INFILE, i.e. baseline should be the zero-effort case?
Great video, as always. Thank you. I got a question: why did you use ec2 instance to load data to Planetscale? Couldn't you do it directly from your local Mac?
Good question. Mostly for the sake of improved network performance (which can be a bottleneck in instances like this). Home internet often has very slow upload speeds :)
It's really awesome.
Recently I tried importing a CSV into MySql database ( using Laravel and excel libraries ) which only had a few million records. It was taking 5-7 minutes even after a lot of improvisations I did.
Using multiple tools and technologies to get the data processed is the main ingredient here. Will give it a try with my import your ideas.
Could you share that code?
How long should it take to load data into Planetscale? My application currently takes ~1 second for 1000 (1 thousand) in a single createMany Prisma statement.
(1000 entries into 1 table)
Adding more or less seems to scale linearly.
That seems really slow to me. Wondering if there are some benchmark for that to compare to see if I am doing something wrong or if it's normal.
Depends on a bunch of factors including the PlanetScale plan you are on, network connectivity / proximity to your DB, how Prisma's createMany function is implemented, etc. What's your setup?
@@PlanetScale I'm only interested in the query latency. So connectivity and all that should be outside of that. My createMany simply gets an array with 1000 objects, each with 10-15 column values. Plan is Scaler. (but does it make a difference in this case with e.g. free? from the pricing page it only looks like the upper quantity would later make an issue but otherwise I don't see what would affect it)
100k inserts in 1.7s. Yep, your app with 5 active users and 1000 inserts per day definitely needs Cassandra.
But what happen when we go viral and become as rich and huge as Google? We have to be prepared for this. More Microservices, more distributed systems please and we need a registration server for the 3 users that logged in yesterday -
@@llothar68 We need separate instance for each user, to make sure it scales.
What do you expect to tell us? " your app with 5 active users and 1000 inserts per day definitely needs Cassandra."
Concurrent connection is slow in this MySQL database?
5 x 1000 = 5,000 raws per day, so why do we need cassandra DB for this small insets?
@@indramal You did not get the joke. Or the terrible situation the web development world is in at the moment.
The thing is that it failed after 21000 rows…
Curious why the upload to ec2 rather than direct from local. Is that just to run so you can do other things and to take advantage of network? Otherwise seems similar to upload direct rather than upload then upload again.
Came to comments for this, also was wondering, as surely you'd spend time uploading first to ec2? And then upload again from there.. but the first upload to ec2 negates benefits? Really would like to know the reasoning behind it
I was thinking it must be -> Load data infile because that has saved me many a time. But the preprocessing in C has made me think I really need to learn that, or Rust or Zig. Very impressive.
Loved the video. I just have a question, couldn't you just have established a connection to planetscale even before the dump?
When you run `mylsql -u root chess < inserts.sql`, couldn't you have directly imported that into planetscale like you did import the dump in the ec2 instance?
Yes! A connection to the PlanetScale DB could have been established directly form my computer instead of from the EC2 instance. The advantage of the EC2 instance is that, if it is in the same region as the database, it can have better network bandwidth to the server.
Bro this is wild, the C script got me questioning my life choices. Can we train an AI model with this chess data?
This is interesting! When I tried the same amount of records from one SQL instance to another in AWS via network. It took me 6 hours. I've doubt what about the BinLog data ? Can we apply the same method with the backup SQL data chunks ?
Would really love a playlist dedicated to performance tips
what should we do if we have to use python, s3 buckets and aws glue just to make pretty the same: convert some txt file to csv? The problem is that we need it FAST)
Aaron where!?
👀
Just curious, how did you send the dump file to your AWS instance? Is it through FTP?
Transferred via scp!
Thank you! @@PlanetScale
How can someone learn all the stuff that your doing here?? Can you share some concepts to get started?
"Load data" isn't faster if you write your loader properly, so you just won a running competition against a disabled opponent
Hardware? Intel Pentium III 500Mhz?
Coolio neato! Only thing that's a problem is the MySQL pronunciation, even the official mysql docs says the correct way is my s-q-l!
Did you just loaded 200+gb file in 1 second with vim..? (2:16)
In vim, you can control-c to cut loading short. The more you know!
Well that was fast for 100 million rows, but you didn't tell us about the hardware you are using. From video I can only tell that you are using a Mac, I guess it has M series CPU and a decent drive. Likely faster than my AMD Ryzen 9 3900x, 32 Gb RAM, Corsair MP600 1 TB NVME.
I did what you did, converted that pgn to CSV (using Go, all 17 columns), and imported all 17 columns to postgreSQL. 100,023,791 rows affected in 8 m 27 s 347 ms. This was done in a docker container, thus I think it made it a little bit slower.
Interestingly you wait over 23 seconds just for a count(*)???? That was really slow. It took 3.7 seconds for me on postgreSQL. I wonder what would happen if I do this on my Mac mini M1. I named my table "speedo", this query:
select count(*) from Speedo where Termination = 'Time forfeit';
returns 31,471,426 in 4.7 seconds. Can you also time this on that mySQL, just curious.
I love this kind of videos, they are so educational.
15:59 I'm always baffled that MySQL does not come with an option to import and export data using a binary file format and instead forces the use of a plaintext SQL file. Seems so inefficient
This video is so satisfying to watch.
Can you share the C code you have written to parse pgn to CSV ?
Great vid, loving the C, always use the right tool for the job (not what's cool) 👍
Question - why not torrent when downloading the set?
was able to successfully load 100 million record in SQLite3 database its about 20 GB. Would be interested to know how big was the 100 million MYSQL database.
The chess DB in MySQL was around 10.7 gigs after running the tests and importing the big data set. The dump was ~13.
@@PlanetScale Was planning to scale to 1 billion rows but had issues with my SSD space. Guess MySQL would scale upto 1 billion rows..
Was impressed with SQLite3 ability to scale up with ease to 100million records.
Administration overhead for SQLite3 is much less compared to other databases such as Postgress , MYSQL and Oracle that's my opinion.
Maybe using batch insert would taken less time
This is phenomenal, thank you very much for this masterpiece
Can you please make a video of using ARK or SED to transform the data and show how fast it can be?
please tell me this load script can be developed in PostgreSQL
what a suprise lol, great vid
what about mojo isn't it making py for effeciant
I automate all these process of downloading, extracting, schema building and migrating into database in a Laravel project.
I initially choose python for this task, but it is slower which I amazed too so php with short lifespan and chunking with recursive calling make all this happen with realtime updates over socket...!
Cool!
Hello bilalthepunjabi,
Could you please share the source code you implemented with PHP (Laravel)? I would greatly appreciate it. Currently, I am also exploring some solutions to accomplish this task with PHP, but I think it may not be as fast as Python.
9:23 query failed, 100000 rows were not inserted. Read all of the output.
Can you provide a link to your C code that converts pgn to csv
How you got inserted 'white_elo' value to 'while_elo' column? 😊
Someone is paying attention! It got fixed off screen.
Great video! Would actually love to see the code for that C program you wrote to convert to CSV
The pause button is your friend here ⏸
@@PlanetScaleSarcastic comments aside, even going frame-by-frame in your video, you cannot see all the code.
Would you mind to share that C script?
You don't have to use select count(*). You could check the schema for the next autoinc or just ask for the id from the last record: select id from chessgame order by id desc limit 1; It will use the index and will be a lot faster than the whole table scan.
It will return the correct number under the conditions: 1. the table has the autoinc column, 2. nothing was deleted, 3. the query doesn't contain any "where".
where is the c code?
07:38 where are mystery numbers (64, 256, 32) coming from?
I think the numbers could be higher in a real application
When creating a MySQL table and using VARCHAR data types with specific sizes like 32, 64, or 256, you are essentially defining the maximum length of the character data that can be stored in that column.
The specific numbers like 32, 64, and 256 are often used in VARCHAR column definitions for several reasons, but they are not rigidly fixed; you can choose other sizes based on your specific data requirements.
Here are some reasons behind the common use of these numbers:
- Powers of 2: 32, 64, and 256 are powers of 2 (2^5, 2^6, and 2^8), which aligns well with the binary storage and memory management systems used in computer hardware. This alignment can make data manipulation and storage more efficient in some cases.
- Common Data Sizes: These sizes often align with common data sizes in various contexts. For example, 32 characters might be chosen for things like usernames, which are often limited in length. 64 and 256 characters cover a broad range of data sizes, accommodating longer strings or descriptions while still keeping data within reasonable limits.
- Historical Precedence: These sizes have been used in database design for a long time, and they have become somewhat standard. Database administrators and developers are accustomed to seeing these sizes in schemas.
- Practicality: They are practical sizes that work well for many types of data. Using numbers like 32, 64, and 256 simplifies decisions about column sizes because they are easy to work with and understand.
That said, there's no requirement to use these specific sizes. Your choice should always be based on your specific data needs. If your data requirements don't align with these sizes, it's perfectly fine to choose different sizes that better suit your use case.
Just "reasonable" values chosen for this example. For a production application, would want to validate that those are good values more thoroughly.
Did the multiple insert statemets actually work at 9:22 or did it take 1.7s in total because it failed, so it would actually take longer if it worked?
Yes. Seems like the error cut the operation short
Nice catch. You're right that the error cut it short. Using -f to continue after errors causes it to take even longer! All the more reason the use LOAD DATA LOCAL INFILE 😄
Did you really decompress and stdout the whole 200GB to the terminal?
i was more impressed vim was able to open 200gb file in no time.. while im struggling with 300mg text files in Notepad++ 🤣🤣
@@sudeshryan8707 please avoid to stdout the whole thing :)
FIELDS are not FILEDS.
To run meaningful queries you need some more indices.
id alone is limited.
Learned a ton from this. Thanks!
man the best channel ever 😎, cool videos as always and ton of informations
Wher's "Seee yyaa" ??? ... ?? 😀
Naa, just kidding. Great content!
I like that you do everything in the terminal
Wait, have you just loaded a 200 gig text file into the editor without any fire ?!
The art of vim chunked loading
its all good, the only part not usable by most people is writing that C code to convert to CSV :)
most people are not that good in classical C programming...
where's AARON?
Surely, Sir, you're not running the same mysql version as the rest of us who have PHPMYADMIN's integrated version? Or are you?
If you are, how is it that you can have 100M rows in one Table and I can't have even half a million in my Version in PHPMYADMIN?
Did vim just open a 214 gigabyte file in seconds? I think I've opened a 12 gig file before, not 214!
Tip: You can CTRL-C as vim is in the process of loading a file to cut it short, allowing you to view part of the file!
wow, thank you for sharing
Video so good i didnt even need to watch this but still did. 😅
Bruh we have terabytes in mysql . And all good just don't do aggregations use warehouse for that.
Where is Aaron?
👀
Finished watching when you inserted records one by one
good topic
How make query run faster 22 sec for a count is long :P
I see that while_elo 😉
Wow impressive. Came here from Google News, this video was featured 😅
Will try with Postgres 16
Dude you had all the data loaded into that C program. It would have been super easy and fast to then write it into MySQL directly from there! Why you go all those complicated detours and persist to use Python? LOL
Next video: Do this in mongoDB
True definition of BigData
Not even close :)
30 gigs is nothing in big data terms.
y use mysql when you could have loaded it into 100 LINKED ACCESS DATABASES?
i was more impressed vim was able to open 200gb file in no time 😅😅
CTRL-C is your friend :)
Man, Python is really slo............w
200 GB text data? WTH is that? 😳
Python can be fast if you stay in the c world.
dude, you could have asked chatgpt and its done
please..pgn to csv converter c source....... python very slow....ㅜ_ㅜ
You discovered that C is faster than python hahahahahahhahahahhahah
Everything is fine, but why have you stopped your free tier for certain countries including India?
Doing 10,000 separate insert prepare/execute SQL statements without surrounding them with transaction calls = extremely bad performance. It's not even close to a fair test. Not that Python is helping matters. MySQL can also insert multiple rows in a single statement, which you also weren't taking advantage of. So, yeah, of course MySQL's command-line client built-ins will outperform even the fastest languages when you're doing the worst possible thing performance-wise.
He never made any queries to MySQL from Python.
Python was only used to generate the queries.
Yes, he should have shown using multiple values instead of separate insert statements. I would argue though… thats effectively what he did with the last mysqldump.
Loaded it directly from the CSV and then dumped it out.
You also don’t need to use C to turn the original data into a CSV either. There are plenty of lands that are faster than Python that can do that.
I was almost hoping to see something that used Awk!
With enough shell-fu you could build an actual “pipeline” (literally, just chain the commands together! That’s what they’re there for!) that goes straight from the PGN file to MySQL with no intermediate files in between…
while_elo instead of white_elo lmao
skill issues
👀