How To Load One BILLION Rows into an SQL Database

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ค. 2024
  • 📝 Get my free SQL Cheat Sheets: www.databasestar.com/get-sql-...
    🎓 Learn and become confident in SQL: databasestar.mykajabi.com/sql...
    Have you ever wondered how to import one billion rows into an SQL database? Or how long it would take?
    In this video, you'll see how to generate a file with 1 billion rows, how to import it into four different SQL databases, and how long it takes in each database.
    ⏱ TIMESTAMPS:
    00:00 - The challenge
    00:31 - Setting up the data
    02:08 - MySQL
    04:08 - Postgres
    05:31 - SQL Server
    07:35 - Oracle method 1
    09:33 - Oracle method 2
    🔗 VIDEO LINKS:
    Gunnar Morling's blog post: www.morling.dev/blog/one-bill...
    1BRC GitHub repo: github.com/gunnarmorling/1brc
    SQL scripts used in this video: github.com/bbrumm/databasesta...
    Commands to clone the repo and generate the file:
    git clone github.com/gunnarmorling/1brc...
    java -version
    mvnw verify
    sh create_measurements.sh 1000000000
    Specs of the computer used in this video:
    Lenovo Ideapad 5 laptop
    CPU: Intel i5 2.4 GHz
    RAM: 16 GB
    OS: Windows 11
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 103

  • @DatabaseStar
    @DatabaseStar  15 วันที่ผ่านมา +1

    Thanks for all of the comments everyone! I'm glad a couple of issues have been pointed out. I'll address them when I redo the process in a follow-up to this video, including:
    - use SQL Server Developer Edition rather than Express to avoid the database size limit and import all 1B rows
    - use another version of Oracle to also avoid the database size limit and import all 1B rows
    - consider optimising the queries or changing some parameters to further improve the performance of the SELECT

  • @ahmettek315
    @ahmettek315 17 วันที่ผ่านมา +32

    If doing this with 250 million rows and multiplying by 4 was a correct approach, then for all databases you could just use 1000 rows and multiply by a million.

    • @jpsolares
      @jpsolares 16 วันที่ผ่านมา +2

      Perhaps consider using the Developer Edition, as it is equivalent to the Enterprise Edition. Since the project is not intended for production purposes, there should be no issue in utilizing it.

    • @ahmettek315
      @ahmettek315 16 วันที่ผ่านมา +6

      @@jpsolares We all know it. What I am saying is, you can't test something doing with 250 million rows and multiplying by 4 and then claim it is equal to doing with 1 billion rows. Looks like you understood as if I were asking how to do it.

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา +1

      Good point! The only reason I used 250m for Oracle and SQL Server was because of the size limitations mentioned in the video. In hindsight (and from the other comments), I could have used the SQL Server Developer edition to load the full 1 billion rows. I could also have used a different edition for Oracle just for this experiment.

    • @ahmettek315
      @ahmettek315 15 วันที่ผ่านมา

      @@DatabaseStar No problem. I later read about it a bit, and it was indeed a challenge where you could use any kind of optimization you can. In my view, that made your SQL server and postgreSQL samples void for me. I don't use mySQL and Oracle, but at least I know your sample is void for mySQL as well. You are doing the uploads and querying just how a beginner would do as well. No tricks, no optimizations, no use of extensions, CLR etc.
      I could create better versions for postgreSQL and SQL server but only when I have spare time and\or I think it is worth doing it.

    • @DatabaseStar
      @DatabaseStar  15 วันที่ผ่านมา

      Yes, that's true, I mention in the video I don't make any optimisations such as indexes or adjustments to the process. The default settings are used on purpose. There are definitely ways to make it faster in each database!

  • @mortezanaghavimoghadam8258
    @mortezanaghavimoghadam8258 17 วันที่ผ่านมา +14

    Time needed for a certain query doesn't grow linearly in respect to the size of data 6:00

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Yes I completely agree. I mentioned in the video it's not a true comparison as I was unable to load the full 1B rows into Oracle and SQL Server.

  • @andyj115
    @andyj115 8 วันที่ผ่านมา +3

    INSERT or BULK-INSERT do not have a linear data insertion time, but an exponential one. Therefore the test should be repeated using 250,000 records for each database (in the comparison of no-cost databases). Oracle and MSSql in the paid versions have very respectable performances as they no longer have RAM or CPU core limits

    • @DatabaseStar
      @DatabaseStar  6 วันที่ผ่านมา

      Thanks for the tip, yeah that's right. Others have pointed that out as well. In hindsight I should have used different versions of Oracle and SQL Server to avoid the size limits, so I could import the billion rows in all vendors.

  • @Chris-by8eh
    @Chris-by8eh 17 วันที่ผ่านมา +9

    in SQL server, if you change the auto grow to 10,000 from 64 it will run much faster for both data and log files. Also SQL server has external tables as well.

    • @lucca5101
      @lucca5101 17 วันที่ผ่านมา +1

      As does Postgres, but he only used that feature with Oracle. Every other DB he copied the data. Totally not pushing Oracle…

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Thanks for the tip, I wasn't aware of that! I left all of the databases at their default settings but this would have been a good one to change to avoid the issue.

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      No, I'm not pushing Oracle, and I left all of the settings as the defaults. I did an additional method for Oracle because I was aware of its ability to use an external table.

  • @RodrigoMenares
    @RodrigoMenares 19 วันที่ผ่านมา +4

    To solve the growth problem in SQL Server, when creating the database, you must click on the ... and choose the growth in percentages, this prevents you from having to indicate a value that may be short (that is, not establishing a growth limit) and thus you solve the problem of the size of the database

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Good tip, thanks for sharing!

  • @ojikutu
    @ojikutu 15 วันที่ผ่านมา +4

    Loading 250m rows 4 times is not equal to loading 1billion rows. The experiment is really flawed. You need to use 250m rows for all. looking forward to another test with same yardstick. Kudos!

    • @DatabaseStar
      @DatabaseStar  15 วันที่ผ่านมา

      Thanks! Yeah a few people had pointed that out, and I mentioned it a couple of times in the video. I plan on doing a follow-up video for it.

  • @PaulBunkey
    @PaulBunkey 16 วันที่ผ่านมา +2

    You can freely and legally use Developer edition for an MS Sql Server. As long as there no production data loaded, the Dev edition is free.

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Thanks for the tip! I should have used that edition, in hindsight, for a better comparison.

  • @ahmettek315
    @ahmettek315 15 วันที่ผ่านมา

    Just for curiosity, can you share your machine's specs? From create...sh timing it looks like you have a faster machine than mine but would be good to know.

    • @DatabaseStar
      @DatabaseStar  15 วันที่ผ่านมา +1

      Sure! Here are the specs:
      Lenovo Ideapad 5 laptop
      CPU: Intel i5 2.4 GHz
      RAM: 16 GB
      OS: Windows 11

  • @ahmettek315
    @ahmettek315 15 วันที่ผ่านมา +2

    In a real world scenario, likely you wouldn't just keep that data as two columns city, temperature. Checking the original challenge, it is more about parsing data. Using plain streaming with languages (C series, Java, Go, Rust, ...) is expected to beat any database that way IMHO. They don't need to store all the values, just min, max and average (aka Sum and count).

    • @DatabaseStar
      @DatabaseStar  15 วันที่ผ่านมา

      Yeah the results of the Java challenges were impressive. One of the points of this video is that databases are great at processing data, so I wanted to see how the different databases handled it.

    • @ahmettek315
      @ahmettek315 14 วันที่ผ่านมา

      @@DatabaseStar Indirectly I started to dig what was that about, those results were not from a file on disk but from a RAM disk and the machine used is an AMD EPYC 7502P with 32 cores and 128 Gb memory! That CPU is supposedly more than 5 times better than yours and is a server CPU.
      Directly working from memory on a beast hardware those timings is not fair to compare with yours. Maybe should also run one or more of their code to see what they score on the same hardware.
      (Trying to allocate time for a postgreSQL and\or Go check)

  • @hendrykhoza4782
    @hendrykhoza4782 17 วันที่ผ่านมา +5

    Please include sqlite

  • @charliesta.abc123
    @charliesta.abc123 9 วันที่ผ่านมา

    Interesting, thank you for sharing. For the postgres part, do you think the pagination negatively affected the results?

    • @DatabaseStar
      @DatabaseStar  9 วันที่ผ่านมา

      No problem! I think it did, actually, based on a couple of other comments. In hindsight I should have shown all results at once in the query to get a more accurate number.

  • @anotherme4638
    @anotherme4638 20 วันที่ผ่านมา +3

    It is interesting to see a real test for the deference in loading/select speed between DBs, and how out of the box Oracle numbers (select) are more than13x times faster than postgres and mysql
    I wander if there are way to tweaked some parameters in Postgres/Mysql (especially for Mysql) to decrease those number and make them comparable to Oracle.
    Thanks a lot for the demonstration, it is a grate video

    • @DatabaseStar
      @DatabaseStar  20 วันที่ผ่านมา +1

      Glad you liked the video! Yes I thought the test was interesting as I was doing it.
      I think there would be ways to tweak some parameters on the server to improve the import process. The default settings may not be the fastest.
      Also there are indexes that could have been added after the import which could have also improved the query performance.

    • @ahmettek315
      @ahmettek315 15 วันที่ผ่านมา

      It is only so, the way he did.
      I don't use mySQL but look at vitess project for example. AFAIK it is what made possible youtube to store their videos on mySQL. You may also check with PlanetScale.

  • @krzysztofcierpia4853
    @krzysztofcierpia4853 2 วันที่ผ่านมา +1

    You should extend your test for databases like: SAP HANA, IBM DB2 , Snowflake, Google BigQuery, MariaDB, SQLite

    • @DatabaseStar
      @DatabaseStar  2 วันที่ผ่านมา

      Good idea! I don't have any experience with those databases unfortunately.

  • @palkollar7739
    @palkollar7739 21 วันที่ผ่านมา

    what if you index the city names, how fast would the queries be?

    • @DatabaseStar
      @DatabaseStar  21 วันที่ผ่านมา +1

      They would probably be quite a bit faster!

  • @JunaidKhan-gq8nw
    @JunaidKhan-gq8nw 17 วันที่ผ่านมา

    Much Helping

  • @asrytube
    @asrytube 10 วันที่ผ่านมา

    Oh I love Oracle! I've been using for years and fascinated with all its features, functions and properties. Its direct path loading feature is also awesome.

    • @DatabaseStar
      @DatabaseStar  9 วันที่ผ่านมา

      Yeah, it does have a lot of features.

  • @ravishmahajan9314
    @ravishmahajan9314 16 วันที่ผ่านมา +1

    Have you used Clickhouse or DuckDB??

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      No, I haven't used them actually.

  • @yanjh222
    @yanjh222 15 วันที่ผ่านมา

    So, what's your hardware configuration, just a laptop?

    • @DatabaseStar
      @DatabaseStar  15 วันที่ผ่านมา

      Yeah, it's my Windows laptop:
      Lenovo Ideapad 5 laptop
      CPU: Intel i5 2.4 GHz
      RAM: 16 GB
      OS: Windows 11

  • @timelschner8451
    @timelschner8451 17 วันที่ผ่านมา

    How fast would it be to load all the data into an in memory Database and query it?

    • @Nachtaktiv2000
      @Nachtaktiv2000 17 วันที่ผ่านมา

      in memory will only be a bit faster when loading, but up to 1000x faster when querying

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Good question. I agree with what Nachtaktiv mentioned, the benefits would be when querying.

  • @Nachtaktiv2000
    @Nachtaktiv2000 17 วันที่ผ่านมา +1

    … some of the Oracle comparison-min-max-times are only for fetching 200 rows!!!
    … and why not loading 500m rows and multipl. x2 ?

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Good point! I tried to load 500m rows but that also exceeded the maximum size of the database (10 GB).

  • @wbarbosabr
    @wbarbosabr 13 วันที่ผ่านมา +1

    How about SQL*Loader in Oracle?

  • @mariosaenz5923
    @mariosaenz5923 16 วันที่ผ่านมา

    Why you didn’t use sql server developer edition??? to avoid the express problems

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Good point! In hindsight I should have done this, and it would have been a better comparison.

  • @RolandOkouby
    @RolandOkouby 21 วันที่ผ่านมา +3

    GREAT CONTENT🔥

  • @amitkumdixit
    @amitkumdixit 11 วันที่ผ่านมา

    Your Postgresql select benchmark was also not correct. It wasn't 27 minutes it was much lower . So the issue was the query already returned all the results. It was your dbeaver which was paginating the result

    • @DatabaseStar
      @DatabaseStar  9 วันที่ผ่านมา

      Oh thanks for pointing that out, that's good to know. I'll look at it again as it seemed like DBeaver took several minutes each time it loaded a new page.

  • @rmcgraw7943
    @rmcgraw7943 12 วันที่ผ่านมา

    bcp, or load a mdb on a non-production server then copy the file and sp_attachdb it. you’ll need to stop the service, then attach it. I can think of a half dozen ways to do this, depending on business constraints.

    • @DatabaseStar
      @DatabaseStar  9 วันที่ผ่านมา

      Good to know there are many other ways to do this

  • @vpn740
    @vpn740 17 วันที่ผ่านมา

    concurrency is a critical factor (one of the many others) in the context of this scenario. based on my observations, SMP databases exhibit subpar performance when managing such workloads in a production setting. in contrast, MPP databases such as teradata, snowflake, and netezza offer a more suitable solution. a decade ago, I was involved in an oracle to teradata migration project, where we conducted parallel testing by executing identical queries on both database platforms. the disparities we encountered were stark and significant.

    • @ramsundararaman6615
      @ramsundararaman6615 16 วันที่ผ่านมา

      Tell us more about it, please!

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Good point, yes I imagine concurrency could impact the performance here.

  • @EMoscosoCam
    @EMoscosoCam 17 วันที่ผ่านมา +5

    You can get the Developer Edition of SQL Server, which is free

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Thanks for sharing! In hindsight, I should have done this instead of using Express, because it would have been a fairer comparison.

  • @RodrigoMenares
    @RodrigoMenares 19 วันที่ผ่านมา

    that could change work times, greetings

  • @arbazadam3407
    @arbazadam3407 17 วันที่ผ่านมา +2

    One of the best channels on database

  • @low-sound
    @low-sound 18 วันที่ผ่านมา +3

    Great video. But it got me curious what about nosql databases? I know they are faster than sql but how much? Can you make a video for nosql databases? Like mongo, cassandra etc.

    • @flybyray
      @flybyray 17 วันที่ผ่านมา

      Or just use unix tools efficiently directly and/or withinOrAsPipe to the generator script itself 😂

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Thanks! They could be faster, but I don't have a lot of experience with NoSQL so I just focused on SQL for this video.

  • @nagesh007
    @nagesh007 11 วันที่ผ่านมา

    Awesome

  • @abhor115
    @abhor115 15 วันที่ผ่านมา

    Very impress

  • @RadityoPrasetiantoWibowo
    @RadityoPrasetiantoWibowo 16 วันที่ผ่านมา +1

    for sqlserver you can use developer edition

    • @RadityoPrasetiantoWibowo
      @RadityoPrasetiantoWibowo 16 วันที่ผ่านมา

      you should use
      1. sqlserver developer edition
      2. you should monitor memory usage and CPU usage also
      3. you should use same client tools like dbbeaver, or minimum client like CLI version of each vendor
      good work!

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Thanks! Yeah a few other commenters have pointed that out, and in hindsight I should have done that.

  • @syfre9506
    @syfre9506 10 วันที่ผ่านมา

    You should use SQL Server for developers edition, it's not limited

    • @DatabaseStar
      @DatabaseStar  9 วันที่ผ่านมา

      That's true! Others have mentioned it and I wasn't aware of that when I made the video.

  • @lucca5101
    @lucca5101 17 วันที่ผ่านมา +1

    Clearly trying to push Oracle. Why not use Postgres to directly read the text file for a better comparison?

    • @DatabaseStar
      @DatabaseStar  16 วันที่ผ่านมา

      Good idea! In my research I couldn't find a way to read a text file directly using Postgres. Do you know how it could be done?
      I'm not trying to push Oracle at all. My preferred database to use is actually Postgres :)

  • @robderrt
    @robderrt 8 วันที่ผ่านมา

    how to do it in python?

    • @DatabaseStar
      @DatabaseStar  6 วันที่ผ่านมา

      I'm not sure, I don't have a lot of experience in Python. I assume you can write some code to read the CSV file and analyse it directly.

  • @nicku33
    @nicku33 17 วันที่ผ่านมา +1

    Do DuckDB !!

  • @kwando88
    @kwando88 17 วันที่ผ่านมา +1

    Now do the same with duckdb :)

  • @JohnoScott
    @JohnoScott 11 วันที่ผ่านมา

    this sounds like a job for DuckDB 🦆🔥💪

    • @DatabaseStar
      @DatabaseStar  9 วันที่ผ่านมา

      Oh good to know!

  • @rezarafieirad
    @rezarafieirad 21 วันที่ผ่านมา

    perfect

  • @poparab
    @poparab 17 วันที่ผ่านมา

    okay than lets try mariadb
    and than we give them challange under 9 minuts

  • @yaynative
    @yaynative 21 วันที่ผ่านมา

    Wow! Why is Oracle so fast?

    • @DatabaseStar
      @DatabaseStar  20 วันที่ผ่านมา +2

      I think because it's a more "premium" database and it's built for high-end performance. Or maybe the default settings are better for importing large sets of data compared to the others.

    • @lucca5101
      @lucca5101 17 วันที่ผ่านมา +3

      Or because it wasn’t a fair comparison. Both Postgres and SQL Server can directly read external files but you chose to copy the data on all but Oracle.

    • @yaynative
      @yaynative 16 วันที่ผ่านมา

      @@lucca5101 Interesting 🤔

    • @DatabaseStar
      @DatabaseStar  15 วันที่ผ่านมา +1

      I couldn't find any documentation on reading a text file directly in Postgres, SQL Server, or MySQL, so I didn't include those in this video. I still included an "import file into table" method for Oracle to see how they would compare as well.

  • @chudchadanstud
    @chudchadanstud 15 วันที่ผ่านมา

    due just keep everything the same

    • @DatabaseStar
      @DatabaseStar  15 วันที่ผ่านมา

      What do you mean?