Comparing duckdb and duckplyr to tibbles, data.tables, and data.frames (CC279)

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ก.ค. 2024
  • duckdb has quickly grown in popularity as a database platform that is super fast with large datasets. Watch as Pat shows how to generate a duckdb database and access values from the database. He'll also compare the performance of using duckdb directly and using duckplyr or using tibbles, data.tables, and data.frames. Pat will discuss how the perforance changes by the number of different key values and the size of the database. You'll likely be surprised by the results! This episode is part of an ongoing effort to develop an R package that implements the naive Bayesian classifier.
    If you want to get a physical copy of R Packages: amzn.to/43pMR8L
    If you want a free, online version of R packages: r-pkgs.org/
    You can find my blog post for this episode at www.riffomonas.org/code_club/....
    Check out the GitHub repository at the:
    * Beginning of the episode: github.com/riffomonas/phyloty...
    * End of the episode: github.com/riffomonas/phyloty...
    #rstats #microbenchmark #vectors #rdp #16S #classification #classifier #microbialecology #microbiome
    Support Riffomonas by becoming a Patreon member!
    / riffomonas
    Want more practice on the concepts covered in Code Club? You can sign up for my weekly newsletter at shop.riffomonas.org/youtube to get practice problems, tips, and insights.
    If you're interested in purchasing a video workshop be sure to check out riffomonas.org/workshops/
    You can also find complete tutorials for learning R with the tidyverse using...
    Microbial ecology data: www.riffomonas.org/minimalR/
    General data: www.riffomonas.org/generalR/
    0:00 Introduction
    6:07 Improve construction of data.table objects
    11:12 Performance of which vs. logical
    16:04 Improved access to values in data.table objects
    20:31 Using duckdb() to store and access data
    27:11 Using duckplyr() to store and access data
    30:05 Evaluating sensitivity to number of rows and sparsity
    32:11 Improving performance of sparse matrix construction
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 41

  • @Riffomonas
    @Riffomonas  2 หลายเดือนก่อน +6

    People have been asking about arrow. Here are the benchmarks with get_arrow_single (line 17) and get_arrow_three (line 15) included. Code for the testing is included in the linked GitHub repository)
    1 get_msparseT_three() 129724738
    2 get_msparseT_single() 128000176.
    3 get_tbl_three() 120202119
    4 get_df_three() 83281619
    5 get_dt_three() 83114421
    6 get_which_three() 82046474.
    7 get_tbl_single() 42185986.
    8 get_msparseC_three() 20539934
    9 get_msparseC_single() 20399386
    10 get_which_single() 17729343
    11 get_df_single() 17174326
    12 get_dt_single() 17037181
    13 get_msparseR_single() 8302705
    14 get_msparseR_three() 8169640.
    15 get_arrow_three() 7671572.
    16 get_dbi_three() 5549166.
    17 get_arrow_single() 4361846.
    18 get_dbi_single() 2842756.
    19 get_duck_three() 2413116.
    20 get_duck_single() 1703386
    21 get_dt_singlek() 447658.
    22 get_dt_threek() 428696
    23 get_mfull_three() 202766.
    24 get_mfull_single() 137412.

  • @RubenMejiaCorleto
    @RubenMejiaCorleto 2 หลายเดือนก่อน +4

    Excellent work, thank you for taking my suggestion about duckdb into account.

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน +1

      Absolutely! Thanks for the suggestion 🤓

  • @thespaniardinme
    @thespaniardinme 2 หลายเดือนก่อน +4

    One of those much awaited videos. Thank you, sir!

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน +1

      My pleasure - thanks for tuning in!

  • @mabenba
    @mabenba 2 หลายเดือนก่อน +2

    Great episode! I am starting to learn more about DuckDB as it seems a really useful tool, mostly used with dbt and large datasets.

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Yeah, it seems pretty awesome. As I understand it, they keep making it more performant. It seems like a great tool

  • @bulletkip
    @bulletkip 2 หลายเดือนก่อน +1

    Thank you sir! Your channel continues to be an excellent resource. Much appreciated

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      My pleasure - thanks!

  • @leonelemiliolereboursnadal6966
    @leonelemiliolereboursnadal6966 2 หลายเดือนก่อน

    Great to see more of your videos!!!!

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Thanks!

  • @vlemvlemvlem3659
    @vlemvlemvlem3659 2 หลายเดือนก่อน +1

    It's between you, good sir, and Josiah Parry for the King of R-content on TH-cam. I love your stuff.

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Thanks a bunch!

  • @ColinDdd
    @ColinDdd 2 หลายเดือนก่อน +1

    great video. benchmarking is such a powerful tool. of course people can game the benchmarks, but they go to show that you shouldn't get too attached to one particular tech because everything can change once a new system shows better performance!

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน +1

      Absolutely, hopefully my recent benchmarkings have shown that there are a lot of factors that can impact performance. It's really important to try to be clear about the assumptions that go into the test

  • @rayflyers
    @rayflyers 2 หลายเดือนก่อน +1

    I learned about duckdb at posit::conf last year. It seems like a good tool, but I primarily use arrow when I need speed (for larger data) and DBI and dbplyr when I need to work with a database.

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Thanks for watching. Check out the pinned comment (be sure to expand it to see the whole thing) where I added arrow to the comparison. For this test, it is actually slower than duckdb!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 หลายเดือนก่อน +1

    Pat,
    Nice to see a video on DuckDB! I have been playing with the arrow package (another space-saving type of approach) but it recently stopped working on my Mac (M1). It is another package worth considering.

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Thanks for tuning in! Not sure why arrow wouldn't work on an M1. That's what I have and was able to get it to work. Check out the pinned comment (be sure to expand it to see the whole thing) where I added arrow to the comparison. For this test, it is actually slower than duckdb!

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 2 หลายเดือนก่อน

      @@Riffomonas
      Pat, When I run arrow_info() I get FALSE on every item except the first (acero). I just updated R and RStudio but that did not fix the issue.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 2 หลายเดือนก่อน

      P. S. The instructions on the Arrow website are of little help to me.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 2 หลายเดือนก่อน

      I am running arrow inside a Quarto book by the way. But it used to work there.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 2 หลายเดือนก่อน

      I get this warning: This build of the arrow package does not support Datasets. Even after updating R and RStudio and re-installing arrow.

  • @mmcharchuta
    @mmcharchuta 2 หลายเดือนก่อน

    Exciting!

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Thanks for watching!

  • @mishmohd
    @mishmohd 2 หลายเดือนก่อน +1

    he understand the assignment

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      thanks for tuning in !

  • @spacelem
    @spacelem 2 หลายเดือนก่อน +1

    That is remarkably satisfying watching all of those benchmarks jostle for supremacy!
    I think one question that might still be good to examine (although I really don't know how you'd do it), given that your initial problem was that your data was too big to fit in memory, is how memory efficient each of these methods are? "Slow but fits in memory" might beat "fast but my machine can't handle it".

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน +1

      Great point! I'll try to follow up on this once I get to the real data

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 หลายเดือนก่อน

    Very nice and thought-provoking. My understanding of DuckDB is that it is basically a way to run large datasets by storing them locally and thus not eat up RAM and slow things down (the larger-than-memory selling point of DuckDB) and only loading in what you need - not the entire dataset. So maybe asking about speed compared to a matrix-approach may bet a bit of apples-vs-oranges deal?

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Still learning about duckdb. It's an option for my project so comparing it to any other possible option seems relevant to me

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 2 หลายเดือนก่อน

      @@Riffomonas Climate scientists have been using NetCDF files or decades. Those are supposed to be very memory efficient. Is that an option for you? I do realize that eventually you have to pick something and move on.

  • @victorcat1377
    @victorcat1377 2 หลายเดือนก่อน

    Hello ! Thanks a lot for you clarity and these useful tutorials ! When I have large data to process, I some time try to parallelize my script with package such as doparallel in R. Any thoughts on that ?

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      I have used the future and furrr packages in the past. These are great to make it easy to work with parallelization when trying to speed things up. Thanks for watching!

  • @djangoworldwide7925
    @djangoworldwide7925 2 หลายเดือนก่อน

    I decided not to go with duckplyr since the print output is a bit annoying. I couldn't see enough rows because of all the extra info there.. how do you silent this?

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน +1

      You can suppress the output with duckdb.materialize_message (see rdrr.io/github/duckdblabs/duckplyr/man/config.html for examples)

  • @sven9r
    @sven9r 2 หลายเดือนก่อน

    Hey Pat, great video!
    I see you scrolling a lot - wouldn't paragraphing help a lot since your code is getting soooooo long (and the comments still suggest more benchmarking :P)

    • @Riffomonas
      @Riffomonas  2 หลายเดือนก่อน

      Yeah, well, I'd have to remember to do that then! 🤓 FWIW, we're done with benchmarking for a bit

  • @michaelmanti
    @michaelmanti 10 วันที่ผ่านมา

    Comparing keyed data.tables to non-keyed, non-indexed duckdb tables seems unfair, since duckdb does support keys and indices. Have you tested keyed and/or indexed tables in duckdb? If I'm not mistaken, the duckdb un-keyed versions outperformed the data.table un-keyed versions?

    • @Riffomonas
      @Riffomonas  10 วันที่ผ่านมา +1

      Thanks for watching! I'm not able to find duckdb/duckplyr documentation on setting keys. Can you point it out to me? But you are correct that dt without keys is slower than duckdb. I did this in the current (and previous episodes). The get_dt_threek function is keyed and took 421k ns, get_dt_three (not keyed) took 104941k ns and get_duck_three took 2474k ns.

    • @michaelmanti
      @michaelmanti 10 วันที่ผ่านมา

      @@Riffomonas I provided a direct link in an earlier comment, but TH-cam appears to have dropped it. But if you search for "indexing" on the DuckDB website, you'll find that keys are "implicitly indexed" by adaptive radix trees (ARTs). I expect that keying the duckdb table will improve performance on your query benchmarks, but I'd be interested in learning how much.