Benchmarking R functions for reading tsv files (CC291)

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 ก.ค. 2024
  • Reading data tables into R is a very common activity and there are many ways to do this in base R with read.delim or with the read_tsv function from readr, the vroom function from the vroom package, or the fread function from data.table. Pat will benchmark these four approaches and discuss the tradeoffs between speed and dependencies for package development. Then he implements readr::read_tsv with test driven development (TDD) to create a read_taxonomy function in his phylotypr R package. This episode is part of an ongoing effort to develop an R package that implements the naive Bayesian classifier.
    If you want to get a physical copy of R Packages: amzn.to/43pMR8L
    If you want a free, online version of R packages: r-pkgs.org/
    You can find my blog post for this episode at www.riffomonas.org/code_club/....
    Check out the GitHub repository at the:
    * Beginning of the episode: github.com/riffomonas/phyloty...
    * End of the episode: github.com/riffomonas/phyloty...
    #rstats #readr #vroom #data.table #read.delim #rdp #16S #classification #classifier #microbialecology #microbiome
    Support Riffomonas by becoming a Patreon member!
    / riffomonas
    Want more practice on the concepts covered in Code Club? You can sign up for my weekly newsletter at shop.riffomonas.org/youtube to get practice problems, tips, and insights.
    If you're interested in purchasing a video workshop be sure to check out riffomonas.org/workshops/
    You can also find complete tutorials for learning R with the tidyverse using...
    Microbial ecology data: www.riffomonas.org/minimalR/
    General data: www.riffomonas.org/generalR/
    0:00 Introduction
    3:30 Benchmarking methods for reading tsv files
    24:23 Writing tests for read_taxonomy
    28:18 Writing read_taxonomy function
    32:09 Refactoring read_taxonomy
    35:51 Package hygiene
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 2

  • @pedrobittencourt_
    @pedrobittencourt_ 22 วันที่ผ่านมา +3

    You should totally include arrow::read_tsv_arrow

    • @Riffomonas
      @Riffomonas  21 วันที่ผ่านมา

      Thanks for the suggestion! I just went back and added it to my benchmarking script. If I leave out the stri_replace_last_regex function call, it is pretty similar to vroom (5.7 vs 5.8 ms), which is a smidge faster than dt (7.6 ms). With the stri_replace_last_regex function call it is still similar to vroom (33 ms), but a smidge slower than dt (30.7 ms). I committed the additional test to the repository if you want to see what I did.