Benchmarking methods for reading text files in R (CC290)

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ก.ค. 2024
  • Pat revisits his code for reading in FASTA-formatted DNA sequence files in R. First he takes on how to read in the sequence data. Then he removes a for loop. Finally, he revisits some of the functions from stringi to see if he can make further improvements in the performance of the code. Between all of the changes the function is now 3 times faster than it was before! He shows how to use scan, readLines, readr::read_lines, data.table::fread, and vroom::vroom_lines. This episode is part of an ongoing effort to develop an R package that implements the naive Bayesian classifier.
    If you want to get a physical copy of R Packages: amzn.to/43pMR8L
    If you want a free, online version of R packages: r-pkgs.org/
    You can find my blog post for this episode at www.riffomonas.org/code_club/....
    Check out the GitHub repository at the:
    * Beginning of the episode: github.com/riffomonas/phyloty...
    * End of the episode: github.com/riffomonas/phyloty...
    #rstats #paste #paste0 #refactor #testthat #tdd #microbenchmark #vectors #rdp #16S #classification #classifier #microbialecology #microbiome
    Support Riffomonas by becoming a Patreon member!
    / riffomonas
    Want more practice on the concepts covered in Code Club? You can sign up for my weekly newsletter at shop.riffomonas.org/youtube to get practice problems, tips, and insights.
    If you're interested in purchasing a video workshop be sure to check out riffomonas.org/workshops/
    You can also find complete tutorials for learning R with the tidyverse using...
    Microbial ecology data: www.riffomonas.org/minimalR/
    General data: www.riffomonas.org/generalR/
    0:00 Introduction
    7:47 Benchmarking reading in unformatted text data
    15:57 Applying benchmarking results to read_fasta
    16:52 Vectorizing creation of data frame
    27:31 Further optimization of stringi functions
    31:43 Revisiting vroom::vroom_lines
    32:20 Importing packages and functions
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 9

  • @pedrobittencourt_
    @pedrobittencourt_ 20 วันที่ผ่านมา +1

    How about a function to read genbank files into a tibble? I would definitely use it!

    • @Riffomonas
      @Riffomonas  19 วันที่ผ่านมา +1

      Nice idea, maybe some day...

  • @JordiRosell
    @JordiRosell 16 วันที่ผ่านมา

    Another idea would be using rust and the rextendr package.

    • @Riffomonas
      @Riffomonas  13 วันที่ผ่านมา +1

      then i'd have to learn rust! 😂

  • @SopheaPhon-23
    @SopheaPhon-23 12 วันที่ผ่านมา

    Could please make a series video about machine learning using R? Recently, I was unable to load h2o package and connect to it.

    • @Riffomonas
      @Riffomonas  12 วันที่ผ่านมา

      You might be interested in using mikropml from my group. I made a series using it about 2 years ago. Here's a link to the playlist: th-cam.com/play/PLmNrK_nkqBpKpzb9-vI4V7SdXC-jXEcmg.html

  • @Slotherdanig
    @Slotherdanig 20 วันที่ผ่านมา

    Just ran into the same issue with a largish tsv (output from chewBBACA), 60x9000 presence absence table. Took about 7 seconds to load. Near instant with fread. However that lead to this error: > # Transpose the data for easier manipulation
    > transposed_data colnames(transposed_data)

    • @Riffomonas
      @Riffomonas  19 วันที่ผ่านมา +1

      Cool - With data.table i would try using melt and dcast to reshape the dataframe cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

    • @Slotherdanig
      @Slotherdanig 18 วันที่ผ่านมา

      @@Riffomonas Thanks for the tip! I did and it works, kinda. About 50% of the time R encounters a fatal error using dcast. With melt it creates a table containing 524552 obs. of 4 variables which is reshaped to 9044obs. of 59 variables. This may cause a memory spike or something. Hence the crash. Seeing that I don't use this script often I'll probably stick to read.csv for now. The goal is to generate a tsv which is used to create a cgMLST and wgMLST upset plot to compare different clades with UpsetR.