How to clean and join data from mothur with the dplyr R package (CC101)

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ก.ค. 2024
  • With ggplot2, the dplyr R package is the foundation of the tidyverse. In this episode of Code Club, Pat shows how to use dplyr to clean and join data generated from the #mothur software package. He will cover select, rename, rename_all, mutate, separate, pivot_longer, str_replace, str_replace_all, group_by, summarize, inner_join, anti_join, and more. In this overview, you'll get a sense of how powerful dplyr is for working with data.
    Pat will use RStudio and functions from #dplyr and the rest of the tidyverse further demonstratin the power of #R. The accompanying blog post can be found at www.riffomonas.org/code_club/....
    Do you have a figure that you would like to receive a critique or help improving? Let me know and I'd be happy to arrange a guest appearance!
    If you're interested in taking an upcoming 3 day R workshop, email me at riffomonas@gmail.com!
    R: r-project.org
    RStudio: rstudio.com
    Raw data: github.com/riffomonas/raw_dat...
    Workshops: www.mothur.org/wiki/workshops
    You can also find complete tutorials for learning R with the tidyverse using...
    Microbial ecology data: www.riffomonas.org/minimalR/
    General data: www.riffomonas.org/generalR/
    0:00 Overview
    6:02 Cleaning up metadata
    8:26 Cleaning up OTU counts table
    11:39 Cleaning up taxonomy data
    17:54 Joining data frames
    21:05 Calculating relative abundances
    23:17 Tidying by taxonomy
    24:53 Conclusion
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 35

  • @Riffomonas
    @Riffomonas  3 ปีที่แล้ว +3

    Are there any dplyr functions that you would like to learn more about?

    • @KN-tx7sd
      @KN-tx7sd 2 ปีที่แล้ว +1

      relocate (when working with order of column names)

    • @Riffomonas
      @Riffomonas  2 ปีที่แล้ว

      @@KN-tx7sd thanks! to be honest, I didn't know about relocate. I've always used select to do these types of things

  • @JOHNSMITH-ve3rq
    @JOHNSMITH-ve3rq 3 ปีที่แล้ว +2

    Absolutely love this channel - perfect example why. More videos cleaning messy data - the best part of the process!!

    • @Riffomonas
      @Riffomonas  3 ปีที่แล้ว +1

      Thank you - great comment! I appreciate the feedback and will be sure to include more steps cleaning messy data in future episodes

  • @fioredelsud
    @fioredelsud 11 หลายเดือนก่อน

    OMG just came across this video!! Thank you so much for this!! I have been avoiding doing this myself because after so many sleepless nights with my kids, I have been finding it really hard to concentrate and learn the tydiverse tools with this kind of data. With this super clear video you saved me probably days of struggling to do this. You saved an #academicmom with very little time and sleep. THANK YOU!

  • @williamvilchezcruz
    @williamvilchezcruz 5 หลายเดือนก่อน

    Excellent tutorial Sir!

  • @ericagardner8249
    @ericagardner8249 5 หลายเดือนก่อน

    Thank you Pat! You are the best!

  • @romulocenci6176
    @romulocenci6176 2 ปีที่แล้ว +1

    I did create a project days ago, but didnt even think about how connect to Rstudio, such valuable information, thanks a lot

    • @Riffomonas
      @Riffomonas  2 ปีที่แล้ว

      Hey Romulo - glad this was inspiring!

  • @Rydaholic
    @Rydaholic 2 ปีที่แล้ว +1

    Another amazing tutorial! Thank you!

    • @Riffomonas
      @Riffomonas  2 ปีที่แล้ว

      Glad you enjoyed it! 🤓

  • @keynesmeetsschumpeterinanarrow
    @keynesmeetsschumpeterinanarrow 3 ปีที่แล้ว +1

    I really like your videos on visualisation but this pivot to data cleaning is very much appreciated. Please consider making videos on missing data visualisations (like those in the nanair package). Thanks!

    • @Riffomonas
      @Riffomonas  3 ปีที่แล้ว

      Great suggestion - thanks!

  • @sunkumargurung1172
    @sunkumargurung1172 2 ปีที่แล้ว +1

    Thanks a lot, it helped me alot

    • @Riffomonas
      @Riffomonas  2 ปีที่แล้ว

      Wonderful - thanks for watching!

  • @dasrotrad
    @dasrotrad ปีที่แล้ว +1

    Dang Pat…. Awesome!

    • @Riffomonas
      @Riffomonas  ปีที่แล้ว

      Thanks! I appreciate you for being on the journey with me🤓

  • @nsaini1029
    @nsaini1029 2 ปีที่แล้ว +1

    Pat - these videos are awesome.. learning R from scratch and thanks to you to make this possible!
    I wish you can organize videos for microbiome analysis where I can go through them one by one. It seems most of the videos on youtube are not properly organized currently and hard to locate all microbiome analysis videos - in one list!!

    • @Riffomonas
      @Riffomonas  2 ปีที่แล้ว

      Thanks! Have you seen this playlist? th-cam.com/play/PLmNrK_nkqBpIIRdQTS2aOs5OD7vVMKWAi.html

  • @N1loon
    @N1loon 3 ปีที่แล้ว +3

    Even though it's a task most people despise, I actually really enjoy pre-processing steps before doing visualizations or creating models. It can be really satisfying reading in a messy dataset and cleaning it so that it's in a tidy format :D
    Although I needed some time to fully wrap my head around the gather/spread functions (now pivot_longer and pivot_wider). And I still struggle from time to time conceptualizing how to get from dataframe X to dataframe Y putting it in either a long or wide format...

    • @Riffomonas
      @Riffomonas  3 ปีที่แล้ว +1

      Thanks for watching! I’ll be sure to include more of these types of transformations in future episodes

  • @chengchenli1677
    @chengchenli1677 3 ปีที่แล้ว +1

    Love this channel and enjoyed every R demo video so far! Thank you! Can you do a video on cleaning and matching sequencing SampleIDs (generated from illumina for example) with SampleIDs recorded in metadata. In an ideal situation, they should completely match but often time they partially match.

    • @Riffomonas
      @Riffomonas  3 ปีที่แล้ว

      Thanks! I'm not sure I know what you mean. Can you post a small snippet of what the data look like?

  • @afonsoosorio2099
    @afonsoosorio2099 ปีที่แล้ว

    Hi Pat, this is great on joining tables using a common id.
    I am an aspiring data analyst and a beginner with R. Do you have an ideia how to read multiple files from a given path *.csv, into R and append (binding) them in few explicit steps ?
    All files have common structure (similar heads) and csv formatted. There are 12 months datasets.
    I appretiate your assistance.

  • @patriciamiller8286
    @patriciamiller8286 ปีที่แล้ว

    What do I do if a few of my taxonomic classes are missing and are replaced by NA? For ex if I have Kingdom to order but family and genus are missing "Kindgom: Bacteria, Phylum: Firmicutes, Class: Bacilli, Order: Bacillales (but nothing else afterwards), following this video, the family and genus become NA. If I omit, the entire row disappears (at least that's what I think happens).
    But I still want those rows because they add to the diversity calculations...i may remove them later when I want to discuss taxa specifically but for alpha and beta diversity I want to keep them in. (Hope this is making sense)
    Also, i want to remove the Eukaryota rows without having to go into excel and do it manually.

    • @patriciamiller8286
      @patriciamiller8286 ปีที่แล้ว

      FYI: I solved the last question; I removed Eukaryota and Unassigned by using filter(str_detect(taxonomy, "Bacteria")...for anyone interested :)

    • @patriciamiller8286
      @patriciamiller8286 ปีที่แล้ว

      forgot that str_detect is part of the stringr package

    • @patriciamiller8286
      @patriciamiller8286 ปีที่แล้ว

      realizing the video actually answers this but I didn't "get" it the first time. 😝

    • @patriciamiller8286
      @patriciamiller8286 ปีที่แล้ว

      Actually, it doesn't, so I added 'mutate(., replace_na(., ""))' in the pipeline after the separate pipe and now I have the blank spaces I needed - for anyone else needing this info hope that helped! Love this channel!!

    • @Riffomonas
      @Riffomonas  ปีที่แล้ว

      Thanks so much for watching and working with the code using your own data. That’s the best way to learn!

  • @CristinaCampbell
    @CristinaCampbell 2 ปีที่แล้ว

    How would you join similar data? I'm pulling temperature data from several dataloggers in the field. The datasets all have the same column names (except for logger ID). I need the data for all the loggers to be aligned by time and grouped by datalogger but I'm not sure how to get there. When I inner join by time I end up with several columns of temp (r gives them all unique names), I'm not sure how to align them in time and then group by datalogger. Thanks for any insight. Love your channel!
    combo combo
    # A tibble: 1,776 x 5
    time f.x dl34 f.y dl35
    1 10/22/2021 15:00 87 1 87 1
    2 10/22/2021 16:00 87 2 87 2

    • @Riffomonas
      @Riffomonas  2 ปีที่แล้ว +1

      Try doing the join without the by argument. Alternatively you could also do by=c(“time”, “f”, etc)