How to clean and join data from mothur with the dplyr R package (CC101)

Поделиться
HTML-код
  • Опубликовано: 15 июл 2024
  • With ggplot2, the dplyr R package is the foundation of the tidyverse. In this episode of Code Club, Pat shows how to use dplyr to clean and join data generated from the #mothur software package. He will cover select, rename, rename_all, mutate, separate, pivot_longer, str_replace, str_replace_all, group_by, summarize, inner_join, anti_join, and more. In this overview, you'll get a sense of how powerful dplyr is for working with data.
    Pat will use RStudio and functions from #dplyr and the rest of the tidyverse further demonstratin the power of #R. The accompanying blog post can be found at www.riffomonas.org/code_club/....
    Do you have a figure that you would like to receive a critique or help improving? Let me know and I'd be happy to arrange a guest appearance!
    If you're interested in taking an upcoming 3 day R workshop, email me at riffomonas@gmail.com!
    R: r-project.org
    RStudio: rstudio.com
    Raw data: github.com/riffomonas/raw_dat...
    Workshops: www.mothur.org/wiki/workshops
    You can also find complete tutorials for learning R with the tidyverse using...
    Microbial ecology data: www.riffomonas.org/minimalR/
    General data: www.riffomonas.org/generalR/
    0:00 Overview
    6:02 Cleaning up metadata
    8:26 Cleaning up OTU counts table
    11:39 Cleaning up taxonomy data
    17:54 Joining data frames
    21:05 Calculating relative abundances
    23:17 Tidying by taxonomy
    24:53 Conclusion
  • НаукаНаука

Комментарии • 35

  • @Riffomonas
    @Riffomonas  3 года назад +3

    Are there any dplyr functions that you would like to learn more about?

    • @KN-tx7sd
      @KN-tx7sd 2 года назад +1

      relocate (when working with order of column names)

    • @Riffomonas
      @Riffomonas  2 года назад

      @@KN-tx7sd thanks! to be honest, I didn't know about relocate. I've always used select to do these types of things

  • @williamvilchezcruz
    @williamvilchezcruz 5 месяцев назад

    Excellent tutorial Sir!

  • @ericagardner8249
    @ericagardner8249 5 месяцев назад

    Thank you Pat! You are the best!

  • @dasrotrad
    @dasrotrad Год назад +1

    Dang Pat…. Awesome!

    • @Riffomonas
      @Riffomonas  Год назад

      Thanks! I appreciate you for being on the journey with me🤓

  • @JOHNSMITH-ve3rq
    @JOHNSMITH-ve3rq 3 года назад +2

    Absolutely love this channel - perfect example why. More videos cleaning messy data - the best part of the process!!

    • @Riffomonas
      @Riffomonas  3 года назад +1

      Thank you - great comment! I appreciate the feedback and will be sure to include more steps cleaning messy data in future episodes

  • @fioredelsud
    @fioredelsud Год назад

    OMG just came across this video!! Thank you so much for this!! I have been avoiding doing this myself because after so many sleepless nights with my kids, I have been finding it really hard to concentrate and learn the tydiverse tools with this kind of data. With this super clear video you saved me probably days of struggling to do this. You saved an #academicmom with very little time and sleep. THANK YOU!

  • @sunkumargurung1172
    @sunkumargurung1172 2 года назад +1

    Thanks a lot, it helped me alot

    • @Riffomonas
      @Riffomonas  2 года назад

      Wonderful - thanks for watching!

  • @romulocenci6176
    @romulocenci6176 2 года назад +1

    I did create a project days ago, but didnt even think about how connect to Rstudio, such valuable information, thanks a lot

    • @Riffomonas
      @Riffomonas  2 года назад

      Hey Romulo - glad this was inspiring!

  • @Rydaholic
    @Rydaholic 2 года назад +1

    Another amazing tutorial! Thank you!

    • @Riffomonas
      @Riffomonas  2 года назад

      Glad you enjoyed it! 🤓

  • @keynesmeetsschumpeterinanarrow
    @keynesmeetsschumpeterinanarrow 3 года назад +1

    I really like your videos on visualisation but this pivot to data cleaning is very much appreciated. Please consider making videos on missing data visualisations (like those in the nanair package). Thanks!

    • @Riffomonas
      @Riffomonas  3 года назад

      Great suggestion - thanks!

  • @nsaini1029
    @nsaini1029 2 года назад +1

    Pat - these videos are awesome.. learning R from scratch and thanks to you to make this possible!
    I wish you can organize videos for microbiome analysis where I can go through them one by one. It seems most of the videos on youtube are not properly organized currently and hard to locate all microbiome analysis videos - in one list!!

    • @Riffomonas
      @Riffomonas  2 года назад

      Thanks! Have you seen this playlist? ruclips.net/p/PLmNrK_nkqBpIIRdQTS2aOs5OD7vVMKWAi

  • @N1loon
    @N1loon 3 года назад +3

    Even though it's a task most people despise, I actually really enjoy pre-processing steps before doing visualizations or creating models. It can be really satisfying reading in a messy dataset and cleaning it so that it's in a tidy format :D
    Although I needed some time to fully wrap my head around the gather/spread functions (now pivot_longer and pivot_wider). And I still struggle from time to time conceptualizing how to get from dataframe X to dataframe Y putting it in either a long or wide format...

    • @Riffomonas
      @Riffomonas  3 года назад +1

      Thanks for watching! I’ll be sure to include more of these types of transformations in future episodes

  • @chengchenli1677
    @chengchenli1677 3 года назад +1

    Love this channel and enjoyed every R demo video so far! Thank you! Can you do a video on cleaning and matching sequencing SampleIDs (generated from illumina for example) with SampleIDs recorded in metadata. In an ideal situation, they should completely match but often time they partially match.

    • @Riffomonas
      @Riffomonas  3 года назад

      Thanks! I'm not sure I know what you mean. Can you post a small snippet of what the data look like?

  • @afonsoosorio2099
    @afonsoosorio2099 Год назад

    Hi Pat, this is great on joining tables using a common id.
    I am an aspiring data analyst and a beginner with R. Do you have an ideia how to read multiple files from a given path *.csv, into R and append (binding) them in few explicit steps ?
    All files have common structure (similar heads) and csv formatted. There are 12 months datasets.
    I appretiate your assistance.

  • @patriciamiller8286
    @patriciamiller8286 Год назад

    What do I do if a few of my taxonomic classes are missing and are replaced by NA? For ex if I have Kingdom to order but family and genus are missing "Kindgom: Bacteria, Phylum: Firmicutes, Class: Bacilli, Order: Bacillales (but nothing else afterwards), following this video, the family and genus become NA. If I omit, the entire row disappears (at least that's what I think happens).
    But I still want those rows because they add to the diversity calculations...i may remove them later when I want to discuss taxa specifically but for alpha and beta diversity I want to keep them in. (Hope this is making sense)
    Also, i want to remove the Eukaryota rows without having to go into excel and do it manually.

    • @patriciamiller8286
      @patriciamiller8286 Год назад

      FYI: I solved the last question; I removed Eukaryota and Unassigned by using filter(str_detect(taxonomy, "Bacteria")...for anyone interested :)

    • @patriciamiller8286
      @patriciamiller8286 Год назад

      forgot that str_detect is part of the stringr package

    • @patriciamiller8286
      @patriciamiller8286 Год назад

      realizing the video actually answers this but I didn't "get" it the first time. 😝

    • @patriciamiller8286
      @patriciamiller8286 Год назад

      Actually, it doesn't, so I added 'mutate(., replace_na(., ""))' in the pipeline after the separate pipe and now I have the blank spaces I needed - for anyone else needing this info hope that helped! Love this channel!!

    • @Riffomonas
      @Riffomonas  Год назад

      Thanks so much for watching and working with the code using your own data. That’s the best way to learn!

  • @CristinaCampbell
    @CristinaCampbell 2 года назад

    How would you join similar data? I'm pulling temperature data from several dataloggers in the field. The datasets all have the same column names (except for logger ID). I need the data for all the loggers to be aligned by time and grouped by datalogger but I'm not sure how to get there. When I inner join by time I end up with several columns of temp (r gives them all unique names), I'm not sure how to align them in time and then group by datalogger. Thanks for any insight. Love your channel!
    combo combo
    # A tibble: 1,776 x 5
    time f.x dl34 f.y dl35
    1 10/22/2021 15:00 87 1 87 1
    2 10/22/2021 16:00 87 2 87 2

    • @Riffomonas
      @Riffomonas  2 года назад +1

      Try doing the join without the by argument. Alternatively you could also do by=c(“time”, “f”, etc)