Generating a ROC curve with ggplot2 in R: Balancing the specificity and sensitivity of ASVs (CC058)

Riffomonas Project

Просмотров 6 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 30 июл 2024
We've seen that by varying the threshold used to define an ASV we can modulate the amount that genomes are split into multiple bins and the amount that different species are lumped together. Receiver operator characteristic (ROC) curves are useful tools for visualizing this tradeoff. In this episode we take a deeper look at the tradeoff and go back to get more data to give us more resolution. This episode is part of a larger arc of episodes investigating the sensitivity and specificity of amplicon sequence variants (ASVs), also known as exact sequence variants (ESVs). ASVs are growing in popularity for analyzing microbial communities using 16S rRNA gene sequences. Pat demonstrates these concepts by live coding at the command line interface using GitHub Flow and RStudio.
0:00 Introduction
3:45 Sensitivity & specificity
10:23 Building a ROC curve
12:04 Sens/Spec for 3%
15:24 Balancing sens & spec
19:27 Getting close to perfect
24:40 Getting more data
31:11 Committing & Conclusion
The accompanying blog post contains the exercises and solutions can be found at www.riffomonas.org/code_club/2...
Наука

Комментарии • 14

@cdeanj 3 года назад ⁺¹
Hi Dr. Schloss, thanks for the cool video. I am a little confused about the problem you're trying to solve though. What I like about ASVs is that they are just error-corrected amplicons, with no clustering by sequence similarity, which is reflective of reality. When ASVs are clustered in the way you are proposing, as has been the standard practice over the last several years, you risk creating something (i.e., an OTU) that is not reflective of something in nature. However, bacteria do contain multiple copies of 16S, with varying degrees of heterogeneity, so clustering in the way you are proposing may help obtain more accurate estimates of richness and diversity, but that assumes the entire community was sequenced. These were just some thoughts I had while watching your video. Great video, thanks for sharing :)!
@Riffomonas 3 года назад
Hi Christopher - thanks for you comment!
I'll grant you that ASVs represent an error corrected sequence. But, should that be the unit of inference in microbial ecology studies? Should we be splitting one genome of E. coli into 5 bins? That doesn't make sense to me. We don't have a bacterial species concept or at least we don't have one that can be captured by 16S rRNA gene sequences. Short of a massive improvement in our databases, the best we can do are OTUs. OTUs, with the right definition will make E. coli one bin rather than 5. To me, that makes a lot more sense as a unit of inference than a part of a genome. So then the question becomes, what threshold should we use to define an OTU. And here we are :)
I don't follow your comment about needing to assume the entire community be sequenced. We can make relative comparisons of richness, diversity,, etc using OTUs (or ASVs if you'd like).
Hope this helps a bit and thanks again for your comments.
@cdeanj 3 года назад
@@Riffomonas Ignore the comment about sequencing depth, rereading it, I am not sure where I was going with that.
Agreed, I think the microbial ecology community needs to continue a discussion around this because there are certainly pros and cons to each approach.
With regards to your E. coli example: that's assuming you knew that E. coli was in your sample. Depending on the hypervariable region you sequenced (or combination of), those amplicon(s) may not have enough information content available to make that call, as might be the case for two species (whatever we take that to mean) within the same genus sharing a similar V4 region, for example.
In this case, we can both agree that splitting them into separate bins (assuming they are not 100% identical) would be ideal. To be honest, though, I am unfamiliar with how distant some of these regions are among closely related species, so maybe I'm talking about a problem that doesn't exist all that much in reality.
Enjoying the debate. I realize that my comments aren't things you haven't thought about yourself, but making them helps advance my knowledge about this fun topic.
@FreeDataScientist 2 года назад
Hello Professor Rifomanas,
You asked why R doesn't open to you root directory.
You can set the working directory from the sessions tab or by setwd("C:/PATH NAME)
An d i believe it will always open thereafter.
Blessings
@Riffomonas 2 года назад
Hi John-Eric - the best way to get things to start in the correct working directory is to create a RStudio project. Then when you double click on that file it will open to that directory directly. Using setwd is widely frowned upon because it often isn't reproducible across computers. Also, try to use relative paths from the project root directory rather than absolute paths from the _computer's_ root
@Norainjoe 3 года назад ⁺¹
This is like trying to steal home, when I'm just trying to put the bat on the ball
@Riffomonas 3 года назад
I think ichiro will be a first ballot hall of famer. Last I checked Barry is still waiting for a call. Right?! You got this 😂
@adrenalinerush2009 Год назад
why data is missing from GitHub repository I checked all the folders? :/
@Riffomonas 4 месяца назад
You can find the data and the repository at the time of the video at github.com/SchlossLab/Schloss_rrnAnalysis_mSphere_2021/tree/63c560670c128fade5eeef717c8c8a9e8ff081e2
@forpirate4695 2 года назад ⁺¹
Hello, Thanks alot for video. But I have a feeling that you look very familiar with Ryan Reynold (Dead Pool)
@Riffomonas 2 года назад
Ha! I think someone else has said that too. Nope, we aren't related and surprisingly, I've never seen Dead Pool
@louiseweschler1746 4 месяца назад
Dear Dr. Schloss, Please tell me how to access the data set for this video. I must apologize for this question ~~~ I think I am missing something obvious and wasting your time.
@Riffomonas 4 месяца назад ⁺¹
You can see what the repository looked like including the data files at this link... github.com/SchlossLab/Schloss_rrnAnalysis_mSphere_2021/tree/63c560670c128fade5eeef717c8c8a9e8ff081e2
@Riffomonas 4 месяца назад ⁺¹
You can find the data and the repository at the time of the video at github.com/SchlossLab/Schloss_rrnAnalysis_mSphere_2021/tree/63c560670c128fade5eeef717c8c8a9e8ff081e2

Следующие

Автовоспроизведение

Parallelizing R code with the furrr package: Accelerating a 16 hour analysis (CC057)