StatQuest: DESeq2, part 1, Library Normalization

StatQuest with Josh Starmer

Просмотров 100 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 26 ноя 2024

Комментарии • 144

@statquest 4 года назад ⁺¹⁵
Correction:
9:28 I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@deni9264 6 лет назад ⁺²⁷
Public Disclaimer: Watching Josh's introductory vids on RNAseq analysis, including this video, (sometimes more than once ;-) ) is a useful primer if you're just starting off in RNA-seq analysis. Watching these videos helped me make sense of the RNA-seq DE pipeline, such as the nature of the inputs and the rationale of the methods and metrics.
@statquest 6 лет назад
Awesome!!! Thanks for the endorsement! :)
@sureshkumar-kx2xz 4 года назад ⁺¹¹
I am Neuroscientist from MIT with no previous background in RNAseq and molecular biology. This video summarized Deseq2 in 12 minutes which is super cool!!! I quickly understand deseq2 in 12 mins
@statquest 4 года назад ⁺³
BAM! :)
@jeremyjacobsen6400 Год назад ⁺¹
I wrote a python script to do this procedure and found what might be an error because of your tutorial. In particular I was filtering inf(s) before calculating the median of the logs. Thanks Josh!
@statquest Год назад
bam!
@Seeeevi 3 года назад ⁺³
I have to do a talk on deseq2 for a data analysis course and this is what I'm starting it off with. Thank you a lot, seriously.
@statquest 3 года назад
Glad it was helpful!
@vanya.antonov 7 лет назад ⁺¹
I had hard time understanding from the original paper how the normalization coefficients are computed. This video helps a lot! Thank you!
@jonathanjavid7206 3 года назад ⁺¹
Well Explained!!! These steps makes the more clear vision about the DEG by DeSeq2. Thank you Josh for this valuable video.
@statquest 3 года назад
Thanks!
@deni9264 6 лет назад ⁺³
Thank you so much Josh!! Your explanation has helped overcome my anxiety about learning DeSeq2...it's complex but not so bad. Thaaank you. I'm sharing this vid with my comp.bio journal club, too
@deni9264 6 лет назад
Do you plan on doing an overview of limma-voom as well?
@statquest 6 лет назад
A lot of people have started to ask about that. I"ll put it onto the to-do list and look into it.
@Reonsi 3 года назад
As a summary, I would say that the geometric average downplays the effect of outliers at the gene level (rows), while the median downplays outliers at the sample level (column). The subtraction allows us to rank samples by sequencing depth, and the division applies the scaling factor to our original data.
@statquest 3 года назад ⁺¹
:)
@ElNick09 5 лет назад ⁺⁷
These videos are an absolutely fantastic resource. Really, thank you so much!
@statquest 5 лет назад ⁺¹
I'm so happy to hear that you like them! :)
@Mortezakhabiri1 7 лет назад
The way that you are explaining is amazing! I was looking for such a explanations for a long time. It is very comprehensive. Thanks!
@Mortezakhabiri1 7 лет назад
Great! Great! Great!
It would be great also if you can introduce some well done books!
Thanks!
@Mortezakhabiri1 7 лет назад
I could not find any, too!
Thanks!
@congchen170 7 лет назад ⁺²²
Found a small mistake from the video: when you explain the library sizes (around 2 minutes), the Sample #2 Gene A2M read counts should be 1126, not 2126.
@romanatorx3949 5 лет назад
I was thinking exactly the same :)
@MrZanvine 7 лет назад ⁺³
Hey Joshsua, thanks so much for making these videos- they are immensely helpful.
I think I noticed a small mistake when you transform the median values into normal numbers. Sample 2 you have e^0.3, but the median is -0.1.
@footboro 7 лет назад ⁺¹
Very nice tutorial, effortless stat learning.
thank you Joshua
@williammo4450 4 года назад ⁺²
I like this guy! Thanks for your carefully explanation! Keep it up!
@statquest 4 года назад
Thank you very much! :)
@apulunuj 5 лет назад ⁺¹
Also, what would be implemented if you wanted to look at log infinity values? that is cell type specific genes @7:57
@statquest 5 лет назад ⁺¹
You can always add a "pseudo-count" to the data, like one read for all genes, so that you can avoid the log infinity problem.
@zuhaibahmed6817 4 года назад ⁺²
Thanks for you videos! They really are a huge help. I just have a question about your explanation for differences in library composition at 3:41. I'm not sure I follow. The way I see it, if those 563 reads don't map to A2M, they aren't going to just move onto other genes to inflate their counts. So the only reason that the other genes in library 2 have higher counts is because they had more reads that matched their sequence, indicating that their transcripts were more abundant. Which would mean those other genes are differentially expressed as well, right? If only A2M was differentially expressed, then those other genes would retain their small counts because they aren't transcribed any more than in library 1. Am I misunderstanding something? Thanks
Edit: I have two other questions as well, if you don't mind:
1) Does this method of normalization take into account the lengths of the different transcripts like TPM/RPKM/FPKM?
2) Is this method more robust than TPM/RPKM/FPKM? If so, then should it be used in instead of them?
Sorry for the onslaught of questions. Thanks for the help!
@statquest 4 года назад
In the example at 3:41, there are 635 reads sequenced per sample (yes, these numbers are small compared to a true RNA-seq experiment, but this is just an example). Now, when we do RNA-seq, we extract the mRNA from cells (or a single cell) and then we amplify it with PCR before making the final library that is sequenced. The PCR ensures that we have a lot of stuff to sequence, so much stuff that there is more than we can actually sequence. Thus the example plays out in reality the way it does in this example in the video. When one gene soaks up a lot of reads in one sample, but not in another, then that just means there are more reads going to other genes in the other sample.
This method does not account for read-lengths, nor should it. DESeq2's model depends only on the number of reads per gene, not the lengths.
Lastly, TPM/FPKM/etc. are useful when just looking at the data and comparing genes of different lengths.
@zuhaibahmed6817 4 года назад
@@statquest Thanks for the clarification
@andydavidson3097 10 месяцев назад ⁺¹
Request: great video on DESeq2 normalization. We already know what the counts are how. I do not understand how the linear models for each genes is used to calculate the lfc? I really appreciate your expliantions
@statquest 10 месяцев назад ⁺¹
Thank you! Unfortunately I haven't done this sort of analysis in a long time so I can't promise I'll follow up on it. :(
@fantasy6611 6 лет назад ⁺¹
Another small mistake I found is that, around 10 mins, sample#2 should be e^-0.1=0.9. Anyway thamks a lot!
@maharshichakraborty3530 6 лет назад ⁺²
Great video! Would have been nice if you could have talked about the negative bionomial distribution fitting
@statquest 6 лет назад ⁺¹
One day I'll get to that part. Hopefully soon.
@nastiaskuba8773 10 месяцев назад
There is a problem at 2:13, reads for A2M gene in sample 2 should have 1126 reads, not 2126. Anyway, thank you for the video, very useful for beginners, and in general nice and unique style!
@statquest 10 месяцев назад
Sorry for the typo, but I'm glad it didn't get in the way of you understanding the ideas. BAM! :)
@stevebarratt888 2 года назад ⁺¹
such a great explanation!
@statquest 2 года назад
Glad you think so!
@tinacole1450 3 года назад ⁺¹
Your explanations are very good. Thanks !!! The song is funny
@statquest 3 года назад
Thank you! 😃
@Aviad3587 3 года назад ⁺¹
i
@statquest 3 года назад
:)
@bzaruk 3 года назад
First of all - I LOVE your stuff! so helpful and clear!
quick question though - I have an RNA-Seq of some experiments for 4 different cell-lines, each cell line has 3 biological replicates with 3 technical replicates each - I want to do some normalization on that RNA-Seq results to compare between the cell lines.
You mentioned in the video that DESeq wasn't meant to do normalization between different reads count but between different cells - which is exactly what I am doing - BUT - I do have some delta between the reads of each technical replicate, especially between the 1st biological replicate against both the 2nd and the 3rd biological replicates due to different PCR cycles.
My question is - do I need to perform any kind of normalization based on the reads before I do the DESeq normalization?
@statquest 3 года назад ⁺¹
Nope! At at 4:18 we see that DESeq2 (and EdgeR) can normalize take care of both situations - when there are differences in library sizes and when there are differences in library composition.
@mrlolzot 5 лет назад ⁺¹
Great stuff dude. Thanks for making this.
@katherinemedinaortiz1935 2 года назад ⁺¹
These videos are awesome
@statquest 2 года назад
Thank you! :)
@alfred532008 7 лет назад
Do you have a video explaining more technical aspects of DESeq2, pleas? e.g. how the GLM fitting (eq. 2 in DESeq2 paper), estimation of dispersion, and estimation of logarithmic fold changes.
@lactobacillusacidophilus 5 лет назад
One question. Deseq2 uses negative binomial regression, so after applying scaling factors, does it also round the normalized numbers to make a real count table of normalized values? Otherwise can we use negative binomial still?
@muffinman1 5 лет назад ⁺¹
Fantastic explanation.
@statquest 5 лет назад
Thanks! :)
@lycz9869 3 года назад
Around 12:20 you say that the idea of logs and median is to look at house keeping genes and to eliminate all genes which are only transcribed in one sample. But why should we do this? If we knock out a transcription factor to find its function this is exactly what we are interested in. Or does this method serve a different purpose?
Thank you!
@statquest 3 года назад ⁺¹
At this stage, all we are interested in is normalizing the read counts to compensate for differences in sequencing depth and library composition. Later, once the read counts are normalized, then we will use statistics to identify differentially expressed genes.
@nishantshade668 3 года назад
The scaling factor which you mentioned at 4:46, is it the same as the work done by the 'Estimate Size Factor' function in R programming??
@statquest 3 года назад
Unfortunately it's been so long since I used DESeq2 that I can't remember.
@alfred532008 7 лет назад
Is there any obvious reason for using geometric mean instead of arithmetic mean when calculating a scaling factor?
@mayling1014 8 месяцев назад
Thank you so much for the great explanation!
2:08 May I know if all the samples were sequenced at the same time ( same sequencing reaction), will the sequencing depth become different?
@statquest 8 месяцев назад
I believe so, because you'll still end up with different numbers of reads per sample.
@mayling1014 8 месяцев назад
@@statquest Does this imply that even if the sequencing depth is standardized to 20x coverage across all samples, the number of reads corresponding to transcripts of gene A may still vary between samples, even if the expression level of gene A is the same in both sample 1 and sample 2?
@statquest 8 месяцев назад
@@mayling1014 I believe there is a stochastic (random) nature to the hybridization between the reads and the chip used for sequencing. So there is a chance that not every sample gets exactly the same number reads because not every sample binds to exactly the same number of spots on the chip. And not every read is the same quality, and that could also result in different numbers of reads per sample after you filter out low quality reads.
@CaveCrack 4 года назад
Josh, thanks for your wonderful series of videos. I have a question about using the DESeq2 normalization method on TPM data. I have TPM from RSEM output, each sample of course sums to 1 million. It seems that using DESeq2 style normalization on this TPM data would be valuable as it will adjust for library composition. I am not using R, so I'm not using the DESEq2 bioconductor package, just computing the normalization as you describe. Documentation on the DESeq2 package says the counts should be raw counts, however it seems that TPM would be just as valid if normalization is the only step of interest. Is this correct? thanks
@statquest 4 года назад
DESeq2's normalization assumes the data are raw because it does part of what TPM attempts to do, compensate for sequencing depth differences. When you start with TPM values, DESeq2 can no longer make that adjustment the way it wants to.
@godsperson5571 3 года назад
@@statquest so is it good or bad?
@Moominverdatre 4 года назад
Thanks for the great video. What you call "scaling factor" is the output of the function estimateSizeFactors, right? The name is a little bit misleading for someone who's already very confused with all the different normalisation methods!
@statquest 4 года назад ⁺¹
I believe that is correct.
@manuelsokolov Год назад ⁺¹
If I have data already in TPM (transcripts per million), can I still apply DESEQ2?
@statquest Год назад
Nope.
@dhkwnr97 6 лет назад
It`s really helpful for my research THANK YOU A LOT!
@AnnaJeanine 7 лет назад
Another amazing video!
@haitrieuphan3832 7 лет назад
Thank you so much for very useful videos
@apulunuj 5 лет назад
In regards to the samples for each DESeq analysis. could that be different biological replicates or does each sample correspond to a different cell type ?
@statquest 5 лет назад
It could be anything - it could be technical replicates, biological replicates or different cell types. Whatever it is you want to study.
@tomy34188 3 года назад
So if you want to investigate cell differentiation using RNA-seq data, would it be wise to apply DESeq2? Because non-house keeping genes would also be of interest here I assume and those would be filtered out with DESeq2 or am I mistaken?
@statquest 3 года назад
Yes, I think DESeq2 would be a good tool for that.
@xuxiaochenwu9376 2 года назад
Hi, I notice a mistake @10:37, for sample #2, e should be raised for -0.1 instead of -0.3. Correct me if I am wrong.
@statquest 2 года назад
Yep, that's a typo.
@michelepierotti2833 4 года назад ⁺²
Average of logs is not the same as log of averages! Around 9:19 you are saying log(reads for geneX) - log(average for geneX) = log of the ratio, correctly. But what you calculated in step 2 is not the log(average for gene X) but the average of the log(reads). If a, b, c were the read counts for the 3 samples for say GENE3, the average you calculated in the example step 2 is (loga +logb + logc)/3. This, in your example is Average of log reads. But when you go on to discuss the logratio you are treating it as the log(average), the log of [ (a+b+c)/3], i.e. the log of the average. These 2 quantities are not the same thing obviously, So either you are wrong in the example at step 2 or you are wrong later when you treat it as a log(average) while you had calculated the average of logs. Could you help clarify and ideally correct the example in the video?
@statquest 4 года назад
You are correct. This error had been noted before in the video's description, and now I have made pinned comment so that it is easier to see. Sorry for the confusion.
@michelepierotti2833 4 года назад ⁺¹
@@statquest Thanks for clarifying and doing it so fast.
@michelepierotti2833 4 года назад
@@statquest"log(reads for gene X) - average(log values for gene for gene X)." Then the interpretation in the box is false and we should ignore that, too, right? You have no difference of logs, so no log of ratio, so not true that "we are really checking out the ratios of the reads in each sample to the average across samples".
@michelepierotti2833 4 года назад
so how do we move from the corrected expression: "log(reads for gene X) - average(log values for gene for gene X)" to the next step where we are working with "log (ratio reads_for_gene_X / average_reads_for_gene_X)". What am I missing?
@statquest 4 года назад
@@michelepierotti2833 You don't do that next step. We don't have a ratio, we just have a difference, or a "residual", from the geometric mean.
@gauss238 7 лет назад
Please post part 2 soon.
@leixiao169 3 года назад
Thanks for the really helpful video! If DEseq2 removes genes that have 0 reads, does this affect results interpretation? For example, different tissues express different genes (in some tissues the expression of certain genes is 0), for some "0" expression genes in certain tissues, the difference between these tissues and the tissues in which these genes are highly expressed is physiologically relevant. I hope the program still keeps these "0" read genes.
@statquest 3 года назад ⁺¹
Yes, it keeps those genes (with 0 reads), however, those genes are not used to calculate the scaling factor.
@leixiao169 3 года назад
@@statquest Thanks Josh, if DEseq2 keeps those genes with 0 reads, that is possible that those genes with 0 reads will be listed as significantly differentially expressed genes in the volcano plot, do I understand right?
@statquest 3 года назад ⁺¹
@@leixiao169 Presumably.
@bzaruk 2 года назад
how would you do a differential expression between multiple cell lines? do them in pairs and then find the shared highly differentially expressed genes? or is there a way of doing it in one analysis?
@statquest 2 года назад ⁺¹
This is a good question. Unfortunately it's been a while since I used DESeq2, however, I remember that you can pretty much do any sort of "linear model" type test, so you should be able to do anova or something like that.
@bzaruk 2 года назад ⁺¹
@@statquest Thanks! appreciate it!
@aealarco 7 лет назад
Thank you very much, it was very useful
@vigneshparasuraman 6 лет назад
Can anyone help me in normalizing excel data in deseq2? Where can i find the clear script ?
@hommejuhyun 5 лет назад
Thank you for your good explanation ! Umm.. So,, Deseq2 is only use to find a moderately expressed gene in different tissue, right?
@statquest 5 лет назад ⁺¹
DESeq2 can find differentially expressed genes among different tissues, or within the same tissue if, for example, one is diseased and the other is healthy.
@hommejuhyun 5 лет назад ⁺¹
@@statquest Oh, I see !! Thank you your good example,,
I have one more question :)
Is there any called name Deseq2 normalization value like TPM, RPKM?
@statquest 5 лет назад
@@hommejuhyun Not that I know of.
@couunderbarz 4 месяца назад ⁺¹
Thanks!
@statquest 4 месяца назад
TRIPLE BAM!!! Than you for supporting StatQuest!!! :)
@lealemler2967 3 года назад
Thank you very much. On which paper is this based?
@statquest 3 года назад
The original DESeq2 manuscript.
@lealemler2967 3 года назад
@@statquest Thank you but there are many DESeq2 papers, do you mean this one: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Michael I Love, Wolfgang Huber and Simon Anders 2014? Thank you
@statquest 3 года назад ⁺¹
Yes, that's the one. I also went through the code to see exactly what it was doing.
@lealemler2967 3 года назад ⁺¹
@@statquest thank you very much!! :) :)
@jyoti9426 3 года назад ⁺¹
It's the intro song for me! \m/
@statquest 3 года назад
bam! :)
@bzaruk 2 года назад
DESeq2 with only one replicate for each group - is it possible? if not, is there any good alternative to detect differential gene expression for one replicate per cell line?
@statquest 2 года назад ⁺¹
I'm almost certain it can. I know I've done it with EdgeR before. The manual for EdgeR gives an example and tells you how to set certain parameters that are usually estimated when you have more data. Presumably you can do something similar with DESeq2.
@kiarashbike 3 года назад
Hey, sorry is it what DEseq and Vst function are doing in DEseq2 package?
@statquest 3 года назад
I'm not sure I understand your question. Can you rephrase it?
@kiarashbike 3 года назад
@@statquest Oh, sorry I had to explain that a bit more specific. When we want to run codes in R using DEseq2 packages for analysing RNAseq data, we have to do data transformation using Vst (variancestabilizing transformation) function. In this vidoe, you explained nicely what DEseq2 does for normalizing RNAseq data. I'm asking whether this normalization is doing the same as what Vst function does in R?
@statquest 3 года назад
@@kiarashbike I believe VST is different.
@aoihana1042 6 лет назад
A 1000 Likes! Thank you Josh! 😭😭🙏🙏🙏
@fmetaller 6 лет назад ⁺¹
Hi, Do you know a DESeq2 alternative in Python?
@statquest 6 лет назад ⁺¹
Unfortunately I don't know of anything like DESeq2 for python.
@fmetaller 6 лет назад ⁺¹
I'm an undergraduate medical student that wants to get into bioinformatics. I spent the last months learning python and reading books like python data science handbook, Elegant Scipy, Think Stats. For what I see, it seems to me that I can do everything you showed in python but I'd appreciate your opinion.
In order to build a career as a bioinformatic would you suggest me to keep investing on python or to switch to R?
@statquest 6 лет назад ⁺¹
This is a great question! If you really want to do bioinformatics, and specifically genomic bioinformatics, than you'll want to have access to the Bioconductor tools - those are all in R. If you want to do more machine learning stuff, Python is probably a better fit. The good news, however, is that once you learn one programming language, learning another isn't that bad. I use both languages pretty frequently.
@johirislam8174 3 года назад
I want have some other quaries regarding DEG analysis.I want to compare two datasets differentially expressed gene ,how can i do that.For example one data set contain 108 DEG and the other contain 70 so i want to see the common gene between this two dataset.So how can i do that and how can i make the vaan diagram between them.Moreover i saw some GEO dataset there are some file format tsv and txt.Son in that case how can i analyse that kind of file.Plz solve this two problem to me.
@statquest 3 года назад
I'll keep those topics in mind.
@adrichuuu 6 лет назад ⁺¹
Thank you very much!
@statquest 6 лет назад
You're welcome! :)
@taotaotan5671 4 года назад ⁺¹
Wait... Michael Love is your colleague right...
@statquest 4 года назад ⁺¹
Yes, are pals. However, I left UNC a few months ago to do StatQuest full time.
@taotaotan5671 4 года назад ⁺¹
StatQuest with Josh Starmer You guys are wonderful! Michael is very active and helpful in Bioconductor forums. Thank you guys for great video and software.
@sunnetinternationalbusines9910 2 года назад
So if log takes away our differential counts, how do we know differential genes amongst two different samples. for us developmental scientists, we always like to see which gene is uniquely responsible for one character and hoe to confirm it by tracing it in the laboratory with knockouts and knockings. Its like DESEQ2 defeats that. And I have been using it for my data analysis from time.
@statquest 2 года назад
What time point, minutes and seconds, are you asking about?
@sunnetinternationalbusines9910 2 года назад
@@statquest from 4,19. What i mean is that some times, these differences in library composition are what we actually looked out for. For instance, if we wish to identify unique transcription factors in a tissue type, we look out for the differences in library composition of the two tissue types. IF DESEQ2 adjusts for these differences by silencing them, how do we know which receptor, or TFs or chemokines are uniquely expressed at a articular time or in a articular tissue type. Thanks man you are the best.
@statquest 2 года назад
@@sunnetinternationalbusines9910 DESeq2 doesn't "silence" those regions - it simply does not use them when adjusting for differences in library composition and depth.Those genes remain in the dataset, and are normalized just like all the others, but are not part of the pool of genes used to calculate the normalization factor.
@Zonno5 3 года назад
I thought geometric average was defined as the nth root of the product of all the samples, not the average of of the log of all samples. It could be a roundabout way to do the same thing I haven''t checked.
@statquest 3 года назад
There is an error at 9:28: I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.
@shixiangwang 5 лет назад ⁺¹
Thanks.
@statquest 5 лет назад ⁺¹
:)
@garyhokawai 7 лет назад
Averages calculated with logs are called "geometric averages"? I suppose the geometric mean is defined as the nth root of the product of n numbers, i.e., for a set of numbers x1, x2, ..., xn. Then in step2, I guess you were just calculating the arithmetic mean of the read counts with logs of each gene across all the samples.
@garyhokawai 7 лет назад
I see. Seems that definitions in programming are not always the same as in mathematics. I see the formula in DESeq2 paper, it's mathematics. However, in practice, it's not. Still need to learn~
@ahmadzaimhilmi 6 лет назад ⁺¹
I come here only for the intro song
@statquest 6 лет назад
Hooray! :)

Следующие

Автовоспроизведение

StatQuest: edgeR, part 1, Library Normalization