Correction: 9:28 I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together. Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Public Disclaimer: Watching Josh's introductory vids on RNAseq analysis, including this video, (sometimes more than once ;-) ) is a useful primer if you're just starting off in RNA-seq analysis. Watching these videos helped me make sense of the RNA-seq DE pipeline, such as the nature of the inputs and the rationale of the methods and metrics.
I am Neuroscientist from MIT with no previous background in RNAseq and molecular biology. This video summarized Deseq2 in 12 minutes which is super cool!!! I quickly understand deseq2 in 12 mins
I wrote a python script to do this procedure and found what might be an error because of your tutorial. In particular I was filtering inf(s) before calculating the median of the logs. Thanks Josh!
Thank you so much Josh!! Your explanation has helped overcome my anxiety about learning DeSeq2...it's complex but not so bad. Thaaank you. I'm sharing this vid with my comp.bio journal club, too
As a summary, I would say that the geometric average downplays the effect of outliers at the gene level (rows), while the median downplays outliers at the sample level (column). The subtraction allows us to rank samples by sequencing depth, and the division applies the scaling factor to our original data.
Found a small mistake from the video: when you explain the library sizes (around 2 minutes), the Sample #2 Gene A2M read counts should be 1126, not 2126.
Hey Joshsua, thanks so much for making these videos- they are immensely helpful. I think I noticed a small mistake when you transform the median values into normal numbers. Sample 2 you have e^0.3, but the median is -0.1.
Thanks for you videos! They really are a huge help. I just have a question about your explanation for differences in library composition at 3:41. I'm not sure I follow. The way I see it, if those 563 reads don't map to A2M, they aren't going to just move onto other genes to inflate their counts. So the only reason that the other genes in library 2 have higher counts is because they had more reads that matched their sequence, indicating that their transcripts were more abundant. Which would mean those other genes are differentially expressed as well, right? If only A2M was differentially expressed, then those other genes would retain their small counts because they aren't transcribed any more than in library 1. Am I misunderstanding something? Thanks Edit: I have two other questions as well, if you don't mind: 1) Does this method of normalization take into account the lengths of the different transcripts like TPM/RPKM/FPKM? 2) Is this method more robust than TPM/RPKM/FPKM? If so, then should it be used in instead of them? Sorry for the onslaught of questions. Thanks for the help!
In the example at 3:41, there are 635 reads sequenced per sample (yes, these numbers are small compared to a true RNA-seq experiment, but this is just an example). Now, when we do RNA-seq, we extract the mRNA from cells (or a single cell) and then we amplify it with PCR before making the final library that is sequenced. The PCR ensures that we have a lot of stuff to sequence, so much stuff that there is more than we can actually sequence. Thus the example plays out in reality the way it does in this example in the video. When one gene soaks up a lot of reads in one sample, but not in another, then that just means there are more reads going to other genes in the other sample. This method does not account for read-lengths, nor should it. DESeq2's model depends only on the number of reads per gene, not the lengths. Lastly, TPM/FPKM/etc. are useful when just looking at the data and comparing genes of different lengths.
Request: great video on DESeq2 normalization. We already know what the counts are how. I do not understand how the linear models for each genes is used to calculate the lfc? I really appreciate your expliantions
There is a problem at 2:13, reads for A2M gene in sample 2 should have 1126 reads, not 2126. Anyway, thank you for the video, very useful for beginners, and in general nice and unique style!
First of all - I LOVE your stuff! so helpful and clear! quick question though - I have an RNA-Seq of some experiments for 4 different cell-lines, each cell line has 3 biological replicates with 3 technical replicates each - I want to do some normalization on that RNA-Seq results to compare between the cell lines. You mentioned in the video that DESeq wasn't meant to do normalization between different reads count but between different cells - which is exactly what I am doing - BUT - I do have some delta between the reads of each technical replicate, especially between the 1st biological replicate against both the 2nd and the 3rd biological replicates due to different PCR cycles. My question is - do I need to perform any kind of normalization based on the reads before I do the DESeq normalization?
Nope! At at 4:18 we see that DESeq2 (and EdgeR) can normalize take care of both situations - when there are differences in library sizes and when there are differences in library composition.
Do you have a video explaining more technical aspects of DESeq2, pleas? e.g. how the GLM fitting (eq. 2 in DESeq2 paper), estimation of dispersion, and estimation of logarithmic fold changes.
One question. Deseq2 uses negative binomial regression, so after applying scaling factors, does it also round the normalized numbers to make a real count table of normalized values? Otherwise can we use negative binomial still?
Around 12:20 you say that the idea of logs and median is to look at house keeping genes and to eliminate all genes which are only transcribed in one sample. But why should we do this? If we knock out a transcription factor to find its function this is exactly what we are interested in. Or does this method serve a different purpose? Thank you!
At this stage, all we are interested in is normalizing the read counts to compensate for differences in sequencing depth and library composition. Later, once the read counts are normalized, then we will use statistics to identify differentially expressed genes.
Thank you so much for the great explanation! 2:08 May I know if all the samples were sequenced at the same time ( same sequencing reaction), will the sequencing depth become different?
@@statquest Does this imply that even if the sequencing depth is standardized to 20x coverage across all samples, the number of reads corresponding to transcripts of gene A may still vary between samples, even if the expression level of gene A is the same in both sample 1 and sample 2?
@@mayling1014 I believe there is a stochastic (random) nature to the hybridization between the reads and the chip used for sequencing. So there is a chance that not every sample gets exactly the same number reads because not every sample binds to exactly the same number of spots on the chip. And not every read is the same quality, and that could also result in different numbers of reads per sample after you filter out low quality reads.
Josh, thanks for your wonderful series of videos. I have a question about using the DESeq2 normalization method on TPM data. I have TPM from RSEM output, each sample of course sums to 1 million. It seems that using DESeq2 style normalization on this TPM data would be valuable as it will adjust for library composition. I am not using R, so I'm not using the DESEq2 bioconductor package, just computing the normalization as you describe. Documentation on the DESeq2 package says the counts should be raw counts, however it seems that TPM would be just as valid if normalization is the only step of interest. Is this correct? thanks
DESeq2's normalization assumes the data are raw because it does part of what TPM attempts to do, compensate for sequencing depth differences. When you start with TPM values, DESeq2 can no longer make that adjustment the way it wants to.
Thanks for the great video. What you call "scaling factor" is the output of the function estimateSizeFactors, right? The name is a little bit misleading for someone who's already very confused with all the different normalisation methods!
In regards to the samples for each DESeq analysis. could that be different biological replicates or does each sample correspond to a different cell type ?
So if you want to investigate cell differentiation using RNA-seq data, would it be wise to apply DESeq2? Because non-house keeping genes would also be of interest here I assume and those would be filtered out with DESeq2 or am I mistaken?
Average of logs is not the same as log of averages! Around 9:19 you are saying log(reads for geneX) - log(average for geneX) = log of the ratio, correctly. But what you calculated in step 2 is not the log(average for gene X) but the average of the log(reads). If a, b, c were the read counts for the 3 samples for say GENE3, the average you calculated in the example step 2 is (loga +logb + logc)/3. This, in your example is Average of log reads. But when you go on to discuss the logratio you are treating it as the log(average), the log of [ (a+b+c)/3], i.e. the log of the average. These 2 quantities are not the same thing obviously, So either you are wrong in the example at step 2 or you are wrong later when you treat it as a log(average) while you had calculated the average of logs. Could you help clarify and ideally correct the example in the video?
You are correct. This error had been noted before in the video's description, and now I have made pinned comment so that it is easier to see. Sorry for the confusion.
@@statquest"log(reads for gene X) - average(log values for gene for gene X)." Then the interpretation in the box is false and we should ignore that, too, right? You have no difference of logs, so no log of ratio, so not true that "we are really checking out the ratios of the reads in each sample to the average across samples".
so how do we move from the corrected expression: "log(reads for gene X) - average(log values for gene for gene X)" to the next step where we are working with "log (ratio reads_for_gene_X / average_reads_for_gene_X)". What am I missing?
Thanks for the really helpful video! If DEseq2 removes genes that have 0 reads, does this affect results interpretation? For example, different tissues express different genes (in some tissues the expression of certain genes is 0), for some "0" expression genes in certain tissues, the difference between these tissues and the tissues in which these genes are highly expressed is physiologically relevant. I hope the program still keeps these "0" read genes.
@@statquest Thanks Josh, if DEseq2 keeps those genes with 0 reads, that is possible that those genes with 0 reads will be listed as significantly differentially expressed genes in the volcano plot, do I understand right?
how would you do a differential expression between multiple cell lines? do them in pairs and then find the shared highly differentially expressed genes? or is there a way of doing it in one analysis?
This is a good question. Unfortunately it's been a while since I used DESeq2, however, I remember that you can pretty much do any sort of "linear model" type test, so you should be able to do anova or something like that.
DESeq2 can find differentially expressed genes among different tissues, or within the same tissue if, for example, one is diseased and the other is healthy.
@@statquest Thank you but there are many DESeq2 papers, do you mean this one: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Michael I Love, Wolfgang Huber and Simon Anders 2014? Thank you
DESeq2 with only one replicate for each group - is it possible? if not, is there any good alternative to detect differential gene expression for one replicate per cell line?
I'm almost certain it can. I know I've done it with EdgeR before. The manual for EdgeR gives an example and tells you how to set certain parameters that are usually estimated when you have more data. Presumably you can do something similar with DESeq2.
@@statquest Oh, sorry I had to explain that a bit more specific. When we want to run codes in R using DEseq2 packages for analysing RNAseq data, we have to do data transformation using Vst (variancestabilizing transformation) function. In this vidoe, you explained nicely what DEseq2 does for normalizing RNAseq data. I'm asking whether this normalization is doing the same as what Vst function does in R?
I'm an undergraduate medical student that wants to get into bioinformatics. I spent the last months learning python and reading books like python data science handbook, Elegant Scipy, Think Stats. For what I see, it seems to me that I can do everything you showed in python but I'd appreciate your opinion. In order to build a career as a bioinformatic would you suggest me to keep investing on python or to switch to R?
This is a great question! If you really want to do bioinformatics, and specifically genomic bioinformatics, than you'll want to have access to the Bioconductor tools - those are all in R. If you want to do more machine learning stuff, Python is probably a better fit. The good news, however, is that once you learn one programming language, learning another isn't that bad. I use both languages pretty frequently.
I want have some other quaries regarding DEG analysis.I want to compare two datasets differentially expressed gene ,how can i do that.For example one data set contain 108 DEG and the other contain 70 so i want to see the common gene between this two dataset.So how can i do that and how can i make the vaan diagram between them.Moreover i saw some GEO dataset there are some file format tsv and txt.Son in that case how can i analyse that kind of file.Plz solve this two problem to me.
StatQuest with Josh Starmer You guys are wonderful! Michael is very active and helpful in Bioconductor forums. Thank you guys for great video and software.
So if log takes away our differential counts, how do we know differential genes amongst two different samples. for us developmental scientists, we always like to see which gene is uniquely responsible for one character and hoe to confirm it by tracing it in the laboratory with knockouts and knockings. Its like DESEQ2 defeats that. And I have been using it for my data analysis from time.
@@statquest from 4,19. What i mean is that some times, these differences in library composition are what we actually looked out for. For instance, if we wish to identify unique transcription factors in a tissue type, we look out for the differences in library composition of the two tissue types. IF DESEQ2 adjusts for these differences by silencing them, how do we know which receptor, or TFs or chemokines are uniquely expressed at a articular time or in a articular tissue type. Thanks man you are the best.
@@sunnetinternationalbusines9910 DESeq2 doesn't "silence" those regions - it simply does not use them when adjusting for differences in library composition and depth.Those genes remain in the dataset, and are normalized just like all the others, but are not part of the pool of genes used to calculate the normalization factor.
I thought geometric average was defined as the nth root of the product of all the samples, not the average of of the log of all samples. It could be a roundabout way to do the same thing I haven''t checked.
There is an error at 9:28: I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.
Averages calculated with logs are called "geometric averages"? I suppose the geometric mean is defined as the nth root of the product of n numbers, i.e., for a set of numbers x1, x2, ..., xn. Then in step2, I guess you were just calculating the arithmetic mean of the read counts with logs of each gene across all the samples.
I see. Seems that definitions in programming are not always the same as in mathematics. I see the formula in DESeq2 paper, it's mathematics. However, in practice, it's not. Still need to learn~
Correction:
9:28 I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Public Disclaimer: Watching Josh's introductory vids on RNAseq analysis, including this video, (sometimes more than once ;-) ) is a useful primer if you're just starting off in RNA-seq analysis. Watching these videos helped me make sense of the RNA-seq DE pipeline, such as the nature of the inputs and the rationale of the methods and metrics.
Awesome!!! Thanks for the endorsement! :)
I am Neuroscientist from MIT with no previous background in RNAseq and molecular biology. This video summarized Deseq2 in 12 minutes which is super cool!!! I quickly understand deseq2 in 12 mins
BAM! :)
I wrote a python script to do this procedure and found what might be an error because of your tutorial. In particular I was filtering inf(s) before calculating the median of the logs. Thanks Josh!
bam!
I have to do a talk on deseq2 for a data analysis course and this is what I'm starting it off with. Thank you a lot, seriously.
Glad it was helpful!
I had hard time understanding from the original paper how the normalization coefficients are computed. This video helps a lot! Thank you!
Well Explained!!! These steps makes the more clear vision about the DEG by DeSeq2. Thank you Josh for this valuable video.
Thanks!
Thank you so much Josh!! Your explanation has helped overcome my anxiety about learning DeSeq2...it's complex but not so bad. Thaaank you. I'm sharing this vid with my comp.bio journal club, too
Do you plan on doing an overview of limma-voom as well?
A lot of people have started to ask about that. I"ll put it onto the to-do list and look into it.
As a summary, I would say that the geometric average downplays the effect of outliers at the gene level (rows), while the median downplays outliers at the sample level (column). The subtraction allows us to rank samples by sequencing depth, and the division applies the scaling factor to our original data.
:)
These videos are an absolutely fantastic resource. Really, thank you so much!
I'm so happy to hear that you like them! :)
The way that you are explaining is amazing! I was looking for such a explanations for a long time. It is very comprehensive. Thanks!
Great! Great! Great!
It would be great also if you can introduce some well done books!
Thanks!
I could not find any, too!
Thanks!
Found a small mistake from the video: when you explain the library sizes (around 2 minutes), the Sample #2 Gene A2M read counts should be 1126, not 2126.
I was thinking exactly the same :)
Hey Joshsua, thanks so much for making these videos- they are immensely helpful.
I think I noticed a small mistake when you transform the median values into normal numbers. Sample 2 you have e^0.3, but the median is -0.1.
Very nice tutorial, effortless stat learning.
thank you Joshua
I like this guy! Thanks for your carefully explanation! Keep it up!
Thank you very much! :)
Also, what would be implemented if you wanted to look at log infinity values? that is cell type specific genes @7:57
You can always add a "pseudo-count" to the data, like one read for all genes, so that you can avoid the log infinity problem.
Thanks for you videos! They really are a huge help. I just have a question about your explanation for differences in library composition at 3:41. I'm not sure I follow. The way I see it, if those 563 reads don't map to A2M, they aren't going to just move onto other genes to inflate their counts. So the only reason that the other genes in library 2 have higher counts is because they had more reads that matched their sequence, indicating that their transcripts were more abundant. Which would mean those other genes are differentially expressed as well, right? If only A2M was differentially expressed, then those other genes would retain their small counts because they aren't transcribed any more than in library 1. Am I misunderstanding something? Thanks
Edit: I have two other questions as well, if you don't mind:
1) Does this method of normalization take into account the lengths of the different transcripts like TPM/RPKM/FPKM?
2) Is this method more robust than TPM/RPKM/FPKM? If so, then should it be used in instead of them?
Sorry for the onslaught of questions. Thanks for the help!
In the example at 3:41, there are 635 reads sequenced per sample (yes, these numbers are small compared to a true RNA-seq experiment, but this is just an example). Now, when we do RNA-seq, we extract the mRNA from cells (or a single cell) and then we amplify it with PCR before making the final library that is sequenced. The PCR ensures that we have a lot of stuff to sequence, so much stuff that there is more than we can actually sequence. Thus the example plays out in reality the way it does in this example in the video. When one gene soaks up a lot of reads in one sample, but not in another, then that just means there are more reads going to other genes in the other sample.
This method does not account for read-lengths, nor should it. DESeq2's model depends only on the number of reads per gene, not the lengths.
Lastly, TPM/FPKM/etc. are useful when just looking at the data and comparing genes of different lengths.
@@statquest Thanks for the clarification
Request: great video on DESeq2 normalization. We already know what the counts are how. I do not understand how the linear models for each genes is used to calculate the lfc? I really appreciate your expliantions
Thank you! Unfortunately I haven't done this sort of analysis in a long time so I can't promise I'll follow up on it. :(
Another small mistake I found is that, around 10 mins, sample#2 should be e^-0.1=0.9. Anyway thamks a lot!
Great video! Would have been nice if you could have talked about the negative bionomial distribution fitting
One day I'll get to that part. Hopefully soon.
There is a problem at 2:13, reads for A2M gene in sample 2 should have 1126 reads, not 2126. Anyway, thank you for the video, very useful for beginners, and in general nice and unique style!
Sorry for the typo, but I'm glad it didn't get in the way of you understanding the ideas. BAM! :)
such a great explanation!
Glad you think so!
Your explanations are very good. Thanks !!! The song is funny
Thank you! 😃
i
:)
First of all - I LOVE your stuff! so helpful and clear!
quick question though - I have an RNA-Seq of some experiments for 4 different cell-lines, each cell line has 3 biological replicates with 3 technical replicates each - I want to do some normalization on that RNA-Seq results to compare between the cell lines.
You mentioned in the video that DESeq wasn't meant to do normalization between different reads count but between different cells - which is exactly what I am doing - BUT - I do have some delta between the reads of each technical replicate, especially between the 1st biological replicate against both the 2nd and the 3rd biological replicates due to different PCR cycles.
My question is - do I need to perform any kind of normalization based on the reads before I do the DESeq normalization?
Nope! At at 4:18 we see that DESeq2 (and EdgeR) can normalize take care of both situations - when there are differences in library sizes and when there are differences in library composition.
Great stuff dude. Thanks for making this.
These videos are awesome
Thank you! :)
Do you have a video explaining more technical aspects of DESeq2, pleas? e.g. how the GLM fitting (eq. 2 in DESeq2 paper), estimation of dispersion, and estimation of logarithmic fold changes.
One question. Deseq2 uses negative binomial regression, so after applying scaling factors, does it also round the normalized numbers to make a real count table of normalized values? Otherwise can we use negative binomial still?
Fantastic explanation.
Thanks! :)
Around 12:20 you say that the idea of logs and median is to look at house keeping genes and to eliminate all genes which are only transcribed in one sample. But why should we do this? If we knock out a transcription factor to find its function this is exactly what we are interested in. Or does this method serve a different purpose?
Thank you!
At this stage, all we are interested in is normalizing the read counts to compensate for differences in sequencing depth and library composition. Later, once the read counts are normalized, then we will use statistics to identify differentially expressed genes.
The scaling factor which you mentioned at 4:46, is it the same as the work done by the 'Estimate Size Factor' function in R programming??
Unfortunately it's been so long since I used DESeq2 that I can't remember.
Is there any obvious reason for using geometric mean instead of arithmetic mean when calculating a scaling factor?
Thank you so much for the great explanation!
2:08 May I know if all the samples were sequenced at the same time ( same sequencing reaction), will the sequencing depth become different?
I believe so, because you'll still end up with different numbers of reads per sample.
@@statquest Does this imply that even if the sequencing depth is standardized to 20x coverage across all samples, the number of reads corresponding to transcripts of gene A may still vary between samples, even if the expression level of gene A is the same in both sample 1 and sample 2?
@@mayling1014 I believe there is a stochastic (random) nature to the hybridization between the reads and the chip used for sequencing. So there is a chance that not every sample gets exactly the same number reads because not every sample binds to exactly the same number of spots on the chip. And not every read is the same quality, and that could also result in different numbers of reads per sample after you filter out low quality reads.
Josh, thanks for your wonderful series of videos. I have a question about using the DESeq2 normalization method on TPM data. I have TPM from RSEM output, each sample of course sums to 1 million. It seems that using DESeq2 style normalization on this TPM data would be valuable as it will adjust for library composition. I am not using R, so I'm not using the DESEq2 bioconductor package, just computing the normalization as you describe. Documentation on the DESeq2 package says the counts should be raw counts, however it seems that TPM would be just as valid if normalization is the only step of interest. Is this correct? thanks
DESeq2's normalization assumes the data are raw because it does part of what TPM attempts to do, compensate for sequencing depth differences. When you start with TPM values, DESeq2 can no longer make that adjustment the way it wants to.
@@statquest so is it good or bad?
Thanks for the great video. What you call "scaling factor" is the output of the function estimateSizeFactors, right? The name is a little bit misleading for someone who's already very confused with all the different normalisation methods!
I believe that is correct.
If I have data already in TPM (transcripts per million), can I still apply DESEQ2?
Nope.
It`s really helpful for my research THANK YOU A LOT!
Another amazing video!
Thank you so much for very useful videos
In regards to the samples for each DESeq analysis. could that be different biological replicates or does each sample correspond to a different cell type ?
It could be anything - it could be technical replicates, biological replicates or different cell types. Whatever it is you want to study.
So if you want to investigate cell differentiation using RNA-seq data, would it be wise to apply DESeq2? Because non-house keeping genes would also be of interest here I assume and those would be filtered out with DESeq2 or am I mistaken?
Yes, I think DESeq2 would be a good tool for that.
Hi, I notice a mistake @10:37, for sample #2, e should be raised for -0.1 instead of -0.3. Correct me if I am wrong.
Yep, that's a typo.
Average of logs is not the same as log of averages! Around 9:19 you are saying log(reads for geneX) - log(average for geneX) = log of the ratio, correctly. But what you calculated in step 2 is not the log(average for gene X) but the average of the log(reads). If a, b, c were the read counts for the 3 samples for say GENE3, the average you calculated in the example step 2 is (loga +logb + logc)/3. This, in your example is Average of log reads. But when you go on to discuss the logratio you are treating it as the log(average), the log of [ (a+b+c)/3], i.e. the log of the average. These 2 quantities are not the same thing obviously, So either you are wrong in the example at step 2 or you are wrong later when you treat it as a log(average) while you had calculated the average of logs. Could you help clarify and ideally correct the example in the video?
You are correct. This error had been noted before in the video's description, and now I have made pinned comment so that it is easier to see. Sorry for the confusion.
@@statquest Thanks for clarifying and doing it so fast.
@@statquest"log(reads for gene X) - average(log values for gene for gene X)." Then the interpretation in the box is false and we should ignore that, too, right? You have no difference of logs, so no log of ratio, so not true that "we are really checking out the ratios of the reads in each sample to the average across samples".
so how do we move from the corrected expression: "log(reads for gene X) - average(log values for gene for gene X)" to the next step where we are working with "log (ratio reads_for_gene_X / average_reads_for_gene_X)". What am I missing?
@@michelepierotti2833 You don't do that next step. We don't have a ratio, we just have a difference, or a "residual", from the geometric mean.
Please post part 2 soon.
Thanks for the really helpful video! If DEseq2 removes genes that have 0 reads, does this affect results interpretation? For example, different tissues express different genes (in some tissues the expression of certain genes is 0), for some "0" expression genes in certain tissues, the difference between these tissues and the tissues in which these genes are highly expressed is physiologically relevant. I hope the program still keeps these "0" read genes.
Yes, it keeps those genes (with 0 reads), however, those genes are not used to calculate the scaling factor.
@@statquest Thanks Josh, if DEseq2 keeps those genes with 0 reads, that is possible that those genes with 0 reads will be listed as significantly differentially expressed genes in the volcano plot, do I understand right?
@@leixiao169 Presumably.
how would you do a differential expression between multiple cell lines? do them in pairs and then find the shared highly differentially expressed genes? or is there a way of doing it in one analysis?
This is a good question. Unfortunately it's been a while since I used DESeq2, however, I remember that you can pretty much do any sort of "linear model" type test, so you should be able to do anova or something like that.
@@statquest Thanks! appreciate it!
Thank you very much, it was very useful
Can anyone help me in normalizing excel data in deseq2? Where can i find the clear script ?
Thank you for your good explanation ! Umm.. So,, Deseq2 is only use to find a moderately expressed gene in different tissue, right?
DESeq2 can find differentially expressed genes among different tissues, or within the same tissue if, for example, one is diseased and the other is healthy.
@@statquest Oh, I see !! Thank you your good example,,
I have one more question :)
Is there any called name Deseq2 normalization value like TPM, RPKM?
@@hommejuhyun Not that I know of.
Thanks!
TRIPLE BAM!!! Than you for supporting StatQuest!!! :)
Thank you very much. On which paper is this based?
The original DESeq2 manuscript.
@@statquest Thank you but there are many DESeq2 papers, do you mean this one: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 Michael I Love, Wolfgang Huber and Simon Anders 2014? Thank you
Yes, that's the one. I also went through the code to see exactly what it was doing.
@@statquest thank you very much!! :) :)
It's the intro song for me! \m/
bam! :)
DESeq2 with only one replicate for each group - is it possible? if not, is there any good alternative to detect differential gene expression for one replicate per cell line?
I'm almost certain it can. I know I've done it with EdgeR before. The manual for EdgeR gives an example and tells you how to set certain parameters that are usually estimated when you have more data. Presumably you can do something similar with DESeq2.
Hey, sorry is it what DEseq and Vst function are doing in DEseq2 package?
I'm not sure I understand your question. Can you rephrase it?
@@statquest Oh, sorry I had to explain that a bit more specific. When we want to run codes in R using DEseq2 packages for analysing RNAseq data, we have to do data transformation using Vst (variancestabilizing transformation) function. In this vidoe, you explained nicely what DEseq2 does for normalizing RNAseq data. I'm asking whether this normalization is doing the same as what Vst function does in R?
@@kiarashbike I believe VST is different.
A 1000 Likes! Thank you Josh! 😭😭🙏🙏🙏
Hi, Do you know a DESeq2 alternative in Python?
Unfortunately I don't know of anything like DESeq2 for python.
I'm an undergraduate medical student that wants to get into bioinformatics. I spent the last months learning python and reading books like python data science handbook, Elegant Scipy, Think Stats. For what I see, it seems to me that I can do everything you showed in python but I'd appreciate your opinion.
In order to build a career as a bioinformatic would you suggest me to keep investing on python or to switch to R?
This is a great question! If you really want to do bioinformatics, and specifically genomic bioinformatics, than you'll want to have access to the Bioconductor tools - those are all in R. If you want to do more machine learning stuff, Python is probably a better fit. The good news, however, is that once you learn one programming language, learning another isn't that bad. I use both languages pretty frequently.
I want have some other quaries regarding DEG analysis.I want to compare two datasets differentially expressed gene ,how can i do that.For example one data set contain 108 DEG and the other contain 70 so i want to see the common gene between this two dataset.So how can i do that and how can i make the vaan diagram between them.Moreover i saw some GEO dataset there are some file format tsv and txt.Son in that case how can i analyse that kind of file.Plz solve this two problem to me.
I'll keep those topics in mind.
Thank you very much!
You're welcome! :)
Wait... Michael Love is your colleague right...
Yes, are pals. However, I left UNC a few months ago to do StatQuest full time.
StatQuest with Josh Starmer You guys are wonderful! Michael is very active and helpful in Bioconductor forums. Thank you guys for great video and software.
So if log takes away our differential counts, how do we know differential genes amongst two different samples. for us developmental scientists, we always like to see which gene is uniquely responsible for one character and hoe to confirm it by tracing it in the laboratory with knockouts and knockings. Its like DESEQ2 defeats that. And I have been using it for my data analysis from time.
What time point, minutes and seconds, are you asking about?
@@statquest from 4,19. What i mean is that some times, these differences in library composition are what we actually looked out for. For instance, if we wish to identify unique transcription factors in a tissue type, we look out for the differences in library composition of the two tissue types. IF DESEQ2 adjusts for these differences by silencing them, how do we know which receptor, or TFs or chemokines are uniquely expressed at a articular time or in a articular tissue type. Thanks man you are the best.
@@sunnetinternationalbusines9910 DESeq2 doesn't "silence" those regions - it simply does not use them when adjusting for differences in library composition and depth.Those genes remain in the dataset, and are normalized just like all the others, but are not part of the pool of genes used to calculate the normalization factor.
I thought geometric average was defined as the nth root of the product of all the samples, not the average of of the log of all samples. It could be a roundabout way to do the same thing I haven''t checked.
There is an error at 9:28: I have log(reads for gene X) - log(average for gene X), but it should be: log(reads for gene X) - average(log values for gene for gene X). We are subtracting the geometric mean from each gene measurement. In other words, if you take 'the average of reads' to be the geometric average, it all hangs neatly together.
Thanks.
:)
Averages calculated with logs are called "geometric averages"? I suppose the geometric mean is defined as the nth root of the product of n numbers, i.e., for a set of numbers x1, x2, ..., xn. Then in step2, I guess you were just calculating the arithmetic mean of the read counts with logs of each gene across all the samples.
I see. Seems that definitions in programming are not always the same as in mathematics. I see the formula in DESeq2 paper, it's mathematics. However, in practice, it's not. Still need to learn~
I come here only for the intro song
Hooray! :)