Thank you!!! This is the best DESeq2 tutorial so far. It's easy to follow and every step makes sense. I am sure many others are benefiting and will benefit from you! I hope you are having a wonderful day or night wherever you are! Thanks a lot!!!
Thanks mate, all the tutorials in this series have been top notch. They go at a great pace and I appreciate that you explain pretty much everything you're doing
Good question. Sample size shouldn't matter as long as each is at least n = 3. For three groups the easiest way is to do a 2 v 2 v 2 analysis. So each group is compared against each of the other groups. In the deseq function you can specify which groups to compare. Alternatively, if you are newer to R and using Deseq you can just load in only two of the groups at a time.
Greetings, I greatly enjoyed and appreciated this series of 4 videos. I am definitely your follower now. I would like to know what exactly is the purpose of creating the vsdata transformed dataset. More important query: When creating the results table, why did you do it using the dds dataset instead of the vsdata dataset?? Just asking what is the logic behind it.
Thank you! Good question. I just use vst to get a normalized matrix to run the PCA. This step is not required for the normal deseq workflow, just for visualization. You could remove the vsdata and pca and still have the results at the end. dds is still the main deseq object. Hope this answered your question!
Hi. Sir. Thank you for your video. Just quick question. Once we run DeSeq(dds) function, the generated results are based on normalized data? after you run "res = results(dds, contrast = c("condition", "S", "C"), you ve got 7 columns including log2foldchange. this log2foldchange is calculated based on normalized data? of course, we can get normalized data using estimaterSizeFactors(dds) followed by counts(dds,normalized=T). But, before this code, we just run dds function and then extract result.
Thank you Mark for your all informative videos. Is there a way to produce RPKM/FPKM and TPM values from DESeq2 library and what’s the easiest way to obtain gene length?
Thank you! You can set the mcols(dds)$basepairs column to the size of the genes using the GenomicRanges packages, then you can call the deseq fpkm function directly. If you can't figure it out, let me know and maybe I can include it in a short upcoming video
Hi thanks for the useful tutorial, how do we convert results (differential table) in to dds (DESeq output)? In a way we can apply the padj cut-off in the res -> dds -> vsdata. Or is there any other way to get padj cut-off applied dds? Thank you
Hi, thanks for this informative video on DESeq2. I have been stuck for a while with input data matrix before running DESeq2. I can see that my Gene identifier column automatically becomes the first column when I am arranging the condition and coldata (column data of my htseq readcounts) into matrix format. Can you suggest me how do I fix it? Thanks!
Hi there! That's really the best Deseq2 tutorial I have seen so far, thank you very much!! I have one question: I ran the first command that includes the header and row.names (row.names =1) but I get the following error message: "Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed" I read a lot of sites that suggest to null the row.names but that is not a good idea for my data. Have you ever encountered this error? Do you have any recommendations? Thanks in advance!
Hi. I am really thankful for your videos. Atm i am in a pickle. I looked up results() function man page since i am a bit confused about this "contrast" argument. The confusion comes from the fact that i have 3 types of samples not just "s" and "c". Either the "contrast" has 1 vector with exactly 3 elements like in the video, (and here comes the confusion): or 2 vectors with names of the fold changes for the numerator, and names of the fold changes for the denominator. What are these? The 3rd option that contrast can contain is "a numeric contrast vector with one element for each element in resultsNames(object) (most general case". Should i use the 2nd or the 3rd option? and what these numerators and denominators mean here? Thank you really.
I wouldn't worry about the third option and just use the variable names instead. You can only do pairwise comparisons. Contrast will always have only 3 elements (column, variable1, variable2). If you want to compare a third variable against variable1 it would be (column, variable1, variable3). Hope this helps!
Informative video. Thanks I have a query regarding data analysis if you could please help me in that. I have a data set for tumors that I downloaded from cancer data portal so now I have gene expression data and clinical data for both tumors. I want to compare the gene expression of both tumors but I am no getting from where I should start, how can I compare these tumors by using DESeq2. Please guide me. Thank you
Hi, thanks! You need to have a count matrix for deseq2. A table where the columns are samples, the rows are genes, and the values are read counts. How many tumors do you have? You will need at least 3 both for any meaningful analysis. In actuality, since you are using human (i.e., non-model organisms) you should have many more than 3 in each group. Let me know if you have any other questions!
Please tell how to collect sample data for GSE99816 , how to know which sample is normal / diseased , Please help me sir.( i used geoquery but it didn't contain this information) please help
Could you further clarify with regard to how you would pick a threshold for row sums. Not sure what you meant by "filter their end result by their mean".
Sure: First you should filter out all rows with 0 counts. Then later after DE analysis you can filter out rows with low counts, e.g., 50. Because things with low counts are inherently more noisy. For example, some samples might have 1 count for a specific gene and another sample has 16. Sure that is 16 times higher, but likely does not reflect an actual 16 fold difference in gene expression. Most likely they are both lowly expressed. The actual threshold is a little bit arbitrary, but 50 as the mean expression across all genes is a good place to start.
@@sanbomics got it, thank you for taking the time to clarify for me! Guess its also dependent on type of material (probably expect less reads from 3' FFPE like I'm working with) and how many samples.
hey mate Am getting this error followed all your steps except the filter one pls help I have 5 columns in my matrix (M10,M11,M12,M3 and M5) and EMBSEL gene ids to it The dds step is not working Error in checkForExperimentalReplicates(object, modelMatrix) : The design matrix has the same number of samples and coefficients to fit, so estimation of dispersion is not possible. Treating samples as replicates was deprecated in v1.20 and no longer
Hi ! Thanks so much for the EXTREMELY helpful video ! Im running into an issue, could you please help me out here? Thanks! On running the code to check dispersion ==>> plotDispEsts(dds) I get the following error : Error in class(x)
Before ^^this step i did run the code for deseq on my data like you did: dds < DESeq(dds) The console does show that its run entirely with the following message - .......fitting model and testing LogicalList of length 24397 [["ENSMUSG00000102275.1"]] logical(0) [["ENSMUSG00000025903.14"]] logical(0) . .
Hello. I am getting an error. While running the DESeqDataSetFromMatrix function, an error pops up Error in DESeqDataSet(se, design = design, ignoreRank) : 'design' should be a formula or a matrix can you tell me how to solve this issue? My dataset consists of 8 columns (4 cancer+ 4 normal samples).
I'm also able to generate the PCA plot, with clear differences between the two clusters. Can you have a look at it and give me some more insights as to what additional info this provides? It would be so much appreciated.
Hi, the short answer is: it really isn't possible. But if your data is completely exploratory, maybe you can help generate hypothesis based on some big differences you see. Which you will then test using a different method with more replicates.
There are multiple ways to do it, I am not sure what is the easiest way but this is what I would do: If you have a list of genes that are coding you can subset the DE results by that list. You can parse the gtf file if you need to make your own list. The gtf file should contain whether a gene is coding or not.
You have to decide which to do pairwise comparisons between. Since you have three samples, you only have a maximum of three comparisons. So it is easy enough to just run the DE with the three possible comparisons. Which is assigned by the contrast argument, which should be passed the two group names.
Hi, when use Counts 50),] I get this: Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'which': 'x' must be numeric can you help me Thanks for the videos
Hi, it's hard to say without seeing exactly what you did, or what you mean by different. You generated the counts table from the same fastq files, using the same reference, and using the same software/parameters that I did in the previous 3 videos?
@@sanbomics Hi Sanbomics, I use this reference: Homo_sapiens.GRCh38.107.gtf, from ensembl. I think I don't change anything that I know. Maybe I changed something that you didn't explicit mention.
@@sanbomics The order of geneID I got was different from you, so it is a little difficult to compare 2 tables. Anyway, I think the data is the same, just the order is different.
It seems like there is something in your counts table that shouldn't be there. Try opening it in excel and taking a quick look at it for anything obvious
@@sanbomicsfeatureCounts has the options -M and --fraction. They allow fractional counting in case you can map the same reads to different locations in the genome. Each of the possible features gets a fraction of the count. Also thank you very much for the video it was a great help!
Thank you!!! This is the best DESeq2 tutorial so far. It's easy to follow and every step makes sense. I am sure many others are benefiting and will benefit from you! I hope you are having a wonderful day or night wherever you are! Thanks a lot!!!
That is great to hear! Thank you for your kind words! :)
Thanks mate, all the tutorials in this series have been top notch. They go at a great pace and I appreciate that you explain pretty much everything you're doing
Thanks!
I found your video as a diamond in the pile of sand.............. superb bro... Thanks a lot.
Thanks!
Hi, Nice tutorial!!! Thank you so much. May I ask how to compare 3 or more groups with different sample sizes?
Good question. Sample size shouldn't matter as long as each is at least n = 3. For three groups the easiest way is to do a 2 v 2 v 2 analysis. So each group is compared against each of the other groups. In the deseq function you can specify which groups to compare. Alternatively, if you are newer to R and using Deseq you can just load in only two of the groups at a time.
to clarify. Given groups a,b,c. Do a vs b + a vs c + b vs c
I wanted to ask the same question. Please after comparing, how can I combine the results? I want to use the results to run co-expression analysis
Greetings, I greatly enjoyed and appreciated this series of 4 videos. I am definitely your follower now.
I would like to know what exactly is the purpose of creating the vsdata transformed dataset.
More important query: When creating the results table, why did you do it using the dds dataset instead of the vsdata dataset?? Just asking what is the logic behind it.
Thank you! Good question. I just use vst to get a normalized matrix to run the PCA. This step is not required for the normal deseq workflow, just for visualization. You could remove the vsdata and pca and still have the results at the end. dds is still the main deseq object. Hope this answered your question!
Hi. Sir.
Thank you for your video.
Just quick question. Once we run DeSeq(dds) function, the generated results are based on normalized data? after you run "res = results(dds, contrast = c("condition", "S", "C"), you ve got 7 columns including log2foldchange. this log2foldchange is calculated based on normalized data? of course, we can get normalized data using estimaterSizeFactors(dds) followed by counts(dds,normalized=T). But, before this code, we just run dds function and then extract result.
Thank you Mark for your all informative videos. Is there a way to produce RPKM/FPKM and TPM values from DESeq2 library and what’s the easiest way to obtain gene length?
Thank you! You can set the mcols(dds)$basepairs column to the size of the genes using the GenomicRanges packages, then you can call the deseq fpkm function directly. If you can't figure it out, let me know and maybe I can include it in a short upcoming video
@@sanbomics I appreciate your response. I would love to share it. I tried your suggestion but I could not figure it out. Thank you very much.
Ok. I will do this in an upcoming video!
@@sanbomics Thank you so much! I truly appreciate it.
Hi, informative video. I want to know, how to deal with non-intergers data having decimal. The data matrix function doesn't work on such data set
Hi, I'm not sure I am exactly sure what you are asking. Which step are you referring to?
I tried going back to .csv data and round the numbers
@@florawang4684 Thank you 😊
is your data normalized? If not, the data won't be decimals. As far as I know
Hi thanks for the useful tutorial, how do we convert results (differential table) in to dds (DESeq output)? In a way we can apply the padj cut-off in the res -> dds -> vsdata. Or is there any other way to get padj cut-off applied dds? Thank you
Hi, you can take the rownames from the significant values from result and filter the objects by it
Hi, thanks for this informative video on DESeq2. I have been stuck for a while with input data matrix before running DESeq2. I can see that my Gene identifier column automatically becomes the first column when I am arranging the condition and coldata (column data of my htseq readcounts) into matrix format. Can you suggest me how do I fix it? Thanks!
Hi, no problem! I am a little confused at what the issue is you are having. When you make the coldata DF it has the gene ids in it?
Hi there!
That's really the best Deseq2 tutorial I have seen so far, thank you very much!!
I have one question: I ran the first command that includes the header and row.names (row.names =1) but I get the following error message:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed"
I read a lot of sites that suggest to null the row.names but that is not a good idea for my data.
Have you ever encountered this error? Do you have any recommendations?
Thanks in advance!
You can try adding a tag like _2 to each of the duplicate ones and see if that fixes it
Hi. I am really thankful for your videos. Atm i am in a pickle. I looked up results() function man page since i am a bit confused about this "contrast" argument. The confusion comes from the fact that i have 3 types of samples not just "s" and "c". Either the "contrast" has 1 vector with exactly 3 elements like in the video, (and here comes the confusion): or 2 vectors with names of the fold changes for the numerator, and names of the fold changes for the denominator. What are these? The 3rd option that contrast can contain is "a numeric contrast vector with one element for each element in resultsNames(object) (most general case". Should i use the 2nd or the 3rd option? and what these numerators and denominators mean here? Thank you really.
I wouldn't worry about the third option and just use the variable names instead. You can only do pairwise comparisons. Contrast will always have only 3 elements (column, variable1, variable2). If you want to compare a third variable against variable1 it would be (column, variable1, variable3). Hope this helps!
@@sanbomics Thanks! And have a great holiday!
Informative video. Thanks
I have a query regarding data analysis if you could please help me in that. I have a data set for tumors that I downloaded from cancer data portal so now I have gene expression data and clinical data for both tumors. I want to compare the gene expression of both tumors but I am no getting from where I should start, how can I compare these tumors by using DESeq2. Please guide me. Thank you
Hi, thanks! You need to have a count matrix for deseq2. A table where the columns are samples, the rows are genes, and the values are read counts. How many tumors do you have? You will need at least 3 both for any meaningful analysis. In actuality, since you are using human (i.e., non-model organisms) you should have many more than 3 in each group. Let me know if you have any other questions!
Thanks for this video. Please how can I reach out to you?
Please tell how to collect sample data for GSE99816 , how to know which sample is normal / diseased , Please help me sir.( i used geoquery but it didn't contain this information) please help
www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99816
Let me know if this has what you need!
Could you further clarify with regard to how you would pick a threshold for row sums. Not sure what you meant by "filter their end result by their mean".
Sure: First you should filter out all rows with 0 counts. Then later after DE analysis you can filter out rows with low counts, e.g., 50. Because things with low counts are inherently more noisy. For example, some samples might have 1 count for a specific gene and another sample has 16. Sure that is 16 times higher, but likely does not reflect an actual 16 fold difference in gene expression. Most likely they are both lowly expressed. The actual threshold is a little bit arbitrary, but 50 as the mean expression across all genes is a good place to start.
@@sanbomics got it, thank you for taking the time to clarify for me! Guess its also dependent on type of material (probably expect less reads from 3' FFPE like I'm working with) and how many samples.
hey mate
Am getting this error
followed all your steps except the filter one
pls help
I have 5 columns in my matrix (M10,M11,M12,M3 and M5) and EMBSEL gene ids to it
The dds step is not working
Error in checkForExperimentalReplicates(object, modelMatrix) :
The design matrix has the same number of samples and coefficients to fit,
so estimation of dispersion is not possible. Treating samples
as replicates was deprecated in v1.20 and no longer
Hi ! Thanks so much for the EXTREMELY helpful video ! Im running into an issue, could you please help me out here? Thanks!
On running the code to check dispersion ==>> plotDispEsts(dds)
I get the following error :
Error in class(x)
Before ^^this step i did run the code for deseq on my data like you did:
dds < DESeq(dds)
The console does show that its run entirely with the following message -
.......fitting model and testing
LogicalList of length 24397
[["ENSMUSG00000102275.1"]] logical(0)
[["ENSMUSG00000025903.14"]] logical(0)
.
.
Hello. I am getting an error. While running the DESeqDataSetFromMatrix function, an error pops up
Error in DESeqDataSet(se, design = design, ignoreRank) :
'design' should be a formula or a matrix
can you tell me how to solve this issue? My dataset consists of 8 columns (4 cancer+ 4 normal samples).
Hi, often there is a typo or small differences that can cause this. Can you paste the command you ran?
@@sanbomics sure. here it is.
des
if you change -de to ~de does the error go away?
@@sanbomics It worked! It's funny how I overlooked the typo. Thank you so much!
I'm also able to generate the PCA plot, with clear differences between the two clusters. Can you have a look at it and give me some more insights as to what additional info this provides? It would be so much appreciated.
Hi, Is it possible to do DEG analysis without replicates? I have 6 RNA sequences from different plant tissues that I want to compare
Hi, the short answer is: it really isn't possible. But if your data is completely exploratory, maybe you can help generate hypothesis based on some big differences you see. Which you will then test using a different method with more replicates.
@@sanbomics same way like in your nice tutorial?
Hi, could you pls help me on how to filter out only the protein coding genes? Thankyou.
There are multiple ways to do it, I am not sure what is the easiest way but this is what I would do: If you have a list of genes that are coding you can subset the DE results by that list. You can parse the gtf file if you need to make your own list. The gtf file should contain whether a gene is coding or not.
what If I have 3 conditions instead of 2? When I try to run res
You have to decide which to do pairwise comparisons between. Since you have three samples, you only have a maximum of three comparisons. So it is easy enough to just run the DE with the three possible comparisons. Which is assigned by the contrast argument, which should be passed the two group names.
@@sanbomics got it! Thanksb
Hi, when use Counts 50),] I get this:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'which': 'x' must be numeric
can you help me
Thanks for the videos
Do you have a value in the counts column that is not a number?
@@sanbomics I fixed it. Yes, that was my problem. Thanks for the tutorial.
what is the function of row.names=1??
It's just specifying that the first column contains the row names
I followed along but the result I got was different. Could you explain what the possible reason is?
Hi, it's hard to say without seeing exactly what you did, or what you mean by different. You generated the counts table from the same fastq files, using the same reference, and using the same software/parameters that I did in the previous 3 videos?
@@sanbomics Hi Sanbomics, I use this reference: Homo_sapiens.GRCh38.107.gtf, from ensembl. I think I don't change anything that I know. Maybe I changed something that you didn't explicit mention.
I've uploaded the counts table I used to: github.com/mousepixels/sanbomics/blob/main/count_table_for_deseq_example.csv
See if that turns out the same.
@@sanbomics The order of geneID I got was different from you, so it is a little difficult to compare 2 tables. Anyway, I think the data is the same, just the order is different.
Ahh makes sense. Glad you figured it out!
Help me
My coade: dds
It seems like there is something in your counts table that shouldn't be there. Try opening it in excel and taking a quick look at it for anything obvious
@@sanbomicsfeatureCounts has the options -M and --fraction. They allow fractional counting in case you can map the same reads to different locations in the genome. Each of the possible features gets a fraction of the count.
Also thank you very much for the video it was a great help!
Thank you very much for sharing the helpful vdo; however, this happened when I loaded my counts.
counts
You can read it as a csv, remove the relevant rows, then add the rownames based on that column