thanks for the video! according to you, which would be the expected pattern in sequence duplication levels for ddRADseq? I have two big peaks in >10 and >100, thanks!
Hey, Gabriel - I am not sure. Actually haven't heard of ddRADseq until you mentioned it. Something for DNA sequencing that I have read which can increase the duplication levels is if the genome has a lot of repeats in it. Just like with transcriptome where genes that are highly up-regulated will cause duplication warnings. You can try to run the rest of your analysis and see if there is anything strange that comes from higher duplication levels. Another thing might be from PCR amplification after the library has been prepped, depending on how many cycles you run there is a chance that, after too many, you can end up with PCR bias. Not sure how your library was prepped but that could be something to think about, too. Sorry for not knowing exactly the answer - I'd have to read more into ddRADseq but it sounds like the dd part is making the selection rather narrow so some duplication would be expected for those genomic regions flanked by the restriction enzymes AND regions of a specific size rather than preparing a library 100% randomly from genomic DNA. But, like I mentioned, that is just a guess.
Without more information allow me to put forth some thought: 1) rRNA sequences are highly conserved regardless of bacterial or animal. Depending on what is being explore with such sequences, and the source, it's plausible to say that the sequence will overlap a great deal. 2) Duplication may mean a few different things, remember. The use of a unique molecular identifier (UMI) could aid in narrowing down the source of the duplication - is it from a high abundance of a particular species bacteria? is it from PCR bias? is it from some unaccounted for source? Really depends on the parameters from which the data was derived. 3) Looking at single isolate sequencing results. If the data that is being explore is from a single source, then the duplication *should* be high. If trying to sequence a pure source yet seeing low duplication, would be concerned. if you sequenced a short region of your genome (say 18S rDNA sequence) there should be high coverage and depending how deep the sequencing is, the duplication will increase (only so many locations for unique molecules to originate from thus deeper and deeper, even if different molecules, probability of seeing the same start and end increases). 4) Inclusion of adapter sequences not known to the software. If working with raw data and some custom adapters that FastQC is unaware of, could imagine it would flag them as duplicated sequences if sequencing through reads (technical issue). Would be more unlikely because of both likelihood of custom adapters (low, niche sequencing company) as well as the tech running the machine allowing a sample with short inserts being sequenced with the wrong kit. Summary - not always unexpected that there is duplication. Have to understand several things to assess whether any at all is expected or technical. Would reach out to the sequencing service provider to further explore reasoning. acs
Nice explanation. I found the overrepresented sequences really useful to confirm that your hashtag or cite-seq antibody binding and sequencing quality
thanks for the video! according to you, which would be the expected pattern in sequence duplication levels for ddRADseq? I have two big peaks in >10 and >100, thanks!
Hey, Gabriel -
I am not sure. Actually haven't heard of ddRADseq until you mentioned it. Something for DNA sequencing that I have read which can increase the duplication levels is if the genome has a lot of repeats in it. Just like with transcriptome where genes that are highly up-regulated will cause duplication warnings. You can try to run the rest of your analysis and see if there is anything strange that comes from higher duplication levels.
Another thing might be from PCR amplification after the library has been prepped, depending on how many cycles you run there is a chance that, after too many, you can end up with PCR bias. Not sure how your library was prepped but that could be something to think about, too.
Sorry for not knowing exactly the answer - I'd have to read more into ddRADseq but it sounds like the dd part is making the selection rather narrow so some duplication would be expected for those genomic regions flanked by the restriction enzymes AND regions of a specific size rather than preparing a library 100% randomly from genomic DNA. But, like I mentioned, that is just a guess.
I have done 16S metagenomic sequencing and the sequence duplication percentage is 90 %? Is it fine?
Without more information allow me to put forth some thought:
1) rRNA sequences are highly conserved regardless of bacterial or animal. Depending on what is being explore with such sequences, and the source, it's plausible to say that the sequence will overlap a great deal.
2) Duplication may mean a few different things, remember. The use of a unique molecular identifier (UMI) could aid in narrowing down the source of the duplication - is it from a high abundance of a particular species bacteria? is it from PCR bias? is it from some unaccounted for source? Really depends on the parameters from which the data was derived.
3) Looking at single isolate sequencing results. If the data that is being explore is from a single source, then the duplication *should* be high. If trying to sequence a pure source yet seeing low duplication, would be concerned. if you sequenced a short region of your genome (say 18S rDNA sequence) there should be high coverage and depending how deep the sequencing is, the duplication will increase (only so many locations for unique molecules to originate from thus deeper and deeper, even if different molecules, probability of seeing the same start and end increases).
4) Inclusion of adapter sequences not known to the software. If working with raw data and some custom adapters that FastQC is unaware of, could imagine it would flag them as duplicated sequences if sequencing through reads (technical issue). Would be more unlikely because of both likelihood of custom adapters (low, niche sequencing company) as well as the tech running the machine allowing a sample with short inserts being sequenced with the wrong kit.
Summary - not always unexpected that there is duplication. Have to understand several things to assess whether any at all is expected or technical. Would reach out to the sequencing service provider to further explore reasoning.
acs
Can i trim my rawreads twice?... meaning i trim raw reads and take the results and trim them?