RPKM, FPKM and TPM, Clearly Explained!!!

StatQuest with Josh Starmer

Просмотров 211 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 ноя 2024

Комментарии • 180

@statquest 2 года назад ⁺²
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@asmitaJ Месяц назад ⁺²
I believe this has become the standard video anyone recommends when you want to understand different types of count normalizations. I have been recommended this by both my supervisor and my professor on two separate occasions haha
@statquest Месяц назад ⁺²
That's awesome! DOUBLE BAM! :)
@Anonymous9683 3 года назад ⁺¹⁰
I love how this man knows his content is irreplaceable so he can mess around in the intro without being concerned about losing viewers
@statquest 3 года назад
:)
@solidsnake013579 6 лет назад ⁺⁷
hands down the most perfect explanation on the internet
@statquest 6 лет назад
Thank you! :)
@ivantsers9445 5 лет назад ⁺³⁸
thank you very much for explanation! But one thing I should notice: the ORDER of division (i.e. order of steps) doesn't matter. It matters, by WHAT are you dividing for - in TPM it's not just library size (i.e. raw amount of all reads), but all counts of reads, normalized by length (i.e. summary RPK across all genes). This is the root of differences between RPKM and TPM
@nicolaikarcher7186 4 года назад
This is correct!
@LiptonTiptonTea 4 года назад ⁺⁴
Couldn't agree more. This video makes the impression that it is changing the order of division that makes the difference, while it's all about total reads vs total normalized counts.
@gabriele223 Год назад
all count of reads that map onto something i suppose
@ejvik3238 3 месяца назад
but why do we normalize by the length even for TPM?
@efthymiakokkinou1616 5 лет назад ⁺⁵⁰
this guy is awesome.
@statquest 5 лет назад ⁺²
Thank you! :)
@Tiago211287 9 лет назад ⁺⁸
Most Clear explanation I ever heard of TPM/FPKM/RPKM. Dont know why So many PhD was so confusing in trying to explaning this to me before.
@maxfeng4532 8 лет назад
+Joshua Starm thanks you so much, I feel like cleaning up the dust piled up in my mind , this is perfect !
@marekglombik8887 7 лет назад ⁺¹
I've just started my PhD and I'm really glad I found this. Thanks!
@TheBlackCarlo 7 лет назад
My initial work for PhD just got soooooo much easy and fun. Thanks!
@MariaSamaloisaMarsa-lw4fk 8 месяцев назад ⁺¹
Terimakasih pak saya sudah menonton RUclips RPKM ini sangat memberkati saya 🙏🙏
Dan nama saya adalah Maria Samaloisa semester 4, terimakasih Tuhan Yesus memberkati kita semua 🙏🙏👍
@statquest 8 месяцев назад
bam! :)
@tuskofgothos2637 6 лет назад ⁺¹
Your channel is an absolute gem! Please do keep up the good work. We need you!!
@fabioPatroni 7 лет назад
The best and clearest explanation I've ever seen! Tks
@syednajeebashraf4101 8 лет назад ⁺³
I watched this presentation and now I can explain this to even seniors in my place as well !! :)
@de_aquila 5 лет назад ⁺²
Thank you very much for this video! It's really very helpful!
For many biologists who have the thirst to understand the logic behind why certain metrics are the way they are with respect to statistics... this is certainly of immense help.
@louisebuijs3221 4 года назад
RPKM = Reads per kilobase million -> normalize for read depth (some replicates simply have more read depth, technical)
- SE RNAseq
- PE RNAseq = FPKM (rest same)
1. devide all reads per gene by the total amount of reads per replicate(or sample however you wanna call it)
2. devide by gene length
TPM = different order
1. devide by read length
2. devide by gene length
result of the difference in order is that the relative expression of reads is more easily comparable because in TPM the piecharts are all the same size and in RPKM the pies are different size
@statquest 4 года назад
bam!
@torlarsen2212 2 года назад ⁺¹
Yet another great explanation StatQuest!!! You keep educating til today!!
@statquest 2 года назад ⁺¹
Thanks!
@dreamyagnes 2 года назад ⁺¹
Hi Josh, thank you so much for your videos.
@statquest 2 года назад
Glad you like them!
@Qaxoontii 6 лет назад ⁺²
Thank you so much for this explanation, it is very useful for us biologist that have no background in bioinformatics.
@statquest 6 лет назад
You're welcome! I'm glad to know that the video is helpful. :)
@rodolfoaramayo7392 8 лет назад
Good Job!
I am going to use this video to explain these concepts in Genomics a Graduate/Undergraduate class I teach at Texas A&M University
@rodolfoaramayo7392 8 лет назад
Thanks!
@Pongant 4 года назад ⁺¹
I love your low-key intros
@statquest 4 года назад
Thanks!
@KeziKing Год назад ⁺¹
This was great!!! You really explained it clearly! Thanks so much!
@statquest Год назад
Glad it was helpful!
@prachinagpal3112 7 лет назад
Concrete explanation .
Concepts explained to the point.
Add more !
@prachinagpal3112 7 лет назад
Yes, I will be watching all videos
@victorcampos9064 3 года назад ⁺¹
Thank you so much!! Could not be explained clearer. Keep up the good work!
@statquest 3 года назад ⁺¹
Thank you! :)
@sambhavmishra1873 5 лет назад ⁺¹
Thank you so much, Josh Starmer !! It was a very clear explanation. My doubts are totally cleared.
@statquest 5 лет назад
Awesome! Thank you. :)
@kanefoster8780 4 года назад ⁺²
this is fantastic. I'm all over this goddam
@statquest 4 года назад ⁺¹
Bam! :)
@priyankamaripuri8249 6 лет назад ⁺¹
I find your videos extremely helpful! Thank you so much!!!! Can you share your presentations too?
@lucyyu2251 9 лет назад ⁺¹
This is very very clear! I wish I've seen this video earlier! Keep it up!
@Jonix-redhat 2 года назад ⁺¹
Thx for a great and easy explanation!
@statquest 2 года назад
Thank you!
@TheLegendOfNiko 4 года назад ⁺²
Perfect explanation, however, one thing was left out - TMM. How does TMM fit into the mix?
@statquest 4 года назад ⁺²
TMM is similar to what they do in DESeq2. For more details, check out: ruclips.net/video/UFB993xufUU/видео.html
@SNAKE1375 7 месяцев назад ⁺¹
Hi Josh, thanks very much for this again well and clear explained video. It seems that TPM would be the most approrpiate to mseure gene expression between sample. However, internet searches shows the contrary. Some are saying that TMM would be the best solution. What do think of this?
@statquest 7 месяцев назад ⁺¹
Thank you!
@SNAKE1375 7 месяцев назад
Thanks Josh, so what do you think about TMM instead of TPM?@@statquest
@statquest 7 месяцев назад ⁺¹
@@SNAKE1375 Unfortunately I haven't been involved with high-throughput sequencing for a long time now, so I don't know the answer.
@bodhisattwabanerjee8936 8 лет назад ⁺¹
Wonderful explanation.. So informative, yet explained so easily. Thank you very much. It was indeed a great help.
@Rd-lx8tu 3 года назад ⁺¹
This video is a life saver! Thanks a Million!
@statquest 3 года назад
bam! :)
@mrcoolgs100 Год назад ⁺¹
Excellent work!!
@statquest Год назад
Thanks a lot!
@VenkatNagaraju 4 года назад ⁺¹
Nice explanation
@statquest 4 года назад
Thanks! :)
@asiyazhao3820 3 года назад ⁺¹
very clear explanation best ever
@statquest 3 года назад
Thank you!
@george543 8 лет назад
Thank you for the clear explanation. You made it so straightforward and easy!
@glorybasumata7555 6 лет назад ⁺¹
Awesome! Pretty well explained and coherent.
@statquest 6 лет назад ⁺¹
Thanks!!! :)
@satu272 7 лет назад ⁺¹
So good! Thank you, this really helps with my thesis.
@mrnotsoevil 8 лет назад
Thank you! Finally a nice and easy-to-understand explanation!
@Adelphos0101 4 года назад ⁺¹
Excelent video!
@statquest 4 года назад
Thanks! :)
@rojinsafavi797 6 лет назад ⁺²
Would you please elaborate on what length one should use if they have gene count instead of transcript count?
@statquest 6 лет назад
Are you talking about the length of the RNA fragments that are sequenced? I don't think it really matters much either way, however, maybe longer fragments are better for transcript-level counting, since you want the fragments to span exons.
@rojinsafavi797 6 лет назад ⁺¹
Thanks for your quick reply :-), and yes for example if a gene has multiple isoforms I wonder which isoform length should be used for normalization step. I guess based on what you mentioned the longest isoform length should be use
@statquest 6 лет назад
If you are just counting reads per gene, I think most people use the longest isoform. However, if you are counting reads per transcript, then you just use that transcript’s length.
@tejasgohil9387 8 лет назад
Most Most Useful. I was beating my head to understand these RPKM/FPKM since last 3 days by reading and reading and reading!!! But this 10 min video did it without any confusion. Thank you Very much.
@taraeicher4241 5 лет назад ⁺²
Great explanation! Thank you!
@statquest 5 лет назад
Thanks! :)
@williammo4450 4 года назад ⁺¹
This guy is amazing! So clear!
@statquest 4 года назад ⁺¹
Thanks! :)
@sumitkumar-el3kc 4 года назад
What sequencing depth really signifies? Does having more sequencing depth mean high expression? Then why normalization for depth is required??
@statquest 4 года назад
For details on what Sequencing Depth means and why we need to normalize, see: ruclips.net/video/tlf6wYJrwKY/видео.html
@rayz1408 3 года назад ⁺¹
This is awesome!! Thank you!
@statquest 3 года назад
Glad you like it!
@lloydy272 8 лет назад
Thanks for explaining this in a way I can understand. My only question, how do people manage with R/FPKM if it is so hard to compare between reps?
@maxfeng4532 7 лет назад
Hey Joshua, thank you for the great video. Could you please explain why normalized counts are not for statistical test? the absolute values are changed by normalization but the ranks or the relative expression has not been changed... Is it because of isoforms? Thank you!
@blackV199 2 года назад
I have a question, shouldn't we use the effective length rather than transcript length? could you maybe make a video about that?
@statquest 2 года назад
I'll keep that in mind.
@blackV199 2 года назад
@@statquest Apologies, effective lengths could only be calculated when raw data is available (fastq files). Here you discuss processed data (counts data). Regardless, it would be pretty awesome though if you could discuss the data processesing pipeline.
@guigaolin6825 3 года назад ⁺¹
Thanks for the video!
Btw, a paper titled 'Single-cell RNA sequencing technologies and bioinformatics pipelines' published in 2018 seems to borrow your idea as their Fig.3c and without any citation.
What do you think of that figure?
@statquest 3 года назад
You're totally right. Thanks for pointing that out to me.
@fmetaller 6 лет назад ⁺¹
First I want to thank you for this great explanation.
There is a point I'm missing. All these normalization techniques assume that each type of cell analyzed is producing the same amount of RNA and all the difference we see are due to some variability in the depth of the sequencing. But is this true? Shouldn't be a better idea to normalize the count only on some housekeeping genes like we do with qPCR?
@statquest 6 лет назад ⁺¹
This is a great question. The reality is that when you do statistics on RNA-seq data, the normalization methods often use housekeeping genes. I explain how these normalization methods work in these videos: ruclips.net/video/UFB993xufUU/видео.html and ruclips.net/video/Wdt6jdi-NQo/видео.html
@fmetaller 6 лет назад ⁺¹
Oh thank for the answer(s)
@jamshidkhorashad1998 4 года назад ⁺¹
This was great, thanks
@statquest 4 года назад
Glad you enjoyed it!
@Eduardrssl 4 года назад ⁺¹
Very nice vid!! Thanks!
@statquest 4 года назад
Thank you! :)
@steffimatchado8442 4 года назад
Thanks for the very explanatory video. It is really helpful for students like me. Could you please post a video on N50 values and these will be used to evaluate the assembly ??
@rollieize 8 лет назад
nicely explained!
@arpitachoudhury9788 4 года назад
Can you please make a detailed video on how limma+voom works
@statquest 4 года назад ⁺¹
I'll keep it in mind.
@yanggao8840 5 лет назад ⁺¹
very helpful, thanks very much
@statquest 5 лет назад
Thanks! :)
@王吉-q4k 4 года назад ⁺¹
Thumb up every video
@statquest 4 года назад
Thank you! :)
@carlagibbs3223 5 лет назад ⁺¹
Excellent
@statquest 5 лет назад
Thanks!
@krzysztofkolmus6936 6 лет назад ⁺¹
Hi Josh,
Just a quick question regarding the TPM. What am I supposed to use as TPM input? Is it for the given transcript total transcript length (so exons, introns and UTRs) or just length of exons? Many thanks for help!
@statquest 6 лет назад
It depends on how the sequencing is done. That said, most of the time, introns are spliced out of the transcript and are not sequenced, so you can exclude those from the length of the sequence. One sure way to know you're doing it right is to look at the alignments using a genome browser - then you'll see where the reads are mapping to - if it's just exons or exons + UTRs.
@easyasperl 8 лет назад
So is TPM more like FPKM in the sense that it keeps track of paired end reads?
@sanjaisrao484 2 года назад ⁺¹
Thanks
@statquest 2 года назад
:)
@leixiao169 3 года назад
Great lecture. Thanks StatQuest! I wonder if Deseq2 automatically normalizes counts based on FKPM or TPM?
@statquest 3 года назад ⁺¹
For details on how DESeq2 normalizes reads, see: ruclips.net/video/UFB993xufUU/видео.html
@leixiao169 3 года назад ⁺¹
@@statquest thanks!
@LGARCIA20504 5 лет назад
Very good man!
@MrDeking10 5 лет назад
What are some typical TPM values? I got a lot of zeros in my dataset. However there is a lot of values between 1 and 2, and some as high as 13. Thanks
@george543 7 лет назад ⁺¹
Josh, could you help answering a question from me?
When normalizing to the total read count (the second step of TPM, after normalizing to gene length), is the total read count the sum of normalized read counts that are mapped to genes only? What about the reads that are not annotated? Thanks fro your help!
@stemcell1167 7 месяцев назад
Hello! I am supposed to do TPM normalisation of my counts Matrix , can l use steps explained here as it is? Or should l use any tool or package?
@statquest 7 месяцев назад
Usually a package will do this for you, but you can also follow these steps.
@biotechsampath 7 лет назад
awesome explanation....thanks
@TheBloodyBeat 6 лет назад ⁺¹
Thanks for the awesome video ! If I understood well, none of these metrics takes into account the amount of unmapped reads. So does comparing TPM across samples that aren't replicates (e.g. a few environmental metagenomes) make any sense ?
@statquest 6 лет назад ⁺¹
You make a very good point. To be honest, TPM, FPKM and RPKM etc are all just for connivence - they may the data easy to look at and get a general feel for. However, they are not used for any sort of "real" comparisons among samples. For example, DESeq2 and EdgeR2 (and pretty much any other software that looks for differences between sets of "seq" samples) use completely different normalization strategies. These methods take into account that different samples might express different sets of genes - and some samples might not have many reads over all etc. So, my advice, is to use edgeR or DESeq2 to normalize your data for you, rather than doing it by hand. I have videos that show how normalization works in EdgeR: ruclips.net/video/Wdt6jdi-NQo/видео.html and DESeq2: ruclips.net/video/UFB993xufUU/видео.html if you would like more information.
@TheBloodyBeat 6 лет назад ⁺¹
@@statquest Hi Josh, thanks a lot for your very helpful answer. I just watched your DeSeq2 video and it looks indeed a lot closer to what I'm looking for than the TPM/RPKM/FPKM metrics. I'll dive into the details and try it on my data.
@statquest 6 лет назад
@@TheBloodyBeat Hooray! :)
@krzysztofkolmus6936 6 лет назад
Great video! Can anyone recommend an R package for TPM normalisation? Thanks a lot in advance!
@krzysztofkolmus6936 6 лет назад
Joshua Starmer, thanks again!
@RonaldCutler 7 месяцев назад
Now you should make a video of why you can’t use these to compare genes between samples and only to compare genes to each other within a sample. Since TPM is a proportion, if one gene goes up in a sample, then the rest of the gene will seem like they are going down, when in reality they really might be at the same level!
@statquest 7 месяцев назад
I'll keep that in mind.
@明坤宋 3 года назад
Hi, your video is very helpful! But if I only have the log2RPM data, how can I find the differentially expressed genes? Is there anyway to transfer the log2RPM data to count data?
@statquest 3 года назад
Not that I know of.
@nnzhou9493 4 года назад
Hey Josh, I used DEseq2 got the significant differential expression gene list. Then I checked the TPM of those genes. some genes' TPM are quite low ( < 1), some are quite high (hundreds or thousands ). should I use TPM cut-off value to filter the low-expression genes? If I have to do this, which cut-off value you prefer? Welcome to any suggestion. Thank you!
@statquest 4 года назад
DESeq2 should do this filtering for you. For more details, see: ruclips.net/video/Gi0JdrxRq5s/видео.html
@areeniiitd 6 месяцев назад ⁺¹
great video ngl.
@statquest 6 месяцев назад
Thanks!
@zekihi6994 7 лет назад
so good! Thanks.
@lilhedayat 4 года назад
why is it that longer genes will have more reads mapping to them? are longer genes more amplified or is it because the short fragment of reads can be mismapped?
@statquest 4 года назад ⁺²
Imagine I have mRNA transcripts for two different genes, Gene A and Gene B. The mRNA transcripts for Gene A are 300 bp long and the mRNA transcripts for Gene B are 900 bp long. Now, since the sequencer can only sequence 300 bp long fragments, I break all of the mRNA fragments in to pieces that are 300bp long. That means for each mRNA transcript for Gene A, we get one 300bp long fragment to sequence. For Gene B, we get 3 fragments to sequence. In other words, we will sequence 3 times as many fragments for every mRNA transcript from Gene B than from Gene A. Does that make sense?
@lilhedayat 4 года назад ⁺¹
@@statquest it absolutely does!!! thankyou so much for explaining, I completely missed that! I always assumed that you would correct for this. I was under the assumption that, not the fragment, but the entire 900bp would count as 1 count by default.
@johnswenson6699 7 лет назад
Hey Joshua,
Thanks so much for this video. I've a follow-up question: suppose I want to compare relative expression levels of gene A between two samples, but the tissue samples vary in size ... do these normalization methods take into account the fact that some samples will have more genes present than others?
As a hypothetical (but easy to visualize) example, suppose I cut off a hand, ground it up, and sequenced the RNA. This is sample 1. For sample 2, I cut off a different hand AND the attached arm, ground them all up, and sequenced the RNA. If I expected gene A expression only in the fingertips, would I be able to compare the two samples to uncover which sample had more expression of gene A, even though sample 2 had more (and more diverse) input tissue than sample 1?
In short, is a there a normalization method that accounts for the fact that there may simply be a greater variety of genes being expressed in one sample relative to another?
Thanks again for this video. You explained these concepts better than any other source I've found!
@johnswenson6699 7 лет назад
Brilliant.I didn't realize those programs included that kind of normalization ... Thanks a lot, sir. I'm going to watch those videos pronto!
@elzedliew972 3 года назад ⁺¹
statquest is an encyclopedia of ...
@statquest 3 года назад
bam! :)
@reafdaw01 7 лет назад
You are pretty awesome! Thanks.
@eldorado.t 4 года назад ⁺¹
Awesome 😍 thanks
@statquest 4 года назад
Thanks!
@pythonsun996 6 лет назад ⁺²
very good！
@statquest 6 лет назад
Thank you! :)
@km2052 4 года назад
thx
@ejvik3238 3 месяца назад
For the TPM, why do we normalize by the gene length?
@statquest 3 месяца назад
Because the number of reads per gene scales by the length of the gene.
@ejvik3238 3 месяца назад
@@statquest Even if I do transcriptome from a sample and I'm interested in how much or how little (if at all) are genes expressed?
@statquest 3 месяца назад
@@ejvik3238 yep
@ejvik3238 3 месяца назад
@@statquest I just watched one of your videos called "StatQuest: A gentle introduction to RNA-seq" so if I understand that correctly we have to divide by the gene length because we create fragments from the RNA to 200 - 300 bp to be able to even start sequencing. If so my question would be why don't we divide by the number of fragments instead?
@statquest 3 месяца назад
@@ejvik3238 The number of reads per gene is a function of the gene's length (because a 1kb long gene will create 5 200bp fragments and a 2kb gene will create 10) and its expression level. By dividing by the length, we can then determine expression level, which is what we are interested in.
@tinacole1450 3 года назад ⁺¹
love it...
@tinacole1450 3 года назад ⁺¹
even the corny songs.... because I know something good follows
@statquest 3 года назад
Thank you very much! :)
@尼安德鲁-n6j 9 лет назад
Nice!
@anjalipatni2580 3 года назад
Sir,
My data do not have any replicates and it is a paired end data.
@statquest 3 года назад
bummer!
@shichengguo8064 4 года назад
Well explained, but I don't agree that TPM is better than FPKM
@statquest 4 года назад
Noted!
@omarmohammadibrahim2197 6 лет назад
the sarting felt like ppap song :P
but everything after that was awesome :D
@IsaacXinPei 5 лет назад
Does the title has a typo? TPM => FPM?
@statquest 5 лет назад
I don't think there is a typo. The title is: "StatQuest: RPKM, FPKM and TPM". RPKM, FPKM and TPM are three (3) different ways to normalize high-throughput sequencing data.
@IsaacXinPei 5 лет назад ⁺¹
@@statquest that's right, the first slide in the video says FPM, I think the slide has a typo
@statquest 5 лет назад
Ah! You are correct! That's amazing. This video has been online for 4 years and you are the first person to spot that.
@IsaacXinPei 5 лет назад ⁺¹
@@statquest no problem at all, the videos are very useful, thank you for all the hard work!
@joshua20199 Месяц назад
Why isn't it TPKM? :/
@statquest Месяц назад
No idea!
@MBCOUGER 8 лет назад ⁺¹
Thank you so much for this, I now no longer look like this when trying to explain this: imgur.com/gallery/iWKad22
@fmetaller 6 лет назад ⁺¹
@hypno666pl 5 лет назад ⁺¹

Следующие

Автовоспроизведение