False Discovery Rates, FDR, clearly explained
HTML-код
- Опубликовано: 9 янв 2017
- One of the best ways to prevent p-hacking is to adjust p-values for multiple testing. This StatQuest explains how the Benjamini-Hochberg method corrects for multiple-testing and FDR.
For a complete index of all the StatQuest videos, check out:
statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - statquest.gumroad.com/l/wvtmc
Paperback - www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - www.amazon.com/dp/B09ZG79HXC
Patreon: / statquest
...or...
RUclips Membership: / @statquest
...a cool StatQuest t-shirt or sweatshirt:
shop.spreadshirt.com/statques...
...buying one or two of my songs (or go large and get a whole album!)
joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
#statistics #pvalue #fdr
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
My PhD dissertation relies heavily on bioinformatics and biostatistics, although my background is neuroscience. Naturally, I had a lot of learning to do, and your videos have helped me immensely. Every time I want to learn about a stats concept, I always type in my Google search, "[name of concept] statquest." Seriously, this is almost too good to be true, and I just wanted to thank you for providing this absolute gold mine.
Wow! Thank you very much and good luck with your dissertation.
You make without a doubt the best videos about statistics on RUclips: funny, clear, intuitive, visual. Thank you so much.
Thank you! :)
Totally second this...
God bless you, I made screenshots of this video to explain this concept to my lab. This isn't the first time you've helped me with RNA-seq procedures. I have bumbled through a differential expression analysis. Trying to understand the statistical methods and knowing which option amongst several is the most logical is a mental hurdle. I am the only student in my lab currently undertaking bioinformatics and I am essentially trying to teach myself. There is a huge vacuum of knowledge in this realm amongst biologists and it's daunting. We all can generate data until we're blue in the face, but it doesn't do anyone any good until someone knows how to analyze it.
Awesome! Good luck learning Bioinformatics.
BAM BAM BAM, thanks a lot man...Your 20 minutes most likely saved hours of trying to understand from wikipedia...
Sweet!!! Glad I could help you out. :)
Can't thank you enough!! Your methods are truly amazing. Being able to deliver them to us so cleverly is a true indication of how much effort you must have put into understanding these concepts .
Wow, thank you!
Awesome explanation !! Thanks for taking the time to make these videos and also answering questions from viewers so well. Going through them already answered some queries that I had :)
Im from China and I watched your channel in Bilibili but I cant had enough so I catch you all the way up ended here, a paradise of data science! thank you Josh, wish you the best!
Wow, thank you!!!!
Wow wow wow how intuitive and visual. Can’t thank you enough for saving me from spending hours struggling to understand this concept🙏
You're very welcome!
Fantastic video, thank you for taking the time to put this together.
From the way my university teachers (didn't) explain to me Benjamini-Hochberg, and after watching this video, I can claim I now understand Benjamini-Hochberg better than them, at a 99.7% confidence level!
BAM! :)
Thanks for the awesome explanation! Really informative and easy to follow. And the DOUBLE BAM in the end actually made me laugh out loud :D
Awesome! :)
OMG!! This is the most beautiful explanation I've ever experienced...... Thank you so much professor.
Awesome!!! Thanks so much.
Simple, informative, and to the point. Absolutely perfect.
Glad you liked it!
I love you ❤️. I was so afraid of FDR adjustment because I thought the math behind was empirical and worked like magic but you made it surprisingly intuitive.
Thank you! :)
Thank you! This was really helpful and made me smile during my intense evening revision :)
Glad it helped!
I'am currently learning to do RNAseq data analysis, these videos are extremely helpful.
First and foremost, I extend my heartfelt gratitude for providing such a series that elucidates concepts in an easily comprehensible manner. Bam !☺
Thank you!
This is amazing. Very well explained and easy to understand!
Glad it was helpful!
Cool, thanks for posting this, very intuitive! An equivalent method for eyeballing the # of true null hypotheses is to plot ranked 1 - p-value on the x-axis and the hypothesis test rank on the y-axis, then fit a line to the scatter plot, starting at the origin. Where the line hits the y-axis is your estimate of the # of true null hypotheses.Would like to see an intuitive explanation for the Benjamini-Yekutieli procedure, used in studies where the tests are not completely independent!
I have always hated math and you just make it clear and interesting! Can't thank you enough
Hooray!!! I'm glad the video is helpful. :)
Great tutorial for FDR. The adjusted p-value is a p-value for the remaining result after cutting off some results you know that are not significant just by the distribution. It will be better if you can tell something about Q-value and how Q-value reflects the quality of a experiment.
This is simply great!!! Thanks for sharing Joshua.
This is the best video that explains FDR. Thank you,
I love you StatQuest. Thank you for never letting me down. You were always present to answer my deepest and most shameful doubts. You never abandoned me during the darkest hours of my PhD.
I'm so happy to hear my videos helped you. BAM! :)
Great video, your example was clear and very will illustrated.
The clearest explanation of BH correction so far. Quadruple BAM!
Wow was seriously struggling with my research since I dont know the first thing about statistics and I love this so so so much. So instructional I had to like
BAM! :)
Thank you very much indeed for the perfect explanations and examples of the FDR concept. I really get my answer.
Thanks!
As always, by far the best explanation on the web!
Thanks!
Great explanation, thanks ! clear explanation, amazing balance between theory and examples
Thank you!
Thank you so much for this great movie!! Great explanation.
As always, it is a great explanation. Thank you Josh 👏
Thank you!
I have to keep saying that I love this channel so much
Hooray!!! Thank you so much!!! :)
Thanks for your effort and simplified explanation!!! live saver ))
Glad it helped!
1 thumb down is a case of FDR :)
So true! :)
Josh is a genius. Really appreciate your work statquest.
Thank you! :)
Nice video, simple and fast.
Thanks!
Just wow!! Thank you for this.
This was SUPER helpful, thank you!
Thank you! :)
Dude thanks so much, this video is AWESOME!!!
Hey Josh, love you videos on stats, specifically centered around hypothesis testing. Can you do more videos on the different techniques of hypothesis testing, like (group) sequential testing and multi-armed bandit?
I'll keep that in mind.
Very nice explaination!
I just love your videos. Thank you so much!
Thank you! :)
This is my first time fully understanding FDR ...
bam!
BAM!!!finally i understand it, which confused me half a year!!
BAM! :)
the second half is hard to understand, but I know I will come back later and watch it again, and again, and again until I finally understand it
Let me know if you have any specific questions.
Thanks, that was preciuos (and spared me hours of frustration)
Thanks! :)
Thank you very much por the explanation, very very clear!!
Muchas gracias!
Great explanations!
Thanks!
شكرا جاش. ماقصرت. مقطع مختصر ومفيد
Thank you!
It's so good, I want to give it more than one thumb up!
Double BAM! :)
Thank you sir, was very useful 🙏
Glad it helped
Nicely explained.
Thank you!
Nice explaination!
Thanks!
Thanks a lot. Mr. Joshua
Hi Josh! Great stuffs here. Could you please make a video on "Significance Analysis of Microarrays". Mainly how it differs from T-stat/Anova. Really appreciate you for all the videos.
I'll keep it in mind, but I can't promise I'll get to it soon.
This is a great video. And, could help me understand how the intuitive understanding (the histograms of p values coming from two distributions) connects to the mathematical procedure of the B-H procedure? thank you!
Good explanation
This video is so beautiful.. Thank you so much
I'm glad you like it!
Thank you, bro!
I love the explanation!
Thank you! :)
Thank you, nicely expalined
You are welcome!
Great channel and fantastic content! I am wondering if you could make an episode about IDR, Irreproducible discovery rate. It is difficult to find a good explanation or usage guide on it.
I'll keep that in mind.
Very nice video and I learned a lot from it. The only thing is when you give examples and told us when you do 10,000 times P value calculation, the distribution of P values will be like this or like that. But I don't know that's true or not. So, I am wondering can you explain a little bit more or is there any further reading I can do about P value and adjusted P value?
This is amazing. thank youu.
Thank you! :)
Would love a video about the target decoy approach
OK. I've added it to the to-do list. :)
This is very great!!!
Thank you!
One part I don't quite understand is how the intuitive eyeball method translates into the B-H p-value adjustments you explain starting at ~15:00. To me, plotting a line along the H0 = True p-values sounds like you would be fitting a linear regression & identifying the outliers < .05.
I don't understand one thing. If samples are taken from the same population, p-value bins would NOT be evenly distributed, rather it is also skewed toward p=1 because it is normally distributed and most of the time samples close to average values are likely to be picked.
By definition, p-values are uniformly distributed. By definition, a p-value = 0.5 means that 5% of the random tests will give results equal to or more extreme. a p-value = 0.1 means 10% etc etc. etc.
Thanks a lot!
It is a crystal clear about FDR and BH method, rather than my professor said
Thank you Sir🌹
Thank you!
AWESOME! Thank you!
:)
I'd like to know why when samples come from the same distribution, the p values are uniformly distributed? Thank you!
This is awesome. Imma save it for later reference hah
thanks josh!
You are welcome!!! I'm glad you like the video! :)
Awesome, this may be too niche but could you do a video on local FDR please?
Thanks for these videos! They are great!!
Can you help me understand the intuition behind why the p-values are uniformly distributed in the samples from the same distribution?
Think about how p-values are defined. If there is no difference, the probability of getting a p-value between 0 and 0.05 is... 0.05. And the probability of getting a p-value between 0.05 and 0.1 is also 0.5 etc.
BAMMMM! Thank you!
Hooray! I'm glad you like the video. :)
Your explanations are very helpful. Can you please make a long video where you discuss all other approaches like SPLOSH, BUM, Pound and Cheng methods, also a comparative explanation between them? I'm eagerly waiting for it. Furthermore, you can explain them with R.
Amazing!!
Thanks!!
i love how he made that joke about wild type with monotone lol
:)
Great video! Congratulations. I've seen the paper of Benjamini and Hochberg 1995, but (guided by my very limited knowledge of math) I was not able to find the formula in the way you explained. Please, could you give some clarifications on this issue, as some kind of transformation of the mathematical procedure? Thank you very much. Best wishes.
I'll keep that in mind.
I have the same questions. Did you figure out the logic behind the mathematical procedure? Thank you!
Hey thanks for the video. Just a question, don't you have higher chance of having samples that come from the middle of the distribution than the tails resulting having more large p-values than small ones? I don't get why p-values are uniformly distributed? Thanks :)
You know, I found this puzzling as well. However, imagine we are taking two different samples from a single normal distribution. If we did a t-test on those samples, 5% of the time the p-value would be less than 0.05. Now imagine we created 100 random sets of samples and did 100 t-tests. 5 of those p-values will be less than 0.05. 10 will be less than 0.1, 15 will be less than 0.15.... 50 will be less than 0.5.... 90 will be less than 0.90, etc. This isn't a mathematical proof, but it makes sense - the whole idea of having any p-value threshold, x, is that we are only expecting, x percent of the tests with random noise to be below that threshold. Thus, we have a uniform distribution of p-values.
Also keep in mind that when computing p-values for the difference between two sample means, p-values of .05 or less cover a wider range of x values than say p-values between .50 and .55.
@@statquest Wow, I had the same question as Ken. Thanks for giving this super intuitive explanation!
@@Tbxy1 me too! been struggling to understand that part and thank god Ken asked 😅
I think you previously talked about how to calculate p value for one sample set that tells us how likely the sample set belongs to the distribution, but in here we are calculating the p-value of two sample sets, and try to tell whether they belong to the same distribution, how is it calculated? Or is it simply just comparing one sample set to the distribution and another and if they both likely belong to the same distribution we say we fail to reject the null hypothesis?
In this video I believe I'm using t-tests. To learn about those, first learn about linear regression (don't worry, it's not a big deal): ruclips.net/video/nk2CQITm_eo/видео.html and then learn how to use linear regression to compare two samples to each other with a t-test: ruclips.net/video/NF5_btOaCig/видео.html
great video
Thank you!
Awesome!!
Thanks!
Holy freaking nuts!! Thank you haha...
Yes! :)
Thank you for the intuitive video. I am awfully new to statistics so I have three questions: Suppose it is a classification problem 1. Are "samples" referred to as "classes" (types of genes) or is it samples of genes? 2. Will the null hypothesis be: there is no dependency between the gene and the samples? 3. Why 10,000 times? (I am bit confused what is relationship between 10,000 genes and 10,000 test as I understand for each test, the distribution plot is based on values of genes)?
1) I'm not sure I understand the question because we are trying to classify the expression as being "the same" or "different" between two groups of mice or humans.
2) The null hypothesis is that there is that all of the measurements come from the same population.
3) When we do this sort of experiment, we test between 10,000 and 20,000 genes to see if they are expressed the same or different between two groups of mice or humans or whatever. So, for each gene in the genome, we do a test to see if it is the same or different. This allows us to identify genes that play a role in cancer or some other disease.
I truly love you...
Thank you! :)
Maravilloso.
:)
Oh My Goodness! You explain very clearly! Why should I waste my time in the classes? But for the graphics part, I prefer something like 3B1B. Besides, I searched but I couldn't find any video about A/B tests? Do you have any? Thank you Josh!
I'm sorry you don't like my graphics.
@@statquest It's the best channel on RUclips about Statistics ever! ❤️❤️❤️❤️❤️
@@revolutionarydefeatism Thanks!
I am glad to see this video as i am doing some FDR tests in my project. I have a question: what if the false positive samples remained after adjustment? Is it still acceptable if FDR is < 0.05?
You can not eliminate false positives, but you can use FDR to control how many there are. So typically people call all tests with FDR < 0.05 "significant".
Hi Dr. Josh, I'm curious to get your thoughts on a simulation I'm running. It's very similar to the simulation in this video where you calculate 10,000 p-values by sampling from the same distribution.
When I run my simulation using a Welch t-test and n=3, only ~3.5% of p-values are less than 0.05. The percentage converges on 5% when I increase the sample size or use the Student's t-test.
It seems as though forgoing the equal variances assumption sacrifices some power, especially at low sample sizes. But I'm still trying to grasp why that is and what the implications are for using the Welch t-test with low sample size in real-life situations. For example, if the null hypothesis is that both samples come from the same population, then why not just assume equal variances and use Student's t-test all the time? (I know that last question is probably conflating some concepts that should be separate, but I'm having a hard time keeping track of it all, and I'm really interested to hear how you would respond to that question).
You seem to have a great way of explaining things like this intuitively. I'm curious to hear your thoughts.
Thanks so much! I've benefited greatly from your videos.
It makes sense to me that welch's t-test has less power with low sample sizes because it makes less assumptions - and thus, has to squeeze more out of the data by estimating more parameters.
omg, amazing!
Awesome!
Thanks!
Thank you so much.
Hooray! I'm glad you like the video! :)
I've been reading publications for an hour and you solved my problem in 10 minutes.
Awesome!!! This is definitely one of those things that's easier to "see" then to read about. Glad I could help. :)
Thank you very much. I think now is the time to step these videos up to applications in R or another statistical software.
I have a few videos in R and Python here: statquest.org/video-index/
Thank you for your very helpful video. I have one question here: what I have understood from the calculation of the FDR is that it will make only the smaller p-values still be significant after the correction, am I right? (you suggested it in 12:09) Nevertheless, I got distracted at 17:20 because there are small-er values in the red area that, based on this, would not be "false positives" if I got your explanation. Could you clarify this? Thank you :)
The numbers in the blue boxes are p-values that were created from two separate distributions. Some of those p-values are below the standard threshold of 0.05 and some are not. The ones that are not are "false negatives". The numbers in the red boxes are p-values that were created from a single distribution. Some of those p-values are below the standard threshold of 0.0.5 and some are not. The ones below the threshold are false positives. However, in this specific example, after we apply the BH procedure (at 18:02 ), all of the false positives end up with p-values > 0.05 and are no longer considered statistically significant so the false positives are eliminated.
@statquest: Josh, Thank you. I have a follow-up though. Sure, we could adjust the p-values to reduce the False positives, but could this adjustment cause an increase in False negatives? Is there a way to quantify that? Apologies if I am missing something obvious.
There are different methods to control the number of false positives, some do a better job than others at keeping the number of false negatives small. FDR is one of the best methods for limiting both types of errors. In contrast, the Bonferroni correction is one of the worst.