Well put!! I wrote "p-value is useless, because 1) people use it as a bad answer to the wrong question; 2) people thinks it's magic; and 3) people are unethical". But forget about it, you've nailed that!
Pretty much. I don't like this way of thinking - people don't understand p-values and how to interpret them, therefore we must do away with p-values. No, just freaking learn to use your brain and how to interpret data!
You may be right. My personal opinion however is that the main problem is improper use by unethical or incompetent people (or both). Proper use of p vale is highly valuable (pun intended) for certain problems. Thus, the best solution IMHO is enforcement of proper use (which unfortunately journals don’t seem to care for) combined with evaluation of alternate and complementary analyses when practical.
@@brentmabey3181 kinda like how any stable biological or stable psycho social aspect of life will be used to maintain and advance the system of contained opposition within larger counter insurgency tactics for a system to maintain power. It’s like a sort of 4 dimensional chess figuring this out as reasons for things can exist on different levels but the higher levels always build on top of what already existed and simply amplify it
I’m a PhD and I’ve worked in the pharmaceutical industry for 25 years. It never ceases to amaze me that physicians lack a basic understanding of statistics.
Even worse in social sciences. But then their expertise isn't in mathematics. Think it's good to scrap it since it's pretty much used to give a veneer of being scientific while it's often really not.
@@KarlFredrik I was about to say the same. It's painfully obvious when reading published studies from the social sciences that most social science researchers don't understand. You'd think they'd swallow their pride and ask someone from the statistics department for advice. But no, they're the "experts." Makes me question what else the "experts" are blind to, but don't question their expertise or you're "anti-science"
Microbiologist here. Same! I'm blown away by how often I ask someone why they used the statistical test that they used and their answer is "it's what my advisor wanted" or "it's what I've used historically". That's not how your suppose to chose your statistical test
The problem is not the p-value, it is understanding that statistics is only a tool based on math, not a medical conclusion. We assign numbers to the project and then assume that the calculation is the answer. But the calculation is only the answer to the numbers we have assigned, and that is not necessarily the medical question. Assigning numbers is based on assumptions, sometimes good assumptions other times our assumptions are completely wrong, which means basically garbage in garbage out.
not to mention that people were using statistical tests incorrectly, without understanding their data and just doing a one tailed p-test or whatever; there are criteria!
Fantastic video. Your clarity of statistical concepts is only bested by your rare capability of simplifying complicated concepts into lucid chunk-sized explanations. Thank you for your amazing work!
Hi! Thank you very much for the nice comment! Yes, we make a lot of effort to explain everything as simply as possible, but it's also a hell of a lot of work.
I think some of the bias against p-value, especially perhaps in the social sciences, has come from the prevalence of p-hacking. This, as you mentioned, is basically trolling through the data, finding something with a "good" p-value, and then testing the appropriate null-hypothesis (which, by construction, you can then reject!). While this is, as you said, a bad scientific approach, I assume the most honest among them think it's ok because "there really is something significant there, see!" without realizing that it breaks the underlying model (5% of p=0.05 hypotheses can be due to random chance when there is no real effect). Your solution to some of the ills, replication studies, is a problem in some areas, as in academia there is often little reward for repeating a study, and if it does find the same result as the original, it may be hard to publish ("We already knew that"). Perhaps for larger scale things, like drug treatments, this is less of an issue, but even there a company is more likely to test their "me too" drug to show its effectiveness rather than repeat studies of older treatments (in case they turn out to show the old is perfectly effective!).
Thank you very much for your detailed feedback! I completely agree with you! Difficult topic, but the p value itself is probably not to blame for that. We will try to adress some of the solutions in a following video, but part of it is also replication. Thanks and Regards Hannah
The failure of many "sciences" to replicate studies is a basic one, calling into question any claim to be science at all. In my view the root cause is the whole science journal system, where publishing houses publish for profit work that has already paid for and done by academics. They don't publish confirmation reports, nor do they publish reports where "we tested this idea but it didn't work". It's not just social sciences, any field with science in its name is also suspect. Computer Science, for example. As rife with fraud as any "soft" science.
The other issue is this: With Big Data and its huge number of observations, many many relationships can be seen as statistically significant-not attributable to random chance-when those relationships are merely artifacts of sample (or population) size. The issue of p-value harvesting is also in play here, but no one with any professional ethics and statistical integrity would accept such findings outside of a rigorous multi variate analysis that reveals the magnitude, i.e., strength, of the presumed or indicated relationship, by using a simple regression coefficient.
The solution to multiple comparisons was laid out back in 1991 in the fledgling journal Epidemiology in an editorial by the prominent methodologist Charles Poole entitled "Multiple comparisons? No problem!".
@@kodowdus I'm fairly certain that there were known solutions to the problem of multiple comparisons decades before 1991. This the problem of academia being so insular
Another reason I've seen for abandoning the p-value (especially from the Bayesian crowd) is that p-values don't answer typical research questions. Critics say that researchers (implicitly) want to know the probability of their hypothesis given their data. That is, they want the posterior probability of their hypothesis. But p-values are not posterior probabilities. The p-value gives you the probability of data at least as extreme as those observed conditional on the hypothesis. That looks backwards relative to the goal of figuring out the probability of the hypothesis. The criticism is related to the misinterpretation problem: If a researcher wants to know the probability of their hypothesis, they may be more likely to misinterpret the p-value as a posterior probability.
The problem is though the original question you state cannot be answered altogether, neither by Frequentist nor Bayesian approaches. To have posterior, one must have a prior. But there is no way in the universe of knowing a prior on a scientific hypothesis. Heck, something as established as universal law of gravity was proven wrong by Einstein's theory of relativity. The data collected over thousands of years of observation was right, the fit was excellent, but the hypothesis was still wrong. P-values are great for quick and dirty identification of potential candidates for effects in a giant pile of data. They tell you that something is off and might be interesting, but don't tell you what exactly. Bayesian methods are great at delving deeper, and infusing prior knowledge about the function of the world into a causal model. P-values are thus better for exploratory analysis, and Bayesian for confirmatory, IMHO, as our ultimate goal is to get to a causal model of how something works, not whether it works. But expecting poor biologists to do Bayesian modeling for every experiment is excessive, as there is a significant overhead in complexity.
@@ArgumentumAdHominem This. The most vocal Bayesian supremacists always strike me as way overconfident about having solved an immortal problem that will always be with us. The p-value answers a specific hypothetical question no one was really interested in. Bayes gives a fake answer to an ill-formed question we *only wish* we could ask and answer. Choose your poison. Heck, a lot of the time, if your prior probability is assigned 0.5, Bayes and the p-value are *identical*. The interpretation is heuristic at best, with the double-edged convenience and danger that comes with it.
I generally align with Bayesian statistics. My feeling about p-values is that they should not be used to draw conclusions in hypothesis tests, but they are useful in getting a gut feel for the implausibility of data given some hypothesis. I find them useful in evaluating my priors: "My intuition tells me that a p-Value of 1% corresponds to a posterior probability of 75% for the alternative hypothesis so I should select a prior for which this is the case." They're also useful as a quick gut check: if your p-Value is so small you need to use scientific notation to express it you know ahead of time more or less what a full Bayesian analysis will conclude.
Good video. Some additional background. The False Positives (Alpha or Type I error) and False Negatives (Beta or Type II error) must always be kept in mind while doing hypothesis testing. No matter which decision we make there is always a chance we make one of these error. The 0.05 is the alpha value. Alpha is the false positive rate, the probability that the sample statistic is in the most extreme 5% of the distribution by chance. We must always decide on an alpha BEFORE running the experiment, otherwise we run the risk of p-hacking (picking an alpha that makes the results significant). The false positive is the 'Boy Who Cries Wolf' error; we say there is an effect when there really isn't. To avoid the false positive we can lower the alpha, say to 0.01 or 0.001. But the problem with this is that by lowering the alpha, we increase the chance of a false negative. A false negative is when we say the drug has no effect on weight loss when it does have an effect. We miss the chance of finding a useful drug to treat serious diseases. Replication is the part of the modern scientific method that helps decrease the probability of making both false positive and false negatives. After reading the comments: a meta analysis is a systematic summary of all the replications done in a field. There is a difference between 'statistical significance' and 'business significance'. Strong Inference, the design of experiments to test two competing theories, is the best way to make sure your experiments are scientific. en.wikipedia.org/wiki/Strong_inference
I came here so ready to rip this video apart for uncritically disparaging p-values. This is why we watch videos until the end! Great video about proper applications of hypotheses testing, and some easy misapplications!
P-vals are based on a lot of assumptions. Abusing, misusing, or ignoring these assumptions are where we most often trip up and potentially abuse the concept.
P-values remain highly relevant but the ”magical” critical values 0.10, 0.05 and 0.01 should only be used in very specific cases, and not as some kind of universal measures of what is ”relevant” or not
Thanks for this. I think there is at least one additional thing wrong with the way we treat P values - namely we think 1 in 20 is a small probability. This is, in part, that cognitively we cannot intuit probabilities that are close to 0 or 1. there is a large difference between probabilities of 1 in a hundred and 1 in a million, but we just think both probabilities are small.
I just commented along a similar line. In the physical sciences it's not unheard of to achieve p values of .01 or lower. So I get very suspicious of research using "physical science" tools, e.g. GC-MS showing p values sometimes larger than 0.2, a 1 in 5 chance of a false rejection of the null hypothesis and claiming "strong" evidence for their alternative hypothesis.
Great instructional video. I particularly enjoyed the explanation that a low p value merely indicates that the null hypothesis is unlikely to be true but that it does not say anything about the alternative hypothesis. I see this all the time in "independent" studies of traffic data - it does not matter whether the study is for low traffic neighbourhoods, cycling lanes, pedestrianisation or even road widening and similar to improve vehicular traffic flow - the same basic flaws are seen (do not get me started on the assumptions that the researchers use to even get to a significant p-value) - if the p-Value is low enough the study concludes not that the null hypothesis is untrue but that their alternative hypothesis must be true.
Thank you for another engaging and powerful video. When I first read "p-value is not scientific", I was pre-empting discussion of how significance levels themselves are arbitrary and seemingly without rationale. How did we come to decide that a significant event can be observed by chance once in only 20 random samples? (when considering p < 0.05 as significant; ditto for 10% or 1% significance levels). This is grounded in the central limit theorem, but whether expressed in terms of 5% or 1.96SD, the thresholds seem more convenient than scientific. Nevertheless, these standards are important for the universal interpretation and continuation of research and I'm glad for it, though directly interpreting the p-value as a probability may help to meaningfully discuss confidence in a result regardless of where it sits relative to the significance level. Looking forward to the next!
Wonderful video. Very well explained. My 'Engineering Statistics' course starts in a couple of weeks (my ninth year teaching this class) and I will post a link to this video on the course canvas site when we get to hypothesis testing as an excellent explanation of the meaning of the p-value and its strength and weaknesses.
Thank you for your brilliant explanation. The banning of the use of p-values by the psychology journal you mentioned suggests to me that (a) academics who publish in it are not competent in statistics and (b) tne journal’s peer reviewers aren’t either. I recommend the competent researchers should avoid submitting articles to such journals. As an aside, I have encountered more weakness with stats competence within the psychology discipline than any other. Psychology had better watch out. It is on the cusp of being discredited by real scientists such as me who don’t want them giving science a bad name. Dr G. PhD, MSc (Statistics) with distinction.
Thansk for the vlear explanation. This is basically the 101 of statistics, yet many researchers have no clue about this. Notably, I was NOT taught this most important insight in my biostatistics course but only by my thesis supervisor who had re Intuitive Biostatistics from Harvey Motulsky and therefore knew it.
Clear, concise, and *very* engagingly presented! You've got a fan. Regarding the content, I'll just 'second' many of the well-stated earlier comments: The problem isn't the p-value _per se_; it's the way in which it's misunderstood by many who perfunctorily apply it -- owed, in large part, to the imprecise way that it's taught in many Elementary Stats courses. (In that regard, good of you to take care to emphasize the essential "...a difference of [x] **or more**...")
Excellent explanation, we were always taught that a p-value was only used for an initial study for supporting, or not supporting further studies because it is not accurate enough on it’s own to come to any satisfactory conclusion.
Thank you! That's a great point. A p-value should indeed be seen as just one piece of the puzzle in research. Many thanks for your feedback! Regards Hannah
This depends on the aim of the study. If the question is: whether something has an effect, then p-values are great. If the question is: why something has an effect, one needs more powerful machinery.
The easiest way to understand the p-value in this example, I think, is in terms of simulations: randomly assign fifty subjects to one group and the remaining fifty to the other group; calculate the difference in means between the two groups; repeat many times, and then determine the proportion of differences thus calculated that were at least as great as the actual difference encountered; that's approximately the p-value. As you say, it's not the p-value's fault that it is often abused or misunterpreted in statistics, and it remains a useful tool for identifying potential effects. Great video!
Great video. One of the underlying issues may be that non-statisticians who depend statistical results have, understandably, a very hard time to grasp and accept what these values actually tell them. In other words, as people want to know "the truth" it is already difficult to accept the notion of likelihood, let alone its twisted sister, an approximation of such a likelihood expressed in terms of a likelihood the opposite not being true. From a practical standpoint it must be so confusing. All you want is a yes or no, maybe you still accept a "80% perhaps", but a "high confidence that there is a high probability that the opposite is not true", must be so confusing. Your video should be taught in every school, because the basic concepts are what most people struggle with.
*"high confidence that there is a high probability that the opposite is not true"* That must be the most convoluted sentence I've ever seen. If this is the language that you folks have to deal with on a daily basis, then I am sorry for you guys!
Thank you for clarifying the P-value meaning. Data makes sense when shared with a data story. Just a p-value for decision is like missing the context in an outcome.
Thank you! A comment on the 0.05 threshold pointed out by @HughCStevenson1 - 50 years ago, I was taught that it developed, along with much of modern statistical methodology, from the study of agricultural fertilizers; a farmer would accept one failure in their time in charge of the farm, typically 20 years, hence 1/20 chance of the observed benefit being by chance. With drugs, it strikes me that using a higher value for beneficial effects (particularly of a low cost treatment) would make sense, BUT a lower value for harmful effects. Would statins have been more or less widely prescribed if this had been done?
A very good explanation of the issue. It is important to emphasize that the rejection of the null hypothesis does not provide insight into the underlying mechanism. Pharmaceuticals often have secondary effects, and it is conceivable that the drug in question may have an unpleasant taste, prompting individuals to consume large quantities of water before eating. This increased water intake could result in reduced food consumption. Consequently, it may be more prudent to recommend drinking water before meals rather than advocating for a drug with unknown secondary effects.
There are two main considerations when using the p-test. First, the samples must follow a normal distribution, and second, the samples should have equal variances. To address these, start by performing a normality test, such as the Shapiro-Wilk test. Next, conduct a test for homogeneity of variances, like the F-test. If both conditions are met, you can proceed with the t-test. However, in my experience, many samples fail the normality test, necessitating the use of a non-parametric test. Non-parametric tests are generally more robust. Even if all three tests are passed, there's still a risk of being misled due to multiple testing. To mitigate this, it's important to adjust the p-value threshold each time you conduct an additional test. The problem is not in the test itself, but the need of more robust scientific methodology.
The assumption on the distribution of a population characteristic can be also a misleading factor. For example, in social phenomena we rarely observe normal distributions. I think this is an important piece of information to add at the critiques on p-value. Especially, when we are trying to compare results from different studies.
Averages of whatever random variable would tend to be normal. Because they are additions. Variables that would come from many random factors would be lognormal. Many times, we don't focus on the underlying original value, but on a composite, like an average. Makes sense?
You are so awesome!!! Thanks for making such wonderful and informative videos! You have a divine gift for teaching complicated concepts in a stepwise and easy-to-follow manner. Much love and good wishes to you.
Thank you very much for your kind feedback!!! And we're pleased that you think we explain the topics simply. That's exactly what we try to do with a lot of effort. Regards, Hannah
Clear and well presented. The job of statistics is to identify sources of variability and put them all to common measure. That demands an appropriate modesty. The uncertainty is only half of what we need to make decisions. We still must reckon the potential benefit against the cost if we are wrong. We can not remove risk, only make considered judgements.
I just found your channel and I really appreciate all the information being well organized in the description. A quick note on what's the video about and references.
Great video. Thank you. I have only 1 small nitpick to comment: the bigger sample size doesn't always translate to more credible result, you also need to consider how the samples are picked as well. For easy example, 100 randomly picked samples probably generate better results than 500 bias hand-picked samples.
#HoldIt Sir Austin Bradford-Hill stated several arguments in favor of causality. Among them, eight have a refutation each. The only argument can be considered as "criterion", because it lacks refutation: consistency! Consistency means that several or more of the studies point in the same direction! This goes in the same line of the main argument of the refutation of the p values. It was written in the 60s'
The other way of implicit p-hacking is to use p-value as stop criteria for experiment (never do this), you can just stop too early on small sample which by accident quirky, but p-value will delude you that what you have find is significant. It also the problem of frequentist approach which operates on probability spaces not very conform with the mind of most people, Bayesian approach is more natural in this way and not fine you for data re-treatment, however, it has its own dangerous caveats.
Approach from machine learning goes in that line. Not accepting the "first island" of accuracy and trying to go beyond with more cases. Also, the train test paradigm is something philosophically very useful. It helps prevent pitfalls and self cheating
Got an idea. I’m not an overt statistician although have had some training and experience in such. Seems to me that p-value is missing something, and I think this could be the key to the criticisms you mention. That is, I think we also need a “Confidence” value OF the p-value. If people reported such a confidence in addition to the p-value, then I think that might address at least some of the issue people have with p-value. Again, great video!
00:07 Understanding the importance and calculation of P value 02:36 Understanding the significance of the P value in hypothesis testing 05:03 Understanding the significance of P values in hypothesis testing 07:38 Misinterpretations of P-values 10:16 P values can be misused, leading to low-quality research. 12:52 P-values should be banned and not used in research 15:25 P-value combines effect size, sample size, and data variance for objective assessment 17:56 Importance of quality in research and statistical software
A very good presentation. I only wish that the video was started by stating that the P in P-value stands for Probability, then it would have been much easier for everyone to understand the concept. 🙏🏾
It should also be pointed out that 1) the correct application of "significance testing" relies on the use of power calculations to determine an appropriate sample size, and these calculations in turn rely on the (somewhat arbitrary) pre-specification of what would constitute a meaningful effect size, and 2) power calculations typically do not take into account subgroup analyses, which are critical for purposes of determining the extent to which the results of a given study are generalizable.
_"'By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.' Analysis of the data should not end with the calculation of a p-value. Depending on the context, p-values can be supplemented by other measures of evidence that can more directly address the effect size and its associated uncertainty (e.g., effect size estimates, confidence and prediction intervals, likelihood ratios, or graphical representations)."_ [American Statistical Association]
Econ Journal Watch publishes peer-reviewed articles on economics. In the current online issue (free) you will find the article "Revisiting Hypothesis Testing With the Sharpe Ratio." Although the focus of the article is on a measure of financial performance, not medical treatment performance, the author references research into medical statistics and thoroughly explains the pitfalls of just relying on the p-value.
Thank you for this clear and great reminder. Yes, carefully looking at the fully qualified context before reaching a conclusion (that you might re-visit under new data) is always important. What are the alternative option(s) proposed by the critics of p-value?
Just wanna share my thoughts on this: #1 Rejecting the null hypothesis means that the alternative is probably true is not an inaccurate description. Hypotheses are formulated such that they partition the parameter space. This means that once you reject H0, the only option left would be Ha. A more accurate description about the trueness of the alternative hypothesis would be to also consider the power of the test involved. So though the p-value alone does not paint a full picture, it’s not entirely wrong to say that the data is in support of the alternative. #3 Variability is also accounted for in these tests. Highly variable data will yield less significant results since the sampling distribution (and in turn test stat distribution) they’ll produce would be more flat, resulting in a much smaller critical region (or even wider acceptance region). Sample size is also accounted in these statistical tests. The standard error of an estimator is a function of the sample size n. If n is small then the s.e is large, yielding more nonsignificant tests stat values. I guess the issue is p-hacking, and due diligence or lack thereof.
I'm glad you put out #1 so well. #3 is true also. but it depends on what these naysayers are proposing to actually value after they trash out good ol' p
One needs to be careful about what Ha actually is, though. The Ha says that the observed difference was not due to chance. It does NOT say that the difference was caused by the drug you're testing!
@@gcewing Mathematically, the Ha is simply a partition of the the parameter space. As for "The Ha says that the observed difference was not due to chance. It does NOT say that the difference was caused by the drug you're testing." You are right, the rejection of H0 means that the observed difference is unlikely under H0. Mathematically, none of these tests concludes causation. This is the purpose of literature. Statistical tests and experimental designs are simply powerful tools that help verify this.
Not surprised that a Journal in social psychology would ban the p-value. It is useful for those in the hard sciences. I have a PhD in nuclear magnetic resonance and most of my later work was in medical research. When in 2013 "skeptics" began proclaiming a "pause" in rising global temperature beginning in 1998 they would only trust RSS satellite data which began in1979, claiming surface data was corrupted. I would routinely point out the lack of statistical significance. The "pause" was dropped as a topic of discussion when the 1998 extreme el nino year was balanced by the 2016 el nino year. Mind you the temperatures before and after 1998 to 2016 were not statistically significant, but "skeptics" don't do statistical significance. I was told I was only trying to confuse people with those "plus minus thingies." The whole record is, and agrees with surface data within statistical significance. Ironically surface data has lower uncertainties. RSS satellite data version 4.0 (Skeptical Science temperature trend calculator) 1979-1998 Trend: 0.116 ±0.161 °C/decade (2σ) 1998-2015 Trend: 0.085 ±0.202 °C/decade (2σ) 1998-2016 Trend: 0.116 ±0.187 °C/decade (2σ) 1979-2024 Trend: 0.210 ±0.050 °C/decade (2σ) Gistemp v 4 surface data 1979-2024 Trend: 0.188 ±0.035 °C/decade (2σ) Hadcrut 4 surface data 1979-2024 Trend: 0.171 ±0.031 °C/decade (2σ) Berkeley surface data 1979- 2024 Trend: 0.192 ±0.029 °C/decade (2σ)
Thank you for this video but I was hoping you would expand a little more into P-hacking for which even though you did include some of its elements into the video (without actually naming it as such) but I believe it warranted more. Perhaps a future video could be done for P-hacking specifically
The main problem with using P-value is that the significant value cutoff is completely arbitrary. This is not a problem with P-value directly, but it is a problem with application. Why is 0.049 worthy of further investigation but 0.051 is not, or 0.01, or any other value?
@@rei8820 I am using "arbitrary" to mean not based on an intrinsic property, there is nothing more arbitrary than 0.05 because there is nothing intrinsically limiting about that value. Individuals each have their own criterian based on specific goals, you can't know what would be significant for all of those future readers at the time of publication. There is also the declining art of exercising judgement. Having the experience and knowledge to read a situation and set of results to form an opinion is critical in all fields. I have seen far too many bureaucrats and papers that try to eliminate judgment with arbitrarilly drawn hard lines and just end up with obvious misclassifications because their algorithm failed to include or properly weight some parameter, or couldn't account for complex interactions of parameters. The ability to step back look at the big picture and think, "hmm something is off. What is wrong with this picture?"
I was expecting to find included elements of Bayesian thinking as a solution to problems with the focus on p-value. (as explained in 'Bernoulli's Fallacy' by A.Clayton) I listened to this book and it made a lot of sense to me, but could use your clarifying touch to explain and/or dispell.
Thank you for your thoughtful comment! I'm glad to hear you found 'Bernoulli's Fallacy' by A. Clayton insightful. Bayesian thinking indeed offers a valuable perspective on addressing issues related to the traditional reliance on p-values. I will put it on my to-do list and try to make a video for it! Regards Hannah
@@ucchi9829 Why? Besides the ideological spin, many of his statistical arguments seem to make sense. But I'm not a statistician and am willing to be convinced that it doesn't?
@@stefaandumez2319 I don't think he's a statistician either. One reason is his very unreasonable views on people like Fisher. There's a great paper by Stephen Senn that debunks his characterization of Fisher as a racist. The other reason would be his unconvincing criticisms of Frequentistsm.
The video correctly states the definition of the p-value. What the p-value is not is any kind of sensible probability that people want to know. We want to know the probability of each hypothesis being the truth. People tend to think, wrongly, that the p-value is the probability that the null hypothesis is true. And then there's the arbitrary 5% level and the inherent bias towards the null hypothesis. We now have the technology to do better than p-values and we should move on.
It would be interesting to see a discussion of how sample selection bias and population/sample non-normality couple with use of the t-test and p-value to make these problems worse.
*Tests the use of vitamin C for the treatment of 50 cancers with a sample size of 10 patients for each cancer *Finds an improvement in 1 cancer with p=0.0499 Headline: "VITAMIN C CURES CANCER, STUDY SHOWS!1!1!1!"
This content is Statistics 101, so it is staggering that it has to be repeated for the benefit of people who are so ignorant that they have no business doing research in the first place.
Based upon your video, it seems that the criticism/rejection of P-value is really about the fact that a single study won't be conclusive anyway--so what's the point of calculating a P=-value (given that, as you said, its function is to generalize one's findings beyond the sample). So, it seems that the critics are saying (though you are not showing that criticism that way) that the researcher should stop at reporting the sample-based findings instead of generalizing them (via the P-value). As for generalization, that can be done via meta-studies down the line when enough individual studies self-restricted to the samples have accumulated. I hope you could comment on this.
Thank you for your insightful comment! You raise a crucial point about the limitations of p-values and the role of individual studies versus meta-studies. The primary criticism of p-values is that they can be misleading when used in isolation. Meta-analyses can be incredibly valuable for generalizing findings across multiple studies. By combining results from different studies, meta-analyses can provide a more comprehensive understanding of an effect and mitigate the limitations of individual studies. In order to carry out a meta-analysis, there must of course be many individual studies, whereby it is important that all relevant figures are named so that researchers can carry out a meta-analysis. We will discuss this topic in part in the following video. Regards Hannah
She did not really grasp the issue. Classical statistics do not answer the questions people are likely to want to ask, but rather some nonsensce recondite questions. The p-value does not have the meaning people think it has. What it means is (1) if the null hypothesis is correct, and (2) you reapeat the same experiment many times, and (3) then a certain proportion of the experiments will yield equal or larger effect size than the observed. There are three weird assumptions here: First: “If the null hypothesis is correct…” What if the null hypothesis is wrong? The p-value is only meaningful if the null hypothesis is correct. But since it can be wrong, we never know if the p-value is valid or not. It then follows it can only be used to reject a true null hypothesis, but not a false one, which is nonsense. The null hypotesis might be false, and if it is, the p-value is invalid. It is a logical fallacy. Traditionally the p-value was thought of a “evidence against H0”. Consider a “q-value” that is similar except “evidence against HA”: Now we assume HA is true and compute the fraction of equal or smaller effect sizes, in an infinite repetition of the experiment. In general p + q ≠ 1. in fact we can even think of a situation where both are small. Second, the p-value assumes we repeat the same experiment many times with a true null hypothesis. Only an idiot would do that. So we do not need to calculate this, as we have no use for this information. Third, it takes into account larger effect sizes than the one we obtained. We did not observe larger effect sizes than the observed, so why should we care about them? In mathematical statistics this means that the p-value violates the likelihood pronciple, which is the fundamental law of statistics. The likelihood principle was ironically discovered by the inventor of the p-value. The likelihood principle says that statistical evidence is proportional to the likelihood function. Fourth, if you fix the significance level to 0.05 and run a Monte Carlo, the p-value will on average be 0.025. It is inconsistent. The summed effect of this weirdness is that the p-value can massively exaggregates the statistical evidence and is invalid for any inference if the null hypothesis happens to be false. In conclusion we cannot use it. There is a silly answer to this: What if we compute the probability that the null hypothesis is true, given the observarions we have made? That is what we want to know, right? Can we do this? Yes. Welcome to Bayesian statistics. The theory was developed by the French matematician Laplace, a century before the dawn of “classical” statistics and p-values. There was only a minor problem: It often resulted in equations that had ro be solved numerically, by massive calculations, and modern computers were not invented. Classical statistics developed out of a need to do statistics with tools at hand: around 1920: Female secretaries (they were actually called “computers”), mechanical calculators that could add and subtract, and slip stics that could compute and invert logarithms. With this tools one could easily compute things like sum of squares. To compute a square, you would take the logarithm with the slipstick, type it in on the calculator, push the add button, type it in again, pull the handle, use the slipstick to inverse log the number it produced. Write it down on a piece of paper. Repeat for the next number. Now use the same mechanical calculator to sum them up. Et voila, sum of squares. The drawback was that classical statistics did not answer the questions that was asked, but it was used from a practicale point of view. Today we have powerful computers everywhere and efficient algorithms for computing Bayesian statistics are developed, e.g. Markov Chain Monte Carlo. Therefore we can just compute the propability that the null hypothesis is true, and be done with it. The main problem is that most researchers think they do this when they compute the p-value. They do not. Convincing them otherwise has so forth been futile. Many statisticians are pulling their hair out in frustration. Then there is a second take on this as well: Maybe statistical inference (i.e. hypothesis testing) is something we should not be doing at all? What if we focus on descriptive statistics, estimation, and data visualization? If we see the effect size we see it. There might simply be a world where no p-values are needed, classical or Bayesian. This is what the journal Epidemiology enforced for over a decade. Then the zealot editor retired, and the p-values snuck back in. Related to this is the futility of testing a two-sided null-hypothesis. It is known to be false a priori, i.e. the probability of the effect size being exactly zero is, well, also exactly zero. All you have to do then to reject any null hypotheses is to collect enogh data. This means that you can always get a "significant result" by using a large enough data set. Two-sided testing is the most common use case for the p-value, but also where it is mot misleading. In this case a Bayesian approach is not any better, because the logical fallacy is in the specification of the null hypothesis. With a continous probability distribution a point probability is always zero, so a sharp null hypothesis is always known to be false. This leads to a common abuse of statistics, often seen in social sciences: Obtaining "significant results" by running statistical tests on very large data sets. Under such conditions, any test will come out "signiicant", and then be used to make various claims. It is then common to focus on th p-value rather than the estimated effect size, which is typically so small that it has no practical consequence. This is actually pseudo-science. This is a good reason to just drop the whole business of hypothesis testing and focus on descriptove statistics and estimation.
@@sturlamolden thanks for a detailed text and info. I always saw the p-test so empty.... It starts with the assumption that 5% difference is a good target. Based on nothing! I was always challenged by the idea that the P-value was a choice os someone and everybody adopted it without ever questioning the real meaning! At best it is used to create steps between small groups to classify them, loosing the infinite possibilities in big population.
@@sturlamoldenI don't understand what you propose. While bayesian methods are very powerful, they require a probability distribution of the effect before the experiment. Generally there is no objective way to summarize all evidence available before the experiment in a probability distribution. Therefore can the probability distribution of the effect size not be calculated after the experiment regardless of how much Computational power you have. I think a Bayesian example calculation in many cases can help the understanding of the results and complement the p-value. The Bayesian results will however always depend on more arbitrary assumptions than the p-value.
You named it in passing , causality shouldn't be taken for granted. Let's say, a drug has a far-fetched biochemical effect , a certain effect can be fully atributed to mistakenly as there are background interactions/noises which haven't been factored in, rendering either conclusion void.
I went to an environmental chemistry talk at an environmental justice conference that analyzed dust collected under couches in low income homes. A graph of flame retardant concentration vs cancer incidence was shown to support banning flame retardants. The statistical output from Excel indicated that 1) there was about 40% chance the correlation was coincidental and of course 2) gave no support to the causation since the ONLY thing they analyzed for was the flame retardant. To me, this was an irresponsible presentation aimed at an unsophisticated audience by a researcher who, by their own admission, arrived at a conclusion before collecting the data and were unwilling to let poor statistics get in the way of the party line.
9:02 If the only results ever published are those with low p-values, then when replication studies are done, only those with low p-values will be published too. This means we have an representative sample of published studies informing our assessment of hypotheses. After enough time and studies, eventually all null hypotheses will be proven wrong unless we start publishing studies with negative results.
Add a deep knowledge course on p value, phacking, removing data points, finding outliers, and including reports on outliers removed to any research position. discussing p value is almost an umbrella now for ethical reporting and interpretation of trends in a data set.
Excellent explanation of a complicated topic. Do you think some companies p-hack by using very large sample sizes? A large enough sample (say, 10000 or more) will often result in significant p-values even if the effect size is minor, but they can make claims in their advertising, like, "Our flu medicine is proven to signficantly reduce the duration of flu symptoms." The fine print has to state what they mean about reduced duration (I saw one that said half a day).
Having a large sample means that it would be closer to the truth. It’s also common practice to include more samples, given proper adjustment in the statistical tests to account for the increase in type 1 error rate. See: adaptive bio equivalence. The deception here is not the p-value, but rather the non reporting (or misreporting) of the effect size.
Some things: I'm aware of situations like residuals testing, where doing a hypothesis test actually ends up incentivising smaller samples. In this case, p-values might be problematic. If you're gonna criticise p-values for incentivising researches to come up with ad hoc null hypotheses, it's probably fair to criticise Bayesian methods for incentivising ad hoc prior distributions. You can't do Bayesian stats without loading on the assumptions about the prior distribution.
The original, great article was, ""The significance of statistical significance tests in marketing research" by Sawyer and Peters in 1983. Another abused measure is using R2 instead of S.E. and b.
The trouble with abandoning the p-value and calling it quits is that it won't remove the incentives in research that lead to p-hacking. So any replacement will still be subject to the same bad incentives. You might just end up in a situation where people will be p-replacement-hacking. Also the p-value is set at 0.05 quite arbitrarily a discussion could be had if we might need to define stricter cutoff points for significance. (E.g. 0.01, 0.001 or even 5 sigma that is used in astronomy for instance.)
Thanks for the great video! P-values are essential, but they have limitations. For example, in heterogeneous populations, p-values might miss significant effects because they can't differentiate between subgroups. Another issue is with limited data: small but crucial contributions might not show significance due to insufficient sample size. In such cases, like with AI models or complex biological processes (e.g., protein folding), alternative approaches can demonstrate significance where p-values fall short. So, while useful, p-values shouldn't be seen as an absolute measure of validity. Moreover, they should not be demanded as unique method required for a proof, without a rationale of why that is required. Appreciate your work on this important topic!
Most of these are problems with study design rather than with the use of P-values. A P-value just gives you a sense of the reliability of the results of specific types of experiments. It isn’t a tool to go fishing for information within a random pool of data nor is it a tool for model valuation.
@@davidwright7193 I agree that this is a different issue. However, the real problem in scientific (molecular biology) journals is that many reviewers lack strong statistical knowledge and focus mainly on finding asterisks (***), leading researchers to rely on p-values or even engage in p-hacking. Confidence intervals, on the other hand, are more informative and visually explanatory. They not only show the direction and variability within each individual in a cohort but can also provide insights into potential subpopulations, though identifying these typically requires further analysis. With enough control data, you can make statements like “6 out of 10 cases show significantly higher values of protein X, and 2 show significantly lower values compared to the control cohort with 95% confidence.” In this scenario, a t-test might yield non-significant results due to case variation, yet this finding could still be biologically significant, suggesting possible activation states. It’s important to understand that statistical significance and biological significance are related but distinct concepts. There are processes that are biologically significant but not statistically significant, and vice versa. This disconnect highlights the dangers of over-reliance on p-values, as it can lead to the acceptance of papers based solely on statistical results without considering their biological relevance. Therefore, while banning p-values isn’t necessary, doing so in some journals could encourage reviewers to apply critical thinking rather than just looking for asterisks. They will still need to find robust evidence to accept or reject a paper, ensuring that the biological significance is not overshadowed by purely statistical considerations.
Well, this was a nice explanation of a lot of statistics, but it also revealed, how misleading the p-value can really be. To get to the core of the problem, you have to understand, that there are basically no two natural things you could compare, that do not have the slightest difference at all. Chose the sample size of your test big enough, and that difference will show up as (statistically) significant. So, ironiously, the fact, that a particular difference in the data is statistically significant tells you MORE about potential relevance, if the sample size was small. Why? Because with small sample sizes, significance can only happen as a result of relatively big and relatively consistent differences. If we care about relevance and meaning, p-values are just the wrong tool. You can use them, individually consider all the other factors and then make an educated guess about relevance, but why should you? There are better tools (hedges g, cohens d) for that purpose. That does not mean, that p-values are worthless. A high p-value instantly tells you, that any difference you meassured, would likely also have occured as a result of random chance. That´s good to know. At least, it keeps scientists from talking to confidently about a lot of effects they supposibly meassured, which could easily be random variation in the samples and it is fine to routinely do that.
There's probably some of that. Part of the problem though, is human analysis is multi-variate. When designing an experiment, it is best to change one thing and measure the effect. Humans have a variety of things and others who influence them to variing degrees. It's very difficult to design experiments to eliminate those effects. It's essentially analogous to the Many Body Problem in Physics.
Those who blame mistreatment of p-value no different from those who seek escape goats. This is natural fenomena I proclame. 100 years ago, the average researcher could read, analyze and digest all the publications in own and related fields. There have never been so many people on this planet as today. Even if we assume % of academia in population stay same (it's increasing) we arrive at unprecedented number of researchers. With exponential growth of publications researchers either have to adjust their toolkit (i.n. scientific method) or adapt otherwise. May AI help us.
An additional criticism I would add is people using P-values from tests for normal data on non-normal data. They get irritated when you point out the comparison is invalid and they used the wrong test method.
I loved the video, informative and engaging. But I think you are also missing another criticism, the problem of not the p-values per se but the comparison to an arbitrary threshold to make a decision
If you like, please find our e-Book here: datatab.net/statistics-book
Why the p-value fell from grace: “because people abused the crap out of it to state flatly false shit”
Well put!! I wrote "p-value is useless, because 1) people use it as a bad answer to the wrong question; 2) people thinks it's magic; and 3) people are unethical". But forget about it, you've nailed that!
Pretty much. I don't like this way of thinking - people don't understand p-values and how to interpret them, therefore we must do away with p-values.
No, just freaking learn to use your brain and how to interpret data!
You may be right. My personal opinion however is that the main problem is improper use by unethical or incompetent people (or both). Proper use of p vale is highly valuable (pun intended) for certain problems. Thus, the best solution IMHO is enforcement of proper use (which unfortunately journals don’t seem to care for) combined with evaluation of alternate and complementary analyses when practical.
Goodhart's Law (originally an economics thing but I think it applies to the magic p
@@brentmabey3181 kinda like how any stable biological or stable psycho social aspect of life will be used to maintain and advance the system of contained opposition within larger counter insurgency tactics for a system to maintain power. It’s like a sort of 4 dimensional chess figuring this out as reasons for things can exist on different levels but the higher levels always build on top of what already existed and simply amplify it
I’m a PhD and I’ve worked in the pharmaceutical industry for 25 years. It never ceases to amaze me that physicians lack a basic understanding of statistics.
Physicists also usually don't know much about inferential statistics. They just look at how well a theroetical curve fits the experimental data.
Even worse in social sciences. But then their expertise isn't in mathematics. Think it's good to scrap it since it's pretty much used to give a veneer of being scientific while it's often really not.
@@KarlFredrik I was about to say the same. It's painfully obvious when reading published studies from the social sciences that most social science researchers don't understand. You'd think they'd swallow their pride and ask someone from the statistics department for advice. But no, they're the "experts." Makes me question what else the "experts" are blind to, but don't question their expertise or you're "anti-science"
Microbiologist here. Same! I'm blown away by how often I ask someone why they used the statistical test that they used and their answer is "it's what my advisor wanted" or "it's what I've used historically". That's not how your suppose to chose your statistical test
Most academics don't understand basic statistics.
The problem is not the p-value, it is understanding that statistics is only a tool based on math, not a medical conclusion. We assign numbers to the project and then assume that the calculation is the answer. But the calculation is only the answer to the numbers we have assigned, and that is not necessarily the medical question. Assigning numbers is based on assumptions, sometimes good assumptions other times our assumptions are completely wrong, which means basically garbage in garbage out.
not to mention that people were using statistical tests incorrectly, without understanding their data and just doing a one tailed p-test or whatever; there are criteria!
The art of good judgment is a necessary skill.
Fantastic video. Your clarity of statistical concepts is only bested by your rare capability of simplifying complicated concepts into lucid chunk-sized explanations. Thank you for your amazing work!
Hi! Thank you very much for the nice comment! Yes, we make a lot of effort to explain everything as simply as possible, but it's also a hell of a lot of work.
I think some of the bias against p-value, especially perhaps in the social sciences, has come from the prevalence of p-hacking. This, as you mentioned, is basically trolling through the data, finding something with a "good" p-value, and then testing the appropriate null-hypothesis (which, by construction, you can then reject!). While this is, as you said, a bad scientific approach, I assume the most honest among them think it's ok because "there really is something significant there, see!" without realizing that it breaks the underlying model (5% of p=0.05 hypotheses can be due to random chance when there is no real effect).
Your solution to some of the ills, replication studies, is a problem in some areas, as in academia there is often little reward for repeating a study, and if it does find the same result as the original, it may be hard to publish ("We already knew that"). Perhaps for larger scale things, like drug treatments, this is less of an issue, but even there a company is more likely to test their "me too" drug to show its effectiveness rather than repeat studies of older treatments (in case they turn out to show the old is perfectly effective!).
Thank you very much for your detailed feedback! I completely agree with you! Difficult topic, but the p value itself is probably not to blame for that. We will try to adress some of the solutions in a following video, but part of it is also replication. Thanks and Regards Hannah
The failure of many "sciences" to replicate studies is a basic one, calling into question any claim to be science at all.
In my view the root cause is the whole science journal system, where publishing houses publish for profit work that has already paid for and done by academics. They don't publish confirmation reports, nor do they publish reports where "we tested this idea but it didn't work".
It's not just social sciences, any field with science in its name is also suspect. Computer Science, for example. As rife with fraud as any "soft" science.
The other issue is this: With Big Data and its huge number of observations, many many relationships can be seen as statistically significant-not attributable to random chance-when those relationships are merely artifacts of sample (or population) size. The issue of p-value harvesting is also in play here, but no one with any professional ethics and statistical integrity would accept such findings outside of a rigorous multi variate analysis that reveals the magnitude, i.e., strength, of the presumed or indicated relationship, by using a simple regression coefficient.
The solution to multiple comparisons was laid out back in 1991 in the fledgling journal Epidemiology in an editorial by the prominent methodologist Charles Poole entitled "Multiple comparisons? No problem!".
@@kodowdus I'm fairly certain that there were known solutions to the problem of multiple comparisons decades before 1991. This the problem of academia being so insular
Another reason I've seen for abandoning the p-value (especially from the Bayesian crowd) is that p-values don't answer typical research questions. Critics say that researchers (implicitly) want to know the probability of their hypothesis given their data. That is, they want the posterior probability of their hypothesis. But p-values are not posterior probabilities. The p-value gives you the probability of data at least as extreme as those observed conditional on the hypothesis. That looks backwards relative to the goal of figuring out the probability of the hypothesis. The criticism is related to the misinterpretation problem: If a researcher wants to know the probability of their hypothesis, they may be more likely to misinterpret the p-value as a posterior probability.
The problem is though the original question you state cannot be answered altogether, neither by Frequentist nor Bayesian approaches. To have posterior, one must have a prior. But there is no way in the universe of knowing a prior on a scientific hypothesis. Heck, something as established as universal law of gravity was proven wrong by Einstein's theory of relativity. The data collected over thousands of years of observation was right, the fit was excellent, but the hypothesis was still wrong. P-values are great for quick and dirty identification of potential candidates for effects in a giant pile of data. They tell you that something is off and might be interesting, but don't tell you what exactly. Bayesian methods are great at delving deeper, and infusing prior knowledge about the function of the world into a causal model. P-values are thus better for exploratory analysis, and Bayesian for confirmatory, IMHO, as our ultimate goal is to get to a causal model of how something works, not whether it works. But expecting poor biologists to do Bayesian modeling for every experiment is excessive, as there is a significant overhead in complexity.
@@ArgumentumAdHominem This. The most vocal Bayesian supremacists always strike me as way overconfident about having solved an immortal problem that will always be with us.
The p-value answers a specific hypothetical question no one was really interested in. Bayes gives a fake answer to an ill-formed question we *only wish* we could ask and answer. Choose your poison.
Heck, a lot of the time, if your prior probability is assigned 0.5, Bayes and the p-value are *identical*. The interpretation is heuristic at best, with the double-edged convenience and danger that comes with it.
@@ArgumentumAdHominem what sort of bastion overhead are you talking about?
@@DaveE99 Doing a t-test requires somewhat less skill and time than designing a DAG model, fitting it and interpreting results.
P-value does not answer a scientific question is the most "scientific guatekeeping" statment I have read in a long time
I generally align with Bayesian statistics. My feeling about p-values is that they should not be used to draw conclusions in hypothesis tests, but they are useful in getting a gut feel for the implausibility of data given some hypothesis. I find them useful in evaluating my priors: "My intuition tells me that a p-Value of 1% corresponds to a posterior probability of 75% for the alternative hypothesis so I should select a prior for which this is the case." They're also useful as a quick gut check: if your p-Value is so small you need to use scientific notation to express it you know ahead of time more or less what a full Bayesian analysis will conclude.
Good video. Some additional background.
The False Positives (Alpha or Type I error) and False Negatives (Beta or Type II error) must always be kept in mind while doing hypothesis testing. No matter which decision we make there is always a chance we make one of these error.
The 0.05 is the alpha value. Alpha is the false positive rate, the probability that the sample statistic is in the most extreme 5% of the distribution by chance. We must always decide on an alpha BEFORE running the experiment, otherwise we run the risk of p-hacking (picking an alpha that makes the results significant).
The false positive is the 'Boy Who Cries Wolf' error; we say there is an effect when there really isn't. To avoid the false positive we can lower the alpha, say to 0.01 or 0.001. But the problem with this is that by lowering the alpha, we increase the chance of a false negative. A false negative is when we say the drug has no effect on weight loss when it does have an effect. We miss the chance of finding a useful drug to treat serious diseases.
Replication is the part of the modern scientific method that helps decrease the probability of making both false positive and false negatives. After reading the comments: a meta analysis is a systematic summary of all the replications done in a field.
There is a difference between 'statistical significance' and 'business significance'.
Strong Inference, the design of experiments to test two competing theories, is the best way to make sure your experiments are scientific.
en.wikipedia.org/wiki/Strong_inference
Many many thanks for your additional comments! Regards Hannah
It's ok. You can call out psychology by name. No need to throw extra words at it.
I really love your clear and straightforward explanations with examples. This is the first time I've truly understood the p-value. Kudos to you!
Glad it was helpful! Many thanks for your nice feedback! Regards Hannah
1. Nice Summary of p-value. Good for people learning stats.
2. "If the facts don't conform to the theory they must be disposed of"
This is pure gold!! The clarity of the explanation is outstanding! Congratulations Prof.
I came here so ready to rip this video apart for uncritically disparaging p-values. This is why we watch videos until the end! Great video about proper applications of hypotheses testing, and some easy misapplications!
P-vals are based on a lot of assumptions. Abusing, misusing, or ignoring these assumptions are where we most often trip up and potentially abuse the concept.
Fantastic insights. Great job done. With these debates, we expand beyond simple interpretations.
P-values remain highly relevant but the ”magical” critical values 0.10, 0.05 and 0.01 should only be used in very specific cases, and not as some kind of universal measures of what is ”relevant” or not
So, if there is not decision threshold, what p-value is relevant for?
@@ihorb7346 to quantify the extent to which the data agree with the null hypothesis
Thanks for this. I think there is at least one additional thing wrong with the way we treat P values - namely we think 1 in 20 is a small probability. This is, in part, that cognitively we cannot intuit probabilities that are close to 0 or 1. there is a large difference between probabilities of 1 in a hundred and 1 in a million, but we just think both probabilities are small.
I just commented along a similar line. In the physical sciences it's not unheard of to achieve p values of .01 or lower. So I get very suspicious of research using "physical science" tools, e.g. GC-MS showing p values sometimes larger than 0.2, a 1 in 5 chance of a false rejection of the null hypothesis and claiming "strong" evidence for their alternative hypothesis.
Great instructional video. I particularly enjoyed the explanation that a low p value merely indicates that the null hypothesis is unlikely to be true but that it does not say anything about the alternative hypothesis. I see this all the time in "independent" studies of traffic data - it does not matter whether the study is for low traffic neighbourhoods, cycling lanes, pedestrianisation or even road widening and similar to improve vehicular traffic flow - the same basic flaws are seen (do not get me started on the assumptions that the researchers use to even get to a significant p-value) - if the p-Value is low enough the study concludes not that the null hypothesis is untrue but that their alternative hypothesis must be true.
Thank you for another engaging and powerful video. When I first read "p-value is not scientific", I was pre-empting discussion of how significance levels themselves are arbitrary and seemingly without rationale. How did we come to decide that a significant event can be observed by chance once in only 20 random samples? (when considering p < 0.05 as significant; ditto for 10% or 1% significance levels). This is grounded in the central limit theorem, but whether expressed in terms of 5% or 1.96SD, the thresholds seem more convenient than scientific. Nevertheless, these standards are important for the universal interpretation and continuation of research and I'm glad for it, though directly interpreting the p-value as a probability may help to meaningfully discuss confidence in a result regardless of where it sits relative to the significance level. Looking forward to the next!
Wonderful video. Very well explained. My 'Engineering Statistics' course starts in a couple of weeks (my ninth year teaching this class) and I will post a link to this video on the course canvas site when we get to hypothesis testing as an excellent explanation of the meaning of the p-value and its strength and weaknesses.
Thank you for your brilliant explanation. The banning of the use of p-values by the psychology journal you mentioned suggests to me that (a) academics who publish in it are not competent in statistics and (b) tne journal’s peer reviewers aren’t either. I recommend the competent researchers should avoid submitting articles to such journals. As an aside, I have encountered more weakness with stats competence within the psychology discipline than any other. Psychology had better watch out. It is on the cusp of being discredited by real scientists such as me who don’t want them giving science a bad name. Dr G. PhD, MSc (Statistics) with distinction.
Thansk for the vlear explanation. This is basically the 101 of statistics, yet many researchers have no clue about this. Notably, I was NOT taught this most important insight in my biostatistics course but only by my thesis supervisor who had re Intuitive Biostatistics from Harvey Motulsky and therefore knew it.
P
Clear, concise, and *very* engagingly presented! You've got a fan.
Regarding the content, I'll just 'second' many of the well-stated earlier comments: The problem isn't the p-value _per se_; it's the way in which it's misunderstood by many who perfunctorily apply it -- owed, in large part, to the imprecise way that it's taught in many Elementary Stats courses. (In that regard, good of you to take care to emphasize the essential "...a difference of [x] **or more**...")
Excellent explanation, we were always taught that a p-value was only used for an initial study for supporting, or not supporting further studies because it is not accurate enough on it’s own to come to any satisfactory conclusion.
Thank you! That's a great point. A p-value should indeed be seen as just one piece of the puzzle in research. Many thanks for your feedback! Regards Hannah
This depends on the aim of the study. If the question is: whether something has an effect, then p-values are great. If the question is: why something has an effect, one needs more powerful machinery.
The easiest way to understand the p-value in this example, I think, is in terms of simulations: randomly assign fifty subjects to one group and the remaining fifty to the other group; calculate the difference in means between the two groups; repeat many times, and then determine the proportion of differences thus calculated that were at least as great as the actual difference encountered; that's approximately the p-value. As you say, it's not the p-value's fault that it is often abused or misunterpreted in statistics, and it remains a useful tool for identifying potential effects. Great video!
Thank you for putting this in such a simple and clear way. My statistics professors should have learnt from you!
Great video. One of the underlying issues may be that non-statisticians who depend statistical results have, understandably, a very hard time to grasp and accept what these values actually tell them. In other words, as people want to know "the truth" it is already difficult to accept the notion of likelihood, let alone its twisted sister, an approximation of such a likelihood expressed in terms of a likelihood the opposite not being true. From a practical standpoint it must be so confusing. All you want is a yes or no, maybe you still accept a "80% perhaps", but a "high confidence that there is a high probability that the opposite is not true", must be so confusing. Your video should be taught in every school, because the basic concepts are what most people struggle with.
*"high confidence that there is a high probability that the opposite is not true"*
That must be the most convoluted sentence I've ever seen. If this is the language that you folks have to deal with on a daily basis, then I am sorry for you guys!
Thank you for clarifying the P-value meaning. Data makes sense when shared with a data story. Just a p-value for decision is like missing the context in an outcome.
Coming from an engineering background I've often found 0.05 to be quite a high probability! Certainly just an indication...
Thank you! A comment on the 0.05 threshold pointed out by @HughCStevenson1 - 50 years ago, I was taught that it developed, along with much of modern statistical methodology, from the study of agricultural fertilizers; a farmer would accept one failure in their time in charge of the farm, typically 20 years, hence 1/20 chance of the observed benefit being by chance.
With drugs, it strikes me that using a higher value for beneficial effects (particularly of a low cost treatment) would make sense, BUT a lower value for harmful effects. Would statins have been more or less widely prescribed if this had been done?
Yeah... Would you trust a bridge if the engineer that designed it said "It'll be fine, it only has a 5% chance of falling down."
but 0.05 sounds small, until you realize it is literally just 1/20. That is actually a crazy high number at massive scales
o melhor video no topico de P-Value que eu ja vi na vida
What a fantastic video! Thank you! I were I still working with students in a research methods course, your video would be mandatory. Excellent!
A very good explanation of the issue. It is important to emphasize that the rejection of the null hypothesis does not provide insight into the underlying mechanism. Pharmaceuticals often have secondary effects, and it is conceivable that the drug in question may have an unpleasant taste, prompting individuals to consume large quantities of water before eating. This increased water intake could result in reduced food consumption. Consequently, it may be more prudent to recommend drinking water before meals rather than advocating for a drug with unknown secondary effects.
That's some good class. We should teach people on how to write the null hypothesis.
There are two main considerations when using the p-test. First, the samples must follow a normal distribution, and second, the samples should have equal variances. To address these, start by performing a normality test, such as the Shapiro-Wilk test. Next, conduct a test for homogeneity of variances, like the F-test. If both conditions are met, you can proceed with the t-test.
However, in my experience, many samples fail the normality test, necessitating the use of a non-parametric test. Non-parametric tests are generally more robust. Even if all three tests are passed, there's still a risk of being misled due to multiple testing. To mitigate this, it's important to adjust the p-value threshold each time you conduct an additional test. The problem is not in the test itself, but the need of more robust scientific methodology.
I think you meant to say t-test, not p-test, right?
OMG: this vídeo is the most great explanation of p-value ever! Thanks!
Glad it was helpful! And thanks!!! Regards Hannah
The null hypothesis is like the serious version of the "nothing ever happens" meme
This is an excellent review of the basics. Much needed (at least for me)
Excellent video. Clear explanations and very nice graphical elements. Congratulations!!
Thank you. I intend to use p-values in future data sets.
The assumption on the distribution of a population characteristic can be also a misleading factor. For example, in social phenomena we rarely observe normal distributions. I think this is an important piece of information to add at the critiques on p-value. Especially, when we are trying to compare results from different studies.
Averages of whatever random variable would tend to be normal. Because they are additions.
Variables that would come from many random factors would be lognormal.
Many times, we don't focus on the underlying original value, but on a composite, like an average.
Makes sense?
You are so awesome!!! Thanks for making such wonderful and informative videos! You have a divine gift for teaching complicated concepts in a stepwise and easy-to-follow manner. Much love and good wishes to you.
Thank you very much for your kind feedback!!! And we're pleased that you think we explain the topics simply. That's exactly what we try to do with a lot of effort. Regards, Hannah
Clear and well presented. The job of statistics is to identify sources of variability and put them all to common measure. That demands an appropriate modesty. The uncertainty is only half of what we need to make decisions. We still must reckon the potential benefit against the cost if we are wrong. We can not remove risk, only make considered judgements.
I just found your channel and I really appreciate all the information being well organized in the description. A quick note on what's the video about and references.
Again an absolutely wonderful clear and pedagogical video.❤️🥰🙏🏻
Hi Per, many many thanks for your nice feedback : )
Great Video!
I like your presentation style. Calm, slow paced and very good emphasis. Keep it going!
Great video. Thank you.
I have only 1 small nitpick to comment: the bigger sample size doesn't always translate to more credible result, you also need to consider how the samples are picked as well. For easy example, 100 randomly picked samples probably generate better results than 500 bias hand-picked samples.
Loved it! as a scientist from different field this is a great explanation
Excellent!!! I'd forgotten everything I know about stats, and now I can see there is a place I might actually use a p value!!
#HoldIt
Sir Austin Bradford-Hill stated several arguments in favor of causality. Among them, eight have a refutation each. The only argument can be considered as "criterion", because it lacks refutation: consistency! Consistency means that several or more of the studies point in the same direction! This goes in the same line of the main argument of the refutation of the p values. It was written in the 60s'
p-values have been critiqued as unscientific since inception. It's amazing how it came to be used as a modern day oracle.
The other way of implicit p-hacking is to use p-value as stop criteria for experiment (never do this), you can just stop too early on small sample which by accident quirky, but p-value will delude you that what you have find is significant. It also the problem of frequentist approach which operates on probability spaces not very conform with the mind of most people, Bayesian approach is more natural in this way and not fine you for data re-treatment, however, it has its own dangerous caveats.
Approach from machine learning goes in that line. Not accepting the "first island" of accuracy and trying to go beyond with more cases.
Also, the train test paradigm is something philosophically very useful. It helps prevent pitfalls and self cheating
Got an idea. I’m not an overt statistician although have had some training and experience in such. Seems to me that p-value is missing something, and I think this could be the key to the criticisms you mention. That is, I think we also need a “Confidence” value OF the p-value. If people reported such a confidence in addition to the p-value, then I think that might address at least some of the issue people have with p-value. Again, great video!
00:07 Understanding the importance and calculation of P value
02:36 Understanding the significance of the P value in hypothesis testing
05:03 Understanding the significance of P values in hypothesis testing
07:38 Misinterpretations of P-values
10:16 P values can be misused, leading to low-quality research.
12:52 P-values should be banned and not used in research
15:25 P-value combines effect size, sample size, and data variance for objective assessment
17:56 Importance of quality in research and statistical software
Fantastic video! Very clear and concise! Thank you for this content!
very good video and content, thanks. I love it how excited you get about statistics :)
A very good presentation.
I only wish that the video was started by stating that the P in P-value stands for Probability, then it would have been much easier for everyone to understand the concept. 🙏🏾
It should also be pointed out that 1) the correct application of "significance testing" relies on the use of power calculations to determine an appropriate sample size, and these calculations in turn rely on the (somewhat arbitrary) pre-specification of what would constitute a meaningful effect size, and 2) power calculations typically do not take into account subgroup analyses, which are critical for purposes of determining the extent to which the results of a given study are generalizable.
_"'By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.' Analysis of the data should not end with the calculation of a p-value. Depending on the context, p-values can be supplemented by other measures of evidence that can more directly address the effect size and its associated uncertainty (e.g., effect size estimates, confidence and prediction intervals, likelihood ratios, or graphical representations)."_
[American Statistical Association]
Econ Journal Watch publishes peer-reviewed articles on economics. In the current online issue (free) you will find the article "Revisiting Hypothesis Testing With the Sharpe Ratio." Although the focus of the article is on a measure of financial performance, not medical treatment performance, the author references research into medical statistics and thoroughly explains the pitfalls of just relying on the p-value.
Great video! Best explanation I have ever heard.
Thank you for this clear and great reminder. Yes, carefully looking at the fully qualified context before reaching a conclusion (that you might re-visit under new data) is always important. What are the alternative option(s) proposed by the critics of p-value?
Just wanna share my thoughts on this:
#1
Rejecting the null hypothesis means that the alternative is probably true is not an inaccurate description. Hypotheses are formulated such that they partition the parameter space. This means that once you reject H0, the only option left would be Ha. A more accurate description about the trueness of the alternative hypothesis would be to also consider the power of the test involved. So though the p-value alone does not paint a full picture, it’s not entirely wrong to say that the data is in support of the alternative.
#3
Variability is also accounted for in these tests. Highly variable data will yield less significant results since the sampling distribution (and in turn test stat distribution) they’ll produce would be more flat, resulting in a much smaller critical region (or even wider acceptance region).
Sample size is also accounted in these statistical tests. The standard error of an estimator is a function of the sample size n. If n is small then the s.e is large, yielding more nonsignificant tests stat values.
I guess the issue is p-hacking, and due diligence or lack thereof.
I'm glad you put out #1 so well.
#3 is true also. but it depends on what these naysayers are proposing to actually value after they trash out good ol' p
One needs to be careful about what Ha actually is, though. The Ha says that the observed difference was not due to chance. It does NOT say that the difference was caused by the drug you're testing!
@@gcewing Mathematically, the Ha is simply a partition of the the parameter space. As for "The Ha says that the observed difference was not due to chance. It does NOT say that the difference was caused by the drug you're testing." You are right, the rejection of H0 means that the observed difference is unlikely under H0. Mathematically, none of these tests concludes causation. This is the purpose of literature. Statistical tests and experimental designs are simply powerful tools that help verify this.
Not surprised that a Journal in social psychology would ban the p-value.
It is useful for those in the hard sciences. I have a PhD in nuclear magnetic resonance and most of my later work was in medical research.
When in 2013 "skeptics" began proclaiming a "pause" in rising global temperature beginning in 1998 they would only trust RSS satellite data which began in1979, claiming surface data was corrupted. I would routinely point out the lack of statistical significance. The "pause" was dropped as a topic of discussion when the 1998 extreme el nino year was balanced by the 2016 el nino year.
Mind you the temperatures before and after 1998 to 2016 were not statistically significant, but "skeptics" don't do statistical significance. I was told I was only trying to confuse people with those "plus minus thingies." The whole record is, and agrees with surface data within statistical significance. Ironically surface data has lower uncertainties.
RSS satellite data version 4.0 (Skeptical Science temperature trend calculator)
1979-1998 Trend: 0.116 ±0.161 °C/decade (2σ)
1998-2015 Trend: 0.085 ±0.202 °C/decade (2σ)
1998-2016 Trend: 0.116 ±0.187 °C/decade (2σ)
1979-2024 Trend: 0.210 ±0.050 °C/decade (2σ)
Gistemp v 4 surface data
1979-2024 Trend: 0.188 ±0.035 °C/decade (2σ)
Hadcrut 4 surface data
1979-2024 Trend: 0.171 ±0.031 °C/decade (2σ)
Berkeley surface data
1979- 2024 Trend: 0.192 ±0.029 °C/decade (2σ)
Thank you for this video but I was hoping you would expand a little more into P-hacking for which even though you did include some of its elements into the video (without actually naming it as such) but I believe it warranted more. Perhaps a future video could be done for P-hacking specifically
The main problem with using P-value is that the significant value cutoff is completely arbitrary. This is not a problem with P-value directly, but it is a problem with application. Why is 0.049 worthy of further investigation but 0.051 is not, or 0.01, or any other value?
How would you solve this problem?
@@rei8820 Just include the calculated p value and let the reader decide if it is sufficiently significant for their purposes.
@@mytech6779 It doesn't solve the problem. It just makes things even more arbitrary.
@@rei8820 I am using "arbitrary" to mean not based on an intrinsic property, there is nothing more arbitrary than 0.05 because there is nothing intrinsically limiting about that value. Individuals each have their own criterian based on specific goals, you can't know what would be significant for all of those future readers at the time of publication.
There is also the declining art of exercising judgement. Having the experience and knowledge to read a situation and set of results to form an opinion is critical in all fields. I have seen far too many bureaucrats and papers that try to eliminate judgment with arbitrarilly drawn hard lines and just end up with obvious misclassifications because their algorithm failed to include or properly weight some parameter, or couldn't account for complex interactions of parameters. The ability to step back look at the big picture and think, "hmm something is off. What is wrong with this picture?"
Thank you very much for this wonderful explanation
I was expecting to find included elements of Bayesian thinking as a solution to problems with the focus on p-value. (as explained in 'Bernoulli's Fallacy' by A.Clayton) I listened to this book and it made a lot of sense to me, but could use your clarifying touch to explain and/or dispell.
Thank you for your thoughtful comment! I'm glad to hear you found 'Bernoulli's Fallacy' by A. Clayton insightful. Bayesian thinking indeed offers a valuable perspective on addressing issues related to the traditional reliance on p-values. I will put it on my to-do list and try to make a video for it! Regards Hannah
Not worth reading imo.
@@ucchi9829 Why? Besides the ideological spin, many of his statistical arguments seem to make sense. But I'm not a statistician and am willing to be convinced that it doesn't?
@@stefaandumez2319 I don't think he's a statistician either. One reason is his very unreasonable views on people like Fisher. There's a great paper by Stephen Senn that debunks his characterization of Fisher as a racist. The other reason would be his unconvincing criticisms of Frequentistsm.
The video correctly states the definition of the p-value. What the p-value is not is any kind of sensible probability that people want to know. We want to know the probability of each hypothesis being the truth. People tend to think, wrongly, that the p-value is the probability that the null hypothesis is true. And then there's the arbitrary 5% level and the inherent bias towards the null hypothesis.
We now have the technology to do better than p-values and we should move on.
Discuss confidence intervals next, please.
Yes, I agree this would be a great topic and I think this is extremely important.
Yes I would watch this
It would be interesting to see a discussion of how sample selection bias and population/sample non-normality couple with use of the t-test and p-value to make these problems worse.
No Sir, we in healthcare will continue to use this inappropriately to make your medical decisions, thanks.
*Tests the use of vitamin C for the treatment of 50 cancers with a sample size of 10 patients for each cancer
*Finds an improvement in 1 cancer with p=0.0499
Headline: "VITAMIN C CURES CANCER, STUDY SHOWS!1!1!1!"
This content is Statistics 101, so it is staggering that it has to be repeated for the benefit of people who are so ignorant that they have no business doing research in the first place.
great content!, i have been confused by this
New to the channel. Looks great! Quick question: how likely are your videos to get proper subtitles?
Thank you for this great analysis
My pleasure!
Based upon your video, it seems that the criticism/rejection of P-value is really about the fact that a single study won't be conclusive anyway--so what's the point of calculating a P=-value (given that, as you said, its function is to generalize one's findings beyond the sample). So, it seems that the critics are saying (though you are not showing that criticism that way) that the researcher should stop at reporting the sample-based findings instead of generalizing them (via the P-value). As for generalization, that can be done via meta-studies down the line when enough individual studies self-restricted to the samples have accumulated. I hope you could comment on this.
Thank you for your insightful comment! You raise a crucial point about the limitations of p-values and the role of individual studies versus meta-studies.
The primary criticism of p-values is that they can be misleading when used in isolation. Meta-analyses can be incredibly valuable for generalizing findings across multiple studies. By combining results from different studies, meta-analyses can provide a more comprehensive understanding of an effect and mitigate the limitations of individual studies. In order to carry out a meta-analysis, there must of course be many individual studies, whereby it is important that all relevant figures are named so that researchers can carry out a meta-analysis.
We will discuss this topic in part in the following video. Regards Hannah
She did not really grasp the issue. Classical statistics do not answer the questions people are likely to want to ask, but rather some nonsensce recondite questions. The p-value does not have the meaning people think it has. What it means is (1) if the null hypothesis is correct, and (2) you reapeat the same experiment many times, and (3) then a certain proportion of the experiments will yield equal or larger effect size than the observed.
There are three weird assumptions here:
First: “If the null hypothesis is correct…” What if the null hypothesis is wrong? The p-value is only meaningful if the null hypothesis is correct. But since it can be wrong, we never know if the p-value is valid or not. It then follows it can only be used to reject a true null hypothesis, but not a false one, which is nonsense. The null hypotesis might be false, and if it is, the p-value is invalid. It is a logical fallacy.
Traditionally the p-value was thought of a “evidence against H0”. Consider a “q-value” that is similar except “evidence against HA”: Now we assume HA is true and compute the fraction of equal or smaller effect sizes, in an infinite repetition of the experiment. In general p + q ≠ 1. in fact we can even think of a situation where both are small.
Second, the p-value assumes we repeat the same experiment many times with a true null hypothesis. Only an idiot would do that. So we do not need to calculate this, as we have no use for this information.
Third, it takes into account larger effect sizes than the one we obtained. We did not observe larger effect sizes than the observed, so why should we care about them? In mathematical statistics this means that the p-value violates the likelihood pronciple, which is the fundamental law of statistics. The likelihood principle was ironically discovered by the inventor of the p-value. The likelihood principle says that statistical evidence is proportional to the likelihood function.
Fourth, if you fix the significance level to 0.05 and run a Monte Carlo, the p-value will on average be 0.025. It is inconsistent.
The summed effect of this weirdness is that the p-value can massively exaggregates the statistical evidence and is invalid for any inference if the null hypothesis happens to be false.
In conclusion we cannot use it.
There is a silly answer to this: What if we compute the probability that the null hypothesis is true, given the observarions we have made? That is what we want to know, right?
Can we do this? Yes. Welcome to Bayesian statistics. The theory was developed by the French matematician Laplace, a century before the dawn of “classical” statistics and p-values. There was only a minor problem: It often resulted in equations that had ro be solved numerically, by massive calculations, and modern computers were not invented.
Classical statistics developed out of a need to do statistics with tools at hand: around 1920: Female secretaries (they were actually called “computers”), mechanical calculators that could add and subtract, and slip stics that could compute and invert logarithms. With this tools one could easily compute things like sum of squares. To compute a square, you would take the logarithm with the slipstick, type it in on the calculator, push the add button, type it in again, pull the handle, use the slipstick to inverse log the number it produced. Write it down on a piece of paper. Repeat for the next number. Now use the same mechanical calculator to sum them up. Et voila, sum of squares.
The drawback was that classical statistics did not answer the questions that was asked, but it was used from a practicale point of view.
Today we have powerful computers everywhere and efficient algorithms for computing Bayesian statistics are developed, e.g. Markov Chain Monte Carlo. Therefore we can just compute the propability that the null hypothesis is true, and be done with it. The main problem is that most researchers think they do this when they compute the p-value. They do not. Convincing them otherwise has so forth been futile. Many statisticians are pulling their hair out in frustration.
Then there is a second take on this as well: Maybe statistical inference (i.e. hypothesis testing) is something we should not be doing at all? What if we focus on descriptive statistics, estimation, and data visualization? If we see the effect size we see it. There might simply be a world where no p-values are needed, classical or Bayesian. This is what the journal Epidemiology enforced for over a decade. Then the zealot editor retired, and the p-values snuck back in.
Related to this is the futility of testing a two-sided null-hypothesis. It is known to be false a priori, i.e. the probability of the effect size being exactly zero is, well, also exactly zero. All you have to do then to reject any null hypotheses is to collect enogh data. This means that you can always get a "significant result" by using a large enough data set. Two-sided testing is the most common use case for the p-value, but also where it is mot misleading. In this case a Bayesian approach is not any better, because the logical fallacy is in the specification of the null hypothesis. With a continous probability distribution a point probability is always zero, so a sharp null hypothesis is always known to be false. This leads to a common abuse of statistics, often seen in social sciences: Obtaining "significant results" by running statistical tests on very large data sets. Under such conditions, any test will come out "signiicant", and then be used to make various claims. It is then common to focus on th p-value rather than the estimated effect size, which is typically so small that it has no practical consequence. This is actually pseudo-science. This is a good reason to just drop the whole business of hypothesis testing and focus on descriptove statistics and estimation.
@@sturlamolden thanks for a detailed text and info. I always saw the p-test so empty.... It starts with the assumption that 5% difference is a good target. Based on nothing! I was always challenged by the idea that the P-value was a choice os someone and everybody adopted it without ever questioning the real meaning! At best it is used to create steps between small groups to classify them, loosing the infinite possibilities in big population.
@@sturlamolden by the way, what about the medicine classification of groups based on steps? Like IQ classification? Does it make any sense?
@@sturlamoldenI don't understand what you propose. While bayesian methods are very powerful, they require a probability distribution of the effect before the experiment.
Generally there is no objective way to summarize all evidence available before the experiment in a probability distribution. Therefore can the probability distribution of the effect size not be calculated after the experiment regardless of how much Computational power you have.
I think a Bayesian example calculation in many cases can help the understanding of the results and complement the p-value. The Bayesian results will however always depend on more arbitrary assumptions than the p-value.
You named it in passing , causality shouldn't be taken for granted. Let's say, a drug has a far-fetched biochemical effect , a certain effect can be fully atributed to mistakenly as there are background interactions/noises which haven't been factored in, rendering either conclusion void.
I went to an environmental chemistry talk at an environmental justice conference that analyzed dust collected under couches in low income homes. A graph of flame retardant concentration vs cancer incidence was shown to support banning flame retardants. The statistical output from Excel indicated that 1) there was about 40% chance the correlation was coincidental and of course 2) gave no support to the causation since the ONLY thing they analyzed for was the flame retardant.
To me, this was an irresponsible presentation aimed at an unsophisticated audience by a researcher who, by their own admission, arrived at a conclusion before collecting the data and were unwilling to let poor statistics get in the way of the party line.
P-Value is not dead, not misleading and scientific.
Totally agree
9:02 If the only results ever published are those with low p-values, then when replication studies are done, only those with low p-values will be published too. This means we have an representative sample of published studies informing our assessment of hypotheses.
After enough time and studies, eventually all null hypotheses will be proven wrong unless we start publishing studies with negative results.
Uauuu... a very clear explanation. Thank you.
Many thanks!!!
Excellent explanation. Thanks again.
Add a deep knowledge course on p value, phacking, removing data points, finding outliers, and including reports on outliers removed to any research position.
discussing p value is almost an umbrella now for ethical reporting and interpretation of trends in a data set.
Excellent explanation of a complicated topic. Do you think some companies p-hack by using very large sample sizes? A large enough sample (say, 10000 or more) will often result in significant p-values even if the effect size is minor, but they can make claims in their advertising, like, "Our flu medicine is proven to signficantly reduce the duration of flu symptoms." The fine print has to state what they mean about reduced duration (I saw one that said half a day).
Having a large sample means that it would be closer to the truth. It’s also common practice to include more samples, given proper adjustment in the statistical tests to account for the increase in type 1 error rate. See: adaptive bio equivalence. The deception here is not the p-value, but rather the non reporting (or misreporting) of the effect size.
@@berdie_a In Theory. In reality it means that they have way more space to fudge Data...
Some things:
I'm aware of situations like residuals testing, where doing a hypothesis test actually ends up incentivising smaller samples. In this case, p-values might be problematic.
If you're gonna criticise p-values for incentivising researches to come up with ad hoc null hypotheses, it's probably fair to criticise Bayesian methods for incentivising ad hoc prior distributions. You can't do Bayesian stats without loading on the assumptions about the prior distribution.
The original, great article was, ""The significance of statistical significance tests in marketing research" by Sawyer and Peters in 1983. Another abused measure is using R2 instead of S.E. and b.
The trouble with abandoning the p-value and calling it quits is that it won't remove the incentives in research that lead to p-hacking. So any replacement will still be subject to the same bad incentives. You might just end up in a situation where people will be p-replacement-hacking.
Also the p-value is set at 0.05 quite arbitrarily a discussion could be had if we might need to define stricter cutoff points for significance. (E.g. 0.01, 0.001 or even 5 sigma that is used in astronomy for instance.)
Very nicely explained. thanks
Thanks for the great video! P-values are essential, but they have limitations. For example, in heterogeneous populations, p-values might miss significant effects because they can't differentiate between subgroups. Another issue is with limited data: small but crucial contributions might not show significance due to insufficient sample size. In such cases, like with AI models or complex biological processes (e.g., protein folding), alternative approaches can demonstrate significance where p-values fall short. So, while useful, p-values shouldn't be seen as an absolute measure of validity. Moreover, they should not be demanded as unique method required for a proof, without a rationale of why that is required. Appreciate your work on this important topic!
Most of these are problems with study design rather than with the use of P-values. A P-value just gives you a sense of the reliability of the results of specific types of experiments. It isn’t a tool to go fishing for information within a random pool of data nor is it a tool for model valuation.
@@davidwright7193 I agree that this is a different issue. However, the real problem in scientific (molecular biology) journals is that many reviewers lack strong statistical knowledge and focus mainly on finding asterisks (***), leading researchers to rely on p-values or even engage in p-hacking. Confidence intervals, on the other hand, are more informative and visually explanatory. They not only show the direction and variability within each individual in a cohort but can also provide insights into potential subpopulations, though identifying these typically requires further analysis.
With enough control data, you can make statements like “6 out of 10 cases show significantly higher values of protein X, and 2 show significantly lower values compared to the control cohort with 95% confidence.” In this scenario, a t-test might yield non-significant results due to case variation, yet this finding could still be biologically significant, suggesting possible activation states.
It’s important to understand that statistical significance and biological significance are related but distinct concepts. There are processes that are biologically significant but not statistically significant, and vice versa. This disconnect highlights the dangers of over-reliance on p-values, as it can lead to the acceptance of papers based solely on statistical results without considering their biological relevance. Therefore, while banning p-values isn’t necessary, doing so in some journals could encourage reviewers to apply critical thinking rather than just looking for asterisks. They will still need to find robust evidence to accept or reject a paper, ensuring that the biological significance is not overshadowed by purely statistical considerations.
beautifully explained!
Well, this was a nice explanation of a lot of statistics, but it also revealed, how misleading the p-value can really be. To get to the core of the problem, you have to understand, that there are basically no two natural things you could compare, that do not have the slightest difference at all. Chose the sample size of your test big enough, and that difference will show up as (statistically) significant. So, ironiously, the fact, that a particular difference in the data is statistically significant tells you MORE about potential relevance, if the sample size was small. Why? Because with small sample sizes, significance can only happen as a result of relatively big and relatively consistent differences. If we care about relevance and meaning, p-values are just the wrong tool. You can use them, individually consider all the other factors and then make an educated guess about relevance, but why should you? There are better tools (hedges g, cohens d) for that purpose.
That does not mean, that p-values are worthless. A high p-value instantly tells you, that any difference you meassured, would likely also have occured as a result of random chance. That´s good to know. At least, it keeps scientists from talking to confidently about a lot of effects they supposibly meassured, which could easily be random variation in the samples and it is fine to routinely do that.
5% is a pure fictional number for deciding a hypothesis test.
Excellent work!
Many thanks!
This was excellent! Thanks 🙌
There's probably some of that.
Part of the problem though, is human analysis is multi-variate. When designing an experiment, it is best to change one thing and measure the effect. Humans have a variety of things and others who influence them to variing degrees. It's very difficult to design experiments to eliminate those effects.
It's essentially analogous to the Many Body Problem in Physics.
Those who blame mistreatment of p-value no different from those who seek escape goats. This is natural fenomena I proclame. 100 years ago, the average researcher could read, analyze and digest all the publications in own and related fields. There have never been so many people on this planet as today. Even if we assume % of academia in population stay same (it's increasing) we arrive at unprecedented number of researchers. With exponential growth of publications researchers either have to adjust their toolkit (i.n. scientific method) or adapt otherwise. May AI help us.
Great content!
Very informative video : )
An additional criticism I would add is people using P-values from tests for normal data on non-normal data. They get irritated when you point out the comparison is invalid and they used the wrong test method.
I loved the video, informative and engaging. But I think you are also missing another criticism, the problem of not the p-values per se but the comparison to an arbitrary threshold to make a decision
A great explanation ❤