Thanks Eva for your positive comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! And continue your love of confidence intervals--they always tell truth! Geoff
Thanks Geyoul for your positive words, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! Geoff
Thanks Taylor and Robert for your positive comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! Geoff
Thank you! There's more in my article 'The New Statistics: Why and How' to appear in 'Psychological Science' in Jan. Just released online: tiny dot cc slash tnswhyhow Enjoy... Geoff
Thanks Mohammad, glad you liked it! It's such an important idea that the p value is simply not reliable, and not nearly enough folks understand that. Confidence intervals are WAY more informative and useful! We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site www.thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish. You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters. Enjoy! Geoff
Thanks Erik, very nice of you to say so! In case of interest, I'll mention that Bob and I are working on a second ed. of our intro textbook, with totally new software. Full info, our blog, and download of the new software (some still being refined) at thenewstatistics.com Also, here's part of a recent reply to an earlier comment: It's such an important idea that the p value is simply not reliable, and not nearly enough folks understand that. Confidence intervals are WAY more informative and useful! We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish. You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters. Enjoy! Geoff
Hmmmmmm. I'm not convinced confidence intervals are better. If you repeated the experiment then, yes, the next mean would lie in that interval , probably. But if you repeat it many times, the distributon isn't going to be centered around the first mean you obtained. Parametric bootstrap maybe at least gives you some idea of how lucky you've been.
p-values are a minimum standard, confidence intervals are a useful addition, not a replacement. Especially for more complicated statistics, with a p-value you need only convince me that your null hypothesis is a reasonable null hypothesis, and then you're free to invent whatever test statistic you wish. With a confidence interval you need to convince me that all of your prior beliefs are reasonable. I also quite like null hypothesis non-rejection regions. For a t-test this has the same size as the confidence interval, but it's around zero. Magnitude of effect in excess of the average magnitude of effect given the null-hypothesis also seems useful as a conservative estimate of effect size -- this is something I'm currently using, to rank a list of things by effect size, to find items that we're fairly certain have a large effect size, and avoid a regression-to-the-mean effect. (It seems a fairly general thing to use, but I still have some doubts on this one.)
Charles R. Twardy Thanks Charles. And there may be progress happening. Watch out for imminent announcement from 'Psychological Science' about new submission guidelines to apply from Jan. tiny dot cc slash pssubguide Geoff
Paul Harrison Thanks Paul. Para 1: At least for familiar cases I disagree: CI gives more info than p. Giving p as well adds nothing. Evidence that it's better to use CI without p, at least in some common situations: tiny dot cc slash cisbetter Agree that effect sizes are what we almost always are, or should be, most interested in, and what interpretation should focus on. Geoff
Although I agree with Cumming's call for CIs and meta-analysis, I disagree with some of the assumptions in this video. I commented on that in a recent article in Frontiers in Psychology, and here goes some excerpts from that comment: Firstly, Cumming’s "dance of p’s ...is not suitable for representing Fisher’s ad hoc approach (and, by extension, most NHST projects). It is, however, adequate for representing Neyman-Pearson’s repeated sampling approach". The role of the p-value for each approach is different, for Neyman-Pearson's approach being "a simple count of significant tests irrespective of their p-value". Secondly, as it turns out, Cumming’s simulation is "a textbook example of what to expect given power", under Neyman-Pearson's approach). For example, 52% of tests should be significant at α = .05 in the long run, when power is set to 52%. Thirdly, Cumming doesn't compare p's and CI's fairly. "To be fair, a comparison of both requires equal ground. At interval level, CIs compare with power". While Cumming’s simulation reveals that about 95% of sample statistics fall within the population CI (out of 95% expected), 52% of those sample statistics are statistically significant (out of 52% expected). Furthermore, "at point estimate level, means (or a CI bound) compare with p-values, and Cumming’s figure reveals a well-choreographed dance between those. Namely, CIs are not superior to Neyman-Pearson’s tests when properly compared although, as Cumming discussed, CIs are certainly more informative." ---- Perezgonzalez JD (2015). Frontiers in Psychology (doi: 10.3389/fpsyg.2015.00034, journal.frontiersin.org/Journal/10.3389/fpsyg.2015.00034/full).
Thanks for your comments, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com Yes, power specifies what % of p values will, in the long run, be less than .05 (or whatever value is chosen for alpha). But even so the variability of p from study to study (where each is an exact replication, just with a new sample) is very large. If power is very high, sure, most p values are small, but there is still great variability. I don't agree that "at interval level, CIs compare with power". Or maybe I don't understand what you mean. Power is a single number, as is a p value. That's a big part of the problem, because it leads to dichotomous thinking, black-and-white thinking, whereas the world is actually a million (or more) colours. An interval, e.g. a CI, is more informative than any dichotomous declaration, notably be quantifying the amount of uncertainty. I've more recently taken a different approach to illustrating the variability of p values: See two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! Geoff
It would have been important to also mention the fact that the number of observations drawn from the population (i.e. the sample size) increases the "predictability" os inferences from the p-value.
Hendrik, Thanks for your comment. For a given effect size in the population, if we increase N (sample size) then the p values we get are on average smaller. Yes, that's certainly true. But the perhaps surprising thing is that they still bounce around to a very large extent. Statistical power sets the sampling distribution of the p value, so if we increase power by increasing N and/or increasing population effect size, then we shift to a different sampling distribution of p, with a lower mean. But the p interval (meaning the interval within which 80% of p values will lie) is still very long. Consider another approach. Suppose we know only that an initial study gives p=.05. Then a single exact replication will give p that's likely to be very different from that initial p. The distribution of such a replication p is derived, illustrated and explained in my 2008 article below. The remarkable think is that this distribution does not depend on the N of the initial study--assuming the replication has the same N (and everything else) as the initial study. It depends just on that initial p value. Hard to grasp, I know--it took me ages--but note that getting p=.05, for example, with very large N will happen only if the observed effect size is very small. In which case, a replication will give, most likely, a slightly different (but also small) effect size, and this is likely to have a very different p. I think the best way to appreciate just how unreliable p is, even for large samples, is to watch two videos: At RUclips, search for 'significance roulette' A p value is never to be trusted, even with large or very large N! Geoff I discuss all this, with pictures, simulations, and formulas, in tiny.cc/pintervals Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300. doi: 10.1111/j.1745-6924.2008.00079.x
Thanks very much Norma! Clearly, you are a highly intelligent and insightful person! Do you also know two videos of a different demo of just how crazy p values can be? Suppose you do an initial study and calculate p = .01. What p value would an exact replication (exactly the same, just with a new random sample) give? Turns out that a VERY wide range of p values are perfectly possible. If initial p = .05, then replication p will, of course, be, on average, a bit larger, but still there is massive uncertainty. For a demo, with explanation, search RUclips for 'significance roulette' and find two videos. Enjoy! Geoff
well done. Have you talked with Ken Rothman? When he was editor of Epidemiology, any paper with a p value in it was immediately rejected... especially if it was not an RCT. BUt it does confirm that CIs give real information, not p. thanks
Thanks Nicholas for your positive comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com Yes, I've met Ken Rothman several times, usually for lunch in Boston, and have emailed over the years. A while back we published a paper describing and evaluating the 'no p values' policy he instituted at the American Journal of Public Health, then, as you say, insisted on in the journal he founded and edited for almost a decade--Epidemiology. Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science, 15, 119-126. You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! Geoff
hahahaha..... B..R..I..L..L..I..A..N..T Many thanks Geoff for providing such a wonderful and intuitive explanation of the unreliability of the p values!
Thanks! If you or anyone is really keen, I have a paper on it: Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300. Enjoy... Geoff
First of all, I loved the dance. Both my partner and I laughed at the random (or not so random) honking paired with emotionally labile dancers. A wonderful program. (sorry in advance, neither of us are statisticians, just trying to learn enough to do our job well) So my partner and I have a bit of disagreement resulting from watching this video. We were wondering about in what applications p-value shouldn't be trusted. Effect size & CI are a given as extremely important & required context. Assuming Bias to be negligible & sample size to be a sufficient for the field of research due to a wonderfully constructed experiment... One argument was that a p-value is not a great instrument for planning a study, but is better for planning management. For example, if I did a study & achieved p = 0.05 showing a +'ve correlation between poking you in the right eye & improving your tax returns then anyone who comes in with poor tax returns I could poke them in the eye & feel 19/20 likelihood to improve their tax returns. However if I attempted to prove this association with further research I may have difficulties. The other interpretation was that p-value is a poor tool for both research planning & developing management. The poor predictability on where p-value may fall in any one experiment means that its utility, even once obtained to be 'significant', lends poor predictive value for both.
Thanks Niels! We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site www.thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish. You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters. Enjoy! Geoff
Hi Geoff, you mentioned the width of the CI is predictive in some way of future CI widths, but what value is that to a researcher? Won't the range of the CI bounce around as shown in your simulation, and doesn't that imply that the confidence interval itself will miss the true mean as it shifts along the axis of possible means some percentage of the time? perhaps as often as the pvalue shifts?
Thanks Randvids, You are absolutely correct that both CIs and p values dance around with replication--as the simulation illustrates. The large extent of the dancing, in both cases, is probably way more than most folks would predict or expect. Yep, the world is full of random variation, unless we're lucky enough to be able to work with samples that are huge. By definition, a 95% confidence interval will miss the true population value on 5% of occasions (assuming random sampling, etc)--these are displayed red in ESCI. So, in a lifetime of seeing and interpreting CIs, some unknown 5% will miss what they are estimating. (In real life, those intervals don't come red, unfortunately!) BUT there is a vital difference between the dancing of CIs and p values. Any single CI gives a pretty good idea how wide, how frenetic, the dance is. In real life we get only a single CI, not part of the dance, so it's highly valuable that any one CI gives us a good idea of the extent of uncertainty. We can be 95% confident (no more, but no less) that our single CI has landed so that it includes the true population value. Hooray! In stark contrast, any single p value gives us virtually no information about the dance it came from. The next p value in the dance may be much bigger, or much smaller. But a p value is a single value, sometimes even reported to 3 decimal places (!), which shouts 'accurate', 'trustworthy' at us--despite it telling us very little. In contrast, the single CI makes the uncertainty salient--its length shouts 'there is uncertainty', 'there is doubt'. It even quantifies the extent of that uncertainty, so we can be very happy if we get a very short CI, and be appropriately disappointed and circumspect if we get a very long CI--indicating that our study may have been pretty useless. Overall, it's of great value to a researcher to know how precise any result is. The CI gives the best information available in the data on that. We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site www.thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish. You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters. Enjoy! Geoff
Sorry, collector's item! There's also a blue version, for the intro book, but I'm afraid I haven't been able to arrange on online store to sell that one either :-( For fun, try searching, at RUclips, for 'significance roulette'. Enjoy--and thanks for your interest! Geoff
Hi there, Interesting talk (and APS poster) but... Ok, so you have a book to sell. I think what you are highlighting is nothing more than looking at two complementary way of expressing the results of a single experiment. If I replicate an experiment exactly, then I will in effect have a larger sample size overall. This will have the happy effect of improving the precision of the measurement (as shown by a reduced range of the confidence interval) and improving my prospects of returning a low p-value(yes, maybe even < 0.05). Both p-values and confidence intervals are sensitive to sample size and in the case of the p-value, the effect size. Many students and researchers forget that establishing statistical significance says nothing about clinical significance. I put it to you that your angle on this is nothing more than a (commendable) attempt to improve the statistical literacy of researchers and consumers of research. However, I am always especially sceptical of educators/researchers who seem to be peddling the 'flavour of the month'. This is not 'new statistics' - for example check out the book 'Statistics with confidence: confidence intervals and statistical guidelines, by Douglas Altman (BMJ books, 2000), where they expressed a similarly zealous approach to changing our 'bad' ways. All the best, Philip philip.dee@bcu.ac.uk
Thanks Philip for your comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) And now I have a second book to sell, an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com Yes, the 'new statistics' are not themselves new, CIs having been around for approaching a century. However, it would indeed be new, and imho highly beneficial, for researchers to switch from p values to better ways, for example estimation and meta-analysis. Yes, in medicine CIs have been very widely used since the 1980s (thanks to Ken Rothman and others) but p values are still almost universally used, and usually provide the basis for interpretation. Alas! You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! Geoff
Thanks Nikolai! I agree. I was blown away when I first started playing with these simulations, inspired by a 1995 paper by Frank Schmidt. That led to my videos and a paper explaining how the enormous variability in the p value remains true, even for large N, large true effects. (Yes, the average p may be smaller for large N, large effects, high power, but the amount of dancing around is still enormous.) See tiny.cc/pintervals For more, you may care to search RUclips for 'significance roulette', to find two videos. Enjoy! Geoff
@@CienciacomCerteza The article by Frank Schmidt, details below. It was published in the first volume of the new journal 'Psych Methods'. The editor was much criticised for publishing such an article that was so strongly 'undermining' what was then regarded as best practice. History has vindicated both Schmidt and the editor: the article has received more than 1400 citations! See especially Tables 1 and 2, and the discussion around there. Sampling variability is often so large, and has effects that most researchers don't have good intuitions about! Thanks for asking. Enjoy! Geoff Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers June 1996 Psychological Methods 1(2):115-129 DOI: 10.1037/1082-989X.1.2.115
Geoff, what you have done here is a very misguided analysis. You have simply shown that statistical testing based on inadequate sample sizes does not make sense. Question: did you purposefully chose sample sizes 32 leading to the power of only 0.52! Your last sentence in the video should have been: this is a classical case not to perform underpower studies, not that the p-values have erratic behavior. In addition, you have selected huge standard deviations (relative to the means), hence there is a substantial amount of overlap between two populations. Suggestion 1: Increase the sample sizes, achieve better power, and your p-value dance will be totally different. Suggestion 2: Decrease both standard deviations to 10 and keep the sample sizes unchanged (n=32). You will be highly disappointed when you see the results: no p-dances at all. Almost all p-values will be < 0.05. Your video about "p-dances" should be retitled to "how to intentionally perform bad statistics".
@@miodraglovric5093 Moidrag, Thank you for this further comment, and especially for your positive and encouraging remarks. Your 3 paras: 1. Yes, Trafimow & Marks really threw out the baby with the bathwater! In my view it was a great move to outlaw NHST, but they should have continued to evaluate for possible publication manuscripts that used classical or Bayesian estimation. Actually, I hope NHST and p values will quietly wither away, as people grow to understand that they are superfluous and damaging. I hope that heavy-handed banning won't be necessary, but I may easily be wrong. The paper you mention is by committed Bayesians. I know 3 of them well, and have had many discussions with them over the years. We agree on many things, but not on the value of single CIs. They take the hard line, also taken by strict Frequentists, that only the formal definition of a CI can be used for interpretation. Meaning that we can only recite 'my interval comes from a notional infinite sequence of intervals, 95% of which include the population value'. Which on Planet Earth is not much help. In my books and papers I advocate and explain a broader more pragmatic approach, based on familiarity with dances--the dance of the CIs, dance of the p values. In most practical cases it's quite reasonable to say 'I'm reasonably confident that my interval includes the true value, although I always keep in mind that it may be red'. (In ESCI dances of CI, those intervals that miss the population value are shown in red.) 2. Yes, move away form any dichotomous decision making. Having an interval null hypothesis is an advance, but doesn't go the full way to estimation, which I advocate. I'm not a fan of renaming CIs as compatibility intervals, although that may be one of several not-too-bad ways to think of CIs. 3. Yes, high power gives different dances of p values, but even with very high power, when virtually all p values are less than .05, there is still large and erratic jumping around. The sig roulette videos are relevant. Below is a table based on the 2008 paper I mentioned. (I hope it displays reasonably well). pobt p interval Probability(replication p > .05) .001 (.0000003, .14) .17 .01 (.00001, .41) .33 .05 (.0002, .65) .50 .2 (.002, .79) .67 (.33 chance p < .05) 'pobt' is the two-tailed p value given by an initial study (of any size, power, true effect size) 'p interval' is the 80% prediction interval for the two-tailed p given by an exact replication. third column is the probability that the exact replication gives p>.05 So, even obtaining .001 does not guarantee that an exact replication will give even p
Thank you Ollie! You may also be interested in Significance Roulette. There are two videos--the easiest way to find them is by searching RUclips for 'significance roulette'. Enjoy! Geoff
The reasoning behind this video is flawed if you don't mention statistical power. The simulated experiments in the video have a power of 52%. This means that there is only an about 52 % chance that the null hypothesis will be rejected (i.e. a p-value < 0.05) in every reiteration of the experiment. Conversely this means that about 48 % of the p-values will be larger than 0.05 which can easily be seen from the distribution of the simulated p-values where the two categories of p > 0.1 and 0.1 > p > 0.05 have a theoretical sum of 48.4 % and the simulated experiment converges to this value asymptotically.
I think you are being very unfair. Statistical power is only easy to calculate if you know the true means and variances and other aspects of the underlying distributions. If you did know them you would not need to do any experiments because you could just look at what you already know to decide if there is an effect and how big it is. If you don't know them yet then you still have to do experiments that are guided by fairly blind guesses and rule of thumbs, and face the possibility that your p-value, as well as your estimates of underlying variances on whcih you might want to base power calculations, may be deeply unreliable. This is an excellent video that should be compulsory viewing for all scientists and your criticism of it seems to me rather pseudo-clever.
This is totally backwards. You are looking at the probability of getting a good p value if an effect is real. We want the opposite. How much more likely an effect, given a p value? If you ran such a simulation, you'd find the p value correlates with the existence of an effect better than any other measure. It's random, because data is random. Sometimes a few patients given a good drug will drop dead anyway. Your p value will be high because it should be. It's not a convincing result, get more data.
Thanks for the comment! You are correct that, with p values, we have to deal with weird, counter-intuitive backwards logic! That's at the heart of the problem of statistical significance testing. It's why it's so hard to understand intuitively. Yes, as researchers, we see a p value and would love to know the effect size in the population. Sure, a larger ES will, on average, give a smaller p, BUT there is huge variation in the p value, simply because of sampling variability. That's what the dance illustrates. You can play for yourself with the wonderful 'esci web' software built by Gordon Moore. Runs in any browser: www.esci.thenewstatistics.com/ Click 'dances', then 'dance of the p values' at red 9, bottom panel on left. Click '?' top right in left hand panel to see tips on mouse hover--to explain what's going on. Set small or large ES, small or large N, and you'll see p values dance wildly. Sure, smaller N gives p that is, on average, larger, etc, etc, but in just about every situation p varies widely. The single value of p gives no idea of the extent of uncertainty, whereas a CI does: the length gives us excellent info about the amount of uncertainty. That's all explained in multiple ways, and illustrated, in my 2008 article: tiny.cc/pintervals Here's another way to think about the issue: Suppose you do an initial study and calculate p = .01. What p value would an exact replication (exactly the same, just with a new random sample) give? Turns out that a very wide range of p values are perfectly possible. If initial p = .05, then replication p will, of course, be, on average, a bit larger, but still there is massive uncertainty. That's also explained in that 2008 article, with formulas. For a demo, with explanation, search RUclips for 'significance roulette' and find two videos. Enjoy! Geoff
Josh, I hope you had sweet dreams, of p values leaping all over the place! To teachers who find this video useful--good on you, way to go! You might also find two more recent videos diverting--more attempts by me to make dramatic some basic statistical ideas. You could use the two links below, or go to RUclips and search for 'Significance Roulette'. Enjoy (as you weep...). Sweet dreams... Geoff tiny.cc/SigRoulette1 tiny.cc/SigRoulette2
Thanks a lot for letting us to understand the "camouflaged" side of p-values. I love Confidence Intervals!
Thanks Eva for your positive comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com
You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy!
And continue your love of confidence intervals--they always tell truth!
Geoff
Thank you. It was excellent explanation and easy to understand.
Thanks Geyoul for your positive words, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! Geoff
Really excellent! An utterly convincing demonstration.
Thanks!
Very interesting. Thanks for this.
Thanks Taylor and Robert for your positive comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com
You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy!
Geoff
A very nice demonstration what the p-value is NOT. Thanks!
Thank you! There's more in my article 'The New Statistics: Why and How' to appear in 'Psychological Science' in Jan. Just released online:
tiny dot cc slash tnswhyhow
Enjoy...
Geoff
Watching this in 2021 and wow what a great explanation
Thanks Mohammad, glad you liked it!
It's such an important idea that the p value is simply not reliable, and not nearly enough folks understand that. Confidence intervals are WAY more informative and useful!
We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site www.thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish.
You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters.
Enjoy!
Geoff
I wish I could double like this. Amazing demonstration and fantastic educator!
Thanks Erik, very nice of you to say so!
In case of interest, I'll mention that Bob and I are working on a second ed. of our intro textbook, with totally new software. Full info, our blog, and download of the new software (some still being refined) at thenewstatistics.com
Also, here's part of a recent reply to an earlier comment:
It's such an important idea that the p value is simply not reliable, and not nearly enough folks understand that. Confidence intervals are WAY more informative and useful!
We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish.
You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters.
Enjoy!
Geoff
Thanks Geyoul Kim!
Did you laugh or weep?!
Enjoy...
Geoff
Hmmmmmm. I'm not convinced confidence intervals are better. If you repeated the experiment then, yes, the next mean would lie in that interval , probably. But if you repeat it many times, the distributon isn't going to be centered around the first mean you obtained.
Parametric bootstrap maybe at least gives you some idea of how lucky you've been.
p-values are a minimum standard, confidence intervals are a useful addition, not a replacement. Especially for more complicated statistics, with a p-value you need only convince me that your null hypothesis is a reasonable null hypothesis, and then you're free to invent whatever test statistic you wish. With a confidence interval you need to convince me that all of your prior beliefs are reasonable.
I also quite like null hypothesis non-rejection regions. For a t-test this has the same size as the confidence interval, but it's around zero.
Magnitude of effect in excess of the average magnitude of effect given the null-hypothesis also seems useful as a conservative estimate of effect size -- this is something I'm currently using, to rank a list of things by effect size, to find items that we're fairly certain have a large effect size, and avoid a regression-to-the-mean effect. (It seems a fairly general thing to use, but I still have some doubts on this one.)
Charles R. Twardy Thanks Charles. And there may be progress happening. Watch out for imminent announcement from 'Psychological Science' about new submission guidelines to apply from Jan.
tiny dot cc slash pssubguide
Geoff
Paul Harrison Thanks Paul. Para 1: At least for familiar cases I disagree: CI gives more info than p. Giving p as well adds nothing. Evidence that it's better to use CI without p, at least in some common situations:
tiny dot cc slash cisbetter
Agree that effect sizes are what we almost always are, or should be, most interested in, and what interpretation should focus on.
Geoff
Although I agree with Cumming's call for CIs and meta-analysis, I disagree with some of the assumptions in this video. I commented on that in a recent article in Frontiers in Psychology, and here goes some excerpts from that comment:
Firstly, Cumming’s "dance of p’s ...is not suitable for representing Fisher’s ad hoc approach (and, by extension, most NHST projects). It is, however, adequate for representing Neyman-Pearson’s repeated sampling approach". The role of the p-value for each approach is different, for Neyman-Pearson's approach being "a simple count of significant tests irrespective of their p-value".
Secondly, as it turns out, Cumming’s simulation is "a textbook example of what to expect given power", under Neyman-Pearson's approach). For example, 52% of tests should be significant at α = .05 in the long run, when power is set to 52%.
Thirdly, Cumming doesn't compare p's and CI's fairly. "To be fair, a comparison of both requires equal ground. At interval level, CIs compare with power". While Cumming’s simulation reveals that about 95% of sample statistics fall within the population CI (out of 95% expected), 52% of those sample statistics are statistically significant (out of 52% expected). Furthermore, "at point estimate level, means (or a CI bound) compare with p-values, and Cumming’s figure reveals a well-choreographed dance between those. Namely, CIs are not superior to Neyman-Pearson’s tests when properly compared although, as Cumming discussed, CIs are certainly more informative."
----
Perezgonzalez JD (2015). Frontiers in Psychology (doi: 10.3389/fpsyg.2015.00034, journal.frontiersin.org/Journal/10.3389/fpsyg.2015.00034/full).
Thanks for your comments, all that time ago. (I don't know why I wasn't automatically alerted to your comment.)
Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com
Yes, power specifies what % of p values will, in the long run, be less than .05 (or whatever value is chosen for alpha). But even so the variability of p from study to study (where each is an exact replication, just with a new sample) is very large. If power is very high, sure, most p values are small, but there is still great variability.
I don't agree that "at interval level, CIs compare with power". Or maybe I don't understand what you mean. Power is a single number, as is a p value. That's a big part of the problem, because it leads to dichotomous thinking, black-and-white thinking, whereas the world is actually a million (or more) colours. An interval, e.g. a CI, is more informative than any dichotomous declaration, notably be quantifying the amount of uncertainty.
I've more recently taken a different approach to illustrating the variability of p values: See two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy!
Geoff
It would have been important to also mention the fact that the number of observations drawn from the population (i.e. the sample size) increases the "predictability" os inferences from the p-value.
Hendrik, Thanks for your comment. For a given effect size in the population, if we increase N (sample size) then the p values we get are on average smaller. Yes, that's certainly true. But the perhaps surprising thing is that they still bounce around to a very large extent. Statistical power sets the sampling distribution of the p value, so if we increase power by increasing N and/or increasing population effect size, then we shift to a different sampling distribution of p, with a lower mean. But the p interval (meaning the interval within which 80% of p values will lie) is still very long.
Consider another approach. Suppose we know only that an initial study gives p=.05. Then a single exact replication will give p that's likely to be very different from that initial p. The distribution of such a replication p is derived, illustrated and explained in my 2008 article below. The remarkable think is that this distribution does not depend on the N of the initial study--assuming the replication has the same N (and everything else) as the initial study. It depends just on that initial p value. Hard to grasp, I know--it took me ages--but note that getting p=.05, for example, with very large N will happen only if the observed effect size is very small. In which case, a replication will give, most likely, a slightly different (but also small) effect size, and this is likely to have a very different p.
I think the best way to appreciate just how unreliable p is, even for large samples, is to watch two videos: At RUclips, search for 'significance roulette'
A p value is never to be trusted, even with large or very large N!
Geoff
I discuss all this, with pictures, simulations, and formulas, in tiny.cc/pintervals
Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300. doi: 10.1111/j.1745-6924.2008.00079.x
Best statistics video ever! 💞
Thanks very much Norma! Clearly, you are a highly intelligent and insightful person!
Do you also know two videos of a different demo of just how crazy p values can be?
Suppose you do an initial study and calculate p = .01. What p value would an exact replication (exactly the same, just with a new random sample) give? Turns out that a VERY wide range of p values are perfectly possible. If initial p = .05, then replication p will, of course, be, on average, a bit larger, but still there is massive uncertainty. For a demo, with explanation, search RUclips for 'significance roulette' and find two videos.
Enjoy!
Geoff
That's pretty amazing Excel magic there.
Thanks! For more, you may care to search RUclips for 'significance roulette', to find two videos. Enjoy! Geoff
well done.
Have you talked with Ken Rothman?
When he was editor of Epidemiology, any paper with a p value in it was immediately rejected... especially if it was not an RCT.
BUt it does confirm that CIs give real information, not p.
thanks
Thanks Nicholas for your positive comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) Since then I've published a second book, joint with Bob Calin-Jageman. It's an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com
Yes, I've met Ken Rothman several times, usually for lunch in Boston, and have emailed over the years. A while back we published a paper describing and evaluating the 'no p values' policy he instituted at the American Journal of Public Health, then, as you say, insisted on in the journal he founded and edited for almost a decade--Epidemiology.
Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science, 15, 119-126.
You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy!
Geoff
You are the best instructor ever... Why did I just find this T_T Where were you... hahah
??? He's right here. And you can bask in his instruction right here.
hahahaha..... B..R..I..L..L..I..A..N..T
Many thanks Geoff for providing such a wonderful and intuitive explanation of the unreliability of the p values!
Thanks!
If you or anyone is really keen, I have a paper on it:
Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300.
Enjoy...
Geoff
Will be grateful. mail?
First of all, I loved the dance. Both my partner and I laughed at the random (or not so random) honking paired with emotionally labile dancers. A wonderful program.
(sorry in advance, neither of us are statisticians, just trying to learn enough to do our job well)
So my partner and I have a bit of disagreement resulting from watching this video. We were wondering about in what applications p-value shouldn't be trusted. Effect size & CI are a given as extremely important & required context. Assuming Bias to be negligible & sample size to be a sufficient for the field of research due to a wonderfully constructed experiment...
One argument was that a p-value is not a great instrument for planning a study, but is better for planning management. For example, if I did a study & achieved p = 0.05 showing a +'ve correlation between poking you in the right eye & improving your tax returns then anyone who comes in with poor tax returns I could poke them in the eye & feel 19/20 likelihood to improve their tax returns. However if I attempted to prove this association with further research I may have difficulties.
The other interpretation was that p-value is a poor tool for both research planning & developing management. The poor predictability on where p-value may fall in any one experiment means that its utility, even once obtained to be 'significant', lends poor predictive value for both.
That is a sick freakin beat my dude
Thanks Liz! May all your confidence intervals be short...
Geoff
❤❤❤
Brilliant video
Thanks Niels!
We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site www.thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish.
You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters.
Enjoy!
Geoff
Hi Geoff, you mentioned the width of the CI is predictive in some way of future CI widths, but what value is that to a researcher? Won't the range of the CI bounce around as shown in your simulation, and doesn't that imply that the confidence interval itself will miss the true mean as it shifts along the axis of possible means some percentage of the time? perhaps as often as the pvalue shifts?
Thanks Randvids,
You are absolutely correct that both CIs and p values dance around with replication--as the simulation illustrates. The large extent of the dancing, in both cases, is probably way more than most folks would predict or expect. Yep, the world is full of random variation, unless we're lucky enough to be able to work with samples that are huge. By definition, a 95% confidence interval will miss the true population value on 5% of occasions (assuming random sampling, etc)--these are displayed red in ESCI. So, in a lifetime of seeing and interpreting CIs, some unknown 5% will miss what they are estimating. (In real life, those intervals don't come red, unfortunately!)
BUT there is a vital difference between the dancing of CIs and p values. Any single CI gives a pretty good idea how wide, how frenetic, the dance is. In real life we get only a single CI, not part of the dance, so it's highly valuable that any one CI gives us a good idea of the extent of uncertainty. We can be 95% confident (no more, but no less) that our single CI has landed so that it includes the true population value. Hooray!
In stark contrast, any single p value gives us virtually no information about the dance it came from. The next p value in the dance may be much bigger, or much smaller. But a p value is a single value, sometimes even reported to 3 decimal places (!), which shouts 'accurate', 'trustworthy' at us--despite it telling us very little. In contrast, the single CI makes the uncertainty salient--its length shouts 'there is uncertainty', 'there is doubt'. It even quantifies the extent of that uncertainty, so we can be very happy if we get a very short CI, and be appropriately disappointed and circumspect if we get a very long CI--indicating that our study may have been pretty useless.
Overall, it's of great value to a researcher to know how precise any result is. The CI gives the best information available in the data on that.
We now have software that anyone can run in a browser that allows exploration of the dance of the p values. At our site www.thenewstatistics.com go to the ESCI menu and click on 'ESCI on the web'. Click (top left) on 'dances' and then explore as you wish. Click '?' (top right) to get mouse-over tips. Click to open Panel 9, Dance of the p values, then explore as you wish.
You may also care to search at RUclips for 'Significance Roulette' for yet more demos of how crazy p values are, and see that it's even more crazy that anyone uses them to make any decision that matters.
Enjoy!
Geoff
Where can I get that tshirt!?
Sorry, collector's item! There's also a blue version, for the intro book, but I'm afraid I haven't been able to arrange on online store to sell that one either :-(
For fun, try searching, at RUclips, for 'significance roulette'. Enjoy--and thanks for your interest!
Geoff
Hi there,
Interesting talk (and APS poster) but...
Ok, so you have a book to sell. I think what you are highlighting is nothing more than looking at two complementary way of expressing the results of a single experiment.
If I replicate an experiment exactly, then I will in effect have a larger sample size overall. This will have the happy effect of improving the precision of the measurement (as shown by a reduced range of the confidence interval) and improving my prospects of returning a low p-value(yes, maybe even < 0.05).
Both p-values and confidence intervals are sensitive to sample size and in the case of the p-value, the effect size. Many students and researchers forget that establishing statistical significance says nothing about clinical significance. I put it to you that your angle on this is nothing more than a (commendable) attempt to improve the statistical literacy of researchers and consumers of research.
However, I am always especially sceptical of educators/researchers who seem to be peddling the 'flavour of the month'. This is not 'new statistics' - for example check out the book 'Statistics with confidence: confidence intervals and statistical guidelines, by Douglas Altman (BMJ books, 2000), where they expressed a similarly zealous approach to changing our 'bad' ways.
All the best,
Philip
philip.dee@bcu.ac.uk
Thanks Philip for your comment, all that time ago. (I don't know why I wasn't automatically alerted to your comment.) And now I have a second book to sell, an intro stats textbook, a kind of prequel to the first New-Statistics book. Info at our blog site www.thenewstatistics.com
Yes, the 'new statistics' are not themselves new, CIs having been around for approaching a century. However, it would indeed be new, and imho highly beneficial, for researchers to switch from p values to better ways, for example estimation and meta-analysis. Yes, in medicine CIs have been very widely used since the 1980s (thanks to Ken Rothman and others) but p values are still almost universally used, and usually provide the basis for interpretation. Alas!
You may be interested in two more recent videos: At RUclips, search for 'Significance roulette'. Enjoy! Geoff
Ryan Faulk sent me here
Thank you to Ryan!
Hope you enjoy, and wonder...
Geoff
amazing!
Thanks Nikolai! I agree. I was blown away when I first started playing with these simulations, inspired by a 1995 paper by Frank Schmidt. That led to my videos and a paper explaining how the enormous variability in the p value remains true, even for large N, large true effects. (Yes, the average p may be smaller for large N, large effects, high power, but the amount of dancing around is still enormous.) See tiny.cc/pintervals For more, you may care to search RUclips for 'significance roulette', to find two videos. Enjoy! Geoff
@@geoffdcumming Thanks for the video, really exciting and worthy. I'm curious about what paper of F. Schmidt did you talk about...?
@@CienciacomCerteza
The article by Frank Schmidt, details below. It was published in the first volume of the new journal 'Psych Methods'. The editor was much criticised for publishing such an article that was so strongly 'undermining' what was then regarded as best practice. History has vindicated both Schmidt and the editor: the article has received more than 1400 citations!
See especially Tables 1 and 2, and the discussion around there. Sampling variability is often so large, and has effects that most researchers don't have good intuitions about!
Thanks for asking. Enjoy!
Geoff
Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers
June 1996 Psychological Methods 1(2):115-129
DOI: 10.1037/1082-989X.1.2.115
Geoff, what you have done here is a very misguided analysis. You have simply shown that statistical testing based on inadequate sample sizes does not make sense. Question: did you purposefully chose sample sizes 32 leading to the power of only 0.52! Your last sentence in the video should have been: this is a classical case not to perform underpower studies, not that the p-values have erratic behavior. In addition, you have selected huge standard deviations (relative to the means), hence there is a substantial amount of overlap between two populations. Suggestion 1: Increase the sample sizes, achieve better power, and your p-value dance will be totally different. Suggestion 2: Decrease both standard deviations to 10 and keep the sample sizes unchanged (n=32). You will be highly disappointed when you see the results: no p-dances at all. Almost all p-values will be < 0.05. Your video about "p-dances" should be retitled to "how to intentionally perform bad statistics".
Miodrag, thank you for your comment. You are correct that different power will give different proportions of replications that give p
@@miodraglovric5093 Moidrag, Thank you for this further comment, and especially for your positive and encouraging remarks. Your 3 paras:
1. Yes, Trafimow & Marks really threw out the baby with the bathwater! In my view it was a great move to outlaw NHST, but they should have continued to evaluate for possible publication manuscripts that used classical or Bayesian estimation. Actually, I hope NHST and p values will quietly wither away, as people grow to understand that they are superfluous and damaging. I hope that heavy-handed banning won't be necessary, but I may easily be wrong. The paper you mention is by committed Bayesians. I know 3 of them well, and have had many discussions with them over the years. We agree on many things, but not on the value of single CIs. They take the hard line, also taken by strict Frequentists, that only the formal definition of a CI can be used for interpretation. Meaning that we can only recite 'my interval comes from a notional infinite sequence of intervals, 95% of which include the population value'. Which on Planet Earth is not much help. In my books and papers I advocate and explain a broader more pragmatic approach, based on familiarity with dances--the dance of the CIs, dance of the p values. In most practical cases it's quite reasonable to say 'I'm reasonably confident that my interval includes the true value, although I always keep in mind that it may be red'. (In ESCI dances of CI, those intervals that miss the population value are shown in red.)
2. Yes, move away form any dichotomous decision making. Having an interval null hypothesis is an advance, but doesn't go the full way to estimation, which I advocate. I'm not a fan of renaming CIs as compatibility intervals, although that may be one of several not-too-bad ways to think of CIs.
3. Yes, high power gives different dances of p values, but even with very high power, when virtually all p values are less than .05, there is still large and erratic jumping around. The sig roulette videos are relevant. Below is a table based on the 2008 paper I mentioned. (I hope it displays reasonably well).
pobt p interval Probability(replication p > .05)
.001 (.0000003, .14) .17
.01 (.00001, .41) .33
.05 (.0002, .65) .50
.2 (.002, .79) .67 (.33 chance p < .05)
'pobt' is the two-tailed p value given by an initial study (of any size, power, true effect size)
'p interval' is the 80% prediction interval for the two-tailed p given by an exact replication.
third column is the probability that the exact replication gives p>.05
So, even obtaining .001 does not guarantee that an exact replication will give even p
10/10
Thank you Ollie! You may also be interested in Significance Roulette. There are two videos--the easiest way to find them is by searching RUclips for 'significance roulette'.
Enjoy!
Geoff
The reasoning behind this video is flawed if you don't mention statistical power. The simulated experiments in the video have a power of 52%. This means that there is only an about 52 % chance that the null hypothesis will be rejected (i.e. a p-value < 0.05) in every reiteration of the experiment. Conversely this means that about 48 % of the p-values will be larger than 0.05 which can easily be seen from the distribution of the simulated p-values where the two categories of p > 0.1 and 0.1 > p > 0.05 have a theoretical sum of 48.4 % and the simulated experiment converges to this value asymptotically.
I think you are being very unfair. Statistical power is only easy to calculate if you know the true means and variances and other aspects of the underlying distributions. If you did know them you would not need to do any experiments because you could just look at what you already know to decide if there is an effect and how big it is. If you don't know them yet then you still have to do experiments that are guided by fairly blind guesses and rule of thumbs, and face the possibility that your p-value, as well as your estimates of underlying variances on whcih you might want to base power calculations, may be deeply unreliable. This is an excellent video that should be compulsory viewing for all scientists and your criticism of it seems to me rather pseudo-clever.
This is totally backwards. You are looking at the probability of getting a good p value if an effect is real. We want the opposite. How much more likely an effect, given a p value?
If you ran such a simulation, you'd find the p value correlates with the existence of an effect better than any other measure. It's random, because data is random. Sometimes a few patients given a good drug will drop dead anyway. Your p value will be high because it should be. It's not a convincing result, get more data.
Thanks for the comment!
You are correct that, with p values, we have to deal with weird, counter-intuitive backwards logic! That's at the heart of the problem of statistical significance testing. It's why it's so hard to understand intuitively. Yes, as researchers, we see a p value and would love to know the effect size in the population. Sure, a larger ES will, on average, give a smaller p, BUT there is huge variation in the p value, simply because of sampling variability. That's what the dance illustrates. You can play for yourself with the wonderful 'esci web' software built by Gordon Moore. Runs in any browser: www.esci.thenewstatistics.com/ Click 'dances', then 'dance of the p values' at red 9, bottom panel on left. Click '?' top right in left hand panel to see tips on mouse hover--to explain what's going on. Set small or large ES, small or large N, and you'll see p values dance wildly. Sure, smaller N gives p that is, on average, larger, etc, etc, but in just about every situation p varies widely. The single value of p gives no idea of the extent of uncertainty, whereas a CI does: the length gives us excellent info about the amount of uncertainty.
That's all explained in multiple ways, and illustrated, in my 2008 article: tiny.cc/pintervals
Here's another way to think about the issue: Suppose you do an initial study and calculate p = .01. What p value would an exact replication (exactly the same, just with a new random sample) give? Turns out that a very wide range of p values are perfectly possible. If initial p = .05, then replication p will, of course, be, on average, a bit larger, but still there is massive uncertainty. That's also explained in that 2008 article, with formulas. For a demo, with explanation, search RUclips for 'significance roulette' and find two videos.
Enjoy!
Geoff
How many of y'all teachers sent you here zzzzzzzzz
Josh, I hope you had sweet dreams, of p values leaping all over the place!
To teachers who find this video useful--good on you, way to go!
You might also find two more recent videos diverting--more attempts by me to make dramatic some basic statistical ideas. You could use the two links below, or go to RUclips and search for 'Significance Roulette'. Enjoy (as you weep...). Sweet dreams...
Geoff
tiny.cc/SigRoulette1
tiny.cc/SigRoulette2