How SIMPSON'S PARADOX explains weird COVID19 statistics
HTML-код
- Опубликовано: 18 дек 2024
- Simpson's Paradox is a statistical paradox and ecological fallacy that can sometimes give very strange results. Consider the case of CFR or Case Fatality Rates between countries. China has a higher survivability for coronavirus than Italy, however if you break it down by age group, every single age group such as people in their 50s have higher survivability in Italy than China, despite China being better overall. Why? We will explore how Simpson's Paradox and age demographics and explain the apparent contradiction. The issue is when using aggregate data vs compartmentalized data, and we will see how it is still possible to have good data and bad conclusions.
The data and observation of simpson's paradox being the issue came from this paper: arxiv.org/abs/...
****************************************************
COURSE PLAYLISTS:
►CALCULUS I: • Calculus I (Limits, De...
► CALCULUS II: • Calculus II (Integrati...
►MULTIVARIABLE CALCULUS (Calc III): • Calculus III: Multivar...
►DIFFERENTIAL EQUATIONS (Calc IV): • How to solve ODEs with...
►DISCRETE MATH: • Discrete Math (Full Co...
►LINEAR ALGEBRA: • Linear Algebra (Full C...
***************************************************
► Want to learn math effectively? Check out my "Learning Math" Series:
• 5 Tips To Make Math Pr...
►Want some cool math? Check out my "Cool Math" Series:
• Cool Math Series
****************************************************
►Follow me on Twitter: / treforbazett
*****************************************************
This video was created by Dr. Trefor Bazett. I'm an Assistant Teaching Professor at the University of Victoria.
BECOME A MEMBER:
►Join: / @drtrefor
MATH BOOKS & MERCH I LOVE:
► My Amazon Affiliate Shop: www.amazon.com...
Maybe Simpson's paradox should be called an "aggregation error"
Thank you for this video, i finally understood this paradox , i wish i would've found your channel when i was studying my undergrad. You are the best
Thank you for this. This is a very timely and useful lesson, and you are a great teacher!
Best explained of Simpson Paradox on YT. Thank you so much for solving my problem.
Thanks for putting all of this amazing content up. It's like going back to high school, but if the teachers were awesome!
Great explanation...keep posting on probability and statistics
Thank you on behalf of my class! This is a great example of "Simpson's Paradox" with a relevant application.
I remember watching your videos for my UC calculus classes! Good to see you make something relevant to what’s going on and super informative :)
Trefor Bazett haha yup! We all miss you too! Been recommending your videos for all the freshmen and sophomores who came in after you left :) those were what got me through calc 2 haha
Thank you Dr. Bazett! You have explained this better than my textbook and resources for class. It is so clear now!
this seems like a key part of why correlation is not always causation.
if that arbitrary x axis from your graph at 3:54 was the levels of an intervention applied (eg a increase in a certain nutrient in peoples diet). it could seem like the intervention causes a negative effect, but really in every subgroup it causes a positive effect. I wonder if having a control group and an intervention group fixes this as is used in many scientific studies? I think it does, as the intervention graph would look like the control graph shifted right and up (in the case of the graph in the video) even though the overall graph is trending down...
with correlation studies, I guess this is why they have to "control" for certain factors such as age race etc. but couldn't you tell whatever story you want by grouping the nodes differently. in the graph from the video, you could further subdivide each group into 2-4 "strip" parts per group, that follow a downward trend.
I wonder, Is there a general way to know you are picking the groups in a good/honest way? I guess maximizing the relative distance between categories maybe? or are the relevant factors not shown on the graph (eg the boolean x boolean graph of x-axis=i(taly, china), y-axis=(died, didn't die), age is not represented on the graph)
so would a control group fix this still with any arbitrary group combos? I can't tell i think maybe not actually, i think each dot on the graph would have a percent chance to move in any direction (that is on the right half of the compass) and that percent chance would be subject to the simpson's paradox? its confusing.
How do i use this to understand if studies will actually apply to me? or to conduct good studies? Is the best we got to throw in as many categories as possible that we can think of to try to get closer to the individual and then look at the aggregate correlation for the group of people with close values for that whole group of categories, and hope that its narrow enough that the Simpsons paradox doesn't apply? or can we sus out groups visually (or using some algorithm) like in your simplified example?
my main questions:
- do control/intervention studies with random selection improve this or fix this completely? (vs just looking at existing data and finding correlations)
- how do we as statisticians best account for this and pick which groups to look at?
you don't have to answer, i will probably find out the known answers if i continue to research, i am just thinking out loud. maybe this will be a record i can look at one day when i already understand the topic and can see what I was thinking as a beginner.
pretty interesting/important fact, thanks for the explainer, it was helpful,
last bit of writing: ok, near the end you say in randomized data its most likely [not] to occur, not guaranteed to not occur... wondering if it is guaranteed not to occur given an infinite sample size or not. you also say to identify causal relationships in order to decide how to group. do you want to group people with similar traits that are likely to be causal to the y axis or non-causal? I think causal, how old you are is causal to dying right? (idk), and im imaginig that 2x2 graph i wrote earlier
3:35 So, how does this graph map into your Italy-China Covid example? What is the trend that changes direction when we look at the subcategories, and what are the y and x variables of that trend, and what are the subcategories? I thought it would be Case Fatality Rate (CFR) vs Age with the subcategories being the countries, but the trend doesn't change in that setting. CFR goes up with age at both global and country levels. Or is it that the y axis is CFR, the x axis is income level, and the subcategories are ages?
3:35 is a made up situation to illustrate the point cleanly, so it wouldn't look exactly like that for italy/china. Let horizontal be age and vertical be CFR. You'd have to great color-coded groups of dots for each age range. Thus if you only looked at the 70-80 yos in Italy they would be lower than the 70-80 yos in China.
@@DrTrefor But CFR vs age goes up at both global and country levels. How does color-coding reveal a different trend?
@@enisten As the color-coded trends reveal: each of the three formerly invisible sub categories shows exactly that upwards trend. BUT the leftmost batch is propably supposed to represent a *very big group* of young people while the rightmost batch is supposed to represent a *vastly smaller group* of old people. If you plot those in the same graph then it *seems* like more young people died and far less old people but that is only because you haven't normalized your group sizes to a directly comparable number and instead try to compare absolute values of two vastly different batch sizes! If you'd looked at the same number of young people as old people then of course it would be obvious that the mortality rate of older people is higher and the "global" trend would show the same upwards tilt as the "country" trends.
I love your enthusiasm.
Great Explanation. I am curious to learn how did you adjust for the lack of random selection into two groups of classes to do a fair comparison?
Love the presentation, content, enthusiasm, and likability all around
Much appreciated!
CFR data is often sliced into buckets across just one dimension (category) at a time, i.e. by age or existence of a co-morbidity, i.e. obesity, hypertension, diabetes, etc. The problem is that assessing one's personal risk is complicated by the fact that poor heath doesn't always line up exactly with age or, even if it does on a large enough population, how do we know if we really have two independent variables to consider. So is it riskier to be an obese and diabetic 10 years old, a hypertensive 30 year old, or a healthy and fit 60 year old when faced with the SARS-CoV-2 virus? I know this video is almost a year old now but a follow-up with another level of complexity makes for interesting math using this still topical example from virology. Thanks.
Amazing topics, man. I'm enjoying your channel pretty much wtg
Glad you enjoy it!
Cool!!! Great time to teach people some statistics. Funny how we all have opinions about the pandemic, sometimes very strong, without having even basic understanding of statistics.
This was so helpful! Thank you for the great explanation. :)
I actually presented this to Reddit back in April of 2020 ; along with South Korea's data as those had easily (ie posted WHO data on wiki) to show case this was a clear example of Simpsons Paradox
I just wanted to brag that I pausrd video and figured it out. It was easy, because I got here by reading about Simpson's paradox not COVID-19.
Your example is great, should be added to the Wikipedia article about this paradox!
Nice job!
Thank you for this Dr. Trefor! I'm a bit of an amateur when it comes to statistics so just wanted to clarify, would it be accurate to say that this is because of skewness of data in statistical terms? That is, Italy is more left skewed (skewed towards older people) and China is more right skewed (skewed towards younger people) and as a result the weight of the numbers leans more towards where the skew is? Additionally, my initial thought was why is the fatality rate in older people higher in China that in Italy? Again I think I got drawn into the fallacy by looking at the overall numbers and thinking that would always apply across all age groups. I'm guessing the answer to that is that is that older people have a higher fatality rate, no matter which country you are in. In the case of China because they had fewer older people that Italy, the higher fatality did not impact the overall numbers.
Great and clear explanation with a very relatable and real example thabks!
That was a very clear explanation of the paradox, thank you
My brain hurts
This is amazing... 🔥
I think a good example of this paradox is also in University admissions. Sometimes it looks like a University is being discriminatory when it turns out that the internal schools that certain demographics prefer are too full to continue to accept more students. It gets a few Universities in hot water but they usually know how to explain it these days.
Indeed three is a prtty infamous example of this
@@DrTrefor the Cal Berkeley admissions case for Grad School?
Great video Dr. Trefor. One question.. if one where to ask which country tackled covid well what will the answer??
Great video! Have you looked into systems dynamics?
So, improperly combining different populations and ignoring confounding variables can lead to faulty conclusions. Who knew?
Mind blown
Well Explained. Thank you!
Glad it was helpful!
TQVM!! Nice, enjoyable lesson!
Amazing! Thank you sir!
What an outstading explanation Congrats!
Glad you liked it!
First rate as usual. Thanks Dr. Bazett.
The video quality is great, and I learned a lot. But the audio quality is a bit lacking. I'd invest in a good microphone. Keep up the great content!
Great video.
👌 very nice video 👍👍
I wonder what one would find if they applied this to crime statistics
Very interesting!
Very!
So Simpson's paradox leads to the conclusion that one should not conflate data unless...?
... one actually switches on their brain and look at what exactly you're seeing and wether it is applicable in the way people claim it to be. Usually it helps looking at several different graphs from different sources, compare them, think about what made them so different and then draw your own conclusion about who tried to sweep things under the rug and who is being extra dramatic...
Got it. Thank you for crystal explanation
That was super useful, thanks!
Also, koodos for the hypotemoose.
So glad it helped!
Is it a paradox? It seems quite obvious and simply a matter of defining the parameter you want to measure precisely.
Well sure, but most seeming "paradoxes" can be resolved when you think about the issues in the right way. Rightly or wrongly, it's the standard name so here we are:D
@@DrTrefor a true paradox is light a wave or particle it can behave as both.
@@nn-uj1iv If you change the way you look at things, the things you look at change. Who said a particle and a wave cannot be the same thing that only *seem* different depending on your perspective? Human perception is very susceptible to expectations and physical and mental limitations we have...
so very well explained!
Glad you think so!
Can you explain your shirt?
How does this paradox contrast/compare with the Lord's Paradox?
I want to work in field of mathematics which is not taught in undergraduate education in university , can you tell what should I study 🙆🏻
@2:04 you ask how that might be that for each age group, you seem more likely to survive in Italy than in China, but that the overall fatality rates was higher in Italy.
. . . I've heard of a similar apparent paradox. It's from politics, so less suited for instruction except if showing politics is intended: the civil rights legislation in 1965 was supported by more Republicans than Democrats in Congress. But if you looked on the old Confederacy and the rest of the country separately, then Democrats were more supportive than Republicans both from the old Confederacy and from the rest. The solution is that there were hardly any Republicans elected from the old Confederacy in 1965.
. . . I suspect a similar case here between Italy and China: there were very few young people diagnosed with the disease in Italy but there were plenty young patients in China. who furthermore also died more often. But the total number of cases in Italy was dominated by the old people, many of whom died, and the cases in China were mere distributed among age group: the young patients in China were less likely to die than old patients in Italy.
. . . But let's watch and see what Simpson's paradox is...
@4:25 "This is an important thing to keep in mind whenever you see statistics about pretty much any phenomenon." - Yes, indeed. And if writers have agendas, they put the conclusion they want in the title where it sticks in the minds of many readers. Good reporting would be showing clear graphics of the data, split up in ALL the different ways where the data show something and then writing a title like "Both (Cause 1) And (Cause 2) Clearly Impact (Important Issue) (How)".
@@Achill101 The main thing is to give you adequate, fairly distributed and even data. That way you don't inject your inference or opinion into how you portray information by trying to rationalize your summation.
People in politics oftentimes use statistics like this to promote their agenda on both sides -- for obvious reasons. So it's important to analyse it for yourself and draw your own objective conclusion.
Reminds me of gerrymandering too, which I won't get into. But that is purely politics
@@arthurmorgan2026 - many studies have shown that HEADLINES stick in people's minds.
You might give fairly distributed data at the end of the article, but the headline will stick for most. It's still better than not giving the data at all, because some readers will read to the end and build their own opinion.
Pleas explain your shirt with the moose on the hypotenuse.
The hypoteMOOSE
Right. But why are their lines going from the head and tail to the corners?
Awesome
is your shirt a hypotamoose?
i figured it out i feel so smart lets go
This is not so much a "statistical" paradox as a simple fact that this virus is more deadly for older people and one should always classify risk in relation to age group (as one category, for example), instead of just taking the entire population.
It doesn't even account for the risks an individual is taking..
Ay Caramba!
Well, you are preaching to the converted.
What we now need to do is to force feed this video to all the RUclipsrs who comment on COVID. This would have 2 benefits:
1) your viewing figures would sky rocket
2) some of them might, I repeat might, talk sense.
Bayesian statistics...
So....no relation to The Simpsons
China lies about its stats, this guys tripping haha
Damn, I thought you'd be talking about Simpsons animation (facepalm) sorry
Please give this lesson to the extreme left media
They don't use mathematics. They just make up lies to manipulate people.
Yep yep all spot on! Might check Fenton's videos also ruclips.net/video/6umArFc-fdc/видео.html
First view son...
worst explanations i've ever heard