Yes, the key concept to understand is called 'conditional probability'. RUclips comments too short to give a good intro, but if you search for videos on it, you should get a decent intro. Also, a related concept is called Bayes' Theorem, which is one of the most important tools for calculating conditional probabilities, based on other conditional probabilities.
In my experience with working with data is that it is of paramount importance to always look at the data graphically, to get a feeling of what's going on. Relying purely on statistical methods, including standard methods is not enough. Judgement about why distributions look a certain way, the existence of clusters, is key.
I think the case without discrete groups is also relevant. If looking at the last example you had a continous variable (say age as a proxy for amount of education) and a test that did not create two discrete groups, say instead of it entirely consisting of early graduate content it was content selected with equal probability from all relevant difficulty levels (so presumably all college level things), and students with more education studied less for it evenly with higher age (depends on why the graduate students studied less as a group, it could be a discete cuttoff becouse of the qualatative differences betwean grad and undergrad on say free time to study for an unimportant test, but it could be a responce to how hard they expect the test to be, in which case the change to the test should work perfectly); You now should have a thick downward sloping band for your data sample of test score compared to hours studied, and now instead of some wierd subgroup effect it just looks like you just got screwed over by the confounding variable of age when realy hours studied causes higher test scores. Thought it was good to mention just becouse I helps me remember that simpsons paradox realy isn’t anything different normal confounding variable problems and so you don’t actualy need any special reasoning, it’s just discrete instead of continuous. And in terms of bad behavor it has the same problems, not reporting a subgroup effect can cause the wrong conclusion to be pushed, but how do you know that a subgroup is relavant? Collecting a ton of subgroup information basicaly p-hacks your new “real” results. The real problem is that the results are taken far to seriously for a non-causative study of data that is to small to reliably try to rigourously detect the presence of counfounding variables. Though in the 2nd example you probably could easily detect the clustering event with a dozen other unrelated group classifiers thrown in
nowadays, we have datascience, and datascientists ( who are engineers)... the principle? programm a class in python for a deep ann or a random forest, fill it up with data thanks to hadoop and other datalakes, and you have a perfect result...... BUT, when i was young, we studied statistics and econometrics, and our teachers gave us another mentality! they said ' play with your data' this means ' before you create a model, you have to analyse deeply the pb, and this will take time, and patience' such a paradox was not a pb, but it is with a deep ann cheers
The best part of the name is that it appears to be a paradox at first, but then isn't a paradox at all when you examine it closer
Yes love it
Thank you, this is the first video I've found that's explained this in a way I can easily understand
What a great explanation. Instantly subscribed ❤
I'd certainly like a series on data ethics.
Thanks!
Are there any things you should think about when analyzing data to avoid falling into the pitfall of Simpson's paradox?
Yes, the key concept to understand is called 'conditional probability'. RUclips comments too short to give a good intro, but if you search for videos on it, you should get a decent intro.
Also, a related concept is called Bayes' Theorem, which is one of the most important tools for calculating conditional probabilities, based on other conditional probabilities.
In my experience with working with data is that it is of paramount importance to always look at the data graphically, to get a feeling of what's going on. Relying purely on statistical methods, including standard methods is not enough. Judgement about why distributions look a certain way, the existence of clusters, is key.
Thank you for such a good examples and explanation
I think the case without discrete groups is also relevant. If looking at the last example you had a continous variable (say age as a proxy for amount of education) and a test that did not create two discrete groups, say instead of it entirely consisting of early graduate content it was content selected with equal probability from all relevant difficulty levels (so presumably all college level things), and students with more education studied less for it evenly with higher age (depends on why the graduate students studied less as a group, it could be a discete cuttoff becouse of the qualatative differences betwean grad and undergrad on say free time to study for an unimportant test, but it could be a responce to how hard they expect the test to be, in which case the change to the test should work perfectly); You now should have a thick downward sloping band for your data sample of test score compared to hours studied, and now instead of some wierd subgroup effect it just looks like you just got screwed over by the confounding variable of age when realy hours studied causes higher test scores.
Thought it was good to mention just becouse I helps me remember that simpsons paradox realy isn’t anything different normal confounding variable problems and so you don’t actualy need any special reasoning, it’s just discrete instead of continuous. And in terms of bad behavor it has the same problems, not reporting a subgroup effect can cause the wrong conclusion to be pushed, but how do you know that a subgroup is relavant? Collecting a ton of subgroup information basicaly p-hacks your new “real” results. The real problem is that the results are taken far to seriously for a non-causative study of data that is to small to reliably try to rigourously detect the presence of counfounding variables. Though in the 2nd example you probably could easily detect the clustering event with a dozen other unrelated group classifiers thrown in
thanks for your videos as always
Glad you like them!
i , actually faced both situations you described, during research on some soils' data.
nowadays, we have datascience, and datascientists ( who are engineers)... the principle? programm a class in python for a deep ann or a random forest, fill it up with data thanks to hadoop and other datalakes, and you have a perfect result......
BUT, when i was young, we studied statistics and econometrics, and our teachers gave us another mentality! they said ' play with your data'
this means ' before you create a model, you have to analyse deeply the pb, and this will take time, and patience'
such a paradox was not a pb, but it is with a deep ann
cheers
I've never been this early to a video before
me too
@@rewanthnayak2972 Tera bhi JEE advanced kal tha kya?
@@Yaara_1 😆😆nahi bro mai already btech mai hu. exam kaise gaya
@@rewanthnayak2972 bekaar. 3 din baad result aayega
Jay-Z: "Numbers don't lie, check the scoreboard."
Statistician: "umm...Simpson's Paradox"
Very good topic!
Thanks!