Simple Linear Regression: Checking Assumptions with Residual Plots

jbstatistics

Просмотров 340 тыс.

3 500

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 5 фев 2025
An investigation of the normality, constant variance, and linearity assumptions of the simple linear regression model through residual plots.
The pain-empathy data is estimated from a figure given in:
Singer et al. (2004). Empathy for pain involves the affective but not sensory components of pain. Science, 303:1157--1162.
The Janka hardness-density data is found in:
Hand, D.J., Daly, F. , Lunn, A.D., McConway, K., and Ostrowski, E., editors (1994). The Handbook of Small Data Sets. Chapman & Hall, London.
Original source: Williams, E.J. (1959). Regression Analysis. John Wiley & Sons, New York. Page 43, Table 3.7.

Комментарии • 153

@Hind-t9j 11 месяцев назад ⁺¹²
this was posted 11 years ago T-T and has the best explanations and videos on statistics I have ever found, thank you so much for all your hard work and legacy, i hope you know you're my savior.
@jbstatistics 11 месяцев назад ⁺⁶
I'm glad to be of help! 11 years, where'd they go? :)
@48956l 8 лет назад ⁺¹⁷⁴
I'M GIVIN THIS VIDEO THE BIG CHECK MARK
@jbstatistics 8 лет назад ⁺¹¹
Thanks!
@read89simo 8 лет назад ⁺⁵
ME TOO + A BIG SUBSCRIBE
@messididit 4 года назад ⁺²
@@read89simo + A BIG LIKE BUTTON
@nkululekoshabane3373 9 лет назад ⁺⁴¹
One of the best, if not the best, video on regression analysis I've seen. Thank you very much for creating it. Your service is highly appreciated.
@jbstatistics 9 лет назад ⁺⁵
Nkululeko Shabane You are very welcome, and thank you very much for the compliment!
@jbstatistics 12 лет назад ⁺⁵⁴
"think you guys should get more views..."
Thanks! (And I'll take as a compliment that you said "you guys", since this is a one man show.) Getting lots of views isn't very high on my priority list -- I'm just trying to provide the best resources for my students that I can. (I haven't done any promotion, and I don't allow ads on the videos.)
There are many students in intro stats in North America and around the world, and I'm glad that some of them find my videos helpful.
@williamlee0 4 года назад ⁺¹
I'd upvote you x10 if I could just for the anti-advert policy.
@Manny123-y3j 3 года назад ⁺¹⁵
This is exactly what I have needed. My professor goes over these plots but has been doing statistics at a high level so long that I think it's hard for him to relate to someone who is new to it. I really needed someone to just explain it all from start to finish, and you did that. Thank you so much! Your videos are so, so helpful. Sincerely, a first year statistics graduate student.
@jbstatistics 3 года назад ⁺¹
I'm glad to be of help!
@snake1625b 9 лет назад
Excellent methods used to help students learn in this vid. This is the future of education!
@doodelay 5 лет назад
"The residual plot removes that increasing trend and then re-scales the y axis, so it's a little bit easier to see these issues.. sometimes in the residual plot." Now that is some serious insight. Thank you so much and this video was superb with really excellent examples!
@jbstatistics 5 лет назад ⁺¹
Thanks for the kind words!
@raseshgupta6276 2 года назад
I was struggling to understand the assumptions in simple linear regression through other sources. This video has made it clear
@jbstatistics 11 лет назад ⁺¹
I'm glad you find them useful John. Best of luck in your course!
@johncasey722 11 лет назад ⁺²
I'm so fricking glad these videos align well with my UIUC stats class. Much appreciated!
@deniskapliy2642 7 лет назад
Small...and then they're big...and then they're small...and then they're big..
Great video, pretty simplistic, but very useful, thank you!
@muhammadusama1558 4 года назад
The more I watch your video, the more I hate my uni. Much love man
@valeriereid2337 Год назад ⁺¹
Thank you for this excellent lecture. It certainly helps.
@Maha_s1999 7 лет назад ⁺¹
Prof Balka knocks it out of the park every time! We miss your videos. Could you do some videos on multiple linear regression? Hope you come back soon with new vids!
@jbstatistics 7 лет назад
Thanks for the compliment! I'm trying to make time for video production, but probably won't get back to it until the new year. It's been a busy few years, but returning to the videos has always been part of the plan (with multiple regression videos up near the top of the list). Cheers.
@Maha_s1999 7 лет назад
YAY!! Thanks Prof !! I will look out for them.
@hichamitani6433 2 года назад
Thank you
Need more like these videos on outliers in residuals
@Jelly-cy4vh 2 года назад
This was very useful, thank you for all the information
@jbstatistics 11 лет назад ⁺²
You are very welcome Simon!
@hritwick1221 3 года назад
you are great man . thanks for your content . I am forever great full to you .
@shayd146 11 лет назад ⁺²
JB thank you so much you have helped me more than you'll ever know! My only suggestion to you would be to create playlists for associated topics. Other than that your teaching methods are incredible! Thanks!
@jbstatistics 11 лет назад ⁺¹
Thanks very much for the compliment Shaydoyle! I believe I do have playlists ordered by topic. I've also set up a website (www.jbstatistics.com), which keeps the videos in a more organized fashion. (I'm not plugging anything on the site - it's just organized lists of my videos.) Cheers.
@JoaoVitorBRgomes 4 года назад ⁺¹
At 1:56 you can't plot against Y because there is dependence between Y and the residuals? You mean the residuals are the difference between the observed and the estimated, so makes no sense to plot against the observed? But why? Could you clarify this?
@rodrigopaolinelli6448 2 года назад
This is a definitely a great video, thank you! You are awesome!
@jbstatistics 12 лет назад
We often simply rely on an appropriate sampling design or experimental design to ensure independence. But if, say, we have recorded the observations in some sort of time order, then plots of the residuals through time can give us some indication of whether the residuals are correlated.
@Pavankumar-zw2fz 4 года назад
Very good Explanation Sir.Thank You
@DHDH_DH Год назад ⁺¹
Still extremely helpful in 2024
@carnationize 6 лет назад
Thanks a lot! All your videos on stats are very clear and have been very helpful!
@jbstatistics 6 лет назад
You are very welcome!
@dedraryqui5606 9 лет назад ⁺¹
very clear, easy understandable video
@rahkshi96 8 лет назад
Thank you very much jb statistics. This is incredibly helpful and well explained.
@jbstatistics 8 лет назад
+Peter Song You are very welcome. Thanks for the compliment!
@jamiebond8481 7 лет назад
good and simple explanation of residual plots and assumptions.
@jbstatistics 7 лет назад
Thanks!
@pubgvulcanizer7857 3 года назад
Very nicely explained 👍
@RodyaCCQ-k9u 14 дней назад
Great video.
@ananyapamde4514 4 года назад
Great video!
@MohamedAbdo-xs7bf 5 лет назад
You are Awesome! Thank you so much for sharing your valuable knowledge.
@linneajohansson3796 4 года назад
This was very helpful! Thank you!
@bharathganeshkumar7071 5 лет назад
Thanks for their video.. Short and sweet...!!!
@jbstatistics 5 лет назад
You are very welcome!
@Mrfrog2024 13 дней назад
Got by glms exam tomorrow, thank you
@carlosaugusto212 4 года назад ⁺¹
Shouldn't we analyse the standardized residual plot? I mean, the residuals will be naturally bigger as the y value gets bigger, won't it? If the y range goes from 0,1 to 10 thousand, we expect bigger residual absolute values near the 10 thousand mark. Correct me if I'm wrong, please
@bibekanandasahoo3497 2 года назад
thanks for this great explanation sir .....
@jingwen8133 4 года назад
Very useful video ! Thank you
@mostafaali8684 9 лет назад ⁺¹
Good video, thank you very much for uploading it.
@jbstatistics 9 лет назад
+Mostafa Ali You are very welcome. I'm glad you found it useful!
@abhishekbhatia6092 6 лет назад ⁺¹
While interpreting the residual plots, can I first pool the residuals in specific bins of X (say each bin 1 unit long or whatever) , so that it looks more like the previous plot with residuals for a given value of X, enabling me to verify the homoscedasticity (and also normality somewhat) more clearly?
Edit: Q) You mentioned that one of the assumptions was that for a given value of X, the error terms are normally distributed with a constant variance sigma-squared (same for each X). Then at 5:50 you took all the residuals disregarding the value of X, and graphically checked it for normality using a Q-Q plot. Didn't you mention that the normality assumption was for errors for a given value of X? I am confused. pls help.
@n9537 6 лет назад
answer to edit: If we assume that sampling was completely random, then data from all treatments/groups/sub-populations/values of X were equally likely to be represented in your sample.In that case all the residuals can be clubbed together and checked for normality.It s same as checking for each treatment group.Note this applies only for the residuals, not the variables.
@n9537 6 лет назад
In regression we usually have predictor variables continuous. so it is impossible to check normality for each value of X. in case of ANOVA , the predictor is usually categorical and you can venture to check residual normality for each treatment group/category.Both ANOVA and Regression come under Generalized Linear Model(GLM), so the assumptions are the same but they play out differently.
@n9537 6 лет назад
Actually all assumptions are on the error terms.But since Residuals are an estimate on the error, we check for "good behavior" on the residuals. We have to make do with what we have(which is the residuals, the error is unknown)
@n9537 6 лет назад
this also follows from the assumption that error ~ i.i.d N(0,sigma^2). So all residuals(used in place of error as a good estimate) are identically distributed(same mean and variance) and are independent of X , implies you can't look at a set of residuals and figure out which value of X it came from. For all you know, they all could be from the same value of X or different values of X.Needless to say, they must be sourced from the same population, you can't club residuals from different populations/different predictor(s). So for checking normality of residuals, you can disregard value of X.This is not the case for Y (dependent variable).
@TB3hnz 4 года назад ⁺²
4:12 "I'm giving this the joker variance, because *let's put a SMILE on that FACE!* "
@acidik9947 9 дней назад
giving this video 5 BIG BOOMS
@babc24 5 лет назад
Very nice teaching. Thanks
@Jemimakl 5 лет назад ⁺²
So helpful! Thank you for this :)
@selinechung1692 5 лет назад
LMAO BRO WHY ARE YOU HERE
@Jemimakl 5 лет назад
Seline Chung WHY ARE YOU HERE
@selinechung1692 5 лет назад
@@Jemimakl WHY ARE YOU SO HARDWORKING
@selinechung1692 5 лет назад
@@Jemimakl BRO YOU STARTED A WEEK AGO
@Jemimakl 5 лет назад
Seline Chung I WAS DOING HOMEWORK
@Riley8185 7 лет назад
These are very good videos
@Thirdd3gree 8 лет назад ⁺¹
great video. such a clear explanation. subbed.
@Kaa279 8 лет назад
in 4:40 you said that there is another feature that we didn't included in our model. but it can also conclude that my model is not good, right?
@JoaoVitorBRgomes 4 года назад ⁺¹
3:23 what kind of graph indicates non normality?
@hanaizdihar4368 4 года назад
this really helps, thank you
@jt007rai 5 лет назад
at 4:21 , can we determine which model will solve this issue based on just looking at this residual plot?
@yingdili2219 4 года назад
perfect video
@KingQuetzal 4 года назад
So I got the 4:12 graph how can I find out what kind of data I have?
@savageprincess2796 2 года назад
im giving this video A BIG CHECK MARK (2)
@wenlidi1604 9 лет назад ⁺¹
very clear explanation.
@pate1495 4 года назад
I have a question regarding the Normal Q-Q plot. On the y-axes, does it show the quantiles of the residual distribution, or the residuals itself? On the x-axes it shows the quantiles of the residual distribution if it were normal, correct? Thank you, great video!
@jbstatistics 4 года назад
There are different ways of formatting these plots, but here I have the ordinary residuals on the y axis. (The y axis value for any point is the ordinary residual of that point.) Any value could be considered a quantile. The x axis represents the corresponding quantile from the standard normal distribution. So if the residuals were normally distributed, we'd expected those values to fall (roughly) in a linear pattern. (There are some technical issues here, as the observed residuals aren't technically iid normal, even if the OLS assumptions are true, but it's a rough approximation.)
@frederikhe707 8 лет назад
Nice! The only improvement I would suggest is that you actually name the violated assumptions. I mean people can draw that conclusion on their own but that would make it even more clear.
@infoesenn 4 года назад
Question: Why do you assume normally distributed errors? From my understanding, in large samples iid-errors with from any distribution should be sufficient (Central Limit Theorem).
@sanjaypandey6586 3 года назад
is it ok in linear regression if dependent and independent variable are not normally distributed if not what should be the optimum solution for negative skew and neg kurtosis
@willtube9 2 года назад
Prof, Could you do some videos on multiple linear regression? Hope you come back soon with new vids!
@Stephanbitterwolf 6 лет назад
Great video! Thank you!
@KNO476 24 дня назад ⁺¹
Hello, thank you for this material.
Could you please elaborate on why we do not plot residuals vs observed values. The notion that there are "related" seems a bit vague to me. After all residuals and fitted values are "related" as well.. i would very much appreciate some clarification on what you meant.
As far as my understanding goes we could use the observed, predicted or X values in a plot to detect heteroscedasticity. My logic goes as follow : if we consider the model you presented Y=β+𝛼X+ε and a simple case of heteroscedasticity where ε~N(0,σ1^2) for the lower half of X values and ε~N(0,σ2^2) for the upper half of X values. The lower half and upper half or X values corresponds respectively to the lower half and upper half of the Y values. Thus plotting residuals vs X or vs Y will results in both cases in the lower half of the plot displaying variance σ1 and the upper half displaying σ2. So we would succeed identifying heteroscedasticity in both cases.
@jbstatistics 24 дня назад ⁺¹
The residuals are positively correlated with the Y values, but not correlated with the Y hat values. So a plot of the residuals against Y will show an increasing trend. If the model assumptions are correct, then a plot of the residuals against the fitted will tend to show a random scattering.
"The lower half and upper half or X values corresponds respectively to the lower half and upper half of the Y values."
We can't say this, since the Y values are random variables. It might correspond *roughly* to that, depending on the specifics and randomness, but the Y values are random variables and could take on any value.
If we happen to get a Y value that's 4 SD above its theoretical mean, say, then that value will be have a large positive residual (with high probability, at least).
You might see certain hints of heteroscedasticity in a plot of residuals vs Y, but it would be hard to tell precisely what the plot means. A plot of the residuals vs the fitted values would give a much clearer picture.
@KNO476 24 дня назад ⁺¹
@@jbstatistics Thank you so much for taking the time to answer. I will meditate on your feedback !
@sivanschwartz3813 8 лет назад ⁺¹
thank you for this amazing video!!!!!!
@jbstatistics 8 лет назад ⁺¹
You're very welcome!
@CHIRAGPERLA 5 лет назад
This is gold!
@renshiue 3 года назад
nice and clear
@davidli6068 4 года назад
thanks a lot your a king
@frederickrosas5248 3 года назад ⁺¹
Hi Sir. May I know what statistical tests/treatments being used in residuals plots to confirm what is allowed and not? Thank you for your help.
@omkareshpali8486 3 года назад
Hi I have a question, let's say I built a model and the R2 value came out 70%
How do I make sure that is the maximum variance I can explain by looking at the residuals.
@purityrima1366 4 года назад
Please can you share me a link with your video on how to correct the unequal change in variance problem shown on the plots. Thanks in advance
@yousifsalam 2 года назад
@4:40 why did you say the residuals are small then big then small..
don't you mean they're negative, positive.. since their magnitude is the same?
@dariopl8664 2 года назад
I think it's because "ε" is a random variable (as he mentioned it in previous videos), and should stay so. If they appear at time sections up and then a bunch of them down, that randomness breaks up, since when a whole of them are up you can forsee they'll be down next time (then where's the randomness?).
I think that if all are up the same amount they're down (as you see @4:40), then, would they still have a normal distribution? no, it would be just a straight line probability distribution, in which you know the moment you're up, next will be down, and so on.
This model assumed ε follows a normal distribution, which is reasonable, since in real life many events occur this way.
If they're are jumping up and down in clusters then we're not dealing any longer with this reasonable distribution. But of course, at the end he'd some way deal with this time effect he didn't know beforehand was causing this, maybe so as to normalize them, as they should be to fit the model🤔. I don't know yet how he tackles this problem. If I find about it I'll tell you.
Hope this reply was helpful.
Best regards.
@maydin34 8 лет назад
Nice video.Thank you.
But it is just plotted between random part vs independent variable(x). What if we have multiple independent varibles ( say z,t,w etc.). Do we need to check for all those seperately by expecting very same variance again regardless of independent variable? (random vs z, random vs t , etc.) Or is it ok just plotting predicted-y vs random part?
@ujasdiyora2804 Год назад
I have one doubt, here we are talking about simple linear regression in all of these videos in playlist. So this assumptions are also true for linear regression, multiple regression and polynomial regression ? , and all of these theory of finding confidence interval and hypothesis testing at the end to find whether coefficients are statistically significant or not , are these methods also applied in any other linear regression ?
@jbstatistics Год назад
The general idea still holds, yes. The specific formulas for the standard errors, degrees of freedom, etc., will change when there is more than one predictor. And there are many subtleties when it comes to multiple regression, so it's best to learn all about MLR rather than think something like "well, it's just like simple linear regression but with more predictors." That said, yes, the general ideas port over from simple linear regression to multiple linear regression in a natural way. Polynomial regression is a type of multiple regression, so same idea there.
@Bombingp 7 лет назад
Thanks! Helped a lot!
@jbstatistics 7 лет назад
You are welcome!
@siryohannb3626 4 года назад
thankyou very much
@ohhrelingo6271 2 года назад
If I can't find out if the variance is constant from the plot what should I do?
@GlorifiedTruth 7 лет назад
So helpful! Thanks.
@nightwalkers5579 12 лет назад
think you guys should get more views... may be there are not enough stats students in the country
@aabinamasoodgundroo5971 3 года назад ⁺¹
my graph is blank, what does that mean?
@Kishimita 3 месяца назад
in the many examples of the plots, the X-axis denotes what ? Im guessing the y- axis is. e_i = y - y^hat ?
@jbstatistics 3 месяца назад
The axes are labelled. It's not a generic "X"; it's the X from the linear regression model that relates Y to X. As given on the plots, the Y axis represents the value of the residual. And yes, a residual is y - y hat as discussed in the video.
@Kishimita 2 месяца назад
@@jbstatistics Thank you! Your videos are awesome :).
@Kingshuk91 10 лет назад
Great video. How is plot of e vs time and plot of e(t) vs e(t-1) different?
@Patriciacx 6 лет назад
Thank you!
@jbstatistics 6 лет назад
You are very welcome!
@sarita-ey5cw 6 лет назад
+jbstatistics ऊघजेऐऊ
@purityrima1366 4 года назад
@jbstatistics, thank you so much for helping me understand these plots! You are the best teacher:) I give you a big check mark for this video too. awesome explanation!
@alexkay7199 5 лет назад
Please help: Do the residuals have a unit or are they unitless???
@jbstatistics 5 лет назад
The residuals are the differences between the observed values of Y and the predicted values of Y. The units of both the observed and predicted values of Y are just the units of Y, and thus the units of the residuals are the units of Y.
@alexkay7199 5 лет назад
@@jbstatistics Thanks a lot! Really helpful!!
@Doh333 8 лет назад
Would it be relevant to make residual plots if i want to check a categorical variable in a lineare regression ?
@jbstatistics 8 лет назад
Yes, some types of residual plots are still informative for categorical explanatory variables. With a categorical variable, a check for linearity is not required, but residual plots can still help to check the normality and common variance assumptions.
@karrisgiani5137 8 лет назад
Brill video! If residuals appear to show an inverted U, how can I improve the model?
@VivianGameCollections 3 года назад
safe my day
@simonschacht1810 11 лет назад
Thank you
@davidchau6874 12 лет назад
how to check the independence in the residual plot?
@SaranathenArun11E214 6 лет назад
brilliant
@jbstatistics 6 лет назад
Thanks!
@MohitSingh-ub9gc 7 лет назад
but why are we doing this, please explain ?
@nishikantanayak7797 2 месяца назад
Damn Awesome voawesome video
@lowerterror7993 8 месяцев назад
No one people like data analytics
@harrygroundwater2590 8 месяцев назад
Ver helpful
@MrPreston1056 10 лет назад
Is there a way to t test the residual plot?
@jbstatistics 10 лет назад
What kind of test are you hoping to do? There isn't going to be an overall increasing or decreasing trend in the residuals in simple linear regression. There may be curvature, and we could test to see whether adding higher order terms (e.g. X^2) results in a significantly improved fit. Cheers.
@MrPreston1056 10 лет назад
Can you use the t statistic to test if H: E(e "hat"sub i)=0 vs. H: E(e"hat" sub i) not equal to zero. E being mean and e "hat" being error
@jbstatistics 10 лет назад
Preston C No, we can't test that. The (observed) residuals always sum to 0 in simple linear regression. When we say the expectation of epsilon is 0 (at every X), we are in effect saying that E(Y|X) falls on the line beta_0 + beta_1X. Conceptually, we could have a different model where the expectation of epsilon was assumed to be 2 instead of 0. This would change very little, except that beta_0 in this model would be 2 less than beta_0 in our usual model. This would unnecessarily complicate things, so we define epsilon to be a random variable with a mean of 0. Cheers.
@MrPreston1056 10 лет назад
But if we tested to see if e_i=0 vs not equal to 0 and rejected the null hypothesis that e_i=0 wouldn't that indicate that the residuals did not sum to zero and our previous assumptions were false?
@jbstatistics 10 лет назад
Preston C The observed residuals sum to 0. That is not an assumption, it is a consequence of the least squares fit. If we attempted to test the null hypothesis that the true mean residual is 0 with a t test, we would end up with a test statistic of 0 and a p-value of 1. So that wouldn't really be a test. If you're wondering about testing the null hypothesis that E(epsilon) = 0 at *any given value of X*, that's a bit of a different story. We do something along those lines when we carry out a lack-of-fit test. (This tests the null hypothesis that the means do indeed fall on a line. We can do this sort of thing when we have multiple observations at at least some of the X's.)
@Nias0404 7 лет назад
I'm not sure how to get the Q-Q plot... can anyone explain?
@jbstatistics 7 лет назад
It's almost always created using software. My intro to normal QQ plots is found here: ruclips.net/video/X9_ISJ0YpGw/видео.html
@Nias0404 7 лет назад
Thanks a lot!! Much appreciated
@ogedaykhan9909 5 лет назад
E X C E L L E N T
Thanks a lot
@unofficiallyofficial2149 5 лет назад
Probably a simple model for college students, not high school.
@jbstatistics 5 лет назад
The "simple" in simple linear regression refers to there being only one predictor (one x), and not because it's simple or easy. It's just the well-established name of the model. Unlike many others, I don't use any clickbait words like "easy" or "simple".
@unofficiallyofficial2149 5 лет назад
@@jbstatistics Oh, I understand. Thanks.
@ukrainrussiawarvideos2810 8 лет назад
A GOOD GAIED Program
@JoshuaDHarvey 4 года назад
Nothing he is explaining makes any sense.
@angelinelam5862 4 года назад
Thank you for this useful video !

Следующие

Автовоспроизведение