Linear Regression in R, Step-by-Step
HTML-код
- Опубликовано: 24 июл 2017
- This video, which walks you through a simple regression in R, is a companion to the StatQuest on Linear Regression • Linear Regression, Cle...
If you want to just copy and paste the R code, you can get it from the StatQuest GitHub site: github.com/StatQuest/linear_r...
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
RUclips Membership: / @statquest
...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
#statquest #regression
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
3 years later, still saving lost souls! Intros are amazing as usual :D
Bam! :)
Hooray!
Came for the intro, Stayed for the infos
That is an awesome rhyme! I love it. :)
The Intro made me subscribe
I know I have written a few comments on your videos,,, but ,,, Bam,,,, they are good.
I am doing a uni degree in statistics and having trouble understanding the books and lectures, these videos are really helping me out - THANK YOU!!!!!!!
What a clear and awesome explanation! Thank you.
I think this is the best explanation I've ever seen of linear regression output, and in just 5 minutes.
I'm really impressed. Thanks a lot for your great videos. Please, keep up the great work.
Wow, thanks!
I have been struggling to understand simple regression and other statistics concepts. Reading books and trying to search some other You tube channels I came across your video. I must say I stopped searching because your videos are so great and explain things in simple terms for all beginners and advanced students to understand. Thank you so much.
Hooray! :)
I love this video!! explained everything super clearly and quickly, thanks so much!
Thank you very much! :)
Another great video, Josh!. Can't wait for logistic and multiple linear/non-linear regression videos, hopefully with explanantion of coding for R if needed.
I was reading about DEseq2 and edgeR which deal with design matrices, thats why I jump onto your videos abour GLMs. Would be waiting eagerly for that video. Thanks again.
Like StatQuest!! More like love StatQuest!! Thanks for all the work...
Hooray! Thank you very much! :)
Good, Josh, I am listening to your songs on your website. I liked specially Saturday and I Try My Best. Also we people from Brazil are learning a lot from you, you and some youtubers are like the neoEnlightment. God bless!
Awesome! Thank you!
Thank you for explaining this in a simple way!
BAM! :)
Nice video again.., amazing Josh..., keep pushing your limits
Thank you! :)
You are precious!!! Thank for the brilliant explanation!!!
Thank you! 😃
This is the best explanation or of regression I have ever seen.
Made me a subscriber on just one vid, Bravo!
Thank you!
OMG such an exciting video. Thank You and have a nice day.
:)
I like StatQuest! Thanks for the video - very helpful and concise
Thank you! :)
Thank you so much.....wid help of your video m able to finish out my assignment. Trust me I ended up watching almost 10 videos...but only got to do with yours ..thank you once again.
Thank you!
😁You got me with the intro song! 😂... Now I'll pay attention
bam!
I started tuning into this channel after i listening to your awesome explanation of gradient descent. Now i am a regular. BAAMMMM!!
BAM!!!! :)
@@statquest haha. Merry Christmas in advance!!
Thank you very much, that was an uniquely clear explanation!
Glad it was helpful!
What a clear explanation; Thank you.
Thanks!
Thanks mate. Really good explanation!
Thank you!!!
Wonderfully explained !
Thank you!
statquest I love you! I will only pass stats bc of you thank you so much guys
good luck! :)
Ur genius, literally saved my life
Thanks!
You are amazing! thank you for doing this!
Thanks!
You saved my day.!!! Thank youuu..!!
Glad it helped!
Thanks for this amazing video Buddy :)
No problem 👍
this channel is the garden of paradise
Thanks!
Do I *like* StatQuest? I love StatQuest. I'm getting all the knowledge for my final project from you and feeling like I am starting to understand statistics a bit. Thank you so much for saving my ass
Hooray! :)
Thank you! This helped a lot.
Glad it helped!
Great video, thank you!
Thanks!
Thank you! A great video!
Glad you liked it!
I admit i'm only subscribed because of the awesome intros.... I'm not even taking my statistics course anymore?!
Going good keep it up
Thanks! :)
THANK YOU
Thanks for all of the videos. They really help me a lot understanding statistics. I was always trying to avoid statistics, but your videos make me stay to learn.
Btw, can I request GLMM on R please?
Thanks! I'll keep GLMM in R in mind.
Statistics is an easy and enjoyable subject, you just need a good teacher to show you that. Stat quest is amazing at presenting this in an easy and enjoyable way. It’s hard to find good content, but when you do it changes everything.
I don't like StatQuest - I LOVE StatQuest, I really do! Thank you so much Josh!
BAM! :)
you are incredible thank you
Wow, thank you!
Tu eres el Mejor!!!
Muchas gracias!!! :)
Thank you!
You're welcome!
Very useful! :) Thank you! Maybe one day you could also do sth on Categorical regression, different types of coding and how to do it in R :)
Like this? ruclips.net/video/CqLGvwi-5Pc/видео.html and ruclips.net/video/Hrr2anyK_5s/видео.html
amazing video! channel is also outstanding, makes learning stats a (little) less difficult haha I would suggest, if you have expertise, to do a series of simples videos in Stata (same topics as the ones you have specifically for R)… thanks!
I'll keep that in mind.
Thank you
I like StatQuest tuuuuuuu!!! la la la!
BAM! :)
MY FAVORITE
Thanks! :)
I keep coming back here
Bam! :)
Hey Josh! Good job again. I think the addition of an applied/ how to companion video to the one presenting the theory looks really good. It is pretty much what I had in mind when I mentioned it to you. You really managed to keep the "how to" video very short but in the same time to add enough detail so that one can directly apply the theory to a dataset at hand. Me likes it ;)
May I suggest an add-on video about doing multiple linear regression? As other topic, multivariate analysis with R companions. Maybe adding R companions to the topics you have already covered theoretically: PCA, heatmaps, clustering etc?
Thanks a lot for your work. Maybe you could get compensated for it? Why not add the option to those who appreciate your work?
I just realized that you actually have a website with more elaborate examples of R tutorials/applications that you don't advertise enough. If I were you, I would include discreetly the url on each of my slides. Like a header or footer...
Re funding, from what I hear, youtube is not a moneymaker anymore due to changes in policies. I noticed that people with youtube channels have an associated website on which they accept donations or subscriptions via Paypal. Here's an example: freedomainradio.com/donate/
That should be pretty easy to setup if you have a paypal account.
Content producers also use Patreon. www.patreon.com/ Get yourself rich(er)!
I totally agree with Ed. Patreon is a good platform for contribution that a lot of RUclipsrs use. This channel is going to be bigger as such videos keep coming up. There is lack of RUclips videos explaining statistical concepts in such clarity and also lack of linking them with practical applications.
Joshua Starmer
Hey Josh,
As another possible platform for your content and for more exposure (the academish kind), you may want to check Udemy. I took a bunch of courses from there and there were pretty useful. Not sure how much money u can make because the courses seem to be discounted quite a bit, but you'll definitely have a more "formal" audience and you could put in your resume that u taught a course online.
I love stats quest
Thanks!
Hi Josh, It would be great if you can develop/upload the videos for linear, logistic, and other regressions (which you already have explained) with STATA as well.
And yes, all your videos are super BAM!! please keep them coming.
Compared to R and Python, STAT is very expensive. If you can convince them to give me a free copy, I'll consider making videos for it.
@@statquest I was of this impression that a lot of universities have STATA available for the student and staff. Although I can't convince them for free licenses 🙂, one thing is there that STATA is easy to learn for users with no/little experience in programming (like me), and videos on STATA will be quite useful and quick to learn. 🙏
But again it should be at your convenience.
@@ambrishsingh8326 Unfortunately I no longer work at a university. I'm doing StatQuest full time, so I would have to pay for it.
@@statquest Frakanly speaking, I have learned equally/(if not more) from you what I have from any of my professors. Thank a lot. 🙌
2:47 p-values of the estimated parameter value (i.e. slope and intercept) is obtained from t-test against null hypothesis (that the parameter was zero).
yep
my master degree got saved by this video
bam!
very good explanation. there is not such a good video like this in german
Thank you! :)
1:11 example
1:56 summary
4:37 add abline
bam!
I LIKE STAT QUEST! IF YOU DO NOT LIKE STAT QUEST, YOU GET OUT HERE!
Dang! :)
Hi Josh! Another great video as always! I just want to ask what if the P Value in my lm function shows p-value:
That means that your p-value is very close to 0.
I am looking for the linear regression section where you discuss categorical or discrete predictors? (love your videos, btw!)
If you are interested in discrete predictors, than you are interested in logistic regression, not linear regression. To learn more about logistic regression see: ruclips.net/p/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe
@@statquest Thankyou! I'm looking now!
Hi Josh, massive fan of the content. I had a question : Why is the R-Sq and Adj R-Sq different in this example (since we only have independent variable)? Appreciate your response!
Because the formula for the adjusted R^2 penalizes for any number of variables in the equation.
@@statquest But in this video, you only have variable (weight) to predict size , so shouldn't make any difference, right ?
@@litfox5951 Just google the formula for adjusted R-squared and you'll see that any number of variables > 0 will affect the adjusted r-squared value.
At 2:54 you say we test whether the estimates for the intercept and the slope are 0 or not. I think our null hypothesis for the slope is that it is equal to 0 i.e. no (linear) dependence on X. But the null hypothesis for intercept should be that it is equal to the mean of Y's; since we also assess the relative variance explained against our simplest model that regardless of the value of X, the Y's are equal to the sample mean of Y. Correct me if I am wrong.
While it is true that we are comparing the linear regression model to the model that consists of just the mean of Y values (the size of the mice), that is not what's going on when we assess the significance of the individual parameters. When we assess the significance of the individual parameters in the model with both a slope and an intercept, rather than comparing the slope intercept model to just an intercept model, we simply compare the model to one where the parameter is set to zero. For example, for the intercept, we are comparing the fit of the "full model", where we let both the intercept and the slope be optimized to the data, to a model where we force the intercept to be 0 and allow the slope to be optimized to the data, given the constraint that the line must go through the origin. When we test the significance of the value for the slope, we compare the full model to one where force the slope to be 0 and only allow the intercept to be optimized to the data, given the constraint that the line must be horizontal. Thus, this second test, testing if the slope is zero, is the same as testing whether the full model is better than just using the mean Y value, and you'll notice that the p-value for the t-test for the slope parameter and the p-value for the F-test for comparing the models (at 4:24) are the same - both are 0.0126.
@@statquest Beautifully explained. I GET IT! THANKS!
@@abhishekbhatia6092 Hooray! :)
Thank you so much for asking this question. And
@Josh for the detailed and clear answer! Do you have any videos on how the Standard Errors for the t-tests are estimated in this case by any chance?
@@daltakid I don't, and it would take a whole new StatQuest video to explain how it all works. In the mean time, I found this webpage to be very useful: www.chem.utoronto.ca/coursenotes/analsci/stats/ErrRegr.html
Hello Josh, thanks for the great videos! I'm a fan!
I was wondering if you could explain
1. The difference between Pr(>|t|) for intercept and weight and p-value at the bottom.
2. What's F? You said "the square root of the denominator in the equation for F."
Thank you!
1) The p-values for the intercept and weight compare the estimated parameter values to 0. The p-value at the bottom is the same as the p-value for weight (comparing a model where the parameter for weight = 0 to a model where the parameter for weight is the least square estimate.
2) For details on F (and a lot more insight into the answer to question #1), see: ruclips.net/video/nk2CQITm_eo/видео.html ruclips.net/video/zITIFTsivN8/видео.html and ruclips.net/video/hokALdIst8k/видео.html
@@statquest Hello! Thanks for your awesome work :) In a regression with multiple explanatory variables, what p-value should we look at ? Does the R² p-value give us a sense of the impact of all the variables on the explained variable and the specific p-value tell us the specific impact of one explanatory variable ? Thank you !
@@anastasiachery5874 The answers to your questions are in this video: ruclips.net/video/zITIFTsivN8/видео.html
I'm waiting for a heavy metal style intro
I think the closest I get is: ruclips.net/video/azXCzI57Yfc/видео.html
n i c e
Great video, what is a good reason not to standardize parameters estimates?
If we leave the parameters in their raw form they can be much easier to interpret in terms of changes in the underlying variables.
Hi. I have 2 questions:
- Should we used Multiple R-squared or Adjusted R-squared in our conclusion?
- What does it mean 1 and 7 DF specifically? Is 1 the numerator and 7 is the denominator?
Thanks!
I answer these questions in my video on linear regression: ruclips.net/video/nk2CQITm_eo/видео.html
you are God
:)
What kind of a god you’re sir!?
:)
For example, can you redo this example out of data, from say the ALL package which contains acute leukemia information. Maybe show genes against any pData type such as age, remission, biology (mol.biol)
Noted
OMG I HAVE FALLEN AGAIN THIS CHANNEL
bam! :)
I am wondering how the std error is calculated for the slope (or the intercept), which i reckon is the standard deviation of the sampling distribution of the slope (or the intercept) by definition. It does not seem to appear in any of your other videos. BTW, I study CS but i have to do data science internship every summer, so I rewatch your video every year for 3 years😂
That would require a whole new video and I'll keep that in mind.
why doesn't the intercept value found in the summary correlate with the y-intercept seen in the plot? I must be missing something. Thanks for the video!
The x-axis in the plot does not go all the way to 0, so that could be throwing you off. If the plot had an x-axis that went all the way to 0, then we would see the line y-axis intercept at 0.58.
Thank you for the video, but for example like a regression formula yi= b0+ b1Xi+ the degree of deviation, so where can I find the degree of deviation in R, is that the same as standard errors?
Unfortunately I'm not familiar with the term "degree of deviation" so I can't tell you what it means.
is that same with partial least square (PLS)?
I have some small doubts. How are the degrees of freedom 7? and how do we assume the significance of p-value. For example, in hypothesis testing, we reject the null hypothesis if p-value < 0.05 (significance level) but on the other hand.. we want it to be less here in this example? Great video as always though :D
The degrees of freedom come from the equation for F. The number of observations - the number of parameters etc. As for your second question, I'm not sure I understand it.
@@statquest Oh thank you so much for answering. I was confused about the 2nd question but I actually realised it so no issues. Thank you again.
Good explanation. Just noted that Rsq and adj Rsq are different although there is only one independent variable. Isn’t it counter-intuitive?
Also, can you explain the difference between Rsq (not the adj Rsq) and multiple Rsq ?
I explain that there is no difference between R-squared and Multiple R-squared at 3:57. And yes, the equation for the adjusted R-squared penalizes for every parameter except for the intercept.
@@statquest Thanks for prompt response. Since there is only 1 independent variable, shouldn't the R-sq and Adj R-sq be same?
@@dhruvdesai3728 You would think, but that's not how adjusted R-squared is defined. The adjusted R-squared penalizes for every parameter except for the intercept.
@@statquest Ok , got it now. Great videos by the way. Keep it up!
Which p-value we will consider for interpreting statistical significance?
The individual parameters have p-values (see: 3:04) and the entire model also has a p-value (see: 4:29)
nice intro
Thanks!
Hi, Josh, I a question about the distribution of the residuals. If we expect that they are symmetrially distributed, why the mean and the max should be the same distant from zero. I think the mean should be close to the zero which means that the data fits the line, but the max means the extreme value, why it should be the same distant as the mean from zero? Thanks!
What time point in the video, minutes and seconds, are you asking about?
@@statquest at 2:07
@@chunzili6256 When you say "mean", do you mean "min", which is short for "minimum"? If that is the case, then you want them symmetrically distributed because that implies that the straight line is a good model for the data. If they are not symmetrically distributed, then maybe a curved line would be better.
Hi Josh, I have a question.
When i use simple linear regression, the P value of the coefficient of independent variable 1 is very low.
But when i add 1 more independent variable, the P value of the coefficient of independent variable 1 is very high.
Can you explain that problems, thank you!!
I discuss adding multiple independent variables and interpreting the p-values in the StatQuest, Multiple Regression In R: ruclips.net/video/hokALdIst8k/видео.html
Hi Josh, I have a doubt. How this model will be useful in day-to-day real life? I mean, How this model is useful in predicting the size of the given weight? Thanks
It's just an example of the method, not something that is actually used in day-to-day real life.
Hello! I'm wondering why the intercept equals 0.58 but when weight = 0 the line cut the y-axis at the point beyond 1? Is the figure just partly shown?
If you look at the x and y-axes in the figure, you see that it does not show the origin (when x or y = 0). The figure just shows the region where we have data.
I had the same remark, you can actually define the range of the axes when making the xy plot by passing 2 boundaries values vectors to the arguments xlim = and ylim = . plot(mouse.data$weight,mouse.data$size,xlim=c(0,8),ylim = c(0,8)).
Hi, just a question why is the intercept the y value 0.58, but when it is plotted on the graph it is just above 1
When you look at the x-axis on the graph, you'll see that it doesn't go all the way to 0, and this is why it appears like the y-axis intercept (on the graph) is > 1. However, if you replot the graph and include 0 on the x-axis, you'll see that the y-axis intercept is correct. Here's the code to include 0 on the x-axis...
plot(mouse.data$weight, mouse.data$size, xlim=c(0,7), ylim=c(0,7))
abline(mouse.regression, col="blue")
.Dear Josh. There is any video in NonLinear regresión in your channel?
The closest thing I have to non-linear regression is fitting a Lowess curve: ruclips.net/video/Vf7oJ6z2LCc/видео.html
Could you also upload a video about linear regression in Python, sir?
I'll keep that in mind.
Here is the code
github.com/StatQuest/linear_regression_demo/blob/master/linear_regression_demo.R
That's right. I'm in the process of moving all the code to GitHub. I hope to be done by the end of the day.
I only come for the song
bam!
Do you have anything on interpreting OR and RRs from this output??
What do you mean by "OR and RR from this output?"
@@statquest thanks for replying! I meant odds ratio and risk ratio, but my tutor explained it to me! (have to use 'logit' and 'log' functions and then exp() (if that makes sense) complete stats/R newb here.
It's very confusing when you say 'we want to'. When you said 'we want the min value and the max value to have approximately the same distance from zero', I was assuming you meant that if they aren't the model is not reliable. But then you say 'we want the p-value to be
I'm sorry my phrasing is so confusing. For p-values, people, generally, but not always, will only go through the trouble of collecting data if they suspect that their hypothesis is correct, so people often "want" to find significance, even if they are supposed to be impartial to the result. That being said, the state of the residuals does have bearing on the significance in that if the residuals re highly skewed, then it's a good sign that the results should not be trusted, regardless of the p-value.
Do you know where I can get a dataset that I can download for linear regression? I can't find any viable ones ANYWHERE. They're either longitudinal, split the CDV into categories, unable to be accessed, or no CDV in the whole set. I need to analyse a dataset for linear regression - with a good sample and like 10 variables
Check archive.ics.uci.edu/ml/index.php
hi there, if the regression line is straight horizontal, does it means that the data set is not suitable for the linear regression model?
Correct
@@statquest thank you o much! I'm new to R so am quite confuse with how to answer the statistic questions
@@psiko_nini4362 For more information on Regression, see: ruclips.net/video/nk2CQITm_eo/видео.html
@@statquest thank you!
Hello 👋
:)
Hi thank u.
Could u please tell me, how can I do regression through orgin
Just set the intercept to 0.
@@statquest but , how should I code for this in R?
@@munshir.c6161 If you have two variables in your data, x1 and x2, then lm(y~ 0 + x1+ x2, data)
@@statquest thank you :) let me try it
Error in model.frame.default(formula = p ~ t, data = pressure, drop.unused.levels = TRUE) :
invalid type (NULL) for variable 't'
help me
What time point, minute and seconds, in the video, are you asking about?
@@statquest it's my question to you when I do I get this error and how to solve this just I asked
@@statquest why this come tell the reason
@@mscit_08_omprakash40 To be clear, you are not asking about the video. Is that correct? Instead, you are asking about some code that you wrote on your own and you have an error?
@@statquest yes
Abline() function is not working, giving error plot.new has not been called yet! Can smbdy help!!
You need to draw a graph before you call abline().
I couldn't put the names , I don't know where the problem is?
:(
Please make more python tutorials
Will do!
python is not that good i feel
RAJAT BHOSALE why do you think that?
nah i just said out of sarcasm bro...nothing to worry python is good compare to R because its coding is lil easy compare to R. Also, it is in boom now.
how can i type "~" in R studio. Anyone can help me?
On my keyboard it's a key in the upper left hand corner.
what if you have a HUGE amount of data?
If you have a HUGE amount of data, than that's awesome!
TIL we do not care about the p-value of the intercept.
True
how can you write the tilde?
It depends on your keyboard. For me, it's the key right below the "esc" key in the upper left hand corner.
@@statquest thank you, i have another crucial question, where do i find some data to analyze?
i would like to use this knowledge in a practice
@@yeahyeah54 Here's a place with all kinds of data: archive.ics.uci.edu/ml/index.php
@@statquest another question, can i make regression between time and another variable?
for example i make the linear regression between quarters and log(revenue) of alphabet, i get a p-value of 2.2*e^(-16) and a R^2 of 0.97, does this mean anything?
@@yeahyeah54 Yes. If the slope is positive, it means that as time passes, alphabet makes more money.
What the hell....!!! Paid🤨
I am really sorry! I don't know what is going on. I've contacted RUclips and have not heard anything back. This is breaking my heart because I never wanted this to happen, but somehow it is. I am sorry and doing everything I can to fix this.