Seriously, with this enthusiasm of yours, you could easily explain any subject in the whole world to me and I would never get bored. I wish tons of likes and subscriptions to you!
I'm taking a multiple regression course for a data analysis masters, and this video really helped piece things together! Thinking about control variables as variables that cut out the parts of the relationship we don't want to consider in our model is a really useful way of thinking about controls... hopefully I got that right. Thanks!
nice I am a technical buisness bachelor student and need to integrate control variables into my regression I have absolutely no idea how. nobody ever teached us.
Superb video, helped me a lot to refresh my understanding in under 10 minutes. Comes in really helpful as I am working on my Bachelors Thesis! Thank you for your work!
@@NickHuntingtonKlein thank you very much! Do I get it right that numerically it's similar to multiple regression? It's just a name to denote the fact that the variables have some relationahip to each other?
Phenomenal explanation! Thank you for your effort. I have one follow-up question: can we also estimate the effect of two variables by controlling other variables? And do you recommend any book to read more. Thanks!
You can - as long as each of them is separately identified, ie you have the controls for each. There is an issue with doing this via regression since the effects of the two variables "contaminate" each other a bit, but you can avoid this by saturation (or just estimating the two effects separately). As for a book I will of course recommend my own! Theeffectbook.net
Hey Nick! Could you answer me a question? I have a model (OLS) with a key explanatory variable and its effects on my dependent variable, and some (5) control variables. My main explanatory variable is significant, but only two of the control variables are significant; although, the model is itself statistically significant. My objective for the paper is just to tell if some effects caused by my explanatory variable are found, and its direction (positive or negative). Do you recommend keeping all the variables in the regression output table, telling which are not significant, and making clear that it doesn't fully matter for the objective of the paper? Sorry for this long and broad question. Thanks!
Keep em in! Even imprecisely estimated controls can improve the coef on your variable of interest. Also, generally, you almost never want to make a decision about model building on the basis of a statistical significance test. Model building is a theoretical task, sig tests are sample based
PLEASE REPLY ASAP!! GDP = shadow economy + inflation + government debt + Unemployment I am looking to investigate the impact of Shadow economy on GDP, and what I just listed is my econometric model. Would inflation, gov debt and unemployment thus be my control variable? thank you
But sir in the example used temperature can also partly explain shorts wearing....so does the problem of multicolinearity arises when we add temperature to the model ???
Yes, that's the idea - you want to take out the part of shorts-wearing that is explained by temperature as well. Having multiple correlated predictors is not a problem in regression unless the correlation between them is extremely strong (or perfect, in the case of perfect multicollinearity). If that's the case here - if nearly all of shorts-wearing is explained by temperature, then the regression estimates would be high variance and there'd be a multicollinearity problem, yes. But if that's the case, where we have to control for temperature but doing so removes nearly all the variation from shorts-wearing, that means that we simply don't have the variation in the data necessary to identify the effect we want.
At the end slide as you add to the scatterplot variable W, you write as Z, also I think it is a little confusing because you start showing the relationship already controlled by Z (or W) instead of showing it in a scatterplot first without control.
Excellent video sir. Quick question. When using control variables lets say.. exchange rates from the world bank data base (time series data). Do you make the values constant by using one specific value throughout the years or you use the timeseries data as is for the different years?
It would depend on what you were trying to control for - using a single value would control for aggregate differences between countries (sort of like a fixed effects control that doesn't go all the way to control for *all* fixed between-country differences, just exchange rates), but the time series variation would control for both between-country and within-country exchange rate differences over time. I'd imagine in most applications you'd want the full time series.
@@NickHuntingtonKlein thank you for the timely response. For clarity, lets say am analyzing the impact of international commodity trade on a certain country. goods exports and goods imports would be my independent variable. GDP my dependent variable. would it be okay to use services as a control variable? and do I use the actual time series for services or do I select a constant value to throughout all the years?
@@leticiaasiimirwe8822 Yes, services as a control for overall trade level (which would then make the effect on goods more about the proportion of all trade is goods trade, rather than about the absolute level of goods trade) makes sense. Id' recommend using the actual time series in that case.
Great video! had to like and subscribe I wonder tho, if i pick a control variable like fx. gender on a topic like wages... does that mean that I believe that gender has an impact on wages?..
Sort of. You're saying that gender is *related to* and *upstream of* both treatment and control, or on a back door path. It doesn't necessarily need to have a *direct* effect on wage
They're not really the same category. A control variable is any variable you adjust for/control for in your statistical model. A moderator is a variable that theoretically affects the relationship between treatment and outcome (for example, a treatment for cervical cancer reduces cancer rates by much more for people with cervixes vs those without). Mediators can be included in a statistical model as control variables, but also you might include a variable as a control for other reasons, like being on a back door path.
Thank you for the video. I have a few questions. How do we know what covariates to include/ exclude in/from our model? Also, how do we determine how many covariates to include in our model? Do we simply use theoretical knowledge or do we have tests that we can do?
My series on causality, especially on causal diagrams, goes deep on which controls to include, at least if your goal is causal identification. The theory should do most of the work in determining your model. That said, there are tools like LASSO, variance inflation factors, and information criteria to help with model selection when you're thinking of adding/removing variables for statistical reasons instead of theoretical ones
Hello Nick, great video! I would like to ask two questions regarding the use of control variables: 1. Shouldn't we worry about multicollinearity since we know in fact that shortswearing and temperature are correlated? 2. Can we have a meaningful interpretation of the control variable coefficient as well (temperature) when we know it is correlated with shortswearing or its use is purely to fight the endogeneity problem? Thank you in advance.
The main point of adding controls is that they *are* correlated - otherwise adding the control doesn't reduce omitted variable bias. Adding uncorrelated controls can improve your model's precision but it doesn't do anything for endogeneity. Multicollinearity is only a problem in terms of variance inflation if the degree of correlation is extremely strong (and if it's so strong, you have to ask whether it's actually a necessary control or just another way of measuring your primary variable).
@@NickHuntingtonKlein Thank you for your reply. So as far as I understand variables being correlated doesn't necessarily mean that one would get highly inflated variance, and if that's the case then the control may be redundant (extremely high correlation).
So if I was looking to see if a person with a higher iq earns a higher wage what would be 2 good control variables to use out of the following? Education,experience,tenure,age,married or not, number of siblings, birth order, fathers education, mothers education or average weekly hours?
@@NickHuntingtonKleinwe are required to create a research paper and i have this data so my teacher wants two control variables to be implemented on the right side of the equation I came up withe the does iq effect wage part because I thought it would be interesting to see the results do you think it’s fine?
I see. In that case I'd probably say father's and mother's education are the best two. They are both proxies for your parents' socioeconomic standing (which affects your job opportunities and thus wages) and also your genetic endowment (which affects your IQ). So they're on back doors you'd want to close. The rest either affect only wages and not IQ (like hours, age, and experience), which are OK to include as controls to improve precision but don't solve any identification problems or are mixes of things that both cause and are caused by IQ (like education and marriage) and so have collider bias issues. I certainly don't think you can identify the effect of IQ on wages using only parental education as controls, but for the assignment you have that's what makes the most sense to go with. See my chapter on back door paths theeffectbook.net/ch-CausalPaths.html @@fernplayz6369
thank you for the clear explanation and visual illustration!. I've been confused by "what is controlling for a long time ". thank you! so the 2 subtracting process is automatically done when we are doing OLS?
Seriously, with this enthusiasm of yours, you could easily explain any subject in the whole world to me and I would never get bored.
I wish tons of likes and subscriptions to you!
Thank you!
Nobody taught me how to identify control variables, until now. It will be so helpful to my PhD thesis. Thanks from Brazil. God bless you!
Your videos should be given to everyone in high school so we would start having a better society
This is the best video of control variables that I've seen.
Thanks a lot, you save my day, i couldn´t find a channel explaining this in my native language
This is the best explanation and animation I’ve ever seen for multiple regression and control variables! 🎉🤩
I'm taking a multiple regression course for a data analysis masters, and this video really helped piece things together! Thinking about control variables as variables that cut out the parts of the relationship we don't want to consider in our model is a really useful way of thinking about controls... hopefully I got that right. Thanks!
nice I am a technical buisness bachelor student and need to integrate control variables into my regression I have absolutely no idea how. nobody ever teached us.
This is the best explanation of control variables I've ever seen. Thank you, HK.
Loved this...Smooth and Straight to the point👏
Excellent contents in the subject matter. It is valuable to build knowledge and skills. I am really thankful for such efforts.
Super!!! i finally learned while clustering can explain positive relationship when in fact there's a negative relationship! Thank you!
6:23 a great explanation, explains a lot of misleading results from a positive to a negative relationship, THank You!!
Thank you! Very clear and ENERGETIC which is rare in these parts (of youtube)
Superb video, helped me a lot to refresh my understanding in under 10 minutes. Comes in really helpful as I am working on my Bachelors Thesis! Thank you for your work!
Amazing Video on Control Variables. Why to use
Great explanation of the control variables! Thank you, Professor!
Thank you. The concepts are simply enough but my ADHD makes it incredibly difficult to focus. Your video helped.
Excelent video
Your content is awesome
This animation that explain what a control variable do is very helpful!
Thank you so much, I finally understand what is control variable..
Great explanation!!
Remarkably good explanation.
You explain things so well. Thank you for posting this!
just the explanation I was looking for, thank you!
Thank you for explaining this in a simple way.
Very clear explanations. Thanks
Thank you this was very well explained
It's a very good visualization. Would be perfect to see a numerical illustration of this control element.
I walk through a numerical illustration in chapter 16 of my book, at theeffectbook.net
@@NickHuntingtonKlein thank you very much! Do I get it right that numerically it's similar to multiple regression? It's just a name to denote the fact that the variables have some relationahip to each other?
@@mikayilmajidov yes, a regression with control variables in it is inherently a multiple regression.
@@NickHuntingtonKlein thank you! Will subscribe to the channel
A bit late to the video, but this was extremely useful! Million thanks :)
Phenomenal explanation! Thank you for your effort. I have one follow-up question: can we also estimate the effect of two variables by controlling other variables? And do you recommend any book to read more. Thanks!
You can - as long as each of them is separately identified, ie you have the controls for each. There is an issue with doing this via regression since the effects of the two variables "contaminate" each other a bit, but you can avoid this by saturation (or just estimating the two effects separately).
As for a book I will of course recommend my own! Theeffectbook.net
@@NickHuntingtonKlein Thank you Professor!
Absolutely Amazing!
Thank you! Thank you! Thank you!
Hey Nick! Could you answer me a question?
I have a model (OLS) with a key explanatory variable and its effects on my dependent variable, and some (5) control variables. My main explanatory variable is significant, but only two of the control variables are significant; although, the model is itself statistically significant. My objective for the paper is just to tell if some effects caused by my explanatory variable are found, and its direction (positive or negative). Do you recommend keeping all the variables in the regression output table, telling which are not significant, and making clear that it doesn't fully matter for the objective of the paper?
Sorry for this long and broad question. Thanks!
Keep em in! Even imprecisely estimated controls can improve the coef on your variable of interest. Also, generally, you almost never want to make a decision about model building on the basis of a statistical significance test. Model building is a theoretical task, sig tests are sample based
@@NickHuntingtonKlein Thanks! You have really helped me with your videos. Keep going!
well-explained and easy to capture the intuition. Thanks a lot. :D
Thank you so much, you explained very well.
Very clear explanations, thank you very much!
Very helpful, thanks!
PLEASE REPLY ASAP!!
GDP = shadow economy + inflation + government debt + Unemployment
I am looking to investigate the impact of Shadow economy on GDP, and what I just listed is my econometric model.
Would inflation, gov debt and unemployment thus be my control variable?
thank you
Yes
@@NickHuntingtonKlein I love you. thank you
Hello, Nick, could you explain what it means by "conditioning on a set of covariates?" Does it mean the same as controlling for these variables?
Yep, same thing
But sir in the example used temperature can also partly explain shorts wearing....so does the problem of multicolinearity arises when we add temperature to the model ???
Yes, that's the idea - you want to take out the part of shorts-wearing that is explained by temperature as well.
Having multiple correlated predictors is not a problem in regression unless the correlation between them is extremely strong (or perfect, in the case of perfect multicollinearity). If that's the case here - if nearly all of shorts-wearing is explained by temperature, then the regression estimates would be high variance and there'd be a multicollinearity problem, yes. But if that's the case, where we have to control for temperature but doing so removes nearly all the variation from shorts-wearing, that means that we simply don't have the variation in the data necessary to identify the effect we want.
@@NickHuntingtonKlein thank you very much sir for this clarification :)
At the end slide as you add to the scatterplot variable W, you write as Z, also I think it is a little confusing because you start showing the relationship already controlled by Z (or W) instead of showing it in a scatterplot first without control.
Excellent video sir. Quick question. When using control variables lets say.. exchange rates from the world bank data base (time series data). Do you make the values constant by using one specific value throughout the years or you use the timeseries data as is for the different years?
It would depend on what you were trying to control for - using a single value would control for aggregate differences between countries (sort of like a fixed effects control that doesn't go all the way to control for *all* fixed between-country differences, just exchange rates), but the time series variation would control for both between-country and within-country exchange rate differences over time. I'd imagine in most applications you'd want the full time series.
@@NickHuntingtonKlein thank you for the timely response. For clarity, lets say am analyzing the impact of international commodity trade on a certain country. goods exports and goods imports would be my independent variable. GDP my dependent variable. would it be okay to use services as a control variable? and do I use the actual time series for services or do I select a constant value to throughout all the years?
@@leticiaasiimirwe8822 Yes, services as a control for overall trade level (which would then make the effect on goods more about the proportion of all trade is goods trade, rather than about the absolute level of goods trade) makes sense. Id' recommend using the actual time series in that case.
@@NickHuntingtonKlein Thank you very much sir.
Great video! had to like and subscribe
I wonder tho, if i pick a control variable like fx. gender on a topic like wages... does that mean that I believe that gender has an impact on wages?..
Sort of. You're saying that gender is *related to* and *upstream of* both treatment and control, or on a back door path. It doesn't necessarily need to have a *direct* effect on wage
And thanks!
@@NickHuntingtonKlein thanks for the answer! very kind of you :)
What is the difference between a moderator and a control variable? Are they the same?
They're not really the same category. A control variable is any variable you adjust for/control for in your statistical model. A moderator is a variable that theoretically affects the relationship between treatment and outcome (for example, a treatment for cervical cancer reduces cancer rates by much more for people with cervixes vs those without).
Mediators can be included in a statistical model as control variables, but also you might include a variable as a control for other reasons, like being on a back door path.
Thank you for the video. I have a few questions. How do we know what covariates to include/ exclude in/from our model? Also, how do we determine how many covariates to include in our model? Do we simply use theoretical knowledge or do we have tests that we can do?
My series on causality, especially on causal diagrams, goes deep on which controls to include, at least if your goal is causal identification. The theory should do most of the work in determining your model. That said, there are tools like LASSO, variance inflation factors, and information criteria to help with model selection when you're thinking of adding/removing variables for statistical reasons instead of theoretical ones
Very helpful!
Hello Nick, great video!
I would like to ask two questions regarding the use of control variables:
1. Shouldn't we worry about multicollinearity since we know in fact that shortswearing and temperature are correlated?
2. Can we have a meaningful interpretation of the control variable coefficient as well (temperature) when we know it is correlated with shortswearing or its use is purely to fight the endogeneity problem?
Thank you in advance.
The main point of adding controls is that they *are* correlated - otherwise adding the control doesn't reduce omitted variable bias. Adding uncorrelated controls can improve your model's precision but it doesn't do anything for endogeneity. Multicollinearity is only a problem in terms of variance inflation if the degree of correlation is extremely strong (and if it's so strong, you have to ask whether it's actually a necessary control or just another way of measuring your primary variable).
@@NickHuntingtonKlein Thank you for your reply. So as far as I understand variables being correlated doesn't necessarily mean that one would get highly inflated variance, and if that's the case then the control may be redundant (extremely high correlation).
@@21LeonidasZ correct
Man, you’re fucking good explaining this! Thanks a lot
So if I was looking to see if a person with a higher iq earns a higher wage what would be 2 good control variables to use out of the following? Education,experience,tenure,age,married or not, number of siblings, birth order, fathers education, mothers education or average weekly hours?
Please answer as soon as possible I’ve been trying to figure out what would be best to use haha!
why two?
if that's a homework question or something it's not very well done, i don't think there is a single right answer
@@NickHuntingtonKleinwe are required to create a research paper and i have this data so my teacher wants two control variables to be implemented on the right side of the equation I came up withe the does iq effect wage part because I thought it would be interesting to see the results do you think it’s fine?
I see. In that case I'd probably say father's and mother's education are the best two. They are both proxies for your parents' socioeconomic standing (which affects your job opportunities and thus wages) and also your genetic endowment (which affects your IQ). So they're on back doors you'd want to close. The rest either affect only wages and not IQ (like hours, age, and experience), which are OK to include as controls to improve precision but don't solve any identification problems or are mixes of things that both cause and are caused by IQ (like education and marriage) and so have collider bias issues. I certainly don't think you can identify the effect of IQ on wages using only parental education as controls, but for the assignment you have that's what makes the most sense to go with. See my chapter on back door paths theeffectbook.net/ch-CausalPaths.html @@fernplayz6369
@@NickHuntingtonKlein is there anyway to live chat? Maybe I should try a different research question with my data?
what the hell happens at 0:50?
That's a fly
you're cool
If I can give more than one like I will do it ❤
ah ha pandemic haircut , I caught you!
It was 666 likes, sorry I ruined it 😉
thank you for the clear explanation and visual illustration!. I've been confused by "what is controlling for a long time ". thank you! so the 2 subtracting process is automatically done when we are doing OLS?
You're welcome! And the 2 subtractions process isn't *actually done* by OLS but it produces the exact same result with the same interpretation