Amazing video! All of this I have covered so many times and I wanted a refresher and this has made it much cleared than all my undergrad and now PhD studies. Thank you!
Thanks for the vid. At the minute six, I can see the intercept value (beta 0) at the summary is a high number completely different (-3.258) from where the line cross over the graph (y axy) . Is there a reason about that? Thank you
Nice question - this is something people often get confused with! The intercept value is not actually "the value of y where the line crosses the x axis" as is often taught. Instead it is "the value of y where x is equal to 0". In this particular model it is purely a mathematical constant to plot the line, and should not be interpreted, given it is so far outside of the range of observed values. In some cases it would make sense to draw your x axis starting at 0; but in this particular instance it would be totally nonsensical as the x axis relates to data recorded between 1640 and 1700. Taking our x axis all the way back to 0 would then squash all of our data into a very tiny corner of the plot. And extrapolating our model back to 0AD is even more stupid, in general we really should only extrapolate outside the observed range of values with extreme levels of caution. In this case it is pretty clearly a bad move since it is predicting a very large negative number of deaths due to a specific cause. But often extrapolation is not so evidently incorrect and can be a tempting thing to do, and then immediately be proved wrong when people want to predict the future, as we see later in the video!
Hi Sam, I am curious to hear your professional opinion about how to fit a model that I find to be challenging. I have measurements of oxygen consumption taken at 4 different temperatures for 520 eggs belonging to 80 species of insects. The goal of my model is to look at how a series of predictors like location, mortality ecc. influence the slopes of the lines that describe the change in oxygen consumption across the 4 measurement temperatures. Oxygen consumption intuitively increases with egg mass, measurement temperature and age of the egg. What I do not understand is how to make these slopes the dependent variables. If I first make a model like oxygen~egg mass+age+measurement temperature and then I use the slopes for each species as dependent variable in a model like slopes~site+mortality+ambient temperature is that correct? Or is that discourage because I am doing statistics on statistics? Conversely, if I build only one model like oxygen~egg mass+age+measurement temperature+site+mortality+ambient temperature is this model actually looking at slopes? Sorry for the long question but I find it hard to get a reasonable answer by myself. Cheers
Hi there - sounds like a fun challenge! I think the key thing you need to look into more is the concept of interactions. Specifically you are interested in the interactions between temperature and your other variables - there's a nice video here: ruclips.net/video/yJnHmCMb1q4/видео.html (not by me!). But it sounds like your main hypothes(e)s are related to whether there are interactions between temperature and your other 5 variables with how they relate to oxygen consumption - oxygen consumption is still going to be the dependent variable there. The other key thing as well is to understand if the trends you have would be approximately following a linear straight line within the observed range of values you have. e.g. It's pretty much universal that with ever increasing temperature you would not expect a continuously increasing response - at some point that relationship is going to break down. Depending on the shape of the relationship you might need to take logs, or perhaps look at a different way of approaching it by treating temperature as a categorical rather than a continuous variable. Or indeed it may be that within the range of temperatures you have a straight line is fine - the best way to assess this is to make sure you do lots of exploratory graphical analysis of your data before trying to fit any models - ggplot2 is great for this! As you only have 4 distinct temperatures I wouldn't want to try to fit any sort of more complex curve as it would likely be extremely overfitted, even if there was something that looked like a quadratic type relationship - treating temperature as a categorical variable in that case is likely a bit of a safer option. But either way what you need to be investigating is the size of the interactions between temperature and those other 5 variables.
I am really grateful for the time you took to explain your rationale about how to approach the matter. Thank you very much1 All of this was very useful. @@Stats4SD
That is so descriptive and makes me understand so much
Thanks Sam! You are a good teacher on statistical modelling in R. This is an amazing video...
Thanks Sam for the simple explanation of modeling. I have learnt a lot
Amazing video! All of this I have covered so many times and I wanted a refresher and this has made it much cleared than all my undergrad and now PhD studies. Thank you!
Thanks for the vid. At the minute six, I can see the intercept value (beta 0) at the summary is a high number completely different (-3.258) from where the line cross over the graph (y axy) . Is there a reason about that? Thank you
Nice question - this is something people often get confused with!
The intercept value is not actually "the value of y where the line crosses the x axis" as is often taught. Instead it is "the value of y where x is equal to 0". In this particular model it is purely a mathematical constant to plot the line, and should not be interpreted, given it is so far outside of the range of observed values.
In some cases it would make sense to draw your x axis starting at 0; but in this particular instance it would be totally nonsensical as the x axis relates to data recorded between 1640 and 1700. Taking our x axis all the way back to 0 would then squash all of our data into a very tiny corner of the plot. And extrapolating our model back to 0AD is even more stupid, in general we really should only extrapolate outside the observed range of values with extreme levels of caution. In this case it is pretty clearly a bad move since it is predicting a very large negative number of deaths due to a specific cause. But often extrapolation is not so evidently incorrect and can be a tempting thing to do, and then immediately be proved wrong when people want to predict the future, as we see later in the video!
Thank you so much. So many clarification right now. Very clear explanation. Helped me a lot.. Thanks
Hi Sam, I am curious to hear your professional opinion about how to fit a model that I find to be challenging. I have measurements of oxygen consumption taken at 4 different temperatures for 520 eggs belonging to 80 species of insects. The goal of my model is to look at how a series of predictors like location, mortality ecc. influence the slopes of the lines that describe the change in oxygen consumption across the 4 measurement temperatures. Oxygen consumption intuitively increases with egg mass, measurement temperature and age of the egg. What I do not understand is how to make these slopes the dependent variables. If I first make a model like oxygen~egg mass+age+measurement temperature and then I use the slopes for each species as dependent variable in a model like slopes~site+mortality+ambient temperature is that correct? Or is that discourage because I am doing statistics on statistics? Conversely, if I build only one model like oxygen~egg mass+age+measurement temperature+site+mortality+ambient temperature is this model actually looking at slopes? Sorry for the long question but I find it hard to get a reasonable answer by myself.
Cheers
Hi there - sounds like a fun challenge! I think the key thing you need to look into more is the concept of interactions. Specifically you are interested in the interactions between temperature and your other variables - there's a nice video here: ruclips.net/video/yJnHmCMb1q4/видео.html (not by me!). But it sounds like your main hypothes(e)s are related to whether there are interactions between temperature and your other 5 variables with how they relate to oxygen consumption - oxygen consumption is still going to be the dependent variable there.
The other key thing as well is to understand if the trends you have would be approximately following a linear straight line within the observed range of values you have. e.g. It's pretty much universal that with ever increasing temperature you would not expect a continuously increasing response - at some point that relationship is going to break down. Depending on the shape of the relationship you might need to take logs, or perhaps look at a different way of approaching it by treating temperature as a categorical rather than a continuous variable. Or indeed it may be that within the range of temperatures you have a straight line is fine - the best way to assess this is to make sure you do lots of exploratory graphical analysis of your data before trying to fit any models - ggplot2 is great for this!
As you only have 4 distinct temperatures I wouldn't want to try to fit any sort of more complex curve as it would likely be extremely overfitted, even if there was something that looked like a quadratic type relationship - treating temperature as a categorical variable in that case is likely a bit of a safer option. But either way what you need to be investigating is the size of the interactions between temperature and those other 5 variables.
I am really grateful for the time you took to explain your rationale about how to approach the matter. Thank you very much1 All of this was very useful. @@Stats4SD
Amazing