Correction for mistake made on 23:45. I stated that "For every 10 year older the odds of death increases by 43% while controlling for all other predictors in the model". That statement is incorrect because I multiplied the percent change in Heart failure by one unit increase in age of 4.3 by 10 to get 43%. This was the incorrect calculation used, 10 * [(exp(.042)-1)*100]. I needed to instead multiply the log odds coefficient by 10 prior to exponentiating. This is the correct calculation (exp(.042*10)-1)*100 = 53%. Therefore, for every 10 year older the odds of death increases by 53% while controlling for all other predictors in the model. Thank you Utku Pamuksuz for spotting that. Here is the code: # Dataset of patients with heart failure # find and load dataset downloaded from # www.kaggle.com/andrewmvd/heart-failure-clinical-data heart
I do not see much of a limit it is just your run time will be longer the larger the number of features you have. You may want to consider reviewing your data for like features, i.e., are there a cluster of features in your dataset that all provide the same information?
Great video. However, I guess I need to verify a statement. I am not sure if we can say "for every 10 year older the odds of death increases by 43%". Sigmoid function is not linear, we cant just simply multiply 4.3 by 10. It depends on the X value. 1 unit increment (in our case one year older) will be equal to beta times mu (1-mu) increment in estimated probability at that specific x point.
That is correct , I just noticed the mistake. For that part, I used a calculation of 10 * [(exp(.042)-1)*100] it should have been (exp(.042*10)-1)*100. The answer should be about 53% not 43%. I will add the correction top the description and rcode.
Generally the interaction term would be defined as the effect ejection fraction has on death is conditional on values of time controlling for the other variables in the model. When you include interactions, it is often also a good idea to include the main effect of each variable also in the model. In addition, to make it easier to interpret you can center each variable before multiplying them together to form the interaction. Here is a good resource for working with interactions that go into more detail: www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=www3.nd.edu/~rwilliam/stats2/l55.pdf&ved=2ahUKEwjGvaaX5vSCAxWgmIkEHah6AfcQFnoECCUQAQ&usg=AOvVaw3KaKU8apAO-VaPq4RXqmYS
Great question. For categorical it would be very similar but instead of talking interns of units it would be comparing the category to some baseline category. For example, let's say a binary variable 'previous_heart_issue' 1=yes and 0=no and the odds ratio percent was 4.3 for the variable 'previous_heart_issue_yes'. To interpret the coefficient in a sentence you could say the following -- the odds of death is 4.3% greater for patients with previous heart issues compared to patients with no previous heart issues while controlling for all other predictors in the model. You do not have to calculate percent you can just use the odds ratio. I believe it was 1.04 and you can say, the odds of death is 1.04 times greater for patients with previous heart issues compared to patients with no previous heart issues while controlling for all other predictors in the model.
I am glad you found it useful. The p-value is specific to each independent variable in your model and the significance of an independent variable (IV) can change depending on what else is included in the model. On the other hand, the stepwise-AIC considers the overall model and the impact to that overall model by removing one variable at a time. One alternative simplistic/basic approach is to create a model with all the IVs (a saturated model), then select only the significant IVs from that to include in a final model. But again, the issue is that what has a significant p-value changes based on what you include in the model. Using stepwise-AIC may result in a more parsimonious model (i.e., model that contains the fewest number of IVs without compromising the overall model). There are other model shrinkage approaches other than stepwise that are more preferred, that is why I include a bootstrap approach along with it. Let me know if this did not help answer your question.
Can you make video on predicting best model using r studio example if I have data that is considered 70 percent we need to predict 30% finding the best model
If I am not mistaken, I believe you referring to model evaluation/validation. I do have sort of a part 2 to this tutorial video -- where I go over using a number of validation techniques (e.g., test/train, k-fold CV, etc.) for the model developed on this dataset. Here is the link ruclips.net/video/tY48B6e__h0/видео.html
Many many thanks for introducing me to bootStepAIC::boot.stepAIC!!! I have two quick (perhaps dumb) questions. How large is each bootstrapped sample? Can you change it? If you cannot, and if it replicates the sample size used for fitting the model. Shouldn't you sample a data set for training the model and then, bootstrap from the original data set? I hope I explained myself clearly. Cheers!
Great questions! The size of each bootstrapped sample is the same as your sample. I do not believe you can adjust the size of the sample using that function, though I could be mistaken. For your second question, what you are referring to are data splitting methods (they go by many names) designed to estimate the overall accuracy of the model. These data splitting methods fit models on to one subset of data and test it against some other subset of data that has not seen the model. Some examples include test/train splitting, cross validation k-folding, and out of bag bootstrap. For out of bag bootstrap, those data in the bootstrap sample selected are tested against those data that weren’t selected. However, the boot.stepAIC is not so much interested in the accuracy of the overall model and instead attempting to do a diagnostic of what the model is comprised of. I only showed one method to mitigate problems of inconsistency with stepwise regression as a model shrinkage method (i.e., developing a parsimonious model). The inconsistency problem of stepwise regression for model shrinkage is that it may result in the inclusion of variables that likely should not be in the model and vice versa. Using this bootstrap approach only outlines how often these variables would be included. But to answer your question, you should use some data splitting method to evaluate the overall model accuracy. In the near future, I will try do a part 2 that includes a review of some model splitting techniques to evaluate overall model accuracy.
Correction for mistake made on 23:45. I stated that "For every 10 year older the odds of death increases by 43% while controlling for all other predictors in the model". That statement is incorrect because I multiplied the percent change in Heart failure by one unit increase in age of 4.3 by 10 to get 43%. This was the incorrect calculation used, 10 * [(exp(.042)-1)*100]. I needed to instead multiply the log odds coefficient by 10 prior to exponentiating. This is the correct calculation (exp(.042*10)-1)*100 = 53%. Therefore, for every 10 year older the odds of death increases by 53% while controlling for all other predictors in the model. Thank you Utku Pamuksuz for spotting that.
Here is the code:
# Dataset of patients with heart failure
# find and load dataset downloaded from
# www.kaggle.com/andrewmvd/heart-failure-clinical-data
heart
Very logical and lucid explanation. Thank you very much.
Thanks so much, I am glad you liked it.
Very helpful, especially the logistic regression section. Thank you.
Best half an hour invested!
Those are some very kind words. It is very much appreciated.
Amazing tutorial. Thank You.
Thank you so much for the kind words, I am glad you found it useful.
Excellent explanation! It is exactly what I needed! It will help me to finish my certification project. Subscribed.
Thank you so much I am glad you found it useful.
nicely explained.
Thank you for the compliment.
for how many features logistic regression works well? I have over 300 features, deos logistic regression work or other model is suggested? thank you
I do not see much of a limit it is just your run time will be longer the larger the number of features you have. You may want to consider reviewing your data for like features, i.e., are there a cluster of features in your dataset that all provide the same information?
Thanks for this video! Can I use this same code for running a stepwise linear regression?
Yes, you can use the stepAIC() function for linear regression models lm() as well.
@@statsguidetree great! thanks for the reply!
please how to do step with glmer? Is there a package?
Thank you for this great video. Could you please share the code and data?
Great video. However, I guess I need to verify a statement. I am not sure if we can say "for every 10 year older the odds of death increases by 43%". Sigmoid function is not linear, we cant just simply multiply 4.3 by 10. It depends on the X value. 1 unit increment (in our case one year older) will be equal to beta times mu (1-mu) increment in estimated probability at that specific x point.
That is correct , I just noticed the mistake. For that part,
I used a calculation of
10 * [(exp(.042)-1)*100] it should have been (exp(.042*10)-1)*100. The answer should be about 53% not 43%. I will add the correction top the description and rcode.
can you help me interpret the interaction term
logit(DEATH_EVENT)=−1.698+0.0385×age+0.8267×serum_creatinine−0.0006520×ejection_fraction×time
Generally the interaction term would be defined as the effect ejection fraction has on death is conditional on values of time controlling for the other variables in the model. When you include interactions, it is often also a good idea to include the main effect of each variable also in the model. In addition, to make it easier to interpret you can center each variable before multiplying them together to form the interaction. Here is a good resource for working with interactions that go into more detail: www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=www3.nd.edu/~rwilliam/stats2/l55.pdf&ved=2ahUKEwjGvaaX5vSCAxWgmIkEHah6AfcQFnoECCUQAQ&usg=AOvVaw3KaKU8apAO-VaPq4RXqmYS
how would you interpret the coefficients for a logistic categorical variable? I could not see this on your model
Great question. For categorical it would be very similar but instead of talking interns of units it would be comparing the category to some baseline category. For example, let's say a binary variable 'previous_heart_issue' 1=yes and 0=no and the odds ratio percent was 4.3 for the variable 'previous_heart_issue_yes'. To interpret the coefficient in a sentence you could say the following -- the odds of death is 4.3% greater for patients with previous heart issues compared to patients with no previous heart issues while controlling for all other predictors in the model. You do not have to calculate percent you can just use the odds ratio. I believe it was 1.04 and you can say, the odds of death is 1.04 times greater for patients with previous heart issues compared to patients with no previous heart issues while controlling for all other predictors in the model.
Thank you for teaching, very helpful ! One more question, may I use stepwise selection according to P-value instead of AIC?
I am glad you found it useful. The p-value is specific to each independent variable in your model and the significance of an independent variable (IV) can change depending on what else is included in the model. On the other hand, the stepwise-AIC considers the overall model and the impact to that overall model by removing one variable at a time. One alternative simplistic/basic approach is to create a model with all the IVs (a saturated model), then select only the significant IVs from that to include in a final model. But again, the issue is that what has a significant p-value changes based on what you include in the model. Using stepwise-AIC may result in a more parsimonious model (i.e., model that contains the fewest number of IVs without compromising the overall model). There are other model shrinkage approaches other than stepwise that are more preferred, that is why I include a bootstrap approach along with it. Let me know if this did not help answer your question.
@@statsguidetree Thanks so much !
Can you make video on predicting best model using r studio example if I have data that is considered 70 percent we need to predict 30% finding the best model
If I am not mistaken, I believe you referring to model evaluation/validation. I do have sort of a part 2 to this tutorial video -- where I go over using a number of validation techniques (e.g., test/train, k-fold CV, etc.) for the model developed on this dataset. Here is the link ruclips.net/video/tY48B6e__h0/видео.html
@@statsguidetree thank you for replying do you have any video on propensity score logistic regression
@@kirtansuvarna I haven't done any yet but was thinking of doing a tutorial on Propensity Score in the very near future.
Many many thanks for introducing me to bootStepAIC::boot.stepAIC!!! I have two quick (perhaps dumb) questions. How large is each bootstrapped sample? Can you change it? If you cannot, and if it replicates the sample size used for fitting the model. Shouldn't you sample a data set for training the model and then, bootstrap from the original data set? I hope I explained myself clearly. Cheers!
Great questions! The size of each bootstrapped sample is the same as your sample. I do not believe you can adjust the size of the sample using that function, though I could be mistaken. For your second question, what you are referring to are data splitting methods (they go by many names) designed to estimate the overall accuracy of the model. These data splitting methods fit models on to one subset of data and test it against some other subset of data that has not seen the model. Some examples include test/train splitting, cross validation k-folding, and out of bag bootstrap. For out of bag bootstrap, those data in the bootstrap sample selected are tested against those data that weren’t selected. However, the boot.stepAIC is not so much interested in the accuracy of the overall model and instead attempting to do a diagnostic of what the model is comprised of. I only showed one method to mitigate problems of inconsistency with stepwise regression as a model shrinkage method (i.e., developing a parsimonious model). The inconsistency problem of stepwise regression for model shrinkage is that it may result in the inclusion of variables that likely should not be in the model and vice versa. Using this bootstrap approach only outlines how often these variables would be included. But to answer your question, you should use some data splitting method to evaluate the overall model accuracy. In the near future, I will try do a part 2 that includes a review of some model splitting techniques to evaluate overall model accuracy.