Your videos are very informative! I love that you take the time to show the data first and explain what the variables are. And the fact that you explain the tidy functions and even repeat a bit of what you said in earlier videos is great! You use just the right amount of detail for me at least. Thank you.
Thank you so much for this video (and for all your videos). I've been using R for about two or three years and this was just the right amount of detail and exposition for me. Your workflow is clean and easy to follow, I like how you used the help function and your overall layout is nice to (console in the top right). I look forward to trying XGBoost on some data sets now! :)
Great work, please keep it up! As an idea for another video, it would be nice to use the package bonsai to train a lightgbm regression model and perhaps show the differences against an XGB and a RF model.
Thank you so much Julia for all your tutorial videos. They are easy to follow and very informative.......just great! Please keep posting them. I hope you can find some time to post a video on neural network optimization with Keras in R. I can even start a petition for that. LOL
Watched more than half of your videos within one week. Don't even want to blink! Saw you plotted XGB importance - wonder if there is tidymodel way to plot SHAP values from XGB. Thanks, Julia!
If you are only doing xgboost, you might try the SHAPforxgboost package: cran.r-project.org/package=SHAPforxgboost (it takes a bit of munging the model to get it to work with that package) For modeling in general, I like DALEX for explainability, which also supports tidymodels: modeloriented.github.io/DALEXtra/reference/explain_tidymodels.html We have a chapter in process on explainability in our upcoming book, so keep your eyes out for that: www.tmwr.org/
Hi Julia, this was video was amazing and very informative! Would you be able to help us find resources for (or post a video about :) ) the math behind these models? I.e. gradient-descent for XGBoost models. Thank you very much for posting these videos! I am learning a ton!
Would be amazing if you do a video using nested data (instead of having a nominal variable, nest it and generate a model for each of the levels for example), also using the map_workflow etc.. great as always!
Hi Julia! Great video. Have you done a video on multiclass classification? I am struggling to find guidance for this type with text classification. Thanks!!
It's because of a global change in how yardstick finds the "first" or base level event: juliasilge.com/blog/xgboost-tune-volleyball/#comment-5015180544
Julia, I ran this model on a new mac mini and it produced results in about 7 minutes. Much faster than my old mac which I desktop did not dare run it on.
Does anyone know how you can save the workflow for later use? I have problems with it since it is not of format 'xgb.booster', whereas using the function saveRDS might result in compatibility issues in case of future package versions.
Great video, thanks. But I’ve got a question. Say, my local computer is too small to fit a model fast enough. How would I train a model in the cloud? Do you have any best practices?
If you get a RStudio crash related to Initializing libomp.dylib, but found libomp.dylib already initialized. When using the final workflow and fit it. You can use a workaround on OSX Sys.setenv(KMP_DUPLICATE_LIB_OK = TRUE)
It's what proportion of the total available sample is used for modeling within one boosting iteration: dials.tidymodels.org/reference/trees.html#details
Great explanation, but i have one question: When you call last_fit() you make use of your split object. In my particular case i only was provided with the train and test test initially, so that i dont have a split object. Is there any way to call last_fit() nevertheless? Thanks!
You can't call last_fit() directly if you don't have the split, but you *can* manually do what it is a wrapper for, which is train one last time on the training data and then evaluate one last time on the testing data.
Julia, I was able to follow along and everything looked fine until the final roc_auc curve. I get a mirror image of your curve. I have combed through the code and found nothing wrong. The confusion matrix outcome is similar to yours etc. It seems like a systematic error. I noticed when looked at the data that will generate the curve that indeed my numbers for specificity are somehow switched. While your table starts with specificity of 1 mine starts at zero so the value seem more like 1-specificity to begin with in my case. I am puzzled.
You can look at the first comment at the relevant blog post here: juliasilge.com/blog/xgboost-tune-volleyball/ Since I published this blog post, there was a change in yardstick in version 0.0.7: github.com/tidymodels/yardstick/blob/master/NEWS.md#yardstick-007 that changed how to choose which level (win or lose) is the "event". You can change this by using the `event_level` argument for functions like `roc_curve()`: yardstick.tidymodels.org/reference/roc_curve.html
I use one of the themes from rsthemes: www.garrickadenbuie.com/project/rsthemes/ I think Oceanic Plus? There are lots of nice ones available in that package.
Hi Julia ! Great video as always :) ! Can i ask you something please? At around 34.08 if we don't want to use the xgb_grid you are using and we use in the tune_grid() function, something else for the grid parameter, let's say grid = 50 is this ok ? I mean generally is it ok to use grid equal a number ? Thank you very much !
Yes, that argument can take a couple of different kinds of values, either a dataframe or an integer value: tune.tidymodels.org/reference/tune_grid.html You can read a bit more about this here: www.tmwr.org/grid-search.html#evaluating-grid
These vids are great. Can we see a classification model with calibration curves, and then recalibrate it, within the tidymodels framework? How long did the hyperparameter tuning take here?
Thanks a lot Julia. I really love your videos. Do you have any plans for making a video on neural network and tuning it in tidymodels? That would be awesome if possible. Please continue these videos. They are really great. Cheers
Thank you for the great tutorial. I have been haivng a problem with a confusion matrix. namely, when i run the code " final_res_r %>% collect_predictions()%>% roc_curve(dependent_var, .pred_dependent_var)%>% autoplot()", i get the error Can't subset columns that don't exist. x Column `.pred_dependent_var` doesn't exist.. I can not understand how to solve the problem. What am i doing wrong?
Hmmmm, do you see the column with the predicted class probability in it, after you run `collect_predictions()`? You can check out the documentation for `roc_curve()` here: yardstick.tidymodels.org/reference/roc_curve.html And if you continue to have trouble, I recommend creating a reprex and posting it on RStudio Community: rstd.io/tidymodels-community It's often easier to get help with coding problems in a format like that rather than comments.
@@JuliaSilge Dear Julia! Just amazing to read your response :). I have solved that problem :). however, another problem that I could not solve was related to the variable importance. I managed to create a figure but I can not get the actual values per variable. I tried to use varImp(model_name), xgb.importance(model = model_name). but getting just lovely red text around, without the results :)
@@tamaraabzhandadze2712 I typically use the vip package for variable importance, as I show in this blog post: juliasilge.com/blog/xgboost-tune-volleyball/
@@JuliaSilge thank you! I have actually posted the question there as well :) . I read your answer and got the results :). I just really have to decide now the cutoff coefficient for choosing some variables out of ten features. p.s. i did factor analyses as well, and could identify 3 variables with good loading, but there it was a bit easier as there are cutoffs for loading :). For XGboost i have no idea what to do :)
Julia, I do like Markdown but for testing out code I prefer R script simply because I make a lot of mistakes. So I am curious to know why you work in Markdown. Is it so because you have already written and debugged your code and would like to save the lesson in a nicer format?
No, I work in R Markdown regularly. In R I basically am either building package code or I am working in R Markdown. I'm a huge believer in the idea of "literate programming" as a real way to work. I make a lot of mistakes too, but I don't think that reduces the value of combining narrative and code in one document.
I am working on setting up a class for students in my department and am quite torn on whether to go the Markdown or R script route. Since most of the class work will be around coding and simply learning how to R I am inclined to start with the regular setup (script) and then move on to Markdown later. Thanks.
@@haraldurkarlsson1147 The person I know who has thought the most about this is Mine Çetinkaya-Rundel; you can see one of her resources for teaching here: datasciencebox.org/ She recommends teaching R Markdown to emphasize reproducible analyses.
Julia, I will have a deeper dive into the datasciencebox. However, I will be teaching grad students that should have some inkling of what the basic statistics concepts are. Most have already worked with data, done some data processing, and generated tables and graphs. I would like to teach them R to simplify their lives and give them hopefully a new valuable skill for the current or future work. As grad students the science part is covered.
Julia thank you for these great videos keep it up ! Quick question once using last_fit if wanting to predict on NEW data what are the workflow steps ? Last_fit doesn’t really work on new data that wasn’t in the original split. Thank you !
Once you get to last_fit(), check out the objects that are inside of it. One of the columns contains a *fitted model* that can be used on new data. In fact, that fitted model is used on the testing data to compute the metrics!
@@JuliaSilge Thank you Julia! Last quick Q, noticed you always process the commands in console from the notebook Rmd, what button do you click to run in console instead of in the notebook?
@@Matthew-px9nu That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.
@@JuliaSilge Hi Julia! Where do I exactly find this? The columns I have are splits, id, .metrics, .notes,. predictions, .workflow. I can't find the fitted model in .workflow either so I'm not sure where it is. Thanks!
@@vincentpepe1064 The .workflow is a *fitted* workflow at this point. For example, try tidying it or predicting on it. I show how to tidy it here: juliasilge.com/blog/palmer-penguins/
Great video! Is there any difference between “pivot_longer” and “gather”? They look identical to me, just with the arguments having different names, but want to make sure I’m not missing something.
Hi im getting a warning-error: ! Fold01: model 1/20: The `x` argument of `as_tibble.matrix()` must have colum... Whentune_grid function runs... Found in a github issue, that it's related to "name reparing"... Do you have any idea if it really affects the results of the tunning process, or if thers a update/solution for it?
Hmmmm, do you want to make sure all your packages are updated? That sounds like a message from an older version of the packages. If you are still getting that warning, I recommend creating a reprex and posting on RStudio Community: community.rstudio.com/c/ml/15
@@JuliaSilge After reading your responde, I did update all my packages, and the error still occurs, but the process seems to keep running. I will let it finish, and see if it affects the results of the tune_grid
@@JuliaSilge yeah...but I don't know what is happening, when I try install the package xgboost gives a error telling me that the xgboost is not available for my R version. My R Studio is the currently version.
@@wecsleyprates3205 Ah, a classic problem that folks run into when things get borked! Check out this SO question + answers: stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-for-r-version-x-y-z-wa
Thanks @@JuliaSilge...Do you know what means the error below? Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘predict’ for signature ‘"xgb.Booster"’
@@wecsleyprates3205 That sounds like xgboost still isn't getting loaded correctly to me. Could you try creating a reprex showing your problem and posting on RStudio Community? rstd.io/tidymodels-community
Your videos are very informative! I love that you take the time to show the data first and explain what the variables are.
And the fact that you explain the tidy functions and even repeat a bit of what you said in earlier videos is great! You use just the right amount of detail for me at least. Thank you.
Very nice example. You show all the process, very illustrative. Thank you Julia
Thank you so much for this video (and for all your videos). I've been using R for about two or three years and this was just the right amount of detail and exposition for me. Your workflow is clean and easy to follow, I like how you used the help function and your overall layout is nice to (console in the top right). I look forward to trying XGBoost on some data sets now! :)
Great work, please keep it up! As an idea for another video, it would be nice to use the package bonsai to train a lightgbm regression model and perhaps show the differences against an XGB and a RF model.
Thank you so much Julia for all your tutorial videos. They are easy to follow and very informative.......just great! Please keep posting them. I hope you can find some time to post a video on neural network optimization with Keras in R. I can even start a petition for that. LOL
I always learn something from your videos.
These videos are super informative. Keep them coming. Thanks
Very helpful video! I look forward to following this example in a future project!
Watched more than half of your videos within one week. Don't even want to blink! Saw you plotted XGB importance - wonder if there is tidymodel way to plot SHAP values from XGB. Thanks, Julia!
If you are only doing xgboost, you might try the SHAPforxgboost package: cran.r-project.org/package=SHAPforxgboost (it takes a bit of munging the model to get it to work with that package)
For modeling in general, I like DALEX for explainability, which also supports tidymodels:
modeloriented.github.io/DALEXtra/reference/explain_tidymodels.html
We have a chapter in process on explainability in our upcoming book, so keep your eyes out for that:
www.tmwr.org/
@@JuliaSilge Got it. Thanks for the direction! Again, amazing video series! Really really tidy.
Amazing video, super clear! Thank you, Julia!
Hi Julia, this was video was amazing and very informative! Would you be able to help us find resources for (or post a video about :) ) the math behind these models? I.e. gradient-descent for XGBoost models. Thank you very much for posting these videos! I am learning a ton!
Very nice presentation of xgboost by the way.
You are the most amazing person I've ever come across
Thanks a lot
Blessings =)
Would be amazing if you do a video using nested data (instead of having a nominal variable, nest it and generate a model for each of the levels for example), also using the map_workflow etc..
great as always!
Hi Julia! Great video. Have you done a video on multiclass classification? I am struggling to find guidance for this type with text classification. Thanks!!
Check out these two:
- juliasilge.com/blog/nber-papers/
- juliasilge.com/blog/multinomial-volcano-eruptions/
Thank you!
48:44 my autoplot was flipped along the X = Y axis, I wonder why.
It's because of a global change in how yardstick finds the "first" or base level event:
juliasilge.com/blog/xgboost-tune-volleyball/#comment-5015180544
Julia,
I ran this model on a new mac mini and it produced results in about 7 minutes. Much faster than my old mac which I desktop did not dare run it on.
Similar time here
Does anyone know how you can save the workflow for later use? I have problems with it since it is not of format 'xgb.booster', whereas using the function saveRDS might result in compatibility issues in case of future package versions.
Great video, thanks. But I’ve got a question. Say, my local computer is too small to fit a model fast enough. How would I train a model in the cloud? Do you have any best practices?
One of the easiest ways to go is to use RStudio on SageMaker:
posit.co/blog/getting-started-rstudio-sagemaker/
If you get a RStudio crash related to Initializing libomp.dylib, but found libomp.dylib already initialized. When using the final workflow and fit it. You can use a workaround on OSX
Sys.setenv(KMP_DUPLICATE_LIB_OK = TRUE)
Can someone tell me why we used sample_prop inside the search grid?
It's what proportion of the total available sample is used for modeling within one boosting iteration:
dials.tidymodels.org/reference/trees.html#details
I should mention that the mini ran this quietly and I heard no noise from an overworked. The unit is also cool to the touch.
Always an amazing content thank you
Great explanation, but i have one question: When you call last_fit() you make use of your split object. In my particular case i only was provided with the train and test test initially, so that i dont have a split object. Is there any way to call last_fit() nevertheless? Thanks!
You can't call last_fit() directly if you don't have the split, but you *can* manually do what it is a wrapper for, which is train one last time on the training data and then evaluate one last time on the testing data.
All your videos are such a great learning resource for real world EDA and modelling. I was just wondering what theme you are using in rstudio ?
It's one of the themes available via rsthemes: www.garrickadenbuie.com/project/rsthemes/
Julia,
I was able to follow along and everything looked fine until the final roc_auc curve. I get a mirror image of your curve. I have combed through the code and found nothing wrong. The confusion matrix outcome is similar to yours etc. It seems like a systematic error. I noticed when looked at the data that will generate the curve that indeed my numbers for specificity are somehow switched. While your table starts with specificity of 1 mine starts at zero so the value seem more like 1-specificity to begin with in my case. I am puzzled.
You can look at the first comment at the relevant blog post here:
juliasilge.com/blog/xgboost-tune-volleyball/
Since I published this blog post, there was a change in yardstick in version 0.0.7:
github.com/tidymodels/yardstick/blob/master/NEWS.md#yardstick-007
that changed how to choose which level (win or lose) is the "event". You can change this by using the `event_level` argument for functions like `roc_curve()`:
yardstick.tidymodels.org/reference/roc_curve.html
What appearance theme are you using here?
I use one of the themes from rsthemes:
www.garrickadenbuie.com/project/rsthemes/
I think Oceanic Plus? There are lots of nice ones available in that package.
Hi Julia ! Great video as always :) ! Can i ask you something please? At around 34.08 if we don't want to use the xgb_grid you are using and we use in the tune_grid() function, something else for the grid parameter, let's say grid = 50 is this ok ? I mean generally is it ok to use grid equal a number ? Thank you very much !
Yes, that argument can take a couple of different kinds of values, either a dataframe or an integer value:
tune.tidymodels.org/reference/tune_grid.html
You can read a bit more about this here:
www.tmwr.org/grid-search.html#evaluating-grid
@@JuliaSilge Thank you again ! :) .
These vids are great. Can we see a classification model with calibration curves, and then recalibrate it, within the tidymodels framework? How long did the hyperparameter tuning take here?
Thanks a lot Julia. I really love your videos. Do you have any plans for making a video on neural network and tuning it in tidymodels? That would be awesome if possible. Please continue these videos. They are really great.
Cheers
Thank you for the great tutorial. I have been haivng a problem with a confusion matrix. namely, when i run the code " final_res_r %>%
collect_predictions()%>% roc_curve(dependent_var, .pred_dependent_var)%>% autoplot()", i get the error Can't subset columns that don't exist.
x Column `.pred_dependent_var` doesn't exist.. I can not understand how to solve the problem. What am i doing wrong?
Hmmmm, do you see the column with the predicted class probability in it, after you run `collect_predictions()`? You can check out the documentation for `roc_curve()` here:
yardstick.tidymodels.org/reference/roc_curve.html
And if you continue to have trouble, I recommend creating a reprex and posting it on RStudio Community:
rstd.io/tidymodels-community
It's often easier to get help with coding problems in a format like that rather than comments.
@@JuliaSilge Dear Julia! Just amazing to read your response :). I have solved that problem :). however, another problem that I could not solve was related to the variable importance. I managed to create a figure but I can not get the actual values per variable. I tried to use varImp(model_name), xgb.importance(model = model_name). but getting just lovely red text around, without the results :)
@@tamaraabzhandadze2712 I typically use the vip package for variable importance, as I show in this blog post:
juliasilge.com/blog/xgboost-tune-volleyball/
@@JuliaSilge thank you! I have actually posted the question there as well :) . I read your answer and got the results :). I just really have to decide now the cutoff coefficient for choosing some variables out of ten features.
p.s. i did factor analyses as well, and could identify 3 variables with good loading, but there it was a bit easier as there are cutoffs for loading :). For XGboost i have no idea what to do :)
Julia,
I do like Markdown but for testing out code I prefer R script simply because I make a lot of mistakes. So I am curious to know why you work in Markdown. Is it so because you have already written and debugged your code and would like to save the lesson in a nicer format?
No, I work in R Markdown regularly. In R I basically am either building package code or I am working in R Markdown. I'm a huge believer in the idea of "literate programming" as a real way to work. I make a lot of mistakes too, but I don't think that reduces the value of combining narrative and code in one document.
I am working on setting up a class for students in my department and am quite torn on whether to go the Markdown or R script route. Since most of the class work will be around coding and simply learning how to R I am inclined to start with the regular setup (script) and then move on to Markdown later. Thanks.
@@haraldurkarlsson1147 The person I know who has thought the most about this is Mine Çetinkaya-Rundel; you can see one of her resources for teaching here: datasciencebox.org/
She recommends teaching R Markdown to emphasize reproducible analyses.
I see. Thanks a lot for the tip.
Julia,
I will have a deeper dive into the datasciencebox. However, I will be teaching grad students that should have some inkling of what the basic statistics concepts are. Most have already worked with data, done some data processing, and generated tables and graphs. I would like to teach them R to simplify their lives and give them hopefully a new valuable skill for the current or future work. As grad students the science part is covered.
Julia thank you for these great videos keep it up ! Quick question once using last_fit if wanting to predict on NEW data what are the workflow steps ? Last_fit doesn’t really work on new data that wasn’t in the original split. Thank you !
Once you get to last_fit(), check out the objects that are inside of it. One of the columns contains a *fitted model* that can be used on new data. In fact, that fitted model is used on the testing data to compute the metrics!
@@JuliaSilge Thank you Julia! Last quick Q, noticed you always process the commands in console from the notebook Rmd, what button do you click to run in console instead of in the notebook?
@@Matthew-px9nu That's probably my most used keyboard shortcut! Ctrl+Shift+Enter for a chunk, Cmd+Enter for a line
In RStudio, you can find them under Tools -> Keyboard Shortcuts Help, but there's just a handful that I use regularly.
@@JuliaSilge Hi Julia! Where do I exactly find this? The columns I have are splits, id, .metrics, .notes,. predictions, .workflow. I can't find the fitted model in .workflow either so I'm not sure where it is. Thanks!
@@vincentpepe1064 The .workflow is a *fitted* workflow at this point. For example, try tidying it or predicting on it. I show how to tidy it here: juliasilge.com/blog/palmer-penguins/
Great video! Is there any difference between “pivot_longer” and “gather”? They look identical to me, just with the arguments having different names, but want to make sure I’m not missing something.
You can read this blog post that introduced the pivot verbs: www.tidyverse.org/blog/2019/09/tidyr-1-0-0/
Julia Silge oh awesome thanks!
Hi im getting a warning-error: ! Fold01: model 1/20: The `x` argument of `as_tibble.matrix()` must have colum...
Whentune_grid function runs... Found in a github issue, that it's related to "name reparing"...
Do you have any idea if it really affects the results of the tunning process, or if thers a update/solution for it?
Hmmmm, do you want to make sure all your packages are updated? That sounds like a message from an older version of the packages. If you are still getting that warning, I recommend creating a reprex and posting on RStudio Community: community.rstudio.com/c/ml/15
@@JuliaSilge After reading your responde, I did update all my packages, and the error still occurs, but the process seems to keep running. I will let it finish, and see if it affects the results of the tune_grid
Is .pred_win = .pred_class ?
No, .pred_win should be a class probability (like a number) and .pred_class should be the predicted class (like the factor level).
@@JuliaSilge Ah ok, thank you!
Do you have a course on tidymodels?? Video Course or Tutorials?
You can check out this interactive course on tidymodels: supervised-ml-course.netlify.app/
@@JuliaSilge Amazing resource, thank you
You are the best!
Hey Julia, congrats again: show up this error:
xgb_res
You need to *install* xgboost, actually; you don't have the package installed: install.packages("xgboost")
@@JuliaSilge yeah...but I don't know what is happening, when I try install the package xgboost gives a error telling me that the xgboost is not available for my R version. My R Studio is the currently version.
@@wecsleyprates3205 Ah, a classic problem that folks run into when things get borked! Check out this SO question + answers:
stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-for-r-version-x-y-z-wa
Thanks @@JuliaSilge...Do you know what means the error below?
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘predict’ for signature ‘"xgb.Booster"’
@@wecsleyprates3205 That sounds like xgboost still isn't getting loaded correctly to me. Could you try creating a reprex showing your problem and posting on RStudio Community? rstd.io/tidymodels-community