Julia, I cant thanks you enough, i just submitted my dissertation in machine learning and I wouldn't have been able to do it without you. Tidymodels literally saved me. Your videos are great and you are an inspiration to budding data scientists. I hope you keep up the tidy Tuesdays data screencasts cause they are great! Thanks again!
Hi Julia, thank you for the wonderful tutorial! Could you explain a little bit about how "class imbalance" impacts the calculation or accuracy in machine learning? Thanks.
Hi Julia, I have now watched this highly informative video twice and will watch it again and again. Very well put together and articulated with concise explanations and a very good structured sequence. Will look for more videos by yourself and the community in conjunction with my studies, Regards Mark
I'm so happy I found your blog. Thanks for the help. Your book is awesome! I learned this stuff in the caret package. For some reason I remember the model fitting steps and model evaluation steps getting tangled up. I could be mistaken, but caret allowed me to estimate model parameters and evaluate models (and maybe tune) almost simultaneously. And if I'm understanding correctly we fit the chosen model to the entire training data in a separate step. Tidy is great!
Thanks for sharing Julia. One minor issue at 26:48, I would normalize the training and test set separately though. By normalizing first and then splitting, we "leak information." It's also incorrect in the caret documentation.
It's called "reindent lines" in RStudio and on a Mac it is Cmd+I. It's one of my most-used shortcuts! You can see shortcuts in RStudio itself, but there is also a list here: support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
I couldn't get the below to work unless I put "knn_spec" before 'children ~ .". around 40min mark Great Video! Error in `fit_resamples()`: ! The first argument to [fit_resamples()] should be either a model or workflow. knn_res
Julia, thanks for these videos. Immensely helpful. Quick question: how can we get training statistics that are used during prepping the recipe? For example, if we use step_impute_median() for a variable, somewhere in that recipe object, that training median must be stored. How can we extract that?
@@JuliaSilge thanks for the reply! tidy(rec) produces a table with operations and their ids. I am more looking for training statistics for imputation steps. It does not seem to have statistics saved. Or maybe I missing something. I believe rec$steps has it as a list of detailed information but I am unable to extract the,. For example, say, a numerical column has a few missing points, and its median for non-missing points is 10. Then when we step_impute_median, somewhere in the recipe object the sample median 10 should be saved for testing data imputations. I want to extract and see those training statistics.
Great video. Thanks for your effort. Question: When you apply a recipe to the test data and it includes scaling and centering, (value - mean(values))/sd(values), does the scaling applied to the test set use the standard deviation from the training data, or from the test data ?
As Fateh says below, the recipe uses the transformation estimates (for scaling/centering, standard deviation and such) from the *training* data and will apply it to any new dataset, such as the testing data. The prep() function estimates the parameters from the data, while the bake() function applies the learned parameters to new data.
You might want to check out this chapter (along with the previous ones), which walks through tuning an SVM model: www.tmwr.org/iterative-search.html#svm
If you are using a model with a formula interface, R typically will make dummy variables for you. This is really a point of confusion in the R modeling world, though, and one we are trying to address using recipes!
Hi Julia, thank you again for the great tutorial! At 27.34 you applied dummy transform and then you normalized all the numeric predictors. So this means that the step normalize will also normalize the dummy variables. Is this ok? I mean i don't think we need to normalize dummy vars. So maybe the step normalize should be above the step dummy or it does not matter at all?
That was definitely on purpose, yep! For models like k-nearest neighbors that are based on a distance metric, all the predictors need to be on the same scale. This includes predictors that have been converted to dummy or one-hot numeric variables. You can check out this vignette for some advice on ordering recipe steps: recipes.tidymodels.org/articles/Ordering.html And this appendix for advice on what kind of preprocessing is needed for different models: www.tmwr.org/pre-proc-table.html
Julia I would love to hear if you stand behind this way of using recipes because in a different talk, Max Kuhn said it is better to not use bake if not necessary and in general that we want to perform changes only in trained data, not test data. Thank you
Ah yes, when I made this screencast, the tooling around workflows was not as robust as it is today. I know Max and I are on the same page in terms of how useful workflows are and how you don't typically need to use `bake()` except for debugging. You can read more about that here: www.tmwr.org/dimensionality.html#recipe-functions It's definitely important to carefully use the test data so that you only apply learned transformations to it and don't use it for any training. This post/screencast does stick with that and use the testing data in a correct way, but it's easier now with the tidymodels workflows infrastructure.
The happy weekend question, Do you have any suggestion on how I could use tidymodels for linear mixed models? I am doing a comparison of how well different regression models perform for predicting a response ratio (continuous variable) based on many hundred covariates. Thank you!
If you are interested in experimenting with our new package for mixed effects models, you can check that out here: github.com/tidymodels/multilevelmod If you are interested in an approach for evaluating different predictor sets, you might try something like this: workflowsets.tidymodels.org/articles/evaluating-different-predictor-sets.html
Hello Julia Silge, I am trying to redo the model but unfortunately, the step_downsample in the recipes package doesn't work. I also searched the internet to see if I can find why! I also tried to use the downsample function from caret package. It didn't work. I also downloaded the development version from github (devtools::install_github("tidymodels/recipes")) but still the function didn't work. What should I do now? this is the error > require(recipes) > recipe(children ~. , data = training_hotel) %>% + step_downsample(children) %>% + step_dummy(all_nominal(), -all_outcomes()) Error in step_downsample(., children) : could not find function "step_downsample" ah, half hour later found it from themis package. sorry.
@@JuliaSilge I'm having the same problem. It seems that step_downsample is not available in the latest version of recipes that I can find online (1.0.3). I get the error: Error in step_downsample(., sample_type) : could not find function "step_downsample". If I try recipes::step_downsample I get: Error: 'step_downsample' is not an exported object from 'namespace:recipes'
Yes, when you use strata in the initial split, then both the training and testing set will have the same proportion of positive/negative cases but that proportion is still small (
@@JuliaSilge This cross validation is not satisfying at all, since the validation sets coming from juic() are not representetive of real word (original imbalanced data). Right?
@@mahdip.4674 The test set has the original imbalance as the "real world". This validation set is balanced like the training set; it is in fact resamples of the training set, like you noticed, because it is being used to compute performance metrics for the training data.
@@JuliaSilge Thanks for reply and great video. But I would like to make sure what i am doing is correct. Usually I split the data to Train and Test. Then I use the Test for cross validation. But in the cross validation phase I still preserve the target distribution mimicking real word for the validation segment. With this approach I try to find the best parameter space. The Test, I only use once for final evaluation. In case of cross validation, for every round of modelling I can then down or upsample the training part while the validation set is preserved as it is and I apply the bake() to each validation set. The bake() is applied on results of prep() for each training parts. Is it right? Not necessary? Thanks for reply.
Not finding any info on how to load tidymodels in. I am getting an error. Restarted R and ran as an administrator to get it to work and still not happening? Any suggestions??
Are you saying you are having trouble installing the tidymodels metapackage? I don't have specific suggestions based on just what you have said here, but you can set yourself up for getting effective help by creating a reprex: www.tidyverse.org/help/ And then posting on a forum like RStudio Community: community.rstudio.com/
@@JuliaSilge ok I’ll give those a try. Looks like someone else has suggested installing rlang from CRAN first, restarting R again and then installing and loading tidymodels. There’s a “load Name Space” error message that mentions rlang inside the error message. Just noting this here in case someone else comes across this message down the road
@@JuliaSilge @ about the 40min mark I am getting this error: Warning: The `...` are not used in this function but one or more objects were passed: '' Error: The `resamples` argument should be an 'rset' object, such as the type produced by `vfold_cv()` or other 'rsample' functions It is on the knn_res
@@infamousprince88 Yes, this screencast is a bit older and some of the tidymodels tuning functions have changed since this time. If you would like to see a more up-to-date example with this dataset, you can check this out: www.tidymodels.org/start/case-study/
24:23: addressing the wrong conclusions that might be arrived at having class imbalances, you said, "No one has children. wow! look at all those people without children". I thought that was Funny.
Hi Julia, besides other best things about your videos and techniques, may I take your permission to give you a compliment......you are very beautiful.....
Julia, I cant thanks you enough, i just submitted my dissertation in machine learning and I wouldn't have been able to do it without you. Tidymodels literally saved me. Your videos are great and you are an inspiration to budding data scientists. I hope you keep up the tidy Tuesdays data screencasts cause they are great! Thanks again!
Thank you Julia for this great tutorial, it means alot to see someone doing modelling inaction !!
Better than any online course.
Good luck
It’s a good one and thanks for doing the screencasts every week..
Looking forward for the next week lesson..
Hi Julia - thank you for doing these vids - They really really help some of us who are trying to improve our R skills!
Hi Julia, thank you for the wonderful tutorial! Could you explain a little bit about how "class imbalance" impacts the calculation or accuracy in machine learning? Thanks.
You can see more about this in this more recent blog post: juliasilge.com/blog/project-feederwatch/
@@JuliaSilge Thank you Julia.
Julia is so awesome!!!!
Hi Julia, I have now watched this highly informative video twice and will watch it again and again. Very well put together and articulated with concise explanations and a very good structured sequence. Will look for more videos by yourself and the community in conjunction with my studies, Regards Mark
Thank you so much for this video Julia. It's been exceedingly helpful to myself in starting to use recipes and tidymodels.
Thank you, Julia. Ever since I’ve subscribed to you channel I’ve become better at using R.
Hey Thanks Julia....................great training....................and tutorial...........:) ..........bye
I'm so happy I found your blog. Thanks for the help. Your book is awesome! I learned this stuff in the caret package. For some reason I remember the model fitting steps and model evaluation steps getting tangled up. I could be mistaken, but caret allowed me to estimate model parameters and evaluate models (and maybe tune) almost simultaneously. And if I'm understanding correctly we fit the chosen model to the entire training data in a separate step. Tidy is great!
If you want to see a more detailed walk through with this data, check it out here: www.tidymodels.org/start/case-study/
As always your work is great. I am learning a lot. Thank you very much.
Thanks for sharing Julia. One minor issue at 26:48, I would normalize the training and test set separately though. By normalizing first and then splitting, we "leak information." It's also incorrect in the caret documentation.
Pretty nice screencast, thanks so much. Im waiting for more...🤓
Thank you very much for sharing this knowledge. Congratulations
I love your videos!! so many great tips and tricks and great explanations of everything!!
That was soooooo informative! Thank you so much for this. Please keep going!
Absolutely brilliant brilliant brilliant
Amazing tutorial, thank you for sharing with us!
On 44:25 how do you do that?
It's called "reindent lines" in RStudio and on a Mac it is Cmd+I. It's one of my most-used shortcuts! You can see shortcuts in RStudio itself, but there is also a list here:
support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts
Thank you for this great tutorial. The available steps in recipe amazes me and I am keen to learn about the tuning functionality.
I like the tile arrangement
I couldn't get the below to work unless I put "knn_spec" before 'children ~ .". around 40min mark Great Video!
Error in `fit_resamples()`: ! The first argument to [fit_resamples()] should be either a model or workflow.
knn_res
dear 420coolbro69 I have same issue with you. After switching the position of knn_spec and formula, it worked. Thank you very much.
Julia, thanks for these videos. Immensely helpful. Quick question: how can we get training statistics that are used during prepping the recipe? For example, if we use step_impute_median() for a variable, somewhere in that recipe object, that training median must be stored. How can we extract that?
You can extract that info by using `tidy()` with the recipe:
www.tmwr.org/recipes.html#tidy-a-recipe
@@JuliaSilge thanks for the reply! tidy(rec) produces a table with operations and their ids. I am more looking for training statistics for imputation steps. It does not seem to have statistics saved. Or maybe I missing something. I believe rec$steps has it as a list of detailed information but I am unable to extract the,. For example, say, a numerical column has a few missing points, and its median for non-missing points is 10. Then when we step_impute_median, somewhere in the recipe object the sample median 10 should be saved for testing data imputations. I want to extract and see those training statistics.
@@TURALOWEN Yep, keep reading in that section I linked to for tidying an individual recipe step.
@@JuliaSilge I see now. Thank you! [one needs to id the step, and then extract recipe and tidy it with that id]
Great video. Thanks for your effort. Question: When you apply a recipe to the test data and it includes scaling and centering, (value - mean(values))/sd(values), does the scaling applied to the test set use the standard deviation from the training data, or from the test data ?
It uses the mean and standard deviation from the training data.
As Fateh says below, the recipe uses the transformation estimates (for scaling/centering, standard deviation and such) from the *training* data and will apply it to any new dataset, such as the testing data. The prep() function estimates the parameters from the data, while the bake() function applies the learned parameters to new data.
Julia Silge great. Thank you both.
I'm trying to apply SVM to this model but I get it overfitted, how can I fix it?
You might want to check out this chapter (along with the previous ones), which walks through tuning an SVM model: www.tmwr.org/iterative-search.html#svm
@@JuliaSilge thank you!!!
Thank you very much Julia
Thanks for great video; I have a question about “dummy variables “. Sometimes R automatically made it, but when do we need to make it manually?
If you are using a model with a formula interface, R typically will make dummy variables for you. This is really a point of confusion in the R modeling world, though, and one we are trying to address using recipes!
Julia Silge thanks very much
TidyTuesday is my new Netflix binge series 😂
Perfect as always!
That was a hilarious mistype at 33:18 😂
great demonstration of recipes, loved it! it would be nice to see an example of boosted trees with xgboost engine in another video..
Great channel !!
Definitely subscribing :)
This is really great work! Thanks Julia. By the way, great book on Tidytext and it was a great read too!
Hi Julia, thank you again for the great tutorial! At 27.34 you applied dummy transform and then you normalized all the numeric predictors. So this means that the step normalize will also normalize the dummy variables. Is this ok? I mean i don't think we need to normalize dummy vars. So maybe the step normalize should be above the step dummy or it does not matter at all?
That was definitely on purpose, yep! For models like k-nearest neighbors that are based on a distance metric, all the predictors need to be on the same scale. This includes predictors that have been converted to dummy or one-hot numeric variables. You can check out this vignette for some advice on ordering recipe steps:
recipes.tidymodels.org/articles/Ordering.html
And this appendix for advice on what kind of preprocessing is needed for different models:
www.tmwr.org/pre-proc-table.html
@@JuliaSilge These 2 links are diamonds! Really helpful. Thank you very much!
Thank you Julia.. It would be great if you can make one video on sentiment analysis 🙂
Julia I would love to hear if you stand behind this way of using recipes because in a different talk, Max Kuhn said it is better to not use bake if not necessary and in general that we want to perform changes only in trained data, not test data. Thank you
Ah yes, when I made this screencast, the tooling around workflows was not as robust as it is today. I know Max and I are on the same page in terms of how useful workflows are and how you don't typically need to use `bake()` except for debugging. You can read more about that here:
www.tmwr.org/dimensionality.html#recipe-functions
It's definitely important to carefully use the test data so that you only apply learned transformations to it and don't use it for any training. This post/screencast does stick with that and use the testing data in a correct way, but it's easier now with the tidymodels workflows infrastructure.
@@JuliaSilgeThank you so much for your elaborate response. You are such a thoughtful educator! I got this link in my favorites already ;)
The happy weekend question,
Do you have any suggestion on how I could use tidymodels for linear mixed models? I am doing a comparison of how well different regression models perform for predicting a response ratio (continuous variable) based on many hundred covariates.
Thank you!
If you are interested in experimenting with our new package for mixed effects models, you can check that out here:
github.com/tidymodels/multilevelmod
If you are interested in an approach for evaluating different predictor sets, you might try something like this:
workflowsets.tidymodels.org/articles/evaluating-different-predictor-sets.html
Nice tutorial but any possibility to make tutorial series on netcdf/hdf satellites files? it would be very useful.
Hello Julia Silge,
I am trying to redo the model but unfortunately, the step_downsample in the recipes package doesn't work.
I also searched the internet to see if I can find why!
I also tried to use the downsample function from caret package. It didn't work.
I also downloaded the development version from github (devtools::install_github("tidymodels/recipes")) but still the function didn't work.
What should I do now?
this is the error
> require(recipes)
> recipe(children ~. , data = training_hotel) %>%
+ step_downsample(children) %>%
+ step_dummy(all_nominal(), -all_outcomes())
Error in step_downsample(., children) :
could not find function "step_downsample"
ah, half hour later found it from themis package. sorry.
Yep, it is here: themis.tidymodels.org/reference/step_downsample.html
@@JuliaSilge I'm having the same problem. It seems that step_downsample is not available in the latest version of recipes that I can find online (1.0.3). I get the error: Error in step_downsample(., sample_type) :
could not find function "step_downsample". If I try recipes::step_downsample I get: Error: 'step_downsample' is not an exported object from 'namespace:recipes'
I think I've figured it out: You need to install the themis package
@@King_of_carrot_flowers Yep, it is here: themis.tidymodels.org/reference/step_downsample.html
Excellent lesson, thanks very much.
Is there an advantage to step_downsampe as part of the recipe versus using "strata =" in the initial split?
Yes, when you use strata in the initial split, then both the training and testing set will have the same proportion of positive/negative cases but that proportion is still small (
@@JuliaSilge of course, thank you
@@JuliaSilge This cross validation is not satisfying at all, since the validation sets coming from juic() are not representetive of real word (original imbalanced data). Right?
@@mahdip.4674 The test set has the original imbalance as the "real world". This validation set is balanced like the training set; it is in fact resamples of the training set, like you noticed, because it is being used to compute performance metrics for the training data.
@@JuliaSilge Thanks for reply and great video. But I would like to make sure what i am doing is correct. Usually I split the data to Train and Test. Then I use the Test for cross validation. But in the cross validation phase I still preserve the target distribution mimicking real word for the validation segment. With this approach I try to find the best parameter space. The Test, I only use once for final evaluation. In case of cross validation, for every round of modelling I can then down or upsample the training part while the validation set is preserved as it is and I apply the bake() to each validation set. The bake() is applied on results of prep() for each training parts. Is it right? Not necessary?
Thanks for reply.
Wonderful!
Do we have comprehensive book about tidymodels with more examples?
Just yesterday we announced our new book, currently with eleven chapters released: www.tmwr.org/
Not finding any info on how to load tidymodels in. I am getting an error. Restarted R and ran as an administrator to get it to work and still not happening? Any suggestions??
Are you saying you are having trouble installing the tidymodels metapackage? I don't have specific suggestions based on just what you have said here, but you can set yourself up for getting effective help by creating a reprex:
www.tidyverse.org/help/
And then posting on a forum like RStudio Community:
community.rstudio.com/
@@JuliaSilge ok I’ll give those a try. Looks like someone else has suggested installing rlang from CRAN first, restarting R again and then installing and loading tidymodels.
There’s a “load Name Space” error message that mentions rlang inside the error message. Just noting this here in case someone else comes across this message down the road
@@JuliaSilge @ about the 40min mark I am getting this error:
Warning: The `...` are not used in this function but one or more objects were passed: ''
Error: The `resamples` argument should be an 'rset' object, such as the type produced by `vfold_cv()` or other 'rsample' functions
It is on the knn_res
@@infamousprince88 Yes, this screencast is a bit older and some of the tidymodels tuning functions have changed since this time. If you would like to see a more up-to-date example with this dataset, you can check this out:
www.tidymodels.org/start/case-study/
Thank you!
24:23: addressing the wrong conclusions that might be arrived at having class imbalances, you said, "No one has children. wow! look at all those people without children". I thought that was Funny.
Hi Julia, besides other best things about your videos and techniques, may I take your permission to give you a compliment......you are very beautiful.....
Then.......corona happened. Model went to dogs..........