Hey Julia, This is really great. I hope you keep these coming. I think it is fascinating and variable to simply hear how people think through and process a problem and/or question. Really love hearing you think through the process to get to the end. Thank you.
Hi Julia, thank you for the great tutorial! In your example, you normalize both the numeric variables and dummy variables. Should we normalize the dummy variables? Also, in a regression model, should we normalize the outcome variable? Thanks.
It depends on the model, but for something like k-nearest neighbor, yes, you do need all the predictors on the same scale, including ones that used to be categorical variables and have been transformed to dummy/indicator variables. You might find this link helpful for understanding when we do/don't want to normalize variables: www.tmwr.org/pre-proc-table.html For most cases, I don't find it useful to normalize an outcome. The exception is for deep learning models sometimes; in some situations, they do a terrible job converging unless the outcome is scaled and centered.
Really helpful, thank you! Quick question. Should we use step_normalize after step_dummy? Wouldn’t that normalize our dummy variables (1s and 0s)? How does that affect different models?
Yep, it does center and scale the 1s and 0s from the dummy variables. If you are dealing with a model that is sensitive to scaling, this is actually a good idea.
Not a plug-in really, just commands in RStudio: I use a Mac, so first I did command-A to select all, and then command-I to reindent lines. I feel like i do this all the time as I'm working. 😅
Hello, Julia, thanks for your video! Quick question. Could I use a cv-fold data to fit model and compute the performance. finally, test our model using test data? Does tidymodel have a function like this?
When you use resampled data like CV folds, your goal is estimating performance (maybe for different sets of hyperparameters, like tuning): www.tmwr.org/resampling.html You don't fit a final model using the CV folds. In tidymodels, the function to fit your final model to the whole training set and evaluate on the test set is `last_fit()`: www.tmwr.org/workflows.html#evaluating-the-test-set
Hi Julia, in this part of the code I received the following error: set.seed(234) glm_rs % fit_resamples(diversity ~ ., folds, metrics = metric_set(roc_auc,sens,spec), control = control_resamples(save_pred = TRUE)) Error: The combination of metric functions must be: - only numeric metrics - a mix of class metrics and class probability metrics The following metric function types are being mixed: - prob (roc_auc) - class (sens) - other (spec )
@Julia Silge used the abbreviation of specificity, sensitivity, I spelled them out entirely and that worked for me. Awesome Videos! Love the standardization of TidyModels! Super helpful
Julia this is the most amazing tidymodels tutorial in youtube! Great job! If I may ask a question? Why is it important to keep numeric features with 0 variance out of the training step? Is this feature with 0 variance detrimental to the model or maybe something else? Lastly when we predict a test set with a model that used a 10-Fold cross-validation which of those models in the 10 folds R uses to predict with? Thank you so much!
Yes, there are many models where predictors with a single unique value will cause the model to fail. On the other question, when you use `fit_resamples()` you do NOT keep any of those models; those models are used only for computing performance metrics and then thrown away. If you check out the blog post here, you'll notice that the model used for predicting the test set is trained on the whole training set: juliasilge.com/blog/tuition-resampling/
Both need to know about what is being fit, either to create a model matrix or for actual fitting. You can check out some docs here: tidymodels.github.io/parsnip/reference/fit.html And here: tidymodels.github.io/recipes/reference/recipe.html
Julia, This is a great tutorial but I'm unable to run code .this is the error i'm getting. Thanks in advance for the help " filter(category == "Total Minority") %>% + mutate(TotalMinority = enrollment / total_enrollment) Error in filter(category == "Total Minority") %>% mutate(TotalMinority = enrollment/total_enrollment) : could not find function "%>%" > filter(category == "Total Minority") Error in as.ts(x) : object 'category' not found > mutate(TotalMinority = enrollment / total_enrollment) Error in mutate(TotalMinority = enrollment/total_enrollment) : could not find function "mutate":
Absolutely in love with these tidymodels videos you're dropping every week! Keep rocking!
Hey Julia, This is really great. I hope you keep these coming. I think it is fascinating and variable to simply hear how people think through and process a problem and/or question. Really love hearing you think through the process to get to the end. Thank you.
Julia, This was amazing. Thank you for your video.
Hi Julia, thank you for the great tutorial! In your example, you normalize both the numeric variables and dummy variables. Should we normalize the dummy variables? Also, in a regression model, should we normalize the outcome variable? Thanks.
It depends on the model, but for something like k-nearest neighbor, yes, you do need all the predictors on the same scale, including ones that used to be categorical variables and have been transformed to dummy/indicator variables. You might find this link helpful for understanding when we do/don't want to normalize variables:
www.tmwr.org/pre-proc-table.html
For most cases, I don't find it useful to normalize an outcome. The exception is for deep learning models sometimes; in some situations, they do a terrible job converging unless the outcome is scaled and centered.
@@JuliaSilge Hi Julia, thanks a lot for your explanation!
This is gold, thank you Julia!
Thank you Julia, It was very helpful, very informative and brought many aha moment.
Really helpful, thank you! Quick question. Should we use step_normalize after step_dummy? Wouldn’t that normalize our dummy variables (1s and 0s)? How does that affect different models?
Yep, it does center and scale the 1s and 0s from the dummy variables. If you are dealing with a model that is sensitive to scaling, this is actually a good idea.
Hey Julia, Great video... @42:10 you used some short-cut to format your code, was that styler plug-in? Thanks!
Not a plug-in really, just commands in RStudio: I use a Mac, so first I did command-A to select all, and then command-I to reindent lines. I feel like i do this all the time as I'm working. 😅
Hello, Julia, thanks for your video! Quick question. Could I use a cv-fold data to fit model and compute the performance. finally, test our model using test data? Does tidymodel have a function like this?
When you use resampled data like CV folds, your goal is estimating performance (maybe for different sets of hyperparameters, like tuning): www.tmwr.org/resampling.html
You don't fit a final model using the CV folds.
In tidymodels, the function to fit your final model to the whole training set and evaluate on the test set is `last_fit()`:
www.tmwr.org/workflows.html#evaluating-the-test-set
@@JuliaSilge Ok, I got it. Thanks!
Hi Julia, in this part of the code I received the following error:
set.seed(234)
glm_rs %
fit_resamples(diversity ~ .,
folds,
metrics = metric_set(roc_auc,sens,spec),
control = control_resamples(save_pred = TRUE))
Error:
The combination of metric functions must be:
- only numeric metrics
- a mix of class metrics and class probability metrics
The following metric function types are being mixed:
- prob (roc_auc)
- class (sens)
- other (spec )
Hmmmmm, I can't reproduce that problem. Can you make sure you have updated versions of the packages?
@Julia Silge used the abbreviation of specificity, sensitivity, I spelled them out entirely and that worked for me. Awesome Videos! Love the standardization of TidyModels! Super helpful
Julia this is the most amazing tidymodels tutorial in youtube! Great job! If I may ask a question? Why is it important to keep numeric features with 0 variance out of the training step? Is this feature with 0 variance detrimental to the model or maybe something else? Lastly when we predict a test set with a model that used a 10-Fold cross-validation which of those models in the 10 folds R uses to predict with? Thank you so much!
Yes, there are many models where predictors with a single unique value will cause the model to fail. On the other question, when you use `fit_resamples()` you do NOT keep any of those models; those models are used only for computing performance metrics and then thrown away. If you check out the blog post here, you'll notice that the model used for predicting the test set is trained on the whole training set:
juliasilge.com/blog/tuition-resampling/
Thanks again for a wonderful TidyTuesday! Just a question, is there a reason you're both specifying the formula in the recipe and in the fit-call?
Both need to know about what is being fit, either to create a model matrix or for actual fitting. You can check out some docs here: tidymodels.github.io/parsnip/reference/fit.html
And here: tidymodels.github.io/recipes/reference/recipe.html
Thanks for your video. I learn some dplyr technique from that.
Julia, This is a great tutorial but I'm unable to run code .this is the error i'm getting.
Thanks in advance for the help
" filter(category == "Total Minority") %>%
+ mutate(TotalMinority = enrollment / total_enrollment)
Error in filter(category == "Total Minority") %>% mutate(TotalMinority = enrollment/total_enrollment) :
could not find function "%>%"
> filter(category == "Total Minority")
Error in as.ts(x) : object 'category' not found
> mutate(TotalMinority = enrollment / total_enrollment)
Error in mutate(TotalMinority = enrollment/total_enrollment) :
could not find function "mutate":
Hmmmm, sounds like you haven't loaded the tidyverse: `library(tidyverse)`