Data preprocessing and resampling using tidymodels

Julia Silge

Просмотров 8 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 фев 2025

Комментарии • 32

@Mr_nn23 4 года назад ⁺³
Absolutely in love with these tidymodels videos you're dropping every week! Keep rocking!
@ajw99a 4 года назад ⁺¹
Hey Julia, This is really great. I hope you keep these coming. I think it is fascinating and variable to simply hear how people think through and process a problem and/or question. Really love hearing you think through the process to get to the end. Thank you.
@scatterplot7048 4 года назад ⁺²
Julia, This was amazing. Thank you for your video.
@yangyang6008 2 года назад ⁺¹
Hi Julia, thank you for the great tutorial! In your example, you normalize both the numeric variables and dummy variables. Should we normalize the dummy variables? Also, in a regression model, should we normalize the outcome variable? Thanks.
@JuliaSilge 2 года назад ⁺¹
It depends on the model, but for something like k-nearest neighbor, yes, you do need all the predictors on the same scale, including ones that used to be categorical variables and have been transformed to dummy/indicator variables. You might find this link helpful for understanding when we do/don't want to normalize variables:
www.tmwr.org/pre-proc-table.html
For most cases, I don't find it useful to normalize an outcome. The exception is for deep learning models sometimes; in some situations, they do a terrible job converging unless the outcome is scaled and centered.
@yangyang6008 2 года назад
@@JuliaSilge Hi Julia, thanks a lot for your explanation!
@augusto1882 4 года назад ⁺¹
This is gold, thank you Julia!
@hesamseraj 4 года назад
Thank you Julia, It was very helpful, very informative and brought many aha moment.
@averyrobbins68 4 года назад ⁺³
Really helpful, thank you! Quick question. Should we use step_normalize after step_dummy? Wouldn’t that normalize our dummy variables (1s and 0s)? How does that affect different models?
@JuliaSilge 4 года назад ⁺²
Yep, it does center and scale the 1s and 0s from the dummy variables. If you are dealing with a model that is sensitive to scaling, this is actually a good idea.
@jordankrogmann352 4 года назад
Hey Julia, Great video... @42:10 you used some short-cut to format your code, was that styler plug-in? Thanks!
@JuliaSilge 4 года назад ⁺¹
Not a plug-in really, just commands in RStudio: I use a Mac, so first I did command-A to select all, and then command-I to reindent lines. I feel like i do this all the time as I'm working. 😅
@psxcl9817 2 года назад
Hello, Julia, thanks for your video! Quick question. Could I use a cv-fold data to fit model and compute the performance. finally, test our model using test data? Does tidymodel have a function like this?
@JuliaSilge 2 года назад
When you use resampled data like CV folds, your goal is estimating performance (maybe for different sets of hyperparameters, like tuning): www.tmwr.org/resampling.html
You don't fit a final model using the CV folds.
In tidymodels, the function to fit your final model to the whole training set and evaluate on the test set is `last_fit()`:
www.tmwr.org/workflows.html#evaluating-the-test-set
@psxcl9817 2 года назад
@@JuliaSilge Ok, I got it. Thanks!
@lucaszago7031 4 года назад
Hi Julia, in this part of the code I received the following error:
set.seed(234)
glm_rs %
fit_resamples(diversity ~ .,
folds,
metrics = metric_set(roc_auc,sens,spec),
control = control_resamples(save_pred = TRUE))
Error:
The combination of metric functions must be:
- only numeric metrics
- a mix of class metrics and class probability metrics
The following metric function types are being mixed:
- prob (roc_auc)
- class (sens)
- other (spec )
@JuliaSilge 4 года назад
Hmmmmm, I can't reproduce that problem. Can you make sure you have updated versions of the packages?
@micahshull 4 года назад
@Julia Silge used the abbreviation of specificity, sensitivity, I spelled them out entirely and that worked for me. Awesome Videos! Love the standardization of TidyModels! Super helpful
@jansenai6764 4 года назад
Julia this is the most amazing tidymodels tutorial in youtube! Great job! If I may ask a question? Why is it important to keep numeric features with 0 variance out of the training step? Is this feature with 0 variance detrimental to the model or maybe something else? Lastly when we predict a test set with a model that used a 10-Fold cross-validation which of those models in the 10 folds R uses to predict with? Thank you so much!
@JuliaSilge 4 года назад
Yes, there are many models where predictors with a single unique value will cause the model to fail. On the other question, when you use `fit_resamples()` you do NOT keep any of those models; those models are used only for computing performance metrics and then thrown away. If you check out the blog post here, you'll notice that the model used for predicting the test set is trained on the whole training set:
juliasilge.com/blog/tuition-resampling/
@leocarlsson3753 4 года назад
Thanks again for a wonderful TidyTuesday! Just a question, is there a reason you're both specifying the formula in the recipe and in the fit-call?
@JuliaSilge 4 года назад
Both need to know about what is being fit, either to create a model matrix or for actual fitting. You can check out some docs here: tidymodels.github.io/parsnip/reference/fit.html
And here: tidymodels.github.io/recipes/reference/recipe.html
@taiwankyh 4 года назад
Thanks for your video. I learn some dplyr technique from that.
@ashishasashu 4 года назад
Julia, This is a great tutorial but I'm unable to run code .this is the error i'm getting.
Thanks in advance for the help
" filter(category == "Total Minority") %>%
+ mutate(TotalMinority = enrollment / total_enrollment)
Error in filter(category == "Total Minority") %>% mutate(TotalMinority = enrollment/total_enrollment) :
could not find function "%>%"
> filter(category == "Total Minority")
Error in as.ts(x) : object 'category' not found
> mutate(TotalMinority = enrollment / total_enrollment)
Error in mutate(TotalMinority = enrollment/total_enrollment) :
could not find function "mutate":
@JuliaSilge 4 года назад ⁺²
Hmmmm, sounds like you haven't loaded the tidyverse: `library(tidyverse)`

Следующие

Автовоспроизведение

Lasso regression with tidymodels and The Office