Feature engineering & interpretability for xgboost with board game ratings

Поделиться
HTML-код
  • Опубликовано: 4 ноя 2024

Комментарии • 36

  • @michal.tomczyk
    @michal.tomczyk 2 года назад +7

    Great stuff. Amazing R programming and data analysis skills!

  • @IamHopeman47
    @IamHopeman47 2 года назад

    Thanks for the great screencast!
    For me the little introduction to the finetune package was a big eye-opener. Very smart approach.
    Looking forward to more of your content (especially the advanced topics).

  •  2 года назад +3

    5:12 your excitement for that distribution LOL

  • @N1loon
    @N1loon 2 года назад +3

    Amazing stuff... I really admire your knowledge in this super complex field.
    I think it would be cool, if you would do an episode that is mainly centered around all the nuances in tuning. I think this video offered a good general introduction of tuning principles but didn't go too much into all the details such as finding the right balance between over- and underfitting, operating with grids etc... Just an idea!
    Anyways, really love your content, Julia!

    • @JuliaSilge
      @JuliaSilge  2 года назад

      For now, you might check out some of my other blog posts that focus on hyperparameter tuning, like these:
      juliasilge.com/blog/scooby-doo/
      juliasilge.com/blog/sf-trees-random-tuning/

  • @whynotfandy
    @whynotfandy 2 года назад

    I love the new lights! It's a great addition for your intros and outros.

  • @briancostello939
    @briancostello939 2 года назад +2

    Fantastic content as always! Quick question, what are your thoughts on training multiple models using the top parameters from HP tuning, and ensembling the predictions/is there an easy way to do something like this with tidymodels? Thanks!!!

    • @JuliaSilge
      @JuliaSilge  2 года назад +3

      Yep, absolutely a great approach to work toward a bit of performance gain! You can implement ensembling in tidymodels with the stacks package:
      stacks.tidymodels.org/

    • @briancostello939
      @briancostello939 2 года назад +1

      @@JuliaSilge Great thanks so much

  • @seadhna
    @seadhna 2 года назад +1

    Great video as always!
    Would it be possible to use native parsnip functions to clean the features instead of doing base string manipulation in a custom function? I think in another video you cleaned the tokens within the recipe step.

    • @JuliaSilge
      @JuliaSilge  2 года назад +1

      We definitely have tidymodels functions for lots of text manipulation, like those in textrecipes:
      textrecipes.tidymodels.org/
      Or functions like:
      recipes.tidymodels.org/reference/step_dummy_multi_choice.html
      But sometimes there isn't something that fits your particular data out of the box, in which case you can extend tidymodels like I walked through in this screencast.

  • @hnagaty
    @hnagaty 2 года назад

    Great screencast and very useful. Many thanks.

  • @codygoggin1097
    @codygoggin1097 2 года назад

    Great video Julia! What would be the proper function to use in order to fit this best model onto new data and view these predictions?

    • @JuliaSilge
      @JuliaSilge  2 года назад

      You use the `predict()` function! workflows.tidymodels.org/reference/predict-workflow.html

  • @ammarparmr
    @ammarparmr 2 года назад +1

    ThankU.. fantastic video
    If you don't mind, I have a question regarding "mtry".. how we ended up with mtry greater than 6( the number of all the predictors).
    maybe I am confused with the concept

    • @JuliaSilge
      @JuliaSilge  2 года назад +2

      After the feature engineering, there are a lot more predictors, 30 from the board game category alone. The data that goes into xgboost is the _processed_ data, not the data in its pre-feature-engineering form.

    • @ammarparmr
      @ammarparmr 2 года назад

      @@JuliaSilge well explained
      Thank you so much

  • @avnavcgm
    @avnavcgm 2 года назад

    Great video! What would then be the best way to 'save' the best trained model so you can predict new with observations in the future that aren't in the train/test split?

    • @JuliaSilge
      @JuliaSilge  2 года назад +2

      You can _extract_ the workflow from the trained "last fit" object and then save that as a binary, like with `readr::write_rds()`. I show some of that at the end of this blog post:
      juliasilge.com/blog/chocolate-ratings/

  • @jaredwsavage
    @jaredwsavage 2 года назад

    Great video Julia. Just a quick question. Have you tried using lightgbm or catboost with boost_trees? They are available in the treesnip package and generally run much faster than xgboost.

    • @JuliaSilge
      @JuliaSilge  2 года назад +1

      HA oh I have had SUCH installation issues with both of those. 🙈 I have a Mac M1 and you can see the current situation for catboost here:
      github.com/catboost/catboost/issues/1526
      I'll have to dig up the lightgbm problems somewhere.
      Anyway, those are great implementations if you can get them to install!

    • @jaredwsavage
      @jaredwsavage 2 года назад

      @@JuliaSilge Wow, as a Windows user I'm usually the one on the wrong end of installation issues. 😁

  • @charithwijewardena9493
    @charithwijewardena9493 2 года назад

    Hi Julia I have a question. I'm trying to get my head around the concept of data leakage. You build your model with the outcome being "average", but before you do your split you do EDA on everything. Are we not gaining insight into the test set by doing that? Should we be doing EDA only AFTER splitting our data? Thanks. :)

    • @JuliaSilge
      @JuliaSilge  2 года назад +1

      This is definitely an important thing to think and make good decisions about. On the one hand, anything you do before the initial split could lead to data leakage. On the other hand, you need to understand something about your data in order to even create that initial split (like stratified sampling). It's most important that anything that you will literally use in creating predictions (like feature engineering) is done after data splitting.

    • @charithwijewardena9493
      @charithwijewardena9493 2 года назад

      Cool, thank you for the reply. 🙏🏽

  • @d_b_
    @d_b_ Год назад

    Is one takeaway from 44:03 that I should create a short play game for older people with few players that has printed minatures based in deductive fantasy animal war?

  • @davidjackson7675
    @davidjackson7675 2 года назад

    Thanks, that is interesting as always.

  • @PA_hunter
    @PA_hunter 2 года назад

    Would it be bad if I used tidymodels steps for non-ml data wrangling, haha!

    • @JuliaSilge
      @JuliaSilge  2 года назад +3

      I think some people do this for sure. Some things to keep in mind are that it is set up for learning from training data and applying to testing data, so I'd keep that design top of mind for using in other contexts. You can see this blog post where I used recipes for unsupervised work, without heading into a predictive model:
      juliasilge.com/blog/cocktail-recipes-umap/

  • @ndiyekebonye208
    @ndiyekebonye208 2 года назад

    Still getting an error at the tune_race_anova despite updating all my packages. Installing latest versions of dials and finetune. Is there a way to overcome this.

    • @JuliaSilge
      @JuliaSilge  2 года назад

      If you are having trouble with one of the racing functions, I recommend just trying to plain old `fit()` with your workflow, or perhaps using `tune_grid()`. Those functions will help you diagnose where the model failures are happening.

    • @ndiyekebonye208
      @ndiyekebonye208 2 года назад

      @@JuliaSilge thank you so much. Will surely try this

  • @russelllavery2281
    @russelllavery2281 Год назад

    cannot read the fonts

  • @danmungai555
    @danmungai555 2 года назад

    Enlightening

    • @barasatu123
      @barasatu123 2 года назад

      Would u plz make video on funmodeling package? And it's different function?