XGBoost Part 1 (of 4): Regression

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 822

  • @statquest
    @statquest  4 года назад +64

    Corrections:
    16:50 I say "66", but I meant to say "62.48". However, either way, the conclusion is the same.
    22:03 In the original XGBoost documents they use the epsilon symbol to refer to the learning rate, but in the actual implementation, this is controlled via the "eta" parameter. So, I guess to be consistent with the original documentation, I made the same mistake! :)
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @blacklistnr1
      @blacklistnr1 4 года назад +3

      Terminology alert!! "eta" refers to the greek letter Η(upper case)/η(lower case), it is one of the greek's many "ee" sounds(as in wheeeeee), it's definitely not epsilon.

    • @MrPopikeyshen
      @MrPopikeyshen 3 года назад +4

      like just for this sound 'bip-bip-pilulipup'

    • @servaastilkin7733
      @servaastilkin7733 Год назад

      @@blacklistnr1 I came here to say the same thing.
      Maybe this helps:
      èta - η sounds somewhat like the vowels in "air"
      epsilon - ε sounds somewhat like the vowel in "get"

  • @pulkitkapoor4091
    @pulkitkapoor4091 3 года назад +225

    I got my first job in Data Science because of the content you prepare and share.
    Can't thank you enough Josh. God bless :)

  • @giannislazaridis6788
    @giannislazaridis6788 4 года назад +52

    I'm starting writing my Master Thesis and there were still some things I needed to make clear before using XGBoost for my classification problem. God Bless You

  • @Hardson
    @Hardson 5 лет назад +396

    That's why I pay my Internet.

  • @nikilisacrow2339
    @nikilisacrow2339 3 года назад +53

    Can I just say I LOVE STATQUEST! Josh does the intuition of a complex algorithm and the math of it so well and then to make it into an engaging video that is so easy to watch is just amazing! I just LOVE this channel. You you boosted the gradient of my learning on machine learning in an extreme way. Really appreciate these videos

    • @statquest
      @statquest  3 года назад +5

      Wow! Thank you very much!!! I'm so glad you like the videos. :)

  • @johnhutton5491
    @johnhutton5491 5 месяцев назад +4

    This dude puts the STAR in starmer. You are an international treasure.

    • @statquest
      @statquest  5 месяцев назад +2

      Thank you! :)

  • @guoshenli4193
    @guoshenli4193 4 года назад +2

    I am a graduate student at Duke, since some of the materials are not covered in the class, I always watch your videos to boost my knowledge. Your videos help me a lot in learning the concepts of these tree models!! Great thanks to you!!!!! You make a lot of great videos and contribute a lot in online learning!!!!

    • @statquest
      @statquest  4 года назад

      Thank you very much and good luck with your studies! :)

  • @prasanshasatpathy6664
    @prasanshasatpathy6664 2 года назад +4

    Nowadays I write a "bam note" for important notes for algorithms.

  • @nitinvijayy
    @nitinvijayy 2 года назад +2

    Best Channel for anyone Working in the Domain of Data Science and Machine Learning.

  • @ChingFungChan-b4l
    @ChingFungChan-b4l 3 месяца назад +1

    Hi Josh,
    I just bought your illustrated guide in PDF. This is the first time I've supported someone on social media. Your video helped me a lot with my learning. Can't express how grateful I'm with these learning materials. You broke down monster maths concepts and equation to baby monster that I can easily digest. I hope by making this purchase, you get the most contribution out of my support.
    Thank you!

    • @statquest
      @statquest  3 месяца назад

      Thank you very much for supporting StatQuest! It means a lot to me that you care enough to contribute. BAM! :)

  • @mainhashimh5017
    @mainhashimh5017 2 года назад +3

    Man, the quality and passion put into this. As well as the sound effects! I'm laughing as much as I'm learning. DAAANG.
    You're the f'ing best!

    • @statquest
      @statquest  2 года назад +1

      Thank you very much! :)

  • @glowish1993
    @glowish1993 5 лет назад +4

    You make learning math and machine learning interesting and allow viewers to understand the essential points behind complicated algorithms, thank you for this amazing channel :)

  • @shhdeshp
    @shhdeshp 10 месяцев назад +1

    I just LOVE your channel! Such a joy to learn some complex concepts. Also, I've been trying to find videos that explain XGBoost under the hood in detail and this is the best explanation I've come across. Thank you so much for the videos and also boosting them with an X factor of fun!

    • @statquest
      @statquest  10 месяцев назад

      Awesome, thank you!

  • @DonDon-gs4nm
    @DonDon-gs4nm 4 года назад +7

    After watching your video, I understood the concept of 'understanding'.

  • @kennywang9929
    @kennywang9929 4 года назад +2

    Man, you do deserve all the thanks from the comments! Waiting for part2! Happy new year!

    • @statquest
      @statquest  4 года назад +1

      Thanks!!! I just recorded Part 2 yesterday, so it should be out soon.

  • @PauloBuchsbaum
    @PauloBuchsbaum 4 года назад +2

    An incredible job of clear, concise and non-pedantic explanation. Absolutely brilliant!

    • @statquest
      @statquest  4 года назад

      Thank you very much!

  • @andreitolkachev8295
    @andreitolkachev8295 3 года назад +2

    I wanted to watch this video last week, but you sent me on a magical journey through adaboost, logistic regression, logs, trees, forests, gradient boosting.... Good to be back

  • @breopardo6691
    @breopardo6691 3 года назад +5

    In my heart, there is a place for you! Thank you Josh!

  • @jaikishank
    @jaikishank 4 года назад +2

    Thanks Josh for your explanation. XGBoost explanation cannot be made simpler and illustrative than this. I love your videos.

    • @statquest
      @statquest  4 года назад +1

      Thank you very much! :)

  • @pavankumar6992
    @pavankumar6992 5 лет назад +3

    Fantastic explanation for XGBoost. Josh Starmer, you are the best. Looking forward to your Neural Network tutorials.

    • @statquest
      @statquest  5 лет назад +2

      Thanks! I hope to get to Neural Networks as soon as I finish this series on XGBoost (which will have at least 3 more videos).

  • @hanyang4321
    @hanyang4321 4 года назад +2

    I watched all of the videos in your channel and they're extremely awesome! Now I have much deeper understanding in many algorithms. Thanks for your excellent work and I'm looking forward to more lovely videos and your sweet songs!

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

  • @modandtheganggaming3617
    @modandtheganggaming3617 4 года назад +5

    Thank you! I'd been waited for XGBoost explained for so long

    • @statquest
      @statquest  4 года назад +2

      I'm recording part 2 today (or tomorrow) and it will be available for early access on Monday (and for everyone a week from monday).

  • @gawdman
    @gawdman 4 года назад +5

    Hey Josh! This is fantastic. As an aspiring data scientist with a couple of job interviews coming up, this really helped!

    • @statquest
      @statquest  4 года назад

      Awesome!!! Good luck with your interviews and let me know how they go. :)

  • @moidhassan5552
    @moidhassan5552 4 года назад +9

    Wow, I am really interested in Bioinformatics and was learning Machine Learning techniques to apply to my problems and out of curiosity, I checked your LinkedIn profile and turns out you are a Bioinformatician too. Cheers

  • @tusharsub1000
    @tusharsub1000 3 года назад +1

    I had left all hope of learning machine learning owing to its complexity. But because of you I am still giving it a shot..and so far I am enjoying...

  • @RidWalker
    @RidWalker Год назад

    I've never I had so much fun learning something new! Not since I stared at my living room wall for 20min and realized it wasn't pearl, but eggshell white! Thanks for this!

    • @statquest
      @statquest  Год назад +2

      Glad you got the wall color sorted out! Bam! :)

  • @kamalamarepalli1165
    @kamalamarepalli1165 7 месяцев назад +1

    I have never seen an data science video like this....good informative, very clear, super explanation of math and wonderful animation and energetic voice....Learning many things very easily....thank you so much!!

    • @statquest
      @statquest  7 месяцев назад

      Thank you very much!

  • @CDALearningHub
    @CDALearningHub 4 года назад +2

    Thank you! Super easy to understand one of the important ml algorithm XGBoost. Visual illustrations are the best part!

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

  • @Jagentic
    @Jagentic 3 месяца назад +2

    it was at some point that I realized that AIML for data science (which I’m currently amid)is really just the ultimate expression of statistics using machine learning to produce mind-boggling scale - and that calc and trig, linear alg and python, some computer science .. are all just tools in the box of the statistician,, which makes the data science. but just like someone with a toolbox full of hammers and saws…. One needs to know how, when and why to use them to build a house fine house. Holy Cow ¡BAM!

  • @jjlian1670
    @jjlian1670 4 года назад +5

    I have been waiting for your video for XGBoost, hope for LightGBM next!

  • @anupriy
    @anupriy 2 года назад +2

    Thanks for making such great videos, sir! You indeed get each concepts CLEARLY EXPLAINED.

  • @hellochii1675
    @hellochii1675 5 лет назад +19

    xgboosting!This must be my Christmas 🎁 ~~ Happy holidays ~

    • @statquest
      @statquest  5 лет назад +5

      Yes, this is sort of an early christmas present. :)

  • @karannchew2534
    @karannchew2534 3 года назад +1

    For my future reference.
    1) Initiate with a predicted value e.g. 0.5.
    2) Get residual. Each sample vs. initial predicted value.
    3) Build a mini tree, using the Residuals value of each sample.
    .Residuals
    .Different values of feature as cut off point at branches. Each value give a set of Similarity and Gain scores
    ..Similarity (use lambda here, the regularisation parameter) - measure how close the residual values to each other
    ..Gain (affected by lamda)
    .Pick the feature value that give highest gain - this determines how to split the data - which create the branch (and leaves) - which produce a mini tree.
    4) Prune tree. Using gain threshold (aka complexity parameter), gamma.
    If gain>gamma, keep branch, else prune
    5) Get Output Value OV for each leaf. Mini tree done.
    OV = sum of Residuals / (no. of Residuals + lambda)
    6) Predict value for each sample using the newly created mini tree.
    Run each sample data through the mini tree.
    New Predicted value = last predicted value + eta * OV
    7) Get new set of residual: New predicted value vs actual value of each sample.
    8) Re do from step 3. Create more mini trees...
    .Each tree 'boosts' the prediction - improving the result.
    .Each tree creates new residual as input to creating the next new tree.
    ...until no more improvement or no. of tree is reached.

    • @statquest
      @statquest  3 года назад

      Noted

    • @carlpiaf4476
      @carlpiaf4476 Год назад

      Could be improved by adding how the decision cut off point is made.

  • @machi992
    @machi992 4 года назад +1

    I actually started looking for XGBoost, but every video assumes I know something. I have ended up watching more than 8 videos just to have no problems understanding and fulfilling the requirements, and find them awesome.

    • @statquest
      @statquest  4 года назад +1

      Bam! Congratulations!

  • @guillemperdigooliveras5351
    @guillemperdigooliveras5351 5 лет назад +24

    As always, loved it! I can now wear my Double Bam t-shirt even more proudly :-)

    • @statquest
      @statquest  5 лет назад +1

      Awesome!!!!!! :)

    • @anggipermanaharianja6122
      @anggipermanaharianja6122 3 года назад +1

      why not wearing the Triple Bam?

    • @guillemperdigooliveras5351
      @guillemperdigooliveras5351 3 года назад +1

      @@anggipermanaharianja6122 for a second you gave me hopes about new Statquest t-shirts being available with a Triple Bam drawing!

  • @SaraSilva-zu7wn
    @SaraSilva-zu7wn 3 года назад +1

    Clear explanations, little songs and a bit of silliness. Please keep them all, they're your trademark. :-)

  • @oldguydoesntmatter2872
    @oldguydoesntmatter2872 4 года назад +2

    I've been using Random Forests with various boosting techniques for a few years. My regression (not classification) database has 500,000 - 5,000,000 data points with 50-150 variables, many of them highly correlated with some of the others. I like to "brag" that I can overfit anything. That, of course, is a problem, but I've found a tweak that is simple and fast that I haven't seen elsewhere.
    The basic idea is that when selecting a split point, pick a small number of data vectors randomly from the training set. Pick the variable(s) to split on randomly. (Variables plural because I usually split on 2-4 variables into 2^^n boosting regions - another useful tweak.) The thresholds are whatever the data values are for the selected vectors. Find the vector with the best "gain" and split with that. I typically use 5 - 100 tries per split and a learning rate of .5 or so. It's fast and mitigates the overfitting problem.
    Just thought someone might be interested...

  • @liuxu7879
    @liuxu7879 2 года назад +1

    Hey Josh, I really love your contents, you are the one who really explains the model details.

    • @statquest
      @statquest  2 года назад

      WOW! Thank you so much for supporting StatQuest!

  • @jackytsui422
    @jackytsui422 4 года назад +1

    I am learning machine learning from scratch and your videos helped me a lot. Thank you very much!!!!!!!!!!!

  • @geminicify
    @geminicify 4 года назад +2

    Thank you for posting this! I have been waiting for it for long!

  • @nickbohl2555
    @nickbohl2555 5 лет назад +1

    I have been super excited for this quest! Thanks as always Josh

  • @antoniojunior-dados
    @antoniojunior-dados 2 года назад +1

    You are the best, Josh. Greetings from Brazil! We are looking forward you video explaining clearly the LightGBM!

    • @statquest
      @statquest  2 года назад +1

      I hope do have that video soon.

  • @gorilaz0n
    @gorilaz0n 2 года назад +2

    Gosh! I love your fellow-kids vibe!

  • @mangli4669
    @mangli4669 4 года назад +3

    Hey Josh, first I wanted to say thank you for your awesome content. You are the number one reason I am graduating my degree haha! I would love a behind the scenes video about how you make your videos. How you prepare for topic, how you make your animations and your fancy graphs! And some more singing ofcourse!

    • @statquest
      @statquest  4 года назад

      That would be awesome. Maybe I'll do something like this in 2020. :)

  • @kn58657
    @kn58657 4 года назад +8

    I'm doing a club remix of the humming during calculations. Stay tuned!

    • @statquest
      @statquest  4 года назад +4

      Awesome!!!!! I can't wait to hear.

  • @shubhambhatia4968
    @shubhambhatia4968 4 года назад +1

    woah woah woah woah!... now i got the clear meaning of understanding after coming to your channel...as always i loved the xgboost series as well. thank you brother.;)

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

  • @王涛-d3y
    @王涛-d3y 4 года назад +1

    Awesome video!!! It's the best tutorial I have ever seen about XGBoost. Thank you very much!

  • @alex_zetsu
    @alex_zetsu 4 года назад +1

    So what if the input data contains multiple inputs? So like "drug dosage, patient is adult, patient resident nation"? In our video example, you compared "Dosage < 22.5" and "Dosage < 30" where we decided "Dosage < 30" had a better gain. So with more than one input would we be considering "Dosage < 22.5," "Dosage < 30," "Patient is adult," "Patient lives in America," "Patient lives in Japan," "Patient lives in Germany,"... and "Patient lives in none of the above" to find the most gain? Also, I just realized that you'd want more samples than you have categories if you have categorical input since if all the patients lived in separate countries, you'd be able to get high similarity scores even if patient's residence was irrelevant to our output.

    • @statquest
      @statquest  4 года назад +1

      When you have more than one feature/variable that you are using to make a prediction, you calculate the gain for all of them and pick the one with the highest gain. And yes, if one feature has 100% predictive power, then that's not very helpful (unless it is actually related to what you want to predict).

    • @alex_zetsu
      @alex_zetsu 4 года назад +1

      Well, if we had a large sample all 3,000 people who lives in Japan had drug effectiveness less than 5 and people from other nations varied from 0 to 30 (even before counting drug dose), we'd be sure residence was relevant. If the sample has 4 people, we had 30 nations (plus none of the above) to input, the 100% predictive power of residence wouldn't be very helpful since they would get high similarity scores regardless of if it was relevant or not.

  • @monkeydrushi
    @monkeydrushi 2 года назад +1

    God, thank you for your "beep boop" sounds. They just made my day!

  • @vijayyarabolu9067
    @vijayyarabolu9067 4 года назад +2

    8:45 checking my headphones - BAM; no problem with my headphones; 10:17 Double BAM; headphones are perfect

  • @lxk19901
    @lxk19901 5 лет назад +3

    This is really helpful, thanks for putting them together!

  • @smarttradzt4933
    @smarttradzt4933 3 года назад +1

    whenever i can't understand anything, I always think of statquest...BAM!

  • @urvishfree0314
    @urvishfree0314 3 года назад +1

    thankyou so much i watched it 3-4 times already but finally everything makes sense. thankyou so much

  • @DrJohnnyStalker
    @DrJohnnyStalker 4 года назад +1

    Best XGBoost explanation i have ever seen! This is Andrew Ng Level!

    • @statquest
      @statquest  4 года назад +1

      Thank you very much! I just released part 4 in this series, so make sure you check them all out. :)

    • @DrJohnnyStalker
      @DrJohnnyStalker 4 года назад +1

      @@statquest
      I have binge watched them all. All are great and by far the best intuative explanation videos on XGBoost.
      A series on lightgbm and catboost would complete the pack of gradient boosting algorithms. Thx for this great channel.

    • @statquest
      @statquest  4 года назад

      @@DrJohnnyStalker Thanks! :)

  • @shivasaib9023
    @shivasaib9023 4 года назад +2

    I fell in love with XGBOOST. While Pruning every node I was like whatttt :p

  • @fivehuang7557
    @fivehuang7557 5 лет назад +1

    Happy holiday man! Waiting for your next episode

    • @statquest
      @statquest  5 лет назад +1

      It should be out in the first week in 2020.

  • @vladimirmihajlovic1504
    @vladimirmihajlovic1504 8 месяцев назад +1

    Love StatQuest. Please cover lightGBM and CatBoost!

    • @statquest
      @statquest  7 месяцев назад

      I've got catboost, you can find it here: statquest.org/video-index/

  • @gokulprakash8694
    @gokulprakash8694 3 года назад +1

    Stat quest is the bestttttt!!!
    love it love it love it!!!!!!

  • @natashadavina7592
    @natashadavina7592 4 года назад +2

    your videos have helped me a lot!! thank you so much i hope you keep on making these videos:)

  • @palvinderbhatia3941
    @palvinderbhatia3941 Год назад +1

    Wow woww wowww !! How can you explain such complex concepts so easily. I wish I can learn this art from you. Big Fan!! 🙌🙌

  • @nilanjana1588
    @nilanjana1588 Год назад +1

    You make it little bit easy to understand Josh . I am saved.

  • @firesongs
    @firesongs 2 года назад

    1. Higher similarity score = Better?
    2. How do you determine what gamma is? You just randomly pick it?

    • @statquest
      @statquest  2 года назад

      1) Yes 2) Use cross validation. See: ruclips.net/video/GrJP9FLV3FE/видео.html

  • @eytansuchard8640
    @eytansuchard8640 Год назад

    Thank you for this explanation. In python there is another regularization parameter, Alpha. Also, to the best of my knowledge the role of Eta is to reduce the error correction by subsequent trees in order to avoid sum explosion and in order to control the residual error correction by each tree.

    • @statquest
      @statquest  Год назад

      I believe that alpha controls the depth of the tree.

    • @eytansuchard8640
      @eytansuchard8640 Год назад

      @@statquest The maximal depth is a different parameter. Maybe Alpha regulates how often the depth can grow if it did not reach the maximal depth.

    • @statquest
      @statquest  Год назад +1

      @@eytansuchard8640 Ah, I should have been more clear - I believe alpha controls pruning. At least, that's what it does here: ruclips.net/video/D0efHEJsfHo/видео.html

    • @eytansuchard8640
      @eytansuchard8640 Год назад

      @@statquest Thanks for the link. It will be watched.

  • @andrewnguyen5881
    @andrewnguyen5881 4 года назад

    Thank you for all of your videos! Super helpful and educational. I did have some questions for follow-up:
    - With Gamma being so important in the pruning process, how do you select gamma? I ask because aren't there situations where you could select a Gamma that would/wouldn't prune ALL branches, which would defeat the purpose of pruning right?
    - Is lambda a parameter that:
    a. Have to test multiple and tune your model to find the most suitable lambda (ie set your model to use one lambda)
    b. You test multiple lambdas per tree so different trees will have different lambdas

    • @statquest
      @statquest  4 года назад

      If you want to know all about using XGBoost in practice, see: ruclips.net/video/GrJP9FLV3FE/видео.html

    • @andrewnguyen5881
      @andrewnguyen5881 4 года назад +1

      @@statquest Great! I was saving that video until i finished the other XGBoost videos

    • @andrewnguyen5881
      @andrewnguyen5881 4 года назад

      @@statquest Will this video also cover Cover from the Classification video?

    • @statquest
      @statquest  4 года назад

      Not directly, since I simply limited the size of the trees rather than worry too much about the minimum number of observations per leaf.

  • @shaz-z506
    @shaz-z506 5 лет назад +1

    Extreme Bam! Finally xgboost is here

    • @statquest
      @statquest  5 лет назад +1

      That's a good one! :)

  • @anggipermanaharianja6122
    @anggipermanaharianja6122 3 года назад +1

    Awesome... this vid should be a mandatory in any schools

  • @ashfaqueazad3897
    @ashfaqueazad3897 5 лет назад +1

    Life saver. Was waiting for this.

  • @bernardmontgomery3859
    @bernardmontgomery3859 5 лет назад +1

    xgboosting! my Christmas gift!

  • @sajjadabdulmalik4265
    @sajjadabdulmalik4265 3 года назад +2

    You are always awesome no better explanation ever seen like this ❤️❤️ big fan 🙂🙂.. Triple bammm!!! Hope we have Lightgbm coming soon.

    • @statquest
      @statquest  3 года назад +1

      I've recently posted some notes on LightGBM on my twitter account. I hope to convert them into a video soon.

  • @HANTAIKEJU
    @HANTAIKEJU 4 года назад +2

    Hi Josh, Love your videos. Currently preparing Data Science interviews based on your video. Actually, really want to hear one about LGBM !

    • @statquest
      @statquest  4 года назад

      I'll keep that in mind.

  • @whenmathsmeetcoding1836
    @whenmathsmeetcoding1836 4 года назад

    Gain in Similarity score for the nodes can be considered weighted reduction of variance of the nodes BTW good attempt to make this digestible to all

  • @yulinliu850
    @yulinliu850 5 лет назад +1

    Great Xmas present! Thanks Josh!

  • @sidbhatia4230
    @sidbhatia4230 4 года назад +1

    Thanks, it helped a lot!
    Looking forward to part 2, and if possible please make one on catboost as well!

  • @vithaln7646
    @vithaln7646 4 года назад +2

    JOSH is the top data scientist in the world

    • @statquest
      @statquest  4 года назад

      Ha! Thank you very much! :)

  • @stylianosiordanis9362
    @stylianosiordanis9362 5 лет назад

    please post slides, this is the best channel for ML. thank you

  • @omkarjadhav13
    @omkarjadhav13 5 лет назад +5

    You just amazing Josh. Xtreme Bam!!!
    You make our life so easy.
    Waiting for neural net vid and further Xgboost parts.
    Please plan a meetup in Mumbai. #queston

    • @statquest
      @statquest  5 лет назад +2

      Thanks so much!!! I hope to visit Mumbai in the next year.

    • @ksrajavel
      @ksrajavel 4 года назад +2

      @@statquest Happy New Year, Mr. Josh.
      New year arrived. Awaiting you in India.

    • @statquest
      @statquest  4 года назад

      @@ksrajavel Thank you! Happy New Year!

  • @willw4096
    @willw4096 Год назад

    1:51 2:36 XGBoost default setting is 0.5 3:11 XGBoost uses a special type of tree 3:51 4:00 5:12 5:25 8:15 10:22 10:49 12:35 12:58 14:00 15:08 15:23 16:10 17:40 18:41 19:17 20:22 21:40 22:03 23:30 23:54

  • @metiseh
    @metiseh 3 года назад +1

    Bam!!! I am totally hypnotized

  • @tc322
    @tc322 5 лет назад +1

    Xtreme Christmas gift!! :) Thanks!!

  • @ramnareshraghuwanshi516
    @ramnareshraghuwanshi516 3 года назад

    Thanks for uploading this.. i am your biggest fan!! I have noticed too many adds these days which really disturb :)

    • @statquest
      @statquest  3 года назад

      Sorry about the adds. RUclips does that and I can not control it.

  • @aksaks2338
    @aksaks2338 4 года назад +4

    Hey Josh! Thanks for the video, just wanted to know when will you release part 2 and 3 of this?

    • @statquest
      @statquest  4 года назад

      Part 2 is already available for people with early access (i.e. channel members and patreon supporters). Part 3 will be available for early access in two weeks. I usually release videos to everyone 1 or 2 weeks after early access.

  • @tobiasksr23
    @tobiasksr23 3 года назад +1

    I justo found this channel and i think it's amazing.

  • @iop09x09
    @iop09x09 5 лет назад +1

    Wow! Very well explained, hats off.

  • @0xZarathustra
    @0xZarathustra 4 года назад +35

    pro tip: speed to 1.5x

  • @zachariahmarrero9358
    @zachariahmarrero9358 5 лет назад

    You can change Xgboost’s default score. Set ‘base_score’ equal to the mean of your target variable (if using regression) or to the ratio of the majority class over sample size (if using classification). This will reduce the number of trees needed for fitting the algorithm and it will save a lot of time. If you don’t set the base score then the algorithm will, effectively, start by solving the problem of the mean. The reason why is because the mean has the unique property of being a ‘pretty good guess’ in the absence of any other meaningful information in the dataset. As another intuition, you’ll find too, that if you apply regularization too strongly that Xgboost will “predict” that essentially every case is either the mean or very close to it.

    • @statquest
      @statquest  5 лет назад

      I'm not sure I understand what you mean by saying that if you don't set "base_score" then the algorithm starts by solving the problem of the mean. At 2:42 I mention that you can set the default "base_score" to anything, but the default value is 0.5. At least in R that's the default, which I'm pretty sure is different from solving the problem of the mean. But I might be missing something.

    • @zachariahmarrero9358
      @zachariahmarrero9358 5 лет назад +1

      @@statquest Oh I see, I misinterpreted what you meant were you said 'this prediction can be anything'. The problem of the mean is just an adhoc expression to say that the algorithm will spend its first 25% (roughly) of time running by getting performance that is as good as simply starting with the mean when your eval metric is rmse. It's not literally trying to determine what the mean is but it's just that your errors will pass 'through' the error achieved with a simple mean prediction. So rather than letting the algorithm do that, you can 'jump ahead' and have it start right at the mean. The end result is a model that relies on building fewer trees which means your hyperparameter tuning effort will go faster. There's a github comment/thread about the base_score default for regression and I believe in there someone has posted a more formal estimate of how much time is saved. I can say from personal experience that this one tweak has shaved days off my own analyses.

    • @statquest
      @statquest  5 лет назад +2

      Ah! I see. And I saw that GitHub thread as well. I think it is interesting that "regular" gradient boost does exactly what you say, use the mean (for regression) or the odds of the data (for classification), rather than have a fixed default. In fact, starting with the mean or odds of the data is a fundamental part of Gradient Boosting, so, technically speaking, XGBoost is not a complete implementation of the algorithm since it omits that step. Anyway, thanks for the practical/applied advice. It's is very helpful.

    • @zachariahmarrero9358
      @zachariahmarrero9358 5 лет назад +1

      @@statquest You're right, I hadn't realized that but you even have it illustrated in your gradient boost video.
      btw I have probably a hundred tutorial/Xgboost explainers and yours is head and shoulders above the rest. It's incredibly clear, accessible, and accurate!

  • @iBenutzername
    @iBenutzername 2 года назад

    Hey Josh, the series is fantastic! I'd like to ask you to consider two more aspects of tree-based methods: 1) SHAP values (e.g., feature importance, interactions) and 2) nested data (e.g., daily measurements --> nested sampling?). I am more than happy to pay for that :-) thanks!

    • @statquest
      @statquest  2 года назад +1

      I'm working on SHAP already and I'll keep the other topic in mind.

    • @iBenutzername
      @iBenutzername 2 года назад +1

      @@statquest That's great news, can't wait to see it in my sub box! Thanks a lot!

  •  4 года назад +1

    Thank you for sharing this amazing video!

  • @sarrae100
    @sarrae100 4 года назад +1

    Love u Ppl, StatQuest the 👍💯, Super BAM!!!

  • @sachinrathi7814
    @sachinrathi7814 5 лет назад +1

    Waiting for this video since long back.

    • @statquest
      @statquest  5 лет назад

      I hope it was worth the wait! :)

    • @sachinrathi7814
      @sachinrathi7814 5 лет назад +1

      @@statquest Indeed. I have gone through many post but everyone is telling about it combine week classified to make strong classifier..n same description every.
      & Then the way of describing the things make differ Josh Starmer to others.
      Marry Christmas 🤗

  • @ecotrix132
    @ecotrix132 6 месяцев назад +1

    Thanks for the wonderful content!
    How does xgboost select which feature to split on? As I understand from the explanation, does each feature have its own full tree unlike bootstrapped subset in random forest that has multiple features used in a subset tree?

    • @statquest
      @statquest  6 месяцев назад

      To select which feature to split on, XGBoost tests each feature in the dataset to selects the one the performs the best.

  • @Erosis
    @Erosis 5 лет назад +3

    Woohooo! Does that mean LightGBM in the future?

    • @statquest
      @statquest  5 лет назад +2

      The current plan is to spend the next month or so just on XGBoost - we're going to be pretty thorough and cover every little thing it does. And then I was planning on moving on to neural networks, but I might be able to squeeze in lightBoost and CatBoost in between. If not, I'll put them on the to-do list and swing back to them later.

    • @Erosis
      @Erosis 5 лет назад +1

      @@statquest BAM!

  • @irynap9262
    @irynap9262 4 года назад

    Fantastic explanation again!!! Thank you for you job😊 the only things that where not mentioned and I can’t figure out by myself are:
    1. Does Xgboost use one variable at a time when builds each tree?
    2. In case of more than one predictor variable how and why would xgboost choose a certain variable to be used to build the first tree 🌳 and other variables for the rest of the trees?

    • @statquest
      @statquest  4 года назад +1

      1 and 2) If you have more than one variable, then, at each branch, it checks all of the thresholds for all of the variables. The threshold/variable combination with the best Gain value is selected for the branch.

    • @irynap9262
      @irynap9262 4 года назад +1

      @@statquest can’t thank you enough 👍🏻👍🏻👍🏻 👏🏻👏🏻👏🏻 and really happy with the tree progress I am making watching your videos.

  • @burstingsanta2710
    @burstingsanta2710 3 года назад +2

    that DANG!!! just brought my attention back😂

  • @adityanimje843
    @adityanimje843 3 года назад +1

    Hey Josh, love your videos :)
    Any idea when you will make the videos for CatBoost and Light GBM ?

    • @statquest
      @statquest  3 года назад

      Maybe as early as July.

    • @adityanimje843
      @adityanimje843 3 года назад

      @@statquest Thank you :)
      One more question - I was reading Light GBM documentationand it said Light GBM grows "leaf wise" where as most DT algorithm grow "level wise" and that is a major advantage of Light GBM.
      But in your videos ( RF and other DT algortihm ones ), all of the videos show that they are grown "leaf wise".
      Am I missing miunderstanding something here ?

    • @statquest
      @statquest  3 года назад

      @@adityanimje843 I won't know the answer to that until I start researching Light GBM in July

    • @adityanimje843
      @adityanimje843 3 года назад

      @@statquest Sure - thank you for the swift reply.
      Looking forward to your new videos in July :)

  • @SeitzAl1
    @SeitzAl1 5 лет назад +1

    amazing lesson as always. thanks josh!

  • @ayenewyihune
    @ayenewyihune 2 года назад +1

    I'm enjoying your videos. I'd love if you can do one on Tabnet.

    • @statquest
      @statquest  2 года назад +1

      I'll keep that in mind!

  • @oldguydoesntmatter2872
    @oldguydoesntmatter2872 4 года назад +1

    Bravo! Excellent presentation. I've been through it a bunch of times trying to write my own code for my own specialized application. There's a lot of detail and nuance buried in a really short presentation (that's a compliment - congratulations!). Since you have nothing else to do (ha! ha!), would you consider writing a "StatQuest" book? I'll bid high for the first autographed copy!

    • @statquest
      @statquest  4 года назад +2

      Thank you very much!

  • @dandyyu0220
    @dandyyu0220 2 года назад +1

    Thank you for such a great video. I'm just wondering if lambda can be a negative value?

    • @statquest
      @statquest  2 года назад +1

      Presumably, but I'm not sure that's a good idea.

  • @keizerneptune4594
    @keizerneptune4594 5 лет назад +1

    Great video! When r u gonna release part 2?

    • @statquest
      @statquest  5 лет назад

      It should be out for early access viewing on January 6th.

  • @hubert1990s
    @hubert1990s 4 года назад +1

    can't wait the part 2

    • @statquest
      @statquest  4 года назад +2

      I'm recording it this weekend. It should be available for early access by Monday afternoon.