Classification Trees in Python from Start to Finish

Поделиться
HTML-код
  • Опубликовано: 3 июл 2024
  • NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
    This webinar was recorded 20200528 at 11:00am (New York time).
    NOTE: This StatQuest assumes are already familiar with:
    Decision Trees: • StatQuest: Decision Trees
    Cross Validation: • Machine Learning Funda...
    Confusion Matrices: • Machine Learning Funda...
    Cost Complexity Pruning: • How to Prune Regressio...
    Bias and Variance and Overfitting: • Machine Learning Funda...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying my book, The StatQuest Illustrated Guide to Machine Learning:
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    RUclips Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    5:23 Import Modules
    7:40 Import Data
    11:18 Missing Data Part 1: Identifying
    15:57 Missing Data Part 2: Dealing with it
    21:16 Format Data Part 1: X and y
    23:33 Format Data Part 2: One-Hot Encoding
    37:29 Build Preliminary Tree
    46:31 Pruning Part 1: Visualize Alpha
    51:22 Pruning Part 2: Cross Validation
    56:46 Build and Draw Final Tree
    #StatQuest #ML #ClassificationTrees

Комментарии • 582

  • @statquest
    @statquest  4 года назад +26

    NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @ezzouaouia.r1127
      @ezzouaouia.r1127 4 года назад

      The site is offline. 11/07 12:00

    • @statquest
      @statquest  4 года назад

      Thanks for the note. It's back up.

    • @ezzouaouia.r1127
      @ezzouaouia.r1127 4 года назад

      @@statquest Thanks very much .

    • @dfinance2260
      @dfinance2260 3 года назад

      Still offline unfortunately. Would love to check the code.

    • @statquest
      @statquest  3 года назад

      @@dfinance2260 It should be back up now.

  • @funnyclipsutd
    @funnyclipsutd 4 года назад +67

    BAM! My best decision this year was to follow your channel.

  • @renekokoschka707
    @renekokoschka707 3 года назад +7

    I just started my bachelor thesis and i really wanted to thank you!
    Your videos are helping me so much.
    You are a LEGEND!!!!!

    • @statquest
      @statquest  3 года назад +1

      Thank you and good luck! :)

  • @montserratramirez4824
    @montserratramirez4824 4 года назад +7

    I love your content! Definitely my favorite channel this year
    Regards from Mexico!

    • @statquest
      @statquest  4 года назад +2

      Wow, thanks! Muchas gracias! :)

  • @jahanvi9429
    @jahanvi9429 Год назад +5

    You are so so helpful!! I am a data science major and your videos saved my academics. Thank you!!

  • @ccuny1
    @ccuny1 4 года назад +2

    I have already commented but I watched the video again and I have to say I am even more impressed than before. truly fantastic tutorial, not too verbose but with every action clarified and commented in the code, beautifully presented (I have to work on my markdown; there are quite a few markdown formats you use that I cannot replicate...to study when I get the notebook). So all in all, one of the very top ML tuts I have ever watched (including paid for training courses). Can't wait for today's or tomorrows webinars. Can't join in real time as based in Europe, but will definitely pick it up here and get the accompanying study guides/code.

    • @statquest
      @statquest  4 года назад

      Hooray!!! Thank you very much!!!

  • @1988soumya
    @1988soumya 4 года назад +3

    Hey Josh, it’s so good to see you are doing this, I am preparing for some interviews, it will help a lot

  • @dhruvishah9077
    @dhruvishah9077 3 года назад +2

    I'm absolute beginner and this is what i was looking. Thank you so much for this. Much appreciated sir!!

    • @statquest
      @statquest  3 года назад

      Glad it was helpful! :)

  • @jefferyg3504
    @jefferyg3504 3 года назад +1

    You explain things in a way that is easy to understand. Bravo!

  • @3ombieautopilot
    @3ombieautopilot 4 года назад +2

    Thank you very much for this one! You're channel is incredible! Hats off to you

  • @beebee_0136
    @beebee_0136 2 года назад

    I'd like to thank you so much for making this stream cast available!

  • @ccuny1
    @ccuny1 4 года назад +1

    Another hit for me. I will be getting the Jupyter notebook and some if not all of you study guides (I only just realised they existed).

    • @statquest
      @statquest  4 года назад

      BAM! :) Thank you very much! :)

  • @aryamohan7533
    @aryamohan7533 3 года назад +1

    This entire video is a triple bam! Thank you for all your content, I would be lost without it :)

  • @ozzyfromspace
    @ozzyfromspace 3 года назад +2

    I dunno how I stumbled on your channel a few videos ago, but you've really got me interested in statistics. Nice Work sir 😃

  • @jonastrex05
    @jonastrex05 2 года назад +1

    Amazing video! One of the best out there for this Education! Thank you Josh

  • @fuckooo
    @fuckooo 3 года назад +1

    Love your videos Josh, the notebook missing values sounds like a great one to do!

  • @robertmitru7234
    @robertmitru7234 3 года назад +1

    Awesome StatQuest! Great channel! Make more videos like this one for the other topics. Thank you for your time!

  • @creativeo91
    @creativeo91 3 года назад +4

    This video helped me a lot for my Data Mining assignment.. Thank you..

  • @ericwr4965
    @ericwr4965 4 года назад +1

    I absolutely love your videos and I love your channel. Thanks for this.

  • @bayesian7404
    @bayesian7404 3 месяца назад +1

    You are fantastic! I'm hooked on your videos. Thank you for all your work.

    • @statquest
      @statquest  3 месяца назад

      Glad you like them!

  • @Mohamm-ed
    @Mohamm-ed 3 года назад +2

    This voice remembering me when I listening to radio in UK. Love that. I want to go again

  • @utkarshsingh2675
    @utkarshsingh2675 Год назад +1

    this is what I have been looking for on youtube...thanks alot sir!!

  • @liranzaidman1610
    @liranzaidman1610 4 года назад +2

    Fantastic, this is exactly what I needed

  • @kaimueric9390
    @kaimueric9390 4 года назад +6

    I actually think it can be great if you created more videos for other ML algorithms. After teaching us almost every aspect of machine learning algorithms as far as the mechanics and the related fundamentals are concerned, I feel it is high time to see those in action, and Python is, of course, the best way to go.

    • @statquest
      @statquest  4 года назад +4

      I'm working on them!!! :)

  • @gbchrs
    @gbchrs 2 года назад +1

    your channel is the best at explaining complex machine learning algorithm step by step. please make more videos

    • @statquest
      @statquest  2 года назад

      Thank you very much!!! Hooray! :)

  • @bessa0
    @bessa0 Год назад +1

    Kind Regards from Brazil. Loved your book!

  • @filosofiadetalhista
    @filosofiadetalhista 2 года назад +1

    Loved it. I am working on Decision Trees on my job this week.

  • @naveenagrawal_nice
    @naveenagrawal_nice 5 месяцев назад +1

    Love this channel, Thank you Josh

    • @statquest
      @statquest  5 месяцев назад

      Glad you enjoy it!

  • @ravi_krishna_reddy
    @ravi_krishna_reddy 3 года назад +4

    I was searching for a tutorial related to statistics and landed here. At first, I thought this is just one among many low quality content tutorials out there, but I was wrong. This is one of the best statistics and data science related channels I have seen so far, wonderful explanation by Josh. Addicted to this channel and subscribed. Thank you Josh for sharing your knowledge and making us learn in a constructive way.

    • @statquest
      @statquest  3 года назад

      Thank you very much! :)

  • @JoRoCaRa
    @JoRoCaRa Год назад +1

    brooo... this is insane!! thanks so much! this is amazing saving me so many headaches

  • @anishchhabra5313
    @anishchhabra5313 Год назад +1

    This is legen..... wait for it
    ....dary!! 😎
    This detailed coding explanation of Decision Tree is hard to find but Josh you are brilliant. Thank you for such a great video.

  • @sameepshah3835
    @sameepshah3835 12 дней назад +1

    I love you so much Josh. Thank you so much for everything.

  • @DANstudiosable
    @DANstudiosable 4 года назад +5

    OMG... I thought you'd ignore when i asked you to post this webinar on youtube. Am glad you posted it. Thank you!

  • @Kenwei02
    @Kenwei02 2 года назад +1

    Thank you so much for this tutorial! This has helped me out a lot!

  • @user-lc8gc6vb3j
    @user-lc8gc6vb3j 9 месяцев назад +2

    Thank you, this video helped me a lot! For anyone else following along in 2023, the way the confusion matrix is drawn here didn't work for me anymore. I replaced it with the following code:
    cm = confusion_matrix(y_test, clf_dt_pruned.predict(x_test), labels = clf_dt_pruned.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['Does not have HD', "Has HD"])
    disp.plot()
    plt.show()

    • @statquest
      @statquest  9 месяцев назад

      BAM! Thank you. Also, I updated the jupyter notebook.

  • @umairkazi5537
    @umairkazi5537 4 года назад +1

    Thank you very much . This video is very helpful and clears a lot of concepts for me

  • @magtazeum4071
    @magtazeum4071 4 года назад +2

    BAM...!!! I'm getting notifications from your channel again

  • @simaykazc1508
    @simaykazc1508 3 года назад +1

    Josh is the best. I learned a lot from him!

  • @joaomanoellins2219
    @joaomanoellins2219 4 года назад +25

    I loved your Brazil polo shirt! Triple bam!!! Thank you for your videos. Regards from Brazil!

    • @statquest
      @statquest  4 года назад +20

      Muito obrigado!!!

    • @cindinishimoto9528
      @cindinishimoto9528 3 года назад +2

      @@statquest paying homage to Brazil!!

    • @statquest
      @statquest  3 года назад +5

      @@cindinishimoto9528 Eu amo do Brasil!

  • @liranzaidman1610
    @liranzaidman1610 4 года назад +10

    Josh,
    this is really great.
    Can you upload videos with some insights on your personal research and which methods did you use?
    And some examples of why you prefer to use one method instead of the other? I mean, not only because you get a better result in RUC/AUC but is there a "biological" reasoning for using a specific method?

  • @amc9520
    @amc9520 Год назад +1

    Thanks for making my life easy.

  • @jihowoo9667
    @jihowoo9667 4 года назад +1

    I really love your video, it helps me a lot!! Regards from China.

  • @srmsagargupta
    @srmsagargupta 3 года назад +1

    Thank you Sir for this wonderful webinar

  • @xiolee7597
    @xiolee7597 4 года назад +4

    Really enjoy all the videos! Can you do a series about mixed models as well, random effects, choosing models, interpretation etc. ?

    • @statquest
      @statquest  4 года назад +4

      It's on the to-do list.

  • @rajatjain7465
    @rajatjain7465 Год назад +1

    wowowowwo the best course ever, even better than all those paid quests thank you @josh stramer for these materials

  • @nataliatenoriomaia1635
    @nataliatenoriomaia1635 3 года назад +1

    Great video, Josh! Thanks for sharing it with us. And I have to say: the Brazilian shirt looks great on you! ;-)

  • @_ahahahahaha9326
    @_ahahahahaha9326 2 года назад +1

    Really learn a lot from you

  • @amalsakr1381
    @amalsakr1381 4 месяца назад +1

    Thank you for your powerful tutrial

    • @statquest
      @statquest  4 месяца назад

      Glad it was helpful!

  • @fernandosicos
    @fernandosicos 2 года назад +1

    greatings from Brazil!

  • @avramdagoat
    @avramdagoat 8 месяцев назад +1

    great insight and refresher, thank you for documenting

    • @statquest
      @statquest  8 месяцев назад

      Glad you enjoyed it!

  • @floral7448
    @floral7448 3 года назад +1

    Finally have the honor to see Josh :)

  • @junaidmalik9593
    @junaidmalik9593 3 года назад

    Hi Josh, one amazing thing about the playlist is the song u sing before starting the video, that refreshes me. u know how to keep the listener awake for the next video. hehe. and really thanks for the amazing explanation.

  • @juniotomas8563
    @juniotomas8563 3 месяца назад +1

    Come on, Buddy! I've just saw a recommendation to your channel and on the first video I see you with a Brazilian t-shirt. Nice surprise!

    • @statquest
      @statquest  3 месяца назад

      Muito obrigado! :)

  • @sharmakartikeya
    @sharmakartikeya 3 года назад +1

    Hurray! I saw your face for the first time! Nice to see one of those whom I have subscribed

  • @pfunknoondawg
    @pfunknoondawg 3 года назад +1

    Wow, this is super helpful!

  • @pratyushmisra2516
    @pratyushmisra2516 3 года назад +4

    My intro song for this channel:
    " It's like Josh has got his hands on python right,
    He teaches Ml and AI really Well and tight ---- STAT QUEST"
    btw thanks Brother for so much wonderful content for free.....

  • @michelchaghoury870
    @michelchaghoury870 2 года назад +1

    MANNNN so usefull please keep going

  • @hanaj4870
    @hanaj4870 3 года назад +1

    Thank you sir!! Best ever!!!! BAM!!

    • @statquest
      @statquest  3 года назад

      Thank you very much! :)

  • @douglasaraujo9763
    @douglasaraujo9763 3 года назад +1

    Your videos are always very good. But today I’ll have to commend you on your fashion choice as well. Great-looking shirt! I hope you have had the opportunity to visit Brazil.

    • @statquest
      @statquest  3 года назад

      Muito obrigado! Eu amo do Brasil! :)

  • @TalesLimaFonseca
    @TalesLimaFonseca 2 года назад +1

    Man, you are awesome! Vai BRASIL!!!

  • @krishanudebnath1959
    @krishanudebnath1959 2 года назад +1

    love the tabla and ur content

    • @statquest
      @statquest  2 года назад

      Thanks! My father used to teach at IIT-Madras so I spent a lot of time there when I was young.

  • @josephgan1262
    @josephgan1262 2 года назад

    Hi Josh, Thanks for the video again!!. I have some questions hope you don't mind to clarify in regards to pruning in general hyperparameter tuning. I see that in general the video has done the following to find the best alpha.
    1) After train test split, find the best alpha after comparison between test and training (single split). @50:32
    2) Rechecking the best alpha by doing CV @52:33. It is checked that that is huge variation in the accuracy, and this implies that alpha is sensitive to different training set.
    3) Redo the CV for to find the best alpha by taking the mean of accuracy for each alpha.
    a) At step two, do we still need to plot the training set accuracy to check for overfitting? (it is always mention that we should compare training & testing set accuracy to check for overfitting) but there is an debate on this as well. ( Where other party mentioned that for a model-A of training/test accuracy of 99/90% vs another model-B : 85/85%. We should pick model-A with 99/90% accuracy because 90% testing accuracy is higher than 85% even though the model-B has no gap (overfitting) between train & test. What's your thought on this?
    b) What if I don't do step 1) and 2) and straight to step 3) is this a bad practice? do i still need to plot the training accuracy to compare with test accuracy if I skip step 1 and step 2? Thanks.
    c) I always see that the final hyper parameter is decided on highest mean of accuracy of all K-folds. Do we need to consider the impact of variance in K-fold? surely we don't want our accuracy to jump all over the place if taken into production. if yes, what is general rule of thumb if the variance in accuracy is consider bad.
    Sorry for the long posting. Thanks!

    • @statquest
      @statquest  2 года назад

      a) Ultimately the optimal model depends on a lot of things - and often domain knowledge is one of those things - so there are no hard rules and you have to be flexible about the model you pick.
      b) You can skip the first two steps - those were just there to illustrate the need for using cross validation.
      c) It's probably a good idea to also look at the variation.

  • @mahdimj6594
    @mahdimj6594 4 года назад +1

    Neural Network Pleaseee, Bayesian and LARS as well. And Thank you. You actually make things much easier to understand.

  • @julescesar4779
    @julescesar4779 2 года назад +1

    thank you so much sir for sharing

  • @rhn122
    @rhn122 3 года назад +6

    Great tutorial! One question, by looking at the features included in the final tree, does it mean that only those 4 features are considered for prediction, i.e., we don't need the rest so we could drop those columns for further usage?

  • @paulovinicius5833
    @paulovinicius5833 3 года назад +1

    I know I'll love all the content, but I start liking the video immediatly bc of the music! haha

  • @chaitanyasharma6270
    @chaitanyasharma6270 3 года назад +1

    i loved your video support vector machines in python from start to finish and this one too!!! can you make more on different algorithms?

  • @marcooliveira9249
    @marcooliveira9249 4 года назад +1

    Congratulations ! Ten times triple bam !!

  • @mcmiloy3322
    @mcmiloy3322 3 года назад

    Really nice video. I thought you were actually going to implement the tree classifier itself, which would have been a real bonus but I guess that would have taken a lot longer.

  • @vipanpatial2243
    @vipanpatial2243 2 года назад +2

    BAM!! You are best.

  • @ramendrachaudhary9784
    @ramendrachaudhary9784 3 года назад +2

    We need to see you play some tabla to one of your songs. Double BAM!! Great content btw :)

  • @korcankomili7398
    @korcankomili7398 Год назад +1

    I wish you were my uncle Josh or something.
    I could imagine how hard I would have had discussions with my parents to spend time with my TRIPLE cool uncle.

  • @shindepratibha31
    @shindepratibha31 3 года назад

    I have almost completed the Machine learning playlist and it was really helpful. One request, can you please make a short video on 'handling the imbalanced dataset'?

    • @statquest
      @statquest  3 года назад

      I've got a rough draft on that topic here: ruclips.net/video/iTxzRVLoTQ0/видео.html

  • @aleksandartta
    @aleksandartta 2 года назад

    How to implement pipeline with cost complexity? Consider the marking part which start before 49:00... Thank in advance! You are the best teacher...

  • @kaimueric9390
    @kaimueric9390 4 года назад +2

    I liked before watching

  • @awahritaengwari2915
    @awahritaengwari2915 Год назад +1

    Thank you so much,

  • @bgrguric555
    @bgrguric555 3 года назад +1

    Awesome video

  • @teetanrobotics5363
    @teetanrobotics5363 3 года назад +1

    Amazing man. I love your channel. Could you please reorder this video , SVMs and Xgboost in the correct order in the playlist ?

  • @toniiicarbonelll287
    @toniiicarbonelll287 2 года назад +1

    we love you we always will

  • @saiakhil4751
    @saiakhil4751 3 года назад +1

    Wow!! Josh on live? made my day...

  • @mrlfcynwa
    @mrlfcynwa 3 года назад

    Thanks for this! I just have a quick feedback that it would've been great had you touched upon how to interpret the leaves of the decision tree

  • @bressanini
    @bressanini Год назад +1

    Hey Josh, follow this equation: You + Brazilian Flag Polo Shirt + Awesome Content = TRIPPLE BAM!!!

  • @randyluong6275
    @randyluong6275 2 года назад +1

    We have data scientist out there. We have "data artist" right in this video.

  • @Moiez101
    @Moiez101 Год назад +1

    1 hour statquest? in the words of Barney Rubble's son: "BAM BAM!"

  • @AK-nx9lg
    @AK-nx9lg 3 года назад +1

    Thank you!!!

  • @BeSharpInCSharp
    @BeSharpInCSharp 4 года назад

    i wanted to learn DT from scratch but it seems here we should already know things like confusion matrix. I better study that first and come back to this video

  • @rogertea1857
    @rogertea1857 3 года назад

    Pruning is better than setting max_depth or min_samples beforehand overall I guess. Thanks for another great tutorial : )

  • @breopardo6691
    @breopardo6691 3 года назад +5

    As Tina Turner would say: "You are simply the best!" 🎵🎵🎵

  • @willw4096
    @willw4096 10 месяцев назад

    1:00:20 Use color to visualize the category and the Gini impurity

  • @amitsaxena6530
    @amitsaxena6530 4 года назад +1

    Hi Josh, Request you to make more such ML videos in python which covers all ML concepts holistically. I am sure this course will then become more popular then any of the available ML courses. Pls pls pls....

  • @jimwest63
    @jimwest63 Год назад +1

    Thanks!

    • @statquest
      @statquest  Год назад

      Wow! Thank you so much for supporting StatQuest!!! BAM! :)

  • @DishantKothia
    @DishantKothia 2 года назад +1

    You're the best

  • @alexyuan1622
    @alexyuan1622 3 года назад +1

    Hi Josh, thank you so much for this awesome posting! Quick question, when doing the cross validation, should the cross_val_score() using [X_train, y_train] or the [X_encoded, y]? I'm wondering if the point of doing cross validation is to let each chunk of data set being testing data, should we then use the full data set X_encoded an y for the cross validation? Thank you!!

    • @statquest
      @statquest  3 года назад +1

      There are different ideas about how to do this, and they depend on how much data you have. If you have a lot of data, it is common to hold out a portion of the data to only be used for the final evaluation of the model (after optimizing and cross validation) as demonstrated here. When you have less data, it might make sense to use all of the data for cross validation.

    • @alexyuan1622
      @alexyuan1622 3 года назад +1

      @@statquest Thanks for the quick response. That makes perfect sense.

  • @prnv5
    @prnv5 2 года назад

    Hi Josh! I'm a HS student trying to learn ML algorithms and your videos are genuinely my saving grace. They're so concise, information heavy and educational. I understand concepts perfectly through your statquests, and I'm really grateful for that.
    One quick question: The algorithm used in this case to build a decision tree: is it the CART algorithm? I'm writing a paper on the CART algorithm and would hence like to confirm the same. Thanks again!

    • @statquest
      @statquest  2 года назад +1

      Yes, this is the "classification tree" in CART.

    • @prnv5
      @prnv5 2 года назад +1

      @@statquest Thank you so much 🥰

  • @PinkFloydTheDarkSide
    @PinkFloydTheDarkSide 2 года назад

    Somehow your room and furniture remind me of my grad building room at the Univ. of Chicago.

  • @patite3103
    @patite3103 3 года назад

    thank you for this video! Would it be possible to do a similar video with random forest and regression trees?

    • @statquest
      @statquest  3 года назад

      I don't like the random forest implementation in Python. Instead, if you're going to use random forests, you should do it in R. And I have a video for that: ruclips.net/video/6EXPYzbfLCE/видео.html

  • @catdef9028
    @catdef9028 4 года назад

    Hi Josh...awesome videos..I have a request that you make videos on python implementation on XGBOOST. Thanks, greetings from India..

    • @statquest
      @statquest  4 года назад +1

      statquest.org/product/webinar-july-14-11am-xgboost-in-python/
      statquest.org/product/webinar-july-16-11am-xgboost-in-python/

  • @lautarocisterna3339
    @lautarocisterna3339 3 года назад +1

    Statistics and ML GOAT

    • @statquest
      @statquest  3 года назад

      BAM! And thank you for supporting me! :)

  • @estebannantes8567
    @estebannantes8567 4 года назад +1

    Hi Josh. Loved this video. I have two questions: 1- Is there any way to save our final decision tree model to use it later in unseen data without having to train it all again? 2- Once you have decided on your final alpha: why not training your tree on a full-unsplit dataset. I know you will not be able to generate a confusion matrix, but wouldn't your final tree be better if it is trained with all the examples?

    • @statquest
      @statquest  4 года назад +1

      Yes and yes. You can write the decision tree to a file if you don't want to keep it in memory (or want to back it up). See: scikit-learn.org/stable/modules/model_persistence.html

  • @ayatkhrisat5964
    @ayatkhrisat5964 4 года назад +2

    kindly add this video to the machine learning list