Classification Trees in Python from Start to Finish
HTML-код
- Опубликовано: 3 июл 2024
- NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
This webinar was recorded 20200528 at 11:00am (New York time).
NOTE: This StatQuest assumes are already familiar with:
Decision Trees: • StatQuest: Decision Trees
Cross Validation: • Machine Learning Funda...
Confusion Matrices: • Machine Learning Funda...
Cost Complexity Pruning: • How to Prune Regressio...
Bias and Variance and Overfitting: • Machine Learning Funda...
For a complete index of all the StatQuest videos, check out:
statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying my book, The StatQuest Illustrated Guide to Machine Learning:
PDF - statquest.gumroad.com/l/wvtmc
Paperback - www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - www.amazon.com/dp/B09ZG79HXC
Patreon: / statquest
...or...
RUclips Membership: / @statquest
...a cool StatQuest t-shirt or sweatshirt:
shop.spreadshirt.com/statques...
...buying one or two of my songs (or go large and get a whole album!)
joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
0:00 Awesome song and introduction
5:23 Import Modules
7:40 Import Data
11:18 Missing Data Part 1: Identifying
15:57 Missing Data Part 2: Dealing with it
21:16 Format Data Part 1: X and y
23:33 Format Data Part 2: One-Hot Encoding
37:29 Build Preliminary Tree
46:31 Pruning Part 1: Visualize Alpha
51:22 Pruning Part 2: Cross Validation
56:46 Build and Draw Final Tree
#StatQuest #ML #ClassificationTrees
NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: statquest.gumroad.com/l/tzxoh
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
The site is offline. 11/07 12:00
Thanks for the note. It's back up.
@@statquest Thanks very much .
Still offline unfortunately. Would love to check the code.
@@dfinance2260 It should be back up now.
BAM! My best decision this year was to follow your channel.
BAM! :)
Mood, been so useful!
I just started my bachelor thesis and i really wanted to thank you!
Your videos are helping me so much.
You are a LEGEND!!!!!
Thank you and good luck! :)
I love your content! Definitely my favorite channel this year
Regards from Mexico!
Wow, thanks! Muchas gracias! :)
You are so so helpful!! I am a data science major and your videos saved my academics. Thank you!!
Happy to help!
I have already commented but I watched the video again and I have to say I am even more impressed than before. truly fantastic tutorial, not too verbose but with every action clarified and commented in the code, beautifully presented (I have to work on my markdown; there are quite a few markdown formats you use that I cannot replicate...to study when I get the notebook). So all in all, one of the very top ML tuts I have ever watched (including paid for training courses). Can't wait for today's or tomorrows webinars. Can't join in real time as based in Europe, but will definitely pick it up here and get the accompanying study guides/code.
Hooray!!! Thank you very much!!!
Hey Josh, it’s so good to see you are doing this, I am preparing for some interviews, it will help a lot
Good luck! :)
I'm absolute beginner and this is what i was looking. Thank you so much for this. Much appreciated sir!!
Glad it was helpful! :)
You explain things in a way that is easy to understand. Bravo!
Thank you! :)
Thank you very much for this one! You're channel is incredible! Hats off to you
Bam! :)
I'd like to thank you so much for making this stream cast available!
:)
Another hit for me. I will be getting the Jupyter notebook and some if not all of you study guides (I only just realised they existed).
BAM! :) Thank you very much! :)
This entire video is a triple bam! Thank you for all your content, I would be lost without it :)
Glad you enjoyed it!
@@statquest This is Quadruple BAM!!!! Thank you Mr. Josh :)
I dunno how I stumbled on your channel a few videos ago, but you've really got me interested in statistics. Nice Work sir 😃
Hooray!
Amazing video! One of the best out there for this Education! Thank you Josh
Thank you!
Love your videos Josh, the notebook missing values sounds like a great one to do!
Awesome!
Awesome StatQuest! Great channel! Make more videos like this one for the other topics. Thank you for your time!
Thanks! Will do!
This video helped me a lot for my Data Mining assignment.. Thank you..
Glad it helped!
I absolutely love your videos and I love your channel. Thanks for this.
Thanks! :)
You are fantastic! I'm hooked on your videos. Thank you for all your work.
Glad you like them!
This voice remembering me when I listening to radio in UK. Love that. I want to go again
:)
this is what I have been looking for on youtube...thanks alot sir!!
Thanks!
Fantastic, this is exactly what I needed
Hooray! :)
I actually think it can be great if you created more videos for other ML algorithms. After teaching us almost every aspect of machine learning algorithms as far as the mechanics and the related fundamentals are concerned, I feel it is high time to see those in action, and Python is, of course, the best way to go.
I'm working on them!!! :)
your channel is the best at explaining complex machine learning algorithm step by step. please make more videos
Thank you very much!!! Hooray! :)
Kind Regards from Brazil. Loved your book!
Thank you!
Loved it. I am working on Decision Trees on my job this week.
bam!
Love this channel, Thank you Josh
Glad you enjoy it!
I was searching for a tutorial related to statistics and landed here. At first, I thought this is just one among many low quality content tutorials out there, but I was wrong. This is one of the best statistics and data science related channels I have seen so far, wonderful explanation by Josh. Addicted to this channel and subscribed. Thank you Josh for sharing your knowledge and making us learn in a constructive way.
Thank you very much! :)
brooo... this is insane!! thanks so much! this is amazing saving me so many headaches
Glad it helped!
This is legen..... wait for it
....dary!! 😎
This detailed coding explanation of Decision Tree is hard to find but Josh you are brilliant. Thank you for such a great video.
Glad you liked it!
I love you so much Josh. Thank you so much for everything.
Thanks!
OMG... I thought you'd ignore when i asked you to post this webinar on youtube. Am glad you posted it. Thank you!
BAM!!!!
Double BAM
Thank you so much for this tutorial! This has helped me out a lot!
Glad it helped!
Thank you, this video helped me a lot! For anyone else following along in 2023, the way the confusion matrix is drawn here didn't work for me anymore. I replaced it with the following code:
cm = confusion_matrix(y_test, clf_dt_pruned.predict(x_test), labels = clf_dt_pruned.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['Does not have HD', "Has HD"])
disp.plot()
plt.show()
BAM! Thank you. Also, I updated the jupyter notebook.
Thank you very much . This video is very helpful and clears a lot of concepts for me
Bam! :)
BAM...!!! I'm getting notifications from your channel again
BAM! :)
Josh is the best. I learned a lot from him!
Wow! Thank you!
I loved your Brazil polo shirt! Triple bam!!! Thank you for your videos. Regards from Brazil!
Muito obrigado!!!
@@statquest paying homage to Brazil!!
@@cindinishimoto9528 Eu amo do Brasil!
Josh,
this is really great.
Can you upload videos with some insights on your personal research and which methods did you use?
And some examples of why you prefer to use one method instead of the other? I mean, not only because you get a better result in RUC/AUC but is there a "biological" reasoning for using a specific method?
Great suggestion!
Thanks for making my life easy.
Any time!
I really love your video, it helps me a lot!! Regards from China.
Awesome! Thank you!
Thank you Sir for this wonderful webinar
Thank you!
Really enjoy all the videos! Can you do a series about mixed models as well, random effects, choosing models, interpretation etc. ?
It's on the to-do list.
wowowowwo the best course ever, even better than all those paid quests thank you @josh stramer for these materials
Thank you! :)
Great video, Josh! Thanks for sharing it with us. And I have to say: the Brazilian shirt looks great on you! ;-)
BAM! :)
Really learn a lot from you
Thanks!
Thank you for your powerful tutrial
Glad it was helpful!
greatings from Brazil!
Muito obrigado! :)
great insight and refresher, thank you for documenting
Glad you enjoyed it!
Finally have the honor to see Josh :)
:)
Hi Josh, one amazing thing about the playlist is the song u sing before starting the video, that refreshes me. u know how to keep the listener awake for the next video. hehe. and really thanks for the amazing explanation.
Awesome thank you!
Come on, Buddy! I've just saw a recommendation to your channel and on the first video I see you with a Brazilian t-shirt. Nice surprise!
Muito obrigado! :)
Hurray! I saw your face for the first time! Nice to see one of those whom I have subscribed
bam!
Wow, this is super helpful!
Glad you think so!
My intro song for this channel:
" It's like Josh has got his hands on python right,
He teaches Ml and AI really Well and tight ---- STAT QUEST"
btw thanks Brother for so much wonderful content for free.....
Thank you! :)
MANNNN so usefull please keep going
Thanks!
Thank you sir!! Best ever!!!! BAM!!
Thank you very much! :)
Your videos are always very good. But today I’ll have to commend you on your fashion choice as well. Great-looking shirt! I hope you have had the opportunity to visit Brazil.
Muito obrigado! Eu amo do Brasil! :)
Man, you are awesome! Vai BRASIL!!!
Muito obrigado!
love the tabla and ur content
Thanks! My father used to teach at IIT-Madras so I spent a lot of time there when I was young.
Hi Josh, Thanks for the video again!!. I have some questions hope you don't mind to clarify in regards to pruning in general hyperparameter tuning. I see that in general the video has done the following to find the best alpha.
1) After train test split, find the best alpha after comparison between test and training (single split). @50:32
2) Rechecking the best alpha by doing CV @52:33. It is checked that that is huge variation in the accuracy, and this implies that alpha is sensitive to different training set.
3) Redo the CV for to find the best alpha by taking the mean of accuracy for each alpha.
a) At step two, do we still need to plot the training set accuracy to check for overfitting? (it is always mention that we should compare training & testing set accuracy to check for overfitting) but there is an debate on this as well. ( Where other party mentioned that for a model-A of training/test accuracy of 99/90% vs another model-B : 85/85%. We should pick model-A with 99/90% accuracy because 90% testing accuracy is higher than 85% even though the model-B has no gap (overfitting) between train & test. What's your thought on this?
b) What if I don't do step 1) and 2) and straight to step 3) is this a bad practice? do i still need to plot the training accuracy to compare with test accuracy if I skip step 1 and step 2? Thanks.
c) I always see that the final hyper parameter is decided on highest mean of accuracy of all K-folds. Do we need to consider the impact of variance in K-fold? surely we don't want our accuracy to jump all over the place if taken into production. if yes, what is general rule of thumb if the variance in accuracy is consider bad.
Sorry for the long posting. Thanks!
a) Ultimately the optimal model depends on a lot of things - and often domain knowledge is one of those things - so there are no hard rules and you have to be flexible about the model you pick.
b) You can skip the first two steps - those were just there to illustrate the need for using cross validation.
c) It's probably a good idea to also look at the variation.
Neural Network Pleaseee, Bayesian and LARS as well. And Thank you. You actually make things much easier to understand.
Thanks! :)
thank you so much sir for sharing
Thanks!
Great tutorial! One question, by looking at the features included in the final tree, does it mean that only those 4 features are considered for prediction, i.e., we don't need the rest so we could drop those columns for further usage?
That is correct.
I know I'll love all the content, but I start liking the video immediatly bc of the music! haha
Thank you! :)
i loved your video support vector machines in python from start to finish and this one too!!! can you make more on different algorithms?
I will try!
Congratulations ! Ten times triple bam !!
Hooray! :)
Really nice video. I thought you were actually going to implement the tree classifier itself, which would have been a real bonus but I guess that would have taken a lot longer.
Noted
BAM!! You are best.
Double bam!
We need to see you play some tabla to one of your songs. Double BAM!! Great content btw :)
Maybe one day!
I wish you were my uncle Josh or something.
I could imagine how hard I would have had discussions with my parents to spend time with my TRIPLE cool uncle.
bam! :)
I have almost completed the Machine learning playlist and it was really helpful. One request, can you please make a short video on 'handling the imbalanced dataset'?
I've got a rough draft on that topic here: ruclips.net/video/iTxzRVLoTQ0/видео.html
How to implement pipeline with cost complexity? Consider the marking part which start before 49:00... Thank in advance! You are the best teacher...
I'll work on that
I liked before watching
Hooray!!!
Thank you so much,
Thanks!
Awesome video
Thanks! :)
Amazing man. I love your channel. Could you please reorder this video , SVMs and Xgboost in the correct order in the playlist ?
Yes!
we love you we always will
Thanks!
Wow!! Josh on live? made my day...
:)
Thanks for this! I just have a quick feedback that it would've been great had you touched upon how to interpret the leaves of the decision tree
noted
Hey Josh, follow this equation: You + Brazilian Flag Polo Shirt + Awesome Content = TRIPPLE BAM!!!
Muito bem! :)
We have data scientist out there. We have "data artist" right in this video.
Wow! Thank you!
1 hour statquest? in the words of Barney Rubble's son: "BAM BAM!"
double bam! :)
Thank you!!!
Bam! :)
i wanted to learn DT from scratch but it seems here we should already know things like confusion matrix. I better study that first and come back to this video
Yep.
Pruning is better than setting max_depth or min_samples beforehand overall I guess. Thanks for another great tutorial : )
Thanks!
As Tina Turner would say: "You are simply the best!" 🎵🎵🎵
BAM! :)
1:00:20 Use color to visualize the category and the Gini impurity
Hi Josh, Request you to make more such ML videos in python which covers all ML concepts holistically. I am sure this course will then become more popular then any of the available ML courses. Pls pls pls....
I'll consider it.
Thanks!
Wow! Thank you so much for supporting StatQuest!!! BAM! :)
You're the best
Thank you! :)
Hi Josh, thank you so much for this awesome posting! Quick question, when doing the cross validation, should the cross_val_score() using [X_train, y_train] or the [X_encoded, y]? I'm wondering if the point of doing cross validation is to let each chunk of data set being testing data, should we then use the full data set X_encoded an y for the cross validation? Thank you!!
There are different ideas about how to do this, and they depend on how much data you have. If you have a lot of data, it is common to hold out a portion of the data to only be used for the final evaluation of the model (after optimizing and cross validation) as demonstrated here. When you have less data, it might make sense to use all of the data for cross validation.
@@statquest Thanks for the quick response. That makes perfect sense.
Hi Josh! I'm a HS student trying to learn ML algorithms and your videos are genuinely my saving grace. They're so concise, information heavy and educational. I understand concepts perfectly through your statquests, and I'm really grateful for that.
One quick question: The algorithm used in this case to build a decision tree: is it the CART algorithm? I'm writing a paper on the CART algorithm and would hence like to confirm the same. Thanks again!
Yes, this is the "classification tree" in CART.
@@statquest Thank you so much 🥰
Somehow your room and furniture remind me of my grad building room at the Univ. of Chicago.
Cool!
thank you for this video! Would it be possible to do a similar video with random forest and regression trees?
I don't like the random forest implementation in Python. Instead, if you're going to use random forests, you should do it in R. And I have a video for that: ruclips.net/video/6EXPYzbfLCE/видео.html
Hi Josh...awesome videos..I have a request that you make videos on python implementation on XGBOOST. Thanks, greetings from India..
statquest.org/product/webinar-july-14-11am-xgboost-in-python/
statquest.org/product/webinar-july-16-11am-xgboost-in-python/
Statistics and ML GOAT
BAM! And thank you for supporting me! :)
Hi Josh. Loved this video. I have two questions: 1- Is there any way to save our final decision tree model to use it later in unseen data without having to train it all again? 2- Once you have decided on your final alpha: why not training your tree on a full-unsplit dataset. I know you will not be able to generate a confusion matrix, but wouldn't your final tree be better if it is trained with all the examples?
Yes and yes. You can write the decision tree to a file if you don't want to keep it in memory (or want to back it up). See: scikit-learn.org/stable/modules/model_persistence.html
kindly add this video to the machine learning list
Will do!