StatQuest: Random Forests in R

Поделиться
HTML-код
  • Опубликовано: 25 фев 2018
  • Random Forests are an easy to understand and easy to use machine learning technique that is surprisingly powerful. Here I show you, step by step, how to use them in R.
    NOTE: There is an error at 13:26. I meant to call "as.dist()" instead of "dist()".
    The code that I used in this video can be found on the StatQuest GitHub:
    github.com/StatQuest/random_f...
    If you're new to Random Forests, here's a video that covers the basics...
    • StatQuest: Random Fore...
    ... and here's a video that covers missing data and sample clustering...
    • StatQuest: Random Fore...
    For a complete index of all the StatQuest videos, check out:
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Support StatQuest by buying The StatQuest Illustrated Guide to Machine Learning!!!
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    RUclips Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    #statquest #randomforest #ML

Комментарии • 406

  • @statquest
    @statquest  2 года назад +3

    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @RaushanKumar-fq7bo
      @RaushanKumar-fq7bo 6 месяцев назад

      I am using this loop command for random forest,
      oob.error.data

    • @statquest
      @statquest  6 месяцев назад

      @@RaushanKumar-fq7bo Are you using my code, or did you write your own?

  • @jasperobico1459
    @jasperobico1459 5 лет назад +1

    Your tutorial video was really helpful! I am not sure if I would be able to do Random Forest without seeing this one! Great job on making a tutorial video that is easy to follow and to understand for non-R users like me. Kudos!

  • @cajogos
    @cajogos 4 года назад +11

    These videos using R are a lifesaver (quite literally!) Thanks a lot for these Josh!

  • @BT-jh3dq
    @BT-jh3dq 3 года назад +4

    I've got so much more out of a couple of hours watching your videos than out of a couple of weeks trying to understand RFs through papers/books. Going back to the papers now, but with much more of a handle on what's going on. Thanks!

  • @BayAreaLakers
    @BayAreaLakers 3 года назад +10

    Can't believe I went from not knowing anything about Machine Learning to learning so much after just a few days. Thanks Josh!

    • @statquest
      @statquest  3 года назад +1

      BAM!

    • @shaiguitar
      @shaiguitar Год назад +1

      I second this. Priceless channel. DOUBLE BAM!

  • @sudiptomitra
    @sudiptomitra 3 года назад +3

    This demo is end to end & complete in RF with R !! This can easily be rewarded as the "GOAT" in this subject. Thanks & looking forward to view more great demos on ML topics.

  • @nurinurlailasetiawan2689
    @nurinurlailasetiawan2689 Год назад +1

    Josh your channel is super awesome! I've been struggling to understand ML because I need to work with RF for my hyperspectral data. I read a lot of papers and books, but so far, your videos are the one that helps me the most! Very effectively communicated!!! Big thanks!!!

  • @dr.sangramsinha2784
    @dr.sangramsinha2784 3 года назад +5

    Recently I have been a regular follower of your channel. This is awesome. I learned a lot being neither from mathematics nor from computer science background. Even if being a experimental biologist, I understood most of your videos on regression analysis and now getting familiar with machine learning. I wonder if you could create some video on protein-protein or protein-ligand interaction using machine learning. I pay my deep respect to the effort you made to teach us all of the complex stuffs in such a simple way. Furthermore, you have beautiful voice too. I love to hear statQuest tunes. Lastly I pray for your good health and wealth.

    • @statquest
      @statquest  3 года назад +1

      Thank you very much! I'm glad my videos are helpful.

  • @lauraeli2286
    @lauraeli2286 Год назад +1

    You really are the best here on RUclips at explaining these 'complex' topics I think - I put inverted commas because actually they're not so complex anymore after watching your videos! :)

  • @BrianUrlacherPoliSci
    @BrianUrlacherPoliSci 5 лет назад +2

    This was awesome. I've been working for 2 days to wrap my head around the R implementation of this. The code I was working with now makes perfect sense.

  • @Rpekeno
    @Rpekeno 6 лет назад

    This video is SO good. I'm a newcomer at this, and your materials have helped me a lot! Thanks!

  • @kinwong6383
    @kinwong6383 5 лет назад +2

    Love the way you show both ways of doing certain thing. It really helps R beginner like me a lot!
    Thank you very much! Wish I can go visit you at performance one day.

    • @statquest
      @statquest  5 лет назад +1

      Thank you so much! I'm glad to hear my videos are helpful. :)

  • @glauberbrito8685
    @glauberbrito8685 4 года назад +6

    You saved my day, Josh. You did a GREAT JOB !! Congrats.

  • @chrisvaccaro229
    @chrisvaccaro229 4 года назад +44

    Jesus Christmas this is incredibly useful. I code in R and
    A) it's almost impossible to find ML tutorials for R
    B) it's really hard to find straightforward ML tuts that are free of jargon ANYWAY
    C) it's hard to find tuts in plain english and without talking about "y-hat" and crap i don't even remember from calculus
    D) it's hard to find stat videos with such a good musical score ;)
    and E) these are just awesome.
    I'd literally given up on finding decent ML tuts for R and just said "screw it, I'll learn python" but then I found these accidentally. These are freaking epic. I literally just went through like 25% of your videos hitting "Shift + N" then liking them (next video, like button, next video, like button, next video, like button, etc.)
    These videos are the BEST. You should make a MOOC. Yours would be better and easier to follow than Andrew Ng or Jeremy Howard (who are the superstars of ML.)
    Maybe even make a course on DataCamp. You can make interactive ones that way.
    Either way, these videos are starting from AI heaven.

    • @chrisvaccaro229
      @chrisvaccaro229 4 года назад +4

      You know what would be really really useful? If you made a teaching tutorial. Like, if you made a tutorial outlining your teaching philosophy and how you're able to make explainer videos so clear and concise. That way other teachers, professors, or even RUclipsrs could watch it and apply it to their OWN subjects. That would me like a full-blown meta-improvement to the educational world.

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

    • @statquest
      @statquest  4 года назад +6

      Wow! That is very flattering. I recently did a talk at Duke University about how my teaching style. The talk was called "The elements of StatQuest". Maybe I'll turn that into a video.

    • @chrisvaccaro229
      @chrisvaccaro229 4 года назад

      @@statquest Yea - please do!

    • @chrisvaccaro229
      @chrisvaccaro229 4 года назад

      @@statquest Is there any chance you have a video copy of the talk in the meantime you'd be willing to send? I just looked up "The elements of StatQuest" and found a zoom link from Duke, but there was no recorded version available. You don't happen to have a recording, do you?

  • @MrRoshanchoudhary
    @MrRoshanchoudhary 6 лет назад

    Hi Joshua, Your explanations are mindblowing. I'm loving it. The way you explain each and every notes are simply awesome. I'm grateful to you. Thank you so much. Keep making such videos. Waiting eagerly for Logistic Regression. Bammm!!!!! :)

  • @Lucrezio81
    @Lucrezio81 3 года назад +1

    It's rare to find a video like this. Libraries, scripts, methodology, processes are so well explained and coherently organized. Even the technical language was amazingly clear for a not native English user like me. I wonder that 12 mentally poor people did not like it!

  • @ffloresalfaro
    @ffloresalfaro 5 лет назад +2

    Love your videos! Proximity matrix is excellent. Thanks so much for making these great videos!!

    • @statquest
      @statquest  5 лет назад

      Hooray! I'm glad you like StatQuest! :)

  • @justarandomchannel5246
    @justarandomchannel5246 Год назад +1

    I was falling asleep reading my coursework's material the ukulele touch and some fun bits u put in makes this dreading boring subject a bit interesting. Thanks mate!

  • @shahrizalmuhammadabdillah3127
    @shahrizalmuhammadabdillah3127 11 месяцев назад +1

    i cant believe it, i just watch it this now, and i love this Statquest. thank josh.. you make me open minded again to another job

    • @statquest
      @statquest  11 месяцев назад

      Happy to help!

  • @alexandersierraa
    @alexandersierraa 5 лет назад +1

    Thanks a lot Josh, your presentation is very clear and depth

  • @adityanjsg99
    @adityanjsg99 4 года назад +1

    You are such an awesome narrator!
    I depend more on your videos than my teacher.

  • @nathaliatf
    @nathaliatf 5 лет назад +1

    Great Video! Efficient and not boring at all!!

  • @rehab4e2
    @rehab4e2 4 года назад +1

    Thank you very much, it is clear and very detail. I also love your introduction!

  • @shahrizalmuhammadabdillah3127
    @shahrizalmuhammadabdillah3127 11 месяцев назад +1

    The tricks so fancy, and help me. I'm cheering to watch this...

  • @amirgharavi4082
    @amirgharavi4082 5 лет назад

    Thanks so much for making these great videos. Really appreciate it

  • @revenez
    @revenez 4 года назад +1

    Brilliant and enjoyable!
    Thank you and please keep up the good work.

  • @teetanrobotics5363
    @teetanrobotics5363 3 года назад +12

    I love your channel and have almost finished the entire ML playlist. You're explanation, animations and diagrams are just amazing🔥🔥 and far better than most university curriculum. I had a request. Just like the R tutorials, Could you please make the python version of the machine learning models ?

    • @statquest
      @statquest  3 года назад +2

      I'd like to do that as soon as I have time.

    • @pacificbloom1
      @pacificbloom1 3 года назад +1

      @@statquest Kindly consider this as a request from one more fan of yours....really need python videos because this is the only channel I have subscribed to learn data science/machine learning

  • @ChunLin_UoE
    @ChunLin_UoE 5 лет назад +6

    Thank you very much - very detailed explanation! It may be easier to convert the err.rate matrix to a data frame and use tidyr::gather() to transform it for ggplot2.

  • @himanshu8006
    @himanshu8006 5 лет назад +1

    cant be explained easier then this ...... great job Josh, thanks a lot

  • @veducatube5701
    @veducatube5701 4 года назад +6

    Dear Sir!
    You saved a lot of my time and a lot of my energy. Thank You... God Bless You with health and Wealth.
    Please keep making videos and keep saving our lives...

  • @tizhang9635
    @tizhang9635 3 года назад +1

    Thanks very much for your channel!!!! Way easier to understand than reading paper.

  • @francinagoh2541
    @francinagoh2541 3 года назад +1

    Thanks I learn alot from your video. Have a nice day!

  • @melaniemax6437
    @melaniemax6437 Год назад +1

    thank u so much! really helpful for me as a beginner in machine learning.

  • @benben0814
    @benben0814 6 лет назад

    Hey Josh this is very helpful and thanks for all the work! Does your code include cross validation for the random forest?

  • @anushkabanerjee2510
    @anushkabanerjee2510 Год назад +1

    Fantastically explained !!

  • @alecvan7143
    @alecvan7143 4 года назад +1

    Super helpful, thanks Josh

    • @statquest
      @statquest  4 года назад

      Hooray! (by the way, you might be in the running for the most comments from a single viewer! Keep'em up!)

  • @user-uz1wz4gp9d
    @user-uz1wz4gp9d 5 лет назад

    Fantastic vedic! Very clear!
    Just have one more question, does RandomForest work with multiple columns of missing?

  • @j.jayelynnshin4289
    @j.jayelynnshin4289 3 года назад +2

    I don't understand ppl who clicked on "dislike" at all. Thank you for doing this!!

  • @PetalGamesStudios
    @PetalGamesStudios 4 года назад +1

    Awesome video! Thanks again!

  • @andreaballestero7780
    @andreaballestero7780 3 года назад +1

    This was very helpful, thank you!! :)

    • @statquest
      @statquest  3 года назад

      Glad it was helpful!

  • @mamahotel1308
    @mamahotel1308 5 лет назад

    Love this, thank you!

  • @AOLFlyersNewsletters
    @AOLFlyersNewsletters 4 года назад +1

    Josh - you are like a god! Thanks man.

  • @kaam975
    @kaam975 2 года назад +1

    Thanks for the code!

  • @vivianhu3389
    @vivianhu3389 4 года назад +1

    Super Clear! THANK YOU!

  • @christiansetzkorn6241
    @christiansetzkorn6241 3 года назад +1

    Great stuff! Thanks!

  • @kkondur7619
    @kkondur7619 4 года назад

    Hi Josh! If I am not worng, rfImpute(..) cannot be used if the target variable had missing values, right? What would help me in a case where in my Training dataset has a large number of missing values in the Target variable?

  • @hiteshpant
    @hiteshpant 4 года назад +1

    hi Josh, I really enjoy watching your videos and like the way you have made statistical topics so easy to interpret. Do you have a video for Feature Selection(varImp) using Random Forest?

  • @yumikowiranto4330
    @yumikowiranto4330 3 года назад +1

    Thank you so much!!!!! This is really helpful for my assignment

    • @statquest
      @statquest  3 года назад +1

      Glad it was helpful!

    • @yumikowiranto4330
      @yumikowiranto4330 3 года назад

      @@statquest is there a limitation in terms of the kind of variables I can include as predictors? For example, can I include race (e.g., white, hispanic, african-american, asian, other)?

    • @statquest
      @statquest  3 года назад +1

      @@yumikowiranto4330 As far as I know, there are no limitations on the types of variables you can use as predictors.

  • @thuanpin
    @thuanpin 5 лет назад

    Hei, Thanks so much for your great lecture. May I ask you questions?
    1) why did you label for sex and hd, not for the other categorical variables? the levels of ca and thal changes after converted, do they influence to model?
    2) Do we need normalize continuous variables before conducting random forest?
    Many thanks!

  • @amitt9053
    @amitt9053 5 лет назад

    How to fill in missing values if they are numeric? (For classification samples could be created using possible classes say Y or N)

  • @wsgsantos
    @wsgsantos 5 лет назад +1

    Very good explanation! Thanks from Brazil! :-)

    • @statquest
      @statquest  5 лет назад +1

      Muito obrigado! :)

    • @pedrosenna100
      @pedrosenna100 5 лет назад +1

      @@statquest I am a professor in Industrial engineering course in Brazil and just discovered your channel, i simply loved the videos! i teach logistics but i was wanting to put some data science practices and your channel is just perfect, i can't thank you enough for the help you gave me being so didactic!

    • @statquest
      @statquest  5 лет назад

      @@pedrosenna100 Hooray!!! I'm so glad to hear that my videos are helpful in Brazil. It's a beautiful country with an amazing culture. I visited once a few years ago and hope to visit again as soon as I can.

  • @fritz3555
    @fritz3555 5 лет назад +1

    Thanks for the great video series. What about randomForestSRC package? If we have data with missing values, is it better to use the randomForestSRC package? Or should we use the randomForest package?

    • @statquest
      @statquest  5 лет назад

      Unfortunately, I’ve only used the randomForest library, so I can’t tell you which one is better.

  • @JoelAgarwal-yl2kw
    @JoelAgarwal-yl2kw Год назад

    Hi Josh! Amazing video - has been super helpful in my understanding. Quick question, how would I find the AUC and ROC curve for the random forest model based on the code that you made? I'm trying to compare different models to see which is best (as well as compare to logistic regression).

    • @statquest
      @statquest  Год назад

      I show how to do that exact thing (AUC and ROC for random forest) in this video: ruclips.net/video/qcvAqAH60Yw/видео.html

  • @sam_AI_Dr
    @sam_AI_Dr 6 лет назад +1

    Hello Joshua, at the point where you were determining the optimal number of variables at each internal node, is there a reason why you selected the empty vector length to be 10?

    • @Rpekeno
      @Rpekeno 6 лет назад

      I'm new to this and have been wondering, this is the thing they call "curse of dimensionality" isn't it? You wanted to make sure you didn't try out too many variables (increasing dimension, and thus overfitting) or too few variables, did I get it right?

  • @peerzadimusavir9557
    @peerzadimusavir9557 5 лет назад

    Sir please tell me how can i calculate f measure for this program.

  • @moniquebrogan7206
    @moniquebrogan7206 2 года назад

    Thanks so much for your great videos. Do you cover Variable Importance in any of your videos?

    • @statquest
      @statquest  2 года назад +1

      Yes. The most conventional approach is with regularization: ruclips.net/video/Q81RR3yKn30/видео.html

  • @angelique3062
    @angelique3062 4 года назад +3

    Thank you Josh! :) You really have a gift for teaching! Could you please do a random forest regression in R?

    • @statquest
      @statquest  4 года назад +3

      Possibly! I'll put it on the to-do list.

    • @imanep4902
      @imanep4902 4 года назад

      @@statquest nice, looking forward to it!

    • @yoyohu6522
      @yoyohu6522 4 года назад

      @@statquest Thanks! looking forward to the RF regression in R.

    • @mariyapak428
      @mariyapak428 2 года назад

      @@statquest -- Thank you Josh!

  • @steliosgiannopoulos8297
    @steliosgiannopoulos8297 3 года назад +1

    Change the nick to Josh R-Charmer , excellent work thank you for all of your videos !!!

  • @PaulO-mv6ku
    @PaulO-mv6ku 5 лет назад +1

    Brilliant - many thanks.

  • @Pavijace
    @Pavijace 6 лет назад

    ukelele...lol.....serious concept explained with fun...tq..keep goin...:-))am going home to you...nice song ...btw

  • @charangrewal6113
    @charangrewal6113 6 лет назад +1

    How do we know which variables the random forest chose to use in the final model?

    • @statquest
      @statquest  6 лет назад +1

      If you build a random forest...
      model

  • @lucpr4501
    @lucpr4501 4 года назад

    Good Morning. Thank you for your video and your time. May I ask you why do you use the Random Forest package for a binary response variable (Y variable is equal to 0 or 1). Should not we use a Bernoulli loss function instead of the quadratic loss function when splits are performed in the tree?

    • @statquest
      @statquest  4 года назад

      For classification, randomForest() uses Gini impurity to decide if it should create a new branch. For more information about how Gini impurity is used, see: ruclips.net/video/7VeUPuFGJHk/видео.html

  • @IamCaptainMan
    @IamCaptainMan 3 года назад +1

    Thanks man, you're awesome!

  • @SergeySkripko
    @SergeySkripko 5 лет назад +1

    Josh, you used cmdscale() on a default dist(method="euclidian") matrix. Does it mean that you did PCA, according to your MDS and PCoA video?

    • @statquest
      @statquest  5 лет назад +1

      Great question! Technically you could say that we did PCA on the distance matrix - but PCA is generally thought of as being applied to the raw data and MDS is applied to a distance matrix. So the difference is sort of in the spirit of how the data is processed, which is relatively minor.

  • @lifeboston853
    @lifeboston853 6 лет назад +1

    Hello Joshua, I watched all your videos and they are so awesome! Will you be able to teach us Shrinkage Method (Ridge, Lasso and PCR), Neural Network, Deep leaning, Image analysis, and video analysis?

    • @lifeboston853
      @lifeboston853 6 лет назад

      Thanks so much! I am looking forward to all your future videos :)

  • @jitenjaipuria
    @jitenjaipuria 5 месяцев назад +1

    thank youuuuuuuuuuuu. i will acknowledge you in my scientific paper

    • @statquest
      @statquest  5 месяцев назад

      Thank you very much!

  • @serman5671
    @serman5671 Год назад +1

    so well explained

  • @monicasteffimatchado1780
    @monicasteffimatchado1780 4 года назад +1

    Thank you so much for the clear explanation. I have a microbiome datasets with 133 samples 431 features. I would like to try RF. How do I decide the range of mtry value ?

    • @statquest
      @statquest  4 года назад +2

      I talk about this in the original Random Forest video: ruclips.net/video/J4Wdy0Wc_xQ/видео.html You start with the default, which is the square root of number of variables, but can use cross validation to try other values.

  • @DanTaninecz
    @DanTaninecz 5 лет назад +1

    That mtry trick is pretty slick.

  • @random-ds
    @random-ds 5 лет назад +1

    Hello, thank you again for you excellent video, however, I still have on question: what is the difference between what you did (RFimpute with 6 iterations) and the MissForest algorithm
    Thank you again!

    • @statquest
      @statquest  5 лет назад

      That's a good question. Unfortunately, I've never used MissForest, so I can't tell you the answer.

  • @drtlfletcher
    @drtlfletcher 2 года назад

    Why do you make the mds plot? Is it to analyse which variables are having the most effect on the classification by looking at the weighting of contributing variables to MDS1 vs 2? Or for more generally exploring the data?

    • @statquest
      @statquest  2 года назад +1

      We draw the MDS plot to see how the samples cluster and to identify potential outliers. Plus it's cool.

  • @gabrielcrone6753
    @gabrielcrone6753 2 месяца назад

    Hi, Josh. Excellent video! So helpful and clear! 😄I am using a new version of randomForrest, and I cannot seem to locate within my model object the err.rate vector. When I write, "model$err.rate", it returns nothing. Do you know if there are equivalent objects now inside of the model to extract the error rate info? Thanks!

    • @statquest
      @statquest  2 месяца назад

      What is the exact version you are using? 4.7-1.1 has err.rate. You can see it in the documentation here: cran.r-project.org/web/packages/randomForest/randomForest.pdf

  • @luciapintoferro7747
    @luciapintoferro7747 5 лет назад

    First of all, thank you very much for your video! I really appreciate your teaching method.
    I've got a problem since the variable I'm trying to predict has NA's in it and when using the function rfImput(name of the vble~., dataset, number of iterations) R returns the following error: "Error in rfImpute.default(m, y, ...) : Can't have NAs in y"

  •  3 месяца назад +1

    Awesome statQuest, did not know you can also impute data using random forests :) How does the analysis of parameters (ntree, mtry) change if we are doing regression instead of classification? Would also love to see a regression example.

    • @statquest
      @statquest  3 месяца назад +1

      I've never used it for regression, but I'll keep that topic in mind.

  • @rubenpinnata4626
    @rubenpinnata4626 4 года назад

    hi Josh! Great videos as always
    a quick question: once you have declared a variable as factor, can you use MDS?
    you said it is very similar to PCA and from what I know, PCA needs scaling which I am not sure will work with categorical variables until you hot encode them, which I dont see any here.
    Can you verify its okay to use MDS plot for data with both continuous and categorical?
    thanks and stay safe

    • @statquest
      @statquest  4 года назад +1

      We apply MDS to the proximity/distance matrix, which is not the same thing as applying it to the raw data. In other words, the process of creating the proximity matrix converts the factors into distances that are suitable for MDS.

    • @rubenpinnata4626
      @rubenpinnata4626 4 года назад +1

      @@statquest perfect! Thanks as always Josh

  • @rubenpinnata4626
    @rubenpinnata4626 4 года назад

    @statquest on 4:10, the thal values change from thal: 3 = normal; 6 = fixed defect; 7 = reversable defect to 2,3 and 4,
    is this a problem?
    Thank you

  • @iBenutzername
    @iBenutzername Год назад +1

    Awesome as always! Can I ask you to make a video about feature importance in RF models?

    • @statquest
      @statquest  Год назад +1

      I'll keep that in mind.

  • @AngelBautistaVII
    @AngelBautistaVII Год назад

    Hello! May I ask what would be the codes if I had an unknown sample, and we want to use the model we built here to classify if it was "healthy" or "unhealthy"? Also, what if my unknown sample had missing values in some of the parameters?

    • @statquest
      @statquest  Год назад

      To make a prediction with a random forest model, called "model", we call predict(model, data), where 'data' is a vector of values. And, unfortunately, the randomForest package has not implemented unsupervised imputation. :(

  • @sterlingwong9589
    @sterlingwong9589 4 года назад

    Great video! Thanks for the step by step guidance. I have a question is that would it be possible that we can program the results of the random forest into the Excel so that people can use it as a tool. For example, in the Excel sheet, as long as we input the values for independent variables included in the random forest model, we can get the predicted result (have heart disease or not) through an Excel sheet right away. I know it can be extremely time-consuming if you manually type into the Excel since you might have 500 trees. So I was wondering if there are any easier ways/packages that can do this. Any help is appreciated! Thanks!

    • @statquest
      @statquest  4 года назад

      I have no idea how you would do this in excel...

    • @sterlingwong9589
      @sterlingwong9589 4 года назад

      @@statquest I googled and found that we can actually print the rules for each tree in the forest using "printRandomForests" function ("rattle" package). For example, if using the default number of trees in randomForest, it will print the results of 500 trees with the thresholds of independent variables and the predicted dependent variable. Then I was thinking about whether the printed results can potentially be programmed into Excel (i.e., using VBA).

  • @RPDBY
    @RPDBY 6 лет назад +1

    Thank you for the great tutorial. I am confused though why do we need to impute our outcome variable, is it justified? Wouldn't it be more reasonable to treat the NAs in our outcome variable as unlabeled data and train the model on labeled data only? Imputing an outcome variable seems like a dubious practice, but maybe i am wrong.
    Also, on a technical side, how can we access the actual predicted values per id (i.e. in this case per patient)? Thanks a lot for the video once again!

    • @statquest
      @statquest  6 лет назад +1

      In an ideal world, you would never have to impute anything. But in practice, sometimes data isn't complete and you don't have a lot of it. So, in these situations, you may not have a choice - it's definitely not ideal, though. Your word, "dubious" is a good description!
      You can get the predicted values, which correspond to the the rows in the input data, with "model$predicted".

    • @RPDBY
      @RPDBY 6 лет назад +1

      Thank you so much for the prompt answers!

  • @HarshKumar-zc4ox
    @HarshKumar-zc4ox 5 лет назад +2

    Great job Starmer. You explained everything quite nicely.
    However, while explaining the confusion matrix, you went wrong as the vertical columns are for ground truth and horizontal rows are for predicted values. The explanation should have been 28 healthy patients were miss classified as unhealthy patients but you explained opposite. Same case with false positive. I saw you confusion matrix lecture, there you have correctly explained the confusion matrix.

    • @wei2674
      @wei2674 4 года назад

      Harsh Kumar I think R output it this way so that 0.14 is the type1 error rate/ false positive rate. Which means 23 healthy classified as unhealthy (false positive)

  • @beautyisinmind2163
    @beautyisinmind2163 Год назад

    one question: during train test split choosing different random_state value give different accuracy on test set why is so? for example in random_state 0 the accuracy on test set is 82 when we change random state to 42 accuracy change into 78, so why such issue is arising and which is the correct model here?

    • @statquest
      @statquest  Год назад

      Different random subsets of data for training and testing will give different results. One way to deal with this problem is to use cross validation: ruclips.net/video/fSytzGwwBVw/видео.html

  • @4ZaKing
    @4ZaKing 3 года назад

    Thanks for a great video!
    Would be happy to get some guidence as I am trying to apply this to a "real world" model. It puzzels me that I dont see the division between training and testing data in this example. Is that happening by default when running the gml/random forest model? Sorry if I missing something obvious =o

    • @statquest
      @statquest  3 года назад +2

      Because of the bootstrapping that random forests use, they do not typically require separate 'training' and 'testing' datasets. Instead we use the "Out-of-Bag" dataset in lieu of a testing dataset. For more details on "out-of-bag" and bootstrapping with random forests, see: ruclips.net/video/J4Wdy0Wc_xQ/видео.html

  • @bernardromey4084
    @bernardromey4084 4 года назад

    How do you now use your trained RF model to make predictions for data with no response variable. Easy with regression.

  • @j30ma
    @j30ma 3 года назад

    When I print the model outcome (like at 6:44) I don't get the OOB estimate and confusion matrix. I get "Mean of squared residuals" and "% Var explained". Is this because of an update to the package? where can I find the OOB estimate and confusion matrix again, or how should I interpretate these outputs? Thank you!
    EDIT: Its because I did a regression RF, could you explain how to further analyse these kinds of RF?

    • @statquest
      @statquest  3 года назад +1

      Just replace the classification trees with regression trees. Regression trees are explained here: ruclips.net/video/g9c66TUylZ4/видео.html

  • @srinivasv3268
    @srinivasv3268 5 лет назад +2

    Hi, Could you please upload some multi class prediction, example : we have one train and test data set first we need predict train data than prediction on test data
    Thanks

    • @statquest
      @statquest  5 лет назад

      I've only done multi-class prediction in Python, but the documentation for randomForest (the R package) indicates that, just like with Python, there's no difference between predicting two classes and predicting more than two classes.

  • @minhthinhnguyen2884
    @minhthinhnguyen2884 4 года назад

    I got some problems when I run rfImpute() with my data set and the error is : Error in rfImpute.default(m, y, ...) : No NAs found in m. Should the predictor variables must contain missing value or NA value?

    • @statquest
      @statquest  4 года назад

      rfImpute() only imputes missing values for the prediction (independent) variables.

  • @balaji.r2735
    @balaji.r2735 4 года назад +1

    Thank you very much

  • @charliepierce6218
    @charliepierce6218 3 года назад +1

    Amazing!

  • @hunadamfeher
    @hunadamfeher 4 года назад

    Dear Josh! What should i do, when i get extremely high class.error value in the confusion matrix (around 79%)? (But total OOB error stays around the satisfactory level of 13%). Details: I used RF for classification (categorical variable: 0/1) with many continous and categorical variables, and the proportion of "0" points were "dominant" (86%) over "1" points (14%) in my dataset. 1) Should i create a representative subset from "0" points? 2) Throw away this model and find some suitable with cross-validation? 3) Or conditional random forest can be useful for this (i fear that produces the same btw)?

    • @statquest
      @statquest  4 года назад +1

      I would try Adaboost or Gradient Boost with your dataset. It could be that since so many points are "0", that bootstrapping isn't working as well as it should.

    • @hunadamfeher
      @hunadamfeher 4 года назад

      @@statquest Thanks for the tip!

  • @alyerart
    @alyerart 4 года назад

    Josh, I see that you've optimized the RF model's hyperparameters ntree and mtry using OOB error in fairly simple way. The package CRAN radomForest comes with tuneRF() and rfcv() to do the same thing (sort of) using (nested) cross-validation. Have you tried these yourself? I'm trying to understand the documentation (and use case examples) but still don't understand how to use them proper ...

    • @statquest
      @statquest  4 года назад

      I haven't used those functions, but they sound very helpful.

  • @victormalmsjo5613
    @victormalmsjo5613 5 лет назад +1

    Is there any good way to impute missing values in the test data ?

    • @statquest
      @statquest  5 лет назад

      That's a great question. Off the top of my head (and I just woke up, so my head isn't in top shape right now) I can't think of anything...

    • @victormalmsjo5613
      @victormalmsjo5613 5 лет назад +1

      Thank you for the quick answer, and for the amazing videos

  • @user-bz8nm6eb6g
    @user-bz8nm6eb6g 4 года назад +1

    Thanks!!

  • @reimiranda3213
    @reimiranda3213 4 года назад +1

    If you have any ecology examples for these stat quests that would be really useful!

  • @andrezaluko
    @andrezaluko 6 лет назад +2

    Josh Starmer, I am your fan! You are very funny =D

  • @AR_Wald
    @AR_Wald 3 года назад +1

    Hooray!

  • @jacquelinmontoyahidalgo6714
    @jacquelinmontoyahidalgo6714 2 года назад +1

    great video! do u have any tutorial with regression random forest?

  • @ademolaadelekan8400
    @ademolaadelekan8400 Год назад

    Thank. How do you use Random Forest for feature selection?

    • @statquest
      @statquest  Год назад

      You should be able to print out some metrics about how each feature is used in a random forest and use those to decide.

  • @datdao6982
    @datdao6982 3 года назад

    Hi Josh, just a question. In the argument Trees=rep(1:nrow(model$err.rate), times=3), I was wondering was the times=3 mean? When I tried RF on my dataset, if I change the times=1 or 2 then it works but when times=3 it doesn't work. I was hoping you dwell a little bit more on that. Thank you

    • @statquest
      @statquest  3 года назад

      In this example, the matrix "model$error.rate" has 3 columns, one for "OOB", which describes how many well the OOB data performed on the trees in the random forests (where the random forests contains between 1 and 500 trees or 1 and 1000 trees), one for "Healthy", which describes how the people labeled as "Healthy" performed on the trees in the random forests and "Unhealthy", which describes how the people labeled as "Unhealthy" performed in the trees in the random forests. We want to plot each one of these columns of data in a graph, where the y-axis represents the error and x-axis represents the size of the random forest. So x=1 represents a random forest with 1 tree and x=500 represents a random forest with 500 trees. So, with "Trees=rep(1:nrow(model$err.rate), times=3)", we are creating x-axis values that correspond to the values in each of the 3 columns in the "model$error.rate" matrix. So, check to see how many columns are in model$error.rate and set "times=" that number of columns.

    • @datdao6982
      @datdao6982 3 года назад

      ​@@statquest Thank you very much. One more question if possible. When I compute the oob.values and substitute the corresponding mtry to the model, it seems to do worse than the default setting. I.e: I create a new model with m-try=7 (the default setting is 5), turns out the OOB estimate from 12.66% now increases to 13.16%. Maybe it is not a lot but I just wonder why that may be since the for loop method should be an optimization method but it is not in my case.

    • @statquest
      @statquest  3 года назад

      @@datdao6982 I'm not sure I understand your question. In the video at 11:46, we have a loop try different values for "mtry". In this example, we found that the default setting was ideal. It sounds like you're getting a similar result.

    • @datdao6982
      @datdao6982 3 года назад

      @@statquest sorry, perhaps I wasn't talking clear. What I mean is that I used a different dataset (actually, it is the Math one in Cortez & Silvia (2008)). In my case, the default setting wasn't ideal, so I change the mtry value to the supposedly "optimal" value, but then the model doesn't fit as well (OOB error increases that I talked in the previous comment, more wrong classifications in the confusion matrix).

  • @afcc777f
    @afcc777f 6 лет назад +14

    Can make a video about random forest for regressions in R ?
    Thanks

    • @statquest
      @statquest  6 лет назад +8

      I've added it to the to-do list, but it might be a while before I get to it.

    • @afcc777f
      @afcc777f 6 лет назад +2

      thanks

    • @baherazzam8863
      @baherazzam8863 6 лет назад +3

      Thank you! I am also looking forward to that

    • @cynical_dd
      @cynical_dd 6 лет назад +3

      Hi, Im hoping for this too! Pretty pleaseeee, thank you!

    • @rajatbhosale8188
      @rajatbhosale8188 5 лет назад +2

      Even I would like to get that.