I watched your videos to help through a data analytics degree and I'm now working in a job type similar to business analyst and looking back at these videos. Very easy to follow, punctual, and informative for getting the job done. Thank you
Take a bow sir! For the first time, I had full clarity on Decision Tree and it's usage! Thanks a lot for this superb tutorial, lucky to find your channel, stay blessed! 👌👍🙏
hello sir, your way of explaining is so simple and effective. made topic simple. i would like to add comment for all as well that i was getting error while using controls=ctree_control and after doing google and forum support , now i am able to run. and veiw tree. Great work sir.
Hello Dr. Rai, thank you for a very informative video. One thing that I would like to add based on my limited knowledge: For a skewed class distribution such as in the data, it is more importance that the model is able to predict the abnormal cases then it is to predict normal cases. If we just look at the mis-classification error, then the model may be aligned towards the class with higher percentage of data. One way to avoid that is to reduce the disparity between the class types by over/under sampling techniques. Another way is to use the Area under the precision-recall curve as a measure of model evaluation. Your comments and feedback on this would be appreciated.
Thank you again for these complete episodes. You have been of a great help to me "Rai". Please, I'd appreciate a complete episode on the ensembles, essentially, heterogeneous ensemble using DT, SVM etc. inclusive as the base classifiers. Comprehensive videos on ensembles are not common, in fact, I haven't come across any. It will go a long way If you could put something together on this. Thank you for your help!
Thank you Professor Rai for taking the time to show us the ropes. Regarding the mis-classification error table, may I know: what is the difference between that and the Confusion Matrix. I notice the calculation for "accuracy" is the same as the Confusion Matrix, simply "sum(diag(tab))/sum(tab)", but for Confusion Matrix, the Actual is on the vertical versus what you stated in video for Actuals in the horizontal. Thanks, and looking forward to more videos from you
Sir, Greetings from the US, I have enrolled in the machine learning course through Udemy as well but your explanation super simple and easier to implement. Please do guide me with any book which I can use to practice more of such datasets
Deep learning is the hottest topic currently within machine learning field. To get started with practical examples you can try: www.amazon.com/Advanced-Deep-Learning-designing-improving/dp/1789538777
Sir, it'd be really nice if you can make a blog explaining the output in more details. For instance, an explanation of the statistical parameters measured in the confusion matrix. Your videos are really helpful! :)
Thanks for your comments and suggestion! You may find decision tree related explanations in following video too: ruclips.net/video/J2a9yV3kl-M/видео.html
Hi Sir, I am so glad to see your all videos on related to machine learning in R, So request one thing if you share your datasets which you have used in your session that will be great
Sir, I have a few questions: 1. How do you find statistically significant variable after developing a decision tree model with all variables? Ho 2. Suppose all variables in a decision tree is coded as POOR, FAIR, GOOD, then how to find the probabilities of each (POOR, FAIR, GOOD) at non terminal nodes of the tree and also number of sample in each category? I need to show this in my plot. 3. What is the best approach in developing a decision tree model? Developing a model on the training data using K Fold Cross Validation OR Developing a model on training data and then going for cross-validation and pruning process using a function like cv.tree() which allows us to choose the tree with lowest cross validation error rate? Which method is better? 4. How to find out the value of the standardized importance of independent variables using CART in R?
1. P-values on the tree indicate statistical significance. 2. You can find it only at the terminal node. 3. k-fold CV is always better to avoid over-fitting. 4. Higher a variable on the tree, more important it is. For variable importance you can also try this link: ruclips.net/video/dJclNIN-TPo/видео.html
Hi! thank you for all your videos. I'd just like make a little comment: ctree function implements 'Conditional Inference Tree', not 'Clasification Tree'. In fact, it can develop clasification trees, but the fundamentals are different. Thank you for all the work you are doing! very usefull. Carlos
Hi Sir, Your videos are really helpful.It has really helped me a lot, I have few doubts though.I have just started learning data science so these doubts may be naive. 1) On what basis we decide that we should put this much data into training, validation, and testing respectively? 2)Is there any criteria(such as r-square in regression models, Chi-square for logistic regression) for decision trees so that we can say how good our model is?
1) one may experiment with different partitions such as 50:50, 60:40, 70:30, etc., and see what works best. There is no single partition ratio that will work well in all situations. 2) if your y variable is categorical, mis-classification error is used for model performance assessment.
Thank you Dr Rai. I have a question about the tree pruning. Prior to the pruning some of the trees were able to classify patients as pathological but after pruning( by changing the control functions) none of the trees identify the pathological patients. If we were to specifically identify patients with suspected pathology how can we modify the control functions or the initial formula included in the "ctree()" function?
Greetings! I came back to this video after a while as it still seems to be the best one regarding Decision Trees out there. I have a quiestion regarding significance of variables. Do you have a video covering this subject? Any techniques I could apply while working on my Decision Tree? thank you
You can use this link. For tree based methods, it provides variable importance plots to show which variables are important and which ones do not contribute much. ruclips.net/video/hCLKMiZBTrU/видео.html
That was awesome but I found that with my dataset I get a completely different decision tree using the rpart package. Without rpart, the tree is what I expected it to be and with rpart - in some ways it's almost opposite. I'm only comparing the two trees with my training data.
I think I know what the problem is - with rpart trees you only get a little "yes" and "no" marker on the root node. In my case "yes" goes to the left of the tree and "no" goes to the right of the tree. If I assume that direction is always the case then things are okay. I do wish that the "yes", "no" little while boxes were printed at every non leaf node so it's very clear which way the path is going. (I wonder if there's an option for that?) Thanks for the great video.
Hi Dr.Rai. I encountered an error on #Misclassification part. I got the table for using the library(party), but I got "all argument must have the same length" when using the rpart() one. But if I use validate set with the rpart package, the table can be generated.
How can we export the first tree prediction (View(predict(tree,validate,type="prob"))) into XL? When using a data frame they come out horizontally and unreadable.
I am a beginner.. could you help me understand if we can use linear/logistic regression todo the prediction here? I have referred your vehicle example and so got confused if we can use that model here.
You can make changes to settings in 'control' to see what helps to improve the model. In the example, I used only 3 variables just for illustration, but you must start with all variables for a better performance.
Hello! I gave my decision tree 97 different features but the decision tree only picked one of these features to make his decision. Is that normal that it doesn't consider all the features for its decision?
It runs with default setting. By making changes to default settings you may be able to make it include some more. But features that have very little impact on the response are unlikely to be included.
i followed ur method with a dataset i created...its a simple one but the output is just printing the values of my dataset rather than plotting a tree and predicting...can u help me understand why
great video, everything explained step by step. I have a question tho. some of my data in the DB file is char and i keep getting an error "data class "character" is not supported". how can i include this data in my experiments?
Thank you, Dr. Roy for sharing simple and detailed explanation on Decision Tree. My query is can we plot ROC curve for Multiclass Data. (pROC package provides to calculate the AUC but I could not find how to plot ROC graph for multinominal data).
how can we validate the accuracy or discriminatory from this model? i believe you can use the model outputs from train and validate to somehow calculate chi-square etc?
Hi, I wanted to ask which is most appropriate software for conducting SEM along with moderation analysis, in case of categorical, nominal (binary and multinomial) and ordinal variables as outcome/dependent/endogenous variables ? P.S:The predictor variables are scale,nominal and ordinal variables. Regards
Hi SIR , how do we apply test set to predict function where the target var have NA values ? As wen i run the function it says predictor must have 2 levels.
Thanks sir for this interesting video. I am facing a problem. My dependent variable is binary(0,1). When I run predict, the estimated values appear in in decimals despite remove "type". So, misspecification error is close to 1. Could you please suggest how I can get the predicted value as 0/1.
When i try to create the missclassification table, it always gives me an error "all arguments have to be the same". Please what can i do ? I am new to data science
You should always pass the model as the first argumnet in predict function. The second parameter should be a data frame of predictor variables only. You can specify type=”prob” as an extra argument to get probabilities of every factor of y. Either type=”class” directly gives you the class of predicted values. By default type argument is set up differently for every R version.
I have a query and i tried to google it but I couldn't find any satisfactory answer against it. The question is what is the difference between ctree and rpart tree?
+Mahum Khan Cree is a function within package called "party" for decision tree. Similarly rpart is a function within a package with the same name "rpart". Both are use for decision tree. I prefer party as it is said to be more accurate. If you search "party vs rpart' you can see many good explanations.
Excuse me, sir. Can you help me? I tried this script into my data. i have 100 observation of 1383 variables. I got the result "Conditional inference tree with 1 terminal nodes" and "Number of observations: 83". However, i can't get the decision trees., i just get the histogram. Can you help me, sir? why it's happen?? Thank you, sir.
Dear Sir Thank you for your video. Can you do a tutorial on R where multiple tree base models ( Decision tree , Random Forest, Gradient Boosting, Logistic and etc..) comparing each other on the same chart using ROC to represent the visualization and split them by training vs validate data set? It would be a great help for this type of visualization especially presenting to management. Thank you !
Thanks for comments and suggestion that I'll work on in near future. Meanwhile here is a link where you can quickly get ROC that plots and compares several methods such as decision tree, logistic regression, svm, random forest, etc., on the same ROC plot. ruclips.net/video/J2a9yV3kl-M/видео.html
Thank you Sir. The above youtube tutorial is really good. Looking forward on your awesome tutorial on comparison of multiple classification models comparison in one graph split between Train & validate.
Thank you for your video. I'd like to know that what do you mean "set.seed(1234)"? why don't use set.seed (2) or .. and do we can use "ifelse" instead of definition "pd"? which way is better?
+Info A set.seed(1234) is just an example, you may use any other number. The idea is to reproduce results which any number can achieve. 'pd' was used for 'partitioning data' and it's just a name, you may use any other name, that will be fine too.
Hi sir, Thank you for the video, it's very helpful! But I still not understand why your model could not predict the 3 model? If we you all the items could we predict more precisely? Thank you!
Hello sir, I'm implementing the same steps for my own set of data. But I am getting an error in the Misclassification part as "all arguments must have the same length". Will it be ok if you can check my code and let me know where I am going wrong? If it's ok for you then I will send you the code and data.
For this, you will have to build rpart model and then you can prune the tree basis on CP value(by printcp(rpart_model) and we choose cp value minimum to prune tree further )
R Studio doubt : I am building a predictive model with 1 million observations and having 15 variables .i am getting error like -" Can not allocate the vector of 432GB " or " Can not allocate the vector of 3.8 GB " I am using 16GB RAM .my file size is just 140MB . and i closed all the applications in my system .still error remains same . Any suggestions much appreciated..
You can probably take sample for creating model with huge data. The difference between model based on a good sample and all data may not be significant. You can also try faster algorithms such as extreme gradient boosting: ruclips.net/video/woVTNwRrFHE/видео.html
When you have decision trees that are too big, 'pruning' helps to reduce size of the tree by removing those parts that do not help much in correct prediction of the outcome. It helps to avoid over-fitting and improve prediction model accuracy.
Hello Bharatendra Sir, Can you please guide me how to implement perturbation method in R? Currently I classified data using classification (decision tree). Now I want to perturb data and follow same classification. I am unable to proceed. Can you please upload some videos illustrating how to implement perturbation method using R. its very urgent for me.
Megha, here is the link for perturbation analysis. Note that it can be used for only regression like models. It may not work with decision trees. ruclips.net/video/Jz97ccAIyj8/видео.html
Decision tree algorithm will automatically choose the attributes or independent variables depending on the parameters such as minimum sample size for splitting, statistical significance, etc., that you choose.
When terminal nodes have very small sample sizes, decision tree model is likely to have over-fitting. Due to small sample sizes, decisions arrived in the terminal node may not be very stable.
I watched your videos to help through a data analytics degree and I'm now working in a job type similar to business analyst and looking back at these videos. Very easy to follow, punctual, and informative for getting the job done. Thank you
You are welcome and god luck!
Sir, I've been following lot of courses but never found something with so clarity. Thanks for posting these!
Thanks for the feedback!
Thanks Doc, after my 6 hrs class ...you went through all my confusions in just 18:43 mins. Such a worthy job!!!
Thanks for your feedback and comments!
Take a bow sir! For the first time, I had full clarity on Decision Tree and it's usage! Thanks a lot for this superb tutorial, lucky to find your channel, stay blessed! 👌👍🙏
Thanks for comments!
You really made it simple. I have been watching others tutorial, but not anymore. I already subscribed. Thanks a lot.
You are welcome!
hello sir, your way of explaining is so simple and effective. made topic simple.
i would like to add comment for all as well that i was getting error while using controls=ctree_control and after doing google and forum support , now i am able to run. and veiw tree. Great work sir.
Thanks for the update!
Hello Dr. Rai, thank you for a very informative video.
One thing that I would like to add based on my limited knowledge:
For a skewed class distribution such as in the data, it is more importance that the model is able to predict the abnormal cases then it is to predict normal cases. If we just look at the mis-classification error, then the model may be aligned towards the class with higher percentage of data. One way to avoid that is to reduce the disparity between the class types by over/under sampling techniques. Another way is to use the Area under the precision-recall curve as a measure of model evaluation.
Your comments and feedback on this would be appreciated.
That's correct. For more details about class imbalance problem, refer to this link:
ruclips.net/video/Ho2Klvzjegg/видео.html
Hi, video posted 4 years ago today has become a saviour for my internal assessment
Thank you 😃
Welcome! You may also find this recent one useful:
ruclips.net/p/PL34t5iLfZddvGr66DPf-L-sSJ50XNwN3K
You are truly remarkable! The way you explain things is very simple to understand.
Thanks for comments!
Thank you again for these complete episodes. You have been of a great help to me "Rai". Please, I'd appreciate a complete episode on the ensembles, essentially, heterogeneous ensemble using DT, SVM etc. inclusive as the base classifiers.
Comprehensive videos on ensembles are not common, in fact, I haven't come across any. It will go a long way If you could put something together on this. Thank you for your help!
Thanks for the suggestion, I'll do it in near future!
Sounds really great. Looking forward to it. Can't wait!
Sir, so much clarity ...How simple and easy you created ! Thank you .
Thanks for comments!
This is a great example of decision trees. Thank you!
Thanks for comments!
Dr Rai, thanks for your videos. I have them useful in explaining basic machine learning methods. Thank you!
Thanks for comments!
Thank you so much for your videos. I am learning everyday with them. May God bless you
Thanks for comments!
Thank you Professor Rai for taking the time to show us the ropes. Regarding the mis-classification error table, may I know: what is the difference between that and the Confusion Matrix. I notice the calculation for "accuracy" is the same as the Confusion Matrix, simply "sum(diag(tab))/sum(tab)", but for Confusion Matrix, the Actual is on the vertical versus what you stated in video for Actuals in the horizontal. Thanks, and looking forward to more videos from you
Both confusion matrix or mis-classification table are same.
Sir,
Greetings from the US,
I have enrolled in the machine learning course through Udemy as well but your explanation super simple and easier to implement.
Please do guide me with any book which I can use to practice more of such datasets
Deep learning is the hottest topic currently within machine learning field. To get started with practical examples you can try:
www.amazon.com/Advanced-Deep-Learning-designing-improving/dp/1789538777
Sir, it'd be really nice if you can make a blog explaining the output in more details. For instance, an explanation of the statistical parameters measured in the confusion matrix. Your videos are really helpful! :)
Thanks for your comments and suggestion! You may find decision tree related explanations in following video too:
ruclips.net/video/J2a9yV3kl-M/видео.html
Hi Sir, I am so glad to see your all videos on related to machine learning in R, So request one thing if you share your datasets which you have used in your session that will be great
You can get data file from the link in description area below the video.
wow thank you sir....!!!!sir please make video of entropy splitting creation calculation it is very useful sir
Thanks for the suggestion, I've added it to my list.
Your videos are honestly so amazing.
Thanks for comments!
very very clear and helpful. thanks tons
Thanks for comments!
Sir, I have a few questions:
1. How do you find statistically significant variable after developing a decision tree model with all variables? Ho
2. Suppose all variables in a decision tree is coded as POOR, FAIR, GOOD, then how to find the probabilities of each (POOR, FAIR, GOOD) at non terminal nodes of the tree and also number of sample in each category? I need to show this in my plot.
3. What is the best approach in developing a decision tree model? Developing a model on the training data using K Fold Cross Validation OR Developing a model on training data and then going for cross-validation and pruning process using a function like cv.tree() which allows us to choose the tree with lowest cross validation error rate? Which method is better?
4. How to find out the value of the standardized importance of independent variables using CART in R?
1. P-values on the tree indicate statistical significance.
2. You can find it only at the terminal node.
3. k-fold CV is always better to avoid over-fitting.
4. Higher a variable on the tree, more important it is. For variable importance you can also try this link:
ruclips.net/video/dJclNIN-TPo/видео.html
Hi! thank you for all your videos.
I'd just like make a little comment: ctree function implements 'Conditional Inference Tree', not 'Clasification Tree'. In fact, it can develop clasification trees, but the fundamentals are different.
Thank you for all the work you are doing! very usefull.
Carlos
Thanks for the update!
Thank you for your help and all your videos. It's help me a lot
Thanks for your comments!
Very informative and easy to understand.Thanks for sharing such an useful video.
Thanks for the feedback!
Really Great Explanation
Thanks for comments!
Also here is a link to more recent one:
ruclips.net/video/RCdu0z2Vyrw/видео.html
Your videos are always very easy to follow!!
+Tara Paider thanks for the feedback 👍
This video as helped me a lot with my assignment, thank you so much.
that's great!
Thanks sir for detailed video..
Most welcome!
You're the man. Keep up the great work!
love all your videos...Please keeping uploading
+pradeep paul Thanks for your feedback!
Hi Sir, Your videos are really helpful.It has really helped me a lot, I have few doubts though.I have just started learning data science so these doubts may be naive.
1) On what basis we decide that we should put this much data into training, validation, and testing respectively?
2)Is there any criteria(such as r-square in regression models, Chi-square for logistic regression) for decision trees so that we can say how good our model is?
1) one may experiment with different partitions such as 50:50, 60:40, 70:30, etc., and see what works best. There is no single partition ratio that will work well in all situations.
2) if your y variable is categorical, mis-classification error is used for model performance assessment.
Thank you, sir!!
Thank you Dr Rai. I have a question about the tree pruning. Prior to the pruning some of the trees were able to classify patients as pathological but after pruning( by changing the control functions) none of the trees identify the pathological patients. If we were to specifically identify patients with suspected pathology how can we modify the control functions or the initial formula included in the "ctree()" function?
Greetings! I came back to this video after a while as it still seems to be the best one regarding Decision Trees out there. I have a quiestion regarding significance of variables. Do you have a video covering this subject? Any techniques I could apply while working on my Decision Tree? thank you
You can use this link. For tree based methods, it provides variable importance plots to show which variables are important and which ones do not contribute much.
ruclips.net/video/hCLKMiZBTrU/видео.html
Nice explanation. thanks.
Sir plz can u suggest a good book for beginners in machine learning to have basic knowledge of all statistical tools ??
brilliant! thank you Dr
You're most welcome!
Thank you so much for your awesome video. I've learned a lot from it.
Thanks for your feedback!
Thanks for your feedback!
That was awesome but I found that with my dataset I get a completely different decision tree using the rpart package. Without rpart, the tree is what I expected it to be and with rpart - in some ways it's almost opposite. I'm only comparing the two trees with my training data.
I think I know what the problem is - with rpart trees you only get a little "yes" and "no" marker on the root node. In my case "yes" goes to the left of the tree and "no" goes to the right of the tree. If I assume that direction is always the case then things are okay. I do wish that the "yes", "no" little while boxes were printed at every non leaf node so it's very clear which way the path is going. (I wonder if there's an option for that?) Thanks for the great video.
See link below that has more detailed coverage:
ruclips.net/video/6SMrjEwFiQY/видео.html
Hi Dr.Rai. I encountered an error on #Misclassification part. I got the table for using the library(party), but I got "all argument must have the same length" when using the rpart() one. But if I use validate set with the rpart package, the table can be generated.
Difficult to say much without looking a the code. But you can review your code again, there may be some typo.
Is it only useful for numerical data? when all the independent variable are continuous? or it can be used for categorical ones too?
It's useful for both. See this more detailed example:
ruclips.net/video/6SMrjEwFiQY/видео.html
Reading /Preparing csv data : 0:32
Decision Tree using rpart Package : 11:22
Thanks!
How can we export the first tree prediction (View(predict(tree,validate,type="prob"))) into XL? When using a data frame they come out horizontally and unreadable.
Hi sir ,
my s.nagaraj adiga your vedios are very simple to listen and it is easy to understand thank you very much .
Thanks for the feedback!
in confusion matrix(tab), the column is predicted data and row-wise actual data
In this video I have used predicted data in row and actual in column for the confusion matrix.
Kindly check it, (table(predict(tree),data$NSP), Then the output will be taken in the following way, column is predicted data and row-wise actual data
Try this, it will make it more clear:
table(Predicted = predict(tree), Actual = data$NSP)
I am a beginner.. could you help me understand if we can use linear/logistic regression todo the prediction here? I have referred your vehicle example and so got confused if we can use that model here.
Yes, you can use logistic regression as response variable is of factor type. For more see:
ruclips.net/video/AVx7Wc1CQ7Y/видео.html
Would this work just as well if some variables were categorical? I.e. written in text but limited options
Thanks for the video
Yes, absolutely
You may also try this link:
ruclips.net/p/PL34t5iLfZddvGr66DPf-L-sSJ50XNwN3K
Thank you, great channel. Subscribed!
Thanks!
Sir, if you could create a video for how to calculate gini, KS using R that would be really great
Thanks for the suggestion, I've added this to my list.
ctree dont support the dates. I tried the dates converted from posix. Can you please suggest the parameter in ctree that resolved this problem ?
Decision tree is not a good methods to work with dates. For dates you should use time series:
ruclips.net/p/PL34t5iLfZddt9X6Q6aq0H38gn-_JQ1RjS
Nice video Bharatendra. One question.. you said that we need to optimize the model.... how to do that ie how to optimize our model! Thanks
You can make changes to settings in 'control' to see what helps to improve the model. In the example, I used only 3 variables just for illustration, but you must start with all variables for a better performance.
thanks :)
thank you ! so helpful !
Thanks for comments!
Hello! I gave my decision tree 97 different features but the decision tree only picked one of these features
to make his decision. Is that normal that it doesn't consider all the features for its decision?
It runs with default setting. By making changes to default settings you may be able to make it include some more. But features that have very little impact on the response are unlikely to be included.
It can happen when one of the feature is the close predictor for y. Then that value is quite enough to predict the y alone.
i followed ur method with a dataset i created...its a simple one but the output is just printing the values of my dataset rather than plotting a tree and predicting...can u help me understand why
Difficult to say much without looking at data and code.
Sir, please post a video on Regression Splines, Polynomial Regression & Step Functions etc
Thanks for the suggestion, I've added it to my list.
great video, everything explained step by step. I have a question tho. some of my data in the DB file is char and i keep getting an error "data class "character" is not supported". how can i include this data in my experiments?
You change such variables to ‘factor’.
@@bkrai omg thank you. so I can just use data$variableF
yes that should work.
How do you remove ticks on the axes? Or realign the axis labels?
Amazing, thanks
A comparative analysis on pre/post pruning of model would have completed the tutorial on Decision Tree.
Thank you, Dr. Roy for sharing simple and detailed explanation on Decision Tree. My query is can we plot ROC curve for Multiclass Data. (pROC package provides to calculate the AUC but I could not find how to plot ROC graph for multinominal data).
At this time it only does it for binomial situation. You can now find roc curve video here:
ruclips.net/video/ypO1DPEKYFo/видео.html
how can we validate the accuracy or discriminatory from this model?
i believe you can use the model outputs from train and validate to somehow calculate chi-square etc?
You can validate the model built on training data with the help of validate data.
Hi,
I wanted to ask which is most appropriate software for conducting SEM along with moderation analysis, in case of categorical, nominal (binary and multinomial) and ordinal variables as outcome/dependent/endogenous variables ?
P.S:The predictor variables are scale,nominal and ordinal variables.
Regards
let me ask, top of the variable of the picture is not dependent variable right? 5:46
It's a independent variable.
@@bkrai sir can i ask some simple questions about tree diagram if you do not mind. I leave it here my gmail adress: ogzhnyvzz@gmail.com
Hi SIR , how do we apply test set to predict function where the target var have NA values ? As wen i run the function it says predictor must have 2 levels.
You need to impute missing values before developing the model.
its just for numaric variables? is their another cod to charachter variabls
Change character variables to factor variables before using this.
very helpful thanks
Thanks for comments!
Thanks for comments!
very nice explanation keep it up
thanks for the feedback!
Thanks sir for this interesting video. I am facing a problem. My dependent variable is binary(0,1). When I run predict, the estimated values appear in in decimals despite remove "type". So, misspecification error is close to 1. Could you please suggest how I can get the predicted value as 0/1.
When i try to create the missclassification table, it always gives me an error "all arguments have to be the same". Please what can i do ? I am new to data science
I am also getting same error message
You should always pass the model as the first argumnet in predict function. The second parameter should be a data frame of predictor variables only. You can specify type=”prob” as an extra argument to get probabilities of every factor of y. Either type=”class” directly gives you the class of predicted values. By default type argument is set up differently for every R version.
Thanks for the update!
Sir what is the difference between rpart() and ctree(). And when to use it??
It's just a different way to represent a tree. Note that both use the same algorithm.
Hi sir nice explanation...learnt about ctree function. Can you please illustrate how we can tune the decision tree model?
Around 7:30 point in the video tuning is shown using "mincriterion" and "minsplit".
Bharatendra Rai my mistake sir...I mean pruning the decision tree
You can do pruning by increasing values for "mincriterion" and "minsplit".
Bharatendra Rai thank u for clarifying sir
like your videos... can you upload some on ensemble and AIC as well. will be very kind of you
Thanks for comments and suggestion, I've added it to my list.
Thanks a ton!!
+Preeyank Pable 👍👍👍
sir please tell me about classical or crisp decision tree
مفيد جدن
Thanks!
I have a query and i tried to google it but I couldn't find any satisfactory answer against it. The question is what is the difference between ctree and rpart tree?
+Mahum Khan Cree is a function within package called "party" for decision tree. Similarly rpart is a function within a package with the same name "rpart". Both are use for decision tree. I prefer party as it is said to be more accurate. If you search "party vs rpart' you can see many good explanations.
Sir why you will give set.seed(1234) why you can't give set.seed(12345).can you pls tell
It can be any number, but to get same samples use the same number next time too.
hello sir....can you plz make a tutorial on how to implement fpgrowth in Rstudio!!! its urgent! plz plz help!
Sir can you provide this dataset which you have used
There is a link below this video.
Excuse me, sir. Can you help me? I tried this script into my data. i have 100 observation of 1383 variables. I got the result "Conditional inference tree with 1 terminal nodes" and "Number of observations: 83". However, i can't get the decision trees., i just get the histogram. Can you help me, sir? why it's happen?? Thank you, sir.
+Gebri Adinda you can send data and I can look into it.
@@bkrai , Sir I get the same error , "Conditional inference tree with 1 terminal nodes" only histogram and number of observations=144..can you help?
sir, could u make a video on Random forest.
Dear Sir
Thank you for your video. Can you do a tutorial on R where multiple tree base models ( Decision tree , Random Forest, Gradient Boosting, Logistic and etc..) comparing each other on the same chart using ROC to represent the visualization and split them by training vs validate data set? It would be a great help for this type of visualization especially presenting to management. Thank you !
Thanks for comments and suggestion that I'll work on in near future. Meanwhile here is a link where you can quickly get ROC that plots and compares several methods such as decision tree, logistic regression, svm, random forest, etc., on the same ROC plot.
ruclips.net/video/J2a9yV3kl-M/видео.html
Thank you Sir. The above youtube tutorial is really good. Looking forward on your awesome tutorial on comparison of multiple classification models comparison in one graph split between Train & validate.
Thanks!
Sir how would you increase the number of nodes?
You can change mincriterion and minsplit in the controls part for that.
For a more recent one, see below:
ruclips.net/p/PL34t5iLfZddvGr66DPf-L-sSJ50XNwN3K
Thank you for your video. I'd like to know that what do you mean "set.seed(1234)"? why don't use set.seed (2) or ..
and do we can use "ifelse" instead of definition "pd"? which way is better?
+Info A set.seed(1234) is just an example, you may use any other number. The idea is to reproduce results which any number can achieve. 'pd' was used for 'partitioning data' and it's just a name, you may use any other name, that will be fine too.
Hi sir, Thank you for the video, it's very helpful! But I still not understand why your model could not predict the 3 model? If we you all the items could we predict more precisely? Thank you!
That's correct! To obtain the final model we need to include all items and that will improve model performance.
Hello sir, I'm implementing the same steps for my own set of data. But I am getting an error in the Misclassification part as "all arguments must have the same length". Will it be ok if you can check my code and let me know where I am going wrong? If it's ok for you then I will send you the code and data.
yes send the code.
Thank you sir. To which email id I should send the code. My email id is subashinivec@gmail.com
Sir,how to choose the Complexity parameter (CP Value)for Tree pruning ?
For this, you will have to build rpart model and then you can prune the tree basis on CP value(by printcp(rpart_model) and we choose cp value minimum to prune tree further )
Could you please explain me this a little bit more?
pd
You can go over this that has more detail:
ruclips.net/video/aS1O8EiGLdg/видео.html
R Studio doubt :
I am building a predictive model with 1 million observations and having 15 variables .i am getting error like -" Can not allocate the vector of 432GB "
or " Can not allocate the vector of 3.8 GB "
I am using 16GB RAM .my file size is just 140MB . and i closed all the applications in my system .still error remains same .
Any suggestions much appreciated..
You can probably take sample for creating model with huge data. The difference between model based on a good sample and all data may not be significant. You can also try faster algorithms such as extreme gradient boosting:
ruclips.net/video/woVTNwRrFHE/видео.html
Bharatendra Rai sure sir ,I will try today
dear sir, how can i get the data set that you are using
your email?
Actually I don't need email. You can get data from:
sites.google.com/site/raibharatendra/home/decision-tree
may add more about CHAID trees
Thanks! I'll keep it in mind.
right after I have installed the package my commands arnt working please reply ASAP
install.packages("party")
library(party)
data$NSPF
which line is not working?
try
footballtree
hello sir its great video does the rpart uses gini index?
It uses altered priors method.
hi sir can u pls explain about pruning of tree. on what basis we do prune ?
When you have decision trees that are too big, 'pruning' helps to reduce size of the tree by removing those parts that do not help much in correct prediction of the outcome. It helps to avoid over-fitting and improve prediction model accuracy.
Hello Bharatendra Sir, Can you please guide me how to implement perturbation method in R?
Currently I classified data using classification (decision tree). Now I want to perturb data and follow same classification. I am unable to proceed. Can you please upload some videos illustrating how to implement perturbation method using R. its very urgent for me.
Megha, here is the link for perturbation analysis. Note that it can be used for only regression like models. It may not work with decision trees.
ruclips.net/video/Jz97ccAIyj8/видео.html
what if there are two target variables like NSP and some other. what deecision tree techniques to use?what will be the formula?
You can make two separate trees.
how we will derive the formula?based on what atributes
Decision tree algorithm will automatically choose the attributes or independent variables depending on the parameters such as minimum sample size for splitting, statistical significance, etc., that you choose.
getting error in #Misclassification error in testing data.
it is prompting " all arguments must have the same length"
Sir, please help me out.
Probably there could be some mix up with training and testing data.
Bharatendra Rai okay sir! Let me try once again ...if i get stuck again, can i share my codes here ?
Bharatendra Rai sir, it was my fault, you were right ..
Now it is working fine.
i cannot download the data from the link.it does not exist.
Here is the link for data:
sites.google.com/site/raibharatendra/home/decision-tree
I've created new links now:
drive.google.com/open?id=0B5W8CO0Gb2GGa09Ma3NzTVpyOWM
drive.google.com/open?id=0B5W8CO0Gb2GGMzJGbkdGUGREYjA
what does the p value represents??
+divya damodaran A p-value of 0.05 means 95% (1 - 0.05 = 0.95) confidence in concluding the variable to be statistically significant.
okay thankyou..
hi, can u provide me the explanation of how over fitting occurs in decision tree?
When terminal nodes have very small sample sizes, decision tree model is likely to have over-fitting. Due to small sample sizes, decisions arrived in the terminal node may not be very stable.
Sir how can be decision tree can be used for variable selection
Importance of a variable in the tree is reflected by it's position. For example, the one at top of the tree is the most important.
Sir I m getting following error: when I tried to execute #Misclassification error in testing data
tab
You can use following lines:
pred
Its throwing error within same project as well outside that project. Why is it so, it worked first time but now its not working and throwing error
+Megha Dabhade send me the entire code to look into.
I was using wrong tree in predict function which is having different length. Now its working fine.
Thanks for your guidance.
+Megha Dabhade 👍