Hi Sebastian, thanks for the video, it would also be great if you could have shown the features you were working with or just how the data looked like after introducing the problem you were trying to solve
21:33 This is a great example. But why not cluster analysis or logistic regression or Q-factor analysis (it's not famous anyway)? One advantage I see here is we don't have to worry about the assumptions, because it is nonparametric. The other advantage might be it's super fast. My exact question is what is the selling point of Decision Trees or Random Forest?
We didn't do cluster analysis because we had class label in formation so it was a supervised problem. Logistic regression was one of the models we tried though. The logistic regression coefficients (and also doing logistic regression with sequential feature selection) led to the same conclusion as the decision tree (this slide shows the logistic regression + feat sele results: speakerdeck.com/rasbt/building-hypothesis-driven-virtual-screening-pipelines-for-millions-of-molecules-at-odsc-west-2017?slide=44). The Decision tree was easy to explain to non-machine learning people (my collaborators), which was nice. -> selling point of Decision Trees or Random Forest? Decision tree: it's easy to understand the model behavior for short trees. Random forest: good performance ou of the box without hyperparameter tuning.
Hi Sebastian, on the published graph of your decision tree about chemical molecule, some of leaf nodes have non-zero gini values. Does it mean that the decision tree is truncated? Thanks!
Yes, I think this was a truncated tree. However, you can also have non-zero gini values even if you don't truncate. This happens if you have examples that have exactly the same feature values but different target values.
Hi Sebastian, thanks for the video, it would also be great if you could have shown the features you were working with or just how the data looked like after introducing the problem you were trying to solve
21:33 This is a great example. But why not cluster analysis or logistic regression or Q-factor analysis (it's not famous anyway)?
One advantage I see here is we don't have to worry about the assumptions, because it is nonparametric. The other advantage might be it's super fast.
My exact question is what is the selling point of Decision Trees or Random Forest?
We didn't do cluster analysis because we had class label in formation so it was a supervised problem. Logistic regression was one of the models we tried though. The logistic regression coefficients (and also doing logistic regression with sequential feature selection) led to the same conclusion as the decision tree (this slide shows the logistic regression + feat sele results: speakerdeck.com/rasbt/building-hypothesis-driven-virtual-screening-pipelines-for-millions-of-molecules-at-odsc-west-2017?slide=44). The Decision tree was easy to explain to non-machine learning people (my collaborators), which was nice.
-> selling point of Decision Trees or Random Forest?
Decision tree: it's easy to understand the model behavior for short trees. Random forest: good performance ou of the box without hyperparameter tuning.
Hi Sebastian, on the published graph of your decision tree about chemical molecule, some of leaf nodes have non-zero gini values. Does it mean that the decision tree is truncated? Thanks!
Yes, I think this was a truncated tree. However, you can also have non-zero gini values even if you don't truncate. This happens if you have examples that have exactly the same feature values but different target values.