This looks like an amazing library. Where it looks thin is on documentation. I hope the documentation is updated to the same standard as LightGBM. This has the potential to be a game changer
XGBoost & friends are good at capturing interactions between features, but can EBM's do the same? Being restricted to just one feature for each tree would intuitively seem to limit the models ability to discover interaction effects (which of course comes at the tradeoff of limiting the interprability)?
Yes, by default EBMs also model pairwise interactions. EBMs do this by fitting the main effects first using trees that can only use one feature at a time. After fitting the mains, EBMs then use a procedure to find which pairwise interactions look like they would help improve the model most. Then EBMs fit these important pairwise interactions using boosted trees that can only use two features in each tree, and thus can model pairwise interactions. Pairwise interactions are important in some problems, and less important in others. By being able to model both main effects and pairwise interactions, EBMs are able to be as accurate as XGBoost on most tabular datasets.
@@richcaruana2067 Cool, can you link to some data explaining exactly how these features are chosen? I (for some unknown reason) thought that it would multiply two features together to construct a new feature, and then fit a tree to that new feature, rather than having the tree split on either "original" features"?
@@allanhansen6923 Multiplying two features together to create pairwise interaction terms is a common practice in statistics if you then want to fit the interaction using things like linear or logistic regression. But if you're using modern machine learning methods there are advantages to directly modeling the interaction instead of multiplying features. Our code uses a heuristic procedure to quickly determine which interaction terms are most important. It's designed to be fast (because there are order n^2 pairwise interactions from which to select), and pretty good. In practice it works well, but there are other methods in the literature for selecting pairwise interactions that are based on analyzing what is learned by random forests or by neural nets. If you prefer an alternate method for selecting what pairs to include in the model, our code allows you to bypass automatic pair selection and instead specify which pairs you want in the model. This is also useful if you've been working with the same data set for a long time and want to use the same pairs you've used in previous models.
@@richcaruana2067 Cool! Would you be able to provide a Github link to the code handling the pairwise interactions? Or perhaps a link to an open-access article or preprint server describing the technique?
if two collinear features are ordered so that, among many, one comes right before the other, will EBMs assign more predictive power to whatever of these two features comes first?
If the learning rate is very small, then so little is learned each time you add a tree for a feature, that most of the variance is still unexplained and as you move to subsequent features they also get an opportunity to learn something that is useful for reducing variance. The lower the learning rate the less important the order in which you visit the features is. If the learning rate is very low, it's almost as if you're doing parallel updates for all the feature simultaneously.
I’m having trouble understanding the explanation for the model determining the cancer/asthma being good for covid at 9:52. I don’t understand the explanation of the sample bias
I like CATboost, but it's not available right now as a method for training EBMs. Because EBMs are restricted because of the need to be intelligible (i.e., they are restricted to being GAMs of main effects and pairwise interactions), we don't think CATboost would make a significant change in the expected accuracy of the final EBMs. But it would be fun to try anyway, and behind the scenes we have explored a variety of algorithms for training EBMs including things like BART (Bayesian Regression Trees) and LightGBM.
but the same kind of plots we can create from gradient boosting machines, so what the point ? having separate trees does not provide any insight itself, nor being them additive, only interesting thing I see is that predictions going flat if data become sparse
We are working on that paper now, but don't currently have any other publications that explain this in detail. BTW, the dataset used to discover this dates back to 1989 and thus is quite old, and since then doctors have changed how they treat patients with high BUN and have begun to treat patients with BUN lower than 100 more aggressively, in part because the treatment itself has changed and become safer. Hope this helps --- sorry we don't have the BUN work written up yet!
This looks like an amazing library. Where it looks thin is on documentation. I hope the documentation is updated to the same standard as LightGBM. This has the potential to be a game changer
The documentation was recently updated and is now much improved. We still need to to do more, but it's getting there.
@@richcaruana2067 thank you 🙏🏾.
XGBoost & friends are good at capturing interactions between features, but can EBM's do the same? Being restricted to just one feature for each tree would intuitively seem to limit the models ability to discover interaction effects (which of course comes at the tradeoff of limiting the interprability)?
Yes, by default EBMs also model pairwise interactions. EBMs do this by fitting the main effects first using trees that can only use one feature at a time. After fitting the mains, EBMs then use a procedure to find which pairwise interactions look like they would help improve the model most. Then EBMs fit these important pairwise interactions using boosted trees that can only use two features in each tree, and thus can model pairwise interactions. Pairwise interactions are important in some problems, and less important in others. By being able to model both main effects and pairwise interactions, EBMs are able to be as accurate as XGBoost on most tabular datasets.
@@richcaruana2067 Cool, can you link to some data explaining exactly how these features are chosen?
I (for some unknown reason) thought that it would multiply two features together to construct a new feature, and then fit a tree to that new feature, rather than having the tree split on either "original" features"?
@@allanhansen6923 Multiplying two features together to create pairwise interaction terms is a common practice in statistics if you then want to fit the interaction using things like linear or logistic regression. But if you're using modern machine learning methods there are advantages to directly modeling the interaction instead of multiplying features. Our code uses a heuristic procedure to quickly determine which interaction terms are most important. It's designed to be fast (because there are order n^2 pairwise interactions from which to select), and pretty good. In practice it works well, but there are other methods in the literature for selecting pairwise interactions that are based on analyzing what is learned by random forests or by neural nets. If you prefer an alternate method for selecting what pairs to include in the model, our code allows you to bypass automatic pair selection and instead specify which pairs you want in the model. This is also useful if you've been working with the same data set for a long time and want to use the same pairs you've used in previous models.
@@richcaruana2067 Cool! Would you be able to provide a Github link to the code handling the pairwise interactions? Or perhaps a link to an open-access article or preprint server describing the technique?
if two collinear features are ordered so that, among many, one comes right before the other, will EBMs assign more predictive power to whatever of these two features comes first?
Can anyone explain why the order of features does not matter with small learning rate?
If the learning rate is very small, then so little is learned each time you add a tree for a feature, that most of the variance is still unexplained and as you move to subsequent features they also get an opportunity to learn something that is useful for reducing variance. The lower the learning rate the less important the order in which you visit the features is. If the learning rate is very low, it's almost as if you're doing parallel updates for all the feature simultaneously.
I’m having trouble understanding the explanation for the model determining the cancer/asthma being good for covid at 9:52. I don’t understand the explanation of the sample bias
Please tell me can we use EBM for multi-class classification?
what is the difference between white box and glass box model? are they the same?
Are these models robust to correlated features?
This is sooo sick
Broroooo...
CATboost is available for EBM now?
I like CATboost, but it's not available right now as a method for training EBMs. Because EBMs are restricted because of the need to be intelligible (i.e., they are restricted to being GAMs of main effects and pairwise interactions), we don't think CATboost would make a significant change in the expected accuracy of the final EBMs. But it would be fun to try anyway, and behind the scenes we have explored a variety of algorithms for training EBMs including things like BART (Bayesian Regression Trees) and LightGBM.
but the same kind of plots we can create from gradient boosting machines, so what the point ? having separate trees does not provide any insight itself, nor being them additive, only interesting thing I see is that predictions going flat if data become sparse
Is there anyone who can explain the way I can get each feature's importance in EBM?
Try something like this:
for f_name, importance in zip(ebm.feature_names, ebm.feature_importances_):
print(f"{f_name}: {importance}")
Can you share a source for the estimate that 2500 lives will be saved a year (5:26), if the BUN risk curve was flattened?
We are working on that paper now, but don't currently have any other publications that explain this in detail. BTW, the dataset used to discover this dates back to 1989 and thus is quite old, and since then doctors have changed how they treat patients with high BUN and have begun to treat patients with BUN lower than 100 more aggressively, in part because the treatment itself has changed and become safer. Hope this helps --- sorry we don't have the BUN work written up yet!