The probabilities you get back from your models are ... usually very wrong. How do we fix that? My Patreon : www.patreon.co... Link to Code : github.com/rit...
Great video, it's very interesting concept that I never heard about, but mathematically speaking would make a sense. Also it's interesting that a linear model was able to correct the error so profoundly. Nevertheless, isn't a kind of metalearning ? Also I think you shouldn't use the name of "testing" dataset for traning the "calibration model", but rather e.g. metadataset. Test dataset is reserved only and only for the final, crossvalidated model
hey Ritvik, thank you so much for making all these videos, pursuing my Financial Math Masters - this channel has been a game changer for me in terms of understanding proper ML although I have applied it a lot of times but this level of in-depth understanding is, to say the least, very satisfying. Thankyou so much for what you're doing for all of us. Also, is it possible to connect with you somewhere?
Awesome video!! I was trying to figure out how this concept work using the SK-Learning documentation, but I found the material too much theoretical. And in your video you put the things in a more friendly way!! Many thanks :)
I also have some problem with the second phase of configuring, I'm curious what happens to the out-of-sample performance after the calibration. I don't claim I understand the background here, but I just easily get the feeling in mind that: "the model fit did not produce performance that matches the observed distribution, so lets wrap the random forest into logistic function and fit it to the empirical distribution". Naturally this would perform better, but does the out-of-sample performance also improve? Sorry for my confusion, also pretty new concept for me as well.
Thank you! Brilliant video on such an important applied-ML topic. Tho i haven't seen, in the top section of comments, mentions of the Isotonic Regression (which also can be found in Scikit-Learn package). More often than not, it performs way better on such a task, compared to the Logistic regression, due to it's inherent monotonicity constraint and piecewise nature. Personally i found the most useful - the part about using different sets (test / val), for calibration and calibration validation. Right now i am in the process of developing the production classification ML model, and i think i have made a mistake of performing calibration using training set. Oops
Shouldn't we first do a min-max scaling on the original probabilities you get from the models? Let's say I have three models and I run them on the same training data to get the below distribution of probabilities: 1) Naive Bayes: all predicted values between 0.1 to 0.8 2) Random Forest: all predicted values between 0.2 to 0.7 3) XGBoost: all predicted values between 0.1 to 0.9 If I want to take an average prediction, I am giving an undue advantage to XGBoost. So we should scale all of them to be between 0 to 1. The second step then is to feed these original scaled probabilities to the Logistic Regression model to the calibrated probabilities by feeding in these new scaled probabilities.
Wow, that is an amazing video, I might be wrong but generally we use validation set first no, for calibration, and test is on the unseen data, I mean it is like that in hyperparameter tuning, so I assumed it should be same here. Correct me if I'm wrong.
I'm late to the party, but surely since random forest is not performing optimally in your example, you need to tweak its hyperparameters(tweak data, tune model) to fit a better curve. What if you create a badly performing model and try to calibrate it further with logistic regression when you could've gotten a better performing model just using random forest?
The notebook section of this video is quite misleading - it is basically just plotting a line of best fit on a calibration curve. To actually calibrate the predictions, the trained logistic regression model should make predictions on a set of model outputs, and those 'calibrated' outputs can then be used to plot a newly calibrated calibration curve.
You could solve the calibration issue more easily by tuning hyperparameters. Specifically, you choose to tune hyperparameters to optimize a cost function that is considered a "proper scoring rule", such as logistic loss/cross entropy (the cost function of logistic regression, actually). At least in my RF implementations, that has resulted in calibrated probabilities right off the bat, without any post-processing. That being said, you'll probably notice that scikit-learn's LogisticRegression() class doesn't return calibrated probabilities all of the time. You can blame that on the class using regularization by default. Just turn it off, and you'll likely get calibrated probabilities again :)
Isn't it weird that the empirical probability is not monotonically increasing as a function of the uncalibrated probability? This would mean that the calibration model needs to learn to transform, e.g. 0.4 to 0.3 but 0.5 to 0.2.
Some people are wondering whether the initial calibration shouldn't be done on the calibration set rather than the test set. I'd say the presenter in this video has the right concepts, but he's calling what's usually called validation set, test set, and vice versa. Usually, the set that's kept out for our final testing of the performance of the model is called the test set and the validation set is used before that to do whatever adjustments and tuning that we want to do.
I get that it works, but ultimately, I can't help but feel like this is a band-aid fix to a more underlying issue, namely that something is wrong fundamentally with the model (in this case random forest). It feels like throwing in a fudge factor and hoping for the best.
Thank you very much for such an amazing video. I like your videos that you explain the reasons behind something and then show the math. Could you please do the same for probability calibration? It is not clear to me why this happens and if changing the loss function in the classifier can change anything.
It looks to me it's already caliberated during training phase because we minimize the error between predicted and empirical probability. I don't quite understand its necessity.
Very good tutorial, I have one question: Is this concept based on any background theory/algorithm??? If so, could you please introduce the specific name. Thanks
I like the explanation. It is very clear. But one thing I've noticed is data snooping. Mainly in the training setting that you proposed, why not training both the classifier and the calibrator on the training set and optimise them using a validation set? as we may not (and should not) have access to the testing set. Thanks.
on surface, it looks like you are using ML twice, with the second iteration to correct error from the first run. I can't seem to see why that second iteration is a legitimate step to do. It's like you made a bad prediction, and now we are going to give you another chance and coach you to adjust your prediction to arrive at a more accurate prediction. I know you used test data, but still can't see how you won't be overfitting.
Hey, thanks for the great video! I have a question regarding the predicted probability versus the empirical probability plot. I'm a bit confused because, if I understand correctly, the empirical observations are either 0 or 1 (or in this plot, are you grouping multiple observations together to obtain empirical observations that represent a probability?) Could you clarify this to help me understand it better? thanks very much again :)
But I find it difficult to understand why non-probabalistic models aren't configured by default... The probability is derived from dataset itself... So if the dataset is large enough then it should be already configured
I am confused why you train the logistic regression with input being predicted probabilities and output being the targets themselves. It seems you would train it with input being predicted probabilities and outputs being empirical probabilities. The probabilities should have nothing to do with the actual targets only how likely the prediction is to match the actual target which we calculate when we calculate the empirical probabilities. What am I missing?
If you hear him properly, he did say the Y Axis is actual value and X Axis has predicted, So the blue line is telling that the predictions are way way off, but using calibrations on the Blue Line we can get better results..
How do you calculate the empirical probability if all the data in dataset is unique? Because if every datapoint is unique the empirical probability will be 0 or 1
Thank you. This makes sense in a regression task. How about a binary classification task. What would be the real emperical probability to fit the calibration task ?
Thank you for this video! I didn't understand why we bias if we train the calibration in the training set and not in the test set. Could you give us an example please? +Subscribe
I know you gave an example later in the notebook, but the what if the data is the other way around? I mean the training is the testing and testing is the training, will we still see this behavior?
Unlike the looks, you simply are a great teacher... (Looks in the sense i mean, your attitude and looks are more similar to a freaky artist not a studious person)..:D:D
Great video, it's very interesting concept that I never heard about, but mathematically speaking would make a sense. Also it's interesting that a linear model was able to correct the error so profoundly. Nevertheless, isn't a kind of metalearning ?
Also I think you shouldn't use the name of "testing" dataset for traning the "calibration model", but rather e.g. metadataset. Test dataset is reserved only and only for the final, crossvalidated model
This is super amazing!! It's such an important concept that like you said, doesn't get all the credit it deserves. And sometimes we forget this step.
hey Ritvik, thank you so much for making all these videos, pursuing my Financial Math Masters - this channel has been a game changer for me in terms of understanding proper ML although I have applied it a lot of times but this level of in-depth understanding is, to say the least, very satisfying. Thankyou so much for what you're doing for all of us.
Also, is it possible to connect with you somewhere?
I got asked about in an interview. Thank you so much for posting this!!!
I would have called your "test" set the "calibration" set. Nice video.
Awesome video!!
I was trying to figure out how this concept work using the SK-Learning documentation, but I found the material too much theoretical. And in your video you put the things in a more friendly way!!
Many thanks :)
does it generalize or is it just overfitting with more steps?
I also have some problem with the second phase of configuring, I'm curious what happens to the out-of-sample performance after the calibration. I don't claim I understand the background here, but I just easily get the feeling in mind that: "the model fit did not produce performance that matches the observed distribution, so lets wrap the random forest into logistic function and fit it to the empirical distribution". Naturally this would perform better, but does the out-of-sample performance also improve? Sorry for my confusion, also pretty new concept for me as well.
I thought we don't touch test dataset until we have decided which model we are going to use?
Thank you! Brilliant video on such an important applied-ML topic.
Tho i haven't seen, in the top section of comments, mentions of the Isotonic Regression (which also can be found in Scikit-Learn package). More often than not, it performs way better on such a task, compared to the Logistic regression, due to it's inherent monotonicity constraint and piecewise nature.
Personally i found the most useful - the part about using different sets (test / val), for calibration and calibration validation. Right now i am in the process of developing the production classification ML model, and i think i have made a mistake of performing calibration using training set. Oops
Shouldn't we first do a min-max scaling on the original probabilities you get from the models?
Let's say I have three models and I run them on the same training data to get the below distribution of probabilities:
1) Naive Bayes: all predicted values between 0.1 to 0.8
2) Random Forest: all predicted values between 0.2 to 0.7
3) XGBoost: all predicted values between 0.1 to 0.9
If I want to take an average prediction, I am giving an undue advantage to XGBoost. So we should scale all of them to be between 0 to 1.
The second step then is to feed these original scaled probabilities to the Logistic Regression model to the calibrated probabilities by feeding in these new scaled probabilities.
This is such an important concept. I feel guilty of deploying models without a calibration layer.
Great material as usual!! Always look forward to learning from you.
Question: Are you planning on doing any material covering xgboost in the future?
Wow, that is an amazing video, I might be wrong but generally we use validation set first no, for calibration, and test is on the unseen data, I mean it is like that in hyperparameter tuning, so I assumed it should be same here. Correct me if I'm wrong.
I'm late to the party, but surely since random forest is not performing optimally in your example, you need to tweak its hyperparameters(tweak data, tune model) to fit a better curve. What if you create a badly performing model and try to calibrate it further with logistic regression when you could've gotten a better performing model just using random forest?
The notebook section of this video is quite misleading - it is basically just plotting a line of best fit on a calibration curve. To actually calibrate the predictions, the trained logistic regression model should make predictions on a set of model outputs, and those 'calibrated' outputs can then be used to plot a newly calibrated calibration curve.
I think I know how you computed empirical probability. For me, it would have helped to see an explicit calculation, just to be sure.
You could solve the calibration issue more easily by tuning hyperparameters. Specifically, you choose to tune hyperparameters to optimize a cost function that is considered a "proper scoring rule", such as logistic loss/cross entropy (the cost function of logistic regression, actually). At least in my RF implementations, that has resulted in calibrated probabilities right off the bat, without any post-processing. That being said, you'll probably notice that scikit-learn's LogisticRegression() class doesn't return calibrated probabilities all of the time. You can blame that on the class using regularization by default. Just turn it off, and you'll likely get calibrated probabilities again :)
Isn't it weird that the empirical probability is not monotonically increasing as a function of the uncalibrated probability? This would mean that the calibration model needs to learn to transform, e.g. 0.4 to 0.3 but 0.5 to 0.2.
Some people are wondering whether the initial calibration shouldn't be done on the calibration set rather than the test set. I'd say the presenter in this video has the right concepts, but he's calling what's usually called validation set, test set, and vice versa. Usually, the set that's kept out for our final testing of the performance of the model is called the test set and the validation set is used before that to do whatever adjustments and tuning that we want to do.
I get that it works, but ultimately, I can't help but feel like this is a band-aid fix to a more underlying issue, namely that something is wrong fundamentally with the model (in this case random forest). It feels like throwing in a fudge factor and hoping for the best.
This was FABULOUS, thank you.
Glad you enjoyed it!
Thank you very much for such an amazing video. I like your videos that you explain the reasons behind something and then show the math. Could you please do the same for probability calibration? It is not clear to me why this happens and if changing the loss function in the classifier can change anything.
It looks to me it's already caliberated during training phase because we minimize the error between predicted and empirical probability. I don't quite understand its necessity.
Thanks Man !!
No problem!
Can you do a video on calibrating scorecards? like doubling of odds?
Opened from vertible coornadation found a reciever according to molecular dissedent alluminum
Very good tutorial, I have one question: Is this concept based on any background theory/algorithm??? If so, could you please introduce the specific name. Thanks
It is called platt scaling
If I already use log loss as loss function, do I need to calibrate it again? Thank you
I like the explanation. It is very clear. But one thing I've noticed is data snooping. Mainly in the training setting that you proposed, why not training both the classifier and the calibrator on the training set and optimise them using a validation set? as we may not (and should not) have access to the testing set.
Thanks.
great video, keep it up!
on surface, it looks like you are using ML twice, with the second iteration to correct error from the first run. I can't seem to see why that second iteration is a legitimate step to do. It's like you made a bad prediction, and now we are going to give you another chance and coach you to adjust your prediction to arrive at a more accurate prediction. I know you used test data, but still can't see how you won't be overfitting.
exactly my thoughts
Your right I've seen this exact phenomenon happen in the wild and the model needed adjustment as such. Does anyone know why this happens?
Hey, thanks for the great video! I have a question regarding the predicted probability versus the empirical probability plot. I'm a bit confused because, if I understand correctly, the empirical observations are either 0 or 1 (or in this plot, are you grouping multiple observations together to obtain empirical observations that represent a probability?) Could you clarify this to help me understand it better? thanks very much again :)
If you calibrate it on the test set, that would introduce bias does it? Shouldn't it be the validation set?
Great video!
But I find it difficult to understand why non-probabalistic models aren't configured by default... The probability is derived from dataset itself... So if the dataset is large enough then it should be already configured
I am confused why you train the logistic regression with input being predicted probabilities and output being the targets themselves. It seems you would train it with input being predicted probabilities and outputs being empirical probabilities. The probabilities should have nothing to do with the actual targets only how likely the prediction is to match the actual target which we calculate when we calculate the empirical probabilities. What am I missing?
You linked the code but not the data. Please add that link.
Is this the same way we implement calibration for a multi-class problem?
Why do you assume the blue line is not correct?
If you hear him properly, he did say the Y Axis is actual value and X Axis has predicted, So the blue line is telling that the predictions are way way off, but using calibrations on the Blue Line we can get better results..
Thanks. So calibration is basically done to reduce error. right?
Thank you!
How do you calculate the empirical probability if all the data in dataset is unique? Because if every datapoint is unique the empirical probability will be 0 or 1
Thank you. This makes sense in a regression task. How about a binary classification task. What would be the real emperical probability to fit the calibration task ?
Great vid. I cant find the data on your github account
Thank you for this video! I didn't understand why we bias if we train the calibration in the training set and not in the test set. Could you give us an example please? +Subscribe
I know you gave an example later in the notebook, but the what if the data is the other way around? I mean the training is the testing and testing is the training, will we still see this behavior?
awesome!
thanks
Unlike the looks, you simply are a great teacher... (Looks in the sense i mean, your attitude and looks are more similar to a freaky artist not a studious person)..:D:D