Thank you! Stratify maintains the same proportion of 0s and 1s in both train and val/test sets as that of the overall data, but it won't resolve the class imbalance issue. We may stratify at the time of split to maintain whatever imbalance we have, and then apply imbalance treatment only to the train set.
Welcome! Good question. Any imbalance treatment needs to be applied only to the train data i.e. for training the model, but because the test data represents future data, it is not supposed to be treated for imbalance.
@@prosmartanalytics i mean whene we use oversampling on the whole dataset (before Splitting) because whene i used this way I have got a good confusion Matrix and better metrics ( accuracy recall F1 precision) and there is not any problelm of overfitting.
Yes, but there is a leakage problem. The results so obtained won't be considered reliable. Test data is suposed to be representing the future. So if we are predicting defaults for a bank where the historical default rate is only 2%, test data should represent this value and not 50%. If we use the entire data for imbalance treatment, somehow the data that we are going to use as test later has already participated in the training process because we generated our labels using that too.
Awesome Explanation...
I suggest you attach the Jupyter notebook code with your video.
Thank you! 😊 We used to provide notebooks too but stopped due to IP infringement.
Awesome presentation. Kindly make a presentation on these also Hybrid Sampling/Ensemble Systems. Thanks
Thank you! We'll keep these suggestions in mind.
Great videos! Thak you for sharing!!!
Glad you like them!
awesome, but what about stratify when splitting?
Thank you! Stratify maintains the same proportion of 0s and 1s in both train and val/test sets as that of the overall data, but it won't resolve the class imbalance issue. We may stratify at the time of split to maintain whatever imbalance we have, and then apply imbalance treatment only to the train set.
Thanks for the presentation if Can I use SMOTE before Splitting the dataset into training and testing dataset ?
Welcome! Good question. Any imbalance treatment needs to be applied only to the train data i.e. for training the model, but because the test data represents future data, it is not supposed to be treated for imbalance.
@@prosmartanalytics i mean whene we use oversampling on the whole dataset (before Splitting) because whene i used this way I have got a good confusion Matrix and better metrics ( accuracy recall F1 precision) and there is not any problelm of overfitting.
Yes, but there is a leakage problem. The results so obtained won't be considered reliable. Test data is suposed to be representing the future. So if we are predicting defaults for a bank where the historical default rate is only 2%, test data should represent this value and not 50%. If we use the entire data for imbalance treatment, somehow the data that we are going to use as test later has already participated in the training process because we generated our labels using that too.