Hi. You should perform under / over sample (including SMOTE) only on training data, and measure f1 on original data distribution (test data). Moreover, if you divide oversample data with train_test_split then you have no control over the distribution of duplicated items for test and train. Which means that you can have the same observation in both test and train, which means you test partially on the training set - that's why the results increase. So first divide into train / test, and then perform operations only on the training set, and the test set should be without any changes. Still, it's a very good tutorial, it's nice that you share your knowledge !!
Those who are watching just recently, SMOTE function is "fit_resample" now. Also if you can't import imbalanced_learn properly, try restarting the kernel.
Hey codebasics, love this video series! I think there’s a pretty big mistake in the oversampling though. You upsample, then do train test split. This means that there will be overlapping samples in both train and testing data, so the model will have already have seen some of the data you are testing it on. I think you need to do your train test split then do the upsampling on the train data only.
Yup, that's true. My professor said you should always oversample after splitting the data, and undersample before. If you oversample before splitting the data, your model will be in danger of overfitting. Yay, go me, commenting on a 3 year old comment!
Thank you so much for sharing this interesting information about data transformation. I was training a neural network that gave an AUC of 0.85, after balancing the class with the SMOTE it reached 0.93 AUC. Obviously, the f1-score and accuracy also improved. Thanks!
nice video, pretty clear. I think there are 2 things that are missing though: 1) Doing the under/oversampling only on training data 2) You could have also choose a different operating point (instead of np.round(y_pred), taking a different threshold) , or just using AUC measure and not rounding at all, that could have been more indicative PS: SMOTE don't actually give any lift in AUC measure. you off just as well adjusted the threshold to y_pred>0.35 or something like that and get better F1 scores
Tremendous respect sir, I love your tutorial. I sincerely follow your tutorial and practice all exercises that you provide. However, I went through some comments for this video lecture and found that people are suggesting to oversample/SMOTE the training sample only, and not to disturb the test sample (which I too believe is quite apparent, as this will avoid duplicate or redundant entry in training and test data set). Hence, separated out the train and test datasets first, then applied the oversample/SMOTE technique on the training dataset only. Unfortunately, the precision, recall, and f1-score are not increasing for the minority class. This is quite logical though. What I understood is, duplicate entry of the same sample in both the train and test dataset was the reason for that huge increase in minority class precision, recall, and f1-score in your case.
This happened when I tried the second exercise of the Bank customer churn prediction problem. Oversampling/SMOTE on train data gives around 0.51, 0.63, and 0.56 for precision, recall, and f1-score. When I follow your method for the Bank customer churn problem, the figures are 0.77, 0.90, and 0.83 respectively.
Only in this video looks like your patience was out of your control sir....huhaaaa....but still quality content delivery and great explanation....Tks a lot Sir....
Sir, Is there any better method from SMOTE for Class Imbalance? if yes please guide me...I am a Research Scholar (Doing Ph.D) from TOP 30 NIRF ranking institute. My area of research is classification problem in machine learning including dealing with imbalance data set. Thank you
In my opinion the SMOTE part is not wrong, but it is tricky. Using SMOTE on the entire dataset will make the X_test performance much better for sure since it will predict values already seen. Instead, if you split your data before the SMOTE you can see that the performance improves, but not too much, it will not reach 0.8 if without SMOTE was 0.47. The X_test in the video could probably interpreted as the X_validation, and the testing data should be imported from other sources, or at the beginning the dataset should be divided into training and test, like on Kaggle.
thanks for the great content, for the ensemble method could we use a random sample of the majority class (n=minority class length) then we could create more models for the vote
Good experiments with different methods! How about Auto-encoders methods? You encode and decode all good data (customer staying per your example) within DNN, calculate its reconstruction error. Now you run customer leaving data in your model. If your error from customer leaving data is not within the reconstruction error (from your staying data), then you have detected an anomaly. What do you think?
Hi @codebasics. I find your tutorial series very informative and interesting. I am learning a lot from your videos. I have a doubt in ensemble technique. While voting you are taking votes from three different predictions. But those predictions are not for the same data set. Is voting ensemble valid for such cases?
Seems, we should not calculate accuracy on train sample, for oversampling it is pretty obvious that precision recall will improve. We need to test the accuracy on test sample, where we artifically have not increase or decrese the number of samples.
Great stuff, but an error I believe. AT 31:07, in the ensemble method, you've used the function 'get_train_batch' to get X_train and y_train, but you're not redefining X_test and y_test
Thanks for sharing, but i think, there is a problem for test metric. Because you use processed data for training( oversampling etc., that is okay ) but you can not use same preprocessed data for testing, because in real state you can not know test data target, so you can not use imbalanced technics. Firstly you should seperate data and only apply implanced process for train data and test without preprocessed test data.
Nice tutorial seen on this Topic Excellent Teaching....Could you please post Topics on supervised learning and unsupervised learning separately to know learn on sequense basis.
Could someone elaborate a little bit on how exactly data is getting overlapped. I see many people saying to first split data and then sample it, will it work because here in this video we are dividing class 0 and 1 well in advance and then combining the data. I am going through many comments on this issue and having a hard time to figure this out.
In over sampling minority class By Duplication if we duplicate minority class then both classes will have equal samples After that we use train-test -split which randomly selects samples. The problem is those duplicate samples will be present in training samples as well as testing samples thus increasing Precision,F1score and all of those. Here is the overlappping
Its a great tutorial! But i have a comment in the evaluation part. you applied Resampling first before splitting the data. So its possible that there's a leakage of data coming from the training to the test set. Right? thats why it has a equal prediction score. Its a good technique that you should split the data set first and then resample only the training set. Hope this helps. Thanks
hello, Sir , I tried this exercise....but for ensemble the f1 score did not change much....for individual batches f1 score for both 0 and 1 was around.80 and .50......and it hardly chaNGED for overall..
one more question suppose we build a model fraud detection based on datasets....like 40% defaulter and 60% non defaulter.......what happend if we passed different datsets ,different distribution...diffrent size,quality......new datsets approx...70% deafulter 30 % non defaulter.........so how we can overcome this problem.........we build two models ,we combine two datsets....to build one model.....plz commnet ...
This tutorial covers how to deal with imbalanced datasets with only 2 or 3 classes. How to deal with a dataset with 64 classes in which some classes do not occur and or only occur only once in the dataset and in which the samples are grayscale images? That would be a great tutorial.
Hi, you have a problem statement to solve with imbalance ~98% is positive and only ~ 2 is the negative data, [[64894 0] [ 1423 0]] Model with Trial Data, Accuracy : 97.85 % Model with Trial Data, Precision : 0.0 % Model with Trial Data, Recall : 0.0 % Model with Trial Data, F1-Score : 0.0 % non of classifiers able to predict negative cases, can please suggest me a best method to increase f1 score.
Sir, please clear my doubt. in method-2 ie Oversampling when we use train_test_split method the precision,recall and f1-score value is not look realistic because my test data is not unique (means trained data is already is in test data because of oversampling). please clarify? Thanks
True, when you over sample there is a good chance that there will be data leakage. It would be helpful if you split the data and then oversample the train data to avoid any influence on the result.
@@piyushdandagawhal8843 Thank you Piyush. Please suggest me some research direction on Handling imbalanced data set in machine learning and Deep Learning. I am a full time research scholar so your suggestions mean a lot for me. Thank you
I was trying to implement smote.fit_resample(X,y). But got this error "'NoneType' object has no attribute 'split'". Couldn't find solution. Can anyone help?
Sir I had a python coding implementing deep neural.netowrk on Kdd dataset can.you explain the coding toe in.a.gmeet session forever I will be.indebted to you thmq
If we have imbalanced dataset but still get good F1-score, should we still be concerned about the data being imbalanced and use one of those techniques or not?
Hi Dhaval, When I run, multiple times, I am getting different F1 score, Accuracy etc. I have tried fixing it by giving below random seed also (in the very beginning of the code). Still getting different results. Kindly let me know how to get reproducible results. from numpy.random import seed seed(1) import tensorflow as tf tf.random.set_seed(2) Even I have used random_state in below methods as well: train_test_split, sample and SMOTE
In the ensemble method code, is it okay to split the data into batches first and then apply the train_split and train it for each, and then take the majority?
can anyone give a solution of SMOTE memory allocation error problem. maybe many of u say that use premium GPU but it's too costly. Is there any other solution for solving this problem??
Amazing video. One question. What if I use under/over sampling and accuracy or precision decrease? Single or combined under/over sampling methods let us to use features for further methods, for example, training multiple weak learners and then use ensemble methods. Is it possible for ensemble resampling methods?
Hello Dhaval, Very Nice explanation.. Does SMOTE work for highly imbalanced data like I have data set where one class has less than 1% representation in the distributions ? Please clarify
Thanks for sharing it. I am wondering that how we can treat imbalance dataset of time series ? Can all mentioned techniques in video be performed on timeseries data?
In general, it depends on type of data. Most of the imbalanced time-series dataset can be handled using SMOTE approach or combination of SMOTE with ENN/TOMEK.
please comment what if...... we pass different datasets.....on classification models....i got less accuracy..and model was build on different datasets.......but new datasets are different distribution,size....how to solve this problem...to improve performances ..........what should you do next....
Hello Sir .i was looking everywhere for class imbalance problem.Thanks a lot for this video. Do you have any videos for implementing rule based classification?
Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced
Hi. You should perform under / over sample (including SMOTE) only on training data, and measure f1 on original data distribution (test data). Moreover, if you divide oversample data with train_test_split then you have no control over the distribution of duplicated items for test and train. Which means that you can have the same observation in both test and train, which means you test partially on the training set - that's why the results increase. So first divide into train / test, and then perform operations only on the training set, and the test set should be without any changes.
Still, it's a very good tutorial, it's nice that you share your knowledge !!
yes thats true
yeah, we should never touch the test set.
True.., it might will be overfit right?
sad but true
@@MMSakho Yes you are right
Those who are watching just recently, SMOTE function is "fit_resample" now. Also if you can't import imbalanced_learn properly, try restarting the kernel.
Thank you
Will this work for categorical response too?
@Ma Aleemit means n_jobs = -1, i.e. use all ur cores for processing
Thank you
gracias amigo!!
Hey codebasics, love this video series! I think there’s a pretty big mistake in the oversampling though. You upsample, then do train test split. This means that there will be overlapping samples in both train and testing data, so the model will have already have seen some of the data you are testing it on. I think you need to do your train test split then do the upsampling on the train data only.
Yup, that's true. My professor said you should always oversample after splitting the data, and undersample before. If you oversample before splitting the data, your model will be in danger of overfitting.
Yay, go me, commenting on a 3 year old comment!
@@shivi_was_never_here this comment just helped me avoid this mistake. So, awesome, yay!
Thank you so much for sharing this interesting information about data transformation. I was training a neural network that gave an AUC of 0.85, after balancing the class with the SMOTE it reached 0.93 AUC. Obviously, the f1-score and accuracy also improved. Thanks!
The way you are introducing the information is very very excellent, thanks for sharing your knowledge and I'm happy to watch your video
nice video, pretty clear. I think there are 2 things that are missing though:
1) Doing the under/oversampling only on training data
2) You could have also choose a different operating point (instead of np.round(y_pred), taking a different threshold) , or just using AUC measure and not rounding at all, that could have been more indicative
PS: SMOTE don't actually give any lift in AUC measure. you off just as well adjusted the threshold to y_pred>0.35 or something like that and get better F1 scores
True. Good points!
My thoughts exactly. Nice!
I always learn something new watching your videos. Thank you 🙏🏻
I'm so glad!
Thank you very much for this video. This actually helps in solving real world scenarios.
:)
Tremendous respect sir, I love your tutorial. I sincerely follow your tutorial and practice all exercises that you provide. However, I went through some comments for this video lecture and found that people are suggesting to oversample/SMOTE the training sample only, and not to disturb the test sample (which I too believe is quite apparent, as this will avoid duplicate or redundant entry in training and test data set). Hence, separated out the train and test datasets first, then applied the oversample/SMOTE technique on the training dataset only. Unfortunately, the precision, recall, and f1-score are not increasing for the minority class. This is quite logical though. What I understood is, duplicate entry of the same sample in both the train and test dataset was the reason for that huge increase in minority class precision, recall, and f1-score in your case.
This happened when I tried the second exercise of the Bank customer churn prediction problem. Oversampling/SMOTE on train data gives around 0.51, 0.63, and 0.56 for precision, recall, and f1-score. When I follow your method for the Bank customer churn problem, the figures are 0.77, 0.90, and 0.83 respectively.
Thank you. Very clear instruction and linked to Ann too, as I've only used with supervised ml.
Only in this video looks like your patience was out of your control sir....huhaaaa....but still quality content delivery and great explanation....Tks a lot Sir....
I think we should first apply train test split and then over/under sample the train data.
Hats off to u Dhaval, Loved ur way of teaching and clearing my concepts, thank u so much
Sir, Is there any better method from SMOTE for Class Imbalance? if yes please guide me...I am a Research Scholar (Doing Ph.D) from TOP 30 NIRF ranking institute. My area of research is classification problem in machine learning including dealing with imbalance data set. Thank you
Thank you again Dhaval. I really appreciate your efforts!!
In my opinion the SMOTE part is not wrong, but it is tricky. Using SMOTE on the entire dataset will make the X_test performance much better for sure since it will predict values already seen. Instead, if you split your data before the SMOTE you can see that the performance improves, but not too much, it will not reach 0.8 if without SMOTE was 0.47. The X_test in the video could probably interpreted as the X_validation, and the testing data should be imported from other sources, or at the beginning the dataset should be divided into training and test, like on Kaggle.
Don't you want to apply SMOTE just to the training data, and leave the test data untouched?
True. Smote musst be appied after train test split.
@@lorizhuka6938 What about the others? Oversampling for instance.
Great presentation! I think I just needed SMOTE for my assignment but I liked how you explained every method.
Thanks for providing us the path and please keep doing the good work and don’t get upset by lesser views you are a true inspiration for all of us.
31:40 the ANN function is using the same old X_test and y_test. I think that's why the accuracy is so bad.
Thanks a lot, codebasics for all of your valuable and knowledgeable content
thanks for the great content, for the ensemble method could we use a random sample of the majority class (n=minority class length) then we could create more models for the vote
Undersampling 7:34
Oversampling 15:04
Note: The fit_sample() method has been replaced by the fit_resample() method in newer versions of imblearn
THANKS
Good experiments with different methods! How about Auto-encoders methods? You encode and decode all good data (customer staying per your example) within DNN, calculate its reconstruction error. Now you run customer leaving data in your model. If your error from customer leaving data is not within the reconstruction error (from your staying data), then you have detected an anomaly. What do you think?
Very interesting, amazing video...at 22:34 when using SMOTE method , smote.fit_sample(X,y) is now smote.fit_resample(X,y).
You answered my question with only 4 minutes. Great! thank you!
Happy to help!
@@codebasics if we have ratio of data in 54% and 46%. Do we need balancing?
very helpful, your video makes everything easier ,thousand thumbs up for you 👍👍
Glad it helped!
i was actually doing the churn modeling project and this video popped up! thanks a lot :)
Glad I could help!
Which is the Best method to do the sampling before Spiting the dataset or After Splitting the dataset
Hi @codebasics. I find your tutorial series very informative and interesting. I am learning a lot from your videos.
I have a doubt in ensemble technique. While voting you are taking votes from three different predictions. But those predictions are not for the same data set. Is voting ensemble valid for such cases?
Same thought.
Voting isn't ideal
So fun the laugh at 22:31 hehe really cool video!
Seems, we should not calculate accuracy on train sample, for oversampling it is pretty obvious that precision recall will improve. We need to test the accuracy on test sample, where we artifically have not increase or decrese the number of samples.
Hi! Why dont directly use the train_test_split with the stratify argument? Thank u!
Great stuff, but an error I believe. AT 31:07, in the ensemble method, you've used the function 'get_train_batch' to get X_train and y_train, but you're not redefining X_test and y_test
I am getting error Failed to convert a NumPy array to a Tensor (Unsupported object type int).
Thanks for sharing, but i think, there is a problem for test metric. Because you use processed data for training( oversampling etc., that is okay ) but you can not use same preprocessed data for testing, because in real state you can not know test data target, so you can not use imbalanced technics. Firstly you should seperate data and only apply implanced process for train data and test without preprocessed test data.
Wonderful video. Great effort. Thank you.
Glad you enjoyed it!
I think there's also a risk of overfitting the model when using SMOTE, as the synthetic data points might look like test data points(unseen).
That's true. Especially if the data is in text
@@MMSakho Anyone managed to know if that's truth?
Nice tutorial seen on this Topic Excellent Teaching....Could you please post Topics on supervised learning and unsupervised learning separately to know learn on sequense basis.
u r awesome teacher plz stay with us long live
thanks for your kind wishes Vinod
Could someone elaborate a little bit on how exactly data is getting overlapped. I see many people saying to first split data and then sample it, will it work because here in this video we are dividing class 0 and 1 well in advance and then combining the data. I am going through many comments on this issue and having a hard time to figure this out.
Did u manage to figured it out
In over sampling minority class By Duplication
if we duplicate minority class then both classes will have equal samples
After that we use train-test -split which randomly selects samples.
The problem is those duplicate samples will be present in training samples as well as testing samples thus increasing Precision,F1score and all of those.
Here is the overlappping
This is very insightful... thank you.
Please can you do a video on Click through rate prediction
sure
You talked about Focal Loss but didn't show the practical application of it. Is there another video on Focal Loss?
Its a great tutorial! But i have a comment in the evaluation part. you applied Resampling first before splitting the data. So its possible that there's a leakage of data coming from the training to the test set. Right? thats why it has a equal prediction score. Its a good technique that you should split the data set first and then resample only the training set. Hope this helps. Thanks
00:00 Overview
00:01 Handle imbalance using under
sampling
02:05 Oversampling (blind copy)
02:35 Oversampling (SMOTE)
03:00 Ensemble
03:39 Focal loss
04:47 Python coding starts
07:56 Code - undersamping
14:31 Code - oversampling (blind copy)
19:47 Code - oversampling (SMOTE)
24:26 Code - Ensemble
35:48 Exercise
hello, Sir , I tried this exercise....but for ensemble the f1 score did not change much....for individual batches f1 score for both 0 and 1 was around.80 and .50......and it hardly chaNGED for overall..
excellent approach very helpfull
🤩 love your tutorials brother
video is really helpful.Thanks for sharing.
Glad it was helpful!
Do you have video for imbalanced data for text classification problem. Please suggest.
Perfect explanation
Glad you think so!
do we need to check for imbalance for unsupervised learning problem or clustering problem?? if yes, why and how??
Can we use variational auto encoder for synthetic data generation in case of minority class?
sir, I am following your deep learning playlist. please make a video on cross validation with keras for neural network.
sure
thanks for these good vide
os these are very help full for me
Thank you so much. It was very informative.
Glad it was helpful!
one more question suppose we build a model fraud detection based on datasets....like 40% defaulter and 60% non defaulter.......what happend if we passed different datsets ,different distribution...diffrent size,quality......new datsets approx...70% deafulter 30 % non defaulter.........so how we can overcome this problem.........we build two models ,we combine two datsets....to build one model.....plz commnet ...
thanks, amazing illustration , do these methods work with multi-class labels ( means the lable column may contain over 10 labels)
Very useful and fruitful, big up
Glad it was helpful!
This tutorial covers how to deal with imbalanced datasets with only 2 or 3 classes. How to deal with a dataset with 64 classes in which some classes do not occur and or only occur only once in the dataset and in which the samples are grayscale images? That would be a great tutorial.
Hi, you have a problem statement to solve with imbalance ~98% is positive and only ~ 2 is the negative data,
[[64894 0]
[ 1423 0]]
Model with Trial Data, Accuracy : 97.85 %
Model with Trial Data, Precision : 0.0 %
Model with Trial Data, Recall : 0.0 %
Model with Trial Data, F1-Score : 0.0 %
non of classifiers able to predict negative cases, can please suggest me a best method to increase f1 score.
I think, in the same way , a method get_test_batch() also is required.
After balancing the dataset may I know what values can be placed in that place
Can we apply same technique if we have more than 2 classes?
does this approach work for more than 2 categories in Target variable?
Sir, please clear my doubt. in method-2 ie Oversampling when we use train_test_split method the precision,recall and f1-score value is not look realistic because my test data is not unique (means trained data is already is in test data because of oversampling). please clarify? Thanks
True, when you over sample there is a good chance that there will be data leakage. It would be helpful if you split the data and then oversample the train data to avoid any influence on the result.
@@piyushdandagawhal8843 Thank you Piyush. Please suggest me some research direction on Handling imbalanced data set in machine learning and Deep Learning. I am a full time research scholar so your suggestions mean a lot for me.
Thank you
I think the test train split should be done before under or oversampling. Otherwise, the results are not reliable.
I was trying to implement smote.fit_resample(X,y). But got this error "'NoneType' object has no attribute 'split'". Couldn't find solution. Can anyone help?
Hi Sir, can you please tell me about how to augment data(not image data) for regression problems?
Hey, great video.
Can you also make one video on how to handle the class overlapping (that too in imbalanced binary classification)??
Thank you
Sir I had a python coding implementing deep neural.netowrk on Kdd dataset can.you explain the coding toe in.a.gmeet session forever I will be.indebted to you thmq
If we have imbalanced dataset but still get good F1-score, should we still be concerned about the data being imbalanced and use one of those techniques or not?
Hi Dhaval,
When I run, multiple times, I am getting different F1 score, Accuracy etc. I have tried fixing it by giving below random seed also (in the very beginning of the code). Still getting different results. Kindly let me know how to get reproducible results.
from numpy.random import seed
seed(1)
import tensorflow as tf
tf.random.set_seed(2)
Even I have used random_state in below methods as well:
train_test_split, sample and SMOTE
Really helpful. Could you please tell whether oversampling strategy is okay if we do cross-validation instead of train-test-split?
Thank you so much and appreciate for your work.
In the ensemble method code, is it okay to split the data into batches first and then apply the train_split and train it for each, and then take the majority?
can anyone give a solution of SMOTE memory allocation error problem. maybe many of u say that use premium GPU but it's too costly. Is there any other solution for solving this problem??
Please some one can explain me, why in this example (on video) the accuracy and loss frequently changed? is this an overfitting?
I also had a similar observation in all videos in this series
Sir, can you please also add adasyn sampling technique and also other different sampling techniques. Differences between SMOTE vs ADASYN
Best Teacher!!!!!
👍😃
Thank you for your sharing.
Amazing video. One question. What if I use under/over sampling and accuracy or precision decrease?
Single or combined under/over sampling methods let us to use features for further methods, for example, training multiple weak learners and then use ensemble methods. Is it possible for ensemble resampling methods?
Great video as usual sir , wish you more success
So nice of you. I hope you are doing good my friend fahad.
Hello Dhaval,
Very Nice explanation.. Does SMOTE work for highly imbalanced data like I have data set where one class has less than 1% representation in the distributions ?
Please clarify
How to oversample the balanced dataset with less samples without blind copy
can we use SMOTE while working with audio dataset ?
awesome. cannot thank you enough
Is it the same process in multi label classification ?
are there similar methods for balancing datasets with continuous targets?
Great video !
i'll thank you with subscription
Thanks for sharing it. I am wondering that how we can treat imbalance dataset of time series ? Can all mentioned techniques in video be performed on timeseries data?
In general, it depends on type of data. Most of the imbalanced time-series dataset can be handled using SMOTE approach or combination of SMOTE with ENN/TOMEK.
JUST THE BEST
Great explanation
Please make a vedio on abstract dialogue summarization !! Where the same problem of imbalanced dataset occurs ...
sure
please comment what if...... we pass different datasets.....on classification models....i got less accuracy..and model was build on different datasets.......but new datasets are different distribution,size....how to solve this problem...to improve performances ..........what should you do next....
Hello Sir .i was looking everywhere for class imbalance problem.Thanks a lot for this video. Do you have any videos for implementing rule based classification?
the evil laugh at 22:28 😂😂
Sir, can I use the methods used in this tutorial for training my image classification model or should I use augmentation for that purpose?
I think for image classification, augmentation is a better approach.