Handling Imbalanced Datasets SMOTE Technique

Поделиться
HTML-код
  • Опубликовано: 27 ноя 2024

Комментарии • 228

  • @pandharpurkar_
    @pandharpurkar_ 4 года назад +5

    best teacher i have ever seen! Explaining in very proper way! in short time explaining exact things!!!

  • @JainmiahSk
    @JainmiahSk 4 года назад +7

    Data Mites is a hidden gem now but soon they will be a Brand for Data Science. Keep my note for Future.

  • @akshiwakoti7851
    @akshiwakoti7851 4 года назад +2

    A real pro! Subbed this channel after watching first 3 minutes. Glad to have found it.

  • @donaloleary5514
    @donaloleary5514 3 года назад +1

    Thank you, Ashok! This is an outstanding explanation of a complex subject. You make it all feel very intuitive. Awesome stuff - I will look for more DataMites videos in the future!

    • @DataMites
      @DataMites  3 года назад

      "Hi, Donal O'Leary,
      Thanks for your comment and keep on visiting our channel for more and updated content."

  • @bhagwatchate7511
    @bhagwatchate7511 4 года назад +2

    Amazing in depth explanation! I was exactly searching for this type of explanation.. Thanks for sharing

    • @DataMites
      @DataMites  4 года назад

      Glad it was helpful!

  • @lalithapriya9484
    @lalithapriya9484 3 года назад

    extreme clarification really superb teaching skills along with good communications

    • @DataMites
      @DataMites  3 года назад

      Hi lalitha priya, thank you for you comment.

  • @SurajSingh-wn4wu
    @SurajSingh-wn4wu 4 года назад +1

    Great Ashok.!! Genuinely liked your way of explanation in depth and the solution... Glad i landed on your page...
    Thank You..!

  • @Parvathy-e1p
    @Parvathy-e1p 11 месяцев назад

    Wow sir liked u r session .please continue posting such videos

  • @inspiritlashi9994
    @inspiritlashi9994 3 года назад

    Thank you so much for the great tutorial.. As someone who does not have even the basic knowledge of python, I could learn many things from you, sir.

    • @DataMites
      @DataMites  3 года назад

      Glad it was helpful!

  • @MLA263
    @MLA263 2 года назад

    Thanks Ashok, very clear and simple explanation.

  • @alisalariyan6676
    @alisalariyan6676 3 года назад

    The best smote tutorial I've seen. Thanks

    • @DataMites
      @DataMites  3 года назад

      Glad it was helpful!

  • @jagannadhareddykalagotla624
    @jagannadhareddykalagotla624 3 года назад

    DataMites is like hidden pattern in unsupervised learning thank you so much ashok❤️❤️

  • @YizhuoLi-q8w
    @YizhuoLi-q8w Год назад +1

    This is really helpful and thank you again!

    • @DataMites
      @DataMites  Год назад

      Glad it was helpful! Keep Watching!

  • @binoypaul9772
    @binoypaul9772 3 года назад

    Nice and informative. Please keep up the good work.

  • @b1k1m1
    @b1k1m1 4 года назад +2

    Hello Sir, Thanks for explaining this very clearly.. keep it up....

  • @milliekim5072
    @milliekim5072 3 года назад +1

    Thank you so much, sir! I hope I see more videos

  • @nasreenbanu2245
    @nasreenbanu2245 2 года назад

    Hai sir! thanks a lot for very simple and clear explanation.keep going we expect more videos from you...

  • @adeyinkasotunde6870
    @adeyinkasotunde6870 4 года назад +1

    wow...... i am very well impressed. well explained. thanks

    • @DataMites
      @DataMites  4 года назад

      You are most welcome

  • @8sharkey8
    @8sharkey8 3 года назад

    Excellent content, brilliantly presented. Thank you. Subscribed.

  • @bhanukiran4317
    @bhanukiran4317 3 года назад

    Great content sir !! Keep on spreading knowledge

    • @DataMites
      @DataMites  3 года назад

      Thank you, Keep watching

  • @osamaamir9311
    @osamaamir9311 2 года назад +1

    Such an amazing topic

  • @heenagirdher6443
    @heenagirdher6443 4 года назад

    Great tutorial. Very good explanation sir.

  • @niswandi6122
    @niswandi6122 Год назад +1

    Thank you ashok, clear explanation, but howto handle the imbalanced datasets if we have 4 classes?

    • @DataMites
      @DataMites  Год назад

      For multiclass also same technique is applied as that of 2 classes

  • @ChrisHalden007
    @ChrisHalden007 Год назад +1

    Great video. Thanks

    • @DataMites
      @DataMites  Год назад

      Glad you like it! Keep Supporting

  • @riorizkiaryanto
    @riorizkiaryanto 3 года назад

    Great video and explanation! Thanks.

  • @defres15
    @defres15 3 года назад

    Great video. Great explanation. Thank you

  • @siddhantagarwal274
    @siddhantagarwal274 4 года назад +1

    Nicely explained. Thanks!

  • @ombb3576
    @ombb3576 3 года назад

    Thank you for your sincere lecture sir

    • @DataMites
      @DataMites  3 года назад

      You are most welcome

  • @mohamedoutghratine8538
    @mohamedoutghratine8538 4 года назад

    Amazing in depth explanation

  • @aftabnaseem
    @aftabnaseem 3 года назад

    Great job ....made it look very easy

  • @dewipurnamasari5814
    @dewipurnamasari5814 2 года назад +2

    Thank you very much

    • @DataMites
      @DataMites  2 года назад

      Most welcome! Keep Watching

  • @abhijitkamune3976
    @abhijitkamune3976 4 года назад

    Nice explanation .. Looking for more NLP related video

  • @swastiknayak5173
    @swastiknayak5173 4 года назад

    At 8.15 you have said it is taking the average of centroids which is completely wrong. SMOTE is calculated over the feature space...it goes like this
    1. we take the feature vector of the minority class point.
    2. we calculate the distance between the neighbours (neighbours=5).
    3. we multiply the distance between the neighbours with a random number that is created between 0 &1.
    4. Then we create the synthesized point.
    hope you got it 😀

    • @DataMites
      @DataMites  4 года назад

      SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
      Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbours for that example are found (typically k=5). A randomly selected neighbour is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

  • @AMITSHARMA-fy4wv
    @AMITSHARMA-fy4wv 4 года назад

    Really appreciate sir..Lot off.🙏🏼🙏🏼🙏🏼🙏🏼🤗🤗🤗👌👌👌👌😊😊😊😊

  • @dikshitlenka
    @dikshitlenka 4 года назад

    Very clear explanation. Thanks

  • @alishahsaber3795
    @alishahsaber3795 3 года назад

    Thank you so much!!! Really helpful. thanks

  • @manishbolbanda9872
    @manishbolbanda9872 4 года назад

    wonderfully explained.thank you.

  • @akshayjadhav2213
    @akshayjadhav2213 4 года назад

    very nicely explained sir ..thank you

    • @DataMites
      @DataMites  4 года назад +1

      You are most welcome

  • @ringgaershaikhwani3478
    @ringgaershaikhwani3478 Год назад +1

    hello sir, the material that you explain is very easy to understand. I want to ask about my project. I have imbalanced data, then I do smote and I model it with KNN, but why after smote does the accuracy go down? 79% to 78%, is there something wrong with my data? Can you help explain this? I am very grateful if you respond to my comment.

    • @DataMites
      @DataMites  Год назад +1

      Using SMOTE, your model will start detecting more cases of the minority class, which will result in an increased recall, but a decreased precision. Accuracy is not a good measure of performance on unbalanced classes. That's because SMOTE technique puts more weight to the small class, makes the model bias to it. The model will now predict the small class with higher accuracy but the overall accuracy may decrease.

  • @babukoshy
    @babukoshy 4 года назад

    This was a great lesson. Thanks a lot

    • @DataMites
      @DataMites  4 года назад

      You're very welcome!

  • @Cobra-bo1fy
    @Cobra-bo1fy 2 года назад

    excellent explanation!

  • @tanvipataskar4597
    @tanvipataskar4597 4 года назад

    Amazing Explanation!!! Thankyou.

  • @svitirur1665
    @svitirur1665 3 года назад

    very good explanation

  • @mozaffarhussain5496
    @mozaffarhussain5496 4 года назад

    Best Explanation sir ..............!

  • @ffckode
    @ffckode 4 года назад

    Thanks for sharing. Very helpful

    • @DataMites
      @DataMites  4 года назад

      Glad it was helpful!

  • @MrMehshankhan
    @MrMehshankhan 4 года назад

    thank you so much man. great thumbs up...

  • @sabbirahmmed7161
    @sabbirahmmed7161 2 года назад

    Thanks, nice explanation

  • @parsayadpa5446
    @parsayadpa5446 3 года назад

    thanks alot for this good tutorial.

  • @AsiaMSaeed
    @AsiaMSaeed 2 года назад

    Amazing. Thanks a lot.

  • @nehaurade4917
    @nehaurade4917 3 года назад

    Perfect video..thank you

  • @samhugh9891
    @samhugh9891 3 года назад

    great video, thank you!

  • @canancetin7897
    @canancetin7897 4 года назад

    Great video! Thanks a lot!!!

  • @muhamm3dali
    @muhamm3dali Год назад

    Firstly, Thank you for sharing. I wanna ask something about time series. I have lots of data. But datas are different frequency. I wonder how deal with all datas. And assume that datas edited to same frequency. By the way datas are not fitted normal distribution so imbalanced that's why i am asking. If datas be same frequency, Smote can be appliable for time series? If not how to resample my time series?

  • @amruthakommu4695
    @amruthakommu4695 3 года назад +1

    Great Ashok. That was a well explained video. I tried the same thing on my data set but my accuracy came down from 94 to 86. What could be the cause?

    • @DataMites
      @DataMites  3 года назад

      Hi, we cannot comment until we look in your data and all the approaches that you have taken. One of the possibility might be your prediction was previously overfitted.

  • @sasidharansathiyamoorthy6918
    @sasidharansathiyamoorthy6918 3 года назад

    Thank you for the informative video! In this video, you have used SMOTE to rectify imbalance in target label. What methods can we use to deal with class imbalance in categorical features( input) in order to make the model more robust?

    • @DataMites
      @DataMites  3 года назад +1

      Hi Sasidharan Sathiyamoorthy, Its property of input so if u balance the input it might affect the target variable. Make 2 models with and without balancing n check the performance

    • @RoyalRealReview
      @RoyalRealReview 3 года назад

      @@DataMites sir if we have 54% persons cancer patients and 46% non-cancer patients then do we need balancing? If yes then which balancing technique should be selected?

  • @seeutube8860
    @seeutube8860 2 года назад

    Nice video.
    After applying smote, balanced data was obtained. But balanced data (X_smote,y_smote) was not split (80:20) in to train n test data sets before reapplying classification model?
    Is it necessary or not to split the data again? Or orginal dataset itself was considered as test dataset.

    • @DataMites
      @DataMites  2 года назад

      We have already split and then we balanced the data. So not required to split again.

  • @sandeshbapu1567
    @sandeshbapu1567 4 года назад

    Nicely explained

    • @DataMites
      @DataMites  4 года назад

      Thank you so much 🙂

  • @perusona_desu5534
    @perusona_desu5534 2 года назад

    in oversampling do you have to make the minority class instances equals the majority class instances ?
    for example:
    can it be 900 nc
    and 800 c

    • @DataMites
      @DataMites  2 года назад

      Oversampling is increasing the samples for minority class to match with the majority class. Undersampling is reducing the samples for majority class to match with minority class.

  • @athilakshmir8589
    @athilakshmir8589 3 года назад

    nice explanation

  • @younesgasmi8518
    @younesgasmi8518 9 месяцев назад

    Thanks so much bro..i have shown some data scientists used undersampling and oversampling before Splitting the dataset into training and testing..in my research paper we heve used NEARMISS technique to balance the dataset..i have got a good results with using cross validation Splitting and Extra tree classifier as model and also the same model to select the best importance features where my results are : (ACC 0.97 , F1 0.97 and AUC 0.99) are there results may be accepted for publishing?

    • @DataMites
      @DataMites  9 месяцев назад +1

      You achieved good results. However, whether your results are acceptable for publishing depends on several other factors too.

  • @tahanics901
    @tahanics901 3 года назад

    Very good explanation Thanks. but this code, is applicable with text data (tweets) or not?

    • @DataMites
      @DataMites  3 года назад

      yes after converting text to numerical vectors. use fit_resample()

  • @chinedumjoseph9875
    @chinedumjoseph9875 3 года назад

    Oh! I got it. Don't worry. Thanks

    • @DataMites
      @DataMites  3 года назад

      You're welcome

    • @inspiritlashi9994
      @inspiritlashi9994 3 года назад

      Hi, can I know how did you correct it? i got the same error message

  • @kurniawandk5078
    @kurniawandk5078 2 года назад

    Very informative, i have a question sir, it is possible to set how many synthetic data created by smote ? in example i want to set n_sample increase to 200% so, how to put this parameters in pyhton code ?

    • @DataMites
      @DataMites  2 года назад

      Your question is not clear. Can you elaborate plz?

  • @michaelpanashemudimbu7405
    @michaelpanashemudimbu7405 3 года назад

    Awesome video

  • @lavanyanayak8707
    @lavanyanayak8707 3 года назад

    Thank you very much for this video. I have a precipitation dataset containing 4 columns and 8000 rows, each of them has a lot of zeros and only a few continuous values. I would like to know if I can use smote in this case?

    • @DataMites
      @DataMites  3 года назад +1

      Hi Lavanya Nayak
      , Github link is provided in the description. please check it out.

  • @HarishKumar-qj9pp
    @HarishKumar-qj9pp 3 года назад

    getting attribute error: 'SMOTE' object has no attribute 'fit_sample' but I have all the packages requirement satisfied still showing the error

    • @DataMites
      @DataMites  3 года назад

      Hi please check imbalanced-learn.org/stable/over_sampling.html for any update in imbalance learn package

  • @wenshanpan8726
    @wenshanpan8726 3 года назад

    Excellent!

  • @cliffordtarimo1511
    @cliffordtarimo1511 4 года назад

    Great video on SMOTE. Do you have a video on undersampling? Can someone perform both undersampling and oversampling in one line of code??? THANKS.

    • @DataMites
      @DataMites  3 года назад

      The other flavor of SMOTE is SMOTETOMEK which uses undersampling of majority class and upsamping of minority class.

  • @jongcheulkim7284
    @jongcheulkim7284 3 года назад

    Thank you so much. ^^

  • @zakariaabderrahmanesadelao3048
    @zakariaabderrahmanesadelao3048 4 года назад

    what a crystal clear explanation. thank you.

    • @DataMites
      @DataMites  4 года назад

      You're very welcome!

  • @kunalgoyal8529
    @kunalgoyal8529 4 года назад

    While dividing training and test data shouldn't you be doing "stratify=y" ? To ensure test data and training data set have equal proportion of outcome variable?

    • @mr.techwhiz4407
      @mr.techwhiz4407 4 года назад

      that would be undersampling

    • @DataMites
      @DataMites  4 года назад

      The aim of machine learning model is to generalization on training set so that performance on unseen
      Data is good.We don't care what the test data consist instead we try to given more generalized pattern to the algorithms.

  • @OriginalBernieBro
    @OriginalBernieBro 4 года назад

    Running into a problem with sklearn 'support' column still looking unbalanced after smoting on print(classification_report(y_test, y_pred)) what gives?

    • @DataMites
      @DataMites  4 года назад

      The support is the number of samples of the true response that lie in that class.

  • @insidiousmaximus
    @insidiousmaximus 3 года назад

    great video thank you. I am trying to figure out how to use this with a generator flowing from directory?

    • @DataMites
      @DataMites  3 года назад

      "Hi
      insidiousmaximus, thanks for reaching us with your query.
      Can you please put your query more precisely so that we can help you?"

  • @ShubhamKumar-id6pf
    @ShubhamKumar-id6pf 4 года назад

    SIr, I went on as per the recommended procedures but my jupyter environment giving an AttributeError that SMOTE object has no attribute '_validate_data'.
    Can you please help me with the.

    • @DataMites
      @DataMites  4 года назад +1

      You need to upgrade scikit-learn to version 0.23.1.

  • @abhimynampati2929
    @abhimynampati2929 2 года назад

    Hey Ashok, can u make a video on dsste algorithm for removing class imbalance?

  • @rukaiyaa191
    @rukaiyaa191 2 года назад

    which module is used for alternative module of imblearn in python sir(for handling imbalance dataset)

    • @DataMites
      @DataMites  2 года назад

      For balancing the dataset we have only imblearn module. But there are other ways to deal with the imbalanced dataset.

  • @petersq5532
    @petersq5532 3 года назад

    how split stratify solves the problem?

  • @JainmiahSk
    @JainmiahSk 4 года назад +1

    you haven't encoded the target variable?

    • @DataMites
      @DataMites  4 года назад

      Target variable needn't require encoding

  • @patrickbormann8103
    @patrickbormann8103 4 года назад

    Amazing!

  • @anaghadamame196
    @anaghadamame196 3 года назад

    Thank you sir...👍

    • @anaghadamame196
      @anaghadamame196 3 года назад

      Can you explain which algorithm should be selected for regression problem....it will help me alot

    • @DataMites
      @DataMites  3 года назад

      All the best

  • @ishan7491
    @ishan7491 3 года назад

    Can you please explain this part of the code in the label encoder section:

    • @DataMites
      @DataMites  3 года назад

      Hi Ishan, please reframe your query.

  • @inspiritlashi9994
    @inspiritlashi9994 3 года назад

    Sir,
    Can I know how to run a logistic regression on the oversampled dataset?

    • @DataMites
      @DataMites  3 года назад

      Hi Inspirit Lashi, you can use SMOGN for preprocessing of your dataset. More more information: proceedings.mlr.press/v74/branco17a/branco17a.pdf

  • @sushmithajanapati7785
    @sushmithajanapati7785 2 года назад

    Does Smote algorithm support Multi output classification?

    • @DataMites
      @DataMites  2 года назад

      Yes, you can use SMOTE.

  • @oumaimasouid5229
    @oumaimasouid5229 3 года назад

    i find this error >> plz help !

    • @DataMites
      @DataMites  3 года назад

      Hi, please use fit_resample

  • @shivki23
    @shivki23 4 года назад

    subscribed for ur content

  • @hendripriyambowo1427
    @hendripriyambowo1427 4 года назад

    hi sir i have question how did we implement those resampling technique in neural network, let say if we implement embedding layer and work with multiple kind of data
    is that resampling technique make our data losing such information?

    • @DataMites
      @DataMites  3 года назад

      You can use mini-batch SGD optimizer to handle imbalance dataset.

  • @terryterry3733
    @terryterry3733 3 года назад

    Hi sir what is the data type for outcome ? i think it is in object . Did u convert that into float or int?

    • @DataMites
      @DataMites  3 года назад

      "Hi Terry, thanks for reaching to us regarding your queries.
      Outcome datatype is in the string and we label encoded it to an integer."

  • @chinedumjoseph9875
    @chinedumjoseph9875 3 года назад

    Thank you for this nice explanation. I was making progress with the codes but when I tried to fit using the command X_train_smote, y_train_smote = smote.fit_sample(X_train.astype('float'),y_train), I got error saying AttributeError: 'SMOTE' object has no attribute 'fit_sample'. I need urgent help please. Thank you

    • @DataMites
      @DataMites  3 года назад

      Hi Chinedum Joseph, can you please list the version of python and scikit learn in your system?

    • @ObaidoGeorge
      @ObaidoGeorge 2 года назад

      Use smote.fit_resample instead of smote.fit_sample.

    • @AbdulLatif-fu9jz
      @AbdulLatif-fu9jz Год назад

      @@ObaidoGeorge Tqvm for your help

  • @wajeehanaz9115
    @wajeehanaz9115 3 года назад

    Hello Sir!
    can you please tell me how to generate images using smote technique ???
    Thanks in advance...

    • @DataMites
      @DataMites  3 года назад

      For image generation we have a different method called Data Augmentation it will newly create synthetic data from existing data.

  • @Adinasa2
    @Adinasa2 Год назад

    AttributeError: 'SMOTE' object has no attribute 'fit_sample'

  • @patelajay1010
    @patelajay1010 3 года назад

    I have one doubt. What if data contains Nan values and you want to do under_sampling? If you impute Nan values with Mean() then there will be information leakage because we impute data before splitting it into train and test dataset. Could you please tell me what should be the possible solution in this case?

    • @DataMites
      @DataMites  3 года назад +1

      Hi
      Ajay Patel, if you have a large dataset, you can certainly drop the Nan Values

    • @patelajay1010
      @patelajay1010 3 года назад

      @@DataMites Sir I have continuous data coming from sensors. Dropping few rows will lead to break a pattern.

    • @DataMites
      @DataMites  3 года назад

      @@patelajay1010 In that case without knowing the source and significance of your nan value, we cannot comment on anything.

    • @patelajay1010
      @patelajay1010 3 года назад

      @@DataMites ok sir. Thank you for your response.

  • @dkandasamypandian719
    @dkandasamypandian719 3 года назад

    Good

  • @snehasamadder3790
    @snehasamadder3790 2 года назад

    after I resample an imbalance dataset how can I download the resampled dataset from colab?

    • @DataMites
      @DataMites  2 года назад

      Combine the resampled x and y and create a new dataframe, then convert that dataframe to a csv file using to_csv()

  • @mohan250s
    @mohan250s 2 года назад

    ur awesome

  • @sunnyarora4916
    @sunnyarora4916 3 года назад

    Any video where we use SMOTE for regression??

    • @DataMites
      @DataMites  3 года назад +1

      Hi Sunny Arora, you can use SMOGN for it. More more information: proceedings.mlr.press/v74/branco17a/branco17a.pdf

    • @sunnyarora4916
      @sunnyarora4916 3 года назад

      @@DataMites Thank you, is it less likely to use SMOGN?

  • @wajeehanaz9115
    @wajeehanaz9115 3 года назад

    Thank you for informative video! I used your coding but got error "
    ValueError: could not convert string to float: '5more'"...plz tell me how can I resolve this error...Thanks in advance:)

    • @DataMites
      @DataMites  3 года назад

      We have to look into your code. But please check if you have converted all the categorical values to numerical values in your dataset.

    • @RoyalRealReview
      @RoyalRealReview 3 года назад

      @@DataMites sir I am predicting heart disease and out of my sample 54% people have heart disease and rest 46% don't have so which method I should use for balancing?

  • @sanyajain2127
    @sanyajain2127 4 года назад

    Getting an error: ValueError: Unknown label type: 'continuous-multioutput'

    • @DataMites
      @DataMites  4 года назад

      It can due to multiple reasons like in logistic-regression doing classification more than 2 classes.
      Or due to the use of classifier if the target variable is continuous.

  • @aiswaryalakshmi1349
    @aiswaryalakshmi1349 2 года назад

    Cannot install imblearn. Kindly help me with this

    • @DataMites
      @DataMites  2 года назад

      Once you install imblearn, restart the kernel. If it doesn't work try any of these codes: "!pip install delayed" or "pip install --user imblearn"

  • @parthasarathyk5476
    @parthasarathyk5476 2 года назад

    Hi, did anyone applied this concept to image dataset. please anyone let me know...

    • @DataMites
      @DataMites  2 года назад

      For image generation you can use method called Data Augmentation it will newly create synthetic data from existing data.

  • @faisalshehzad9504
    @faisalshehzad9504 4 года назад

    thanks.