(Code) Iterative Imputer | MICE Imputer in Python | Machine Learning

Поделиться
HTML-код
  • Опубликовано: 21 авг 2024
  • #mice #python #iterative
    In this tutorial, we'll look at Iterative Imputer from sklearn to implement Multivariate Imputation By Chained Equations (MICE) algorithm, a technique by which we can effortlessly impute missing values in a dataset by looking at data from other columns and trying to estimate the best prediction for each missing value.
    Machine Learning models can't inherently work with missing data, and hence it becomes imperative to learn how to properly decide between different kinds of imputation techniques to achieve the best possible model for our use case.
    I've uploaded all the relevant code and datasets used here (and all other tutorials for that matter) on my github page which is accessible here:
    Link:
    github.com/rac...
    Some useful resources that might be helpful for further reading:
    cran.r-project...
    stefvanbuuren....
    www.ncbi.nlm.n...
    towardsdatasci...
    towardsdatasci...
    towardsdatasci...
    If you like my content, please do not forget to upvote this video and subscribe to my channel.
    If you have any qualms regarding any of the content here, please feel free to comment below and I'll be happy to assist you in whatever capacity possible.
    Thank you!

Комментарии • 54

  • @SodaPy_dot_com
    @SodaPy_dot_com 2 месяца назад

    verey detailed with the parameters. love it

  • @chandravardhansinghkhichi2648
    @chandravardhansinghkhichi2648 3 года назад +2

    Rachit, Learning a lot from your tutorials. I find your content very informative yet very easy to understand. cute little 'Namaste' at start is warming :). Thanks a lot. Also i have some doubts
    1) can we use classifier in estimator separately for categorical features & discrete features only, and regressor for numerical features? would it be a good practice?
    2) I'm in learning phase so i often wonder which Imputation should i choose, for ex. If multiple Imputer or KNN imputer is advance, they should be used in all cases, then why they teach other imputers in start( like Random Sample / Arbitrary Imputation, End of Distribution or mean/median/mode imputers).
    Thanks in advance :D. I really appreciate you try to respond to everyone you can, taking your valuable time

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      Hi ChandraVardhan! Thank you for the kind words, means a lot :)
      for q1 : the choice between classifier and regressor depends upon the target variable right? If the target is continuous, you'd need to use a regressor on the whole data. and similarly for a categorical target column you'd need to use a classifier.
      for q2 : ML is hardly an exact science. For some data, a simple Logistic Regression can yield better results than say, a Random Forest Classifier or other work-intensive algorithms lol. So you gotta try different things and pick the one that gives best results with your data.
      Thanks for the "namaste" thing too, haha! xD

  • @itsamanrai
    @itsamanrai 2 года назад +1

    thank you for this informtie video, Rachit - quick question : in the dataset i am working on has some columns with ~75% missing values, would the iteration imputation work there? also can we use iterative imputation in EDA i.e. before the train test split?

  • @rukiakuchiki629
    @rukiakuchiki629 2 года назад

    Halo... i really love ur explanation 💕💕💕 thank you so much... but i have a question for you... because u just have 1 NA for each columns, what if we have more than 2 missing values in our columns?? For example... the first columns we have 3 missing values, are they three will be predicting simultaneous??or just one by one just like ur video explanation??
    Sorry if my english is bad :" i hope u wanna responds :" thanks in advance^^

    • @rachittoshniwal
      @rachittoshniwal  2 года назад

      Hi Rukia! I'm so glad it helped! :)
      so if we have multiple missing values in a column, all of those rows will behave like a "test set" of sorts, with the model being fit on the fully-filled rows of that column. Once we have the model, we can use it to predict each of the "test set" rows.
      Hope it helps!

  • @SQDLowkey
    @SQDLowkey 3 года назад +1

    Hello Rachit,
    Thank you for the Great video.
    Is there any method or attribute using which we can get the value of change from the IterativeImputer object and store it in a variable?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      I don't see any direct method/ attribute, but python is your good friend:
      original_stdout = sys.stdout
      with open('filename.txt', 'w') as f:
      sys.stdout = f
      imp.fit_transform(X)
      sys.stdout = original_stdout
      this should save the printed statements of imp.fit_transform(X) in filename.txt, and then you can import that file and go berserk with whatever string manipulation/ regex methods you wanna apply to extract whatever you want lol.
      credit to : stackabuse.com/writing-to-a-file-with-pythons-print-function/
      thanks Aman, I got to learn this new thing today as well!

    • @SQDLowkey
      @SQDLowkey 3 года назад +1

      @@rachittoshniwal thanks a lot for this.🙏you rock

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@SQDLowkey haha!

  • @briankantanka3273
    @briankantanka3273 Год назад

    What did you press to expand the function and see all of its applicable parameters/arguments at 1:33?

    • @rachittoshniwal
      @rachittoshniwal  Год назад

      Once inside the function, hold shift and press "tab" twice to get a floating documentation. Hold shift and press tab 4 times to fix it at the bottom of the screen.

  • @datascientist2958
    @datascientist2958 3 года назад

    And thankyou for Implemtation

  • @soheilaahmadi4807
    @soheilaahmadi4807 2 года назад

    Hi Rachit. Thank you for the explanation. It was very useful. Actually, I have a big data set containing 6068 rows and 10 variables which are anthropometric measurements such as stature, weight, waist circumference and etc. there is no missing value in the data set. but there are some new users that should enter their 10 measurements and they may miss some of these measurements. like the new user which is 6069th sample may not enter all these 10 measurements. She/He may enter just 4 out f 10 measurements. I want to predict the other 6 missing variables based on my old data set contacting 6068 samples without any missing values. I wanted to know if I can use MICE approach in this way? I mean imagine that the new samples can be appended to old data set with missing values. and if so, How should I know which estimator is better?

  • @seshilrs
    @seshilrs 3 года назад

    Hi Rachit, I am a newbie to ML. in KNN imputation you have split the data, however, in iterative you didn't which is right?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Hi Talari. Yes we have to split the data. In the first example I wanted to explain its working, hence didn't dive into splitting. However, in the second half, I have shown an example with splitting as well.

  • @rithikmathur8944
    @rithikmathur8944 3 года назад

    Hi Rachit, Great explanation, I have two questions:
    1. I use mice imputation with linear regression and then run ridge regression for prediction, I found that the r2score to 62.9% and rsme to be 7.6. Can you explain how this can be possible? such good rsme with such low r2score.
    2. I used decision tree regressor for mice imputation, with decision tree regressor the imputation is taking around 2 to 3 hours. Is anything you can suggest.
    Thanks!

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Rmse is a kinda relative measure. An rmse of 80 if you're predicting salaries of employees (who earn 100k) is brilliant, but that same rmse of 80 while predicting a student's exam score is atrocious (cuz the exam itself is of 100 marks!) so you gotta see the context IMO
      Idk why is it taking so long, maybe the data size is too large? Or some parameters need tweaking possibly

    • @rithikmathur8944
      @rithikmathur8944 3 года назад

      Hi Rachit, thanks for the reply, I will consider it and also I have complete dataset of 900 rows, which I think its not too big.

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@rithikmathur8944 yeah 900 ain't big. Idk then what's the problem sorry.

    • @rithikmathur8944
      @rithikmathur8944 3 года назад +1

      @@rachittoshniwalNo problem, thanks for the help👍

  • @vigneshnathan4317
    @vigneshnathan4317 2 года назад

    Even after doing the steps I am having null values in the dataset. Is it because it is in the array format it doesnt get transformed . But my code isnt showing any error

  • @saumyashah6622
    @saumyashah6622 3 года назад

    Hey, what is case when we can't understand the correlation from scatter plot ( when the scatter is actually random ). In that case we cant apply any regression algorithms. Can we use KNNImputer in that case. Please suggest a way

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      A ML model is kinda like a black box sometimes. You can try and check accuracies with different imputers, check over/ under fitting etc and take a wise call

  • @jjxed
    @jjxed 3 года назад

    Hi Rachit, do you have any idea why running iterative impute's fit_transform method in jupyter notebook causes my computer to freeze. No other python method has this effect

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Hey Jack. I've no idea why this is happening. Have you tried these:
      Update sklearn
      Restart kernel
      Try running it afresh in a new notebook
      First off try the update.

  • @joeyk2346
    @joeyk2346 3 года назад

    Great video Rachit!! I am looking for a tutorial on how to apply MICE while performing cross validation in Python. Could you please share a link/code where you performed cross validation while applying MICE? Otherwise do you have any insights on how to accomplish this? Thank you very much!!

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Hi Joey! I'm glad it helped!
      Are you facing any particular problem while applying CV to MICE?

    • @joeyk2346
      @joeyk2346 3 года назад

      @@rachittoshniwal Hi Rachit - thank you for your prompt reply. Just to clarity, I am trying to figure out how to code the "normal cross validation" in order to find the optimal hyperparameters when predicting the response (not cross validation to optimize the imputation). I am working with missing data. So say you want to do 5 cv. You start by training your data with 4 folds and you predict the 5th fold. Before training the 4 folds you apply MICE imputation (using only the 4th folds) then you impute the 5th folds using the pretrained MICE model. You do not want to impute using the 5 folds all together since it will bias your results. I was just wandering if you have some code/resources on how to accomplish this in python? I am looking forward to hearing back from you. Thanks a lot - Joey

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@joeyk2346 you're looking for a grid search probably?

    • @joeyk2346
      @joeyk2346 3 года назад

      @@rachittoshniwal Yes exactly! Maybe a grid search with a Pipeline. Just looking for a way to implement it in Python. This is crucial since you always need to tune/evaluate your model before deployment. Any insight would be very appreciated

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@joeyk2346 you're kinda in luck here lol. I've done a video on this: ruclips.net/video/KzIQ3G_TEFg/видео.html
      Both with and without pipeline approaches. Hope it helps :)

  • @123TK
    @123TK 3 года назад

    Is it possible to use MICE to impute categorical variables?

    • @thepresistence5935
      @thepresistence5935 2 года назад

      Yes, He told clearly after encoding we could do that.

  • @datascientist2958
    @datascientist2958 3 года назад

    Sir is this approach of predictive mean matching. Actually it was not used in parameter

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      I'm sorry I didn't understand?

    • @datascientist2958
      @datascientist2958 3 года назад

      @@rachittoshniwal predictive mean matching is a method used in multiple imputation.

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@datascientist2958 yes... and?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Oh, ok. No. This approach is not PMM. It is the regression based method.

    • @datascientist2958
      @datascientist2958 3 года назад

      @@rachittoshniwal can you please make a video on that. Thanks in advance

  • @scifimoviesinparts3837
    @scifimoviesinparts3837 3 года назад

    Can I use it with RandomForestClassifier for Categorical data ?

    • @plemplem94
      @plemplem94 3 года назад

      Yes, I did it myself, however you need to set the parameter 'initial_strategy' to 'most_frequent'

  • @karteekmenda3282
    @karteekmenda3282 3 года назад

    Hey Rachit. Can you please check your github once as I didn't find any notebook on this iterative imputer.

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      hi Karteek,
      you'll find it now :)
      github.com/rachittoshniwal/machineLearning/blob/master/Iterative%20Imputer%20demo.ipynb

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +2

      @@karteekmenda3282 Wow, thanks man! means a lot :)

  • @datascientist2958
    @datascientist2958 3 года назад

    If we have categorical feature can we use this approach?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Technically, you'd have to convert them to integers using ordinal/ one hot encoder. But since they'll now be discrete in nature, it would make less sense when their imputations turn out to be floats like 0.56 etc.
      Hence it's better to avoid iterative imputer for categorical data IMO

    • @datascientist2958
      @datascientist2958 3 года назад

      Multiple imputation don't work with discrete values?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@datascientist2958 it does, but since we're using a regression model, the output will be a continuous variable. So if we have red, green, blue as 0,1,2 and a few missing values in the column, the predictions would not strictly be 0/1/2. It most likely will be floats.
      PMM would work for discrete data

    • @datascientist2958
      @datascientist2958 3 года назад

      Thankyou very much. I will appreciate if you make a tutorial on PMM.