Step-by-Step procedure of KNN Imputer for imputing missing values | Machine Learning

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024

Комментарии • 74

  • @kumarnikhil8197
    @kumarnikhil8197 3 года назад +9

    Very well explained! Teacher like you must be appreciated!!

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +2

      Wow, thanks Nikhil ! Appreciate your kind words! :)

    • @kumarnikhil8197
      @kumarnikhil8197 3 года назад +4

      @@rachittoshniwal I know its a really tough job to make edu videos which hardly gets much views as compared to filth which is piling up on YT, please don't lose motivation, just remember there is always that one weak person who by your help can sleep peacefully that night.

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +2

      @@kumarnikhil8197 you're making me nervous now Nikhil :p thanks btw!

  • @r.s.572
    @r.s.572 3 месяца назад

    thank you for explaining this! :) poor PhDs are thankful for people like you who use their free time to do such videos!

  • @akshatjain1746
    @akshatjain1746 16 дней назад

    short simple informative!

  • @ritvikpalvankar1903
    @ritvikpalvankar1903 2 года назад

    Hello, thank yo so much for a clear explanation. I was asked this question in an interview and I think I did a good job by watching this video a day before. :)

    • @rachittoshniwal
      @rachittoshniwal  2 года назад

      Wow, I'm so glad it helped Ritvik! I hope you get the job! :)

  • @pushpakkothekar9271
    @pushpakkothekar9271 2 года назад

    Learned KNN imputation buddy thank you. Liked and Subscribed brother....

  • @tridibpal857
    @tridibpal857 2 года назад

    Sir you are awesome . Please take a bow .

  • @prithvisingh4173
    @prithvisingh4173 3 года назад +1

    nice , bro you got my concept and doubts cleared . Thanks ......

  • @ethiopiansickness
    @ethiopiansickness 3 года назад

    I'm surprised you don't have more subscriptions to your channel. A lot of your videos are at the top of search queries on youtube, so I am sure eventually you will get the subscriptions and views that you deserve. Keep up the great work!

  • @ivanrazu
    @ivanrazu 3 года назад +5

    Nice example, I do have a question. When you do the imputation for the other missing values, do you use the imputed value you just found when computing distances with respect to that row? Or do you do all imputations simultaneously?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +3

      hi Ivan!
      No, we do not pay attention to any newly imputed values for imputing other values.
      All NaN's get imputed independent of each other. So basically yeah, in a sense all get imputed simultaneously

    • @ivanrazu
      @ivanrazu 3 года назад

      @@rachittoshniwal Ok got it. Thank you, Rachit!

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      @@ivanrazu :)

  • @DrizzyJ77
    @DrizzyJ77 6 месяцев назад

    Thanks
    Needed a clear explanation for my missed class😅

  • @md.faisalsohail9108
    @md.faisalsohail9108 4 года назад +1

    simply awesome. thanks, brother.

    • @rachittoshniwal
      @rachittoshniwal  4 года назад

      I'm glad you liked it! :)

    • @md.faisalsohail9108
      @md.faisalsohail9108 4 года назад

      @@rachittoshniwal hope u and ur channel grows exponentially.

    • @rachittoshniwal
      @rachittoshniwal  4 года назад

      @@md.faisalsohail9108 whoa! Thank you for the kind words! :)

  • @kennethbassett6330
    @kennethbassett6330 2 года назад +1

    Thanks for the great video! I have a question:
    Let's say I am finding the 5 nearest neighbors. I am trying to fill in a missing value for column A for a certain data point. One of it's nearest neighbors is also missing a value in column A, should I take the average of the remaining 4 neighbors, or should I include the next closest neighbor (6th furthest) in the average?

  • @TheReluctantCoder
    @TheReluctantCoder 3 года назад

    Very good explanation! Thank you!

  • @bhavnatanwar8591
    @bhavnatanwar8591 3 года назад +1

    your vedio is really helpful:)), i have a question, does KNN imputer uses the imputed values in calculation further, what i mean is suppose we have imputed the missing values in the first column and now we have to compute the values in the second column so does it uses the values imputed in the first column or does it consider it as missing and is used in weight as missing entry??

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      Hi Bhavna, thanks!
      Well, no. It doesn't take into account the imputed values in one column while imputing other columns. It considers the "original" missing values as missing. All columns are independent of each other during imputations basically.

    • @bhavnatanwar8591
      @bhavnatanwar8591 3 года назад +1

      @@rachittoshniwal thanks this was really helpful:))

  • @pumpitup1993
    @pumpitup1993 4 года назад +4

    Very nicely explained, can you do the same for MICE imputation?

    • @rachittoshniwal
      @rachittoshniwal  4 года назад

      I'm glad you liked it!
      I'll look into MICE!

    • @rachittoshniwal
      @rachittoshniwal  4 года назад

      Hi Sourav, I've just published one on MICE here: ruclips.net/video/WPiYOS3qK70/видео.html
      Do check it out and let me know if you do (or if you do not ! ) find it useful :)

    • @pumpitup1993
      @pumpitup1993 4 года назад +1

      @@rachittoshniwal yes i just saw it, really helpful,thanks a lot!

  • @ismafoot11
    @ismafoot11 2 года назад +1

    Excellent video however what is the impact of doing this when the features having extreme variability. For example if one column ranged between 0 and 1Million and the other columns hovered around 10-20. Should you normalize/standardize your data before hand ? If so, should you normalize or standardize and how would you do it if you have missing values in that column

    • @rachittoshniwal
      @rachittoshniwal  2 года назад +1

      Thanks!
      Yes, we should ideally normalize the data if they're on different scales.
      Scikit learn will ignore presence of missing values, and scale the columns based on the non missing values. The NaNs remain NaNs. You can then impute those values.
      There is no one correct answer as to whether to normalize or standardize. Trial and error, whichever works best on your data

  • @ajeethmajhi1358
    @ajeethmajhi1358 3 года назад +1

    very nice explanation bro. Is there any way/method to select number of neighbours while imputing values?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Thanks Ajeeth. Glad it helped!
      Well, you could try a grid search or even give the elbow method a shot

  • @tugce2326
    @tugce2326 2 года назад

    Hi Rachit,
    Very nicely explainedThat's why I want to ask you something. I have 440 data belonging to 9 precipitation observation stations (data matrix:440×9). There are missing values at each station. 9 none of precipitation series shows a normal distribution. However, the missing in 9 precipitation series are completely random. My question is;1) Can I use the k-NN/Random forest/ and MICE methods even though 9 precipitation data is not distributed normally? 2) Are there any prerequisites/conditions for using these methods? 3) Could I use these methods if my data was not MCAR?

    • @rachittoshniwal
      @rachittoshniwal  2 года назад

      Hi! Following a distribution is not a pre req for imputation. However, if the data is skewed, it is better to go for median imputation than mean, because median is a better approximate than mean in that case.
      So MICE works better when the data is MAR, if not you might get suboptimal results. At the end of the day though, it is mostly trial and error while finding the best method. Hope it helps!

  • @shoaibahmed5848
    @shoaibahmed5848 10 месяцев назад

    What about 1 row missing value and 4th row missing value is those values to be filled necessary?

  • @heteromodal
    @heteromodal 3 года назад +1

    Thank you again for a great tutorial! Can you give an outline to when this method would be preferable to MICE for example?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      First of all, thanks! I'm glad you liked it!
      Well, mice is helpful when the features are "correlated", and if you know that's the case, go ahead with it. Otherwise, look for other methods (like knn for example)
      But it's more of trial and error really. The Imputer that gives the best results is the best one!

    • @heteromodal
      @heteromodal 3 года назад

      @@rachittoshniwal Thanks again! Really appreciate your videos and responses! :)

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@heteromodal thanks! My pleasure!

  • @ThePablo505
    @ThePablo505 2 года назад

    Thank you so much

  • @SumitKumar-sj5xw
    @SumitKumar-sj5xw 2 года назад

    very good explanation

  • @NitinMukeshIITB
    @NitinMukeshIITB 3 года назад

    Awesome explanations

  • @rizkiekaputri2122
    @rizkiekaputri2122 2 года назад

    please input subtitle in this video, my final task is about this topic, i really hope u put subtitle here so I can understand what are u explain in.

  • @TheElementFive
    @TheElementFive 2 года назад

    Suppose you want to apply this technique to a dataset where the outcome variable is discrete. Would it be logical to limit your set of neighbors to those belonging to the class associated with the row you are imputing (i.e., calculate the Euclidean distance between the row to impute, and only all other rows for which y_current_row == y_neighbor row) ?

  • @noorbariahmohamad8759
    @noorbariahmohamad8759 2 года назад

    Prof, what if NAN happened at the same time ? means Friends, GOT, Suit, Breaking Bad, HIYM all missing at row 2. Still can impute using kNN method?

  • @KartikRai-YrIDDCompSciEngg
    @KartikRai-YrIDDCompSciEngg 2 года назад

    What if (Row 3 Col0),(Row 4 Col0) also had missing values. So the mean ( (50+29)/2 ) would not be possible, then how does the algorithm proceed.

  • @venkateshwarlusonnathi4137
    @venkateshwarlusonnathi4137 3 года назад

    should you not normalize the values before doing KNN? or is it because all of them are supposed to be in the same range of 0 to 100, we dont need to do here?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      Yes, precisely. Since all are on the same scale, it isn't mandatory to perform normalization. But if features are set on different ranges, then you should

  • @barathwajas6702
    @barathwajas6702 2 года назад

    Hi Rachit, Quick Question how do you evaluate and tune the models if the imputer did predict the correct or nearby value or not?
    Thanks in advance.

    • @rachittoshniwal
      @rachittoshniwal  2 года назад

      We can only judge goodness of the imputation by the model performance. If we get a good final model, it means the imputer was able to get close to the real values

    • @barathwajas6702
      @barathwajas6702 2 года назад

      @@rachittoshniwal correct but in your example case was there any tuning done if so can you share that insight? TIA.

  • @RS-fe1hk
    @RS-fe1hk 3 года назад

    2 doubts :
    1) if v r giving k-neighbour =2 and the nan value is present in 1st row instead of 2nd row which are the rows will be selected for calculating euclidean?
    2) while 'weight' value 'total' / present cord .. What is the total nd present cord value if the both the values are nan.. Example : in ur example say instead of 85 there is Nan value.. Then wat is total and present cord value while computing with 1st row?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      If I understand your second question,
      Total will always be 4 in this case, because there are 4 other columns.
      For a combination to be considered as "present coord", both values must be present, so if 85 was a nan, then while comparing person 2 with 1, both will be nan for HIMYM column and hence won't be considered in "present coord"
      I didn't quite get your first question. By 1st row, you mean 1st person or 0th person?

    • @RS-fe1hk
      @RS-fe1hk 3 года назад

      @@rachittoshniwal @Rachit Toshniwal that's answers the 2nd question.. Thanks... And for 1st question 1st row means 1st index not the 0th index.. (ie friends =44)

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      @@RS-fe1hk so you mean to say for person 1, both Friends and HIMYM are nans?

    • @RS-fe1hk
      @RS-fe1hk 3 года назад

      @@rachittoshniwal yeahh... If we try to fill Nan for person 1 and give k-neighbors as 2... How it will select rows. Because Above person 1 there is only 1row ( person 0) is present rite.. So in that case wat r the rows will get selected for imputation?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад +1

      @@RS-fe1hk It doesn't matter how many rows are above or below the row in which there is a missing value. It will scan through all rows in the dataset and find the top 2 neighbors
      I hope that solves your query. Let me know if it doesn't!

  • @mohitgoyal229
    @mohitgoyal229 3 года назад

    Rachit can you recommend some books, where we can find techniques like these in details.

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      I don't really have any good recommendations, but you can check the scikit learn documentations of the algorithm you want, they usually refer to a research paper/ a good reference on which their implementation is based.

  • @yv4000
    @yv4000 3 года назад

    Do we need to scale the features before imputation

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      If the features are in different scales, then yes.

    • @yv4000
      @yv4000 3 года назад

      @@rachittoshniwal how should we scale a feature with null values present? min-max scalar wont work with null values present.

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      @@yv4000 it does work with missing data. It just ignores the presence of those missing values

    • @yv4000
      @yv4000 3 года назад +1

      ​@@rachittoshniwal Thanks!

  • @anirudhgupta455
    @anirudhgupta455 3 года назад

    how would imputation happen if any of these variables were categorical?

    • @rachittoshniwal
      @rachittoshniwal  3 года назад

      yeah, so KNN imputer wouldn't work particularly well for categorical data.