(Code) KNN Imputer for imputing missing values | Machine Learning

Rachit Toshniwal

Просмотров 22 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 21 авг 2024
#knn #imputer #python
In this tutorial, we'll will be implementing KNN Imputer in Python, a technique by which we can effortlessly impute missing values in a dataset by looking at neighboring values.
Machine Learning models can't inherently work with missing data, and hence it becomes imperative to learn how to properly decide between different kinds of imputation techniques to achieve the best possible model for our use case.
KNN works on the intuition that to fill a missing value, it is better to impute with values that are more likely to be like that row, or mathematically, it tries to find points (other rows in the dataset) in space which are the closest to it, and uses them as a benchmark for figuring out a value to impute.
I've uploaded all the relevant code and datasets used here (and all other tutorials for that matter) on my github page which is accessible here:
Link:
github.com/rac...
If you like my content, please don not forget to upvote this video and subscribe to my channel.
If you have any qualms regarding any of the content here, please feel free to comment below and I'll be happy to assist you in whatever capacity possible.
Thank you!

Комментарии • 102

@faizikramulla1210 3 года назад ⁺³
this is the best video (of any kind) i have watched all week... will reply to get a better understanding, but even the first pass was informative and useful, thank you!
@rachittoshniwal 3 года назад
Hi Faiz, I'm glad it was helpful!
@ranjitgawande 2 года назад
Rachit, you nicely explain KNN imputation coding. Great
@edidiongesu4035 Год назад
This Video came in clutch for me on a project. Had to subscribe. Thank you so much!!!
@ivanrazu 3 года назад ⁺²
Thank you for your video, this helps a lot!
I have a few questions:
1) It is not clear to me what values you are imputing. The original data has missing values only on categorical columns.
Then you introduce some nans in 'age' and 'hours-per-week'. Is that to create a training set?
2) Before you do the KNN, you create X_train, X_test,y_train, and y_test. Is that always necessary? What if your data is like in your example with the movies (but larger), will you still do this?
3) You are using income as your target column. But what happens if you don't have such a kind of binary column?
4) After you do the imputation, you lost labels on columns and rows. How do you recover that? More importantly, how do you display the original data but including the imputed data.
Thanks again for your videos!
@rachittoshniwal 3 года назад ⁺⁵
1. KNNImputer works only for numerical columns, and this data didn't have missing values in numerical columns. So I introduced them in age and hours per week cols
(side note: it can work on categorical columns as well, but we'd need to first convert them to numerical form by ordinal/ one hot encoding)
2. Technically we're "learning" the top N neighbors, right? and when we learn anything, it should be from the training set only.
3. No, I'm not using the target (income) column anywhere. You see, I'm fitting the KNNImputer in X_train[num], and X_train only has the features. Target col is in y_train/ y_test.
4. scikit-learn does lose the column names as its transformed output is a numpy array.
At 4:10 I make a list of all the numerical column names. We'd have to do a pd.DataFrame(output, columns=num_cols) to get a dataframe of the transformed num cols.
To display all the data with the imputations, we'd need to do a few pandas manipulations to merge the two dataframes (one made out of the num cols imputations, and other one made out of the remaining cols in the dataset)
or we could use a column transformer to do the merging work for us, and return us the whole output array with all the columns (without the col names, of course. We'd have to do that ourselves)
Hope I've cleared your doubts. If not, let me know!
@ivanrazu 3 года назад
Hi Rachit! Thank you so much for your response.
I am still a little confused about points 2 and 3.
What I was wondering is in the case where you have data like in your TV shows example (but lets suppose is way larger) where you have NaNs all over the place. What would you choose as X and y to pass to train_test_split()?
Thank you again!
@rachittoshniwal 3 года назад ⁺²
@@ivanrazu hi Ivan!
X is the feature space, and y is the target vector.
In this income dataset example, the income column is the target vector. Hence that column became my "y" and other variables (known as predictors) became "X".
And by segregating the data into X and y, we made sure we're working with only the "X" data while imputing, and not looking at the "y" at all.
In the tv show example, I didn't include any target variable, just for convenience purposes. Every column there was X.
The target can be continuous or discrete, it doesn't matter because we ain't using it in our calculations anyway.
Is it clear now? Let me know if not!
@ivanrazu 3 года назад
Ok, I get it now.
Thank you so much for answering my questions, I really appreciate it!
@sanketpadwal7298 3 года назад ⁺¹
very crisp and clear
@rachittoshniwal 3 года назад
Glad you think so, Sanket!
@Sinister_Rewind 3 года назад
You explained it in such a great way 🤩🤩, thanks for this amazing video
@muthierry1 2 года назад
Great Video Sir . thank you very much for your clear & valuable explanation .
@rohitjagdale4648 3 года назад
Nice one ! Sorry I didn't understand what you said @8:15, about imputing done on training set even if we are doing it for X_test. Request you to elucidate more. Thanks.
@prashu25925 3 года назад ⁺²
Brilliant 👏
Thanks
@rachittoshniwal 3 года назад
Thanks Prashant! I'm glad it helped :)
@abrahammathew8698 3 года назад
Another awesome video...I got following queries after watching.
1. Since KNN is distance algorithm should we perform outlier treatment and standardization before doing the imputation?
2. You mentioned about indicator column that "Missingness:Quality of row being missing can be feature for us". Could you tell how it will be useful?
3. Is there any thumb rule for using whne to use KNN vs MICE?
Thanks!
@rachittoshniwal 3 года назад
Hi Abraham,
1. Yes, it would be better to perform standardization before the imputation in the case of KNN. Outlier treatment depends on the business case, as in, how important are those outliers in the context.
2. For example in a survey there are two (among others) questions: "do you have a pet" and "what food brand you feed them". So if someone does not have a pet, they will skip the second question, which will show up as a nan, and we can estimate the reason behind the missingness as it being because of them not having a pet. (a stupid example, but I hope it helps)
3. Not to my knowledge, no. Its more of what performs better with the dataset in question.
@Chinmay4luv 3 года назад ⁺²
Hi rachit your way of teaching is excellent. can you please make a coding video on MICE imputer.
@rachittoshniwal 3 года назад
Hi Chinmay!
Glad you found it useful !
I've thought about making a coding video on mice, but scikit-learn's implementation is currently under experimental phase. Hence I've stayed away from doing it.
But I'll see into it.
Btw, if you don't know yet, I've made a video on the mice algorithm
ruclips.net/video/WPiYOS3qK70/видео.html :)
@CreatingUtopia 3 года назад
@@rachittoshniwal make a video on implementation too, please
@rachittoshniwal 3 года назад ⁺¹
hi Shubham,
Sure, I'll make one on MICE soon :)
@vish183 3 года назад ⁺²
Brilliant one. Thank you. Quick question, dataset after Knn transformation didnt have column names. Any tips on retaining the variable names?
@rachittoshniwal 3 года назад ⁺²
Thanks!
pd.DataFrame(output_array, columns=num_cols) where num_cols is the list of column names.
Hope it helps!
@vish183 3 года назад ⁺²
@@rachittoshniwal You are a star
@rachittoshniwal 3 года назад
@@vish183 ah, it's okay! ;)
@chiragsharma9430 2 года назад
Great video as always Rachit. Can you make videos on Target encoding, Mean Encoding, Weight of Evidence Encoding, Probability Encoding, etc. techniques, and how can we use them in a Pipeline?
That would be very helpful. Also, how can we use these techniques (mentioned ones) with Cross-validation?
@sriraj8392 2 года назад
thank u thank u ....❤ u ....well explained...
@vamseesworld1035 2 года назад
thanks for the video. i have converted categorical data into numerical data then applied knn imputer. some values imputed with floats like 2.37. 1.89 etc, which doesn't make any sense. can you suggest best algorithms to impute categorical data without float values.
@stoicsimulation2131 3 года назад
I have a question, I noticed that you did train test split, isn't that only necessary when you want to build a predictive model? Is it necessary to use train test split for imputation purposes? Thank you !
@rachittoshniwal 3 года назад ⁺¹
Hi there.
Well, we need to know how well the imputation is doing, right? And that's why we do the split - to check if the imputed values are any good when faced with new unseen data.
@rachittoshniwal 3 года назад ⁺¹
@@stoicsimulation2131 no worries, and thanks btw!
By checking the accuracy/ rmse/ any other relevant metric for example. If the score is good, it means our guesses with the imputations are working, right?
@prernajha8318 3 года назад
Thanks for the video, I have a small doubt, here you have treated numerical columns, what about categorical ? Can we use cat in place of 'num' to treat the values?
@rachittoshniwal 3 года назад
No, categorical columns in their original string form cannot be treated using knn, you'd have to convert them into numbers. But then if you convert a color column with red, blue, green into 1,2,3 then you run the risk of imputing a missing value in this column with a float value like say 1.78 which doesn't make sense
@ajaykushwaha4233 3 года назад
Hi Rachit there are lots of imputation techniques. But to know which one is correct to use.
@rachittoshniwal 3 года назад
There is no one correct or incorrect answer. Trial and error really.
@ajaykushwaha4233 3 года назад
@@rachittoshniwal just a small doubt. I did imputation of missing value and compare result with df and X_train value. Distplot was almost same, change invariance aprxx5-10%, before also same amount of outliers were present and later also. Am I going on right track ?
@shahedabdulhadi8226 2 года назад
thanks for the video, now after imputing how to export the data after filling the missing values?
@nandinisharma4311 2 года назад
how do we impute missing data in categorical variables
@bhushantayade7984 3 года назад
Plz make video on hot deck and expectations maximization Imputation method in python.
@MechiShaky 4 года назад ⁺¹
A very nice video , but why don't you impute it before train_test_split .
@rachittoshniwal 4 года назад
Hi, I'm glad you liked it!
I don't impute it before train_test_split because of a concept called 'data leakage'. (machinelearningmastery.com/data-leakage-machine-learning/)
We can't see future data, so we need to find a way to simulate that future data, and we do that using train_test_split.
This ensures that we hold a set of data points which have never been seen before, so we can test our model against this data.
So this is a data we shouldn't "peek" into while doing any kind of feature engineering/ data preprocessing.
Hence I didn't impute before train_test_split. Was I clear? Please let me know!
@MechiShaky 4 года назад ⁺¹
@@rachittoshniwal Thanks bro got it
@rachittoshniwal 4 года назад
@@MechiShaky happy to help!
@prernajha8318 3 года назад
One more thing- I had total 864 missing values including cat and num variables but after imputing through knn all the values are 0 but i have not done any one hot encoding for cat variables so does it mean it's already treated?
@rachittoshniwal 3 года назад
Are you sure about the categorical columns getting filled? because I tried this simple script, and it is throwing an exception just as it should, as I am trying to impute a categorical column
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
df = pd.DataFrame([
[1,'C',3],
[5, np.nan, np.nan]], columns='col1 col2 col3'.split())
num = [col for col in df if df[col].dtypes != 'O']
knn = KNNImputer()
knn.fit(df)
@priyarafael 3 года назад
Hi! Thanks for your video.. they are always useful ! Can you explain to me why you dropped the income variable in the Train and test split? What does this variable represent when you are trying to fill in missing values with knn imputer. I am confused because I have a couple of columns with missing values but I am not sure which one should be dropped in this case.
@rachittoshniwal 3 года назад
Hi!
The income variable is the target, which we are trying to predict.
@priyarafael 3 года назад
@@rachittoshniwal I see. Thanks for your prompt reply! The problem that I am currently facing is that I have more than one target variable in my dataframe that I want to impute, i.e. say I have one missing value in one column but some other missing values in other columns but may not necessarily be in the same row, so how do I impute that?
I face the same problem when I try to find the ideal number of neighbors for my dataframe. I have written below , the code I am using to find the no. of neighbors with the least error. If you see the first line of the code, it also asks me for the target variable, but again I have multiple target variables. So I am unsure how to proceed..
def optimize_k(scaled_data, target):
errors = []
for k in range(1, 20, 2):
imputer = KNNImputer(n_neighbors=k)
imputed = imputer.fit_transform(scaled_data)
scaled_data_imputed = pd.DataFrame(imputed, columns=df.columns)
#Split the dataset into training and testing subsets:
X = scaled_data_imputed.drop(target, axis=1)
y = scaled_data_imputed[target]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
#Fit the Random Forests model and predict on the test set:
model = RandomForestRegressor()
model.fit(X_train, y_train)
preds = model.predict(X_test)
error = rmse(y_test, preds)
errors.append({'K': k, 'RMSE': error})

return errors
k_errors = optimize_k(scaled_data=df, target='MEDV')
k_errors
@priyarafael 3 года назад
Correct me if I am wrong, because I am not a professional coder but I need to use Python for a university project.
@rachittoshniwal 3 года назад
can't you use a grid search to find the optimum number of neighbors?
@priyarafael 3 года назад
I wasn't able to find the Grid search code for Knn imputer parameters on the internet.... I mostly came across examples of Grid search to find SVC hyperparameters. Would you by any chance have an idea what the code might look like in my case?
@pradeebhabenildus8891 3 года назад
Hi,
Can you help me with the below doubt?
Here in the video, we are taking numeric columns and we are imputing values. this is said in the stmt
num = [col for col in X_train.columns if X_train[col].dtypes != 'O']
my qn is how should we do imputing categoical values? am using wine dataset wherein i need to impute values under 'quality' column which has got values as Quality A, Quality B and null. If i leave out this column and carry on with other numerical columns, anyways the null values in this column is not getting imputed. How to handle for such columns which holds categorical values and dtype as Object? pls explain.TIA!
@rachittoshniwal 3 года назад
one way is to use strategy='most_frequent' in SimpleImputer, which is nothing but mode imputation
another way is to replace those nan values with an "unknown" label
but each has its own advantages/ disadvantages
you can also read about it here:
medium.com/analytics-vidhya/ways-to-handle-categorical-column-missing-data-its-implementations-15dc4a56893
@ajaykushwaha-je6mw 3 года назад
Can anyone help here for fixing position issue for a graph.
All graphs are coming in 1 column and in 3 rows and what I am looking for is single row and in three column .
ax=plt.subplots()
ax=sns.distplot(data['LotFrontage'],hist=False)
ax=sns.distplot(X_train['LotFrontage'],hist=False)
ax=plt.subplots()
ax=sns.distplot(data['MasVnrArea'],hist=False)
ax=sns.distplot(X_train['MasVnrArea'],hist=False)
ax=plt.subplots()
ax=sns.distplot(data['GarageYrBlt'],hist=False)
ax=sns.distplot(X_train['GarageYrBlt'],hist=False)
@mooncake4511 3 года назад
Hi I'm trying to use KNNImputer to impute numerical categorical columns but I end up getting a value inbetween the two values. Ex:- i have a boolean column with values 1 and 0 populated in the column but after I run the KNNImputer to replace missing values I get 0.4,0.93 0.3 etc as the values which replaced the np.nan values. Is there something I'm missing ?
@rachittoshniwal 3 года назад
It isn't advisable to use KNN Imputer on discrete columns exactly because of the results you got.
For such columns, try using mode imputation or PMM.
@mooncake4511 3 года назад
@@rachittoshniwal Thanks for the reply but what is PMM ? I am new to this field so I lack knowledge.
@rachittoshniwal 3 года назад ⁺¹
@@mooncake4511 PMM is predictive mean matching
Read this article:
stefvanbuuren.name/fimd/sec-pmm.html
@mooncake4511 3 года назад
@@rachittoshniwal thanks will check it out! Also any thoughts on this paper ? www.google.com/search?q=%22Automatic+Discovery+of+the+Statistical+Types+of+Variables+in+a+Dataset%22&oq=%22Automatic+Discovery+of+the+Statistical+Types+of+Variables+in+a+Dataset%22&aqs=chrome..69i57j0i30.8177j0j9&client=ms-android-xiaomi-rev1&sourceid=chrome-mobile&ie=UTF-8
@mooncake4511 3 года назад
@@rachittoshniwal I ran into a problem while trying to automatically ascertain what statistical dtype the data is and how to automatically determine them, I found this paper by Dr Valera but it's beyond my comprehension, is there any roadmap on reading material you can suggest so that I can understand this paper completely ?
@ektatripathi3088 3 года назад
Rachit I have one dataset where sales column has 0 values..how can we impute values using knr..
@rachittoshniwal 3 года назад
Has 0 values as in?
@ektatripathi3088 3 года назад
@@rachittoshniwal 0.0 entries in sales column..i find these entries in 1000 rows.So i think this is some data issue.So for that reason i want to impute using knr..
@rachittoshniwal 3 года назад
@@ektatripathi3088 if you feel these are "wrong" readings, you can set them to null and then do the imputations. Does it help?
@atiaspire 4 года назад
how to decide the best K, as I think it will impact the processing time of finding the nearest neighbor. Thank You
@rachittoshniwal 4 года назад ⁺¹
Indeed, it will significantly increase the computational time as the data size increases, because it is going to scan the entire data for finding the neighbors.
You can use a similar strategy for finding best K as while finding the best K in KNN for regression/ classification
machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/
@seshilrs 3 года назад
Hi, how do we split the unsupervised data to impute missing values?
@rachittoshniwal 3 года назад
I'm sorry I don't quite understand the question?
@seshilrs 3 года назад
@@rachittoshniwal Hi, I have a dataset with some missing values with no class labels (unsupervised data). How do we split the data in this case?
@rachittoshniwal 3 года назад
@@seshilrs if your data is in a dataframe format, you can just do train = df.sample(0.7). This will put 70% of the data into the train dataframe.
train_labels = train.index
test_labels = [label for label in df.index if label not in train_labels]
this will make a list of all the datapoints which are not in train and store them in test_labels
now you can do:
test = df[test_labels]
is this what you want?
@swetapadmadalai889 4 года назад
Plz make video on categorical data Imputation..
@rachittoshniwal 4 года назад
Hello,
I've already done a tutorial on categorical data imputation, the link of which is here: ruclips.net/video/g21swrBOcBs/видео.html
This tutorial on Simple Imputer contains handling of both categorical and numerical missing values.
Hope you like it :)
@deepikanadarajan3407 3 года назад
Unable to import impute from sklearn.... Using python version 3.scikit learn is already installed
@rachittoshniwal 3 года назад
KNNImputer was added in sklearn v0.22
so check what version are you on. maybe that's the issue
do this:
import sklearn
print(sklearn.__version__)
@deepikanadarajan3407 3 года назад
Yes
@deepikanadarajan3407 3 года назад
Version 0.19
@rachittoshniwal 3 года назад
@@deepikanadarajan3407 aha.
@deepikanadarajan3407 3 года назад ⁺¹
Thank you for clarification. I will try to upgrade the version.
@seshilrs 3 года назад
Hi, do we need to split the data for imputation?
@rachittoshniwal 3 года назад ⁺¹
yes, absolutely. All inferences must be based off training data only, and passed onto the test set as is.
@seshilrs 3 года назад
@@rachittoshniwal thank you, how do we determine the value of K?
@rachittoshniwal 3 года назад ⁺¹
@@seshilrs we can use grid search
@kouhtahiyodou9390 3 года назад
Is there any way to tune the n_neighbours parameter?
@rachittoshniwal 3 года назад
You can try Elbow method or a grid search
@kouhtahiyodou9390 3 года назад
@@rachittoshniwal Ok Thanks
@humbleman8476 3 года назад
Hi Rachit, can you please make the video a little slower as you are going too fast and therefore, some of us couldn't really understand what is happening. Can you also help me with If I want to impute missing values to a categorical variable using classification models, how can we make a train data having non-missing values and test data with null values.
@rachittoshniwal 3 года назад
Sure, thanks for the feedback!
I don't quite understand your question unfortunately
@humbleman8476 3 года назад
@@rachittoshniwal I was saying, there was a technique of Imputing missing values to categorical columns by applying certain ML algorithms..for that you need to divide the data into train and test. In train, you will keep all the data which have no missing values while in test, you will keep all the data having missing values. Now you will run algorithms into Train data and then apply the algorithm onto test data in order to find out the missing values in columns. I hope you understand now. How will you divide data into train and test???
@rachittoshniwal 3 года назад
@@humbleman8476 if you want the entire training dataset to be free of missing values, and all the rows with missing values to be in the test set, then you could something like:
train_idx = X.dropna().index
test_idx = [ix for ix in range(len(X)) if ix not in train_idx]
and then you can do an iloc operation to fetch the rows from these indices.
is this what you need?
@humbleman8476 3 года назад
@@rachittoshniwal Thanks Rachit...you're too kind to help me ... Tomorrow I'll look into this and apply these codes...will tell you for sure if it work..but for now.. You'll get a jew follower..and I wish you will reach 100k soon
@rachittoshniwal 3 года назад ⁺¹
@@humbleman8476 haha! thanks!
@geekyprogrammer4831 2 года назад
Say Assalamualaikum not Namaste.

Следующие

Автовоспроизведение

Using Missing Indicator for checking missing values | Machine Learning