Handle Missing Values: Imputation using R ("mice") Explained
HTML-код
- Опубликовано: 21 мар 2020
- Data Cleaning and missing data handling are very important in any data analytics effort. In this, we will discuss substitution approaches and Multiple Imputation using Chained Equation (MICE) imputation in R.
R Program installation steps:-
Please install R framework in your system. It is available for Linux,Windows and Mac systems below.
cran.utstat.utoronto.ca/
Also, after you install R framework, install the IDE(Integrated Development Environment), i.e R studio Desktop from below link.
www.rstudio.com/products/rstu... Наука
I am totally new to MICE imputation and searched for clues on the internet but failed. However, your video was PERFECT and now I could totally understand how it works . Love your work you've done here 👍
You've done a super job explaining the content so well that I subscribed! Thanks for sharing!
Channels don't need to have a thousand subscribers, good content like this is sufficient. Thanks!
Thank you my friend. Subscribers are important for making more content like this.. 😆
Very good video with in-depth explanations!
Thank you....
This is Great content. Thanks for patiently explain
Thank You So Much Sir for a very great explanation.I have my project and was worried about imputing the missing values and this has really helped me a lot.GOD Bless You.
You wlcm...
That is so valuable. Thank you for creating this video!
Thanks
The information on the nhanes variables is readily available in the package section of RStudio. There it states that hyp = 1 is for 'no' and hyp = 2 is for 'yes'. Might have been better to convert them to 0 (for no) and 1 (for yes).
this was so well explained and to the point. thank you for your knowledge.
Thank you..
The best explanation of this I've ever come across..please make more R videos!
Thanks a lot.. more coming soon....
@@dataexplained7305 would you happen to know two different ways to compute the upper quartile of the variable BMI? Like what is the specific syntax?
@@raihankhan4374 Try this...
nhanes$bmi[which(is.na(nhanes$bmi))] = mean(nhanes$bmi, na.rm = T)
Uppr_Quartile_A = summary(nhanes$bmi)[5]
Uppr_Quartile_B = quantile(nhanes$bmi)[4]
Hope this helps !
Cheers.
Thanks! i'll give it a go! I tried a different way earlier, it worked but feels super cheap lol check it out:
First Method:
step1
Looks great..
Cheers
Thank you, very well explained. Appreciate it
Great tutorial! Thank you!
Incredible brother!! Very well explained 👍
Thank you, vinit...
You have compared the imputed dataset with mean of raw data or pooled data distribution?
Thank you so much for your effort, love this explanation
You welcome...
Thank you for this material!
You welcome...
Ohhh, now I got it. Thank you!!
You welcome...
Great explanation, thank you!
My pleasure...
22:14 saved the experiment for my paper! I still do not understand how to fit all imputed dataset in one model. But at least i got something! much love
Thanks so much for the support.
You can choose the best data per
statistical sense, e.g. regressing or mean scale from the imputed datasets. This will be one data as a whole which will be replaced in place of the missing values.. or if you want, with a few lines of code, you can pick and choose the data from all 5 which ever values you think might be close to missing values.
Check this out..
ruclips.net/video/_ymR-FFG44c/видео.html
Stay tuned... more coming soon !!
Thank you for the in depth explanation of the MICE. One question though, I understand the first step to MICE is a simple imputation but what is the point of doing so if in the Mice Imputation command the original data set (input_dt2=nhanes) was used, in addition to checking it against its mean?
Sorry for delay... I m just making a copy of the nhanes dataset to another variable as I don't want to alter the original dataset in which case.. I need to reinstall the library.. just being lazy.. lol
The variable age is also categorical, with categories 1="20-39"; 2="40-59"; and 3="60-99"', which are already coded in the data.frame nhanes2, also in the mice package. This does not detract from your very helpful explanations! I am really concerned about imputation of mixed datasets, with both continuous and categorical variables. There are many journal papers in medicine where the authors say they used multivariate normal imputation, and I always wonder how they could possibly handle missing data in categorical variables using that method. The point is that they cannot, and they did not.
Muito obrigado.
Sem falar Inglês, eu entendi!!!! Parabéns.
Thanks a lot, my friend.. Stay tuned !!
Thanks for the detailed explanation...
You are welcome...
The point of multiple imputation is to perform the analyes in each imputed dataset and then to pool the results. If you just want to work with one dataset it would be better to use VIM or a similar package to perform single imputation.
Thanks for your comment. In mice also you can do that like below.,
1) mice() this will give you mids
2) with function to apply a stat function to the mids object above. This is the Mira object
3) pool (mira) to combine the results together.
Having said that, I still find this way simple and very effective... agree that we can use VIM as well..
Very well explained Senthil!!
I have a question though. When we do the mice imputation, we get 5 datasets and in this case you chose one that is closest to the mean value of BMI. In this case, the amount of NAs was very less. So you could look through the data and decide which dataset to use.
What happens in a real life situation when the # of NAs to be replaced is high? How do we decide which dataset to use then? If the number of NAs is huge, manually going through the datasets and deciding which one to use would be a cumbersome task right?
Thanks Heral.. you can get the distrivution of each of these datasets as well like below..
Summary (mice$imp$bmi$1).. Let me.know how it goes..
Thank you very much!
Is it still ok to match the 5th choice with our original dataset's mean if we have non-normal data? Why?
Thanks man, this was very useful
You are welcome
great video. thank you very much.
You are welcome!
thank you for this tutorial video. It was really helpful. Is there any specific rule for the number of datasets( in your case:5) we choose?
Thanks. I have Selected based on the correlation and closeness of the data to original set.. there are a couple ways to this. Some don’t choose the data for substitution but just use it for analysis…
This was great, thank you
You are welcome..
Amazing!
Is there a way to only impute certain variables with the "mice"-command? I'm looking for a way to specifically include certain predictor and auxiliary variables to my imputed model, since I am only working with a subset of variables of a bigger dataset. Thanks in advance.
Thanks for your comment.
The answer for your question is YES.
Go to the 14:50 minute of this video. Which ever column you don't want to impute, you simply keep it null in the method() argument. For others, you specify whatever statistical function you like to use.
Hope this helps
Yes, you skip variables by choosing method "" for the feature you don't want the be imputed (like in the video in case of the age feature).
Sir can we use pmm for nominal features which have 4 categories.or we have to reduce the cardinality
Yes pmm would be a good choice..Make sure to have them converted as a factor before you apply pmm..
I really enjoyed your video! You selected the 5th column by comparing the original mean to the column of imputed values, which is practical for a small data set. In cases where there are 100s or 1000s of imputed values, what are the steps for calculating the mean for each of the columns of imputed values, and comparing those results to original mean to then select the best column of imputed values?
You can do summary(imputedDS$1) compare them to summary(source$column) and then summary(imoutedDS$2) compare with source column.. or simply mean() function...
@@dataexplained7305 Excellent - Thank you !!!
Hey, thank you for the video! It was very informative!
Is it possible that the table (my_imp$imp$bmi) could show different results because of the random multiple iterations?
This ca show you the number of specified imputed data sets. For e.g. if you selected 5 like in the video, you will see five different imputed sets. You can check the myimp$chain mean which will show the chained mean computed based on imputation iterations that you specify... let me know if this helps..
@@dataexplained7305 thank you for the prompt reply! If I’m not mistaken, the chained mean is identical!
also, what method do you suggest for ordinal variables??
My apologies for the delay.. Check this published paper, when you get a chance..
www.researchgate.net/publication/326435546_Missing_Data_Imputation_for_Ordinal_Data
thx mate
nice explanation .. thank you
Thank you
Do you know from which author this multiple imputation is? I have to cite the method used in my article.,
I don’t know either.. sorry for delay
hey! thanks for the video, I am getting this error:
error in str2lang(x) : :1:5: unexpected Symbol
1: 5531Atrialappendectomy
^
I don´t know what to do, tried to delete it column but then it says the same error for the header for the next column! anyone who knows how to solve it ?
Thanks for your comment. Looks like a simple one around the syntax. Hope you have fixed it by now..
the missing values in my dataset are not mentioned as NA but left blank. will there be an issue?
data$column[data$column == ""]
is there a better way to choose which column is closest to your mean instead of eye balling it? like what if you had over 100 rows.
Summarize your output set... that is the best way..
@@dataexplained7305 thanks. that worked .
I am getting this error - Error: $ operator is invalid for atomic vectors. I am using a panel data set.
Any idea on what I should do?
Hi,
Not sure if you have figured it out yet but i found something real quick for you..
stackoverflow.com/questions/23299684/r-error-in-xed-operator-is-invalid-for-atomic-vectors
Thanks for your video !! but i dont know if you could explain the other parameters of mice package .. greetings
Anything specific that you are looking for ?
Yes the "Seed" parameter, its important to write a specific number and if it is like that, what number i have to write .. also you said that i have pick the model that looks like in mean, but there is another method to choose the correct one ... something more statistical ... and also thank you so much. Greetings.
Thanks for the Q. Highly appreciate it.
SEED: I suggest you customize your iterations(ideal = 20 to 40) and also your "m" values. Seed parameter is not an ask by the mice() as per the r documentation. Having said that, you can always choose seed as "NA" for a random number generation (or) self assume a number, if that brings in a better imputation.
OUTPUT_PICK: The one I shown above is mean pick. You can also regress(linear/logistic for example) your variables for a more closer pick. However, considering the length of explanation involved, this should be a separate video, I guess.
Cheers.
@@dataexplained7305 thank so much for the explanation, and also i will hope you will bring us a new video with the explanation involved. Greetings.
Stay tuned ! I will do one soon..
Hello! I have a dataset with 475 rows where the NA's are categorical values (factors with 4-13 levels) these variables have 2-3 missing values, can I use mode (most frequent) for imputation.
Also, there are 2 variables with categorical data (factor with 12 levels) where the percentage of missing values are 18% and 25%. I think these variables are important, how can I fill them up? Thank you!
Hi Jay, thanks for your comment. My quick comments below
First case: Either a mode or a categorical predictor algorithm like logreg will not make a big difference. So, feel free there..
Second: Use Polyreg for this as it is >2 levels. It might be a default method also for these kinds of variables.
@@dataexplained7305 Hello sir, thank you for your response. I have filled the missing categorical variables with their mode just now.
In my second question, my dataset (475 observations) has 11 variables and 3 of those have a lot of missing data (17%, 25%, 27%), in using mice (polyreg), should I include those 11 variables in one go to fill all the missing data? Or should I include only the variables that I think that have effect on the missing values?
CODE: try
@@jaygomez2320 not a problem. Handle the ones Which ever one has the missing values only. Others leave them as is...
@@dataexplained7305 so that code will do?
@@jaygomez2320 yes. Thats right.
Really very good content but i have a doubt, here we have taken the summary of bmi and based on the mean value we have selected the 5th column values because the values are near to the mean. Here why are we considering only bmi summary and why not considering chl summary to select the nearest column values. Thanks in advance
Sorry for delay.. if you still have this Q.. ..the baseline is that the NA values needs to be replaced with best possible values from our judgment... the target variable distribution can be compared to that of the imputed datasets and like you said each column can be imputed using seperate imputed datasets and not just one common dataset.. whatever you choose as your output variable for your analysis can be chosen to compare distribution with final datasets. Hope this helps
@@dataexplained7305 Hi, was wondering if you could help me with a similar query.
I have 20 continuous variables in my dataset which I need to impute missing data for.
In this case how do I impute each column using separate datasets? Do I need to do these steps separately for 20 different datasets? And how would I then combine all those columns at the end into one complete data set?
If you know the matlab code of MICE please inform me
Do you have your syntax for this uploaded anywhere that I could just review your entire syntax for this video and compare it to my code?
Can you check the video itself.. I don't have the code locally..
@@dataexplained7305 Yes, thank you. I just didn't know if you have it exported on a notepad file or something.
can you explain hmisc on R, or can we do imputation through mice when analysing categorical variables
Re:Hmisc.. i will upload a video.. but yes.you can impute categorical variables..
@@dataexplained7305 what we have to do if doing categorical variables, simply follow the steps or have to add another step
@@zoiyaehtisham818 yea.. you will just need to apply a categorical algorithm to that feature/column.. if you see the video, i would have applied logreg to hyp column... hope this helps..
@@dataexplained7305 thank you so much, I have data of 528 obs of 165 variables, and all my data is binary(categorical variables), this data I ve imported from SPSS on R. I have to do the analysis after the imputation and I ve short time, I am learning R since 3 months, and still failed to impute the data, I have applied the algorithm on my data but the error has come i.e
Error in formula.character(object, env = baseenv()) :
invalid formula "scale#1 ~ 0+gender+residence+age+religion+mothertounge+cultralbackground+academicbackground+occupation+income+ty( and so on) .Can you advice me what should I do? when I was doing imputation on SPSS there occured the MAXMODELPARAM error so I switch to R. And still I am unable. I ll be very thankful to you if you have any idea and you would tell me.
I am gettting this error after the entering mice function.Error in str2lang(x) : :1:27: unexpected symbol
1: Company ~ Year+Contingent liabilities
Sorry missed it somehow..Hope you would have fixed the code by now.. if not please have your code sent to dataandyou@gmail.com
anyone can explain why he chose these methods in the mice function?
Depending on the variable type, there are a bunch of function options to choose from.
what if you have over 50000 variables? when i try to use these functions such its not possible to call out a specific variable
How can we extract pool imputed dateset from R Studio
yes.. you can try..
pool(with(imp_function,lm(chl~bmi+age)))
and to understand the results..
www.rdocumentation.org/packages/mice/versions/2.8/topics/pool
Good Job.
Thanks buddy...
Up to what percentage of the data can be imputed? Any references?
Thanks for your comment. I suggest, keep the outcome variable at-least complete 75 % for better predictions.Other aux variables can be a bit less depending if it is a continuous, cat, ordinal e.t.c. At the bottom line, if there is any variable less than 60% full, think of deletion like pairwise or listwise...
@@dataexplained7305 Thanks!
Hi. Can you please let me know how can you impute the missing value if i have 172 variables? and u need to impute in all variables
It depends on the percent of missing features.. if its more then it might not me sense to impute due to the loss of Natural value in the data
@@dataexplained7305 thanks.
but the data which i have is 4% missing and in different area with out the first variable.
Sound like possible.. 👍
Good explainer Re: using mice. However, in addition to @annap9782's comment, I'd also flag that finding that your imputed data is distributed similarly to the observed data is not a meaningful test of imputation performance. If we had a (conditionally) ignorable missigness mechanism (i.e., MAR), it may even be expected that these distributions will differ, without this saying anything about the performance of a given imputation.
Why did you choose the fifth column please ?
If you see the mean of the original set is closer to the distribution of 5th output set.
substituting with mean for n/a will have accumulated error......i still cannot agree... say i have data 1, 10, 10000, 50, 20, na, na, na, 3, etc.... please explain, will it justify
Thanks for your comment... you are correct. Mean Mode substitution will only work in certain cases when the distribution is tight rather loose like you specified.. thats why we get estimates using mice.. watch the second half of the video as well and let me know what you think..
Bro what about copula based imputation
Thanks for your comment. I will make a video on the CoImp() soon... stay tuned..
Plz bro
what is you have categorical variables
nvm this handles categorial as well as numeric
Yes.. you just need to choose the appropriate functions to handle the categorical columns...
@@dataexplained7305 yeah I didn't realize that the first part of the vid you weren't using mice. just a simple imputation. second part of the vid you use mice. thanks for the vid.
how can you just pick an imputation. it must be pooled.
Nicely explained. However, you should not just pick one set after your multiple imputation. You run the analysis on all generated datasets and produce a pooled analysis.
Agreed. Something like this:
my_imp = mice(input_dt2,m=5,method=c("","pmm","logreg","pmm"),maxit=20) # create several sets of imputed data as seen in video
my_analysis_model
@@lesliezhen4256 Hi, I wonder if you could expand on that second line of code there for me? The one which fits the model to each set of imputed data. What do I put in the brackets of model( ) ?
@@paigecox347 The part of the 2nd line that reads model(...) is a placeholder for whatever is the actual model that you are fitting. For example, if you are fitting a mixed effects model, you would replace model(...) with something like lmer(outcome ~ 1 + variable1 * variable2 + (1|subject), data=name_of_dataset)
@@lesliezhen4256 thanks Leslie, I think for my data I won't be able to do this as it's only my dependent variables in a separate data set that need to be imputed and then averaged.
My independent variables are in a different format (so I cant build the model from the imputed data set in the traditional way)
Thanks for the help though 👍
its great presentation but it always asked about " my_imp " is not found....s how i pass this step...?
i put the error i faced below,,,thanks alot
summary(input_dta3$faminc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4.0 50.0 72.0 134.4 150.0 3000.0 494
> my_imp$imp$faminc
Error: object 'my_imp' not found
> my_imp=imp$faminc
Error: object 'imp' not found
> my_imp$imp$faminc
Error: object 'my_imp' not found...... this is the problem with me..
Thanks for your comment.
Two things here. 1) Make sure you have executed that line of my_imp=mice(....)
2) If this still doesnt help, paste the code below with important details taken off.. just the code flow..
@@dataexplained7305 i also excuted before like,,,below
my_imp=mice(input_dta3,m=5,method=c("","pmm","logreg","pmm"),maxit=20)
Error: Length of method differs from number of blocks
> summary(input_dta3$faminc)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4.0 50.0 72.0 134.4 150.0 3000.0 494
> my_imp$imp$faminc
Error: object 'my_imp' not found
> my_imp$imp$faminc
Error: object 'my_imp' not found
this is what i excuted
@@idealtube281 check the length of the arguments supplied and columns you have in the data. Still not working ? then, email me the code dataandyou@gmail.com. I will check..
the jeet imputatoor
22 mins that could be summed to 5.
unnecessarily stretched
very basic .. this video could have been made in 5 mins
Probably...
stop saying 'right' all the time. I literally had to mute the audio and use CC captions because of that.
Lol... will try to fix it next time. Stay tuned though..
Well explained. Thanks for this!
You are welcome..