R demo | How to impute missing values with machine learning

Поделиться
HTML-код
  • Опубликовано: 21 авг 2024

Комментарии • 55

  • @mustafa_sakalli
    @mustafa_sakalli 2 года назад +4

    You are a legend! I've spent my hours to find proper tutorial to missing data imputation. They were all about mice and they were applying it to the 20 rows-5 columns dataset :D Since my dataset relatively big, mice package was struggling to compute missing values. But with the help of your small script I was able get a result in approximately 45 minutes. Thank you again

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 года назад

      Great to hear, Mustafa! I am glad it's useful not only for me :)

  • @45tanviirahmed82
    @45tanviirahmed82 3 месяца назад

    This video ends abruptly 🤣 I was so into it, that I thought there was problem.
    Great video! you playlist on R is becoming an addiction

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  3 месяца назад

      Awesome! Really happy you like it. I think I did cut it at the end, because the end turned out to be useless, which killed the retention. So, in more recent videos I try to provide the value every second... doesn't always work, but I think videos got a bit better since then :) Thank you so much for feedback!

  • @haythemboucharif7750
    @haythemboucharif7750 2 года назад +1

    Mister, i am french so i can tell you that we have problems with enflish, but let me say that you speak really smooth, and very very well, thank you very much

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 года назад

      Thanks, Haythem! Glad you liked it. I can recommend Deep Exploratory Analysis video. It's long but very useful. Cheers

  • @muhammedhadedy4570
    @muhammedhadedy4570 Год назад

    You are a true legend. I enjoy every single video of your tutorials.

  • @auliakhoirunnisa9447
    @auliakhoirunnisa9447 Месяц назад

    Thank you for your explanation. It really helps me alot! Your voice is indeed calming and soothing😃
    will definitely subscribe, Sir!

  • @jameswhitaker4357
    @jameswhitaker4357 Год назад

    I'm just a mere junior analyst, but I am enamored by cool statistical methods. I have a lot of questions that you answered. While I have a minor mistrust in algos and ML, I have a major intrigue in how accurate and precise imputations can be.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Год назад +1

      Glad it was helpful! The missRanger command can even tell you for every variable, how good the imputation was via OOB - out of bag error rate. I don't know anymore whether I talk about it in the video.

  • @angelajulienne3122
    @angelajulienne3122 10 месяцев назад

    AMAZIIIINGGGGG !! You're incredible thanks :D

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  10 месяцев назад +1

      Thanks for your feedback! And thanks for watching!

  • @rayray0313
    @rayray0313 3 года назад

    Excellent stuff. Thanks for making this video.

  • @mkklindhardt
    @mkklindhardt Месяц назад

    Amazing 👏🏽 thank you

  • @robertasampong
    @robertasampong 2 года назад

    Absolutely excellent! Thank you!

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 года назад +1

      Glad you enjoyed it! Check out later videos. You might like those too. Thanks for feedback!

  • @yaoliao3517
    @yaoliao3517 2 года назад

    Really helpful to me. I see your recommendation from twiter.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 года назад

      Glad it was helpful! And thanks for nice comments! They help :)

  • @TheBaudoing2007
    @TheBaudoing2007 Год назад

    thank you ! this is great

  • @angezoclanclounon1751
    @angezoclanclounon1751 3 года назад

    Awesome video! Thanks a lot.

  • @edaemanet26
    @edaemanet26 2 года назад

    Thank you sir this is perfect!

  • @ntran04299
    @ntran04299 8 месяцев назад

    Thank you for this great video. May I ask the assumptions that should/must be met before using missRanger to impute data?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  8 месяцев назад +1

      Yes, you may :) but I am afraid they are just common sense. For example, I never impute the response variable, I don't impute when there is a lot >20% of missing values, I always check the imputation results and accept or not accept depending on the result. Like, when I impute categories and after imputation only one category was filled up while the others not (in case there needed to be impited like 10% or more), then I don't accept that. so, no assumptions, but your own shit tests are important here. hope that helps! cheers

    • @ntran04299
      @ntran04299 8 месяцев назад

      @@yuzaR-Data-Science I see thank you sir!

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  8 месяцев назад

      you are very welcome!

  • @syhusada1130
    @syhusada1130 2 года назад

    Thank you

  • @dle3528
    @dle3528 2 года назад

    This video is awesome. Congrats! Can I use this method before estimating ML models? Should I input data before or after the partition data?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 года назад +2

      Thanks a lot! Imputation before conditioning is for sure better, because you have more data for the model to learn from, so the imputation quality would be better. Cheers!

    • @dle3528
      @dle3528 2 года назад

      @@yuzaR-Data-Science thank you so much ! 😃😃

  • @chacmool2581
    @chacmool2581 Год назад +1

    Using 'ggplotly' to make a missing value heatmap interactive is too computationally expensive and slow for anything but very small datasets. Instead, you could try making an interactive heatmap directly using 'd3heatmap'. Much faster. Plus you can control the aesthetics of 'd3heatmap' to a greater degree than the 'vis_miss()' function.

  • @chrisdietrich6400
    @chrisdietrich6400 Год назад

    Thanks a lot! Super helpful video! I am just wondering at which point in the data management process it would be the best to apply the imputation - I have some categorical items that I use for multiple factor analysis, which I then use for multilevel modelling. I am currently applying the imputation after I created the factors - however my intuition says it might be wiser to impute as a very first step. Do you have an opinion on that? (or some literature in regard to this?)

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  Год назад +1

      Glad it was helpful! Well, and I am not familiar with any hard core rules, or rules of thumb. For me it depends on imputation goals and common sense. I did 3 imputations once because the dataset needed lots of operations, so, in order not to loose few point there and few here, I did 3 rounds. Another thing is, the categories or factors supposed to be recognised automatically. So, factorising before imputation makes sense to me. If you have 3 categories, 1,2 and 3 and ask for imputation of such a "numeric" variable, you might get some odd continuous numbers. However, if you want exactly that - go for it.

  • @warmtaylor
    @warmtaylor Год назад

    Thank you very much for your informative, succinct video! Is {missRanger} package considered the best package for multivariate imputation? Is {missRanger} package better than {mice} or {miceRanger} packages? How did you discover {missRanger} package? I'm sorry if I asked too many questions because I'm new to data imputation and would like to select the best package to impute my data.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  11 месяцев назад +1

      Sorry for a late replay. I was traveling a lot lately. I believe missRanger is the best, but it's just a believe, not fact. I think so because it connects predictive mean matching and multiple imputation. It iterates till the OOB stops improving. I did however not directly compare the results of the packages and usability. The usability is also important, because there tons of packages that don't run without some special things. miss Ranger does, and does it quick. Having said this, if you would compare the results of different imputation, I would be grateful to know how it went. Kind regards and thanks for your nice feedback!

    • @warmtaylor
      @warmtaylor 11 месяцев назад

      @@yuzaR-Data-Science No worries. Thank you very much for your answer.😀I think I will probably stick with {missRanger} package for now due to the fact that it is easy to implement and the great features you have discussed.😄// I was wondering if you could provide me with quantitative method(s) which could be used to assess the accuracy of the imputation rather than visualisation?// In your demonstration of using {missRanger} package (5:28), I think that it is essential to include the argument `pmm.k` (e.g. pmm.k = 3) to conserve data structure/format. This is because, when I first ran the code without the argument `pmm.k`, it gave different rounding to my values. I have checked in the package's vignette, and it is confirmed that setting the `pmm.k` argument to a positive number is needed so that all imputations done during the process can be combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values.🤔 Best regards, Poss.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  11 месяцев назад +1

      Oh, cool, thanks for pointing out the "pmm.k" option! I did actually often forget it. What I usually use to follow up on predictions is "num.trees = 10000, verbose = 2, seed = 1, returnOOB = T", which displays the Out Of Bag Error for each variable at each iteration. After some iterations, when the OOB stops improving, it stops imputing and you have a final dataset. I try no to accept any OOB above 10% ... but yeah, it depends on the situation. I also usually plot the data, just to see whether some very strange values were predicted ... I was never the case till now. Of coarse, the more data you have, the better the predictions. Cheers ;)

    • @warmtaylor
      @warmtaylor 11 месяцев назад

      @@yuzaR-Data-Science Oh, I see. Due to the fact that my data contains over 3,000 rows and 30 variables, I have reduced "num.trees" to shorten the processing time to 100. Consequently, this led to the different rounding of imputed values, so I added the argument "pmm.k" to retain their data structure. Thank you very much for your clarification! :)

  • @vyshnavisanagapalli4314
    @vyshnavisanagapalli4314 5 месяцев назад

    hi sir, i have gone through this video but im not able to get plot_na_pareto function in R studio. its throwing an error " builtinfunction not found".can u help me how to overcome this issue!?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  5 месяцев назад

      hi, it works perfectly at my PC. have you installed and loaded the {dloork} library?

    • @vyshnavisanagapalli4314
      @vyshnavisanagapalli4314 5 месяцев назад

      Thank You for replying Sir.I am getting the plot , actually there was some problem with my r studio, I rectified it .and I must say ur videos are so informative and easy understanding.

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  5 месяцев назад

      that's so nice of you! huge thanks! I am happy my content is useful!

  • @syhusada1130
    @syhusada1130 2 года назад

    Been coming back to this video. For a dataset with 165040 rows, missRanger crashed my Rstudio. I ended up using imputate_na with mode as the method since I'm not sure what yvar I should use in the dataset. So it produced "imputation" class, and I'm not sure what to do about it, can I just insert the result into the dataset?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 года назад +2

      I would still recommend to use missRanger. I made the best experience with that. Are all 165k rows and all variables important? If not reduce the dataset or split it in smaller sets. By the way, it never crushed my RStudio, only took a little longer, if dataset was huge. Then, ask yourself are all variables (rows) contribute to the meaningful imputation? E.g. IDs or too diverse categorical variable don't, but they let missRanger think more for no return. If some variable have too many missing values, like 30% do you actually want them to be imputed?
      I suggest missRanger over "imputate_na" because you can track the OOB error (which is amazing) and because you create a new data set, which you can immediately use if OOB is low:
      d_imputed %
      missRanger(., formula = . ~ .,
      num.trees = 1000, verbose = 2, seed = 999, returnOOB = T)

    • @syhusada1130
      @syhusada1130 2 года назад

      @@yuzaR-Data-Science thank you for the extra tips, amazing channel by the way, love it!

  • @achual1909
    @achual1909 2 года назад

    Can I use this for time-series data?

    • @yuzaR-Data-Science
      @yuzaR-Data-Science  2 года назад +1

      If you wanna date-format itself (day/month/year : sec/min/hour), I don't think so. But if your timepoints are columns, and you just have some things measured and sometimes missing, then for sure.