Это видео недоступно.
Сожалеем об этом.

Dealing With Missing Data - Multiple Imputation

Поделиться
HTML-код
  • Опубликовано: 20 авг 2024

Комментарии • 76

  • @user-hz2ef7cy7z
    @user-hz2ef7cy7z 5 месяцев назад +3

    Wow, you have a natural ability to make complicated concepts become simple.

  • @amiriqbal1871
    @amiriqbal1871 3 года назад +2

    I was struggling with the concept, but your video made it crystal clear to me, thanks

  • @ysc2652
    @ysc2652 3 года назад +3

    This was so clear and easy to understand! Thank you!

  • @ThuHuongHaThi
    @ThuHuongHaThi 3 года назад +1

    A thousand thanks, your explanation is very easy to understand, it's really helpful.

  • @robertcsalodi1207
    @robertcsalodi1207 3 года назад +1

    This explanation is awesome! Congratulations!

  • @bhushantayade7984
    @bhushantayade7984 3 года назад +2

    Amazing sir. It's really helpful.

  • @zhaoqian58
    @zhaoqian58 Год назад

    Thank you for producing this high-quaity video.

  • @mehmetkaya4330
    @mehmetkaya4330 4 года назад +1

    Very very clear. Very helpful. Thank you!

  • @alexslayerking
    @alexslayerking 2 года назад

    This is an outstanding explanation. Thank you so much for making this.

  • @sean_gruber
    @sean_gruber 2 года назад

    VERY clear explanation. Thank you!

  • @emicat7045
    @emicat7045 4 года назад +1

    Thanks you very much! love your videos, they were always clearly explained.

  • @StarFlex21
    @StarFlex21 5 лет назад

    Thank you for the interesting and helpful series about missing data. Also, great video quality.

  • @newbie8051
    @newbie8051 Год назад

    Great explanation ! Thanks a lot

  • @jessicalambert4019
    @jessicalambert4019 Год назад

    Thanks very clear and useful!

  • @jimpauls7765
    @jimpauls7765 Год назад

    Very informative! Thank you, good sir :)

  • @ayselceferzade8587
    @ayselceferzade8587 2 года назад

    clearly explained! thanks a lot!

  • @phumlanimbabela-thesocialc3285
    @phumlanimbabela-thesocialc3285 3 года назад

    Thank you very much.

  • @alfinpradana
    @alfinpradana 4 года назад +4

    Great explanation and excellent in describing how multiple imputations! But I have a question to ask, how could I choose the final value for the imputation if there is 5 value? should I go average 5 of the value instead, or is there any better approach? Thank You

    • @qwertyuiop-qy6hb
      @qwertyuiop-qy6hb 4 года назад

      Yes. As explained in the video, you calculate the mean of the 5 values. You also calculate the standard deviation as well.

    • @tinghuachen7844
      @tinghuachen7844 3 года назад +1

      @@qwertyuiop-qy6hb I am confused, is calculating the mean of the predicted value as a final chosen value or calculating the mean of the sample means as a chosen value? It makes more sense to me using the mean of predicted values, but why do we want to see the standard deviation of the mean for the sample mean? What will actually affect our decision?

    • @qwertyuiop-qy6hb
      @qwertyuiop-qy6hb 3 года назад +2

      @@tinghuachen7844 I am not a statistician so what I will tell you is the way I understand it. Multiple imputation gives you an estimation of missing data points of a specific variable in a data set. So this estimation is based on the values of the same variable (one column) of the other "individuals or rows" in the data set. Everytime you perform the imputation the resultant value depends on the selected "individuals" which should be selected randomly. So if you do imputation let's say 5 times, you end up with five estimated values. Here the mean of these values gives the estimation of the missing data point. Calculating the standard deviation (from here this is my own understanding) gives you an idea how variable these estimations are. Same thing if standard of error is calculated. If the estimated values are (spread or not close in value), SD will be high and I'd be careful in my assumptions and interpretation of the final analysis . I have not done multiple imputation in my domain (medicine). I would be very careful using multiple imputation but certainly this is a great method to avoid missing data and use all the sample size.

  • @davidrussell3433
    @davidrussell3433 3 года назад

    Very helpful, thank you!

  • @brendali5803
    @brendali5803 3 года назад

    Great job!

  • @elkalaiibrahim365
    @elkalaiibrahim365 4 года назад +1

    ~2~Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?

    • @vishnumohank1299
      @vishnumohank1299 2 года назад

      Data points selection when sampling is almost always done randomly, in order to avoid bias. A similar sample-&-test approach is taken when you perform cross-validation. Same logic follows when choosing the right sample size.

  • @tracykakyoalexis2155
    @tracykakyoalexis2155 4 года назад +6

    This was my aaahaaa moment. Thank you!

  • @ericazombie793
    @ericazombie793 3 года назад

    Very clear!

  • @user-ek5rc5kl7o
    @user-ek5rc5kl7o Год назад

    This is so clearly explained. Thank you very much for this concise and informative video! I have a question. I believe the purpose of step 2 - calculating the standard deviation - is to confirm that the mean is a reliable one. What if the standard deviation is too large? Does it imply that the imputation method is not a reliable one and should not be adopted? Thank you!

  • @lucavisconti1872
    @lucavisconti1872 2 года назад +1

    Thanks for the practical example, not clear to me, at the end which value we have to use to fill in the missing value with the multiple imputation method. Could you please clarify?

  • @cynical_dd
    @cynical_dd 5 лет назад

    Goodjob! Great video!

  • @BelleandBoos
    @BelleandBoos 8 месяцев назад

    oh my gosh thank you so much!

  • @PedroRibeiro-zs5go
    @PedroRibeiro-zs5go 5 лет назад

    Thanks! That was a really nice explanation!!

  • @SirDerRosen
    @SirDerRosen Год назад

    Thank you very much :)

  • @andreibarbulescu7812
    @andreibarbulescu7812 3 года назад

    Isn't it actually even more complicated than that? Isn't it that for each regression, instead of imputing the missing fine value with the value predicted by the regression we actually randomly sample from the distribution of fine values around that predicted value (the distribution of fine conditional on distance)? This adds even more of the uncertainty involved in the guess we are making to our imputation process.

  • @chancesofrain6480
    @chancesofrain6480 4 года назад +2

    What we do with 5 imputations that have been calculated? which of them can be considered as the imputed value finally if we want just to show this as a graph?

    • @bonflaneur3194
      @bonflaneur3194 3 года назад +1

      Yeah I am missing the final information too. We have the values for the total final mean and the standard deviation of the means to the final mean but what are we supposed to do with these values? How do we decide, which values to we impute for the missing data?

  • @qwertyuiop-qy6hb
    @qwertyuiop-qy6hb 4 года назад

    Great explanation, thanks.
    I have done many retrospective clinical research projects and I have never dealt with missing data. I always left these blank knowing that they will automatically be excluded from analysis.
    I believe leaving these missing data unfilled is better to avoid any chance of bias influenced by data of other patients in the study cohort.
    What do you think?
    Now looking at your clear video I am thinking about this approach as well for future projects.
    I am not a statistician and I've done all these while in training.

    • @Ampelman123
      @Ampelman123 2 года назад

      You have to think about what you do when you just omit the Data with missing values. You heighten the statistical power of the other Entries. Why are the values missing? When the data is missing for a reason you introduce bias into your data set and ergo reduce variance and unless you cant proof that the data is missing completly at random its best practice to act as if it is missing not at random. With methods like this you try your best to keep the variance of the data set. Its important to use a method, that tries to model the most plausible value otherwise you would reduce variance.
      I recommend you look into the types of Missing Data "Missing Completly at Random", "Missing at Random" and "Missing not at Random".

  • @diptikalyan
    @diptikalyan 5 лет назад

    It would be great if you can share links to some of the papers or books that you refer here.

  • @ajanasoufiane3903
    @ajanasoufiane3903 5 лет назад +1

    Thanks a lot for this very clear video. Do you know if we can combine multiple imputation with variable selection (with lasso for example) for prediction purposes?

  • @MyMy-tv7fd
    @MyMy-tv7fd 2 года назад

    very clear and easy to follow thanks, but will we not get as good a result by taking one regression sample of 250 data items, as opposed to five sets of fifty, then taking the mean of the means?

  • @psychoriginal1670
    @psychoriginal1670 5 лет назад

    That was very helpful, thank you!

  • @Discogolf97
    @Discogolf97 5 месяцев назад

    I don't understand why it is more unbiased to run 5 OLS regressions with only 50 out of the 2000 rows.
    Why not just run a single OLS regression with all rows and use that as my predictor for the missing values?

  • @user-ey2np8ff8k
    @user-ey2np8ff8k 2 года назад

    This is an amazing video. Thank you so much. Do we have to check the assumptions for linear regression for each model for each imputed variable?

  • @kyliestaraway2492
    @kyliestaraway2492 4 года назад

    Can you do regression imputation next? I really loved this vid

  • @alecvan7143
    @alecvan7143 4 года назад

    Awesome!

  • @rorysamuels2829
    @rorysamuels2829 3 года назад

    Thanks for the video! If the subsets are random, all the estimators are unbiased right? The aggregated estimator would just have lower variability.

  • @AnkushSharma-zv5hv
    @AnkushSharma-zv5hv 4 года назад

    thank you so much

  • @keeszethof6272
    @keeszethof6272 4 года назад

    Thank you!

  • @bevansmith3210
    @bevansmith3210 5 лет назад +1

    Thank you very much. Quick question, which imputed values do you end up leaving in the dataset for further analysis. Say now I want to impute values to be used later for a variety of machine learning applications. Surely, I cant use multiple imputation every time I want to implement a new machine learning model and measure a metric?

    • @SolomonHatcher
      @SolomonHatcher 3 года назад +1

      You would use the grand mean that you calculated at the end after investigating whether it is a valid metric.

  • @raterake
    @raterake 2 года назад +1

    Thanks for the great video! Question: suppose I have 5 different random samples with which I can get 5 regressions, and then \mu_1, ..., \mu_5, to find an aggregate mean \mu_A. Why not just pool those 5 data sets into one large data set and compute the grand mean \mu_B that way? Wouldn't my answer \mu_B be more precise (less variable) than just taking the average of the 5 means to get \mu_A?

    • @vishnumohank1299
      @vishnumohank1299 2 года назад

      A few things to note,
      >The 5 random samples that you taken from the existing data, may have common elements. So, straight up combining them might increase bias towards the repeating data points.
      >Lets say you avoid having repeating data points, then, combining the 5 samples only help create a subset of the original dataset. Thus you would be better off just running a regression imputation.
      >In my opinion, the whole point here is to have multiple versions of estimated values, so that you may better understand how well the estimation fits our data. Usualy, if the variance or spread of the final values is quite high then we might not want to go ahead with imputation or we may wanna use something other than regression to estimate the missing values.
      >Therefore, multiple regression is giving you a clearer & broader picture of what & how much compromises you are making for implementing imputation to replace missing values.

    • @raterake
      @raterake 2 года назад

      @@vishnumohank1299 thanks for the thoughtful reply. I think that I was missing the point of multiple imputation, but your last two points clear that up for me. Thanks again!

  • @user-sn1dl7vm6w
    @user-sn1dl7vm6w 3 года назад

    Sir,
    you said we need from 5 to 10 models. How to calculate the exact needed number?
    thank you

  • @judewells1
    @judewells1 2 года назад

    It wasn’t apparent to me why this estimator would be less biased than a single imputation, you mentioned that doing multiple regressions and the aggregating ‘washes away the noise’ but each of your individual regressions would also be more noisy than a single regression that uses the whole dataset - so how do I know that in the aggregate they are less noisy than a single regression?

  • @deepakarumugam5866
    @deepakarumugam5866 5 лет назад

    Here you know that the fine amount is a dependent variable that depends on the distance from the library.... but what if you have a data set with missing values in a particular column but the column is actually a independent variable column... how will you use multiple imputation in that case... can you do something like using the distribution to find the values

    • @TheOraware
      @TheOraware 5 лет назад

      imputation is for missing data treatment , no matter it is dependent or independent

  • @hamishthecat666
    @hamishthecat666 Год назад

    How does PMM identify nearby candidates when there are a mixture of numeric and categorical variables? Thanks :)

  • @mtcloris
    @mtcloris 5 лет назад

    Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?

    • @cynical_dd
      @cynical_dd 5 лет назад

      i guess he randomly pick 50 data points you might wanna hear at 3.27

    • @mtcloris
      @mtcloris 5 лет назад

      Thanks!

  • @leoncioblp
    @leoncioblp 2 года назад +1

    Wouldn't this be problematic if your objective with the dataset is precisely to demonstrate if there is any relationship (like a linear relationship) between those 2 variables? Filling a missing value through a method which assumes the very same linear realtionship you are trying to demosntrate would actually be begging the question, isn't it?

  • @claymarzobestgoofy
    @claymarzobestgoofy 3 года назад

    Can you actually do standard deviation? Won't that just reduce the sd for each regression by adding a bunch of point that perfectly fit the regression?

  • @me3jab1
    @me3jab1 5 лет назад

    thank you

  • @deojeetsarkar2006
    @deojeetsarkar2006 7 месяцев назад

    Noiccccccceeeeeeeeeeeeeeeeeee

  • @joefishq11
    @joefishq11 4 года назад

    Great explanation! But one that also seems at odds with what I'm reading from other sources, which make it sound like parameters in the model estimating the outcome are what get randomly selected for each iteration, not the observations used to make the prediction.
    Is what I'm describing an alternative approach to the same thing, or am I misunderstanding the approach?

    • @lbryan250
      @lbryan250 4 года назад

      What do you mean by randomly selecting parameters? His choice of single imputation method is least squares regression and the parameters (the a and b in your "ax + b" regression line) have a closed-form solution. If you use the same dataset, the parameters of least squares don't have any variability in and of themselves. Maybe you can elaborate more on what you mean?

  • @PS3HARDCOREZOCKER
    @PS3HARDCOREZOCKER 2 месяца назад

    Is this the PMM approach?

  • @f2gms647
    @f2gms647 4 года назад +1

    I found it confusing !! Especially you move the paper up n down when you talk!

  • @ayakhaled5316
    @ayakhaled5316 4 года назад

    TTTTHhHHHHhhaaaannnnnkkkkk yooooooooooooooooooooooooou very very very much

  • @sultanmehmood5022
    @sultanmehmood5022 3 года назад

    Data for 1.7 & 2.1 mi is not, prima faci true

  • @Hari-888
    @Hari-888 4 года назад

    I just wish that you were more neat instead of writing everything on that one paper and you keep moving it and it isnt clear what you are referring to when you point your finger on the paper as you've written everything in every nook and corner of that paper.