Euclidean distance and the Mahalanobis distance (and the error ellipse)

Поделиться
HTML-код
  • Опубликовано: 5 июл 2024
  • See all my videos at www.tilestats.com/
    In this video, we will discuss the difference between the Euclidean distance and the Mahalanobis distance and how the Mahalanobis distance can be used to create an error ellipse and to identify outliers in the multivariate space.

Комментарии • 64

  • @tilestats
    @tilestats  2 года назад +6

    Note that the covariance matrix shown at 6:10 should be
    [0.724 0.687
    0.687 1.046] for more accurate calculations.

    • @azibatorbanigo4043
      @azibatorbanigo4043 2 года назад

      How did you compute the covariance matrix from the green data points?

  • @youngzproduction7498
    @youngzproduction7498 3 года назад +11

    I love the way you explicitly explain every step of calculations. It helps me who is not a math expert understand the concept at ease. Thanks.

  • @alecmunnur5918
    @alecmunnur5918 2 года назад +4

    That was heck of a good explanation. Thanks very much👍

  • @merythegirl
    @merythegirl Год назад +1

    This video helped a lot, thank you for this!

  • @startupeco2257
    @startupeco2257 6 месяцев назад +1

    Very well explained! Even for a non-mathematician.

  • @forrestoakley4882
    @forrestoakley4882 Год назад +1

    Thank you! Very clear explanation

  • @tabyonyt8091
    @tabyonyt8091 Год назад +1

    this was enlightening, thanks a lot

  • @szymonk.7237
    @szymonk.7237 2 года назад +3

    So clearly explained ! 😮
    Thank you for it ❤️

  • @lba7238
    @lba7238 Год назад +1

    Excellent video currently studying up to be able to break up a single model into sub models and I'm trying to use the m distance

  • @ricardpunsola
    @ricardpunsola Год назад +1

    Very helpful, thanks 👍🏻

  • @guidenote771
    @guidenote771 3 года назад +2

    Thank you sir for another great video!

  • @shivamsharma6255
    @shivamsharma6255 Год назад +1

    mazaa aa Gaya bhai

  • @tilestats
    @tilestats  3 года назад +6

    I got this comment: "Are you sure the inverse of the covariance matrix is correct? This is what I get when I put it into symbolab. [4.1 -2.82 -2.82 2.95]."
    This is due to that the covariance matrix has been rounded. This is the covariance matrix with more decimals.
    x y
    x 0.7241053 0.6869474
    y 0.6869474 1.0462105

    • @compsci91
      @compsci91 3 года назад +1

      Got it! Thank you for clearing that up!

  • @ya00278
    @ya00278 2 года назад +1

    Super clear. Thank you!!

  • @Nada-yc8uo
    @Nada-yc8uo 3 года назад +3

    Thank you sir

  • @amankushwaha8927
    @amankushwaha8927 2 года назад +1

    Thanks. It was really informative

  • @TM-vg4mx
    @TM-vg4mx 2 года назад +1

    great video, thanks

  • @tone5875
    @tone5875 2 года назад +1

    hi can you elaborate more on generating 95% error ellipse. do we use random number generator with normal distribution to create it? is there a simple example of generating random numbers with intended distribution, or ive read long time ago from monte carlo where we can use cholesky decomposition to create data from correlation matrix? curios to know the mechanics behind them

    • @tilestats
      @tilestats  2 года назад

      You simply draw the ellipse based on the eigenvectors and eigenvalues of the covariance matrix. I used the package ellipse in R to draw the ellipse but if you like to know the details, I suggest this page:
      www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#google_vignette

    • @tone5875
      @tone5875 2 года назад

      @@tilestats thx a lot.

  • @Jonathan_wow
    @Jonathan_wow 3 года назад

    How did you consider the corresponding critical value 13.82 at 9:50 minute of the video if the cut off is 0.001? Can you kindly explain it ?

    • @tilestats
      @tilestats  3 года назад +2

      If you like a cutoff of 0.001, you should extract the corresponding value from a chi-square distribution, which means that you should extract the value that defines 0.001 of the upper tail. In this example, the area to the right-hand side of 13.82 in a chi-square distribution with 2 degrees of freedom is 0.001. Use a software or a chi-square table to get this value. The cutoff 0.001 is an arbitrary, but common, value to use to detect outliers.

  • @wagon19
    @wagon19 2 года назад +1

    Can you tell me how you built the ellipse?
    Preferably in the program scilab

    • @tilestats
      @tilestats  2 года назад

      I answered a similar question below. Hope that helps.

  • @jacksonchen8679
    @jacksonchen8679 2 года назад

    Thank you

  • @yd3130
    @yd3130 Год назад

    Is it the centroid that has to be computed or the mean. I think they aren't always the same, right?

    • @tilestats
      @tilestats  Год назад +1

      I would say the overall mean in the multivariate space. As you point out, a centroid might have different meanings in different fields.

  • @Unaimend
    @Unaimend 10 месяцев назад

    Hi Andreas, could you explain why I should expect a chi-square distribution at 8:26. As always a nice video :)

    • @tilestats
      @tilestats  9 месяцев назад +1

      If you would square the values from a normal distribution, those values will generate a chi-square distribution with 1 df. So, calculations that involve squaring stuff usually result in that we use the chi-square distribution.

    • @Unaimend
      @Unaimend 9 месяцев назад

      Thanks for the explanation@@tilestats

  • @cmindaaa
    @cmindaaa 2 года назад +1

    How do you get 6.45 as the MD for point 2? When I calculate using the same method for point 1, i got back the same MD as point 1

    • @tilestats
      @tilestats  2 года назад

      Go to minute 6:32, and replace vector [5 5] by [5 1] for data point 2. Try and do the math again and let me know if it works.

    • @cmindaaa
      @cmindaaa 2 года назад

      @@tilestats Yeap, I have tried and I still did not get it. My workings: [1.9 -2] * matrix * [1.9 -2]. Eventually, I get sqrt(5.080360804). I took 5 - 3.1 = 1.9 and 1 -3 = -2

    • @tilestats
      @tilestats  2 года назад +1

      @@cmindaaa If you multiply the row vector [1.9 -2 ] by the matrix, you should get the row vector [11.83 -9.56]. If you multiply this row vector by the column vector [1.9 -2.0], you should get the number 41.597. The square root of this number is about 6.45.

    • @cmindaaa
      @cmindaaa 2 года назад

      @@tilestats omg i got it! thank you so much!!

  • @lorenzotagliari6699
    @lorenzotagliari6699 Месяц назад

    I did not understand why the cutoff od 0.001 would not be appropriate in cases when we have many datapoints. Could you clear this up for me?

    • @tilestats
      @tilestats  Месяц назад +1

      Because, 0.1% of the data points will be outside the ellipse due to chance. If you for example have 1 million data points, you should expect that 1000 are outside the ellipse, right? It would then not be appropriate to define all these as outliers.

  • @MrTOCSY
    @MrTOCSY 3 года назад

    Is it correct to calculate the error ellipse for the autoscaled data for PCA calculation?

    • @tilestats
      @tilestats  3 года назад +1

      Not sure I understand. It would be OK to calculate the error ellipse based on the scores in 2D (if that is what you mean).

    • @MrTOCSY
      @MrTOCSY 3 года назад

      @@tilestats, yes, I ment 2D score plot. "It would be OK to calculate the error ellipse based on the scores in 2D" But why? The data were previously autoscaled, i.e. were divided by standard deviation. Is it correct to calculate the error ellipse for scores since scores and autoscaled data are DIFFERENT in their own nature?

    • @MrTOCSY
      @MrTOCSY 3 года назад

      @@tilestats Sorry for the bothering, but you explain transparently and simply. A rare phenomenon if we consider statistics )

    • @tilestats
      @tilestats  3 года назад +1

      Yes, since scaling does not affect the relative distances between the points. If you create an error ellipse of unscaled data, and you, for example, identify 2 points outside that ellipse, the same points will be outside that ellipse if you scale the data, given that you of course calculate the ellipse on the scaled data. Try this on a simple data set, which will help to understand.

  • @MrTOCSY
    @MrTOCSY 3 года назад

    Is it correct to calculate MD using correlation matrix instead of covarience matrix?

    • @tilestats
      @tilestats  3 года назад

      No, you will then not get the correct value, unless you have standardized data, where the covarince and correlation matrix will be identical. Have a look at my video about this:
      ruclips.net/video/2bcmklvrXTQ/видео.html

    • @MrTOCSY
      @MrTOCSY 3 года назад

      The data are autoscaled. Numerical values of elements of correlation matrix and covariance matrix are equal.

    • @MrTOCSY
      @MrTOCSY 3 года назад

      And one more question, if I may. If we are up to find an outlier on a 2D score plot of principal components should we use a covariance matrix of SCORES?

    • @tilestats
      @tilestats  3 года назад

      Yes, but note that PC1 and PC2 are uncorrelated.

  • @eyupyondem4818
    @eyupyondem4818 Год назад

    Hi sir; this is a really nice and clear explanation. However, there may be an incorrect covariance matrix inversion since when I compute the values in R, it gave me another result. X X
    [,1] [,2]
    [1,] 0.72 0.69
    [2,] 0.69 1.00
    > solve(X)
    [,1] [,2]
    [1,] 4.100041 -2.829028
    [2,] -2.829028 2.952030
    > X %*% solve(X)
    [,1] [,2]
    [1,] 1 0
    [2,] 0 1

    • @tilestats
      @tilestats  Год назад

      That is because I show rounded values in the covariance matrix. In the first comment below the video, I show the covariance matrix with more decimals.

  • @rambisneves2077
    @rambisneves2077 2 года назад +1

    Hi Tile, Could you share these points in an excel file?

    • @tilestats
      @tilestats  2 года назад +1

      I do not have the original data since that was randomly generated. However, the data below should work to reproduce the calculations:
      x=[4.6, 4.4, 3.9, 3.9, 3.8, 3.5, 3.8, 3.4, 3.0, 2.7, 3.7, 3.0, 2.5, 2.2, 2.9, 2.5, 2.3, 2.1, 2.1, 1.5]
      y=[4.6, 4.1, 4.5, 3.9, 3.5, 4.0, 3.3, 3.2, 3.7, 3.5, 2.1, 2.7, 3.1, 3.2, 2.3, 2.0, 2.3, 1.8, 1.4, 1.0]

    • @rambisneves2077
      @rambisneves2077 2 года назад

      @@tilestatsthanks, What do you think in relation to do the ellipse in the excel file?

  • @juhoke
    @juhoke 2 года назад

    I wish I had seen this video during my clustering methods course. I had to drop it because I did not understand for example meaning of centroids.

    • @tilestats
      @tilestats  2 года назад

      I have two vids on clustering if you like to catch up:
      ruclips.net/video/uWf__KIKzPQ/видео.html
      ruclips.net/video/4E_DFMt60rc/видео.html