I got this comment: "Are you sure the inverse of the covariance matrix is correct? This is what I get when I put it into symbolab. [4.1 -2.82 -2.82 2.95]." This is due to that the covariance matrix has been rounded. This is the covariance matrix with more decimals. x y x 0.7241053 0.6869474 y 0.6869474 1.0462105
If you like a cutoff of 0.001, you should extract the corresponding value from a chi-square distribution, which means that you should extract the value that defines 0.001 of the upper tail. In this example, the area to the right-hand side of 13.82 in a chi-square distribution with 2 degrees of freedom is 0.001. Use a software or a chi-square table to get this value. The cutoff 0.001 is an arbitrary, but common, value to use to detect outliers.
If you would square the values from a normal distribution, those values will generate a chi-square distribution with 1 df. So, calculations that involve squaring stuff usually result in that we use the chi-square distribution.
hi can you elaborate more on generating 95% error ellipse. do we use random number generator with normal distribution to create it? is there a simple example of generating random numbers with intended distribution, or ive read long time ago from monte carlo where we can use cholesky decomposition to create data from correlation matrix? curios to know the mechanics behind them
You simply draw the ellipse based on the eigenvectors and eigenvalues of the covariance matrix. I used the package ellipse in R to draw the ellipse but if you like to know the details, I suggest this page: www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#google_vignette
Because, 0.1% of the data points will be outside the ellipse due to chance. If you for example have 1 million data points, you should expect that 1000 are outside the ellipse, right? It would then not be appropriate to define all these as outliers.
@@tilestats Yeap, I have tried and I still did not get it. My workings: [1.9 -2] * matrix * [1.9 -2]. Eventually, I get sqrt(5.080360804). I took 5 - 3.1 = 1.9 and 1 -3 = -2
@@cmindaaa If you multiply the row vector [1.9 -2 ] by the matrix, you should get the row vector [11.83 -9.56]. If you multiply this row vector by the column vector [1.9 -2.0], you should get the number 41.597. The square root of this number is about 6.45.
I do not have the original data since that was randomly generated. However, the data below should work to reproduce the calculations: x=[4.6, 4.4, 3.9, 3.9, 3.8, 3.5, 3.8, 3.4, 3.0, 2.7, 3.7, 3.0, 2.5, 2.2, 2.9, 2.5, 2.3, 2.1, 2.1, 1.5] y=[4.6, 4.1, 4.5, 3.9, 3.5, 4.0, 3.3, 3.2, 3.7, 3.5, 2.1, 2.7, 3.1, 3.2, 2.3, 2.0, 2.3, 1.8, 1.4, 1.0]
@@tilestats, yes, I ment 2D score plot. "It would be OK to calculate the error ellipse based on the scores in 2D" But why? The data were previously autoscaled, i.e. were divided by standard deviation. Is it correct to calculate the error ellipse for scores since scores and autoscaled data are DIFFERENT in their own nature?
Yes, since scaling does not affect the relative distances between the points. If you create an error ellipse of unscaled data, and you, for example, identify 2 points outside that ellipse, the same points will be outside that ellipse if you scale the data, given that you of course calculate the ellipse on the scaled data. Try this on a simple data set, which will help to understand.
Hi sir; this is a really nice and clear explanation. However, there may be an incorrect covariance matrix inversion since when I compute the values in R, it gave me another result. X X [,1] [,2] [1,] 0.72 0.69 [2,] 0.69 1.00 > solve(X) [,1] [,2] [1,] 4.100041 -2.829028 [2,] -2.829028 2.952030 > X %*% solve(X) [,1] [,2] [1,] 1 0 [2,] 0 1
No, you will then not get the correct value, unless you have standardized data, where the covarince and correlation matrix will be identical. Have a look at my video about this: ruclips.net/video/2bcmklvrXTQ/видео.html
And one more question, if I may. If we are up to find an outlier on a 2D score plot of principal components should we use a covariance matrix of SCORES?
Note that the covariance matrix shown at 6:10 should be
[0.724 0.687
0.687 1.046] for more accurate calculations.
How did you compute the covariance matrix from the green data points?
I love the way you explicitly explain every step of calculations. It helps me who is not a math expert understand the concept at ease. Thanks.
Great!
That was heck of a good explanation. Thanks very much👍
Thank you!
Very well explained! Even for a non-mathematician.
So clearly explained ! 😮
Thank you for it ❤️
Thank you!
This video helped a lot, thank you for this!
Excellent video currently studying up to be able to break up a single model into sub models and I'm trying to use the m distance
very comprehensive explainnation. thank you
this was enlightening, thanks a lot
Thank you! Very clear explanation
Thank you sir for another great video!
Thank you!
I got this comment: "Are you sure the inverse of the covariance matrix is correct? This is what I get when I put it into symbolab. [4.1 -2.82 -2.82 2.95]."
This is due to that the covariance matrix has been rounded. This is the covariance matrix with more decimals.
x y
x 0.7241053 0.6869474
y 0.6869474 1.0462105
Got it! Thank you for clearing that up!
Very helpful, thanks 👍🏻
Thanks. It was really informative
Thank you!
Super clear. Thank you!!
Thank you!
Excellent
great video, thanks
Thank you!
Thank you sir
mazaa aa Gaya bhai
How did you consider the corresponding critical value 13.82 at 9:50 minute of the video if the cut off is 0.001? Can you kindly explain it ?
If you like a cutoff of 0.001, you should extract the corresponding value from a chi-square distribution, which means that you should extract the value that defines 0.001 of the upper tail. In this example, the area to the right-hand side of 13.82 in a chi-square distribution with 2 degrees of freedom is 0.001. Use a software or a chi-square table to get this value. The cutoff 0.001 is an arbitrary, but common, value to use to detect outliers.
Can you tell me how you built the ellipse?
Preferably in the program scilab
I answered a similar question below. Hope that helps.
Hi Andreas, could you explain why I should expect a chi-square distribution at 8:26. As always a nice video :)
If you would square the values from a normal distribution, those values will generate a chi-square distribution with 1 df. So, calculations that involve squaring stuff usually result in that we use the chi-square distribution.
Thanks for the explanation@@tilestats
Is it the centroid that has to be computed or the mean. I think they aren't always the same, right?
I would say the overall mean in the multivariate space. As you point out, a centroid might have different meanings in different fields.
hi can you elaborate more on generating 95% error ellipse. do we use random number generator with normal distribution to create it? is there a simple example of generating random numbers with intended distribution, or ive read long time ago from monte carlo where we can use cholesky decomposition to create data from correlation matrix? curios to know the mechanics behind them
You simply draw the ellipse based on the eigenvectors and eigenvalues of the covariance matrix. I used the package ellipse in R to draw the ellipse but if you like to know the details, I suggest this page:
www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/#google_vignette
@@tilestats thx a lot.
I did not understand why the cutoff od 0.001 would not be appropriate in cases when we have many datapoints. Could you clear this up for me?
Because, 0.1% of the data points will be outside the ellipse due to chance. If you for example have 1 million data points, you should expect that 1000 are outside the ellipse, right? It would then not be appropriate to define all these as outliers.
I wish I had seen this video during my clustering methods course. I had to drop it because I did not understand for example meaning of centroids.
I have two vids on clustering if you like to catch up:
ruclips.net/video/uWf__KIKzPQ/видео.html
ruclips.net/video/4E_DFMt60rc/видео.html
How do you get 6.45 as the MD for point 2? When I calculate using the same method for point 1, i got back the same MD as point 1
Go to minute 6:32, and replace vector [5 5] by [5 1] for data point 2. Try and do the math again and let me know if it works.
@@tilestats Yeap, I have tried and I still did not get it. My workings: [1.9 -2] * matrix * [1.9 -2]. Eventually, I get sqrt(5.080360804). I took 5 - 3.1 = 1.9 and 1 -3 = -2
@@cmindaaa If you multiply the row vector [1.9 -2 ] by the matrix, you should get the row vector [11.83 -9.56]. If you multiply this row vector by the column vector [1.9 -2.0], you should get the number 41.597. The square root of this number is about 6.45.
@@tilestats omg i got it! thank you so much!!
Hi Tile, Could you share these points in an excel file?
I do not have the original data since that was randomly generated. However, the data below should work to reproduce the calculations:
x=[4.6, 4.4, 3.9, 3.9, 3.8, 3.5, 3.8, 3.4, 3.0, 2.7, 3.7, 3.0, 2.5, 2.2, 2.9, 2.5, 2.3, 2.1, 2.1, 1.5]
y=[4.6, 4.1, 4.5, 3.9, 3.5, 4.0, 3.3, 3.2, 3.7, 3.5, 2.1, 2.7, 3.1, 3.2, 2.3, 2.0, 2.3, 1.8, 1.4, 1.0]
@@tilestatsthanks, What do you think in relation to do the ellipse in the excel file?
Thank you
Thank you!
Is it correct to calculate the error ellipse for the autoscaled data for PCA calculation?
Not sure I understand. It would be OK to calculate the error ellipse based on the scores in 2D (if that is what you mean).
@@tilestats, yes, I ment 2D score plot. "It would be OK to calculate the error ellipse based on the scores in 2D" But why? The data were previously autoscaled, i.e. were divided by standard deviation. Is it correct to calculate the error ellipse for scores since scores and autoscaled data are DIFFERENT in their own nature?
@@tilestats Sorry for the bothering, but you explain transparently and simply. A rare phenomenon if we consider statistics )
Yes, since scaling does not affect the relative distances between the points. If you create an error ellipse of unscaled data, and you, for example, identify 2 points outside that ellipse, the same points will be outside that ellipse if you scale the data, given that you of course calculate the ellipse on the scaled data. Try this on a simple data set, which will help to understand.
Hi sir; this is a really nice and clear explanation. However, there may be an incorrect covariance matrix inversion since when I compute the values in R, it gave me another result. X X
[,1] [,2]
[1,] 0.72 0.69
[2,] 0.69 1.00
> solve(X)
[,1] [,2]
[1,] 4.100041 -2.829028
[2,] -2.829028 2.952030
> X %*% solve(X)
[,1] [,2]
[1,] 1 0
[2,] 0 1
That is because I show rounded values in the covariance matrix. In the first comment below the video, I show the covariance matrix with more decimals.
Is it correct to calculate MD using correlation matrix instead of covarience matrix?
No, you will then not get the correct value, unless you have standardized data, where the covarince and correlation matrix will be identical. Have a look at my video about this:
ruclips.net/video/2bcmklvrXTQ/видео.html
The data are autoscaled. Numerical values of elements of correlation matrix and covariance matrix are equal.
And one more question, if I may. If we are up to find an outlier on a 2D score plot of principal components should we use a covariance matrix of SCORES?
Yes, but note that PC1 and PC2 are uncorrelated.