I am now convinced that there are no tough subjects, only ineffective tutors. I have been struggling to understand this concept for over 3 years, and here I am, within 11 minutes things have fallen into place. An expert not necessarily be a great teacher. There might be great experts assigned in educational institute to teach such concepts. But someone like you is what we need in our schools and colleges (expert and well articulated). Simplicity is the utmost form of sophistication. Thanking you from the bottom of my heart. Keep on helping people like us. Perhaps another video on how to do it in R will be great hit.
Thank you so much for your kind words, I'm really flattered! I'm glad it was useful and it cleared up concepts for you:) Great idea about an R tutorial, will definitely add it to my todo list!
A big thank you!! I have this topic in my semester exams, and everyone around me is mugging up, upon asking what it is actually, they give me definitions that do not satisfy me, this single video clears a lot in understanding PCA. I wish I had a teacher like you in my college.
I have been struggling to understand PCA for days 🤯, despite reading many articles and watching countless videos, but this is by far the best and easiest to understand explanation, thank you!! 🥳
It would be great to have PCA explained conceptually, mathematically, as well as programmatically. When push comes to shove, we'll need to do it in a computer, running an algorithm that either we have to put in, or call from a Python library. Thank you for all the work you do educating us!
Hi thank you so much for explaining PCA in such a clear way. I've been really stressed about understanding it for my uni stats exam, but now I feel much more confident :)
I'm currently watching without logging into my Google account. 😊 However, halfway through, I made the decision to log in, hit the like button, and subscribe to your channel. 🎉 Thank you for your valuable content-it's truly helpful, and I encourage you to keep up the great work! 👍
I have a question: Let's say I want to calculate the PCA of children's grades at school to know how it impacts the final grade average of each child. For that I have my sample, which would be all the children in a class, for example. Then I have the grades of those children in different subjects such as Math, History, Physical Education, etc. And I also have the average. Do I add the average as another variable to the PCA analysis? Or should I make a correlation, for example, between PC1 and the average and see the PCA loadings?
Thanks for the video, this really explains it in a nutshell! This may be not in your alley, but what would PC2 to be if you would plot geochemical data with each variable being an element (PC1 seems to be rock type). Thanks in advance!
I have one question if I have 60(A1-A60) variables with a 2k sample size, A1 is the first and A60 is last, in between these A10, A20, A30, A40, A50 and the confirmed output but for some of the samples the A19, A29 output doesn't exist, as A20 reached earlier, the data is of this type for some reasons. Will PCA work in the same way as explained?
How to obtain the loadings? is it the same to eigenvectors or scaled coordinates?? in my geochemical software iogas the report of PCA contain this items: Correlation - Eigenvectors - Eigenvector Plots - Eigenvalues - Scree Plot - Scaled Coordinates - PC1 vs PC2 - PC1 vs PC3 - PC1 vs PC4 and so on... (the last is PC3 vs PC4). My input was 32 chemical elements previously transformed with CLR
Here is the ioGAS description for Scaled Coordinates: "Created by scaling the length of the eigenvector to the eigenvalue. All eigenvectors have a length of 1 so scaling by the eigenvalue changes the lengths so that the length is proportional to the variance (eigenvalue) accounted for by that eigenvector. Click on a PC header column to sort the scaled coordinates from lowest to highest or vice versa." And for Eigenvectors: "Eigenvectors are PCA coordinate values that correspond to the projected location of the original input variables onto the calculated PCA axes. PC1, or the first eigenvector, is a calculated line of best fit through the maximum direction of variation for the selected variables. The PC1 eigenvectors represent the value of each input put along this line. PC2 is a line of best fit through the maximum variation at right angles to PC1 so the PC2 eigenvalues are the original input variable values projected onto this axis, and so on for each of the number of principal components. An eigenvector may be in either of two opposite directions. ioGAS will always choose the eigenvector whose first element is positive. Click on a PC header column to sort the eigenvectors from lowest to highest or vice versa."
At ~3:58 you say the principal components explain 85% of the variance in life expectancy. I don't think that's right. I think it's 85% of the variance in the predictor variables. Or am I totally confused?
Thank you for your video. After you have assigned PC1 to PC5 ..., you show the PC matrix in order reflecting the amount of variation explained, where there are a variety of values listed under each PC from - 6 to +6. What do these values represent?
Hi! Thanks for your question! So the values are just an example, they don't necessarily go from -6 to +6. Basically, the values represent the 'contribution' of that variable to a specific PC. Since PCs are ranked by the variation of the dataset they explain (PC1 explains more than PC2, which in turn explains more than PC3...), variables with higher (more positive) or lower (more negative) scores for lower PCs (i.e., PC1) are 'more important', in other words, they explain more variability in the dataset. Hope this helped!
Thank you very much for your rapid reply and explanation. I thought that this was the case, but was not certain. As an extension of my question, do these + or - values under each PC align with a tick mark on the x:y and -x:-y axes? (for reference the axes you use to demonstrate these concepts around 5:10 to 5:30 minutes into your presentation). If "yes", and by way of feedback, having a scale on these axes would be helpful. I have watched 3 separate presentations on PCA today, and I have found yours the most useful. Thank you again, and in particular for responding to my question so quickly. Best wishes.
Hi thanks so much for your feedback! No, they're not! The tick marks represent increments of 1 (so 1, 2, 3, 4...) and I think my intention was to make them match the PC scores, but I must have changed the labels around to make it make sense with the biology and forgot to update the table. But they should match, so thanks for pointing that out! Will correct it if I ever do a part 2 on this:) Cheers @@brettlidbury4110
Hi Good presentation on PCA. Can we apply PCA on a dataset that have numeric and categorical data? Also do we need to ensure that each variable follow a normal distribution if it does not what should we do? Also do we need to normalised each of variables? Appreciate your comments.
Hi, great questions. PCA is not recommended for categorical data - even if you one-hot encode it. For mixed data types, there are better alternatives like Multiple Factor Analysis available in the FactoMineR R package (FAMD()) or Multiple Factor Analysis (MFA()) is also an option. I haven't got experience with either but you can check the thread here: stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont Yes, it is necessary to standardise data before performing PCA because PCA basically maximises the variance. So if you have some variables with a very large variance and some with little variance, it will give more importance to the variables with large variance. If you change the scale of one of your variables, e.g., weight of mice, from kg to g, the variance increases, and the variables 'weight' will go from having little impact to be the main feature that explains variance in your dataset. Standardising will do the trick since it makes the SD of all the variables the same (normalization does not make all variables to have the same variance). Hope this was clear!
@@biostatsquid Sorry it was not clear...I just wonder if there is a limitation at applying PCA only in cases of data where there is some correlation among the factors or some factors for example height and weight are correlated etc.
@@nikitrianta9896 Oh I see ! No, not at all, actually PCA allows you to gather insights about features describing our data - by looking at the coefficients of the features/variables for each PC you can find out if they are positively, negatively or not correlated. If you want to visualise this you can draw a plot of the coefficients for PC1 vs PC2 (for example) for all features. For each feature, imagine (or draw) a vector with origin in (0, 0) to the point (coefficient PC1, coefficient PC2). Features that are positively correlated to each other have an angle between their vectors close to 0 degrees , if they are negatively correlated the angle between them is 180 and if they are not then the angle is close to 90 degrees. Does this answer your question? :)
If PC1 is 20% it means it explains 20% of the variability of the dataset. You can then check which are the top contributing variables of PC1 to figure out what are the features of your dataset that explain most variability. In complex scenarios you might be happy with 20% of variability. For example, you are studying height in the human population, and want to figure out which genes contribute to height. You 'take' a sample of people with different heights, do RNAseq to figure out gene expression (this is a very simple example, but let's go with it). You do PCA on the gene expression counts of all genes in the human genome. PC1 explains 20% of variability (i.e., differences in height in the sample you took). Then you check and top PC1-contributing genes are X, Y, Z. So you know that X, Y, Z genes most probably play an important role in height. But of course this is only 20% of the variability of your data. What about the other 80%? Well, you forgot about other important factors that contribute to height, like diet, gender, genomic varaibility (not only transcriptomics, but also epigenetics, genomics might play an important role!) ... etc. Hope this made it a bit easier to understand!
Hi! I really love your explanation! Would it be possible to get a copy of the dataset? I need to teach PCA and i think this a nice example cause the relationships between the factors are easy to understand! Would definitely point them to this video!
Hi Liz, thanks for your feedback! Unfortunately I cannot share my dataset, not because I don't want to, but because there is no dataset! I just made up the categories and figures for illustration purposes, just cause it is easier to understand when the factors are 'obvious'. So sorry to disappoint you... However, you can check out my post here in case it is helpful: biostatsquid.com/pca-simply-explained/
I am now convinced that there are no tough subjects, only ineffective tutors. I have been struggling to understand this concept for over 3 years, and here I am, within 11 minutes things have fallen into place.
An expert not necessarily be a great teacher. There might be great experts assigned in educational institute to teach such concepts.
But someone like you is what we need in our schools and colleges (expert and well articulated).
Simplicity is the utmost form of sophistication.
Thanking you from the bottom of my heart.
Keep on helping people like us.
Perhaps another video on how to do it in R will be great hit.
Thank you so much for your kind words, I'm really flattered! I'm glad it was useful and it cleared up concepts for you:) Great idea about an R tutorial, will definitely add it to my todo list!
100% CORRECT
Hi there. Yes, please add a video where you show us how to do this analysis on R. Thank you :)
You are for sure principal component #1! You're the best at describing information ;)
i have suffered with PCA for two years and this video just made it so easy for me
This video just solved half of my problem in understanding PCA stats. To solve the other half is I need to translate the info to my actual research.
Was going insane looking for an understandable explanation of "what" a PCA is, until I found this video! Thank you very much!
Thank you for your kind words!! Glad it helped:)
This is the best tutorial I have ever seen. Thank you so much.
A big thank you!! I have this topic in my semester exams, and everyone around me is mugging up, upon asking what it is actually, they give me definitions that do not satisfy me, this single video clears a lot in understanding PCA. I wish I had a teacher like you in my college.
I have been struggling to understand PCA for days 🤯, despite reading many articles and watching countless videos, but this is by far the best and easiest to understand explanation, thank you!! 🥳
I have watched so many videos trying to understand pca ..and this by far is the most interesting with fundamentals fully explained
Very good teacher! Thanks!
A very clear and engaging introduction to PCA. It was new to me, and I came away with a good impression of how it would be used. Thanks very much!😀
It would be great to have PCA explained conceptually, mathematically, as well as programmatically. When push comes to shove, we'll need to do it in a computer, running an algorithm that either we have to put in, or call from a Python library.
Thank you for all the work you do educating us!
Great video. It explained PCA in a very simple and clear way. Thank you!
Wonderfully explained! Keep up the great work, Biostasquid.
This is the best video I found on PCA!
Can't thank you enough Biostatsquid!❤
You really understand what you were talking, big up
Wow! best PCA video on youtube.
super elegant and clear explanations, thank you!
Thank you, I'm happy you found it useful:)
Hi thank you so much for explaining PCA in such a clear way. I've been really stressed about understanding it for my uni stats exam, but now I feel much more confident :)
I'm currently watching without logging into my Google account. 😊 However, halfway through, I made the decision to log in, hit the like button, and subscribe to your channel. 🎉 Thank you for your valuable content-it's truly helpful, and I encourage you to keep up the great work! 👍
This was excellent. Some people just know how to explain things
Thanks for making amazing video help me explain things I have been researched for days.
Thanks!
Thank you for such informative, easy-to-understand content!
So simple and clear. U are awesome.
Loved it! It's a really comprehensive explanation!😍
Incredibly clear ! Thank you, and congratulation
Amazingly well explained
Perfect explanation for Principal Component Analysis
Although I don’t use PCA in my workday, I think this is the best video out there explaining how PCA works. Good job 👍
This was brilliant, thank you
Best explanation in RUclips, awesome.
Absolutely brilliant explanation.
the best explanation. easy to understand.
Amazing content, clearly explained! :)
Best video to understand PCA plot 😊
wow!!! that was explained so nicely by you..... thank you!
So well explained. Thanks a bunch!
Fantastic presentation.
Great I benefited a lot!
Lovely video, thank you for explaining!
Glad it was helpful! You're very welcome:)
I have a question:
Let's say I want to calculate the PCA of children's grades at school to know how it impacts the final grade average of each child. For that I have my sample, which would be all the children in a class, for example. Then I have the grades of those children in different subjects such as Math, History, Physical Education, etc. And I also have the average. Do I add the average as another variable to the PCA analysis? Or should I make a correlation, for example, between PC1 and the average and see the PCA loadings?
just on point!Loved it!
Thanks for the video, this really explains it in a nutshell! This may be not in your alley, but what would PC2 to be if you would plot geochemical data with each variable being an element (PC1 seems to be rock type). Thanks in advance!
Thank you very much for your clear explanation.
Very great example this was the exact example of what im doing too!
i wish i could hug you, thank you so much
Simply excellent !
Good explanation, Thats great!
I have one question if I have 60(A1-A60) variables with a 2k sample size,
A1 is the first and A60 is last, in between these A10, A20, A30, A40, A50 and the confirmed output but for some of the samples the A19, A29 output doesn't exist, as A20 reached earlier, the data is of this type for some reasons.
Will PCA work in the same way as explained?
Very good high level video!
Thank you very much, very well explained
How to obtain the loadings? is it the same to eigenvectors or scaled coordinates?? in my geochemical software iogas the report of PCA contain this items: Correlation - Eigenvectors - Eigenvector Plots - Eigenvalues - Scree Plot - Scaled Coordinates - PC1 vs PC2 - PC1 vs PC3 - PC1 vs PC4 and so on... (the last is PC3 vs PC4). My input was 32 chemical elements previously transformed with CLR
Here is the ioGAS description for Scaled Coordinates:
"Created by scaling the length of the eigenvector to the eigenvalue. All eigenvectors have a length of 1 so scaling by the eigenvalue changes the lengths so that the length is proportional to the variance (eigenvalue) accounted for by that eigenvector.
Click on a PC header column to sort the scaled coordinates from lowest to highest or vice versa."
And for Eigenvectors:
"Eigenvectors are PCA coordinate values that correspond to the projected location of the original input variables onto the calculated PCA axes. PC1, or the first eigenvector, is a calculated line of best fit through the maximum direction of variation for the selected variables. The PC1 eigenvectors represent the value of each input put along this line. PC2 is a line of best fit through the maximum variation at right angles to PC1 so the PC2 eigenvalues are the original input variable values projected onto this axis, and so on for each of the number of principal components.
An eigenvector may be in either of two opposite directions. ioGAS will always choose the eigenvector whose first element is positive. Click on a PC header column to sort the eigenvectors from lowest to highest or vice versa."
Ahhh, I think the Loadings are equal to Scale Coordinates 😅
I'd love to have a tutorial on how to perform this on R. This was very well explained.
Great suggestion! I cover it a bit in the preprocessing video but maybe a specific video for PCA in R would be good - I'll keep it in mind! Thanks!
You are exceptional 🤩
Very well explained, thank you!
Woow.. this is so helpful
Well explained. Thank You.
Thank you. Well explained!
good work. thanks
really nice, congratulations for your video! I follow you now :)
At ~3:58 you say the principal components explain 85% of the variance in life expectancy. I don't think that's right. I think it's 85% of the variance in the predictor variables. Or am I totally confused?
Amazing explanation
How do someone find out the linear combination of PC1?
How to explain which factors contribute to PC1 and PC2? by biplot graph.
it's well explained for begginer to understand the plot,but if you wanna know how to do it,this video can't help you
Please can u tell me how can we calculate principal loading. I am a bit confused to this part.
the best video i found regarding loadings, but you don't mention the scores
Thank you for your video. After you have assigned PC1 to PC5 ..., you show the PC matrix in order reflecting the amount of variation explained, where there are a variety of values listed under each PC from - 6 to +6. What do these values represent?
Hi! Thanks for your question! So the values are just an example, they don't necessarily go from -6 to +6. Basically, the values represent the 'contribution' of that variable to a specific PC. Since PCs are ranked by the variation of the dataset they explain (PC1 explains more than PC2, which in turn explains more than PC3...), variables with higher (more positive) or lower (more negative) scores for lower PCs (i.e., PC1) are 'more important', in other words, they explain more variability in the dataset. Hope this helped!
Thank you very much for your rapid reply and explanation. I thought that this was the case, but was not certain. As an extension of my question, do these + or - values under each PC align with a tick mark on the x:y and -x:-y axes? (for reference the axes you use to demonstrate these concepts around 5:10 to 5:30 minutes into your presentation). If "yes", and by way of feedback, having a scale on these axes would be helpful. I have watched 3 separate presentations on PCA today, and I have found yours the most useful. Thank you again, and in particular for responding to my question so quickly. Best wishes.
Hi thanks so much for your feedback! No, they're not! The tick marks represent increments of 1 (so 1, 2, 3, 4...) and I think my intention was to make them match the PC scores, but I must have changed the labels around to make it make sense with the biology and forgot to update the table. But they should match, so thanks for pointing that out! Will correct it if I ever do a part 2 on this:) Cheers @@brettlidbury4110
@@biostatsquid My pleasure and looking forward to the next installment (o:
very well explaining
Nice, video, thanks!
Very good explanation mam
Hi
Good presentation on PCA. Can we apply PCA on a dataset that have numeric and categorical data? Also do we need to ensure that each variable follow a normal distribution if it does not what should we do? Also do we need to normalised each of variables? Appreciate your comments.
Hi, great questions. PCA is not recommended for categorical data - even if you one-hot encode it. For mixed data types, there are better alternatives like Multiple Factor Analysis available in the FactoMineR R package (FAMD()) or Multiple Factor Analysis (MFA()) is also an option. I haven't got experience with either but you can check the thread here: stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont
Yes, it is necessary to standardise data before performing PCA because PCA basically maximises the variance. So if you have some variables with a very large variance and some with little variance, it will give more importance to the variables with large variance. If you change the scale of one of your variables, e.g., weight of mice, from kg to g, the variance increases, and the variables 'weight' will go from having little impact to be the main feature that explains variance in your dataset. Standardising will do the trick since it makes the SD of all the variables the same (normalization does not make all variables to have the same variance). Hope this was clear!
SUPERB THANKS
Nice video😊
Very helpful video but I'm not sure I understand when to use PCA variable need to be correlated or not?
Hi Niki, not sure if I understand your question, could you rephrase it, please?
@@biostatsquid Sorry it was not clear...I just wonder if there is a limitation at applying PCA only in cases of data where there is some correlation among the factors or some factors for example height and weight are correlated etc.
@@nikitrianta9896 Oh I see ! No, not at all, actually PCA allows you to gather insights about features describing our data - by looking at the coefficients of the features/variables for each PC you can find out if they are positively, negatively or not correlated.
If you want to visualise this you can draw a plot of the coefficients for PC1 vs PC2 (for example) for all features. For each feature, imagine (or draw) a vector with origin in (0, 0) to the point (coefficient PC1, coefficient PC2). Features that are positively correlated to each other have an angle between their vectors close to 0 degrees , if they are negatively correlated the angle between them is 180 and if they are not then the angle is close to 90 degrees.
Does this answer your question? :)
how do we interpret the data including hierachial
Looking for a response from the Author - What is the signfiicance of a low PCA for a large biological data set? - Does a PC1 of
If PC1 is 20% it means it explains 20% of the variability of the dataset. You can then check which are the top contributing variables of PC1 to figure out what are the features of your dataset that explain most variability. In complex scenarios you might be happy with 20% of variability. For example, you are studying height in the human population, and want to figure out which genes contribute to height. You 'take' a sample of people with different heights, do RNAseq to figure out gene expression (this is a very simple example, but let's go with it). You do PCA on the gene expression counts of all genes in the human genome. PC1 explains 20% of variability (i.e., differences in height in the sample you took). Then you check and top PC1-contributing genes are X, Y, Z. So you know that X, Y, Z genes most probably play an important role in height. But of course this is only 20% of the variability of your data. What about the other 80%? Well, you forgot about other important factors that contribute to height, like diet, gender, genomic varaibility (not only transcriptomics, but also epigenetics, genomics might play an important role!) ... etc. Hope this made it a bit easier to understand!
@@biostatsquidoh myyy. What a good teacher you are! Thumbs up🎉
Hi! I really love your explanation! Would it be possible to get a copy of the dataset? I need to teach PCA and i think this a nice example cause the relationships between the factors are easy to understand! Would definitely point them to this video!
Hi Liz, thanks for your feedback! Unfortunately I cannot share my dataset, not because I don't want to, but because there is no dataset! I just made up the categories and figures for illustration purposes, just cause it is easier to understand when the factors are 'obvious'. So sorry to disappoint you...
However, you can check out my post here in case it is helpful: biostatsquid.com/pca-simply-explained/
@@biostatsquid no worries thanks so much!
Do you by chance make time for appointments? I would be grateful. Thanks
Hi Ruth! Just send me an email describing your issue and I'll tell you if I can help:)
Nice, very nice
very nice
Whole world creator's godfather bless you all always and you all love and remember godfather with your pure hearts.
Bravissima
Walker George Clark John Garcia Anthony