Applied Principal Component Analysis in R

Spencer Pao

Просмотров 27 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 7 авг 2024
===== Likes: 413 👍: Dislikes: 7 👎: 98.333% : Updated on 01-21-2023 11:57:17 EST =====
Theory and Application for Principal Component Analysis
Thanks for watching! Let me know what you think. Are there any issues? Please let me know in the comments below.
Please Like and Subscribe! :)
Github link for code & Follow along!
github.com/SpencerPao/Data_Sc...
0:00 - Intro
0:26 - Theory behind PCA (No math)
2:45 - Exploring Data
3:55 - Princomp package
8:38 - Screeplot
10:17 - Scores
10:50 - Prcomp package
12:43 - Principal package
Наука

Комментарии • 77

@AndrzejFLena 3 года назад
Bloody love your channel mate - thanks for all your great work! Peace from UK :D
@joshpoland804 2 года назад ⁺¹
This is so incredibly helpful. Thank you!
@nicholaslewis7148 Год назад ⁺¹
Thank you for doing such a straight forward and simple video. No one seems to have a video with PCA and Eigenvalues in real application to a dataset
@jonashansen6391 2 года назад ⁺¹
Great video. Simple intuitive rundown of something complex.
@SwoleMastrChase 3 года назад
Good stuff! Im doing PCA for my senior project and this video helped me out a ton!
@SpencerPaoHere 3 года назад
I'm glad this helped!
@shivangigujar3184 3 года назад
Very nicely explained!!!
@andrewdelgado7536 3 года назад
Dude you're a lifesaver for these videos! THANK YOU! - Clinical Research PhD Student
@lauradiaz5829 2 года назад
Amazing video!
@dendeibrahimadekanmbi8022 2 года назад
Thanks for this presentation
@telmamendes9423 3 года назад
Thank you so much! It helped a lot!
@MUHAMMADYASEEN-ce5xx 3 года назад
do u know about multivariate PCA in R studio?
@PinkWitch 2 года назад
Very helpful in preparing for exam PA. Thx ~
@marlonedy55 Год назад
Thank you for sharing your knowledge. You could make a video about RDA and CCA. Greetings from Ecuador 🇪🇨🙋‍♂️
@oliesting4921 2 года назад ⁺¹
Very great video! Hope you will do more tutorials in R. Do you use stepAIC to do feature importance? How is AIC method different from PCA method? Thank you!
@SpencerPaoHere 2 года назад ⁺¹
Hi! I'm glad you liked it. StepAIC is used to determine which model is best; it does this by calculating scores when subtracting features (hence can be considered a form of feature selection)
The PCA method is used to shrink the number of features, reducing complexity but maximizing variance of your features.
@davekimmerle9453 Год назад ⁺¹
Hey Spencer, great video, keep up the good work!
I wanted to ask again if you have an academic article, paper or book that I could cite in my thesis when I do PCA?
@SpencerPaoHere Год назад
Try This:
royalsocietypublishing.org/doi/10.1098/rsta.2015.0202
@XxRoos898xX 3 года назад
Hey Spencer Pao,
Thank you for the great video!
Just a few questions:
- Why are you entering cor = true (pc.teeth
@SpencerPaoHere 3 года назад ⁺¹
Hello, I'm glad you liked it!
1) The (cor = True) parameter indicates whether to use to correlation or covariance matrix for the PCA calculation. In this case, since it is set to true, I am using the correlation matrix.
2) PCA is naturally trying to rotate the observation's axis for best fit.
3) How to decide what rotation is sort of an artform. You'd have to try different rotational methods to see which one is best suited to your use case. I would try maybe 2-3 different popular rotational algorithms and observe the outcomes.
@XxRoos898xX 3 года назад
@@SpencerPaoHere Thank you for the clarification! I am only familiar with PCA in SPSS.
My teacher explained PCA in short as the following steps:
- examine the correlation matrix
Variables have to be correlated for PCA to be appropriate (i.e., if they are not correlated they are unlikely to share common factors).
+ as a guide look for correlations > 0.35 in absolute size
- extract all potential factors
- examine eigen values
The total variance explained by each factor is the eigenvalue.
The most common method is to only retain factors that account for variance >1.
An alternative method is to use a scree plot (look for the break in the curve as an indication of the point at which further factors stop giving us a worthwhile extra amount of explained variance).
- examine the factor matrix (loadings - determine which loadings load heavely on which components (does it make sense theoretically that they are loading heavely on a certain component?))
- examine the final stastistics (communalties fall since only a subset of the factors are used)
Low communalties suggest that a variable may need to be excluded (i.e., it is not explain well by your components)
- explore how rotations influence your PCA(if needed)
How do you feel about the above explanation?
Is it possible to have to highly correlated variables? E.g., would a correlation of >.90 be a problem for PCA?
Also if I wanted to examine communalties in R do you know how I would code this in?
I am just starting to learn about PCA, I do apologise if there are any mistakes in the above text.
@SpencerPaoHere 3 года назад
@@XxRoos898xX Hello. Are you referring to the mathematics of PCA? I like the shortest explanations possible that address the main steps of the algorithm haha. But by all means, when it comes to teacher explanations, I'd stick with what they are saying since they are the ones who are grading your descriptions :p
High level overview:
1) Compute covariance or correlation matrix
2) Calculate the Eigenvectors / eigenvalues from covariance or correlation matrix
3) Sort by the eigenvectors by eigenvalues and choose whichever number of eigenvectors fits your screeplot.
4) Transform your data into new subspace using the chosen eigenvectors
The communalities you are referring to is equivalent to Sum of Squares of Loadings. Check out 15:02
@XxRoos898xX 3 года назад
@@SpencerPaoHere Thank you! I will rewatch your video again :) it truely is amazing! I will check out 15:02 again, thank you for your quick replies
Sorry, my notes were from my teachers explanation on how to interpret/work through a spss output of a PCA - we didn't touch the algorithms behind PCA (I think this may be why it is difficult for me to follow some of the steps in R)
@menoknow2 2 года назад
Hi Spencer, great video! I was wondering if you could use PCA for binary data? I see that the bot.canine feature is binary. Thanks!
@SpencerPaoHere 2 года назад
Yes! You absolutely can use PCA for binary outcomes. But, there are better options to use when it comes to modeling categorical variables.
@azarael77 2 года назад
I read an article where it said that there might be problems if the difference in item difficulties between the variables (the proportion of people agreeing to an item) is too high. It recommends to use correspondence analysis in this case.
@dcsekhar 10 месяцев назад
sir i have a data of categorical independent variables which i have converted into 0,1 and i am trying to fit a logistic regression because response variable is also categorical in nature 0,1 so can this technique be used to avoid multicollinearity problem in the dataset and do discriminate analysis for prediction
@SpencerPaoHere 5 месяцев назад
Yes. There are other forms of penalization techniques out there -- but you can use PCA to avoid multicollinearity (in fact that is one of the main purposes of PCA)
@kavyashreenm6815 2 года назад
Please do a video on " how to avoid overlapping of labels in kmeans clusterplot and pca scatterplot?"
@SpencerPaoHere 2 года назад
Hmm. I don't really follow. Are you referring to a preprocessing step for the visualization side?
@fishfish20 Год назад
Where is the next part? I love your work
@SpencerPaoHere Год назад
Hmm. You can probably scroll through the videos on my channel to see what you're looking for. if not, let me know!
@adriantyanandya4338 Год назад
Hello Spancer. First I want to thank you for your great video. I have a question, why i can't use 'princomp' for the reason "can only be used with more units than variables" . Is there a solution to solve it? Thank you.
@SpencerPaoHere Год назад
Thanks! Hmm. well, it is perfectly reasonable (and ideal) to have more observations than there are features! This helps with the collinear effect.
@justin2icy 3 года назад
Thank you, this was very helpful! One quick question I had was, if I wanted to know which variables were closely related to another variable, how could I interpret that? For example, of the 8 variables, which 3 are strongly related to bot.cannine? How would I go about doing that?
@SpencerPaoHere 3 года назад
Hi! Try running the R^2 value to find the correlation amongst variables. If the relationship between say X and Y is close to 1, then you know that they are strongly related. You can run the correlation function on all the features and generate a correlation matrix.
@asirbillah4987 3 года назад
Hello, would you plz tell whether its possible for me to pursue MS in ML/DL in a good US uni if I have 3.5ish cgpa in undergrad?
@SpencerPaoHere 3 года назад ⁺¹
I'd argue that anything is possible. I'd take a look at Georgia Tech (*Note that there are a ton EXCELLENT MS programs out there). But, I specifically know people who have gone through Georgia Tech's program and have heard of good things about it. They have a DS program that is remote / part time.
@kx7522 2 года назад
How do we construct an Index with PCA? Do we multiply the raw data of each column with the proportion variation and sum them up?
@SpencerPaoHere 2 года назад ⁺¹
Now there is a really lengthy answer to that question and I would not be doing the answer justice without this post: -- in essence it depends.
This might better answer your question.
stats.stackexchange.com/questions/133492/creating-a-single-index-from-several-principal-components-or-factors-retained-fr
@kx7522 2 года назад
@@SpencerPaoHere Thanks :) I have another question, do I need to scale my dataset before doing PCA if I were to use the 'prcomp' function in R?
@SpencerPaoHere 2 года назад
@@kx7522 Yes! (Because PCA's backend end is sum of squares): You should scale & normalize.
@Aaarya299 3 года назад
how to analyze principal components using variance values ?.
@SpencerPaoHere 3 года назад
Well.. technically the variances are the eigenvalues. You can checkout the covariance matrix between the principal components and check out how much variability is explained by each component. To find out much of the variability is explained, you get the diagonal values and divide by the sum of the diagonal values to get the 'explainability' of each component.... I am not sure if I answered your question.
@adityapratapsingh6068 2 года назад
Does having a categorical variable along with continuous variables make any difference? What modifications are needed in the analysis if the dataset is so? Thank you.
@SpencerPaoHere 2 года назад
You'd have to one hot encode your categorical variables. Else, the interpretation of your categorical variables would be ordered. (which is something you don't want)
@adityapratapsingh6068 2 года назад
@@SpencerPaoHere So suppose my variable has been categorised in a rating from 1 to 5. So their values are like 0, 2, 5, 1 and so on. Can i as such use them for the analysis?
@SpencerPaoHere 2 года назад
@@adityapratapsingh6068 Yep! Just one hot encode that feature. Then, you should be good to go!
@dimariscolonmolina2223 3 года назад
Hello, I have a PCA analysis where I used the prcomp and I want to show in the graph species and the environmental parameter that is sorting (grain size particle) but it does not show in the graph the sorting , this is the code that I have at the moment. Really appreciate all th help, thank you!!!
pr.envt1
@SpencerPaoHere 3 года назад
This sounds like a graphing issue. Have you taken a look at the sort() function? This can 'order' your independent variables when graphing. You can use the sort function on the "X" variables that you are trying to plot, and it should sort the variables to your liking.
@dimariscolonmolina2223 3 года назад
@@SpencerPaoHere ok, so how will be the Rcode that I send you with the sort function. If you have an email address I be happy to send out my Rscript to see what I'm doing wrong, if is no problem with you
@SpencerPaoHere 3 года назад
I am wary about revealing an email address publicly. Do you have a github? Maybe you can push your stuff up there and I can take a look at it?
Or, even better, if you have reproducible code with a sample data (i.e the iris dataset) and can generate the problem, I can help diagnose the issue.
But, in essence, for the variable that you wanted to plot, try doing something like plot(sort(name_variable)...)
@dimariscolonmolina2223 3 года назад
@@SpencerPaoHere ok awesome will do
@ziddneyyy 3 года назад
Thanks for the tutorial, but do u know how to convert the principal component analysis to the principal component regression? im stucked after i got the component that will be used for the regression but idk how to convert it, thx
@SpencerPaoHere 3 года назад ⁺¹
Hi!
You could run the lm() on your components and you Y variable.
So it would be something like this:
df
@ziddneyyy 3 года назад
@@SpencerPaoHere OMG THANKS A LOT, U R HELPING TOO MUCH RIGHT NOW
@AJman24 2 года назад
@@SpencerPaoHere would we include all the components or just the main ones? And how would we do prediction using this regression as the test data will be in a different format ?
@SpencerPaoHere 2 года назад
@@AJman24 The idea behind PCA is to maximize the variance explained with as few components as possible. So, whatever your threshold of variance is will determine how many components you want to use.
@AJman24 2 года назад
@@SpencerPaoHere but what if we want to compare the testing results with another model let's say ridge and we have already set aside a few observations for that and we want to test our pcr on the same observations as the ones that we used for ridge?
@MarinaUganda 3 года назад
I have beent rying to plot a biplot of the varimax rotated components. Is this possible?
@SpencerPaoHere 3 года назад ⁺¹
It should be! You can attempt to store the component values into X and Y variable and plot(X, Y) as needed.
@MarinaUganda 3 года назад
@@SpencerPaoHere It won't work unfortunately when I use fviz.
@SpencerPaoHere 3 года назад
@@MarinaUganda Hmm weird. Once you have obtained your fit rotated on varimax, you would plot your loadings to get the visualization.
Your code should look something like this:
fit
@ajaydhungana1921 2 года назад
Does number of entry among columns matters or not?
@SpencerPaoHere 2 года назад
Are you referring to records/rows? It really depends on what type of machine learning model you use. In general, the more data you have, the better your model will be off. (Not always the case i.e billions of records for a PCA might be a bit over board.)
But, if you don't have a lot of data, then you will be worse off. (i.e < 100 records/observations)
@chiennguyenminh8109 3 года назад
Hi, I wonder how you can do autocomplete variable so fast? In 11:02
@SpencerPaoHere 3 года назад ⁺¹
oh haha. I just edited the video so that you won't see me type the words out (keeping out the more mundane parts)
However, in R, there are ways to autocomplete. Try the 'Tab' + 'enter' button when writing variables/functions etc.
@chiennguyenminh8109 3 года назад
@@SpencerPaoHere Thank you so much!!!
@pramitthapa283 8 месяцев назад
All the experts in youtube video say ‘PCA is dimensionality reduction and bla bla bla…’. However, no one explains in simpler way how the reduced dimensions or principal components (explain x% of variance) exactly mean regarding variables, in a way a beginner in statistics can understand.
@muhammadzubairchishti1795 2 года назад
Dear Respected Professor! Thank you so much for providing us free knowledge. I highly appreciate your precious efforts. Kindly please give me your email address since I want to send my issue regarding R codes to you. Thank you.
@SpencerPaoHere 2 года назад ⁺¹
I am not a professor. :p
Though I can answer any questions you might have in the comment section, I can be reached out at
business.inquiry.spao@gmail.com
@muhammadzubairchishti1795 2 года назад
@@SpencerPaoHere Thank you so much for your kind reply.

Следующие

Автовоспроизведение

Ensemble Method: Boosting (Hypothesis Boosting)