StatQuest: PCA in R

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 340

  • @statquest
    @statquest  2 года назад +6

    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @Buckybarnesfan22
    @Buckybarnesfan22 Год назад +21

    Josh, whenever I have a question about data science, your channel is the go-to. Thank you for the clear teaching!

  • @MancoMusic
    @MancoMusic 3 года назад +12

    In more or less 30 minutes I understood more about PCA than in hours of university degree and PhD courses! This video along with the one about theory background are amazing.
    I was just looking for something to understand it better and try to apply it on my "a lot of species" VOCs dataset.
    I'm going to try a first raw run on R!
    And you are both scientist and songmaker like me! I'm not a lonely weirdo!
    Awesome!
    Thank you!

    • @statquest
      @statquest  3 года назад +1

      Thank you very much! And keep making music!

    • @MancoMusic
      @MancoMusic 3 года назад +1

      🤟

  • @N0rmad
    @N0rmad 3 года назад +28

    It blows my mind how much people like you know about several topics. Statistics, probability, coding, machine learning,biology, and I'm sure more. You're nuts. And I meant that as a compliment.

  • @fgfanta
    @fgfanta 7 месяцев назад +5

    Six years later, this is still gold.

  • @suzandilaratokac2692
    @suzandilaratokac2692 4 года назад +17

    This was amazing! I was afraid of the name "PCA" in R but you mad it so easy!

  • @kathrynoneill5570
    @kathrynoneill5570 5 лет назад +39

    This is amazing!!! You are saving me so much grad school struggle right now. I am singing StatQuest from the rafters!

  • @janaw768
    @janaw768 4 года назад +35

    Not all heroes wear capes

  • @niceday2015
    @niceday2015 2 года назад +1

    No matter how many times I say thank you, can not express my thankness to you

  • @EmapMe
    @EmapMe 4 года назад +66

    Dangit you even have a video on how to do this stuff in R. Now I have absolutely no excuse to not do my PCA assignment.

  • @arfix2077
    @arfix2077 Год назад +1

    I'm always happy when I see that you have covered a topic I am studying. Thank you!

  • @leylayim
    @leylayim 3 года назад +3

    You sir, deserve a medal! Thank you so much!

  • @yutassmilehealsme6572
    @yutassmilehealsme6572 3 года назад +4

    Your channel saved my grades during my entire stats undergrad

    • @statquest
      @statquest  3 года назад +1

      BAM! I'm glad my videos are helpful. :)

    • @chainemusique1792
      @chainemusique1792 2 года назад +1

      @@statquest double bam I guess, me too
      Thank you

  • @ahsanjaved8313
    @ahsanjaved8313 6 лет назад +1

    Awesome. Watched 100's of tutorials and all were confusing. It's the best one and as said clearly explained.. Thanks Joshua :)

    • @ahsanjaved8313
      @ahsanjaved8313 6 лет назад

      Great. I will definitely watch it :) Thanks for the update

  • @jiwoosong6055
    @jiwoosong6055 4 года назад +5

    Best stats lecture ever! Thank you!! This needs more views :)

    • @statquest
      @statquest  4 года назад

      Thanks! :)

    • @mossammadu.c.sultana8548
      @mossammadu.c.sultana8548 4 года назад

      Dear Song, Could you help me to handle my data? I would appreciate receiving a suggestion form you. Please help me. I am in a critical period, need to learn.

  • @josy4767
    @josy4767 Год назад +1

    I was really struggling to get a grip on using PCA in R - you explained prcomp() so well! Thank you

  • @asfarlathif9943
    @asfarlathif9943 5 лет назад +1

    Hello Josh! I just wanted to say a huge thanks for making these explanation videos on stats concepts. It's been a huge help for me to get my head around these complicated topics on my own. I was also looking for videos explaining the mathematics behind the SINGULAR VALUE DECOMPOSITION method and how to implement it in R. I would greatly appreciate it if you make a video explaining SVD (and its practical implementation in R). Thanks again!

    • @Vinladar
      @Vinladar 4 года назад

      I second this request. I've been racking my brain trying to understand the intuition behind SVD and how it relates to PCA. I've watched lots of videos and read tons of articles on the subject, but I am still mystified by it.

  • @rhoninpowers
    @rhoninpowers 6 лет назад +2

    Oh man...I am subscribing!! I love how he (Josh) steps through the code... I thought there were some other great videos out there...and there are...but hands down these videos are excellent. (I want to punch the person who downvoted however... not sure how someone can downvote something with such detail and clarity...such is the perils of choice...lol. thanks Josh!

  • @DanObscur
    @DanObscur 5 лет назад +1

    Clear, simple, practical. Superb!

  • @aman09astana
    @aman09astana 6 лет назад +1

    Thanks Joshua!!!
    This is brilliant!, BAM!!!
    I have watched your few videos and amazed with your methods of explanation!

  • @Viking_Walrus
    @Viking_Walrus 4 года назад +8

    This was a great help to me. Thank you for the tutorial!

  • @SandraMaged-m3c
    @SandraMaged-m3c 11 месяцев назад +1

    great teaching! And very simple to comprehend.

  • @kubrateksen8845
    @kubrateksen8845 2 года назад +1

    I wish we could like each of your videos more than once :)

  • @elovyn7210
    @elovyn7210 6 лет назад +3

    Well explained, thanks! I used PCA, trying out different packages, would be nice to have some information on those mentioned as well, some of them provide nice functions for visualisation. When I do/try PCA, I find it most challenging to prepare the dataset to be suitable for PCA, but I guess that is a different topic of how to standardize variables and what to do with categorical variables etc.

  • @Krie7ananas
    @Krie7ananas 3 года назад +2

    Thank you sir, you are saving my lazy-ass once again

  • @rosalieo5045
    @rosalieo5045 4 года назад +2

    tip: if you want a nice looking scree plot in R then just do install.packages("devtools") and then install.packages("factoextra"). EDIT: use fviz_eig(df) to make the plot itself.

  • @peterh5960
    @peterh5960 3 года назад +1

    Joining the list of many others that have thanked you for the awesome explanation :)

  • @asalkhoshravanazar9173
    @asalkhoshravanazar9173 2 года назад +2

    Its amazing that you can teach stuff in 10 minutes with Hooray! and BAM! that I cannot learn with weeks of studying books

  • @yibletdagnachew5997
    @yibletdagnachew5997 2 года назад +3

    This was the first time that I saw your videos and it's absolutely great. Thank you. As a beginner, I am struggling with the R software package. Do you have some videos on basic R so that I can successfully deal with my statistics?

  • @delphinedesmedt440
    @delphinedesmedt440 3 года назад +2

    Thanks a lot, very clear and helpful video!

    • @statquest
      @statquest  3 года назад

      Glad it was helpful!

  • @casdessers9105
    @casdessers9105 4 года назад +1

    I beg beg beg you to make a video about exploratory factor analysis, my professor really sucks at it and you made it so perfectly clear!!!

  • @FJParravicini
    @FJParravicini 5 лет назад +1

    Your tutorials are amazing!

  • @adelutzaification
    @adelutzaification 7 лет назад +1

    Hey Josh, Thanks again for putting all this work in making these videos. As a new student of R, I appreciate every time I learn a new trick from reading a script and I am glad when I can decipher and understand the scripts I read. You really distilled a lot of valuable "R knowledge" in this video... That being said, I was wondering if you are open to receive and maybe post comments / potential useful tricks for data analyzing in R/ applied to a certain topic. Like a mini-forum. I am thinking that by doing so, it would stimulate more discussion and that more people would benefit. I figure that your audience might grow as well ;)
    As a potential topic for discussion, what do you think about talking about contrasting various types of distributions, the types of data they apply to, and the proper statistical tests that are applicable for such data sets ?

    • @adelutzaification
      @adelutzaification 7 лет назад

      As a biologist, I would primarily like to know more about the distributions encountered in various biological experiments, but of course, learning about other types of distribution encountered in other fields would be useful too. The important points would be , in my opinion, type of distribution, example of experiments/ real life situations, how you test that the data fits the criterion for the distribution (statistical test and comparison with a standard plot), what statistical tests can then be applied to measure various things in that population, contrast various distributions. Normal, Poisson, negative binomial, binomial definitely, and whatever you consider useful and have time for. I bet it takes you quite a bit of time to put together these youtube videos.
      Much obliged for learning so much from them.
      I see that, in your videos, you give examples of experiments encountered in the studying of gene expression. How about short and concise tutorials about how to analyze RNAseq/ChIpseq data?

    • @adelutzaification
      @adelutzaification 7 лет назад

      Great! Look forward to seeing them. If you come to NYC I'll buy you a whole distribution of pints ;)

  • @henrydanielpetocovarrubias3666
    @henrydanielpetocovarrubias3666 22 дня назад +1

    Stat Quest I love you so much

  • @manon8600
    @manon8600 4 года назад +2

    I love this! thank you, it makes so much sense!

    • @statquest
      @statquest  4 года назад

      Glad it was helpful! BAM! :)

  • @marceno4963
    @marceno4963 3 года назад +2

    Thanks so much! this was very useful :)

  • @argael1
    @argael1 4 года назад +2

    Thank you so much! You saved me a lot of time

  • @jenicksoncosta
    @jenicksoncosta 2 года назад +1

    Perfect. Thank you!

  • @maf4421
    @maf4421 3 года назад +1

    Thank you Josh Starmer for explaining PCA in detail. Can you please explain how to find weights of a variable by PCA for making a composite index? Is it rotation values that are for PC1, PC2, etc.? For example, if I have (I=w1*X+w2*Y+w3*Z) then how to find w1, w2, w3 by PCA.

    • @statquest
      @statquest  3 года назад

      Unfortunately I don't know that off the top of my head.

  • @joeykandre4641
    @joeykandre4641 6 лет назад +2

    hello Josh can you please explain in Matlab for the same example. Love the way you explain the codes line by line. thank you so much

    • @statquest
      @statquest  6 лет назад

      I wish I could. I'll reach out to Matlab and see if they will give me a copy of their software for my videos.

  • @naziarais1045
    @naziarais1045 6 лет назад +2

    very nice explanation about PCA .. can you explain about variance portioning analysis (VPA) in R?

    • @statquest
      @statquest  6 лет назад

      I've added it to the to-do list.

  • @veducatube5701
    @veducatube5701 4 года назад +4

    statQuest!
    statQuest!
    U r the best!
    U r the best!

  • @vikasvarma9299
    @vikasvarma9299 4 года назад +1

    Making complex ideas simple!!! BAM!!

    • @statquest
      @statquest  4 года назад

      Thank you very much! :)

  • @donaldhunter6913
    @donaldhunter6913 5 лет назад +2

    seriously helpful, thanks

  • @niranjansivalingam5512
    @niranjansivalingam5512 6 лет назад +2

    Thanks for your efforts in explaining...Could you please suggest some data sources for practicing?

    • @statquest
      @statquest  6 лет назад +4

      There are lots of great datasets here at the UCI Machine Learning Repository: archive.ics.uci.edu/ml/index.php

  • @waverider80
    @waverider80 4 года назад +1

    Why can't all teachers teach like you?

  • @ByounggookCho
    @ByounggookCho 6 лет назад

    StatQuest man! thank you so much

  • @joseoviedo4529
    @joseoviedo4529 Год назад

    hello, I like your videos keep it up. One note, though, regarding the naming of your first data frame (data.matrix) which is confusing because later on you actually call the data.frame() when you are doing the ggplot2 part and its a easy to get confused. Just an FYI, cheers. Great video!

    • @statquest
      @statquest  Год назад

      The variable names are actually pretty accurate. "data.matrix" is not a data.frame object. It's a matrix, which, in R, is different from a data.frame. In contrast, "pca.data", which is used for ggplot2, is an actual data.frame object.

    • @joseoviedo4529
      @joseoviedo4529 Год назад

      @@statquest maybe comment them in your code, and keep the accuracy but also improve the readability. Thanks for the reply!

  • @romanatorx3949
    @romanatorx3949 5 лет назад +1

    That was amazing!

  • @aliciachen9750
    @aliciachen9750 5 лет назад +2

    josh I didn't realize you're also from UNC... awesome video & GO HEELS :]

  • @burcakotlu7858
    @burcakotlu7858 6 лет назад +1

    Thank you for these videos. Having said that I want to ask you a question @5:58 you have drawn the "My PCA Graph", samples with positive X are on the right of graph and samples with negative X are on the left of the graph. However although sample "ko2" has the highest Y value, it is not on the top right of the graph. What is the reason for that? I guess since there are other PCs that also effect and this graph shows the overall resulting graph but with only two PCs are shown. Is that the reason for that? And also what must we do if our samples are not separated this much clearly? Thanks.

  • @mariagabriel3704
    @mariagabriel3704 3 года назад +1

    Thank you so much!

  • @leeicey2843
    @leeicey2843 7 лет назад

    Very clear! Thanks very much!

  • @GO-wr3de
    @GO-wr3de 4 года назад +1

    now I am good to go.thanks big

  • @matthewdong9368
    @matthewdong9368 5 лет назад +3

    I recall in the introductory vid to PCA you explained how to calculate loading score for each gene when there are only 2 variables, I am curious how do you calculate them in this case with multiple variables.

    • @statquest
      @statquest  5 лет назад +1

      So I have a couple of "introductory" PCA videos. Did you see this one? ruclips.net/video/FgakZw6K1QQ/видео.html At around 16 minutes and 25 seconds into it I show how PCA works for three variables and the ideas generalize from there.

  • @thilinikalpana7206
    @thilinikalpana7206 3 года назад

    This is awesome as always!. Can we apply PCA on categorical data?

    • @statquest
      @statquest  3 года назад

      In theory, no, but people seem to do it anyway.

    • @thilinikalpana7206
      @thilinikalpana7206 3 года назад +1

      @@statquest Thanks! Good to know that.

  • @mayankkaashyap2724
    @mayankkaashyap2724 5 лет назад +1

    Thanks for nice video Josh. Can you tell me how can I show samples with colored dots and show legends on the top? Please i am newbie to this!

  • @elghanbary5183
    @elghanbary5183 6 лет назад +1

    manyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy thanks, you helped me very much. great and complete explanation

    • @statquest
      @statquest  6 лет назад

      Hooray!!! I'm glad the video was helpful :)

  • @munaalhammadi4237
    @munaalhammadi4237 4 года назад +1

    Thank you so much for the tutorial. I have a question, is it better to do PCA analysis only for genes that have a significant differential expression?

    • @statquest
      @statquest  4 года назад

      It depends, however, it is typical to do it on just the genes that have different expression.

    • @munaalhammadi4237
      @munaalhammadi4237 4 года назад +1

      @@statquest Thanks for your quick response.

  • @luisa1551
    @luisa1551 4 года назад +1

    First: Thank you for the video. It is a life saver. I am a very practical person and until I do not see the values I have problems getting the abstract idea. I have a question: which version of R did you use? It gave me an error as I tried to use in my R (3.6) the function sort() with the parameter decreasing=TRUE. Thanks and have a great day!

    • @statquest
      @statquest  4 года назад

      I just ran it fine in 3.6.1, but the code is old and has run in earlier versions as well, so I'm not sure what your bug is.

  • @jacobmoore8734
    @jacobmoore8734 5 лет назад +3

    How do I take results from PCA and use in a predictive model? Specifically, linear regression

    • @statquest
      @statquest  5 лет назад +3

      You can select the variables/features that have the highest loading scores and use those in your linear regression.

    • @statquest
      @statquest  5 лет назад +4

      Alternatively, you could use Elastic-Net regression to select the variables/features that are best for your linear regression. Elastic-Net is a form of regularization and I have a bunch of videos on that topic.

    • @jacobmoore8734
      @jacobmoore8734 5 лет назад +2

      You're a wizard, Harry. Seriously, thank you :)

  • @Jacksonmrodrigues
    @Jacksonmrodrigues 5 лет назад

    Suggestion: Principal Curve analysis

  • @uohzihz
    @uohzihz 6 лет назад

    Thanks for another good video!

  • @giuliafabbri523
    @giuliafabbri523 4 года назад +1

    Thank you for the series of videos on PCA, they were life-saving, really!!!
    I have one question: if my dataset consists of SNPs and "pure" and quite recently admixed populations (info on admixture comes from both historical records and ADMIXTURE analysis), does it make sense to first perform PCA on the "pure" populations and then add the admixed ones so that the latter don't bias the PC calculation? How can I project the admixed individuals after the PCA with pure populations is performed? I should mention that I'm using dudi.pca (that can perform the centering of the data) in adegenet package, and that I found the function predict (or suprow in ade4 but it gives the same result as predict), but I'm not sure this is a correct way to do it.

    • @statquest
      @statquest  4 года назад

      Unfortunately I have no idea what "admixed" means so I can't help you. :(

    • @giuliafabbri7937
      @giuliafabbri7937 4 года назад

      @@statquest Oh it just means that population C is the result of mating between population A and population B (which are genetically differentiated) in the past and then population C evolved separately!

    • @statquest
      @statquest  4 года назад

      ​@@giuliafabbri7937 I'm not sure how to extract the loading cores from dudi.pci. However, if you can use prcomp(), you can extract the loading scores (called "rotation") and use those to multiply and sum the new (admixed) data to get PCA coordinates in the plot made with the original populations.

  • @shortfun4626
    @shortfun4626 4 года назад +1

    This the 200 comment, i am feeling lucky, sir I like your Hello in the start.....

  • @ritamdutta2078
    @ritamdutta2078 4 года назад +2

    This is an outstanding explanation. I tried the PCA and in my case PC2 is greater than PC1. The values are in negative axis. What does it mean? Can you help me?

    • @statquest
      @statquest  4 года назад

      If PC2 accounts for more variation than PC1, then something went wrong.

  • @콘충이
    @콘충이 4 года назад +1

    Thank you so much

  • @janiceoou
    @janiceoou 3 года назад

    autoplot is another handy tools to plot pca

  • @shawnkimisback
    @shawnkimisback 4 года назад +3

    Thank you very much for this. Do you have any specific recommendations for online courses to learn R in a biological context for someone coming from the bench side of research looking to perform more detailed RNA-seq data analysis from publicly available databases?

    • @luisa1551
      @luisa1551 4 года назад

      Hi! I found a R for biologist from Data Carpentry (datacarpentry.org/semester-biology/readings/) Maybe it will help you.

  • @糜家睿
    @糜家睿 6 лет назад +1

    Hi, Joshua, to clarify, in prcomp, the scale = T means, each value in the original dataset (not the transposed one) would minus the rowMean and then divided by standard deviation of the row?

    • @statquest
      @statquest  6 лет назад

      Generally speaking, scale=TRUE scales the columns in the dataset passed to prcomp(). So if you pass a matrix to prcomp() with scale=TRUE, then the columns of that matrix will be scaled. If you pass a transposed matrix to prcomp(), the columns of the transposed matrix will be scaled. Centering is also done by column and prcomp() does it by default. If you don't wan to center, set center=FALSE.

    • @糜家睿
      @糜家睿 6 лет назад +1

      That makes a lot of sense. Thanks, Joshua!

  • @shreyasi2
    @shreyasi2 4 года назад +1

    Thanks Josh, amazing video! One question: My PC1 and PC2 account for only 40% variation and I have to cluster my observations. Thus, I need to use 7 PCs that account for 75% variation, for a fair clustering. How should I go about that, and more so how should I show it graphically in a 2D space?

    • @statquest
      @statquest  4 года назад

      It depends. Again, the dimension reduction is sometimes not ideal, but you do the best you can. If PC1 and PC2 only account for 40% of the variation, you note that on the graph. Alternatively, you can plot pc1 vs pc2 and pc1 vs pc3 etc.

    • @shreyasi2
      @shreyasi2 4 года назад

      StatQuest with Josh Starmer thanks a lot for the reply Josh. The issue I am facing is with clustering. I am trying to run k means to cluster observations on these 7 PCs, but not really sure how to achieve that. Each of my clusters would be some combination of all 7 PCs, and I don’t think I can just take their means to describe each cluster. Is there a simpler way I can do this? Thanks again

  • @amoloney_
    @amoloney_ 6 месяцев назад

    This video has been super useful, but I do have a question:
    When using Eigendiscomposition to do PCA, there's a comment in the code saying we need to multiply by -1, as eigen() flips the x axis in this case. I assume this means that this step is not always necessary, but how do we know, when we need to flip the axis?

    • @statquest
      @statquest  6 месяцев назад +1

      The only reason we are flipping the axis is to make the graph look exactly the same as the others - and we do that for easy comparison. Otherwise, you never need to do it since the mirror image is equivalent and contains the exact same information.

  • @ibrahimlawan9663
    @ibrahimlawan9663 2 года назад +1

    Oh! You're great

  • @PeihuiBrandonYeo
    @PeihuiBrandonYeo 6 лет назад +1

    This is amazing :)

  • @stephaniefaithravelo3510
    @stephaniefaithravelo3510 3 года назад

    Hello. I would like to ask if you standardize your data before doing the PCA? It seems that I did not see it but I could also be wrong.

    • @statquest
      @statquest  3 года назад

      In this case, all of the variables are on the same scale, so there is no need to standardize. However, if different variables have different scales, then you should standardize. I talk about this in this video: ruclips.net/video/oRvgq966yZg/видео.html

  • @Kickflip1904
    @Kickflip1904 4 года назад +1

    Hey, thanks for that nice tutorial!!!
    I have one question: How did you keep the gene names in the rotation data? When I do the last step to determine the top 10 genes, i never get the names of the genes...
    Thanks in advance!

    • @statquest
      @statquest  4 года назад

      I just re-ran my code and it worked. I believe you might be missing the following lines:
      pca.data

    • @ergosumdre
      @ergosumdre 4 года назад

      @@statquest I'm experiencing the same issue

  • @sayedelsedimy9988
    @sayedelsedimy9988 5 лет назад +1

    thanks very vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvverrrrrrrrrrrrrrrrrrry much sir

    • @statquest
      @statquest  5 лет назад +1

      BAAAAAAAMMMMMMM!!!!!

  • @Juanguaqueta
    @Juanguaqueta 2 года назад

    Hello dear Doctor Josh Starmer.
    I know that to do a PCA data ought to be scaled and centered.
    However, I have seen in a book that the authors do not scale and center data to plot loadings against wavelengths in r.
    Then, I am on the fence about this. I plot my loadings against wavelengths from two types of PCA in r:
    1). prcomp(data, scale = TRUE, center = TRUE)
    2). prcomp(data)
    Then, I am deeply concerned.
    Would you please help me with the next question?
    Which type of PCA should I do in order to plot loadings against wavelengths in the right way?

    • @statquest
      @statquest  2 года назад +1

      The data always needs to be centered if it is not already. If it is not, then you get problems. However, you don't always need to scale the data if it is all measured in the same units. For details on both of these aspects of PCA, see: ruclips.net/video/oRvgq966yZg/видео.html

    • @Juanguaqueta
      @Juanguaqueta 2 года назад +1

      @@statquest Thank you so much dear Doctor Josh Starmer.

  • @alaamuayad5877
    @alaamuayad5877 4 года назад +1

    sooooo nice.....thanks

  • @lelamakharadze9411
    @lelamakharadze9411 4 года назад +1

    Thank you Josh! I managed to make PCA and got an impression of my RNAseq data clustering. However, since I got an error 'cannot rescale a constant/zero column to unit variance' while trying to scale data, I left it out. What exactly does this problem mean and how could I fix it? no proper explanations can be found in google.. thanks !!

    • @statquest
      @statquest  4 года назад +1

      I believe that means that you have a column where every single measurement is the same. This means there is no variation in the data. Presumably this is for a gene that is not transcribed, so you should filter it out before you scale.

  • @anonymoustiger
    @anonymoustiger 3 года назад

    In the PCA Step by Step video, PC1 and PC2 correspond to Genes 1 and Genes 2. But here the PCs stand for the samples. Why is the a difference? With 100 genes i expected 100 PCs and then the PC with the highest variation percent was the gene responsible for the most variation between samples.

    • @statquest
      @statquest  3 года назад

      In both videos, the PCs are created from linear combinations of the genes (not the samples). In the "step-by-step" video, that was the "cocktail recipe". In this video, that cocktail recipe is the Loading Scores. The number of PCs is limited the the minimum of the number of samples (mice) or features (genes).

  • @wei2674
    @wei2674 4 года назад

    I am a bit confused sdev is used to calculate percentage of variance explained by PC. If the data points are centered based on gene1, gene2 values, does it guarantee the projected points on PC1 and PC2 axises are also centered from origin? I am asking this because in the video “PCA in R”, sdev is used to calculate percentage of variance explained by PC, so I guess mean must be 0 so that SS(distance)=n*sdev^2. Am I right?

    • @statquest
      @statquest  4 года назад

      In R, the prcomp() function returns the standard deviation of the variance around the principal components. We square that value to get the variance.

    • @wei2674
      @wei2674 4 года назад

      StatQuest with Josh Starmer Thanks for the reply! But in the previous video PCA step by step, the variation explained by PC is calculated using eigenvalues, ie SS(distance to the origin), not SS(distance to the mean).

    • @wei2674
      @wei2674 4 года назад

      StatQuest with Josh Starmer Let me try to ask the question again. In this R example, percentage of variation explained by PC1 is calculated as: variance of PC1/ sum of variances of all PCs. However, in the video PCA step by step, percentage of variation explained by PC1 is calculated as: SS(distance to origin) for PC1/ sum of SS(distance to origin) for all PCs. There seem to be a gap here because variance is usually calculated using SS(distance to the mean), not SS(distance to the origin), and this is my confusion. Thanks!

    • @statquest
      @statquest  4 года назад

      @@wei2674 That's why it is important to center you data before PCA, then the mean = origin.

    • @wei2674
      @wei2674 4 года назад +1

      StatQuest with Josh Starmer Thanks Josh! I just figured the projected points are also centered from origin!

  • @ta2245
    @ta2245 4 года назад

    Hi! So, after performing PCA, which variables should I include the model? The genes with positive loading score? Do the genes that have positive loading score , are correlated with first principal component?Variables that are correlated with PC1 (i.e., Dim.1) are the most important in explaining the variability in the data set. So, the genes which have positive loading score, should we keep them for the model?

    • @statquest
      @statquest  4 года назад +1

      You look at the magnitude of the loading scores. Loading scores with large negative values are also important. For more details, see: ruclips.net/video/FgakZw6K1QQ/видео.html

    • @ta2245
      @ta2245 4 года назад

      @@statquestThanks a lot for your answer.I have seen that video also. For more clarification, could you please tell me which genes or variables should you keep based on your PCA results in this video?

    • @statquest
      @statquest  4 года назад

      @@ta2245 It's subjective. It depends on what you need and what you want to do. In other words, there's no fixed answer.

    • @ta2245
      @ta2245 4 года назад +1

      @@statquest Many thanks.

  • @AndresEOA
    @AndresEOA 5 лет назад +1

    Hi from Spain Josh!!! Very nice video! Many thanks! I have one question: How could I do to circle in different color different treatment in the PCA ggplot??? (Grouped by wt and ko) I am trying to do it but can't. Many thanks

    • @statquest
      @statquest  5 лет назад

      Glad you like my video! I'm terrible at ggplot. When I want to add annotation, I export the image and do it in another program.

    • @AndresEOA
      @AndresEOA 5 лет назад +1

      @@statquest Many thanks for the response!!!

  • @adelutzaification
    @adelutzaification 7 лет назад

    In your comments and elsewhere, I encountered the notions of "covariance" and "correlation" (matrix) in the context of PCA.
    I assume that it is the correlation between the different variables (cells in your example). This is my current gestalt. To me, at an intuitive level, correlation implies redundancy (as one variable can be predicted from the other), and therefore, allows the opportunity for simplification/dimension reduction performed by PCA.
    Reciprocally, correlation between the initial variables / cells also implies an intrinsic similarity/"core pattern" that is further revealed by PCA (in the form of "density clouds" when 2 principal components are plotted). How far out am I? 😉
    On a more practical side, I read about the so-called adequacy tests such as KMO, Bartlett’s sphericity tests (that test for some measure of correlation it seems) to determine whether PCA is pertinent as an analysis for a particular dataset. Do you apply these tests in your work or you know a priori that the type of experiment lends itself to PCA analysis?
    You also mention about the QC of PCA in the scree plot example and you indicate the importance of the first 2 PCs. I assume that the amount of variation a PC accounts for, relates somehow to the scale/range of that PC as the wider the range, the better clusters separate. The choice of the first 2 PCs seems like a good application for gene expression as it can separate clusters more readily. I also encountered analyses which use criteria such as Kaiser’s criterion (stdev of a pc >1) to select PCs for further analysis such factor analysis or clustering to further reveal more hidden pattern. Do you know of any such examples applied in biology/gene expression? Thank you for your patience 😊

  • @jmflowers74
    @jmflowers74 3 года назад

    This is great except that I'm not sure that the "rotation" matrix contains the loadings. According to the prcomp documentation the rotation matrix contains the eigenvectors, which they also call the "loadings" which is not correct.

    • @jmflowers74
      @jmflowers74 3 года назад

      My problem is that I would like to interpret the loadings in terms of correlation coefficients between the PC and my variables. I believe I need to divide the eigenvector (i.e., the "loadings" in rotation matrix) by the square root of the eigenvalue, which I believe is what sdev is. So for example, for PC1, I would do pca$rotation[,1] / pca$sdev[1]. Is this correct?

    • @statquest
      @statquest  3 года назад

      Yes! The documentation is confusing and I didn't help any be just repeating it. However, for the purposes of this video, both loadings and eigenvectors will give us the same results (the same rank order of the variables), so at least that's OK.

    • @jmflowers74
      @jmflowers74 3 года назад

      @@statquest Thanks for your quick reply. Yes, if all we want is the rank order of the "loadings" in terms of their contribution to a PC then I agree. In other words, if my solution is correct, dividing by the corresponding sdev is just a scaling factor.

    • @statquest
      @statquest  3 года назад

      @@jmflowers74 Yep.

  • @visionarynjy5491
    @visionarynjy5491 4 года назад

    Hey Josh, Is the plot in 4:30 same as what we would get with biplot? And why not using biplot in this video?

    • @statquest
      @statquest  4 года назад

      There is a lot of variety in the data that go into biplots - there isn't just one way to draw them. And when you have a lot of data, they get really cluttered, so that's why I did not draw one.

  • @t_thyme5845
    @t_thyme5845 3 года назад

    any advice on RDA, when to use RDA, and vegan package usage?

    • @statquest
      @statquest  3 года назад

      Are you asking about reading R files into python?

  • @geehach6886
    @geehach6886 4 года назад

    hi @StatQuest with Josh Starmer, hi Josh, i am trying to follow exactly as it is, but I am unable to get the ggplot() which is drawn here using X and Y. I did create all variables and all seem to be fine, but while running, the code runs but no graph.. do you happen to know why.
    ggplot(data = P4, aes(x=X4, y=Y4, type = "b", label = sample), +
    geom_text() , +
    xlab(paste("PC1 - ", p4.vp[1], "%", sep = " ")), +
    ylab(paste("PC2 - ", p4.vp[2], "%", sep = " ")), +
    theme_bw(), +
    ggtitle(" My PCA graph for xyz "))
    The values I used are:
    :: pca4

    • @statquest
      @statquest  4 года назад

      Unfortunately debugging code in a youtube comment is pretty hard to do so I can't help you. :(

  • @1220MrCool
    @1220MrCool 3 года назад

    What is a loading score exactly?
    Also, if you were to find the top ten genes that account for the second highest variation (from PCA2), you would do:
    loading_scores

    • @statquest
      @statquest  3 года назад

      To learn about PCA and loading scores, see: ruclips.net/video/FgakZw6K1QQ/видео.html

  • @mikaylafeldbauer2182
    @mikaylafeldbauer2182 2 года назад

    If I normalize my data first using a method like median of ratios or quantile normalization, do I still need to use scale = TRUE in the prcomp call? (Btw thank you for making these amazing videos! They really help me understand difficult statistics topics)

    • @statquest
      @statquest  2 года назад +1

      As long as your data are all on the same scale, you do not need to normalize it.

    • @mikaylafeldbauer2182
      @mikaylafeldbauer2182 2 года назад +1

      @@statquest Thank you!

  • @propyne717
    @propyne717 4 года назад +1

    Thanks for the tutorial, but I'm completely new in R . can you tel me how to read the a custom data from a CSV file for pca. And export the load score in another CSV/ txt file?

    • @mikhaelaneelin8405
      @mikhaelaneelin8405 4 года назад +1

      read.csv("insert link here", header = TRUE) to read in a CSV file
      write.csv(name, file="insert link here") to export to CSV

  • @blankaroje8853
    @blankaroje8853 4 года назад

    Hi, I couldn't find an answer to this: why did the random generated wt and ko samples cluster in the PCA if the same code generated them?
    Thank you for the videos :)

    • @statquest
      @statquest  4 года назад +1

      At 2:03 you'll see that I have two lines of code within the for loop that generate read counts and a third line of code that adds the read counts to the data.matrix. The first line creates 5 random values for the WT samples and the second line creates 5 random values for the KO samples. The key is that each time we go through the loop I pick a new value for lambda (the "rate" parameter) using the sample() function (this picks 1 number between 10 and 1000) and use that value for lambda for all 5 WT values. I then pick another value for lambda (though another call to sample()) and use that value for lambda for all 5 KO values. So the 5 WT values were generated with one value for lambda and the 5 KO values were generated for another value for lambda. As a result, the WT samples are more similar to each other than they are to the KO samples.

    • @blankaroje8853
      @blankaroje8853 4 года назад +1

      ​@@statquest great, thanks :)

  • @ishtiaquezaman197
    @ishtiaquezaman197 3 года назад

    ​ @StatQuest with Josh Starmer Hi Josh, just wanted to confirm, is it because of the fact that the number of samples are lower than the number of features/variables, therefore there is an upper bound for how many PCs we can expect in this example which is 10?

    • @statquest
      @statquest  3 года назад

      I answer this question in this video: ruclips.net/video/oRvgq966yZg/видео.html

  • @saintofthesinners
    @saintofthesinners 2 года назад

    The "data frame" sections keeps throwing the error : "arguments imply differing number of rows" and I'm using the code from this example, any ideas how to fix it?

    • @statquest
      @statquest  2 года назад

      I just re-ran the code and it worked fine so I'm not sure what's wrong on your end. Did you download the code as a file or copy and paste it? If you copied and pasted it, you might have skipped a line and that could be messing things up.

    • @saintofthesinners
      @saintofthesinners 2 года назад

      @@statquest copy and pasted, and I went through and had nothing missing. I’ve tried it a few times and haven’t been able to get it to work by adding the “Sample” column. The X and Y fill fine though.

  • @ravikrishnacheemakurthi7680
    @ravikrishnacheemakurthi7680 4 года назад

    PCA is generally done on numerical variables in the data you have genes column which character how do you manage this in prcomp function. i was trying a similar experiment where I have MS-MS data 100 proteins and 20 samples. in one column i have protein names when i run prcomp function it is displaying Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric. in the final step how did u get the top 10 gene names. if I omit that in the loading score column I am getting null. sir, please explain this

    • @statquest
      @statquest  4 года назад

      If you look at the code, you'll see that I transpose it with the t() function while passing it to prcomp(). Specifically...
      pca

  • @huichanghuang9580
    @huichanghuang9580 4 года назад +1

    Hi Josh, thanks for your amazing video!
    I got an error when I doing PCA in R "Error in prcomp.default(data1, center = T, scale. = T) :
    cannot rescale a constant/zero column to unit variance"
    I can understand that zero must be an invalid data, but why constant column either?
    Thanks for your efforts in explaining.

    • @statquest
      @statquest  4 года назад

      My guess is that one of your columns doesn't have any variation in the data - i.e. every observation has the exact same value.

    • @huichanghuang9580
      @huichanghuang9580 4 года назад +1

      @@statquest I've checked my data again, and found it's actually because of the reason you explained. Thank you soooooo much!!!
      I've watched your videos for about one month, they helps me a lot!!

    • @statquest
      @statquest  4 года назад

      @@huichanghuang9580 Awesome! Good luck!!!

  • @danielnakamura6430
    @danielnakamura6430 4 года назад

    If our variables are in different scales (eg. Temperature, Body mass...), should we log-transform our data, right?

    • @statquest
      @statquest  4 года назад

      For the answer, see: ruclips.net/video/oRvgq966yZg/видео.html

  • @Elise_Barton
    @Elise_Barton 4 года назад

    Hi, great video! The link to the R code isn't working, is there any other way I could access it? thanks!

    • @statquest
      @statquest  4 года назад

      I just tried it and it worked, so there might be something strange going on. Here it is again: github.com/StatQuest/pca_demo/blob/master/pca_demo.R

    • @Elise_Barton
      @Elise_Barton 4 года назад

      @@statquest Hi, thanks very much- I just tried it again and it worked, so must have been a problem my end, sorry! Could I just ask how I would go about incorporating my own dataset into this code? Do I have to make it into a data matrix or do I just import my dataset? Thanks again

    • @statquest
      @statquest  4 года назад

      @@Elise_Barton It depends on your data. You may be able to use it without much tinkering. However, one thing to be aware of is that in the video (and in the example code) the rows are the "samples" and the columns are the "variables" (or features). You data might be the opposite (columns are samples and rows are variables). If this is the case, you do not need to use the transpose function, t(), like I do in this tutorial.

  • @dekhangmigmar5403
    @dekhangmigmar5403 4 года назад

    Hi
    I would like to know if I have more than one variables for the wt and Ko (expt with read counts if I have some other variables too like nucleotide coverage or something). Where do I need to ake changes?

    • @statquest
      @statquest  4 года назад

      I have some tips on mixing data in this video: ruclips.net/video/oRvgq966yZg/видео.html