StatQuest: PCA - Practical Tips

Поделиться
HTML-код
  • Опубликовано: 19 дек 2024

Комментарии •

  • @statquest
    @statquest  2 года назад +5

    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @buihung3704
    @buihung3704 Год назад +19

    This is a gold mine for Data Scientist, Data Engineer, ML/DL engineer. I can hardly think of anyone else that can teach the same concept more clearly.

    • @statquest
      @statquest  Год назад +1

      Thank you very much! :)

  • @geethanjalikannan5527
    @geethanjalikannan5527 5 лет назад +54

    Dear Josh. I had so much issues with stats as I am from a totally different background. Watching Ur videos helped me overcome my insecurities. Thank you so much.

    • @statquest
      @statquest  5 лет назад +5

      Hooray! I'm glad the videos are helpful.

  • @caperucito5
    @caperucito5 5 лет назад +12

    Josh's videos are so cool that I usually like them before watching.

  • @Jason-xe4tt
    @Jason-xe4tt 6 лет назад +4

    All prof in the world need to learn how to teach from you ! Thanks !

  • @stevenmugishamizero8471
    @stevenmugishamizero8471 Год назад +3

    The best on this platform hands down🙌

  • @iloveno3
    @iloveno3 6 лет назад +7

    The intro with you singing is so cute, made me smile...

  • @shwetankagrawal4253
    @shwetankagrawal4253 5 лет назад +16

    Your initial music always make me smile😂😂

  • @bendiknyheim6936
    @bendiknyheim6936 3 года назад +4

    Thank you for all the amazing videos. I would be having a really hard time without them

    • @statquest
      @statquest  3 года назад +1

      Glad you like them!

  • @jesusfranciscoquevedoosegu4933
    @jesusfranciscoquevedoosegu4933 6 лет назад +4

    Thank you so much for, basically, all your videos on PCA

    • @statquest
      @statquest  6 лет назад

      You're welcome!!! I"m glad that you like them. :)

  • @yurobert3007
    @yurobert3007 Год назад +2

    This PCA series (step-by-step, practical tips , then R) is brilliant! I found them very helpful. Thank you for these great videos!
    Would you be considering to do a series on factor analysis?

    • @statquest
      @statquest  Год назад +3

      Thanks! One day I hope to do a series on factor analysis.

  • @paulotarso4483
    @paulotarso4483 3 года назад +1

    Hey Josh thx so much for your videos... 3 quick questions:
    1. 7:54 says "if there are fewer samples than variables, the number of samples puts an upper bound on the number of PCs with eigenvalues greater than 0", but in the example there, the number of samples is equal to the number of variables, not less. Should the statement be "if # of samples

    • @statquest
      @statquest  3 года назад +1

      1) What matters is that there is an upper bound and it depends on the number of variables and the number of samples, and that means we can actually write it both ways: "if # of samples

    • @ptflecha
      @ptflecha 3 года назад +1

      Thanks so much!!

  • @AakashOnKeys
    @AakashOnKeys Год назад +1

    Thanks for the headsup! Very helpful!

  • @johnfinn9495
    @johnfinn9495 Год назад +2

    Very nice videos. Have you considered a segment on kernel PCA?

    • @statquest
      @statquest  Год назад +1

      I'll keep that in mind.

  • @boultifnidhal2600
    @boultifnidhal2600 2 года назад

    Thank you so much for switching to Math and reading, cause the genes and cells things were giving headaches. Nevertheless; Thank you so much for your efforts ♥♥

  • @ylazerson
    @ylazerson 6 лет назад +3

    Fantastic video once again!

    • @statquest
      @statquest  6 лет назад

      Hooray!!!! I'm glad you're enjoying them. :)

  • @Patrick881199
    @Patrick881199 4 года назад +1

    Hi, Josh, , I am a little confusing that at 2:37, you mentioned using standard deviation, well, if we have math scores(0-100) with standard deviation of 5 and, in the same time, the reading scores(0-10) also has sd of 5, then by dividing sd, math and reading are still NOT in the same scale.

    • @statquest
      @statquest  4 года назад

      Regardless of the original scale, if you divide each value in a set of measurements by the standard deviation of that set, the standard deviation of the new values will be 1. And that puts all variables on the same scale.

    • @statquest
      @statquest  4 года назад

      For more details, see: stats.idre.ucla.edu/stata/faq/how-do-i-standardize-variables-in-stata/

    • @Patrick881199
      @Patrick881199 4 года назад +1

      @@statquest Thanks, Josh

  • @urd4651
    @urd4651 3 года назад +1

    well explained !!!!! thank you very much!

  • @arifahafdila5531
    @arifahafdila5531 3 года назад +1

    Thank you so much for the videos 👍

  • @mrweisu
    @mrweisu 4 года назад

    At 6:19, even the two points are on a line, but does the line necessarily go through (0,0)? If not, there still can be two PCs. Can you help clarify? Thanks.

    • @statquest
      @statquest  4 года назад

      PC1 always goes through the origin. That's why we center the data to begin with.

    • @mrweisu
      @mrweisu 4 года назад

      @@statquest Yes, but the line connecting the two points might not.

    • @statquest
      @statquest  4 года назад

      If the data are centered, then the line connecting the two points will go through the origin. If they are not centered, then, technically, you are correct, we will have 2 PCs - but neither PC will do a good job reflecting the relationship of the data as well as the PC derived from the centered data.

    • @mrweisu
      @mrweisu 4 года назад +1

      @@statquest does centering data make the connecting line go through (0,0)?

    • @statquest
      @statquest  4 года назад

      @@mrweisu Yes

  • @basharabdulrazeq4349
    @basharabdulrazeq4349 7 месяцев назад

    Hello Josh. @ 7:57, you explained that if there are fewer samples than variables then the number of samples puts an upper bound on the number of PCs. In the last example, there are 3 samples and 3 variables (therefore the number of samples isn't fewer than the number of variables), and the number of PCs should be 3 (not 2). could you explain why did you decide that the number of PCs should be 2!!. (BTW I watched all of your videos about PCA, but I don't understand this specific example).

    • @statquest
      @statquest  7 месяцев назад

      The answer to your question starts at 5:09. The key is that we don't include PCs that have an eigenvalue = 0. If there's no variation in a direction, then there is no need for an axis in that direction. Thus, 3 data points can only define a 2-dimension plane, and thus PC3 will have an eigenvalue = 0 and thus, we can exclude PC3.

    • @basharabdulrazeq4349
      @basharabdulrazeq4349 7 месяцев назад

      @@statquest I agree, but I still can see variation in a third direction. I just can't comprehend the idea that in there isn't, because there are three variables for three students and all of them change with each other. I'd be so grateful, if you could prove or give me some source to see a proof that there shouldn't be PC3, because I really need to comprehend the idea.

    • @statquest
      @statquest  7 месяцев назад +1

      @@basharabdulrazeq4349 For each student, we have 3 values for the 3 variables that represent a single point in the 3-dimensional space. Thus, we have 3 points in the 3-dimensional space, one per student. 3 points define a plane, which is only a 2-dimensional space. Thus, only 2 PCs can possibly have eigenvalues > 0.

    • @basharabdulrazeq4349
      @basharabdulrazeq4349 7 месяцев назад +1

      @@statquest Thanks a lot. I understand now that no matter which direction you arrange any three points, they will always lie on the same plane.

  • @sane7263
    @sane7263 Год назад

    Great Video Josh!
    I am wondering @ 7:32 "Find the line perpendicular to PC1 that fits best" what does this means?
    I mean either you can have line perpendicular or a best fit line.

    • @statquest
      @statquest  Год назад +1

      When you have more than 2-dimensions, then the first perpendicular line can rotate around PC1 and still be perpendicular. Thus, any line in that plane will be perpendicular. For more details, see: ruclips.net/video/FgakZw6K1QQ/видео.html

    • @sane7263
      @sane7263 Год назад

      @@statquest Thanks for the lighting fast reply Josh!
      I have already seen that video and after watching this I had same question.
      If a PC2 (a line) is passing through PC1 (another line) perpendicular i.e., at 90 degree, how can it rotate and still maintain that angle?

    • @statquest
      @statquest  Год назад +1

      @@sane7263 If we have 3-dimensions, PC1 can go anywhere. PC2 however, can go anywhere in a plane that is perpendicular to PC1 and PC3 has not choice but to be perpendicular to both PC1 and PC2. I try to illustrate this here: ruclips.net/video/FgakZw6K1QQ/видео.html

    • @sane7263
      @sane7263 Год назад

      @@statquest Ahh! I see!
      So if we have a 2D plane PC1 can go anywhere but in this case PC2 will have no choice but to be perpendicular. Right?
      I think now I got it.

    • @statquest
      @statquest  Год назад +1

      @@sane7263 That's right. When we only have 2-dimensions, then the first line can go anywhere, but once that is determined, the second line has no choice. When we have 3-dimensions, things are a little more interesting for the second line.

  • @doubletoned5772
    @doubletoned5772 5 лет назад

    I have a trivial question at 1:39 . If the recipe to make PC1 is using approx 10 parts Math and only 1 part reading, why does that mean that Math is '10' times more important than Reading to explain the variation in data? I mean I understand that it will be more important but is that specific number (10) correct?

    • @statquest
      @statquest  5 лет назад

      I think my wording may have been sloppy here.

  • @samirsivan8134
    @samirsivan8134 7 месяцев назад +1

    I love statequest❤

  • @mostafael-tager8908
    @mostafael-tager8908 4 года назад +1

    Thanks for the video, but I think there is a simple mistake at @2:08 when you said mix 0.77 Math with 0.77 Reading , I thought that both must add up to 1 , or I got something wrong ?

    • @statquest
      @statquest  4 года назад +3

      0.77 for math and 0.77 for reading represent 2 sides of a triangle that has been normalized so that the hypotenuse = 1. In other words, using the Pythagorean theorem, sqrt(0.77^2 + 0.77^2) = 1. For more details about this, see minute 11 and second 16 in this video: ruclips.net/video/FgakZw6K1QQ/видео.html

  • @samggfr
    @samggfr Год назад +1

    Hi Josh. Thanks for your videos, especially when you are diving into details and tips.
    In tip#2 concerning centering, you show 2 sets of 3 points and you present the centering to the mean. Let's imagine an experiment with 3 patients with drug A and 3 patients with drugs A and B. Let's say the lower/left set if the reference, drug A, and the upper/right set is the test, drug A+B. What about centering on A (set A will be at the origin)? This centering should show the total effect of adding drug B to drug A, whereas the mean centering shows half the effect. In the same vein, the variables plot should show the variables that change from drugA set to drugAB set instead of showing variables that change from the mean experiment ie ((drugA+drugAB)/2). What's your view?

    • @statquest
      @statquest  Год назад +1

      Centering using all of the data does not change the relationship between the two groups of points - they are still the same distance apart from each other, and the eigenvalue will reflect this and give you a sense of how much a difference A is from A+B.

    • @samggfr
      @samggfr Год назад

      @@statquest Thanks for your reply concerning the distance, which I might interpret as the effect size. Could you tell me your view concerning the plot of variables?

    • @statquest
      @statquest  Год назад +1

      @@samggfr I'm not 100% certain I understand your question about the variables plot, but the loadings for the variables on PC1 will tell you which variables have the largest influence in causing variation in that direction.

  • @joyousmomentscollection
    @joyousmomentscollection 5 лет назад +1

    Thanks Josh... If your data contains one-hot encoded data(transformed from categorical data) and discrete data along with continuous data types, what kind of scaling would be prefered before applying PCA technique

    • @statquest
      @statquest  5 лет назад

      It may be better to use lasso or elastic-net regularization to select the variables that are most important than to use PCA. Regularization can remove variables that are not useful for making predictions. If you're interested in this subject, I have several videos on it. Just look for "regularization" on my video index page: statquest.org/video-index/

  • @mjifri2000
    @mjifri2000 5 лет назад +1

    Man ;
    you are the best.

  • @mehreensoomro3545
    @mehreensoomro3545 Месяц назад

    Hi Professor Starmer, Many thanks for your videos. Are the loadings from PC the same as beta coefficients per variable and can be used as effect estimates?

    • @statquest
      @statquest  Месяц назад

      To learn more about how loadings are interpreted, see: ruclips.net/video/FgakZw6K1QQ/видео.html

  • @vigneshvicky6720
    @vigneshvicky6720 3 года назад +1

    Your r using datapoints to get pca's but in general we r using covariance matrix to get pca y??

    • @statquest
      @statquest  3 года назад +1

      The old way to do pca was to use a covariance matrix. However, no one does that any more. Instead, we apply Singular Value Decomposition directly to the data.

    • @vigneshvicky6720
      @vigneshvicky6720 3 года назад +2

      @@statquest tq love frm india💖

  • @rezkyilhamsaputra8472
    @rezkyilhamsaputra8472 Год назад

    Are these tips also applied in principal component regression (PCR)?

    • @statquest
      @statquest  Год назад +1

      They apply to any time you want to use PCA, so yes, they would also apply to PCR.

    • @rezkyilhamsaputra8472
      @rezkyilhamsaputra8472 Год назад

      @@statquest and if the software gives us an option whether or not we want to center and/or scale the data, is there a condition where we shouldn't center/scale the data or we must always do it?

    • @statquest
      @statquest  Год назад

      @@rezkyilhamsaputra8472 I can't think of a reason you wouldn't want to center your data. Scaling depends on the data itself. If it's already on the same scale, you might not want to do it.

    • @rezkyilhamsaputra8472
      @rezkyilhamsaputra8472 Год назад +1

      @@statquest alright, thank you so much for the crystal clear explanation!

  • @mojojojo890
    @mojojojo890 3 года назад

    If you could explain why the first PCA is the eigen vector that would be nice
    I know that eigen vectors are the vectors that their span doesn't change after a transformation even if they are scaled... so here what exactly is the transformation applied ?

    • @statquest
      @statquest  3 года назад

      I explain eigen vectors in this video: ruclips.net/video/FgakZw6K1QQ/видео.html

    • @mojojojo890
      @mojojojo890 3 года назад

      @@statquest I watched that video but it does not explain why the the first PCA is an eigen vector

    • @statquest
      @statquest  3 года назад

      @@mojojojo890 Ah, I see. First, that video focuses on Singular Value Decomposition, which is the modern way to do PCA and doesn't actually involve calculating eigenvectors. However, the old method does by applying eigen decomposition to the variance-covariance matrix of the raw data. And in the old method, PC1 was an eigenvector for the variance/co-variance matrix. In other words, if the variance/co-covariance matrix is V, then V x PC1 = eigenvalue * PC1, which makes PC1 an eigenvector.

    • @mojojojo890
      @mojojojo890 3 года назад +1

      @@statquest That sends me somewhere to find my answer... Thanx a lot !!

  • @paulohmarco
    @paulohmarco 4 года назад

    Hi professor Josh Starmer,
    Thanks a lot for your videos. This a very joyful way to teach these methods!
    lease, let me ask you: I am giving a lecture about PCA online in Brazil, in Portuguese language, and I would like to ask your permission to use some of your examples to teach PCA. Of course, I will reference it to your StatQuest channel.
    Thanks in advance!

    • @statquest
      @statquest  4 года назад

      Feel free to use the examples and cite the video.

  • @alexisvivoli8963
    @alexisvivoli8963 4 года назад +1

    Hi josh ! thanks a lot for your videos. I vizualise very well how it works for 3 variables thanks to your animation, but i'm struggling to understand what happen if you add more variables : how do you project/center and calculate PC since you can't have more than 3 dimension. So how it works if you have, let say 4 or even more like 100 variables ? Thanks !

    • @statquest
      @statquest  4 года назад +1

      If I have one variable, called var1, then I can center it by calculating the mean for var1 and subtracting that from each value.
      If I have two variables, var1 and var2, I can center the data by calculating the mean for var1 and subtracting that from all of the var1 values and calculating the mean for var2 and subtracting that from all of the var2 values.
      If have 3 variables, var1, var2 and var3, then I can center it by calculating the mean for var1 and subtracting that from all of the var1 values, calculating the mean for var2 and subtracting that from all of the var2 values and calculating the mean for var3 and subtracting that from all of the var3 values.
      If I have 4 variables, var1, var2, var3 and var4, then I can center it by calculating the mean for var1 and subtracting that from all of the var1 values, calculating the mean for var2 and subtracting that from all of the var2 values, calculating the mean for var3 and subtracting that from all of the var3 values and calculating the mean for var4 and subtracting that from all of the var4 values.
      If I have N variables, var1, var2, var3 ... varN, then I can center it by calculating the mean for var_i and subtracting that from all of the var_i values, where I is a value from 1 to N. etc. Does that make sense?

    • @jasperli7794
      @jasperli7794 2 года назад

      @@statquest Thanks, I understand this idea of centering the data for all variables. But then how do you draw the principle components, for all the variables, beyond 3? After you draw principle component 1 through the origin (so that it best fits the data, using SVD etc.), and place principle component 2 through the origin perpendicular to it, and principle component 3 perpendicular to both 1 and 2, how do you continue placing principle components perpendicular to the first 3? Is there an explanation for further principle components which does not rely on the restrictions of the physical 3D world? Thank you very much!

    • @statquest
      @statquest  2 года назад

      @@jasperli7794 It's just relatively abstract math, which isn't limited to 3-dimensions. However, the concepts are the same, regardless of the number of dimensions.

    • @jasperli7794
      @jasperli7794 2 года назад +1

      @@statquest Okay, so if I understand correctly, the principle components capture various axes which are related to each other by position, and which explain (decreasing amounts of) variance within the data and the relative contributions of each feature/variable at each principle component. Thanks!

    • @statquest
      @statquest  2 года назад

      @@jasperli7794 Yep!

  • @k_a_shah
    @k_a_shah Год назад

    which application is used to plot this graph ?
    or any software

    • @statquest
      @statquest  Год назад

      I give all my secrets away in this video: ruclips.net/video/crLXJG-EAhk/видео.html

  • @lucaliberato
    @lucaliberato 3 года назад

    Hello Josh, i have a question. You say that, to find a 3rd PC, we should find a line perpendicular to PC1 and PC2 and it's not possible. But in the first video you say we cand find PC3 that goes through the origin and is perpendicular to PC1 and PC2. I lost something in the video for sure, can you help me pls?

    • @statquest
      @statquest  3 года назад +1

      In the first video, we have enough data points on the graph that we can meaningfully create 3 axes. However, in this example, we don't have enough data to do that. The point being made in this video is that the maximum number of PCs can be limited by the number of data points. So, even if you have 3-D data, if you only have 2 points, then you will only have 1 PC, because 2 points only define a specific line. We need 3 points to define a specific plane (for 2 pcs) and we'd need 4 points to define 3 PCs etc.

    • @lucaliberato
      @lucaliberato 3 года назад +1

      @@statquest Thank you so much Josj, you're super😎

  • @reytns1
    @reytns1 6 лет назад +1

    Other question regarding PLS, as I know PLS is a regression over PCA, Is that rigth? and PLS can make over only one variable to regress (I mean the Y variable)? uhmm another question if you have a lot of trait to do a PCA, there are some statistics that show me what is the best and second trait that it is important for that PCA? I mean not the autovalue?? thanks

    • @statquest
      @statquest  6 лет назад +1

      Partial Least Squares (PLS) and Principle Component Regression (PCR) are both ways to combine regression with PCA, as a way avoid overfitting the model (if there are more variables than samples, you'll overfit your model and your future predictions will be bad). PCR does PCA on just the variables (the measurements used to predict something). As a result, it focuses on the variables responsible for most of the variation in the data. In contrast, PLS does PCA on both the variables and the thing you want to predict. This makes PLS focus on variation in the variables as well as variables that correlate with the thing you want to predict.
      As for statistics on which variable is the most important for PCA (other than just looking at the loading scores), you could probably use bootstrapping, but, at least at this time, I don't have a lot of experience with this.

  • @misseghe3239
    @misseghe3239 4 года назад

    Can PCA be used for Regression problems or only classification problem? Thanks .

    • @statquest
      @statquest  4 года назад

      There are actually several types of regression that use PCA. PCA reduces the number of variables in your model.

  • @jo91218
    @jo91218 6 лет назад

    great video! quick question: converting raw scores into z-scores would both center and scale my data? thanks!

  • @Asia25Asia
    @Asia25Asia 3 года назад

    Hi Josh! thank you for your videos. Could you please give some hint what to do with NA (not obtained) values in PCA? How to deal with them? Additionally - what is better - to use raw data (abundance) or relative abundance (percentage) as an input to PCA?

    • @statquest
      @statquest  3 года назад +1

      You can try to impute the missing values. And depending on what you want to show, it can be better to use raw data or some sort of transformed version.

    • @Asia25Asia
      @Asia25Asia 3 года назад

      @@statquest Thanks a lot for quick response. Can you recommend some easy and friendly function for imputing biological data?

    • @statquest
      @statquest  3 года назад +1

      @@Asia25Asia Not off the top of my head.

    • @Asia25Asia
      @Asia25Asia 3 года назад

      @@statquest OK, no problem :)

  • @kartikmalladi1918
    @kartikmalladi1918 Год назад

    What is the need of pca if you can use average scores as contribution?

    • @statquest
      @statquest  Год назад

      My main PCA video gives a reason to use it: ruclips.net/video/FgakZw6K1QQ/видео.html

    • @kartikmalladi1918
      @kartikmalladi1918 Год назад

      @@statquest I mean I've gone through your videos. Great work by the way. Main goal of PCA is to understand the contribution of each variable to a sample. However finding out average of each variable and their percentage contribution still gives some idea. So how is this average contribution different from PCA loading scores?

    • @statquest
      @statquest  Год назад

      @@kartikmalladi1918 What time point in the video, minutes and seconds are you asking about?

    • @kartikmalladi1918
      @kartikmalladi1918 Год назад

      @@statquest it's from the main PCA video, discussing about loading score contribution.

    • @statquest
      @statquest  Год назад

      @@kartikmalladi1918 What time point?

  • @Amf313
    @Amf313 4 года назад

    How we should scale for the variables which don’t have clear upper or lower bounds?
    For example If our 2 variables are human height and Weight ...
    is it rational to scale them based on the maximum height and weight existing in the whole samples?
    What if we have just one person weighting above 100 Kg and his weight is 160Kg;
    If we drop only this sample from the Data, the scale and whole PCAs will differ significantly. So is it rational to consider the variable scale based on the max and min of the values existing in the samples? (for such variables without intrinsic upper and lower bounds)
    🤔

    • @statquest
      @statquest  4 года назад

      You scale the data based on the data itself, not theoretical bounds.

  • @addisonmcghee9190
    @addisonmcghee9190 4 года назад

    So Josh, would the upper bound for Principal components be: minimum{ # of variables, (# of samples - 1)}

    • @statquest
      @statquest  4 года назад

      I answer this question at 3:30

    • @addisonmcghee9190
      @addisonmcghee9190 4 года назад

      @@statquest
      Ok, so if we had 2 students and 5 variables, wouldn't we only have 1 principal component? These are two points in a 5-dimensional space, so it would be a line, right?
      So, (# of samples - 1)?
      I'm just trying to find a pattern...

    • @statquest
      @statquest  4 года назад

      @@addisonmcghee9190 That is correct.

    • @addisonmcghee9190
      @addisonmcghee9190 2 года назад +1

      @@statquest Revisiting this comment because I'm learning about PCA in G-school....old StatQuest for the win!

  • @shubhamgupta6567
    @shubhamgupta6567 4 года назад

    Can u make a video on partial least square regression please

  • @urjaswitayadav3188
    @urjaswitayadav3188 6 лет назад

    Thanks for the video Joshua! Would you please consider doing a video on hypergeometric distribution and hypergeometric test? I have seen that it is often used to check the significance of overlaps between lists generated by high throughput analyses, but I am always confused on how to set it up when I have to do one myself. Thanks a lot!

  • @namithacherian1743
    @namithacherian1743 2 года назад

    DOUBT: When there are only 2 points, and you mentioned that you can fit only one line through them (Correct). However, there is no guarantee that it will pass through the origin. In other words, when there are only 2 points, you can draw a line that goes through the origin and fit it with one data point for sure. But having it pass through both data points is a matter of chance. Right?

    • @statquest
      @statquest  2 года назад

      What time point, minutes and seconds, are you asking about?

  • @reytns1
    @reytns1 6 лет назад

    I have a question: Could I enter a percentage value in order to obtain a PCA?

  • @giosang1111
    @giosang1111 4 года назад

    Hi, is it always true that to scale the values of the variants we divide the values by their SDs?

    • @statquest
      @statquest  4 года назад +1

      For PCA, yes.

    • @giosang1111
      @giosang1111 4 года назад

      Thanks! Can you make a video which summarizes which statistical methods used in which cases? There are so many methods out there and I am really confused which and when to use them. Thanks a lot.

    • @statquest
      @statquest  4 года назад +1

      @@giosang1111 Since there are so many methods, this would probably be a series of videos, rather than a single video, but either way, it's on the to-do list. However, it will probably be a while before I can get to it.

    • @giosang1111
      @giosang1111 4 года назад +1

      @@statquest Hi! I am looking forward to it. All the bests!

  • @MyKornflake
    @MyKornflake 4 года назад

    Great explanation, I made a PCA plot with 96 genes from 6 different samples using SPSS, but I am having a hard time trying to interpret PC1 and PC2 as what do they represent. Could you please give me some idea on this? Thanks in advance.

    • @statquest
      @statquest  4 года назад +1

      Look at the magnitude of the loading scores for PC1 and PC2.

  • @etornamtsyawo6407
    @etornamtsyawo6407 2 года назад +1

    Let me like the video before I even start watching.

  • @danielasanabria3242
    @danielasanabria3242 4 года назад

    Did you already make a video about Partial Least Squares?

  • @糜家睿
    @糜家睿 6 лет назад +1

    Two quick questions, Joshua. When we deal with RNA-seq data, we should log-transformed the data before running PCA right? Can I say it is a way to minimize the effect from outliers when determining PC?. My second question is in R, there is a embedded function called prcomp; also in many other packages there are functions like runPCA and plotPCA, how do I know these functions will center the data before doing calculating variation and doing projections? Thanks!

    • @statquest
      @statquest  6 лет назад

      Log transforming RNA-seq data before PCA is a good idea and I generally do it. For prcomp(), there is a parameter "scale" that you can set to TRUE. When you do this, prcomp() will center and scale your data for you. In general, you can always look up the documentation for the PCA function you are using. In R you can get the documentation for prcomp() with the call "?prcomp()".

    • @bz6445
      @bz6445 2 месяца назад

      @@statquest Or how about not using log or scaling? I’m trying to decide if I should pursue using just the normalized read counts as typically I get meaningful results relevant to our interests in the organisms biology. Is this a good reason not to scale or log transformations? The results of the principle components are high (explained variance) as well as the contributing variables as the genes are highly expressed but are relevant to the development of the organism. What do you think, is this a valid reason?

    • @statquest
      @statquest  2 месяца назад

      @@bz6445 Maybe you can create a histogram of the reads to see how things are distributed. Then do the same thing with the log scaling. You may notice a few outlier genes - look into those - see if they are relevant to your experiment. (You could also look at the PCA loading scores).

  • @lingxinhe4627
    @lingxinhe4627 4 года назад +1

    Hi Josh,
    Thank you for the amazing videos, the content on this channel on stats is so much better than everything else I've found online. I have a quest(ion): Once you get PC1 and PC2 as the main components that explain variation, how can we get back to the variables that compose them?
    Thank you!

    • @statquest
      @statquest  4 года назад

      I show how to do this exact thing in my PCA in R ruclips.net/video/0Jp4gsfOLMs/видео.html and PCA in Python ruclips.net/video/Lsue2gEM9D0/видео.html videos.

  • @kushaltm6325
    @kushaltm6325 6 лет назад +1

    Josh, Thank you very much for helping us out with stats. When i get a job, I sure should contribute towards your efforts.
    I am struggling to understand things @3:10
    Why should it be a problem if we do NOT centre the data ?
    Can you please explain with respect to your "PCA -Clearly Explained" Video. My Prof would't answer it. So asking a Cool-Stat-Guru about it :)
    If it requires too much eleboration please point me to other resources.... Thanks Again.
    Best Wishes from India... :)

    • @statquest
      @statquest  6 лет назад

      Thanks!! Do mean try to explain it in terms of "PCA-Clearly Explained" or do you mean "PCA Step-By-Step". The former shows the "old" or "original" method of PCA, which was to find the eigenvectors of the covariance matrix. The latter, "Step-by-step", shows how PCA is done using the more modern technique using Singular Value Decomposition. I think it is easier to understand centering in terms of SVD.

  • @haydergfg6702
    @haydergfg6702 Год назад

    which programming using

    • @statquest
      @statquest  Год назад

      I'm not sure I understand your question. Are you asking how I created the video? If so, see: ruclips.net/video/crLXJG-EAhk/видео.html

  • @sarahjamal86
    @sarahjamal86 6 лет назад +1

    Ok... since PCA uses the SVD and covariance matrix, so not centering the data to the origin means, that the data is not mean free, freeing the data from the mean is part of constructing the covariance matrix. So not having zero mean data means that our eigenvector will not be 100% correctly derived.

    • @statquest
      @statquest  6 лет назад +3

      PCA uses SVD or the covariance matrix. It doesn't use both. Older PCA methods use a covariance matrix and a covariance matrix is automatically centered, so you don't need to worry about this. However, newer PCA methods use SVD because it is more likely to give you the correct result (SVD is more "numerically stable"), and when using SVD, you need to center your data (or make sure that the program you are using will center it for you). Otherwise you get the errors illustrated at 2:55.

  • @survivio8937
    @survivio8937 4 года назад

    Thank you so much for these amazing videos. With my new found free time, I am trying to learn about PCA in preparation for upcoming RNA-seq experiments. I have yet to do this and will probably understand more once I have practical experience, but one thing struck me as odd in your video. When scaling data, you state that the typical method of doing this is to divide by the standard deviation assuming large values will have larger SD. But, intuitively it would make sense to scale data based on the mean rather than the SD. For example if I had one gene which is highly expressed but not variable, then it would not be scaled down appropriately and would have an oversized contribution to PC1. Am I thinking about this wrong or is there some reason what mean is a bad choice? Next, it seems that with scaling, small changes in rare transcripts (that might just be error and not true transcripts) would contribute a lot to the variability and thus PC1; does this not present a problem?
    Also, another comment: I find this and the prior video on PCA from 2018 much more intuitive than the one you produced previously in which you discuss generating a PC axis by looking at the spread of data for 2 cells with multiple transcripts and coming up with weights or "loading scores" for each trancript based on high and low expression.
    Thank you

    • @statquest
      @statquest  4 года назад +1

      In the PCA step-by-step video ( ruclips.net/video/FgakZw6K1QQ/видео.html ) one of the first things we do is center the data. This is the equivalent of subtracting the mean value from each dimension in the data. So, for your example, if you have a gene with high expression, but no variation, we will subtract the mean of that gene from each replicate. So that part of the data standardization is already taken care of.
      The original PCA video is based on the old way of doing PCA, which is still taught as if the new way does not exist. The old way is based on creating a variance/covariance matrix of all the observations. I agree, that it is not as intuitive to understand as the new way, which is to use Singular Value Decomposition.

    • @gspb4
      @gspb4 4 года назад

      ​@@statquest Hi Josh. You mention the "old way" of performing PCA using the variance/covariance matrix versus the new way of using SVD. Do both techniques produce identical results?
      Further, have you considered producing videos for non-linear PCA?
      Anyways, thanks so much for what you do. I'm currently taking a computational biology course in grad school and wouldn't be able to get through it without your videos!!

  • @mojo9Y7OsXKT
    @mojo9Y7OsXKT 4 года назад

    How come this video is gone "Private"? The screen says: "Video Unavailable" "This video is private"!!

    • @statquest
      @statquest  4 года назад

      This specific video? Or are you asking about another video?

    • @mojo9Y7OsXKT
      @mojo9Y7OsXKT 4 года назад

      @@statquest This video was showing as private yesterday. Its come back us today! Could've been a glitch. Thanks for all your vidz.

    • @statquest
      @statquest  4 года назад

      @@mojo9Y7OsXKT Yeah, something strange must have happened. I'm glad it's back. :)

  • @DabeerRoy
    @DabeerRoy 4 месяца назад

    Hi I need where To practice this please anyone who can help me?

    • @statquest
      @statquest  4 месяца назад

      in R: ruclips.net/video/0Jp4gsfOLMs/видео.html and in PyThon: ruclips.net/video/Lsue2gEM9D0/видео.html

  • @AyatUllah-zr6ij
    @AyatUllah-zr6ij 8 месяцев назад +1

    Good ❤

  • @wolfisraging
    @wolfisraging 6 лет назад

    Thank u sooooooooo much, that's damn awesome.

  • @thuyduongnguyen1231
    @thuyduongnguyen1231 4 года назад

    Dear Josh, I have watched PCA (step-by-step) and this video of yours. It really helps me get over the scare of math and try to understand these terminologies. However, I wonder what if our data has 20 attributes (not 2 or 3 attributes like in the video), does it mean we will have 20 PCA? or there will be another approach to determine the maximum number of PCA? Thank you very much

    • @statquest
      @statquest  4 года назад

      I answer this question at 3:30

  • @raisaoliveira7
    @raisaoliveira7 Год назад +1

    • @statquest
      @statquest  Год назад

      You're welcome again! :)

  • @marianaferreiracruz5398
    @marianaferreiracruz5398 Год назад

    love the music

  • @JaspreetSingh-eh1vy
    @JaspreetSingh-eh1vy 4 года назад

    So technically speaking, number of PC = number of features but if the number of samples < number of features, then the number of PC = number of samples - 1. Am I right?

  • @pattiknuth4822
    @pattiknuth4822 3 года назад

    Drop the song.

  • @shwetankagrawal4253
    @shwetankagrawal4253 5 лет назад

    Hey John,I am not able to understand kernal PCA, can u explain or tell me the book name which can give me the clear understanding of this?

  • @seazink5357
    @seazink5357 5 лет назад

    love you

  • @1989ENM
    @1989ENM 6 лет назад +1

    ...for me?