Principal Component Analysis in R Programming | How to Apply PCA | Step-by-Step Tutorial & Example

Поделиться
HTML-код
  • Опубликовано: 21 авг 2024
  • This video explains how to apply a Principal Component Analysis (PCA) in R. More details: statisticsglob...
    The video is presented by Cansu Kebabci, a data scientist and statistician at Statistics Globe. Find more information about Cansu here: statisticsglob...
    In the video, Cansu explains the steps and application of the Principal Component Analysis in R. Watch the video to learn more on this topic!
    Here can you find the first part of this series:
    Introduction to Principal Component Analysis (Pt. 1 - Theory): • Introduction to Princi...
    Links to the tutorials mentioned in the video:
    Can PCA be Used for Categorical Variables? (Alternatives & Example): statisticsglob...
    PCA Using Correlation & Covariance Matrix (Examples): statisticsglob...
    Biplot of PCA in R (Examples): statisticsglob...
    R code of this video:
    install.packages("MASS")
    install.packages("factoextra")
    install.packages("ggplot2")
    Load Libraries
    library(MASS)
    library(factoextra)
    library(ggplot2)
    Import biopsy data
    data(biopsy)
    dim(biopsy)
    Structure of Data
    str(biopsy)
    summary(biopsy)
    Delete Cases with Missingness
    biopsy_nomiss <- na.omit(biopsy)
    Exclude Categorical Data
    biopsy_sample <- biopsy_nomiss[,-c(1,11)]
    Run PCA
    biopsy_pca <- prcomp(biopsy_sample,
    scale = TRUE)
    Summary of Analysis
    summary(biopsy_pca)
    Elements of PCA object
    names(biopsy_pca)
    Std Dev of Components
    biopsy_pca$sdev
    Eigenvectors
    biopsy_pca$rotation
    Std Dev and Mean of Variables
    biopsy_pca$center
    biopsy_pca$scale
    Principal Component Scores
    biopsy_pca$x
    Scree Plot of Variance
    fviz_eig(biopsy_pca,
    addlabels = TRUE,
    ylim = c(0, 70))
    Biplot with Default Settings
    fviz_pca_biplot(biopsy_pca)
    Biplot with Labeled Variables
    fviz_pca_biplot(biopsy_pca,
    label="var")
    Biplot with Colored Groups
    fviz_pca_biplot(biopsy_pca,
    label="var",
    habillage = biopsy_nomiss$class)
    Biplot with Customized Colored Groups and Variables
    fviz_pca_biplot(biopsy_pca,
    label="var",
    habillage = biopsy_nomiss$class,
    col.var = "black") +
    scale_color_manual(values=c("orange", "purple"))
    Follow me on Social Media:
    Facebook - Statistics Globe Page: / statisticsglobecom
    Facebook - R Programming Group for Discussions & Questions: / statisticsglobe
    Facebook - Python Programming Group for Discussions & Questions: / statisticsglobepython
    LinkedIn - Statistics Globe Page: / statisticsglobe
    LinkedIn - R Programming Group for Discussions & Questions: / 12555223
    LinkedIn - Python Programming Group for Discussions & Questions: / 12673534
    Twitter: / joachimschork
    Instagram: / statisticsglobecom
    TikTok: / statisticsglobe

Комментарии • 76

  • @aguscmzz
    @aguscmzz 17 часов назад

    You are the best team and people dedicated to teach about Data Science. Thanks for all the information you share with us.

  • @user-wq3df5wd2q
    @user-wq3df5wd2q 10 месяцев назад +2

    This was an excellent presentation, and doubly-good when paired with the intro one. I agree with others that taking the final solution and being able to unscale and unrotate to get back to an original variable solution would have made this off the charts great!

    • @cansustatisticsglobe
      @cansustatisticsglobe 10 месяцев назад

      Hello hello!
      Thank you for your encouraging feedback. I am not sure if I got your last point. Are you interested in finding original values from te calculated principal components?
      Best,
      Cansu

  • @deeptimittal8420
    @deeptimittal8420 4 месяца назад +1

    Thank you Statistics globe for the insightful and informative video. Please keep posting these kind of tutorials.

    • @StatisticsGlobe
      @StatisticsGlobe  4 месяца назад

      Thanks a lot, will do! :)

    • @fisherh9111
      @fisherh9111 2 месяца назад

      The part 1 video does this well I think.

  • @johneagle4384
    @johneagle4384 Год назад +2

    Thank you Joachim and Cansu. This is a good overview of PCA.

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      Hello John,
      I'm glad to hear that you're interested. If you're also considering learning how to perform PCA in Python, be sure not to miss the upcoming tutorial in this series
      Best,
      Cansu

  • @greggunter5975
    @greggunter5975 4 месяца назад +2

    If this is Part 2 please label the video "Part 2" so it easy for people to watch them in sequence.
    Thanks

    • @StatisticsGlobe
      @StatisticsGlobe  4 месяца назад

      Hey, thanks for the feedback, Greg! You can also watch this video without watching the first one, if you are only interested in how to apply PCA in R.

  • @KameshwarChoppella
    @KameshwarChoppella 5 месяцев назад +1

    Well done! This was simple and straightforward

    • @StatisticsGlobe
      @StatisticsGlobe  5 месяцев назад

      Thanks a lot, glad you found it helpful! :)

  • @darrylmorgan
    @darrylmorgan Год назад +3

    Really helpful tutorial.Thank you Cansu and Joachim!!

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      Hello Darryl,
      I am glad that the tutorial was helpful for you. Welcome!
      Best,
      Cansu

  • @wakjiratesfahun3682
    @wakjiratesfahun3682 Год назад +1

    Welcome Kansu! Excellent tutor . I love her way of teaching. Just 😮

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад +1

      Hello!
      Thank you for such kind words. Feel free to let me know if you have any questions about this topic.
      Best,
      Cansu

    • @wakjiratesfahun3682
      @wakjiratesfahun3682 Год назад +1

      @@cansustatisticsglobe Sure. I would love to see you in principal coordinate analysis . Is it okay?

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      @@wakjiratesfahun3682 Noted!

  • @rubicleisgomes323
    @rubicleisgomes323 Год назад +1

    I will use this example in my classes! Thank you very much.

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      Hello,
      I am glad that you liked the example. You are welcome.
      Bes,
      Cansu

  • @AlbertoFCabreraCasillas
    @AlbertoFCabreraCasillas Год назад +4

    Excellent presentation of PCA analyses by Cansu. You may want to alert the reader that either MASS or factoextra libraries affect the tidyverse select() function. I had to reinstall tidyverse after completing this session. On another point, I wonder if Cansu will describe the process to rotate the factor solution. I assume the example dealt with an unrotated solution.

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад +2

      Hello Alberto,
      I am happy to hear that you liked the tutorial. Thank you for the feedback.
      I have double-checked the functions of the packages. There shouldn't be any function called select() in either of the packages to mask the select() function. However, there can be complex interactions between packages, and it is conceivable that other functions in these packages may somehow interfere with select(). If you are having issues, one way to potentially avoid them is to explicitly use the dplyr::select() syntax when you want to use the select() function.
      Regarding the second point, rotation is often used in factor analysis, like EFA, SEM, etc., to ease the interpretation of the factors. However, rotating may undermine the purpose of PCA, which is to capture as much variability as possible with the first few components. In other words, the first few rotated components might not explain as much of the variance in the data as the original principal components do. That is why, in practice, it's often more useful to interpret the loadings of the original variables on the principal components directly
      I hope it is clear. Please let me know if I am missing something, or in case of any further questions.
      Best,
      Cansu

  • @anuraratnasiri5516
    @anuraratnasiri5516 Год назад +1

    Thank you so much for sharing valuable information about PCA!

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      Hello!
      You are very welcome! You can always visit our Statistics Globe webpage: statisticsglobe.com/ for further details about PCA.
      Best,
      Cansu

  • @macanbhaird1966
    @macanbhaird1966 Год назад +3

    Great stuff. Clearly explained and easy to follow for a somewhat complicated analysis. Thanks for this!

  • @preeyawangsomnuk189
    @preeyawangsomnuk189 Год назад +2

    Thanks for your video.

  • @PaoloCondo
    @PaoloCondo Месяц назад

    Thanks for the video! Çok faydalıydı, it was very helpful!

    • @micha.statisticsglobe
      @micha.statisticsglobe Месяц назад

      Thank you very much for your kind feedback. Glad it helped! 🙂

  • @USKalemao
    @USKalemao 7 месяцев назад

    Thanks a lot for this valuable video! It was very easy to follow the explanations.

    • @Ifeanyi.StatisticsGlobe
      @Ifeanyi.StatisticsGlobe 6 месяцев назад

      Thanks, Kalemao for your kind words. We are happy that you found the video helpful!

  • @Mustafa-dw3wm
    @Mustafa-dw3wm 8 месяцев назад +1

    Very understandable. Thanks a lot. But one question.. Isn't it necessary before the PCA to make a Kaiser-Meyer-Olkin test?

    • @cansustatisticsglobe
      @cansustatisticsglobe 7 месяцев назад

      Hello Mustafa,
      The Kaiser-Meyer-Olkin test is more commonly used for other factor analysis techniques like EFA and CFA. It may also be a useful tool for assessing the adequacy of your data for PCA. However, in practice, PCA usually proceeds without it, as PCA is used more for pattern recognition and dimensionality reduction rather than strict factor extraction like in EFA and CFA.
      Best,
      Cansu

  • @ginnistrehle426
    @ginnistrehle426 7 месяцев назад +1

    Very helpful! Is there a way to only project the observations in the space instead of a biplot of both the observations and variables?

    • @cansustatisticsglobe
      @cansustatisticsglobe 7 месяцев назад

      Hello!
      I am glad that you found the tutorial helpful. Sure! You can use scatterplots for that. Please see the tutorials on our website: statisticsglobe.com/scatterplot-pca-r
      Best,
      Cansu

  • @Geology_monster
    @Geology_monster 3 месяца назад

    Love u guys 🙌🏼

  • @nikolaostziokas6847
    @nikolaostziokas6847 Месяц назад +1

    Great video. What about if someone wants to weight the PCs by e.g. their variance? Is it possible to do that using the prcomp() function?

    • @StatisticsGlobe
      @StatisticsGlobe  Месяц назад

      Thanks a lot! To weight the principal components (PCs) by their variance using prcomp() in R, use the scale. = TRUE argument. This scales the data so that each variable has unit variance, ensuring that PCs are weighted appropriately by their variance.

  • @thezenithanalysis7541
    @thezenithanalysis7541 2 месяца назад +1

    What variable will we use as an independent variable as an index? I mean, what is the biopsy index variable we will use for the analysis?

    • @StatisticsGlobe
      @StatisticsGlobe  2 месяца назад

      I'm not an expert in the biopsy field, but as far as I know, the biopsy index variable used for the analysis will typically be the first principal component derived from the PCA on the set of dummy variables. This component captures the most significant variation in the data, serving as a comprehensive index.

  • @Nico_boost
    @Nico_boost 9 месяцев назад +1

    Nice video! How can one remove principal components to simplfy the model?

    • @cansustatisticsglobe
      @cansustatisticsglobe 9 месяцев назад

      Hello Nico!
      I am glad that you liked the video. If you want to extract, let's say, the first two principal components to simplify your dataset. You can do the following for the dataset in this tutorial.
      # Principal Component Scores
      biopsy_pca$x
      # Retrieve the first two components
      biopsy_pca$x[,1:2]
      As you can see, it is a simple dataset subsetting operation.
      Best,
      Cansu

    • @Nico_boost
      @Nico_boost 9 месяцев назад +1

      Thank you!

    • @cansustatisticsglobe
      @cansustatisticsglobe 9 месяцев назад

      Welcome @@Nico_boost !

  • @BhanuBhaktaSharma-ut5zw
    @BhanuBhaktaSharma-ut5zw 9 месяцев назад +1

    Hi is there any way of doing PCA for different variables with unequal numbers of values without omitting?

    • @cansustatisticsglobe
      @cansustatisticsglobe 9 месяцев назад

      Hello!
      Do you mean variables with different lengths of inputs? If so, yes, you can, but the lacking inputs would be treated as missing values. Then, dealing with missingness comes into play. You should decide on a missingness handling method. The most straightforward but hazardous one is omitting the cases with missingness. For more advanced methods, see statisticsglobe.com/missing-data/.
      Best,
      Cansu

  • @ariskoitsanos607
    @ariskoitsanos607 Год назад +1

    Nice intro. If you had added at least some basic interpretation of the results would be even better.

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      Hey!
      Thank you for your feedback. It is the second video of our PCA series. I explain how to interpret the results in our first video, which you can find here: ruclips.net/video/DngS4LNNzc8/видео.html. I hope it helps!
      Best,
      Cansu

  • @pegah_95
    @pegah_95 Месяц назад +1

    Hi thank you for your videos. Can we do PCA analysis on binary data? My features are a mix of binary and numeric. Could you please tell what the best approach is in this case?

    • @StatisticsGlobe
      @StatisticsGlobe  Месяц назад

      Hey, thanks for the kind comment. Yes, PCA with dummy variables is possible. However, depending on your data, it may not always yield meaningful results. You might consider using techniques like Multiple Correspondence Analysis (MCA) instead.

    • @pegah_95
      @pegah_95 Месяц назад +1

      @@StatisticsGlobe Thank you for your response. I am doing PCA prior to my cluster analysis and since most of my features are binary ( i think only 2 or 3 are numeric) would you suggest i scale the binary variables (and then do PCA) in this case?

    • @StatisticsGlobe
      @StatisticsGlobe  Месяц назад

      Hey, applying PCA to binary data is controversial and often not recommended, so unfortunately I can't provide a definitive answer. For a detailed discussion, check out this thread on Cross Validated, which includes various links to relevant papers and articles: stats.stackexchange.com/questions/159705/would-pca-work-for-boolean-binary-data-types

    • @pegah_95
      @pegah_95 Месяц назад +1

      @@StatisticsGlobe Thank you!

  • @SekyereNyantakyiwaaJosephine
    @SekyereNyantakyiwaaJosephine 10 дней назад +1

    Pls when given a multivariate dataset is it necessary to test for the assumptions of a PCA before applying it

    • @StatisticsGlobe
      @StatisticsGlobe  9 дней назад

      Yes, it is important to test the assumptions of PCA, such as linearity, large sample size, and the absence of outliers, to ensure the results are valid and meaningful. However, PCA is often robust to minor violations of these assumptions.

    • @SekyereNyantakyiwaaJosephine
      @SekyereNyantakyiwaaJosephine 9 дней назад +1

      Thank you ​@@StatisticsGlobe

    • @StatisticsGlobe
      @StatisticsGlobe  9 дней назад

      You are very welcome!

  • @erdavg
    @erdavg 10 месяцев назад +1

    Muchas graciaaaas

    • @matthias.statisticsglobe
      @matthias.statisticsglobe 10 месяцев назад

      Thank you very much for the positive feedback. Hope the video has been helpful.

  • @thezenithanalysis7541
    @thezenithanalysis7541 2 месяца назад +1

    Can we use PCA to make an index using a set of dummy variables?

    • @StatisticsGlobe
      @StatisticsGlobe  2 месяца назад

      Hey! Yes, PCA can be used to create an index from a set of dummy variables by transforming them into principal components that summarize the underlying patterns.

  • @alessandrorosati969
    @alessandrorosati969 Год назад +1

    can a dataset consisting of the principal components and the target variable be used to perform machine learning techniques?

  • @freya_yuen
    @freya_yuen 10 месяцев назад +1

    Is it possible that we perform PCA without using factoextra library? Relying solely on tidyverse?

    • @cansustatisticsglobe
      @cansustatisticsglobe 10 месяцев назад

      Hello Melody!
      Yes, you don't need to install the factoextra package. It is just needed to visualize the results easily. You can also use only ggplot2 for the visualization, but then you need to write more lines of code to obtain the same visual. Please let me know if you still have some questions.
      Best,
      Cansu

    • @freya_yuen
      @freya_yuen 10 месяцев назад +1

      @@cansustatisticsglobeGot it!! Thanks for the reply !!

    • @cansustatisticsglobe
      @cansustatisticsglobe 10 месяцев назад

      Welcome @@freya_yuen !

  • @UsamahXiyad
    @UsamahXiyad Месяц назад +1

    sorry i want to ask, why did i get this: could not find function "fviz_eig"

    • @StatisticsGlobe
      @StatisticsGlobe  Месяц назад

      Hey, fviz_eig is a function of the factoextra package. I assume you haven't installed and loaded the package properly. Try this, before running the rest of your code:
      install.packages("factoextra")
      library("factoextra")

  • @ambujmishra695
    @ambujmishra695 7 месяцев назад

    if resultant variables are overlapping, what can be done to ripple that. i used geom_text_repel, nudge_x = c(4, 5, 6), # Adjust horizontal position for each label
    nudge_y = c(0, 0, 0) # Adjust vertical position for each label
    etc. But nothing was working. may be i was using it in correctly.
    kindly guide me for the same please

    • @Ifeanyi.StatisticsGlobe
      @Ifeanyi.StatisticsGlobe 6 месяцев назад

      Hi Ambujmishra. Sorry about the late response to your comment. Do you still require assistance with the problem?

    • @ambujmishra695
      @ambujmishra695 3 месяца назад

      @@Ifeanyi.StatisticsGlobe yes

  • @pramitthapa283
    @pramitthapa283 8 месяцев назад

    Everyone makes simple videos of PCA. No one can explain well what PCA and dimension reduction means. Maybe the video makers also don’t understand what PCA is

    • @StatisticsGlobe
      @StatisticsGlobe  8 месяцев назад +1

      Hey, do you have a specific question that is not answered by the video? We are happy to help. Regards, Joachim