Cluster analysis in R | Finding out Intra and Inter cluster distances and optimum number of clusters

Поделиться
HTML-код
  • Опубликовано: 7 авг 2024
  • ~Hello everyone in this tutorial, I am going to cover almost all concepts involved in cluster analysis, starting from how to find an optimum number of clusters using wss plot function, silhouette method using factoextra package and also by using NbClust package, for k-means and hierarchical clustering algorithms, in hierarchical clustering I will explain about dendrograms, how to plot, export and hang them. In this tutorial, I will also show how to find the inter and intracluster distances using an R package called clv. Finally how to find the cluster means or average
    0:00 Introduction
    2:10 Preliminary steps
    6:40 k means
    7:20 Opt no of clusters
    12:50 Hierarchical clustering
    19:00 Cluster distances
    21:27 Cluster means
    Other youtube videos
    Coding
    1) • Introduction to Cluste...
    2) • Cluster analysis
    Theory
    1) • 4 Basic Types of Clust...
    2) • Clustering: K-means an...
    3) • StatQuest: K-means clu...
    4) • StatQuest: Hierarchica...
    Descriptive
    ~NbClust
    1) www.rdocumentation.org/packag...
    ~Finding the optimum number of clusters
    1) uc-r.github.io/kmeans_clustering
    2) stackoverflow.com/questions/1...
    ~Distances
    1) www.geeksforgeeks.org/ml-inte...
    Script
    docs.google.com/document/d/19...

Комментарии • 106

  • @Guruprasad_A
    @Guruprasad_A  2 года назад +1

    Check the script in the video description............

  • @Guruprasad_A
    @Guruprasad_A  3 года назад +1

    Hi there, don't hesitate to ask your doubts here....... Feel free to ask me.

  • @PlantBreeding_is_my_passion
    @PlantBreeding_is_my_passion 2 месяца назад

    Excellent information, thanks for explaining every step very clearly

  • @aberaseboka2456
    @aberaseboka2456 2 года назад +2

    It is more informative Sir thank you more!.

  • @priyankaparihar2217
    @priyankaparihar2217 5 месяцев назад +1

    Thank you. this is very helpful

  • @s.praveenmscvegetablescien1632
    @s.praveenmscvegetablescien1632 6 месяцев назад +1

    very useful lecture sir, thanks

  • @bartolodimattia2501
    @bartolodimattia2501 9 месяцев назад

    Amazing, BIG!

  • @gpb100percent_bkpraveen
    @gpb100percent_bkpraveen 2 года назад

    Nicely explained

  • @zeruyimer3764
    @zeruyimer3764 3 года назад +1

    Thanks more for sharing the video. If possible can you show an other video on Prep-design and lattice design (i.e. simple and triple lattice) data analysis ? Thank you...

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      You are welcome Zeru, you can check the recent video on Agricultural statistical tools which is uploaded recently in my data analysis playlist you may find what you're looking for in STAR app manual.

  • @linuxvegalinuxvega1918
    @linuxvegalinuxvega1918 Месяц назад

    Very good explanation, the pea.xlsx file can you post it? Did you publish your findings ?

  • @babitabhatt6469
    @babitabhatt6469 9 дней назад

    hello sir, my design was Augmented 2 with 7 blocks ..210 RILs of sorghum , with 6 checks replicated in each block . In case of this should igo for hierarchical clustering analysis using k means and include adjusted mean of 210 lines for analysis. . am I right s?
    I have to do interpretation on the bases of fodder yield. ..so is their any specification for doing trait wise clustering..
    I have taken quantative traits along with shoot fly associated dead heart percent etc traits ..
    as my RILs are the population of shoot fly susceptible x resistance cross ....
    so how should I proceed for the analysis on diversity for forage sorghum based on my characters ...total 18 characters were taken..... Should I limit my traits according to my objective of interpretation ..or should I subject all the traits ....
    .is for non replicated data non hierarchical clustering Is different from hierarchical ?

  • @sandipbohara9534
    @sandipbohara9534 3 месяца назад

    sir,how to make the circular cluster dendrogram

  • @deepikachandrasekaran3554
    @deepikachandrasekaran3554 2 года назад

    Thank you for the video sir. If scaling is only for quantitative trait what could we do for qualitative traits...how to do cluster analysis for qualitative traits?

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      Most often we won't consider qualitative traits, if you are considering them the clustering will not be good, if you want to that you can do it in SPSS using two step clustering option.

    • @deepikachandrasekaran3554
      @deepikachandrasekaran3554 2 года назад

      @@Guruprasad_A Ok sir. Thank you for your reply.

  • @JMSBRK
    @JMSBRK 2 года назад

    The topic was very simple clearly & well explained.
    Your video is now among my go to resources in data science.
    Just a small constructive criticism : You have already added your R code however the dataset you used (pea.xlsx) hasn't been included.
    Thus, your script won't work without it & it may have no use for people who wishes to try practice themselves.
    So, that would be much better if you include the excel file as well, imho.
    But overall fantastic tutorial & very helpful for rookies
    👍

  • @user-ce8fu8sf4s
    @user-ce8fu8sf4s 11 месяцев назад

    Which method will be suitable for cluster analysis for 300 treatments/genotypes and how can i represent it using dendrogram or any clustering technique for data visualization.

    • @Guruprasad_A
      @Guruprasad_A  11 месяцев назад

      Ward method is more suitable.

  • @rahulm.r2269
    @rahulm.r2269 3 года назад

    How to get percent contribution of each independent trait on the dependent variable similar to that one used by Tocher's ?

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      stackoverflow.com/questions/62332254/is-there-a-way-to-determine-the-weight-of-different-attributes-used-for-r-cluste

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      Check out the above article on FeatureImpCluster, I think you need to get it from devtools.

  • @revanthsrirangaraju8863
    @revanthsrirangaraju8863 2 года назад +1

    I wanted to perform hierarchical clustering for my data which has 4k observations,
    I have applied 'NbClust' with smaller number of clusters i.e 2 - 9, and it throws this error -
    " Error in plot.new() : figure margins too large "

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      Try to adjust the number of clusters in between 1 to 15 or so..

    • @revanthsrirangaraju8863
      @revanthsrirangaraju8863 2 года назад +1

      @@Guruprasad_A Sure, I will try that
      Thank you

  • @vikramana7796
    @vikramana7796 3 месяца назад

    Sir i have doubt, how to write multivariate D2 analysis and ward D2 method hierarchical clustering in thesis, they differing in cluster partition.

    • @Guruprasad_A
      @Guruprasad_A  3 месяца назад

      You go with one which suits you. Mostly for diversity studies D2 test is better.

  • @mukeshdhadhich2733
    @mukeshdhadhich2733 Месяц назад

    After calculating d sq distance according to privious video.
    Then according to this video number of cluster is different how we can solve this. Please respond

    • @Guruprasad_A
      @Guruprasad_A  Месяц назад

      You won't get similar results stick to one method.

  • @warwan.search
    @warwan.search Год назад

    Can you please do analysis for path and corelation .i am not able to run the program

  • @im_hung9058
    @im_hung9058 2 года назад

    Hi. I am new to this field. I'am trying to conduct as you illustrated, but I got a pause at clustering with NbClust: Error in NbClust(data = newdata, diss = NULL, distance = "euclidean", :
    The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.
    Could you help me?

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      stackoverflow.com/questions/46067602/the-tss-matrix-is-indefinite-there-must-be-too-many-missing-values-the-index

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      stats.stackexchange.com/questions/204760/r-how-to-fix-nbclust-error-with-error-message-the-tss-matrix-is-indefinite?newreg=6dbb3e394e5f40cf8b1e50a16f88ee84

  • @rakhimahto376
    @rakhimahto376 2 года назад

    Can you please make video on basics of R...or recommend any source for the same

  • @user-yv8tu3tq7v
    @user-yv8tu3tq7v 3 года назад

    how to understand wss silhouette NbClust giving different clusters number suggestion? which we should believe more?

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      Just look the things given in my video description.

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      NBClust. because it is based on 30 different indices.

  • @samrathbaghel9026
    @samrathbaghel9026 Год назад

    Hi I did clustering using ward.d , and got 5 optimal number of clusters but in cluster mean it is giving mean for only 4 clusters .,why

    • @Guruprasad_A
      @Guruprasad_A  Год назад

      I hope there might be a problem while selecting number of clusters object, in case of that you can manual calculate the average by adding one more extra column alongside of treatments indicating which cluster they belong, later sort them and get the average using MSExcel.

  • @abc_def789
    @abc_def789 2 года назад

    Which inter class and intra class cluster should we use?

  • @anilkhar
    @anilkhar 2 года назад +1

    When I used NbClust I am getting this message 'Error in NbClust(data = Table.s, diss = NULL, distance = "euclidean", :
    The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated."
    The silhouette method is given number of clusters as 7

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      stats.stackexchange.com/questions/204760/r-how-to-fix-nbclust-error-with-error-message-the-tss-matrix-is-indefinite?newreg=6dbb3e394e5f40cf8b1e50a16f88ee84

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      Consider 7 only for k means clustering

  • @minhaoling3056
    @minhaoling3056 2 года назад

    Dear Sir, I am doing cluster analysis of 40,000 points. I used the efficient hierarchical clustering algorithm and successfully did the clustering. Now I want to choose the optimal no of clusters. Which one would you suggest for dealing with large dataset? Because I need to test from k=1 , … , k=40000, which is huge

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      www.datanovia.com/en/lessons/clara-in-r-clustering-large-applications/

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      ?NbClust
      Pass the above argument in the R script to get more help, where you can understand what you can make out with help of NbClust

    • @minhaoling3056
      @minhaoling3056 2 года назад

      @@Guruprasad_A hi i tried nbclust, but still its quite slow to compute all one by one.

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      @@minhaoling3056 Try to define a narrow range for the number of clusters in the NbClust function i.e min.nc and max.nc

  • @bhuvaneshwaris8146
    @bhuvaneshwaris8146 3 месяца назад

    I have loaded corrected means from augmented RCBD.. But one numerical trait is taken as character.. after changing to numeric it is showing values as 'NA' for all genotypes... This unable to proceed with k means .. it is showing error... How to go about..? Kindly help..

    • @Guruprasad_A
      @Guruprasad_A  3 месяца назад

      Check there might be problem in the data what you have entered. There might be double pulstop or a alphabet like o instead of zero.

  • @mahidargowd3432
    @mahidargowd3432 11 месяцев назад

    Can we get label font size clear and small

  • @tinymee589
    @tinymee589 Год назад

    How we generate output file for all this process ?

    • @Guruprasad_A
      @Guruprasad_A  Год назад

      sink("output.txt")
      # your code to get the output
      .
      .
      sink()

  • @shivangitare7102
    @shivangitare7102 Год назад

    in this result where is the genotypes grouping on the basis of clusters

  • @muhiuddinfaruqueeirribd8336
    @muhiuddinfaruqueeirribd8336 2 месяца назад

    When I run NbClust(data=rice.s, diss= NULL, distance= "euclidean", min.nc = 2, max.nc = 15,
    method = "complete", index= "all", alphaBeale = 0.1), then I got the following error:
    Error: division by zero!Error in solve.default(W) : Lapack routine dgesv: system is exactly singular: U[6,6] = 0
    How can I solve this?
    Thank you.

  • @rukoochawla9714
    @rukoochawla9714 2 года назад

    How to fill rectangle and lines with color in hierarchical dendrogram

  • @matinakand8874
    @matinakand8874 2 года назад

    can i have the code for vertical viewing of this dendrogram

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

  • @charupriyachauhan8654
    @charupriyachauhan8654 3 года назад

    An error is comming while loading package factoextra : package or namespace load failed for 'ggplot2' in loadnamespace (i, c(lib.loc, .libPaths()), versioncheck = vI[[i]] : namespace 'ellipsis' 0.3 .1 is already loaded , but >= 0.3.2 is required.
    Please provide solution for this

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      Have you already installed ggplot2. If no or already installed, also try installing once again with this command
      Install.packages("ggplot2")

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      After loading ggplot2
      Then load factoextra

    • @charupriyachauhan8654
      @charupriyachauhan8654 3 года назад +1

      @@Guruprasad_A thanks for guiding....an update was needed

  • @anandjaiswar3929
    @anandjaiswar3929 Год назад

    where can i get the pea dataset

  • @anoopnehra7298
    @anoopnehra7298 2 года назад

    hello sir , I'm getting error after running this command:
    NbClust(data = clust.s, diss = NULL, distance = "euclidean", min.nc = 2,
    max.nc =15,method = "kmeans",index = "all",alphaBeale = 0.1)
    error is:
    Error in NbClust(data = clust.s, diss = NULL, distance = "euclidean", :
    The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.
    please gives your valuable suggestions regarding this

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      stackoverflow.com/questions/46067602/the-tss-matrix-is-indefinite-there-must-be-too-many-missing-values-the-index

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      stats.stackexchange.com/questions/204760/r-how-to-fix-nbclust-error-with-error-message-the-tss-matrix-is-indefinite?newreg=6dbb3e394e5f40cf8b1e50a16f88ee84

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      Consider silhouette method

  • @samrathbaghel9026
    @samrathbaghel9026 2 года назад

    Error in t(jeu) %*% jeu :
    requires numeric/complex matrix/vector arguments
    Getting this error while computing the optimum number of clusters using NbClust ...Please help

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      Does your dataset contain any non numeric variable.

    • @samrathbaghel9026
      @samrathbaghel9026 2 года назад

      @@Guruprasad_A: no it does not. I am using the adjusted mean data of the augmented RCBD design. I did that analysis using by seeing your video only "augmented RCBD in R".

    • @Guruprasad_A
      @Guruprasad_A  2 года назад +1

      While importing data check whether all variables in double.

    • @samrathbaghel9026
      @samrathbaghel9026 2 года назад

      @@Guruprasad_A : Got it fixed, thanks alot for the wonderful video,please share how to extract the analysis report in word format, as you did in previous video.

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      It's not possible in this package.

  • @samrathbaghel9026
    @samrathbaghel9026 Год назад

    Error in hclust(d, "ward.D") : Invalid clustering method , getting this error

    • @Guruprasad_A
      @Guruprasad_A  Год назад +1

      try using hclust(dist(data, "ward.D")

  • @anilkhar
    @anilkhar 3 года назад

    When I am running the NbClust it gives me error Error in solve.default(W) :
    system is computationally singular: reciprocal condition number = 8.06347e-18

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      stackoverflow.com/questions/56160674/how-do-i-fix-nbclust-error-solve-defaultw-in-mtcars-dataset

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      It's because of your data, just ignore this method and if you are looking for the optimum number of cluster for k-means use silhouette method.

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      www.biostars.org/p/407232/

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      If possible try taking out too much correlated data, if you have replication use Tocher method/D2 statistics

    • @anilkhar
      @anilkhar 3 года назад

      @@Guruprasad_A Yes silhouette method is giving 2 clusters

  • @ikrambashir8604
    @ikrambashir8604 2 года назад

    > NbClust(data = pea.s, diss =NULL, distance = "euclidean", min.nc = 2, max.nc = 15,
    + method = "kmeans" , index = "all", alphaBeale = 0.1)
    Error in NbClust(data = pea.s, diss = NULL, distance = "euclidean", min.nc = 2, :
    The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.
    can you tell me how to overcome this problem???

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      You can use the results from other methods like silhouette.

  • @deeptitiwari7097
    @deeptitiwari7097 2 года назад

    When I went to make dendrogram by watching your video, it is saying that figure margins are too large and I m not getting anything in plots.

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      Try to export with optimum resolution

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      Or do it in any other computer and see.

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      stackoverflow.com/questions/23050928/error-in-plot-new-figure-margins-too-large-scatter-plot

  • @harshinikasturi2983
    @harshinikasturi2983 Год назад

    Error in solve.default(W) :
    system is computationally singular: reciprocal condition number = 6.36052e-19
    In addition: Warning message:
    In log(det(P)/det(W)) : NaNs produced
    Sir please help with this error.

    • @Guruprasad_A
      @Guruprasad_A  Год назад

      It's because of too much correlations in your data

    • @sreevandana520
      @sreevandana520 3 месяца назад

      Then how to solve this problem sir​@@Guruprasad_A

  • @drchinmoymishra
    @drchinmoymishra 2 года назад

    Please share the data file "pea"

    • @Guruprasad_A
      @Guruprasad_A  2 года назад

      It's my research data sir, I have to wait until it get published in Krishikosh

  • @tejapegada1781
    @tejapegada1781 15 дней назад

    I have problem with finding clusters!

  • @gpb100percent_bkpraveen
    @gpb100percent_bkpraveen 2 года назад

    library(cluster)
    library(class)
    library(clv)
    data(faba.s)
    # compute intercluster distances and intracluster diameters
    #kmeans
    kid

  • @amalubiju8255
    @amalubiju8255 3 года назад

    Could you share the data set pls

    • @Guruprasad_A
      @Guruprasad_A  3 года назад

      It's my research data, I will share the link once after it's published in Krishikosh by our university. I hope you understand.