Sir, you are a saint! Thank you thank you thank you!!! Not only did you make this easy but you gave me peace of mind. If we ever meet in person, I hope you will give me the honor of buying you a drink.
hi, i keep getting the error: " Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ" when i try and plot the graph to see the number of k with this code - plot(1:10, betweenss_totss, type = "b", + ylab = "Between SS / Total SS", xlab = "Clusters (k)") Do you know how i can solve this? I have tried looking online and have found how to make the x and y axis the same, however for this graph we don't need them to be the same. Any information you could spare would be great!!
Hi Sophie, the first two arguments to the plot() function need to be vectors representing the x and y axes, respectively. At the moment, you only supply a vector for the x axis (1:10) that only contains values 1 through 10. What are you trying to plot here? The easiest way to plot the clusters is to create a new column in the data.frame indicating cluster membership, plotting two variables against each other and colouring by this cluster variable.
Fantastic video! The only thing that would have been nice to see is how do you take these clustering solutions and create a new column with the cluster number for each observation that could then be exported to Excel and used to create a PPT slide. This would have really been helpful for work situations.
Thank you for the information, please help me to understand how the results of PCA is used in clustering, because both the videos of PCA and Clustering are found not continued, please help us if we want to continue clustering with results of PCA how we should do that..again thank you for detailed information
If you wish to cluster the results of a PCA, simply select the number of principal components you wish to retain (the first few that explain most of the variance), and use the data projected onto these components as the input data to your clustering algorithm. But I would suggest you compare the clusters based on the original data, and the clusters based on the reduced dimensional data, to see which performs better.
Nice spot. Using usual definition of BIC yes, lower is better. The mclust package rearranges the equation to be: BIC = L - 0.5 * p * ln(n) Such that the expression should be maximised instead of minimised.
@@hefinrhys8572 Ok i didn't know that definition , maybe they need to not call it BIC to not confuse the people. This part is for the people that read this reply and maybe don't know what we are talking: BIC=k ln(n)-2ln(L), k is the number of parameter estimated by the model, L is the maximized value of the likelihood function of the model and n the number of observations (sample size). Lower BIC is, less information we lose.
Thank you for the wonderful tutorial!!! I have one question. Is there anyway to create different data frames based on the clusters identified? For instance, having a new column for 'cluster' and have the cluster # for each row. I am doing the following but not sure if it is right. kc
Is there a clustering method indicated to a certain number of variables.? I have in my study 32 variables and I'm thinking that perhaps it would be a specific procedure to that much of variables. thanks for sharing you knowledge.
This is the best video I have watched about cluster analysis. I subscribed immediately after watching this. Here's my question. I don't know if I skipped this in the video but how do I extract the vector containing the specific cluster each observation belongs to? I tried doing the model-based method.
I didn't show this in the video sorry, but you can extract the vector of most probable clusters by accessing the $classification component of your mclust model. If you want more detailed information, i.e. the matrix of probabilities for each datum belonging to each class, extract the $z component. In R, it's always useful to call str() on your model objects so you can understand and inspect their structure. Using ?Mclust and reading the Value section, also shows you what each component of the model object means.
@@hefinrhys8572 oh okay thanks. I didn't think of using ?Mclust. I'm going to use your videos as reference. Very informative and understandable. Thanks again.
Hallo Hefin, wieder ein großartiges Tutorial. Erlaube eine Frage: Wie kann ich zu einem Wert (oder Wertepaar x, y) aus dem Datensatz das zutreffende Cluster zuordnen? Eindimensionales Beispiel: Werte 1 - 10 = Cluster 1 11 - 20 = Cluster 2 21 - 30 = Cluster 3 Zu welchem Cluster gehört Var = 19? Antwort: Var (=19) gehört zum Cluster 2. Wie berechne ich das mit kmeans bzw. R (ggplot2)? Hello Hefin, another great tutorial. Allow a question: How can I assign the appropriate cluster to a value (or value pair x, y) from the data set? One-dimensional example: values 1 - 10 = cluster 1 11 - 20 = cluster 2 21 - 30 = cluster 3 Which cluster does Var = 19 belong to? Answer: Var (= 19) belongs to cluster 2. How do I calculate this with kmeans or R (ggplot2)?
I am just curious, how would I go about removing a trend in my data and then using all the observation into clustering. I have several countries and several variables but observed over time [annually]. The bad thing is that picking a year to do clustering is an option but not that much great since explanation can be fault if its based only on one year - if I am not mistaken xD
Hi Wildfox, sorry for the late reply. So I presume you are looking for clusters of countries? One approach would be to model the relationship between time and a dependent variable (if you have one) for each country using a linear model. Then, use the estimated marginal means (also called least squares means) to cluster the countries. Estimated marginal means are the predicted means of variables in a linear model after accounting for all the other variables (i.e. after removing the effect of time).
Excellent study material. BUT BECAME BLURRED FROM 6.45 TO ABOUT 16.05 MINUTES. Please how can I study the entire video clearly? Thank you for such a summary.
You're very welcome. I'm sorry for the late reply to this; I'm not familiar with theory of planned behaviour models, but if you describe the problem you're trying to solve, I may be able to help.
Hi Nikhil, I'm not quite sure I understand what you are trying to do. Do you mean defect clustering in application testing? This isn't a statistical computing application per se, but here is some information which may be of use: www.pitsolutions.ch/blog/defect-clustering-and-pesticide-paradox/
Hi ghdoia, so R and its packages come with datasets built in. To list them all, simply run data(), to load one (such as the iris dataset), run data(iris). Then, you can access the iris dataset by referring to it by name. I hope that helps.
@@hefinrhys8572 thank you very much. I learnt alots from you. I've just had some of issue when I did run the below code : plot(1:10, betweenss_totalss, type = "b", ylab = "Between SS/Total SS", xlab = "Cluster(K)") as it shows this message below; Error in plot.window(...) : need finite 'ylim' values In addition: Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf Please can you tell what is that mean and how i can solve this problem?
Thank you! Sorry for the late reply. So irisScaled is a matrix (which I think I omitted to mention in the video), and a succinct way to get the means of each column would be to use the apply() function: apply(irisScaled, 2, mean) where the first argument is the data, the second is an index (either 1 to iterate over rows or 2 for columns) and the third argument is the function to apply. You could also find the mean for each column individually like this: mean(irisScaled[, 1])
Sorry for the slow reply to this. I hope you worked it out, but I would extract the assigned groups using something like this: # ADD NEW COLUMN TO DATAFRAME WITH MCLUST GROUP ASSIGNMENTS irisScaled$Group
Honestly, it depends on what you are clustering. But there is the risk of too many variables which will cause problems clustering, I think this falls down to trial and error also using domain knowledge of what variables to include.
Hi Priyanwada, I'm reading this on my phone so cannot check in R right now, but it looks like you have 'list()' in front of 'for(i in 10)'. This 'list()' function is empty and I'm not sure you need it anyway.
Sir, you are a saint! Thank you thank you thank you!!! Not only did you make this easy but you gave me peace of mind. If we ever meet in person, I hope you will give me the honor of buying you a drink.
Best Video i ever saw on clustering algorithms. Great Work. Thanks for posting!
This is not an algorithm.
just to like this video and add a comment, I logged in from my google account.. awesome work Hefin... really appreciate your efforts.. :)
Very helpful! I've found the k-means and hierarchical algorithms most useful for my specific data. Thumbs up for this video!
Thanks! I'm glad you've been able to apply the algorithms and get meaningful results from your data.
Wonderful! I'm impressed. You're very bright! God has used you to bless me. Thank you!
Keep on making more videos. Congrats & Cheers!
Please do not stop posting videos! Congratulations on the explanations. Brazil here
Obrigada! Estou feliz que você tenha gostado!
Dr Hefin Rhys your sessions on clustering is truely amazing and well explained thank you bring some on PAM and CLARA
This is a great video on Clustering. Thank you for putting it together.
wow! very well explained, thank you so much.
I appreciate the details and the beginner-friendliness of your tutorial!
hi, i keep getting the error:
" Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ"
when i try and plot the graph to see the number of k with this code -
plot(1:10, betweenss_totss, type = "b",
+ ylab = "Between SS / Total SS", xlab = "Clusters (k)")
Do you know how i can solve this? I have tried looking online and have found how to make the x and y axis the same, however for this graph we don't need them to be the same. Any information you could spare would be great!!
Hi Sophie, the first two arguments to the plot() function need to be vectors representing the x and y axes, respectively. At the moment, you only supply a vector for the x axis (1:10) that only contains values 1 through 10. What are you trying to plot here? The easiest way to plot the clusters is to create a new column in the data.frame indicating cluster membership, plotting two variables against each other and colouring by this cluster variable.
Really congratulation for how you made these tutorials! They are really clear and helpful! Thank you
So clear and concise. Thank you!
Wow! Absolute great video, well explained + really helpful tipps and tricks! Thank you for that!
Best video and explanation on clustering
Fantastic video! The only thing that would have been nice to see is how do you take these clustering solutions and create a new column with the cluster number for each observation that could then be exported to Excel and used to create a PPT slide. This would have really been helpful for work situations.
Helloo..deah.. how to find a gene cluster of an secondary metabolite..if the genome of the organism is not sequenced?can you help me?
You are a great teacher. thanks MILLION
Thank you for the information, please help me to understand how the results of PCA is used in clustering, because both the videos of PCA and Clustering are found not continued, please help us if we want to continue clustering with results of PCA how we should do that..again thank you for detailed information
If you wish to cluster the results of a PCA, simply select the number of principal components you wish to retain (the first few that explain most of the variance), and use the data projected onto these components as the input data to your clustering algorithm. But I would suggest you compare the clusters based on the original data, and the clusters based on the reduced dimensional data, to see which performs better.
Thank u so much
This is high quality content. Thanks.
Thanks Glenn!
Wonderful explanation. Thank you so much!!!!
Is there a way that you could figure out what metrics contribute to the hierarchical break?
At 30.37, in theory lower BIC is better the model fits and not like you said.
Nice spot. Using usual definition of BIC yes, lower is better. The mclust package rearranges the equation to be:
BIC = L - 0.5 * p * ln(n)
Such that the expression should be maximised instead of minimised.
@@hefinrhys8572 Ok i didn't know that definition , maybe they need to not call it BIC to not confuse the people. This part is for the people that read this reply and maybe don't know what we are talking: BIC=k ln(n)-2ln(L), k is the number of parameter estimated by the model, L is the maximized value of the likelihood function of the model and n the number of observations (sample size). Lower BIC is, less information we lose.
Very good introductions to clustering!!
Best on the web re clustering, thanks.
How can I perform cluster analysis on data that I specify as survey data first (with svydesign)? thanks!
Thank you for the wonderful tutorial!!! I have one question. Is there anyway to create different data frames based on the clusters identified? For instance, having a new column for 'cluster' and have the cluster # for each row. I am doing the following but not sure if it is right.
kc
What a brilliant tutorial, thank you!
Thank you! Glad to be of help :)
Is there a clustering method indicated to a certain number of variables.? I have in my study 32 variables and I'm thinking that perhaps it would be a specific procedure to that much of variables. thanks for sharing you knowledge.
Do you know about Wards-Method when doing cluster Analysis?
Thank you for this great video , but if I need to use GMM clustering algorithm with them . may you help me to do that plz ?
Incredible, so well explained, thanks!
Very Nice Explanation.
Learnt a lot of things :)
Hi sir, im new to R as im from a different background, is clustering available for Panel data?
Ty for the quality content brother , I am beginner that's been very helpful can you please provide more videos 🙂 thankyou
You are amazing! really helpful tutorials! Thank you!
❤️ Thank you so much! this was brilliant!
This is the best video I have watched about cluster analysis. I subscribed immediately after watching this. Here's my question.
I don't know if I skipped this in the video but how do I extract the vector containing the specific cluster each observation belongs to? I tried doing the model-based method.
I didn't show this in the video sorry, but you can extract the vector of most probable clusters by accessing the $classification component of your mclust model. If you want more detailed information, i.e. the matrix of probabilities for each datum belonging to each class, extract the $z component. In R, it's always useful to call str() on your model objects so you can understand and inspect their structure. Using ?Mclust and reading the Value section, also shows you what each component of the model object means.
@@hefinrhys8572 oh okay thanks. I didn't think of using ?Mclust. I'm going to use your videos as reference. Very informative and understandable. Thanks again.
Hallo Hefin,
wieder ein großartiges Tutorial. Erlaube eine Frage: Wie kann ich zu einem Wert (oder Wertepaar x, y) aus dem Datensatz das zutreffende Cluster zuordnen?
Eindimensionales Beispiel:
Werte
1 - 10 = Cluster 1
11 - 20 = Cluster 2
21 - 30 = Cluster 3
Zu welchem Cluster gehört Var = 19?
Antwort: Var (=19) gehört zum Cluster 2.
Wie berechne ich das mit kmeans bzw. R (ggplot2)?
Hello Hefin,
another great tutorial. Allow a question: How can I assign the appropriate cluster to a value (or value pair x, y) from the data set?
One-dimensional example:
values
1 - 10 = cluster 1
11 - 20 = cluster 2
21 - 30 = cluster 3
Which cluster does Var = 19 belong to?
Answer: Var (= 19) belongs to cluster 2.
How do I calculate this with kmeans or R (ggplot2)?
I am just curious, how would I go about removing a trend in my data and then using all the observation into clustering. I have several countries and several variables but observed over time [annually]. The bad thing is that picking a year to do clustering is an option but not that much great since explanation can be fault if its based only on one year - if I am not mistaken xD
Hi Wildfox, sorry for the late reply. So I presume you are looking for clusters of countries? One approach would be to model the relationship between time and a dependent variable (if you have one) for each country using a linear model. Then, use the estimated marginal means (also called least squares means) to cluster the countries. Estimated marginal means are the predicted means of variables in a linear model after accounting for all the other variables (i.e. after removing the effect of time).
Excellent study material. BUT BECAME BLURRED FROM 6.45 TO ABOUT 16.05 MINUTES. Please how can I study the entire video clearly? Thank you for such a summary.
What a very usefull tutorial video!!! Thank you so much!!
Fantastic. Subscribed!
Excellent Job. Please provide the sample data file publically so that we can arrange our data easily. Thanks in advance.
Great explanation 👍 you have such a soothing voice btw 😘
Thanks! Glad you enjoyed the voice ;)
Thank you for this excellent tutorial !
Thank you! Glad you enjoyed.
Great video!
absolutely amazing!! thank you!. please do more. i'm trying to learn r to do CFA to fit theory of planned behavior models. can you do one on this?
You're very welcome. I'm sorry for the late reply to this; I'm not familiar with theory of planned behaviour models, but if you describe the problem you're trying to solve, I may be able to help.
CAN ANYBODY TELL ME HOW TO FIND THE DEFECT CLUSTERS IN THE ABOVE DATASET
Hi Nikhil, I'm not quite sure I understand what you are trying to do. Do you mean defect clustering in application testing? This isn't a statistical computing application per se, but here is some information which may be of use: www.pitsolutions.ch/blog/defect-clustering-and-pesticide-paradox/
Anyone else thinks his voice and accent sound like the man at Headspace? I cannot help deepening my breath when learning this...
can you provide me with the data you used (iris) ?
Hi ghdoia, so R and its packages come with datasets built in. To list them all, simply run data(), to load one (such as the iris dataset), run data(iris). Then, you can access the iris dataset by referring to it by name. I hope that helps.
@@hefinrhys8572
thank you very much. I learnt alots from you.
I've just had some of issue when I did run the below code :
plot(1:10, betweenss_totalss, type = "b",
ylab = "Between SS/Total SS", xlab = "Cluster(K)") as it
shows this message below;
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
Please can you tell what is that mean and how i can solve this problem?
Thanks, you've helped me a lot 👍
Congratulations for your videos. One question: how can I get the means of the columns of irisScaled? (irisScaled is a list).
Thank you! Sorry for the late reply. So irisScaled is a matrix (which I think I omitted to mention in the video), and a succinct way to get the means of each column would be to use the apply() function:
apply(irisScaled, 2, mean)
where the first argument is the data, the second is an index (either 1 to iterate over rows or 2 for columns) and the third argument is the function to apply. You could also find the mean for each column individually like this:
mean(irisScaled[, 1])
is there an easy way to output the Mclust cluster assignments to a csv file?
Sorry for the slow reply to this. I hope you worked it out, but I would extract the assigned groups using something like this:
# ADD NEW COLUMN TO DATAFRAME WITH MCLUST GROUP ASSIGNMENTS
irisScaled$Group
another question: what's a recommended number of variables for a cluster analysis?
Honestly, it depends on what you are clustering. But there is the risk of too many variables which will cause problems clustering, I think this falls down to trial and error also using domain knowledge of what variables to include.
You help me :) Tnx for video
Big thumbs up ! Huge thumbs up... Really
Very well explained! Thanks
Thank you! I hope it was useful.
Thank you very much, extremely helpful
Thank you sir for the videos...!!!!!
You are welcome!
Can I have the code, please?
Hi Pankal, sorry for the late reply. I have now added a link to the R script in the video description above.
> betweenss_totss
Hi Priyanwada, I'm reading this on my phone so cannot check in R right now, but it looks like you have 'list()' in front of 'for(i in 10)'. This 'list()' function is empty and I'm not sure you need it anyway.
Nice One ! really helpful.
Thanks Hefin for such excellent video. can you please upload some video on Time series.
Thanks Sahil! Glad it was useful. Thanks for the feedback, I may do a video in time series in the future!
It helped a lot. well done.
Glad it was of use :)
well explained video ,thanks!
You're welcome Rakesh! Thanks!
thanks for the video , well done.
You're very welcome. Happy clustering!
thank you! super helpful!
Thanks! Glad to help!
BTW GREAT video!!
suscrito buen video
very nice tutorial, thanks a lot