Outstanding! Thank you, man! This really helped me do my masters thesis. I really appreciate that you explained every small step, and used as much visuals as possible, and focused on us being able to learn! - In case others run into the same problem: With Scikit K-means, when using the fit(data) function, I got an "split" error message. (attributeerror: 'nonetype' object has no attribute 'split'). I checked my BLAS, and updated through conda all libraries, then shut everything down and opened again, and this resolved the problem, but it took a long time. (I asked chatgpt for help)
Thank you, sir. This is how tutorials should be conducted: with in-depth explanations, step-by-step implementation, and the release of all code and datasheets to enable everyone to practice and advance their own personal projects. Congrats!
Thank you, thank you, thank you!!! Being able to perform and explain what runs under the hood is really important- I agree. Please keep these videos coming 🙌🏼❤️ The “From Scratch” series :)
@@Dataquestio Vik, how do I assign new data points to a cluster i.e. once I have run my K-means cluster and want to use it to assign a cluster to new data sets just like out of time datasets or testing/validation datasets. There doesn't seem to be anything online about this. Is it the case that I'd have to re-run the K-Means with the new data included? Thanks in advance Elvy
Great Video , BTW why did u use Geometric means instead Arithmetic mean for finding the clusters. Please make a whole series on building models From Scratch.
I loved ur video it is so well-explained!! I only used scikit-learn but now I understand better how it's works. But I have a question: why is it not good no use height and wight to use as feature?
thanks teacher, may you introduce how to calculate SSE for k means clustering solution when you choose not to use k means directly from sklearn package
Thanks for the video. It is just brilliant. One of the best ones on Clustering that I have seen for sure! I just had a question. I tried using this on data with 13 variables. It worked perfectly but when I scale the data using n. distrb or skscalar rather than using min-max, I get an error following the PCA transformation code saying there are Nans in the data variable when there clearly were not before. I cant put my finger on what is causing this. Would appreciate any insights on your part. Thanks
Very insightful explanation of codes. By the way how can I plot the Elbow plot using the SSE Vs K values at every k value iteratively. this will help me be able to optimise the K value using this codes... Looking foreword to hearing from you
pls unpack what is going on in centroid = data.apply(lambda x: float (x.sample())) without the float cast the line returns a DataFrame with NaN values in none sampled/selected columns. There appears to be some VooDoo magic going on here, driven by the float cast!
Hi, Thanks so much for the video!! Can you please advise on how one adds a legend to the cluster scatter plots? I've been trying but can't figure it out.
Sir how to find out the individual elements present in each cluster? For example, I'm working on a dataset of genes. How will i get the names of the individual genes that are present in each cluster?
if we have IP addresses in data should we still scale the data ? i had a dataset where ip add and fraud transactions are given, i converted ip add to numerical data
Fit transform will both compute the fit and transform the data. In this case, we already computed the fit on the data, and we want to just apply the same fit to the centroids, so that they're all on the same scale and can be visualized. -Vik
@@animal40 Just leave out the PCA- still transform the centroid T though and remember to include iloc here's my code: def plot_clusters(data, labels, centroids, iteration): centroid_T = centroids.T plt.title(f'Iteration {iteration}') plt.scatter(x = data.iloc[:,0], y= data.iloc[:,1], c =labels) plt.scatter(x = centroid_T.iloc[:,0],y = centroid_T.iloc[:,1]) plt.show()
@@akosuakoranteng3327 thanks very much for this. Tried a few things today but couldn't quite get it working. Will try again tomorrow with this. Appreciate it, cheers.
Please can someone tell me how to apply arithmetic mean instead of geometric mean in lambda function of getting new centroids. I am dealing with negative datasets and applying geometric mean is of no use to me. will it be like this : data.groupby(labels).apply(lambda x: np.mean(x,axis=0))
Here's all the code for this video - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . Hope you enjoy it!
Outstanding! Thank you, man! This really helped me do my masters thesis. I really appreciate that you explained every small step, and used as much visuals as possible, and focused on us being able to learn!
- In case others run into the same problem: With Scikit K-means, when using the fit(data) function, I got an "split" error message. (attributeerror: 'nonetype' object has no attribute 'split'). I checked my BLAS, and updated through conda all libraries, then shut everything down and opened again, and this resolved the problem, but it took a long time.
(I asked chatgpt for help)
Thank you, sir. This is how tutorials should be conducted: with in-depth explanations, step-by-step implementation, and the release of all code and datasheets to enable everyone to practice and advance their own personal projects. Congrats!
This was amazing. Brilliantly explained, demonstrated and presented clearly. Helped me so much with my current bootcamp task. Thank you.
From the bottom of my heart; thank you. This was so clear and easily understandable, fantastic video!
Terrific implementation! I also really liked the way you used PCA for iteritive visualization... Nicely done
Thanks a lot, Tim! -Vik
Your explanation is absolutely clear. You have best knowledge. Keep posting new topics and encourage us ❤
One of the best tutorials on the internet, thank you.
it's very great job , the only one in youtube that explain every place of code 👍👍
I have never thought that we can visualize K means by using Dimension Reduction (PCA)!! Awesome Tutorial Sir
Absolutely fantastic
Would love a similar video on PAM clustering for mixed integer and categorical variables
Thanks for the suggestion :)
This is a nice and powerful way to learn. Thanks for teaching.
This THE best tutorial online. I am so grateful for this! Thank you
Awesome stuff, Vik. Thanks for sharing.
Amazingly clear! Thank you so much, Dataquest!
Thank you, thank you, thank you!!! Being able to perform and explain what runs under the hood is really important- I agree. Please keep these videos coming 🙌🏼❤️ The “From Scratch” series :)
That's a great idea :) I'm working on linear regression from scratch.
Great video. Really helpful looking at implementing it manually. Thank you so much
Such good and clearly delivered material. Thanks a lot!
very helpful and clear explanations - thank you!
Very insightful and step by step code explanation.
Thank you for this excellent tutorial
:)
Glad it was helpful! -Vik
@@Dataquestio Vik,
how do I assign new data points to a cluster i.e. once I have run my K-means cluster and want to use it to assign a cluster to new data sets just like out of time datasets or testing/validation datasets.
There doesn't seem to be anything online about this. Is it the case that I'd have to re-run the K-Means with the new data included?
Thanks in advance
Elvy
This is amazing, keep up a good job
I can't thank you enough. Thank you for this content.
you might be a hero... thansk a lot for the contents...
Great Video , BTW why did u use Geometric means instead Arithmetic mean for finding the clusters. Please make a whole series on building models From Scratch.
Thanks a LOT for this tutorial!😀
Thank you very much for this clearly understood video.
I loved ur video it is so well-explained!! I only used scikit-learn but now I understand better how it's works.
But I have a question: why is it not good no use height and wight to use as feature?
Excellent video !! Many thanks 🙏🏼
thanks teacher, may you introduce how to calculate SSE for k means clustering solution when you choose not to use k means directly from sklearn package
More videos like these please on other algos
Thanks for the video. It is just brilliant. One of the best ones on Clustering that I have seen for sure!
I just had a question. I tried using this on data with 13 variables. It worked perfectly but when I scale the data using n. distrb or skscalar rather than using min-max, I get an error following the PCA transformation code saying there are Nans in the data variable when there clearly were not before. I cant put my finger on what is causing this. Would appreciate any insights on your part. Thanks
great video, you are a great teacher
Can you make a video implementing Local Outlier Factor (LOF) with Pandas and NumPy in Python for identifying outliers?
It's a very well explained video. Just a quick question, how can we add random_state in the final model code?
Very insightful explanation of codes. By the way how can I plot the Elbow plot using the SSE Vs K values at every k value iteratively. this will help me be able to optimise the K value using this codes... Looking foreword to hearing from you
pls unpack what is going on in centroid = data.apply(lambda x: float (x.sample())) without the float cast the line returns a DataFrame with NaN values in none sampled/selected columns. There appears to be some VooDoo magic going on here, driven by the float cast!
Thank you so much.
Hi, Thanks so much for the video!! Can you please advise on how one adds a legend to the cluster scatter plots? I've been trying but can't figure it out.
Thanks alot that was a great help !
Your explanation is grate. I found out that the "k" parameter of method "new_centroids" has no effect for the application. Correct me if I'm wrong.
make a video on ''customer segmentation and clustering in retail using machine learning'' using real retail dataset
Hi, thank you so much for this clear tutorial.
I need one another help from you. How do we get this cluster result exported to a CSV file?
Great video!
I think k = 4, because the young players incluce two high overall and low overall. Like young star in high leage level and young normal player
can we follow up based on the identified clusters, by using them to regress for another variable, e.g. with a logistic regression?
Thanks for this, I really don't get how I can possibly use it for fraud detection
I didn’t understand why we took geometric mean instead of arithmetic mean??? Can you explain tht pls ????
Could you explain the meaning of the x- and y-axis?
good tutorial thank you
what is the maximum amount of variables recommendable for a clustering analysis?
How do you know which 5 features to pick at the beginning?
can someone help with the issue at 29:48
when we use old_centroids=centroids
in my code
this error comes
'DataFrame' object has no attribute 'equal'
it should be .equals with an s
TYSM :)
I am getting an error when calculating centroids - 'float' object has no attribute 'sqrt'..... Please help
How would you include Ordinal features ?
do we have to get rid of outliers beforehand?
can we also use players pogition as one of the feature if yes then how (cauz that isn't numeric)
Keep sending the emails, thanks for the vids
Can a feature with dichotomous data be used?
Cool 👍
Which platform you are using for coding??
Sir how to find out the individual elements present in each cluster? For example, I'm working on a dataset of genes. How will i get the names of the individual genes that are present in each cluster?
I am finding the same right now ? Are you able to get anything . If yes then please help me too😊
if we have IP addresses in data should we still scale the data ? i had a dataset where ip add and fraud transactions are given, i converted ip add to numerical data
Why you did not apply fit_transform to centroids_2d variable as well?
Fit transform will both compute the fit and transform the data. In this case, we already computed the fit on the data, and we want to just apply the same fit to the centroids, so that they're all on the same scale and can be visualized. -Vik
Amazing!! But, how to implement the scatter without PCA?
Did you figure out? I'd like to know too.
@@animal40 Just leave out the PCA- still transform the centroid T though and remember to include iloc here's my code: def plot_clusters(data, labels, centroids, iteration):
centroid_T = centroids.T
plt.title(f'Iteration {iteration}')
plt.scatter(x = data.iloc[:,0], y= data.iloc[:,1], c =labels)
plt.scatter(x = centroid_T.iloc[:,0],y = centroid_T.iloc[:,1])
plt.show()
@@akosuakoranteng3327 thanks very much for this. Tried a few things today but couldn't quite get it working. Will try again tomorrow with this. Appreciate it, cheers.
can my cluster be different from yours with the same code ?
Wht does groupby() return. ?? How can I see wht groupby() has returned??? Can you pls share the code too what data.groupby(labels) do ???
At 10:08, how did you know row 0 belongs to lionel messi?
Please can someone tell me how to apply arithmetic mean instead of geometric mean in lambda function of getting new centroids. I am dealing with negative datasets and applying geometric mean is of no use to me. will it be like this : data.groupby(labels).apply(lambda x: np.mean(x,axis=0))
Thank you, I required arithmetic mean too and your code worked for me.
does anyone have the code ?
Code is here - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . It's linked in the description
@@Dataquestio sir , I'm getting an error doing with scratch, any platform at which I can send my query?
Nice but a little too much for a newbie 😅
SUUUUIII
grt