Clustering Algorithm for mixed datatypes - K-Prototypes
HTML-код
- Опубликовано: 1 окт 2024
- #datascience #machinelearning #ml
The k-means based methods are efficient for processing large data sets, but they are often limited to numeric data. Kmeans optimize a cost function defined on the Euclidean distance
measure between data points and means of clusters. Minimizing the cost function by
calculating means limits their use to numeric data.
This is where K-Prototype shines. When applied to numeric data the algorithm is identical to k-means. For categorical data algorithm uses a simple matching dissimilarity measure
, replaces the means of clusters with modes, and uses a frequency-based method to
update modes in the clustering process to minimize the clustering cost function.
There is a copy paste bug in my code while converting datatype in numpy array. Fixed code is below. Sorry for the inconvinience
github.com/srivatsan88/RUclipsLI/blob/master/K_Prototype_for_Mixed_Datatypes.ipynb
Concepts remain the same
thanks for this great video. May I ask where the dataset comes from, as I'd like to know about the details of each column, tks.
How do we identify that what is the best number of cluster for this particular data like we have Elbow method or silhoutte coffecient for finding better number cluster but in case of kprototype how we gonna find it??
Have you got an answer for your question?
Hi! Great video! Just a small doubt. Does this algorithm allow the calculation of the silhouette to evaluate the model performance?
@AIEngineering I'm not agree with the aproach. If you run this at the end (sb.pairplot(marketing_df[['CLV','Income','monthly_premium','Months_Since_Policy_Inception','cluster']], hue = 'cluster' )
) you will see cluster "easy-defined" just with straight lines. Please, could you run the seaborn plot and give me your feedback. Many many thanks :)
I stuck with the dataset which has mixed data types when i try to apply k means . thanks for sharing ! this really helps me.
First time i am hearing about K-Prototype algo.. i never read anywhere and any other youtube vedios as well. either it is big or short videos, it has so much to know in your videos.. some special ingredient you mix.. :)..
This video is really helpful. I didn't even know about k-prototype algorithm.
This is one of the best channel for ml because you just not teach ml but also deployment over kubernetes and other platforms(which is one of the most important thing).
Will surely recommend the channel to my friends.
Thank you Manan :)
Firstly the tutorial was crisp and informative. Shout out to the maker.
But How does this model differentiate between a categorical nominal vs ordinal data?
Eg: Here in this dataset state feature is nominal and coverage feature is ordinal who is this info interpreted by the model?
Shhiv. It does not differentiate.. I would say try ordinal data as numerical and see. I have not tried it though but depends on data and related variables it must be able to differentiate it in clusters
Nicely done , can you also confirm if it can help deduplicate categorical records ... The problem at hand is , a customer enters his or her info through multiple entry points, like dealer portal , dealer mobile app etc ..but to be able.to.send a promotional item , csn this kmode help with identifying duplicates among many customer records coming in from multiple sources and give me a unique record so I can send the promotion to only this record
Rajaram.. I do not think it can be handled by any model. I would prefer to pre-engineer the data and might be you want to handle multiple entry point as additional features quoting the channel.
LIFE SAVER !!!! Never Heard of it . I have SOOOO much categorical data , i almost lost hope . TQQQQ
The reason why he was able to distinguish clusters basis CLV scores is because compared to other numerical variables CLV scores are way higher in magnitude and hence gets more importance. Like he said standardisation/normalisation of all numerical variables in a mandate when you have variables with varying magnitudes (almost always).
Hi, first of all thank you so much for this thorough tutorial on K-prototypes!
I've got a question: I noticed that you didn't normalize the numerical columns, could you please explain why?
Chuhan.. As I have mentioned in my video this was just to give a quick overview of K-Prototypes and not focus on model outcome. But in real world if u have numeric value that has different scale it is better to normalize. I will maybe walk through this once I created details video on it
@@AIEngineeringLife Thank you so much for your timely reply! Sorry I didn't pay close attention. And thanks again for your video! This is the most helpful tutorial I've ever found!
This was a great video. Today I have learnt something new. Thanks for sharing.
However, I would like to know how the results compare to the original k-means. Did including categorical variable provide any benefits?
Kmeans does not work on categorical as it finds distance as similar to numerical values. K Prototype is similar to kmeans for numerical and kmode for categorical. I would say rather compare they complement each other
Hello Sir,
Nice Video!
Is there a way to find the Outliers of each cluster?
Amazing work. but how to visualize it as scatter plot?
Hi, how can we know the profile for each of the cluster segment. Would be great to have a visualisation of tge different orifulw8. Thanks
Why haven't you scaled the numeric variables like we do in k-means?
Can't we plot graph for the above 3 clusters? It will be easy for finding any pattern or Data Insights. Any idea how to plot graph for these above 3 clusters datasets?
I have the same question @AIEngineering
Very Well explained sir! Can we have video on proper DASK framework and how it is comparable to Pandas data frames for parallel computation!
Karndeep.. I have on cudf that can run on dask. Let me see if I can create one on dask
@@AIEngineeringLife It would be a great help to get everything from your side as I am learning from you always.
i have converted my categorical nominal variables to numeric. so, some numerics are categorical and some are continuous. can i use k-prototype with such kind of a numeric dataset?
Yes it is fine. Just make sure to give location of categorical variables in k-prototype fit method
I think there is a mistake.. 6:34 while creating numpy array.. all values are same ..?
What is the most intelligent way to calculate the number of clusters and plot these clusters? Does anyone have any idea? Thank you.
Aveen.. I dont think there are lot of option frankly. Silhouette are elbow method are option but you might need to run for some cluster ranges to find optimal
Hi, Thanks for posting this! As you demo-ed here, this algo works well over mixed data types. Would it be of much help if I have a dataset that has all categorical/qualitative variables? TIA
Divya.. you can check Kmodes algorithm for categorical data. In fact in k prototype if you pass only categorical data it becomes Kmode
Thank you very much for this. This saved me.
What if any of my categorical variables have null values in the original dataset?
For categorical values missing values can be made as np.NAN but it is always advisable to impute missing values based on domain understanding
sir how to calculate manually kprototype formula in table scratch.
k-prototype = kmeans(elbow method) + kmode(simple matching dissimilarity)
how to take conclusion which cluster for each data?
thanks you so much man you explain very well i watch lot of k mean video but giving me lot of errors finely found some that not giving errors.
you github link is not connecting. can you drop the link
Here you go...github.com/srivatsan88/RUclipsLI
Sir,
Just a small doubt.
The code snippet for markarray, you want to convert the columns 1,3,5,6 as float
why you are copying the converted float value of 1st column to 3,5,6 columns.
Is it mistake while copying ? Please correct me if i am wrong.
I mean
mark_array[:,1] = mark_array[:,1].astype(float) --> This will convert the first column value to float and assign to first column
Why mark_array[:,1].astype(float) is being copied for mark_array[:,3],mark_array[:,5], mark_array[:,6] ?
Venkatesh.. Extremely sorry.. Looks like a copy paste mistake and that is why i think it has grouped cluster only by CLV column.. I did not pay too much focus on it. My bad, did it in hurry.. I will update the notebook in github and pin a comment as well to this video post. Thanks for pointing it out
Thanks sir for your reply.
I have added a note as well to the video. Thanks again for pointing it out :)
@@AIEngineeringLife
hi I'm from Ecuador, copy the of github but i get an error " could not convert string to float"
please help
It's taking a lot of time for a dataset 260k rows and 7 columns, any suggestions on how to speed it up
Very nicely explained, thanks! How do you handle ordinal variable e.g. if you have model year as 2009, 2010 etc.. Do you have to make them categorical? They should not be treated like numbers.
Any replies?
Even I was wondering the same.
I tried to train with ordinal values anyways but I explicitly mentioned them as categorical index(one of the parameters when you call the model). And I got good results.
I think u mean cluster_dict is a list (same len as df row count) ]not a dictionary.
Hi sir, Thank you for create this video. It was amazing video! But, I want to ask about data type. Why you convert the data to float?
Very good, to the point. Great job!
Sir, Really informative post. We could have done this segregation manually also based on cltv. Just would like to in general, how to it adds value over traditional methods.
In this case it was just demo and u r right.. but if we further feature engineer multiple variable can contribute to a cluster. It becomes difficult doing it in a high feature space
@@AIEngineeringLife thank you sir.. But is there any limitations for no of features we can use in this algorithm? When we have say more than a million rows
How did you decide the value for cluster. Example you said here the value of cluster=3
Vignesh.. In this video I am more showing k prototype how it works. I will dive into clustering in future videos where I will cover selecting number of clusters
Thanks
Have you got an answer for your question?
Does the clustering data also include outcome/response variable too. Secondly can we perform a decision tree on this clustered data?
Please can you make a video on the mathematics of K prototype method
Hi! First of all thank you so much for this video super helpful
can you explain the mathematics behind this with a suitable simple example
I used this for facial expression mixed data, it worked wonders. Thank you!
How did you decide the right number of clusters?
@@priyankamehta2827u can use elbow approach, kprotoype in kmode package has attribute called cost_. its defined as the sum distance of all points to
their respective cluster centroids.
#Choosing optimal K
cost = []
for num_clusters in list(range(1,8)):
kproto = KPrototypes(n_clusters=num_clusters, init='Cao')
kproto.fit_predict(Data, categorical=[0,1,2,3,4,5,6,7,8,9])
cost.append(kproto.cost_)
plt.plot(cost)
source: github.com/aryancodify/Clustering
Hi sir thank you very much. I just got one question, what happens if there's a binary variable that says has a Loan? , and the values are 1 for yes , 0 for no. How can I treat them? Do I have to convert it as float too?
Can u share the link to this code plz?
It is in my git repo
github.com/srivatsan88/RUclipsLI/
Hello! thanks for sharing!, I have a question, How do we identify that what is the best number of cluster for this particular data like we have Elbow method or silhoutte coffecient for finding better number cluster but in case of kprototype how we gonna find it??
Nicely explained !! Thanks.
I searched and read many sources the last 2 weeks to understand k-prototype, you are the only one who makes me understand. thanks
Thanks Sir. It is extremely helpful
Thanks Sir , can we use this approach on textual data like reviews clustering ?
Hi sir, can you suggest something on variable clustering ?
is it possible to plot this?
thanks for this great explanation
Thank you for providing great info.
Can you please provide brief details about intuition .
As mentioned, for numeric data it ises Euclidean distance and for categories used mode. But how these get combined. More over in categorical does it use any 🌲 based method like gini or entropy.
Really appreciates for your assistance.
Sumit.. Check this paper it has all details you are asking for
grid.cs.gsu.edu/~wkim/index_files/papers/kprototype.pdf
Thank you, that was really helpful
Hi