Clustering Algorithm for mixed datatypes - K-Prototypes

Поделиться
HTML-код
  • Опубликовано: 1 окт 2024
  • #datascience #machinelearning #ml
    The k-means based methods are efficient for processing large data sets, but they are often limited to numeric data. Kmeans optimize a cost function defined on the Euclidean distance
    measure between data points and means of clusters. Minimizing the cost function by
    calculating means limits their use to numeric data.
    This is where K-Prototype shines. When applied to numeric data the algorithm is identical to k-means. For categorical data algorithm uses a simple matching dissimilarity measure
    , replaces the means of clusters with modes, and uses a frequency-based method to
    update modes in the clustering process to minimize the clustering cost function.

Комментарии • 86

  • @AIEngineeringLife
    @AIEngineeringLife  4 года назад +12

    There is a copy paste bug in my code while converting datatype in numpy array. Fixed code is below. Sorry for the inconvinience
    github.com/srivatsan88/RUclipsLI/blob/master/K_Prototype_for_Mixed_Datatypes.ipynb
    Concepts remain the same

    • @xixiongguo2986
      @xixiongguo2986 4 года назад +1

      thanks for this great video. May I ask where the dataset comes from, as I'd like to know about the details of each column, tks.

  • @karndeepsingh
    @karndeepsingh 4 года назад +5

    How do we identify that what is the best number of cluster for this particular data like we have Elbow method or silhoutte coffecient for finding better number cluster but in case of kprototype how we gonna find it??

    • @Aveen3
      @Aveen3 4 года назад +1

      Have you got an answer for your question?

  • @fri0
    @fri0 3 года назад +5

    Hi! Great video! Just a small doubt. Does this algorithm allow the calculation of the silhouette to evaluate the model performance?

  • @JorgeAlvarezLopategui
    @JorgeAlvarezLopategui 4 года назад +1

    @AIEngineering I'm not agree with the aproach. If you run this at the end (sb.pairplot(marketing_df[['CLV','Income','monthly_premium','Months_Since_Policy_Inception','cluster']], hue = 'cluster' )
    ) you will see cluster "easy-defined" just with straight lines. Please, could you run the seaborn plot and give me your feedback. Many many thanks :)

  • @subbaraogannavarapu7405
    @subbaraogannavarapu7405 4 года назад +4

    I stuck with the dataset which has mixed data types when i try to apply k means . thanks for sharing ! this really helps me.

  • @raviirla459
    @raviirla459 4 года назад +3

    First time i am hearing about K-Prototype algo.. i never read anywhere and any other youtube vedios as well. either it is big or short videos, it has so much to know in your videos.. some special ingredient you mix.. :)..

  • @mananbedi
    @mananbedi 4 года назад +4

    This video is really helpful. I didn't even know about k-prototype algorithm.
    This is one of the best channel for ml because you just not teach ml but also deployment over kubernetes and other platforms(which is one of the most important thing).
    Will surely recommend the channel to my friends.

  • @shhivram929
    @shhivram929 4 года назад +2

    Firstly the tutorial was crisp and informative. Shout out to the maker.
    But How does this model differentiate between a categorical nominal vs ordinal data?
    Eg: Here in this dataset state feature is nominal and coverage feature is ordinal who is this info interpreted by the model?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Shhiv. It does not differentiate.. I would say try ordinal data as numerical and see. I have not tried it though but depends on data and related variables it must be able to differentiate it in clusters

  • @sofluzik
    @sofluzik 4 года назад +2

    Nicely done , can you also confirm if it can help deduplicate categorical records ... The problem at hand is , a customer enters his or her info through multiple entry points, like dealer portal , dealer mobile app etc ..but to be able.to.send a promotional item , csn this kmode help with identifying duplicates among many customer records coming in from multiple sources and give me a unique record so I can send the promotion to only this record

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Rajaram.. I do not think it can be handled by any model. I would prefer to pre-engineer the data and might be you want to handle multiple entry point as additional features quoting the channel.

  • @nmuralikrishna4599
    @nmuralikrishna4599 3 года назад +1

    LIFE SAVER !!!! Never Heard of it . I have SOOOO much categorical data , i almost lost hope . TQQQQ

  • @amithnambiar9818
    @amithnambiar9818 2 года назад +1

    The reason why he was able to distinguish clusters basis CLV scores is because compared to other numerical variables CLV scores are way higher in magnitude and hence gets more importance. Like he said standardisation/normalisation of all numerical variables in a mandate when you have variables with varying magnitudes (almost always).

  • @chuhanwang8252
    @chuhanwang8252 4 года назад +3

    Hi, first of all thank you so much for this thorough tutorial on K-prototypes!
    I've got a question: I noticed that you didn't normalize the numerical columns, could you please explain why?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Chuhan.. As I have mentioned in my video this was just to give a quick overview of K-Prototypes and not focus on model outcome. But in real world if u have numeric value that has different scale it is better to normalize. I will maybe walk through this once I created details video on it

    • @chuhanwang8252
      @chuhanwang8252 4 года назад

      @@AIEngineeringLife Thank you so much for your timely reply! Sorry I didn't pay close attention. And thanks again for your video! This is the most helpful tutorial I've ever found!

  • @zorangen
    @zorangen 4 года назад +2

    This was a great video. Today I have learnt something new. Thanks for sharing.
    However, I would like to know how the results compare to the original k-means. Did including categorical variable provide any benefits?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Kmeans does not work on categorical as it finds distance as similar to numerical values. K Prototype is similar to kmeans for numerical and kmode for categorical. I would say rather compare they complement each other

  • @piet7583
    @piet7583 Год назад +1

    Hello Sir,
    Nice Video!
    Is there a way to find the Outliers of each cluster?

  • @JainmiahSk
    @JainmiahSk 3 года назад +1

    Amazing work. but how to visualize it as scatter plot?

  • @tusharkumarsaxena6029
    @tusharkumarsaxena6029 5 месяцев назад

    Hi, how can we know the profile for each of the cluster segment. Would be great to have a visualisation of tge different orifulw8. Thanks

  • @GurveenKaur-bd9gm
    @GurveenKaur-bd9gm 4 месяца назад

    Why haven't you scaled the numeric variables like we do in k-means?

  • @kapilgupta4586
    @kapilgupta4586 4 года назад +1

    Can't we plot graph for the above 3 clusters? It will be easy for finding any pattern or Data Insights. Any idea how to plot graph for these above 3 clusters datasets?

  • @karndeepsingh
    @karndeepsingh 4 года назад +1

    Very Well explained sir! Can we have video on proper DASK framework and how it is comparable to Pandas data frames for parallel computation!

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Karndeep.. I have on cudf that can run on dask. Let me see if I can create one on dask

    • @karndeepsingh
      @karndeepsingh 4 года назад

      @@AIEngineeringLife It would be a great help to get everything from your side as I am learning from you always.

  • @gaayathrisankar2555
    @gaayathrisankar2555 3 года назад +1

    i have converted my categorical nominal variables to numeric. so, some numerics are categorical and some are continuous. can i use k-prototype with such kind of a numeric dataset?

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      Yes it is fine. Just make sure to give location of categorical variables in k-prototype fit method

  • @skh7056
    @skh7056 Год назад

    I think there is a mistake.. 6:34 while creating numpy array.. all values are same ..?

  • @Aveen3
    @Aveen3 4 года назад +1

    What is the most intelligent way to calculate the number of clusters and plot these clusters? Does anyone have any idea? Thank you.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Aveen.. I dont think there are lot of option frankly. Silhouette are elbow method are option but you might need to run for some cluster ranges to find optimal

  • @divyadixit8340
    @divyadixit8340 4 года назад +1

    Hi, Thanks for posting this! As you demo-ed here, this algo works well over mixed data types. Would it be of much help if I have a dataset that has all categorical/qualitative variables? TIA

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Divya.. you can check Kmodes algorithm for categorical data. In fact in k prototype if you pass only categorical data it becomes Kmode

  • @natasha16340
    @natasha16340 3 года назад +1

    Thank you very much for this. This saved me.

  • @gaayathrisankar2555
    @gaayathrisankar2555 3 года назад +1

    What if any of my categorical variables have null values in the original dataset?

    • @AIEngineeringLife
      @AIEngineeringLife  3 года назад

      For categorical values missing values can be made as np.NAN but it is always advisable to impute missing values based on domain understanding

  • @arran5918
    @arran5918 4 года назад

    sir how to calculate manually kprototype formula in table scratch.
    k-prototype = kmeans(elbow method) + kmode(simple matching dissimilarity)
    how to take conclusion which cluster for each data?

  • @prafullachaudhari7018
    @prafullachaudhari7018 2 года назад

    thanks you so much man you explain very well i watch lot of k mean video but giving me lot of errors finely found some that not giving errors.

  • @chidiedim3166
    @chidiedim3166 4 года назад +1

    you github link is not connecting. can you drop the link

  • @venkateshsatagopan2963
    @venkateshsatagopan2963 4 года назад +1

    Sir,
    Just a small doubt.
    The code snippet for markarray, you want to convert the columns 1,3,5,6 as float
    why you are copying the converted float value of 1st column to 3,5,6 columns.
    Is it mistake while copying ? Please correct me if i am wrong.
    I mean
    mark_array[:,1] = mark_array[:,1].astype(float) --> This will convert the first column value to float and assign to first column
    Why mark_array[:,1].astype(float) is being copied for mark_array[:,3],mark_array[:,5], mark_array[:,6] ?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Venkatesh.. Extremely sorry.. Looks like a copy paste mistake and that is why i think it has grouped cluster only by CLV column.. I did not pay too much focus on it. My bad, did it in hurry.. I will update the notebook in github and pin a comment as well to this video post. Thanks for pointing it out

    • @venkateshsatagopan2963
      @venkateshsatagopan2963 4 года назад +1

      Thanks sir for your reply.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      I have added a note as well to the video. Thanks again for pointing it out :)

    • @fiammayepez4999
      @fiammayepez4999 2 года назад

      @@AIEngineeringLife
      hi I'm from Ecuador, copy the of github but i get an error " could not convert string to float"
      please help

  • @jaspartapsingh8774
    @jaspartapsingh8774 3 года назад

    It's taking a lot of time for a dataset 260k rows and 7 columns, any suggestions on how to speed it up

  • @purnimasharma9734
    @purnimasharma9734 4 года назад +1

    Very nicely explained, thanks! How do you handle ordinal variable e.g. if you have model year as 2009, 2010 etc.. Do you have to make them categorical? They should not be treated like numbers.

    • @gaayathrisankar2555
      @gaayathrisankar2555 3 года назад

      Any replies?

    • @shivamsoliya6529
      @shivamsoliya6529 3 года назад

      Even I was wondering the same.
      I tried to train with ordinal values anyways but I explicitly mentioned them as categorical index(one of the parameters when you call the model). And I got good results.

  • @sndselecta
    @sndselecta 3 года назад

    I think u mean cluster_dict is a list (same len as df row count) ]not a dictionary.

  • @malikafuhamid4258
    @malikafuhamid4258 2 года назад

    Hi sir, Thank you for create this video. It was amazing video! But, I want to ask about data type. Why you convert the data to float?

  • @GM-qv1ql
    @GM-qv1ql 4 года назад +1

    Very good, to the point. Great job!

  • @srinu7j
    @srinu7j 4 года назад +1

    Sir, Really informative post. We could have done this segregation manually also based on cltv. Just would like to in general, how to it adds value over traditional methods.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      In this case it was just demo and u r right.. but if we further feature engineer multiple variable can contribute to a cluster. It becomes difficult doing it in a high feature space

    • @srinu7j
      @srinu7j 4 года назад

      @@AIEngineeringLife thank you sir.. But is there any limitations for no of features we can use in this algorithm? When we have say more than a million rows

  • @vigneshwarravichandrababu581
    @vigneshwarravichandrababu581 4 года назад +1

    How did you decide the value for cluster. Example you said here the value of cluster=3

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      Vignesh.. In this video I am more showing k prototype how it works. I will dive into clustering in future videos where I will cover selecting number of clusters

    • @vigneshwarravichandrababu581
      @vigneshwarravichandrababu581 4 года назад

      Thanks

    • @Aveen3
      @Aveen3 4 года назад

      Have you got an answer for your question?

  • @javeda
    @javeda 3 года назад

    Does the clustering data also include outcome/response variable too. Secondly can we perform a decision tree on this clustered data?

  • @laikasatish3236
    @laikasatish3236 2 года назад

    Please can you make a video on the mathematics of K prototype method

  • @anac4818
    @anac4818 4 года назад

    Hi! First of all thank you so much for this video super helpful

  • @satishkannan6176
    @satishkannan6176 2 года назад

    can you explain the mathematics behind this with a suitable simple example

  • @joanesperanza7519
    @joanesperanza7519 4 года назад +1

    I used this for facial expression mixed data, it worked wonders. Thank you!

    • @priyankamehta2827
      @priyankamehta2827 4 года назад

      How did you decide the right number of clusters?

    • @arran5918
      @arran5918 4 года назад +1

      ​@@priyankamehta2827u can use elbow approach, kprotoype in kmode package has attribute called cost_. its defined as the sum distance of all points to
      their respective cluster centroids.
      #Choosing optimal K
      cost = []
      for num_clusters in list(range(1,8)):
      kproto = KPrototypes(n_clusters=num_clusters, init='Cao')
      kproto.fit_predict(Data, categorical=[0,1,2,3,4,5,6,7,8,9])
      cost.append(kproto.cost_)
      plt.plot(cost)
      source: github.com/aryancodify/Clustering

  • @project-du6ei
    @project-du6ei 3 года назад

    Hi sir thank you very much. I just got one question, what happens if there's a binary variable that says has a Loan? , and the values are 1 for yes , 0 for no. How can I treat them? Do I have to convert it as float too?

  • @aniacharya9
    @aniacharya9 4 года назад +1

    Can u share the link to this code plz?

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад +1

      It is in my git repo
      github.com/srivatsan88/RUclipsLI/

  • @oscarperezraggio1264
    @oscarperezraggio1264 4 года назад

    Hello! thanks for sharing!, I have a question, How do we identify that what is the best number of cluster for this particular data like we have Elbow method or silhoutte coffecient for finding better number cluster but in case of kprototype how we gonna find it??

  • @dr.rajkishorbisht
    @dr.rajkishorbisht 2 года назад

    Nicely explained !! Thanks.

  • @MuhammadFirdaus-tz6uq
    @MuhammadFirdaus-tz6uq Год назад

    I searched and read many sources the last 2 weeks to understand k-prototype, you are the only one who makes me understand. thanks

  • @onetirtha
    @onetirtha 3 года назад

    Thanks Sir. It is extremely helpful

  • @FindMultiBagger
    @FindMultiBagger 3 года назад

    Thanks Sir , can we use this approach on textual data like reviews clustering ?

  • @GR-vx7kh
    @GR-vx7kh 3 года назад

    Hi sir, can you suggest something on variable clustering ?

  • @go1chase1the1sun1set
    @go1chase1the1sun1set 3 года назад

    is it possible to plot this?

  • @mohammedibrahim5642
    @mohammedibrahim5642 4 года назад

    thanks for this great explanation

  • @sumitchandak6131
    @sumitchandak6131 4 года назад

    Thank you for providing great info.
    Can you please provide brief details about intuition .
    As mentioned, for numeric data it ises Euclidean distance and for categories used mode. But how these get combined. More over in categorical does it use any 🌲 based method like gini or entropy.
    Really appreciates for your assistance.

    • @AIEngineeringLife
      @AIEngineeringLife  4 года назад

      Sumit.. Check this paper it has all details you are asking for
      grid.cs.gsu.edu/~wkim/index_files/papers/kprototype.pdf

    • @sumitchandak6131
      @sumitchandak6131 4 года назад

      Thank you, that was really helpful

  • @vigneshwarravichandrababu581
    @vigneshwarravichandrababu581 4 года назад

    Hi