K-means Clustering From Scratch In Python [Machine Learning Tutorial]

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024

Комментарии • 87

  • @vikasparuchuri
    @vikasparuchuri Год назад +15

    Here's all the code for this video - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . Hope you enjoy it!

  • @MarianneHMiettinen
    @MarianneHMiettinen 11 месяцев назад +4

    Outstanding! Thank you, man! This really helped me do my masters thesis. I really appreciate that you explained every small step, and used as much visuals as possible, and focused on us being able to learn!
    - In case others run into the same problem: With Scikit K-means, when using the fit(data) function, I got an "split" error message. (attributeerror: 'nonetype' object has no attribute 'split'). I checked my BLAS, and updated through conda all libraries, then shut everything down and opened again, and this resolved the problem, but it took a long time.
    (I asked chatgpt for help)

  • @maleck25
    @maleck25 7 месяцев назад

    Thank you, sir. This is how tutorials should be conducted: with in-depth explanations, step-by-step implementation, and the release of all code and datasheets to enable everyone to practice and advance their own personal projects. Congrats!

  • @animal40
    @animal40 Год назад +6

    This was amazing. Brilliantly explained, demonstrated and presented clearly. Helped me so much with my current bootcamp task. Thank you.

  • @stevenlomon
    @stevenlomon Год назад +2

    From the bottom of my heart; thank you. This was so clear and easily understandable, fantastic video!

  • @TimHerrin
    @TimHerrin 2 года назад +2

    Terrific implementation! I also really liked the way you used PCA for iteritive visualization... Nicely done

  • @Risewithvishwas
    @Risewithvishwas 10 месяцев назад

    Your explanation is absolutely clear. You have best knowledge. Keep posting new topics and encourage us ❤

  • @mo_l9993
    @mo_l9993 2 года назад +1

    One of the best tutorials on the internet, thank you.

  • @allaguimaouia6510
    @allaguimaouia6510 Год назад

    it's very great job , the only one in youtube that explain every place of code 👍👍

  • @photoish3863
    @photoish3863 2 года назад

    I have never thought that we can visualize K means by using Dimension Reduction (PCA)!! Awesome Tutorial Sir

  • @Silverwing_99
    @Silverwing_99 2 года назад +1

    Absolutely fantastic
    Would love a similar video on PAM clustering for mixed integer and categorical variables

    • @Dataquestio
      @Dataquestio  2 года назад

      Thanks for the suggestion :)

  • @elu1
    @elu1 9 месяцев назад

    This is a nice and powerful way to learn. Thanks for teaching.

  • @amandamorrow73
    @amandamorrow73 2 года назад

    This THE best tutorial online. I am so grateful for this! Thank you

  • @jessemunson7091
    @jessemunson7091 Год назад

    Awesome stuff, Vik. Thanks for sharing.

  • @tejasvinnarayan2887
    @tejasvinnarayan2887 2 года назад +1

    Amazingly clear! Thank you so much, Dataquest!

  • @oskeeg619
    @oskeeg619 2 года назад +1

    Thank you, thank you, thank you!!! Being able to perform and explain what runs under the hood is really important- I agree. Please keep these videos coming 🙌🏼❤️ The “From Scratch” series :)

    • @Dataquestio
      @Dataquestio  2 года назад +3

      That's a great idea :) I'm working on linear regression from scratch.

  • @krlwshu
    @krlwshu Год назад

    Great video. Really helpful looking at implementing it manually. Thank you so much

  • @hounddog1
    @hounddog1 Год назад

    Such good and clearly delivered material. Thanks a lot!

  • @sashagalanova818
    @sashagalanova818 10 месяцев назад

    very helpful and clear explanations - thank you!

  • @elvykamunyokomanunebo1441
    @elvykamunyokomanunebo1441 2 года назад

    Very insightful and step by step code explanation.
    Thank you for this excellent tutorial
    :)

    • @Dataquestio
      @Dataquestio  2 года назад +1

      Glad it was helpful! -Vik

    • @elvykamunyokomanunebo1441
      @elvykamunyokomanunebo1441 2 года назад

      @@Dataquestio Vik,
      how do I assign new data points to a cluster i.e. once I have run my K-means cluster and want to use it to assign a cluster to new data sets just like out of time datasets or testing/validation datasets.
      There doesn't seem to be anything online about this. Is it the case that I'd have to re-run the K-Means with the new data included?
      Thanks in advance
      Elvy

  • @obeynjanjeni4466
    @obeynjanjeni4466 7 месяцев назад

    This is amazing, keep up a good job

  • @TidianeDiallo-s3s
    @TidianeDiallo-s3s Год назад

    I can't thank you enough. Thank you for this content.

  • @ytustatistics
    @ytustatistics 9 месяцев назад

    you might be a hero... thansk a lot for the contents...

  • @virendrakhanduri4897
    @virendrakhanduri4897 2 года назад +1

    Great Video , BTW why did u use Geometric means instead Arithmetic mean for finding the clusters. Please make a whole series on building models From Scratch.

  • @shreshthasingh
    @shreshthasingh Год назад

    Thanks a LOT for this tutorial!😀

  • @ahmetatasever8315
    @ahmetatasever8315 Год назад

    Thank you very much for this clearly understood video.

  • @HelloIamLauraa
    @HelloIamLauraa 5 месяцев назад

    I loved ur video it is so well-explained!! I only used scikit-learn but now I understand better how it's works.
    But I have a question: why is it not good no use height and wight to use as feature?

  • @VaradKashmire
    @VaradKashmire 2 года назад

    Excellent video !! Many thanks 🙏🏼

  • @soothingszelam2607
    @soothingszelam2607 8 месяцев назад

    thanks teacher, may you introduce how to calculate SSE for k means clustering solution when you choose not to use k means directly from sklearn package

  • @saemamiftah1669
    @saemamiftah1669 Год назад

    More videos like these please on other algos

  • @a3i3m1an
    @a3i3m1an Год назад +1

    Thanks for the video. It is just brilliant. One of the best ones on Clustering that I have seen for sure!
    I just had a question. I tried using this on data with 13 variables. It worked perfectly but when I scale the data using n. distrb or skscalar rather than using min-max, I get an error following the PCA transformation code saying there are Nans in the data variable when there clearly were not before. I cant put my finger on what is causing this. Would appreciate any insights on your part. Thanks

  • @rajeshmanjrekar3614
    @rajeshmanjrekar3614 2 года назад

    great video, you are a great teacher

  • @dataprofessor_
    @dataprofessor_ 2 года назад

    Can you make a video implementing Local Outlier Factor (LOF) with Pandas and NumPy in Python for identifying outliers?

  • @payalpatel2560
    @payalpatel2560 Год назад

    It's a very well explained video. Just a quick question, how can we add random_state in the final model code?

  • @UkrainVsRussoReaction
    @UkrainVsRussoReaction 2 года назад

    Very insightful explanation of codes. By the way how can I plot the Elbow plot using the SSE Vs K values at every k value iteratively. this will help me be able to optimise the K value using this codes... Looking foreword to hearing from you

  • @ZigBehaviour
    @ZigBehaviour Год назад +1

    pls unpack what is going on in centroid = data.apply(lambda x: float (x.sample())) without the float cast the line returns a DataFrame with NaN values in none sampled/selected columns. There appears to be some VooDoo magic going on here, driven by the float cast!

  • @irenemayyy4293
    @irenemayyy4293 20 дней назад

    Thank you so much.

  • @akosuakoranteng3327
    @akosuakoranteng3327 Год назад

    Hi, Thanks so much for the video!! Can you please advise on how one adds a legend to the cluster scatter plots? I've been trying but can't figure it out.

  • @VarunSingh-b2e
    @VarunSingh-b2e Год назад

    Thanks alot that was a great help !

  • @sadeepmihiranga6958
    @sadeepmihiranga6958 Год назад

    Your explanation is grate. I found out that the "k" parameter of method "new_centroids" has no effect for the application. Correct me if I'm wrong.

  • @AbrarMuhtasim
    @AbrarMuhtasim 2 года назад

    make a video on ''customer segmentation and clustering in retail using machine learning'' using real retail dataset

  • @ayushadhikari2357
    @ayushadhikari2357 Год назад

    Hi, thank you so much for this clear tutorial.
    I need one another help from you. How do we get this cluster result exported to a CSV file?

  • @adriancondie831
    @adriancondie831 2 года назад

    Great video!

  • @viencong
    @viencong 11 месяцев назад

    I think k = 4, because the young players incluce two high overall and low overall. Like young star in high leage level and young normal player

  • @SaraM-c7f
    @SaraM-c7f 8 месяцев назад

    can we follow up based on the identified clusters, by using them to regress for another variable, e.g. with a logistic regression?

  • @goodnessawe4262
    @goodnessawe4262 2 года назад

    Thanks for this, I really don't get how I can possibly use it for fraud detection

  • @sukshithshetty8349
    @sukshithshetty8349 2 года назад +1

    I didn’t understand why we took geometric mean instead of arithmetic mean??? Can you explain tht pls ????

  • @jakubharas9477
    @jakubharas9477 Год назад

    Could you explain the meaning of the x- and y-axis?

  • @jagajaga6908
    @jagajaga6908 Год назад

    good tutorial thank you

  • @SaraM-c7f
    @SaraM-c7f 8 месяцев назад

    what is the maximum amount of variables recommendable for a clustering analysis?

  • @Anae2003
    @Anae2003 8 месяцев назад

    How do you know which 5 features to pick at the beginning?

  • @swayamjoshi7667
    @swayamjoshi7667 Год назад +1

    can someone help with the issue at 29:48
    when we use old_centroids=centroids
    in my code
    this error comes
    'DataFrame' object has no attribute 'equal'

    • @engineervol
      @engineervol 7 месяцев назад

      it should be .equals with an s

  • @itsamankumar403
    @itsamankumar403 11 месяцев назад

    TYSM :)

  • @2919091986
    @2919091986 8 месяцев назад

    I am getting an error when calculating centroids - 'float' object has no attribute 'sqrt'..... Please help

  • @causticmonster
    @causticmonster Год назад

    How would you include Ordinal features ?

  • @SaraM-c7f
    @SaraM-c7f 8 месяцев назад

    do we have to get rid of outliers beforehand?

  • @anirudhpurohit2251
    @anirudhpurohit2251 11 месяцев назад

    can we also use players pogition as one of the feature if yes then how (cauz that isn't numeric)

  • @rodneymawero9063
    @rodneymawero9063 2 года назад

    Keep sending the emails, thanks for the vids

  • @tivchack
    @tivchack 3 месяца назад

    Can a feature with dichotomous data be used?

  • @Hgrewssauujdkhvcjjipp
    @Hgrewssauujdkhvcjjipp 2 года назад

    Cool 👍

  • @NadeemAkhtar-gu4up
    @NadeemAkhtar-gu4up Год назад

    Which platform you are using for coding??

  • @63_mayukhdebnath22
    @63_mayukhdebnath22 Год назад

    Sir how to find out the individual elements present in each cluster? For example, I'm working on a dataset of genes. How will i get the names of the individual genes that are present in each cluster?

    • @subhasishtripathy6933
      @subhasishtripathy6933 Год назад

      I am finding the same right now ? Are you able to get anything . If yes then please help me too😊

  • @prgyagupta555
    @prgyagupta555 Год назад

    if we have IP addresses in data should we still scale the data ? i had a dataset where ip add and fraud transactions are given, i converted ip add to numerical data

  • @dataprofessor_
    @dataprofessor_ 2 года назад

    Why you did not apply fit_transform to centroids_2d variable as well?

    • @Dataquestio
      @Dataquestio  2 года назад +1

      Fit transform will both compute the fit and transform the data. In this case, we already computed the fit on the data, and we want to just apply the same fit to the centroids, so that they're all on the same scale and can be visualized. -Vik

  • @bgizzanm
    @bgizzanm 2 года назад

    Amazing!! But, how to implement the scatter without PCA?

    • @animal40
      @animal40 Год назад

      Did you figure out? I'd like to know too.

    • @akosuakoranteng3327
      @akosuakoranteng3327 Год назад +1

      @@animal40 Just leave out the PCA- still transform the centroid T though and remember to include iloc here's my code: def plot_clusters(data, labels, centroids, iteration):
      centroid_T = centroids.T
      plt.title(f'Iteration {iteration}')
      plt.scatter(x = data.iloc[:,0], y= data.iloc[:,1], c =labels)
      plt.scatter(x = centroid_T.iloc[:,0],y = centroid_T.iloc[:,1])
      plt.show()

    • @animal40
      @animal40 Год назад

      @@akosuakoranteng3327 thanks very much for this. Tried a few things today but couldn't quite get it working. Will try again tomorrow with this. Appreciate it, cheers.

  • @ThePowerofInspiration-ym7vr
    @ThePowerofInspiration-ym7vr 2 месяца назад

    can my cluster be different from yours with the same code ?

  • @sukshithshetty8349
    @sukshithshetty8349 2 года назад

    Wht does groupby() return. ?? How can I see wht groupby() has returned??? Can you pls share the code too what data.groupby(labels) do ???

  • @itsmitasha
    @itsmitasha 9 месяцев назад

    At 10:08, how did you know row 0 belongs to lionel messi?

  • @shreyanshkhandelwal6499
    @shreyanshkhandelwal6499 2 года назад

    Please can someone tell me how to apply arithmetic mean instead of geometric mean in lambda function of getting new centroids. I am dealing with negative datasets and applying geometric mean is of no use to me. will it be like this : data.groupby(labels).apply(lambda x: np.mean(x,axis=0))

    • @animal40
      @animal40 Год назад

      Thank you, I required arithmetic mean too and your code worked for me.

  • @DeepakKumarBCH
    @DeepakKumarBCH 2 года назад +1

    does anyone have the code ?

    • @Dataquestio
      @Dataquestio  2 года назад

      Code is here - github.com/dataquestio/project-walkthroughs/tree/master/kmeans . It's linked in the description

    • @DeepakKumarBCH
      @DeepakKumarBCH 2 года назад

      @@Dataquestio sir , I'm getting an error doing with scratch, any platform at which I can send my query?

  • @jinluwang5671
    @jinluwang5671 5 месяцев назад

    Nice but a little too much for a newbie 😅

  • @I_balit
    @I_balit Год назад

    SUUUUIII

  • @kesharikumar5878
    @kesharikumar5878 6 дней назад

    grt