Naive Bayes in Python - Machine Learning From Scratch 05 - Python Tutorial

Поделиться
HTML-код
  • Опубликовано: 2 окт 2024

Комментарии • 119

  • @patloeber
    @patloeber  4 года назад +14

    There is a slight fix in the fit method that must be applied if class labels do not start at 0:
    for idx, c in enumerate(self._classes)
    instead of
    for c in self._classes

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 года назад +3

      how to solve this problem.what I do.
      for idx, c in enumerate(self._classes):

      X_c = X[y==c]
      self._mean[idx, :] = X_c.mean(axis=0)
      self._var[idx, :] = X_c.var(axis=0)
      self._priors[idx] = X_c.shape[0] / float(n_samples)
      boolean index did not match indexed array along dimension 1; dimension is 5 but corresponding boolean dimension is 1

    • @alitaangel8650
      @alitaangel8650 4 года назад

      @@AliHussain-kb3ew Above code works fine for me, maybe something is wrong with your input data ?

    • @Dhanush-zj7mf
      @Dhanush-zj7mf 4 года назад +1

      I was stucked for 2 days and also posted question in stack overflow I think I should have watched comments first

    • @robinsonnadar5457
      @robinsonnadar5457 3 года назад

      @@AliHussain-kb3ew Even I am stuck up with the same error :(

    • @umarmughal5922
      @umarmughal5922 2 года назад

      @Python Engineer could you please explain how to apply Laplace to this?

  • @kougamishinya6566
    @kougamishinya6566 2 года назад +2

    I love the way you explain what each line is doing and relate it back to the formulae, that's super helpful thank you!

  • @heidycespedes9220
    @heidycespedes9220 Год назад

    Awesome explanation! It helped me to understand the concept and work on my project. Thanks a lot!

  • @posadzd7343
    @posadzd7343 3 года назад +1

    Good video, learnt a lot, please can you implement Bayes-classifier based on parzen window density estimation?

  • @matthewcallinankeenan2034
    @matthewcallinankeenan2034 4 года назад +2

    @PythonEngineer I'm using this on a large dataset with 8 columns and ~16000 rows. Its saying 'IndexError: index 10000 is out of bounds for axis 0 with size 210" Do you know how I can fix this?

  • @ozysjahputera7669
    @ozysjahputera7669 2 года назад

    The pdf implemented here is only for univariate gaussian, correct? Multivariate would have involved covariance matrix inverse, and determinant.
    Never mind. You assume all features are independent of each other.

  • @mattgoodman2687
    @mattgoodman2687 4 года назад +4

    Thank you for this. I had no clue how to conceptually grasp Naive Bayes, but after watching your video I understand it very well

    • @patloeber
      @patloeber  4 года назад +1

      I’m glad it is helpful :)

  • @andreaq.y1770
    @andreaq.y1770 5 лет назад +4

    very good tutorial !!! hope you will update more about algorithm implementations

    • @patloeber
      @patloeber  5 лет назад +1

      Thank you! Yes more videos are coming soon :)

  • @T4l0nITA
    @T4l0nITA 4 года назад

    Really good explanation.

  • @matthewcallinankeenan2034
    @matthewcallinankeenan2034 3 года назад +1

    What do we change about this program if the class isn't just True/False eg self._classes isn't just [0,1]

    • @patloeber
      @patloeber  3 года назад

      It works for multiple classes, however you have to change the for loop like this: for idx, c in enumerate(self._classes):
      In my gitHub repo I already updated this fix....

  • @abhisheksuryavanshi979
    @abhisheksuryavanshi979 Год назад

    No init function inside the NaiveBayes class?

  • @akshaygoel2184
    @akshaygoel2184 2 года назад +2

    Amazing implementation!
    Small question/point - for the PDF shouldn't the numerator var have a square term? i.e. (2 * var**2)?

    • @BlackHeart-AI
      @BlackHeart-AI Год назад

      f(x) = (1 / (σ * sqrt(2π))) * e^(-((x-μ)^2) / (2σ^2))
      In statistics, σ (the Greek letter sigma) represents the standard deviation of a population. The standard deviation is a measure of the spread or dispersion of a set of data around its mean.
      Standard deviation is closely related to the variance, which is equal to the square of the standard deviation, and is denoted by σ^2.
      Just σ^2 == variance

  • @prateekarora4549
    @prateekarora4549 3 года назад

    very good tutorial !

  • @debatradas9268
    @debatradas9268 2 года назад

    thank you

  • @OnlineGreg
    @OnlineGreg 2 года назад

    hey, thanks a lot for this series. One question: why do you often put an underscore _ in front of a function or a variable?

    • @derilraju2106
      @derilraju2106 2 года назад

      It's a general way to describe private methods which need not be called in the main function

  • @Fresh290PL
    @Fresh290PL 2 года назад +1

    Great video, thanks! Just one thing - how we can avoid the zero-frequency problem in this implementation?

  • @tkaczoro
    @tkaczoro 7 месяцев назад

    Looks like for the same reason you removed P(X) from formula for y, you can also remove the prior term P(y). You will get the same result in calculation of accuracy.

  • @abhisheksuryavanshi979
    @abhisheksuryavanshi979 Год назад

    can anyone pls tell why are we adding prior+class_conditional variables?

  • @nobody2937
    @nobody2937 2 года назад

    Also, make sure var is NOT 0 ...

  • @shehanjanidu2334
    @shehanjanidu2334 3 года назад

    I was using my own csv file as my dataset but it gives ufunc 'subtract' did not contain a loop with signature matching types (dtype('

  • @ramazanburakguler5842
    @ramazanburakguler5842 Год назад

    In terms of regularization, what can be done?

  • @Lanipops
    @Lanipops 4 года назад +1

    Tried to run this but i keep getting this error:
    ~/anaconda3/envs/XXXXXX6/aima-python-master/naivebayes.py in fit(self, X, y)
    15 for c in self._classes:
    16 X_c = X[y==c]
    ---> 17 self._mean[c, :] = X_c.mean(axis=0)
    18 self._var[c, :] = X_c.var(axis=0)
    19 self._priors[c] = X_c.shape[0] / float(n_samples)
    IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

    • @omkarpatil4386
      @omkarpatil4386 4 года назад

      make your labels binary or encode the labels .

  • @_Shrivi_
    @_Shrivi_ 4 года назад

    Hi, very good explanation . Can I use this code to train data for sentiment analysis as well?

  • @amauryribeiro1860
    @amauryribeiro1860 4 года назад +2

    just... thank you !! for your help! ^^

  • @samii8104
    @samii8104 2 года назад

    So i'm trying to run the algorithm for a dataset which have features for y_train first half 0 and second half 1.
    The problem is that when im trying to get the predict for the first half of y_train im getting error of dividing with 0.
    Is there anyway using laplace in the code help me???

  • @jossyrayonieram5231
    @jossyrayonieram5231 2 года назад

    Hi. What do you mean by "classes" here. You mention classes "0" and "1", but still not sure what you meant or why they are called "classes".

  • @bryanchambers1964
    @bryanchambers1964 3 года назад

    Hey there, I like your videos you explain well but I am confused about something. There is a step in your code where you have:
    for c in self.classes:
    X_c = X[c==y]
    I understand the first line in the code (for c in self.classes:), but I have no idea why you have X_c = X[c==y].,
    if my c values are for example [ 1, 4, 8] , then X_c = X[1==1] just gives me X_c with an extra dimension. For example if X is a 3x4 matrix, X_c is now the same matrix except it has dimension 1x3x4. Am I just dumb or overthinking this detail?

    • @patloeber
      @patloeber  3 года назад

      Note that y is an array as well, not just a number, and the length of y has to be the same as the first dimension of X! So X_c[1==y] gives you all rows of X where y is 1. Please note also that my code has a slight but. It should be this (compare with my code on Github):
      for idx, c in enumerate(self._classes):
      X_c = X[y==c]
      self._mean[idx, :] = X_c.mean(axis=0)

    • @bryanchambers1964
      @bryanchambers1964 3 года назад +1

      @@patloeber Thanks, yeah I kind of realized this after a while. So, this will extract the rows of X that have that class y=1. Makes sense.

  • @vanshikajain8353
    @vanshikajain8353 3 года назад +1

    In the second function predict, under the for loop, there is misplaced x which can be replaced by c in class conditional otherwise you get an exception of ValueError.

    • @chandank5266
      @chandank5266 Год назад

      Yeah! Actually I got confused at that point but now its clear. Thanks for confirming :)

  • @MuhammadAli-pf4ww
    @MuhammadAli-pf4ww 2 года назад

    Can anyone explain what X_c = X[c==y] is doing? I'm a little confused

  • @ГарикКубич
    @ГарикКубич 3 года назад +1

    Thank you so much friend, very helpfull

  • @ragaistanto6722
    @ragaistanto6722 4 года назад

    Terimakasih. Untuk teman" lainya saya juga ada nih video tutorial ngoding Naive Bayes python 3 bisa di cek barangkali cocok.
    ruclips.net/video/m0HVDfe0k90/видео.html

  • @anjaliacharya9506
    @anjaliacharya9506 4 года назад +1

    I try to implement this in wbcd dataset but getting an error in the line " numerator = np.exp(- (x-mean)**2 / (2 * var))" UFuncTypeError, could you help me with this

    • @anjaliacharya9506
      @anjaliacharya9506 4 года назад

      I have used label encoder to change 'diagnosis' target column to integer type but the error persists in the same line I mentioned. UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @jonn6897
      @jonn6897 4 года назад

      I have the same error with another dataset, looking forward to any help!

    • @anjaliacharya9506
      @anjaliacharya9506 4 года назад +2

      @@jonn6897 I tried converting all columns with feature except target to numpy array for probability calculation, then it works. In my case it is WBCD dataset.
      y = wbcd_data.diagnosis
      X = wbcd_data.drop('diagnosis',axis=1)
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
      #convert all columns with feature except target to numpy array to calculate probability
      X_train = np.array(X_train)
      X_test = np.array(X_test)

    • @patloeber
      @patloeber  4 года назад +2

      try casting your x to dtype=np.float64 before calling fit(), and yes of course it must be a numpy array

  • @redhwanalgabri7281
    @redhwanalgabri7281 3 года назад

    ('Naive Bayes classification accuracy', 0)

  • @BlueSkyGoldSun
    @BlueSkyGoldSun Год назад

    Any book you recommend to learn ml in native python?

  • @madsmith1352
    @madsmith1352 Год назад

    Guass.. rhymes with house..

  • @AliHaider-hg7lj
    @AliHaider-hg7lj 4 года назад +1

    How can we train any model on it? I mean if we have a csv file so how can we use it on this model?

    • @patloeber
      @patloeber  4 года назад

      load the data with pandas or just manually with open(filename) and convert each line to your x and y vectors. then create training and testing data and train your model

    • @patloeber
      @patloeber  4 года назад

      I'm actually planning to release a short video in the next 1-2 days on how to load your own datasets from csv

    • @AliHaider-hg7lj
      @AliHaider-hg7lj 4 года назад

      @@patloeber Perfect & Thanks:)

    • @T4l0nITA
      @T4l0nITA 4 года назад +3

      data = pandas.read_csv("file_name.csv")
      X = data.iloc[samples, features].values
      y = data.iloc[samples, y_column].values

  • @tanziahkhanam6451
    @tanziahkhanam6451 3 года назад

    I got very less accuracy for my own dataset. Accuracy only 0.3 , what is the reason? And also got warning, RuntimeWarning: divide by zero encountered in true_divide numerator = np.exp(- (x - mean) ** 2 / (2 * var))

    • @bong-techie
      @bong-techie 3 года назад

      how did you fix it, i'm facing the problem now, please help[

  • @srikaramanaganti1285
    @srikaramanaganti1285 3 года назад

    can you model class conditional probability using Multinomail distribution

  • @dinarakhaydarova4898
    @dinarakhaydarova4898 2 года назад

    exactly what i needed! thank you bunchesss

  • @robertrey7002
    @robertrey7002 2 года назад

    Hey man that was a great tutorial! I would just like to ask however, is there a way to know when you should use the Naive Bayes classifier?

    • @no_guarantees
      @no_guarantees 2 года назад

      Simplest application would be a binary classifier (0/1) or (no/yes) such as spam classification. You could experiment with NB where you would typically use logistic regression to build your intuition.

  • @kritamdangol5349
    @kritamdangol5349 4 года назад

    I got this errror while performing run .Please provide me solution for this.
    line 54, in
    predicted_values=(model.predict(Features_test))
    line 20, in predict
    y_pred=[self._predict(x) for x in X]
    , in
    y_pred=[self._predict(x) for x in X]
    line 29, in _predict
    line 40, in _pdf
    numerator=np.exp(-(x-mean)**2/(2*var))
    numpy.core._exceptions.UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @patloeber
      @patloeber  4 года назад +1

      probably your datatype or the shape of your vector is not correct. try casting to np.float32

    • @kritamdangol5349
      @kritamdangol5349 4 года назад

      @@patloeber Thank u !

  • @nafesafirdous3670
    @nafesafirdous3670 4 года назад

    If I have my on dataset which is not present in sklearn datasets then how can I make classification?
    please help!

    • @patloeber
      @patloeber  4 года назад +1

      You need to load the dataset (probably from a csv file) and setup your X and y numpy arrays

    • @nafesafirdous3670
      @nafesafirdous3670 4 года назад

      @@patloeber Helpful
      Thanks

  • @kidspast7294
    @kidspast7294 2 года назад

    Great tutorial thanks!

  • @prithviamin6847
    @prithviamin6847 4 года назад

    hi
    i'm getting this error:
    UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @patloeber
      @patloeber  4 года назад

      Try converting your data to np.float. And check if all your data is valid, probably you have NaN for some data points...

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 года назад

      Hi, I face a Same problem ,you got it right.
      if correct the code please suggest me what I do.

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 года назад

      Hi

  • @FoodieTechVoyager
    @FoodieTechVoyager 3 года назад

    Hi, I am new to Machine learning, it would be very helpful if you could provide the dataset too , or share a tutorial on how to create that

    • @patloeber
      @patloeber  3 года назад

      thanks for the suggestion

  • @changsinlee4634
    @changsinlee4634 3 года назад

    A great tutorial and implementation. Just one correction on the implementation.
    _pdf is implemented differently than the formula. It should be:
    numerator = np.exp(- (x-mean)**2 / (2 * var**2))
    denominator = np.sqrt(2 * np.pi * var**2)
    The implemented code is missing the squared part.
    numerator = np.exp(- (x-mean)**2 / (2 * var))
    denominator = np.sqrt(2 * np.pi * var)

    • @patloeber
      @patloeber  3 года назад

      thanks for the feedback. but you are wrong, you may have confused standard deviation and variance. in most formulas (and this video) it is written with the squared standard deviation, which is equal to the variance (so no square when using the variance directly) :)

    • @changsinlee4634
      @changsinlee4634 3 года назад

      @@patloeber Thanks for the quick reply. Ah, yes, I see it. In that case, it should be std**2. You get different values based on whether you use var or std**2. I was comparing the results with those of the standard library (from scipy.stats import norm
      ) and that's when I discovered the differences.

    • @patloeber
      @patloeber  3 года назад

      @@changsinlee4634 oh this is interesting. Thanks for noticing this! I would expect that std**2 and var are exactly the same except for rounding errors

  • @seyeeet8063
    @seyeeet8063 4 года назад

    so NB does not have any updating rule like gradient decent?

    • @patloeber
      @patloeber  4 года назад +1

      No you just have to pre calculate priors and mean and var, and then apply the formula using Bayes‘ theorem

  • @amitupadhyay6511
    @amitupadhyay6511 3 года назад

    what if the values in _pdf matrix are inf, then?

    • @patloeber
      @patloeber  3 года назад

      then you have a problem ;) yeah you should add some error checking and maybe clip the allowed range in the calculation

  • @boooringlearning
    @boooringlearning 3 года назад

    great video!

  • @tsotnegams
    @tsotnegams 4 года назад

    In the pdf method you wrote (2*var), it should be(2*var**2) because of squared variance in the formula. Great tutorial otherwise.

    • @patloeber
      @patloeber  4 года назад +2

      No. The formula shows the squared standard deviation, which is equal to the variance (small sigma is always used in statistics for standard deviation). probably i should have pointed this out better. thanks for watching :)

    • @tsotnegams
      @tsotnegams 4 года назад +1

      @@patloeber You are right, thanks for the reply.

    • @patloeber
      @patloeber  4 года назад +1

      No problem :) you can always reach out when you have questions or find different errors

  • @joydeepkr.devnath193
    @joydeepkr.devnath193 4 года назад

    Hi, great video btw...1 question at 4:43, where you define P(x_i|y) = Gaussian formula..but the Gaussian pdf is a distribution, so to get the probabilities we need integration. So, do we approximate this integration as area inside the rectangle having height=pdf and breadth = some delta. So, since we have a ratio of probabilities in the Bayesian formula, so the numerator delta cancels the denominator delta. So, that is why we dont include that delta term in our formula. Is this how you are doing ?

    • @patloeber
      @patloeber  4 года назад +1

      This is a very good question! I hope this helps: stats.stackexchange.com/questions/26624/pdfs-and-probability-in-naive-bayes-classification

    • @joydeepkr.devnath193
      @joydeepkr.devnath193 4 года назад

      @@patloeber yes this link was helpful. Thanks !

    • @patloeber
      @patloeber  4 года назад

      @@joydeepkr.devnath193 sure :)

  • @AliHussain-kb3ew
    @AliHussain-kb3ew 4 года назад

    How to use this code in python Anaconda ?,

    • @patloeber
      @patloeber  4 года назад

      I have a tutorial for Anaconda setup

  • @Lanipops
    @Lanipops 4 года назад

    need to make the naive bayes file allow 2d array

    • @patloeber
      @patloeber  4 года назад

      try to cast y to int before fitting the data: y = y.astype(np.int)

  • @marcosraphael3390
    @marcosraphael3390 4 года назад

    This is an unlabeled classifier?

    • @patloeber
      @patloeber  4 года назад +1

      No, it is supervised learning

  • @viperz301
    @viperz301 4 года назад

    Hi! what do you mean by the self that you pass into every function? is it the data frame?

    • @patloeber
      @patloeber  4 года назад +1

      This is an essential concept of object oriented programming and using classes in Python. self represents the instance of the class. By using the “self” keyword we can access the attributes and methods of the class in python. It binds the attributes with the given arguments.

    • @jossyrayonieram5231
      @jossyrayonieram5231 2 года назад

      @@patloeber out of all the things Python does for you automatically, they stopped with "self". >_

  • @godwingeorgethekkanath
    @godwingeorgethekkanath 3 года назад

    Great tutorial😍
    It was useful for me.

    • @patloeber
      @patloeber  3 года назад +1

      thanks, glad you like it!

  • @AliHussain-kb3ew
    @AliHussain-kb3ew 4 года назад

    I try to Run this code on Anaconda an other iris dataset but ,i face a problen.

  • @reellezahl
    @reellezahl 2 года назад

    You need either a better microphone or to better adjust your sound settings. Your volume levels keep crashing and it's very grating on the ear.