Naive Bayes in Python - Machine Learning From Scratch 05 - Python Tutorial

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ก.ย. 2024

ความคิดเห็น • 119

  • @patloeber
    @patloeber  4 ปีที่แล้ว +14

    There is a slight fix in the fit method that must be applied if class labels do not start at 0:
    for idx, c in enumerate(self._classes)
    instead of
    for c in self._classes

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 ปีที่แล้ว +3

      how to solve this problem.what I do.
      for idx, c in enumerate(self._classes):

      X_c = X[y==c]
      self._mean[idx, :] = X_c.mean(axis=0)
      self._var[idx, :] = X_c.var(axis=0)
      self._priors[idx] = X_c.shape[0] / float(n_samples)
      boolean index did not match indexed array along dimension 1; dimension is 5 but corresponding boolean dimension is 1

    • @alitaangel8650
      @alitaangel8650 4 ปีที่แล้ว

      @@AliHussain-kb3ew Above code works fine for me, maybe something is wrong with your input data ?

    • @Dhanush-zj7mf
      @Dhanush-zj7mf 3 ปีที่แล้ว +1

      I was stucked for 2 days and also posted question in stack overflow I think I should have watched comments first

    • @robinsonnadar5457
      @robinsonnadar5457 3 ปีที่แล้ว

      @@AliHussain-kb3ew Even I am stuck up with the same error :(

    • @umarmughal5922
      @umarmughal5922 2 ปีที่แล้ว

      @Python Engineer could you please explain how to apply Laplace to this?

  • @mattgoodman2687
    @mattgoodman2687 4 ปีที่แล้ว +4

    Thank you for this. I had no clue how to conceptually grasp Naive Bayes, but after watching your video I understand it very well

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      I’m glad it is helpful :)

  • @kougamishinya6566
    @kougamishinya6566 2 ปีที่แล้ว +2

    I love the way you explain what each line is doing and relate it back to the formulae, that's super helpful thank you!

  • @tkaczoro
    @tkaczoro 6 หลายเดือนก่อน

    Looks like for the same reason you removed P(X) from formula for y, you can also remove the prior term P(y). You will get the same result in calculation of accuracy.

  • @vanshikajain8353
    @vanshikajain8353 3 ปีที่แล้ว +1

    In the second function predict, under the for loop, there is misplaced x which can be replaced by c in class conditional otherwise you get an exception of ValueError.

    • @chandank5266
      @chandank5266 ปีที่แล้ว

      Yeah! Actually I got confused at that point but now its clear. Thanks for confirming :)

  • @andreaq.y1770
    @andreaq.y1770 4 ปีที่แล้ว +4

    very good tutorial !!! hope you will update more about algorithm implementations

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      Thank you! Yes more videos are coming soon :)

  • @heidycespedes9220
    @heidycespedes9220 ปีที่แล้ว

    Awesome explanation! It helped me to understand the concept and work on my project. Thanks a lot!

  • @Fresh290PL
    @Fresh290PL 2 ปีที่แล้ว +1

    Great video, thanks! Just one thing - how we can avoid the zero-frequency problem in this implementation?

  • @akshaygoel2184
    @akshaygoel2184 2 ปีที่แล้ว +2

    Amazing implementation!
    Small question/point - for the PDF shouldn't the numerator var have a square term? i.e. (2 * var**2)?

    • @BlackHeart-AI
      @BlackHeart-AI ปีที่แล้ว

      f(x) = (1 / (σ * sqrt(2π))) * e^(-((x-μ)^2) / (2σ^2))
      In statistics, σ (the Greek letter sigma) represents the standard deviation of a population. The standard deviation is a measure of the spread or dispersion of a set of data around its mean.
      Standard deviation is closely related to the variance, which is equal to the square of the standard deviation, and is denoted by σ^2.
      Just σ^2 == variance

  • @amauryribeiro1860
    @amauryribeiro1860 4 ปีที่แล้ว +2

    just... thank you !! for your help! ^^

    • @patloeber
      @patloeber  4 ปีที่แล้ว

      You are welcome!

  • @matthewcallinankeenan2034
    @matthewcallinankeenan2034 3 ปีที่แล้ว +2

    @PythonEngineer I'm using this on a large dataset with 8 columns and ~16000 rows. Its saying 'IndexError: index 10000 is out of bounds for axis 0 with size 210" Do you know how I can fix this?

  • @user-tp7ry2sf4l
    @user-tp7ry2sf4l 3 ปีที่แล้ว +1

    Thank you so much friend, very helpfull

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      Glad you like it!

  • @posadzd7343
    @posadzd7343 3 ปีที่แล้ว +1

    Good video, learnt a lot, please can you implement Bayes-classifier based on parzen window density estimation?

  • @dinarakhaydarova4898
    @dinarakhaydarova4898 2 ปีที่แล้ว

    exactly what i needed! thank you bunchesss

  • @changsinlee4634
    @changsinlee4634 3 ปีที่แล้ว

    A great tutorial and implementation. Just one correction on the implementation.
    _pdf is implemented differently than the formula. It should be:
    numerator = np.exp(- (x-mean)**2 / (2 * var**2))
    denominator = np.sqrt(2 * np.pi * var**2)
    The implemented code is missing the squared part.
    numerator = np.exp(- (x-mean)**2 / (2 * var))
    denominator = np.sqrt(2 * np.pi * var)

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      thanks for the feedback. but you are wrong, you may have confused standard deviation and variance. in most formulas (and this video) it is written with the squared standard deviation, which is equal to the variance (so no square when using the variance directly) :)

    • @changsinlee4634
      @changsinlee4634 3 ปีที่แล้ว

      @@patloeber Thanks for the quick reply. Ah, yes, I see it. In that case, it should be std**2. You get different values based on whether you use var or std**2. I was comparing the results with those of the standard library (from scipy.stats import norm
      ) and that's when I discovered the differences.

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      @@changsinlee4634 oh this is interesting. Thanks for noticing this! I would expect that std**2 and var are exactly the same except for rounding errors

  • @matthewcallinankeenan2034
    @matthewcallinankeenan2034 3 ปีที่แล้ว +1

    What do we change about this program if the class isn't just True/False eg self._classes isn't just [0,1]

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      It works for multiple classes, however you have to change the for loop like this: for idx, c in enumerate(self._classes):
      In my gitHub repo I already updated this fix....

  • @ramazanburakguler5842
    @ramazanburakguler5842 ปีที่แล้ว

    In terms of regularization, what can be done?

  • @kidspast7294
    @kidspast7294 2 ปีที่แล้ว

    Great tutorial thanks!

  • @abhisheksuryavanshi979
    @abhisheksuryavanshi979 ปีที่แล้ว

    can anyone pls tell why are we adding prior+class_conditional variables?

  • @abhisheksuryavanshi979
    @abhisheksuryavanshi979 ปีที่แล้ว

    No init function inside the NaiveBayes class?

  • @debatradas9268
    @debatradas9268 2 ปีที่แล้ว

    thank you

  • @ozysjahputera7669
    @ozysjahputera7669 2 ปีที่แล้ว

    The pdf implemented here is only for univariate gaussian, correct? Multivariate would have involved covariance matrix inverse, and determinant.
    Never mind. You assume all features are independent of each other.

  • @MuhammadAli-pf4ww
    @MuhammadAli-pf4ww 2 ปีที่แล้ว

    Can anyone explain what X_c = X[c==y] is doing? I'm a little confused

  • @T4l0nITA
    @T4l0nITA 4 ปีที่แล้ว

    Really good explanation.

  • @samii8104
    @samii8104 2 ปีที่แล้ว

    So i'm trying to run the algorithm for a dataset which have features for y_train first half 0 and second half 1.
    The problem is that when im trying to get the predict for the first half of y_train im getting error of dividing with 0.
    Is there anyway using laplace in the code help me???

  • @srikaramanaganti1285
    @srikaramanaganti1285 3 ปีที่แล้ว

    can you model class conditional probability using Multinomail distribution

  • @godwingeorgethekkanath
    @godwingeorgethekkanath 3 ปีที่แล้ว

    Great tutorial😍
    It was useful for me.

    • @patloeber
      @patloeber  3 ปีที่แล้ว +1

      thanks, glad you like it!

  • @robertrey7002
    @robertrey7002 2 ปีที่แล้ว

    Hey man that was a great tutorial! I would just like to ask however, is there a way to know when you should use the Naive Bayes classifier?

    • @no_guarantees
      @no_guarantees 2 ปีที่แล้ว

      Simplest application would be a binary classifier (0/1) or (no/yes) such as spam classification. You could experiment with NB where you would typically use logistic regression to build your intuition.

  • @OnlineGreg
    @OnlineGreg 2 ปีที่แล้ว

    hey, thanks a lot for this series. One question: why do you often put an underscore _ in front of a function or a variable?

    • @derilraju2106
      @derilraju2106 2 ปีที่แล้ว

      It's a general way to describe private methods which need not be called in the main function

  • @shehanjanidu2334
    @shehanjanidu2334 3 ปีที่แล้ว

    I was using my own csv file as my dataset but it gives ufunc 'subtract' did not contain a loop with signature matching types (dtype('

  • @Lanipops
    @Lanipops 4 ปีที่แล้ว +1

    Tried to run this but i keep getting this error:
    ~/anaconda3/envs/XXXXXX6/aima-python-master/naivebayes.py in fit(self, X, y)
    15 for c in self._classes:
    16 X_c = X[y==c]
    ---> 17 self._mean[c, :] = X_c.mean(axis=0)
    18 self._var[c, :] = X_c.var(axis=0)
    19 self._priors[c] = X_c.shape[0] / float(n_samples)
    IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

    • @omkarpatil4386
      @omkarpatil4386 4 ปีที่แล้ว

      make your labels binary or encode the labels .

  • @prateekarora4549
    @prateekarora4549 3 ปีที่แล้ว

    very good tutorial !

  • @jossyrayonieram5231
    @jossyrayonieram5231 2 ปีที่แล้ว

    Hi. What do you mean by "classes" here. You mention classes "0" and "1", but still not sure what you meant or why they are called "classes".

  • @anjaliacharya9506
    @anjaliacharya9506 4 ปีที่แล้ว +1

    I try to implement this in wbcd dataset but getting an error in the line " numerator = np.exp(- (x-mean)**2 / (2 * var))" UFuncTypeError, could you help me with this

    • @anjaliacharya9506
      @anjaliacharya9506 4 ปีที่แล้ว

      I have used label encoder to change 'diagnosis' target column to integer type but the error persists in the same line I mentioned. UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @jonn6897
      @jonn6897 4 ปีที่แล้ว

      I have the same error with another dataset, looking forward to any help!

    • @anjaliacharya9506
      @anjaliacharya9506 4 ปีที่แล้ว +2

      @@jonn6897 I tried converting all columns with feature except target to numpy array for probability calculation, then it works. In my case it is WBCD dataset.
      y = wbcd_data.diagnosis
      X = wbcd_data.drop('diagnosis',axis=1)
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
      #convert all columns with feature except target to numpy array to calculate probability
      X_train = np.array(X_train)
      X_test = np.array(X_test)

    • @patloeber
      @patloeber  4 ปีที่แล้ว +2

      try casting your x to dtype=np.float64 before calling fit(), and yes of course it must be a numpy array

  • @FoodieTechVoyager
    @FoodieTechVoyager 3 ปีที่แล้ว

    Hi, I am new to Machine learning, it would be very helpful if you could provide the dataset too , or share a tutorial on how to create that

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      thanks for the suggestion

  • @tanziahkhanam6451
    @tanziahkhanam6451 3 ปีที่แล้ว

    I got very less accuracy for my own dataset. Accuracy only 0.3 , what is the reason? And also got warning, RuntimeWarning: divide by zero encountered in true_divide numerator = np.exp(- (x - mean) ** 2 / (2 * var))

    • @bong-techie
      @bong-techie 2 ปีที่แล้ว

      how did you fix it, i'm facing the problem now, please help[

  • @_Shrivi_
    @_Shrivi_ 4 ปีที่แล้ว

    Hi, very good explanation . Can I use this code to train data for sentiment analysis as well?

  • @BlueSkyGoldSun
    @BlueSkyGoldSun ปีที่แล้ว

    Any book you recommend to learn ml in native python?

  • @joydeepkr.devnath193
    @joydeepkr.devnath193 4 ปีที่แล้ว

    Hi, great video btw...1 question at 4:43, where you define P(x_i|y) = Gaussian formula..but the Gaussian pdf is a distribution, so to get the probabilities we need integration. So, do we approximate this integration as area inside the rectangle having height=pdf and breadth = some delta. So, since we have a ratio of probabilities in the Bayesian formula, so the numerator delta cancels the denominator delta. So, that is why we dont include that delta term in our formula. Is this how you are doing ?

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      This is a very good question! I hope this helps: stats.stackexchange.com/questions/26624/pdfs-and-probability-in-naive-bayes-classification

    • @joydeepkr.devnath193
      @joydeepkr.devnath193 3 ปีที่แล้ว

      @@patloeber yes this link was helpful. Thanks !

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      @@joydeepkr.devnath193 sure :)

  • @AliHaider-hg7lj
    @AliHaider-hg7lj 4 ปีที่แล้ว +1

    How can we train any model on it? I mean if we have a csv file so how can we use it on this model?

    • @patloeber
      @patloeber  4 ปีที่แล้ว

      load the data with pandas or just manually with open(filename) and convert each line to your x and y vectors. then create training and testing data and train your model

    • @patloeber
      @patloeber  4 ปีที่แล้ว

      I'm actually planning to release a short video in the next 1-2 days on how to load your own datasets from csv

    • @AliHaider-hg7lj
      @AliHaider-hg7lj 4 ปีที่แล้ว

      @@patloeber Perfect & Thanks:)

    • @T4l0nITA
      @T4l0nITA 4 ปีที่แล้ว +3

      data = pandas.read_csv("file_name.csv")
      X = data.iloc[samples, features].values
      y = data.iloc[samples, y_column].values

  • @boooringlearning
    @boooringlearning 3 ปีที่แล้ว

    great video!

  • @bryanchambers1964
    @bryanchambers1964 3 ปีที่แล้ว

    Hey there, I like your videos you explain well but I am confused about something. There is a step in your code where you have:
    for c in self.classes:
    X_c = X[c==y]
    I understand the first line in the code (for c in self.classes:), but I have no idea why you have X_c = X[c==y].,
    if my c values are for example [ 1, 4, 8] , then X_c = X[1==1] just gives me X_c with an extra dimension. For example if X is a 3x4 matrix, X_c is now the same matrix except it has dimension 1x3x4. Am I just dumb or overthinking this detail?

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      Note that y is an array as well, not just a number, and the length of y has to be the same as the first dimension of X! So X_c[1==y] gives you all rows of X where y is 1. Please note also that my code has a slight but. It should be this (compare with my code on Github):
      for idx, c in enumerate(self._classes):
      X_c = X[y==c]
      self._mean[idx, :] = X_c.mean(axis=0)

    • @bryanchambers1964
      @bryanchambers1964 3 ปีที่แล้ว +1

      @@patloeber Thanks, yeah I kind of realized this after a while. So, this will extract the rows of X that have that class y=1. Makes sense.

  • @nobody2937
    @nobody2937 2 ปีที่แล้ว

    Also, make sure var is NOT 0 ...

  • @seyeeet8063
    @seyeeet8063 4 ปีที่แล้ว

    so NB does not have any updating rule like gradient decent?

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      No you just have to pre calculate priors and mean and var, and then apply the formula using Bayes‘ theorem

  • @viperz301
    @viperz301 4 ปีที่แล้ว

    Hi! what do you mean by the self that you pass into every function? is it the data frame?

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      This is an essential concept of object oriented programming and using classes in Python. self represents the instance of the class. By using the “self” keyword we can access the attributes and methods of the class in python. It binds the attributes with the given arguments.

    • @jossyrayonieram5231
      @jossyrayonieram5231 2 ปีที่แล้ว

      @@patloeber out of all the things Python does for you automatically, they stopped with "self". >_

  • @nafesafirdous3670
    @nafesafirdous3670 4 ปีที่แล้ว

    If I have my on dataset which is not present in sklearn datasets then how can I make classification?
    please help!

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      You need to load the dataset (probably from a csv file) and setup your X and y numpy arrays

    • @nafesafirdous3670
      @nafesafirdous3670 4 ปีที่แล้ว

      @@patloeber Helpful
      Thanks

  • @madsmith1352
    @madsmith1352 11 หลายเดือนก่อน

    Guass.. rhymes with house..

  • @prithviamin6847
    @prithviamin6847 4 ปีที่แล้ว

    hi
    i'm getting this error:
    UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @patloeber
      @patloeber  4 ปีที่แล้ว

      Try converting your data to np.float. And check if all your data is valid, probably you have NaN for some data points...

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 ปีที่แล้ว

      Hi, I face a Same problem ,you got it right.
      if correct the code please suggest me what I do.

    • @AliHussain-kb3ew
      @AliHussain-kb3ew 4 ปีที่แล้ว

      Hi

  • @amitupadhyay6511
    @amitupadhyay6511 3 ปีที่แล้ว

    what if the values in _pdf matrix are inf, then?

    • @patloeber
      @patloeber  3 ปีที่แล้ว

      then you have a problem ;) yeah you should add some error checking and maybe clip the allowed range in the calculation

  • @kritamdangol5349
    @kritamdangol5349 4 ปีที่แล้ว

    I got this errror while performing run .Please provide me solution for this.
    line 54, in
    predicted_values=(model.predict(Features_test))
    line 20, in predict
    y_pred=[self._predict(x) for x in X]
    , in
    y_pred=[self._predict(x) for x in X]
    line 29, in _predict
    line 40, in _pdf
    numerator=np.exp(-(x-mean)**2/(2*var))
    numpy.core._exceptions.UFuncTypeError: ufunc 'subtract' did not contain a loop with signature matching types (dtype('

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      probably your datatype or the shape of your vector is not correct. try casting to np.float32

    • @kritamdangol5349
      @kritamdangol5349 4 ปีที่แล้ว

      @@patloeber Thank u !

  • @Lanipops
    @Lanipops 4 ปีที่แล้ว

    need to make the naive bayes file allow 2d array

    • @patloeber
      @patloeber  4 ปีที่แล้ว

      try to cast y to int before fitting the data: y = y.astype(np.int)

  • @AliHussain-kb3ew
    @AliHussain-kb3ew 4 ปีที่แล้ว

    How to use this code in python Anaconda ?,

    • @patloeber
      @patloeber  4 ปีที่แล้ว

      I have a tutorial for Anaconda setup

  • @marcosraphael3390
    @marcosraphael3390 4 ปีที่แล้ว

    This is an unlabeled classifier?

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      No, it is supervised learning

  • @tsotnegams
    @tsotnegams 4 ปีที่แล้ว

    In the pdf method you wrote (2*var), it should be(2*var**2) because of squared variance in the formula. Great tutorial otherwise.

    • @patloeber
      @patloeber  4 ปีที่แล้ว +2

      No. The formula shows the squared standard deviation, which is equal to the variance (small sigma is always used in statistics for standard deviation). probably i should have pointed this out better. thanks for watching :)

    • @tsotnegams
      @tsotnegams 4 ปีที่แล้ว +1

      @@patloeber You are right, thanks for the reply.

    • @patloeber
      @patloeber  4 ปีที่แล้ว +1

      No problem :) you can always reach out when you have questions or find different errors

  • @redhwanalgabri7281
    @redhwanalgabri7281 3 ปีที่แล้ว

    ('Naive Bayes classification accuracy', 0)

  • @AliHussain-kb3ew
    @AliHussain-kb3ew 4 ปีที่แล้ว

    I try to Run this code on Anaconda an other iris dataset but ,i face a problen.

    • @patloeber
      @patloeber  4 ปีที่แล้ว

      Which problem ?

  • @ragaistanto6722
    @ragaistanto6722 4 ปีที่แล้ว

    Terimakasih. Untuk teman" lainya saya juga ada nih video tutorial ngoding Naive Bayes python 3 bisa di cek barangkali cocok.
    th-cam.com/video/m0HVDfe0k90/w-d-xo.html

  • @reellezahl
    @reellezahl 2 ปีที่แล้ว

    You need either a better microphone or to better adjust your sound settings. Your volume levels keep crashing and it's very grating on the ear.