The intuition is good but if you can help us with a proper derivation and also about the thought process i.e. how do we thought the way we though. It will be deep!!!
Likelihood of parameters means what's the probability of having observed the particular distribution of the dataset that you have with your right now given that I choose a particular set of parameters. What maximum likelihood estimation says is that you want to find that set of parameters that maximises the probability of having observed that distribution of the dataset that you have. You do that by taking the gradient of the likelihood/log-likelihood function with respect to the parameters and equating to 0, then solving for those parameters
there are various flavors of ML algorithms. In logistic regression with the approach of trying to learn a discriminative function that can classify a point into a particular label => a function f:X->Y such that f(datapoint) = class_label(belonging to set Y). Since these class labels are discrete if you try to use a mean_squared_error loss function you will get an expression of the loss function which will not be a convex function, I have attempted a proof of it but it involves a bit of intricate mathematics. You can do that by showing that the hessian of the loss function is neither positive semi-definite nor negative semi-definite hence it's neither convex nor concave. When you use a loss function which is a logistic loss function you get a concave function and you basically would need to do a gradient ascent to get to the maxima of the concave function. Again these involve concepts from Convex Optimization which you may attempt to read if interested from Boyd.
Let's say you have a 2-class classification problem. You henceforth assume that your random variable Y values come from a Bernoulli distribution, with each label being either 0 or 1. This random variable Y can take on value 1 based on some probability theta(say), since probabilities of a pmf add up to one hence you can also infer that Y takes on value 0 with probability (1-theta). Now you have a dataset with you consisting of features X(n features say) and your target Y and the number of observations(samples) you have is m(say). What you want to learn is a function mapping f such that f: X -> Y. This f can be a probabilistic function as well. You define the probability of having observed a particular datapoint taking on the y value as say 1 given its features x as Pr(y_i=1|x_i). What you want now is to find the probability of having observed the values of Y across the dataset in the particular order(like y_1 takes value 1, y_2 takes value 0, these y values are what you have from the dataset) given the features X across the whole dataset(in the same order) , so basically Pr(Y|X;theta) this is read as the probability of having observed Y given that you have observed X parameterised by theta. You now define your likelihood function as L(theta) which means the likelihood of theta => the probability of having observed this Y given X. Since each of the observations/samples are independent and they are believed to have come from the same bernoulli distibution(with replacement) or in short i.i.d you say that the Pr(Y|X;theta) = product across all i (Pr(y_i = 1|x_i; theta). Why I did this is because of the independence property in probability which says the Pr(A and B) = Pr(A)*Pr(B) if event A and event B are independent. You now take a log on both sides so as to make your calculation easier and it becomes summation across all i (log(Pr(y_i=1|x_i; theta)). This is called your log-likelihood. What you now want to do is find the value of theta for which this expression is maximized which is known as maximum likelihood estimation. I should also add that this theta is assumed to be a function of w^Tx => g(w^Tx) where typically your g is a sigmoid function. So when you take the gradient you also have to substitute this function in the log-likelihood expression and then you take the gradient w.r.t w.
Here we have theta = [theta0, theta1] and X = [1, X1], we are transposing theta matrix to get a single value after the multiplication, which is our hypothesis. z = theta0 + theta1 * X1 is another way of writing it. But z = theta transpose * X is a general way (in case if we have multiple features(X.columns > 2)).
wish i could add more thousands of likes from my side. such a great explanation!!
Thank you sir!
Thanks, Krish for making videos in Hindi. You always make things easy to understand.
Thank u so much krish sir for making videos in hindi.....aapka way of explanation bhut easy hota hai...aap complex chizo ko bhi easy bna dete ho😊😊
after 1 year, today I understood why do we have log term in cost function of logistic 22:00
You are doing amazing work man
23:22 need to keep in mind ? because i am very bad with logs
explanation is good.
But the Explanation of Nitish sir Campusx is another level.
True
Superb Explanation Sir ❤❤
Great explantion thank u dear sir be happy😍
The intuition is good but if you can help us with a proper derivation and also about the thought process i.e. how do we thought the way we though. It will be deep!!!
quality content ❤🔥❤🔥
maja aa gya quick and understable
Thank you so much this one clear my whole droughts
bro its doubts not droughts
Thank you sir.
thank you sir ..so helpful for me
Very Helpful Video
great tutorial
bhai aap ak video. Text Mining and Sentiment Analysis pe bna dijiye.
Nice 👍
You are legend!!
Thank you sir
Krish, when will next community session start
Pass=1 and Fail=0 till okay, but what is higher than 1? and how study hour can be less than 0? time can not be less than 0.
local minima se global nikalne time ap ne dundi mar di !..
thanks a lot
👌
The maths for logistic regression you upload in ml playlist is completely different from hindi playlist which is correct🙆♂️😰😰
Even I had the same confusion, @krishnaik could you please clarify?
IN that case what does "Maximum Likelihood" mean?
maximum likelihood is used to simply estimate the parameters i.e. coffcients, these cofficients are further used in odds, log odds
Likelihood of parameters means what's the probability of having observed the particular distribution of the dataset that you have with your right now given that I choose a particular set of parameters. What maximum likelihood estimation says is that you want to find that set of parameters that maximises the probability of having observed that distribution of the dataset that you have. You do that by taking the gradient of the likelihood/log-likelihood function with respect to the parameters and equating to 0, then solving for those parameters
do we need not need to square the last equation ?
no
@@parul15137 why???
@@uroojmalik8454 because squared error and linear sigmoid makes it non-convex
sir, for classification we have classifier model. so, why logistic Regression
You can use any model whichever gives you best performance wrt training and testing data
logestic regression is a classification problem its name is regression but actually it is classifier problm
@@RonaldoRewind-cr7 exactly
Because in logistic regression we take sigmoid function and sigmoid return data between o to 1.
Bessssssssssssssssttttttttttttttt
Sir you didn't teach here about loss function in logistic regression
Loss function will be the same as regression just you have to replace the hypothesis function by hypothesis function for logistic regression
@@kartikeysingh5781 thanks kartikey, I got it that was 1 year ago 😂
9:48
Hello sir
Can you please provide notes in pdf form?
Thanks
Please tell us why a log function is used as a cost function(if you know at all)
if you know -we are all ears.
@@shaileshkumar-rg9tg Sure thing! It's done to ensure the cost function is convex.
there are various flavors of ML algorithms. In logistic regression with the approach of trying to learn a discriminative function that can classify a point into a particular label => a function f:X->Y such that f(datapoint) = class_label(belonging to set Y). Since these class labels are discrete if you try to use a mean_squared_error loss function you will get an expression of the loss function which will not be a convex function, I have attempted a proof of it but it involves a bit of intricate mathematics. You can do that by showing that the hessian of the loss function is neither positive semi-definite nor negative semi-definite hence it's neither convex nor concave. When you use a loss function which is a logistic loss function you get a concave function and you basically would need to do a gradient ascent to get to the maxima of the concave function. Again these involve concepts from Convex Optimization which you may attempt to read if interested from Boyd.
Can you explain probabilistic approach for logistic regression?
Maximum likehood
Let's say you have a 2-class classification problem. You henceforth assume that your random variable Y values come from a Bernoulli distribution, with each label being either 0 or 1. This random variable Y can take on value 1 based on some probability theta(say), since probabilities of a pmf add up to one hence you can also infer that Y takes on value 0 with probability (1-theta). Now you have a dataset with you consisting of features X(n features say) and your target Y and the number of observations(samples) you have is m(say). What you want to learn is a function mapping f such that f: X -> Y. This f can be a probabilistic function as well. You define the probability of having observed a particular datapoint taking on the y value as say 1 given its features x as Pr(y_i=1|x_i). What you want now is to find the probability of having observed the values of Y across the dataset in the particular order(like y_1 takes value 1, y_2 takes value 0, these y values are what you have from the dataset) given the features X across the whole dataset(in the same order) , so basically Pr(Y|X;theta) this is read as the probability of having observed Y given that you have observed X parameterised by theta. You now define your likelihood function as L(theta) which means the likelihood of theta => the probability of having observed this Y given X. Since each of the observations/samples are independent and they are believed to have come from the same bernoulli distibution(with replacement) or in short i.i.d you say that the Pr(Y|X;theta) = product across all i (Pr(y_i = 1|x_i; theta). Why I did this is because of the independence property in probability which says the Pr(A and B) = Pr(A)*Pr(B) if event A and event B are independent. You now take a log on both sides so as to make your calculation easier and it becomes summation across all i (log(Pr(y_i=1|x_i; theta)). This is called your log-likelihood. What you now want to do is find the value of theta for which this expression is maximized which is known as maximum likelihood estimation. I should also add that this theta is assumed to be a function of w^Tx => g(w^Tx) where typically your g is a sigmoid function. So when you take the gradient you also have to substitute this function in the log-likelihood expression and then you take the gradient w.r.t w.
y=0, y=1, y is predicted value right?
Time slap?
Sir this ML playlist is enough to learn complete machine learning.
nope
what is the mean of this@@anonymousperson7054
Sir I want to ask how z= theta0 + theta1 X1 converted to z = theta tranpose of x. waiting for your reply.
Here we have theta = [theta0, theta1] and X = [1, X1], we are transposing theta matrix to get a single value after the multiplication, which is our hypothesis. z = theta0 + theta1 * X1 is another way of writing it. But z = theta transpose * X is a general way (in case if we have multiple features(X.columns > 2)).
Kuch samaj mein nhi aaya sir
Sir sorry but subkuch dimag ke uppar se chala gaya
Shi bola
Thanks maine puri vid dekhne se pehle ye comment dekh liya
@@supriyasaxena5053 मुझे खुशी है की आपका टाईम मेने बाचाया
Thanks 👍