When we are talking about linear regressions we normally have to satisfy a few assumptions like (linearity, normality of error, homoskedasticity and etc). Do these conditions also have to hold in order to perform a logistic regression? You have after all a kind a of a linear regression inside the sigmund function, right?
The logistic model assumes a linear relationship between the predictor variables and the log-odds of the event. Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression, but logistic regression makes 2 key different assumptions First, the conditional distribution y ∣ x is a Bernoulli distribution rather than a Gaussian distribution. Second, the predicted values are probabilities and are therefore restricted to (0,1).
Thank you for the kind words! I have been a little busy and not posting much lately, I am sorry about that! Let me know what topics you would like videos on! Thank you for watching
Praneeth, that is the equation for a Bernoulli distribution which takes the value 1 with probability p and the value 0 with probability 1 - p. This type of distribution is nice for a binary problem like we are solving in the video.
The explanation is very good. I still have some doubts regarding logistic regression Gradient descent/ascent is used to minimize the error right? In linear regression what we did was, we chose a random value for "b1 & b0" and then checked our error (MSE) using that slope and intercept. We integrated it again and again until we got a slope and intercept (b1 and b0) where the error was minimum which means global minima. 1) Why we are using gradient ascent here? 2) In linear regression we chose random values for b0 and b1, so what will be that random value here in logistic regression?
1) We are using the gradient ascent here to maximize the log likelihood function that we derived. You could use gradient descent if you multiply the log likelihood by a -1, then you are minimizing the -1 * Likelihood which is the same as maximizing the likelihood. 2) The random variables here are the model parameters in the hypothesis. That is the vector theta mention. In the example I am showing it is the same model as linear regression, the number of parameters depends on the length of the input vector x
Thank you! I am glad you like the video and subscribed, please feel free to provide any other feedback on topics you would like to see covered. Endless Engineering is committed to provide our followers with content they want!
Hi Kailas, thank you for watching! That is a great question, and the answer is YES! In fact the formulation I show here does estimate an intercept, that is included in the notation x bar (a bar on top of the vector x). x_bar is the vector x appended with one element of value 1 like this --> x_bar = [1, x0, x1, ..., xn]. So the parameter that multiplies the element of value 1 is the intercept.
@@EndlessEngineering understood!! Thank you sir for replying. When i was trying to understanding entire mechanism i came accross various ways like Gradient descent and Newton Rapson. In gradient descent X new= x initial - Rate * slope Newton rapson X new = x initial - Log likely hood ( x initial)/ derivative of it M i correct?? Which is better?? Thanks you once again for your reply!!🙏🙏🙏
@@user-xt9js1jt6m Newton's method requires computing the second derivative of your cost function (Hessian) and inverting it. This uses more information and would converge faster than gradient descent (which only requires the first derivative. However, requiring the hessian to exist imposes a more strict condition of smoothness, and inverting it might be computationally challenging in some cases. So I would say neither is absolutely better, it just depends on the problem/data you are dealing with.
Thanks for your question. The term affected by the sum symbol was too long to fit on one page, so I had to break it into multiple lines. It is valid to use the +...+ notation in that case since I am using the same index (i) to show that the other term belongs in the sum as well.
first things first, thank you very much for this explanation!! Liked and subscribed immediately! One thing that is still unclear to me is how in time stamp 12:06 you get the value of X-bar from that derivative. At the beginning X-bar was set to be a vector of the form: X-bar = [1, X_0,..., X_n]. I did the math (ie. the derivative of the sigmoid function w.r.t. theta), and I do not understand that connection. Any help would be welcome!
Hi Karl. Thanks for watching and subscribing! That is a very good question, let me see if I can clarify. The derivative we want to compute is d sigma(theta__transpose * x_bar) / d theta . Using the chain rule and let a = theta__transpose * x_bar Eq(1) then we can write that as d sigma(a) / d theta = [d sigma(a) / d a] * [d a / d theta] Eq(2) based on Eq(1) we can write [d a / d theta] = d theta__transpose * x_bar / d theta --> which is equal to x_bar Eq(3) and d sigma(a) / d a = sigma(a) * (1 - sigma(a)) = sigma(theta__transpose * x_bar) * (1 - sigma(theta__transpose * x_bar)) Eq(4) Substitute Eq(3) and Eq(4) into Eq(2) and substitute the value of a from Eq(1) all into the original derivative d sigma(theta__transpose * x_bar) / d theta = sigma(theta__transpose * x_bar) * (1 - sigma(theta__transpose * x_bar)) * x_bar I hope that clears it up
Normally we want to maximize the likelihood (consequently the log likelihood). This is the reason why we call this method maximum likelihood estimation. We want to determine the parameters in such a way that the likelihood (or log likelihood) is maximized. When we think about the loss function we want to have something that is bounded by 0 from below and is unbounded for positive values. Our goal is to minimize the cost function. Hence, we take the negative of the log likelihood and use it as our cost function. It is important to note that this is just a convention. You could also take the log likelihood and maximize it, but in this case, we would not be able to interpret it as a cost function.
Best explanation on the internet. Thank you.
Thank you! I am glad you enjoyed it, feel free to check out my other videos and subscribe to the channel
This is so far the best video I've watched on youtube that explains this topic
I have been looking for a good explanation in books and other videos but I couldn't understand this topic until I found your video. Thank you! :)
i have searched so many explanation but finally understood by your videos. Thank
Damn this guy is awesome in explaining stuff. Really good work mate!
Thank you for watching! I am glad you found the video useful
Just enough math to capture the intuition behind the algorithm. You got yourself a sub sir.
Thanks for watching and subscribing! Glad you liked the video! Let me know if you have any topics you would like to see videos on in the future
I'm wondering how the hell this is filmed. Some mirror magic or does he actually write backwards?
It is filmed using Engineering Magic!
he did wrote backwards.. thats engineering concept , if adding doesnt solve the problem, then subtract it :)
Video camera placed behind the board, video footage flipped horizontally while editing.
@@pratikd5882 nah that's make sense..
@@pratikd5882 Nah, I think this guy is obvioulsy bored so he did the entire video writing backward
When we are talking about linear regressions we normally have to satisfy a few assumptions like (linearity, normality of error, homoskedasticity and etc). Do these conditions also have to hold in order to perform a logistic regression? You have after all a kind a of a linear regression inside the sigmund function, right?
The logistic model assumes a linear relationship between the predictor variables and the log-odds of the event. Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression, but logistic regression makes 2 key different assumptions First, the conditional distribution y ∣ x is a Bernoulli distribution rather than a Gaussian distribution. Second, the predicted values are probabilities and are therefore restricted to (0,1).
U forgot to write (1-y_{i})LOG(1-sig(theta..) just then after taking the log :)
Edit: nvm u fixed it right after
Thanks for watching! And paying attention 🙂
I need more videos man!!
Awesome work. Please post more videos or helpfull links that explain as well as you do!
Thank you for the kind words! I have been a little busy and not posting much lately, I am sorry about that! Let me know what topics you would like videos on! Thank you for watching
Why is the probability multiplied in example and why is it power in actual equation? Time : 4.52 of video
Praneeth, that is the equation for a Bernoulli distribution which takes the value 1 with probability p and the value 0 with probability 1 - p. This type of distribution is nice for a binary problem like we are solving in the video.
just fabulous explanation there are tons of videos on logistic regression but most of them gloss over the mathematical details.
Thank you for watching! Glad you found it clear and useful. Feel free to share and subscribe to the channel
Great video dude and very neatly explained. Please keep up the great!
*great work!
Thank you for watching!
Beste video on RUclips to this topic!
Thanks for the video.
Please, how can one optimize the coefficients of a Logistic Regression Model using a Genetic Algorithm?
Thank you ! This video help me a lot to understand what MLE in logistic does
Thanks a ton!!, you explained it so well !! Please keep making videos on ML math topics.
This is a very good explanation of the math behind Logistic Regression I ever seen!
Thanks for watching! Glad you found it useful
Please come out with new lessons! Very clear and cool!
Thank you, I am glad you enjoy the videos
Thank you alot!
Could you explain kernels methods?
Thank you for watching Fadi. I have more videos coming, I will try to do one on Kernel methods
That was really helpful. Thank you very much
Endless engineering.. Fantastic work.. You Explaining everybit... really appreciate
Thank you Abdulkareem. I am glad you found the video useful. Thanks for watching!
يعطيك العافيه شرح رائع تذكرتك لما شرحت هياكل الطائرات
This is the best explanation ever. Thx!
Thank you! I am glad you found this video clear and useful. Please let me know if there are other topics you would like to see videos on
Well done.
Thank you Anil, glad you enjoyed the video.
Amazing explanation....just wow
Thank you Shrish, glad you enjoyed the video. Please feel free to like the video and subscribe to channel, thank you for the support.
The explanation is very good. I still have some doubts regarding logistic regression
Gradient descent/ascent is used to minimize the error right? In linear regression what we did was, we chose a random value for "b1 & b0" and then checked our error (MSE) using that slope and intercept. We integrated it again and again until we got a slope and intercept (b1 and b0) where the error was minimum which means global minima.
1) Why we are using gradient ascent here?
2) In linear regression we chose random values for b0 and b1, so what will be that random value here in logistic regression?
1) We are using the gradient ascent here to maximize the log likelihood function that we derived. You could use gradient descent if you multiply the log likelihood by a -1, then you are minimizing the -1 * Likelihood which is the same as maximizing the likelihood.
2) The random variables here are the model parameters in the hypothesis. That is the vector theta mention. In the example I am showing it is the same model as linear regression, the number of parameters depends on the length of the input vector x
Thanks, bald Jack Black. It was very helpful.
subscribed because this was so good :D
Thank you! I am glad you like the video and subscribed, please feel free to provide any other feedback on topics you would like to see covered. Endless Engineering is committed to provide our followers with content they want!
it was easily explained and convenient to comprehend. thanks
Hi Salman, I am glad you enjoyed the video. Thank you for watching
Nice explanation sir
Wt if there is intercept?
Do we have to estimate multiple parametrs simultaneously??
Hi Kailas, thank you for watching!
That is a great question, and the answer is YES! In fact the formulation I show here does estimate an intercept, that is included in the notation x bar (a bar on top of the vector x). x_bar is the vector x appended with one element of value 1 like this --> x_bar = [1, x0, x1, ..., xn]. So the parameter that multiplies the element of value 1 is the intercept.
@@EndlessEngineering understood!!
Thank you sir for replying.
When i was trying to understanding entire mechanism i came accross various ways like
Gradient descent and Newton Rapson.
In gradient descent
X new= x initial - Rate * slope
Newton rapson
X new = x initial - Log likely hood ( x initial)/ derivative of it
M i correct??
Which is better??
Thanks you once again for your reply!!🙏🙏🙏
@@user-xt9js1jt6m Newton's method requires computing the second derivative of your cost function (Hessian) and inverting it. This uses more information and would converge faster than gradient descent (which only requires the first derivative.
However, requiring the hessian to exist imposes a more strict condition of smoothness, and inverting it might be computationally challenging in some cases.
So I would say neither is absolutely better, it just depends on the problem/data you are dealing with.
@@EndlessEngineering okay!!!
Thank you!!!
I appreciate your sincerity!!
Thank you once again sir!!!
Wow! Nice explanation. Thank you so much.
Glad you liked it! Let me know if there are any other topics you are interested in learning!
why are you writing the sum symbol and then still do +...+ . I dont think its supposed like that
Thanks for your question. The term affected by the sum symbol was too long to fit on one page, so I had to break it into multiple lines. It is valid to use the +...+ notation in that case since I am using the same index (i) to show that the other term belongs in the sum as well.
but why is it x bar ?, its not xi?
how can we help ? do you have a udemy course we can buy? I feel like a mathematician after this video
first things first, thank you very much for this explanation!! Liked and subscribed immediately!
One thing that is still unclear to me is how in time stamp 12:06 you get the value of X-bar from that derivative. At the beginning X-bar was set to be a vector of the form: X-bar = [1, X_0,..., X_n]. I did the math (ie. the derivative of the sigmoid function w.r.t. theta), and I do not understand that connection. Any help would be welcome!
Hi Karl. Thanks for watching and subscribing!
That is a very good question, let me see if I can clarify.
The derivative we want to compute is d sigma(theta__transpose * x_bar) / d theta . Using the chain rule and let
a = theta__transpose * x_bar Eq(1)
then we can write that as
d sigma(a) / d theta = [d sigma(a) / d a] * [d a / d theta] Eq(2)
based on Eq(1) we can write
[d a / d theta] = d theta__transpose * x_bar / d theta --> which is equal to x_bar Eq(3)
and
d sigma(a) / d a = sigma(a) * (1 - sigma(a)) = sigma(theta__transpose * x_bar) * (1 - sigma(theta__transpose * x_bar)) Eq(4)
Substitute Eq(3) and Eq(4) into Eq(2) and substitute the value of a from Eq(1) all into the original derivative
d sigma(theta__transpose * x_bar) / d theta = sigma(theta__transpose * x_bar) * (1 - sigma(theta__transpose * x_bar)) * x_bar
I hope that clears it up
@@EndlessEngineering Very helpful explanation thank you!
Very good explanation!
Thank you Keven. I am glad you found it helpful and enjoyed the video!
Thank you indeed!
Thank you for watching! Glad you found the video useful
Thank you, dir! Great explain!
You are welcome! Glad you enjoyed the video
Why theta to be found, is needed to be increased using Max-LF ?
Normally we want to maximize the likelihood (consequently the log likelihood). This is the reason why we call this method maximum likelihood estimation. We want to determine the parameters in such a way that the likelihood (or log likelihood) is maximized.
When we think about the loss function we want to have something that is bounded by 0 from below and is unbounded for positive values. Our goal is to minimize the cost function. Hence, we take the negative of the log likelihood and use it as our cost function.
It is important to note that this is just a convention. You could also take the log likelihood and maximize it, but in this case, we would not be able to interpret it as a cost function.
what do he mean by x barre
Thank you very much it really helps.
Hello Tayyeb, I am glad you find this helpful. Let me know if there are any other topics you are interested in learning!
Awesome!
Glad you liked the video Gabriel! Thanks for watching
hero!
What is x bar at 13:18
It is a vector of the inputs with a one appended to it, see 1:47
art
Thanks for watching. Glad you enjoyed it