NOTE: In statistics, machine learning and most programming languages, the default base for the log() function is 'e'. In other words, when I write, "log()", I mean "natural log()", or "ln()". Thus, the log to the base 'e' of 2.717 = 1. Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@@statquest Thanks for the explanation Josh...I also want to thank you for the immense contribution you are making by publishing such quality educational content for free!
In the past one year of my MBA, I have been taught concepts of logistic regression by Math PhDs and Analytics Gurus, but no one could beat the simplicity and elegance of your explanation.
The simplicity of your work is what is truly needed in education. Too often are professors and tutors trying to teach complex stuff in a complex way to make themselves seem smarter. Too many ignore the fact that learning complex things in a simplified manner is much more beneficial. Your work does this perfectly.
This explanation is so good that I feel kinda guilty for having access to it. There is no doubt that if people in the past had access to this, then their lives would've been a lot easier. I feel like a spoiled brat. This explanation is too good for this world.
@@statquest I too feel guilty for having access to such great learning at no cost. Once I get job as ML engineer will donate 1000 dollars. Till then I will never skip your any ads does that help?
@Rakesh Shaw DS The algorithm mentioned at 9:01 probably refers to "Gradient Descent", you can select some initial values for the intercept and the slope for the candidate line and let Gradient Descent find the line with the best fit.
Thank you Josh, now we understand, during classroom days classmates don't do better because they are necessarily smarter but because they get exceptional teachers like you. Thank you for democratising statistics and machine learning and bridging the gap, more power to you.
I teach graduate level operations research courses which require some understanding of probability and statistics. My students really had limited understanding of some fundamental knowledge even if they took some related courses before. I found your videos organized and concise. I will recommend your channel to my future students. Thank you, Josh!
Dear Josh, your videos are amazing and I would have never passed my qualification exam without them. Also, this is the first time I actually understood how MLE works. Thank you so much!
Hi Josh.. I hope you know how you are changing people's life directly.. a lot of people are earning very good salary bz of your quality content.. if one person gets a job, his entirely family is getting benefited bz of him.. so, you are not just helping one single person but thousands of families.. May the almighty bless you and keep you happy, healthy and wise..
Hi Josh, you are single-handedly carrying me through my Masters program with your videos. I was seriously considering dropping out earlier this year cause I was having a lot of difficulty understanding anything, but a lot of things are starting to click now thanks to you. Your vids are a godsend to students everywhere.
@@statquest josh how many times we should do trail and error process in order to fit the squizzle to the data for linear regression there is a direct formulae using matrices to find the slope and the intercept is there any formulae to determinre the curve ....we cant perform the operation these many times
So far, the best explanation of maximum likelihood estimation on RUclips. Log odds of me being better at math would be significantly high had you been my math teacher at highschool. Thanks Josh.
Wowwww!!! This is the most useful video on this topic! I was really struggling to understand this concept in my class but this video explained the concept so well! I'm sooo grateful for your existence!
Hey Josh, I can only resonate everybody´s thankful words here. I can´t be more grateful. You make me think that statistics can be very daunting most of the time for the lack of experts like you that can really explain its details in a very simple way. For me, this is true sign of mastery! I had one question regarding this video though, based on your other video where you explain the difference between probability (P(data | mean, sd) )and likelihood (L (mean, sd| data)). Where the in the former you find the distribution of the data under a fixed hypothesis and the latter you find the best distribution that fit the data. Somehow in this video about logistic regression i feel that you talk about the likelihood in terms of the probability. As in "the likelihood of that this mouse is obese [given the shape of the squiggle] is the same as the predicted probability." So here the likelihood of the data point (the mouse) is based on a fixed distribution? Could you explain that if you have time? :) Again, really appreciate all that you do!!!
In the probability vs likelihood video I'm specifically talking about the difference in probability and likelihood with respect to continuous probability distributions. In this case, the s-shaped squiggle is not a probability distribution (you can because the area under the squiggle, from 0 to positive infinity > 1). However, like in the probability vs likelihood video, likelihood still refers to the y-axis. The big difference, now, however, is that so does probability. Thus, thus, in this case (because the s-shaped squiggle is not a continuous probability distribution), both likelihood and probability are the same. That said, we choose to call it "likelihood" when we are fitting the squiggle to the data to be consistent with "maximum likelihood".
@@statquest Thanks very much for your answer, Josh! I understand what you mean that the squiggle is not a probability distribution. It is also pretty clear the overall of Maximum Likelihood Estimation: to find the squiggle that best fits the data. What remains unclear to me (which was pretty clear in your video likelihood vs probability) is how I should interpret MLE in formal terms: is it the highest probability (likelihood) achieved by the data given the squiggle (a case of p(y|Hi) or the highest probability given the data p(Hi|y). From your video and explanation I think it is the former?
@@bernardoleivas8296 Given that the data is fixed, and we're wiggling the squiggle to fit it best, we want to maximize the likelihood by changing the parameters determine the shape of the squiggle (the slope and intercept in log(odds) space) given the data.
I must say you are a magician. You have the tricks to communicate and deliver just what people want. We want more like you. Thank you so much in shaping the world.
Hello Josh, first of all I wanted to thank you for these enormously helpful, world class educational videos. They are my lifeboat at the moment in my "Data Literacy" class. One thing I noticed was: @3:35 it should say -1.4. This is not irrelivant since the "fancy looking formula" produces values closer to 0 for negative values. Again, thank you so much for what you did here. I would probably have to go back to copying slides by hand which omit all the details and assume way to much previous knowledge. Best Markus
Finding this channel was a pure blessing, especially now that my econometrics classes are held online due to the pandemic and it’s even harder to understand the material. Thank you so much providing free content with such a high academic value (and with very lovely jingles too)!
@@statquest I really loved all your videos but when I was trying to apply hyperparameter tuning for logistic regression, I was unable to understand what is a ( penalty, C). I searched and got to know that penalty is the cost function, but in all your videos on logistic regression u didn't mention about cost function, can u please tell help me out in answering?
@@nitishkumar-bk8kd The "cost" function for logistic regression is the negative log-likelihood (it is simply -1 times the log-likelihood described in this video. We can either maximize the likelihood with the log-likelihood, or, if we multiply the log-likelihood by -1, we can minimize the "cost"). However, what language are you using? Python? In python 'C' is the inverse to the regularization strength. If you want to learn about regularization, see: ruclips.net/video/Q81RR3yKn30/видео.html
@@statquest Thanks for your reply professor josh, I am happy u replied to me :) but still, I have a doubt :( you are telling me that the cost function is -1*log(likelihood) of the data right? if so why didn't you multiply the log(likelihood) of the data with -1 to find the best fit line for the data? and in hyperparameters, the solver parameter have options like 'lbfgs' , 'newton-cg' , 'liblinear' , 'sag' , 'saga' Each solver tries to find the parameter weights that minimize a cost function. how these are related with likelihood?
@@nitishkumar-bk8kd Logistic Regression has traditionally been solved by maximum likelihood, which is why my video describes that approach. It is only recently that we have general purpose code that minimizes cost functions and includes regularization. Thus, when people talk about the theory of optimizing logistic regression, which is what I do in this video, they talk about maximizing the log-likelihood. In practice, things are always a little different, especially if you are using one of the newer methods that include regularization. Unfortunately, the questions you are asking are really more appropriate for a "how to do logistic regression in Python" video (NOTE: I already have a "how to do logistic regression in R" video: ruclips.net/video/C4N3_XJJ-jU/видео.html ) and I'll keep that topic in mind for a future video.
This was so helpful! You explained the likelihood really intuitively (calculate likelihood of this data given shape of the squiggle) and it makes a lot of sense now. thank you!!!
Thank you for this. You make our life easier, especially for data analytics students like me. Your explanations are so great that is s easy to understand. Such a talent.
Thank you! Yes, it takes a lot of time and work to make a video. I spend a lot of time researching a subject and then a lot of time trying to find a new way to present the information.
This is my new favourite channel for ML really love the explanations and the speed at which you talk is easy to follow. I am a person who learns by examples so it would be great if you added more examples. Thanks for the great content.
You need to learn Binomial distribution, Bernoulli trial, likelihood, MLE, loss function, Gradient Descent, odds, log(odds), Logit function, sigmoid function, decision boundary, and expected values. And of course the mathematical intuitions too. This is overwhelming for a beginner like me. But Josh your part 1 and 2 of Logistic regression solves many problems for me. Thank you and a lot more to learn from you. BAM!!!
I am student of data science, when i see this logistic algorithm calculation , i am scared and think that i could survive in this field or not, But after seeing your content of this algorithm i gonna play with this .Thankyou so much sir for this valuable content
My first thought when learning about this was why couldn't we use least squares on the log odds thing graph, then straight away it's explained with the pos/neg infinity thing. Nice.
When you use the train data to derive the coefs, how do you know the log-odds from the binary response in the train data set? in 3:31, how do you get the log(odds) for the first response is -2.1? How do you derive that candidate line? To get log(odds) we have to know p(y=1) right? When you train the model, you actually dont have know how to convert binary (0,1) response to p(y=1). I am wondering is the candidate line used to project the data from binary to log(odds) selected by random, then we keep improve it?
The candidate line is just a randomly selected line - it's a starting point that is then improved on by using maximum likelihood. So you start with a guess, calculate the likelihood of that guess, and then move the line - change the slope and change the intercept - and see if that gives you a higher or lower likelihood. If the likelihood is lower, then you move the line in the other direction. If it's higher, you keep moving in that same direction. Does that make sense?
Hey Josh, I really like your explanations but the logistic regression is where i got stuck real bad. But it's clear now. The problem is that you never linked the log(odds) and line equation. I saw from the internet that there's a link function between them and that's how we can use line equation in place of log(odds). If you'd have included that explanation as well, i'd have not depended on other explanations. But anyways thank you, am always a fan of your explanations. Keep up the good work!!!! BAAAMMM!!!!!
Hi josh, I have a question. How in 2:45 do we plot our actual data at infinity and negative infinity, while if i understand correctly that the x axis is the p, so there are some points plotted that are beyond infinity since the corresponding p value on x axis > 1, and at p=1 log odds is infinity
3:24 ,you said that we project the original data on the candidate line, and you got values 2.1 and so on. is this the assumption that these points are falling in that regions on the line .
I'm not sure I understand your question. By "projecting onto the line", we are simply plugging the x-axis coordinates for each data point into the equation for the link to get the y-axis coordinates. The combination of the x- and y-axis coordinates is the "projection".
Excuse me if this was explained in the video(I guess it went over my head), but I have one question: At 3:51 we already have a squiggle with data points projected onto it. Then we transform it with log(p/(1-p)) and then we transform it back to the squiggle. I wonder why is this done, couldn't we automatically get the probabilities by projecting the data onto the squiggle? Sometimes I think I understand why it's needed to transform the data but it just doesn't click. Could you please elaborate?
We start with a random straight line in the log(odds) space (on the right side of the screen). However, in order to evaluate that line, we need to calculate residuals - and we can't do that in the log(odds) space since the training (known) data is at -infinity and +infinity. So we transform it into probability space (on the left side) so we can calculate the residuals and evaluate the straight line on the right. We then rotate the straight line, transform and evaluate to see if we the rotated line is improving it's fit, etc.
This is a brilliant video - in-depth discussion of how to fit a logistic regression line for non-mathematicians. Do consider creating a lesson giving more details of the conversion between the squiggle and the log-likelihood graph. Bet it will be awesome! BAM :) PS - purchased all your lesson :)
Awesome!!! Thank you very much!!! Have you seen the other videos in this series, they also describe the conversion between the squiggle and the log-likelihood. See: ruclips.net/p/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe
Coming from lecture from MIT on generalized linear model which was difficult to get a grasp, this series on logistic regression helps me understand GLM better. Thank you! I also wonder if you would make a video that deeper explains multinomial logistic regression.
Hey Josh, I have a question on the process of iterating the fitting of the candidates line. We start off with a random line, this line is then put through the process of max (log) likelihood. After this is done, we simply change the line and the slope around until we find the best likelihood fit. I have 2 questions. 1. When do we know if the likelihood is right? you show one that is log(of-data-likelihood) = -3.77. I am confused on when I will be able to say, "yes this line has the best likelihood, done." 2. Is there a method to iterating through the possible 'best' fitted line candidates? You mention some algorithms for choosing. I think since I am not sure which likelihood would be optimal, I can not visualize what algorithm would test the next line. Your videos are fantastic and I am learning more from you than I ever did in school. Really appreciate your teaching methods.
Logistic Regression uses an iterative method called Gradient Descent to find the best fitting line. For more details on how this works, see: ruclips.net/video/sDv4f4s2SB8/видео.html
@@statquest If I may run this past you to ensure I am doing it right (based on what I got from your recommended video). I would take all the possible y-intercepts of my log(odds) lines, process the corresponding max-likelihood values for each intercept, then calculate the slope of this relationship? Once a slope closest to 0 is found, I have now found the intercept and the slope of the best fitting log(odds) line?
@@Ash_Industries For logistic regression, we take the derivative of the log likelihood function with respect to the y-axis intercept (in log(odds) space) and slope (also in log(odds) space) and then plug those into the gradient descent algorithm and let it go. For details on how to take those derivatives (they are basically the result of applying the chain rule a few times), see: medium.com/analytics-vidhya/logistic-regression-with-gradient-descent-explained-machine-learning-a9a12b38d710
So, let me get this straight. We're supposed to calculate the likelihood by multiplying the collective set of probabilities (or adding the log of those probabilities) at each point...and the probability for each point is defined by a (squiggly) line). And that optimal line is a line that produces probabilities at each point, that when multiplied by each other, results in the maximum likelihood. The apparent circularity of this statement is EXTREMELY confusing...but your videos are excellent. Thank you.
You have the main idea. We wiggle the line around a bit, calculate the likelihood (or log-likelihood), then wiggle some more and calculate the likelihood and repeat until we have found the squiggle that gives us the highest likelihood.
This Logistic Regression series is one of the best means I've found online for understanding the theoretical foundations behind those models in a simple way. In both Linear and Logistic Regression, you reference the idea of iterating and optimizing the line. Do you imply the usage of Gradient Descent? Do you consider solving those problems using partial derivatives less relevant?
I'm glad you like my videos! I use the concept of "iterating and optimizing" more just to convey the concept that some lines fit better than others and that one possible way to find a solution is to step towards the best fitting line. However, in practice, other solutions can be used more efficiently.
3:55 So basically log odds to probability transformation is nothing but substituting sigmoid equation p = 1/(1+e^-z) with z = log(odds). Since sigmoid converts any real number to a value between 0 and 1, this should make sense, right? And we also know that log odds is nothing but b0 + b1X, so basically we are using the fit line equation.
Great tutorials. Something I missed: in previous video the probability was used to calculate the Log. Here the Log(odd) is used to calculate back the probabilities. I guess that 80-85 kg individuals were clinically distributed to obese/not-obese and this ratio provides the probability for this weight range. So, why we need to convert to log, and back, as probabilities are the raw data?
It's relatively easy to fit lines to things, and relatively hard to fit squiggles. So we use the log() space to fit a line to the data and then translate that back to probabilities (and a squiggly line).
Hi josh, your explanation is superb but i have a doubt @ 8:15 "-3.77" is an intercept or a slope ?. I have gone through pt 1 video also.Please throw your knowledge light on this area so that i would be out of this dark. Thank you
-3.77 is not the intercept or the slope, -3.77 is the log-likelihood of that specific line. In other words, -3.77 is measure of how well the line fits the data. The goal is to find the line that gives us the maximum likelihood, and we can do this by finding the line that gives us the maximum log-likelihood. Thus, for each line that we try, we calculate its log-likelihood. If that value is larger, then the new line is a better fit. Does that make sense?
Damn great videos. So the idea of Maximum Likelihood method is to choose the model that maximazes the product of likelihoods of getting the right results from your observations.
Hey there, can you speak a to the process of 'projecting the observed data' onto the candidate line? In your video, you do not show what the actual values of x are which makes it very easy to projects the values onto the line, but in reality (when working with real data) this becomes a bit confusing. I have ticket prices for the titanic (range of 0.0 - 500.0), I have classified the data per passenger whether they survived or not (1 or 0). When I choose a candiate line (y-intercept = -3.5, slope=0.5) I get a line. I am having troube with how to determine where the observed x-axis data would land on this line in order to perform max-log-likelihood on the new y-value .
In your case, ticket price is your x-axis variable (and the y-axis is the log(odds) of survival). So plug ticket price into the equation... y-axis coordinate = -3.5 + (0.5 * ticket price) ...and you'll get the data projected onto the line.
@@statquest OK, that is what I did! I am so proud of myself lol hahahahha. Just felt like it was too easy and with all the previous video regarding why you do log-odds made me feel like I missed a step!
Hi Josh. I hope you are doing well. My question is regarding the maximum likelihood (6:41): when calculating, the probability values are being taken. so the blue dots represent the probability values for obese and for the same we can get the probability values for *not obese*. My question is, instead of taking the probability values of not obese*, why (1- *not obese* )value is considered?
The y-axis represent the likelihood and probability that a mouse is obese. In order to calculate the likelihood or probability that a mouse is not obese, we subtract the y-axis values from 1 to flip.
I see at 6:25 there is a switch in thinking from Obese to Not Obese points, causing an extra step of doing 1-prob rather than directly using prob previously. How would this operation work when we have more than 2 classes, like 3 instead? I assume we can't calculate MLE from this logistic regression anymore. I know we can still use this framework by doing OVR, but are there other ways of calculating MLE for 3 classes if we're not doing OVR? Not sure if the "1-prob" thing will still be done, and what new graphs we'll be reading from to get the probabilities for calculating MLE.
You are welcome!! I'm glad you like the videos and that they help you. I'm really excited about the next video (R-squared and p-values for Logistic Regression). :)
Hi Josh, Re: 05:36, "The likelihood the mouse is obese, given the shape of the squiggle, is the same as the predicted probability. In this case the probability is not calculated as the area under the curve, but instead is the Y-axis value, and that's why it's the same as the likelihood." But in other videos, you mentioned probability is not the same as likelihood. Could you elaborate a little please? I often use time interchangeably and been been confused. Thanks.
The squiggle is not a continuous statistical distribution. If it where, then probability and likelihood would be different. In this case, however, both are y-axis coordinates.
We can start with any random line, and we can use Gradient Descent to iteratively determine the optimal slope and intercept. For details about how Gradient Descent works, see: ruclips.net/video/sDv4f4s2SB8/видео.html
And thanks a lot of you for replying the comments. I'm trying to calculate coefficients manually without any software and my question is related to hours studied by students to pass the exams. And my problem is to calculate coefficients manually as we do in linear regression simply using formulas.
The coefficients are not calculated like linear regression. Linear Regression has an analytical solution - there's a formula and you can plug numbers into it and get a solution. Logistic Regression, however, does not have an analytical solution. It's solved using an iterative algorithm, like Gradient Descent.
I have a question (I don't really know about regression) Why at 7:56 when he calculates the log(likelihood)=Sum[log(p)] + Sum[log(1-p)] He is not doing it as the formula log(likelihood)=Sum[ *(Y)* log(p)]+Sum[ *(1-Y)* log(1-p)] Where is (Y) and (1-Y) ??
Remember Y represents the what is known about each mouse and is either 0 (if the mouse not obese) or 1 (if the mouse is obese). So, when using your formula, Sum[ (Y) log(p)]+Sum[ (1-Y) log(1-p)], if we expand the summation, we get: [Y1*log(p1) + (1-Y1)*log(1-p1))] + [Y2*log(p2) + (1-Y2)*log(1-p2))] + etc. So, now imagine Y1 = 0, because mouse #1 is not obese and Y2 = 1, because mouse #2 is obese. That gives us... [0*log(p1) + (1-0)*log(1-p1))] + [1*log(p2) + (1-1)*log(1-p2))] = [0 + (1)*log(1-p1))] + [1*log(p2) + (0)*log(1-p2))] = log(1-p1) + log(p2) ...and that is the formula that I'm using.
Hi StatQuest!, I would like to ask that at 8:21, we need to rotate the line to obtain the best sigmoid function. But rotating a line need a pivot, how to find the optimal pivot? I found that using the mid-point of these two cluster (in this case, finding the mean of not-obese rat and obese rat seperately, and then find the mid-point of these mean) could be problematic, because the mean could be affected by outliars. But using mid-point of medians could be a solution Thank you.
The log likelihood part around 6:00 is really confusing about why likelihood here are the same as probabilities. Would you help to explain more? Please!!! I saw a lot of people in comments are confused too.....
Likelihoods are the y-axis coordinates created by statistical distributions. Always. However, depending on the distribution, probabilities are either the y-axis coordinates or the area under the curve. When the distribution is for continuous outcomes (like height), probabilities are calculated as "the area under the curve" (as illustrated in this video: ruclips.net/video/pYxNSUDSFH4/видео.html ). In contrast, in this case the distribution is for discrete outcomes (obese or not obese) and with discrete outcomes, the y-axis is also equal to the probabilities.
@3:20, is the candidate line there for the projection randomly initialized? If not, where could we get the 'candidate line' on the log-odd graph during the 1st iteration of the maximum likelihood estimate?
what I don't understand is : Why do you transform the y-axis to the log(odds) ? why can't we just use the maximum likelihood on the S-curve ? why do we need Bete_0 and Beta1
It's way, way easier to optimize something linear (which is what we do in the log(odds) space) than non-linear (which is what we have in the probability space). Non-linear functions like this sigmoid shape are sometimes impossible to fit because there are too many options for where and how to bend the line.
@@statquest First of all, thank you for your response. So, does this mean that by replacing the parameter \( z \) with the function of a linear equation, we are essentially optimizing the linear equation to fit the data, and because this is a component of logistic regression, we are indirectly optimizing the S-curve as well? Is that how it should be understood?
@@sayyamplay By optimizing the straight line, we optimize the squiggle. So we replace an impossible problem with something we can solve and in the process solve both.
@@statquest wow, thanks for the great explanation. That was the missing piece. Btw, I‘m currently reading Essential Math for Data Science, they mentioned your videos multiple times there. Living legend
First, the line at 3:45 isn't a best fitting line, it's just a candidate "best fitting" line (as stated at 2:44 ). The projection is done by plugging the x-axis coordinates for each data point into the equation for the line to determine the y-axis coordinates. (for example, if the equation for the candidate line is y = 2x - 5, and if the x-axis coordinate for the first point was 0.5, then y = 2*0.5 - 5 = -4.)
could anyone help me understand Where is Sigmoid function coming into picture here ? also can anyone help me understand where are we using log loss function to optimize things ?
The sigmoid function is used to convert log(odds) to probability. If we invert the log(likelihood) (which we want to maximize) we get the log(loss) (which we want to minimize).
Thanks Josh for these great videos. I'm wondering if it's possible to optimize the coefficients of a Logistic Model using a Genetic Algorithm. If yes, please can you demonstrate how? Many thanks.
The goal is to find the squiggle with the maximum likelihood, and is why this video is titled "Maximum Likelihood". However, at 7:41, I say that the squiggle that maximizes the likelihood is the same one that maximizes the log-likelihood, so we maximize the log-likelihood instead (because adding logs of the likelihoods on a computer is easier the multiplying the likelihoods due to something called "underflow problems")
NOTE: In statistics, machine learning and most programming languages, the default base for the log() function is 'e'. In other words, when I write, "log()", I mean "natural log()", or "ln()". Thus, the log to the base 'e' of 2.717 = 1.
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
So why don't you write e^log(odds) as odds directly. Why do you need to keep the exponent as it is? P=odds/(1+odds)
@@rkbshiva Because often times we have a value, x, that is equal to log(odds). So, essentially, we have e^x, and the equality is not as obvious.
@@statquest Thanks for the explanation Josh...I also want to thank you for the immense contribution you are making by publishing such quality educational content for free!
Had 1 doubt
Is maximum-likelihood different from the cost function where we use gradient descent?
@@anurodhchoudhary1689 You can probably multiple the likelihood by -1 and end up with a cost function that you can minimize with gradient descent.
Dear Josh, I want you to know that there are many of us who are so thankful to have you.
Thank you very much! :)
Count me in
Count me in
In the past one year of my MBA, I have been taught concepts of logistic regression by Math PhDs and Analytics Gurus, but no one could beat the simplicity and elegance of your explanation.
BAM! :)
This is where our tuition fees should have gone to.
:)
Exactly
The simplicity of your work is what is truly needed in education. Too often are professors and tutors trying to teach complex stuff in a complex way to make themselves seem smarter. Too many ignore the fact that learning complex things in a simplified manner is much more beneficial. Your work does this perfectly.
Thank you!
Easily the most intuitive and detailed explanation of logistic regression + max likelihood on the web. period.
Thank you very much! :)
Yes
@@arundas7760 Thanks! :)
This explanation is so good that I feel kinda guilty for having access to it. There is no doubt that if people in the past had access to this, then their lives would've been a lot easier. I feel like a spoiled brat. This explanation is too good for this world.
Thank you very much! :)
@@statquest I too feel guilty for having access to such great learning at no cost. Once I get job as ML engineer will donate 1000 dollars. Till then I will never skip your any ads does that help?
@@raj-nz4bj That's awesome and thank you for your support!!!
@Rakesh Shaw DS The algorithm mentioned at 9:01 probably refers to "Gradient Descent", you can select some initial values for the intercept and the slope for the candidate line and let Gradient Descent find the line with the best fit.
Thank you Josh, now we understand, during classroom days classmates don't do better because they are necessarily smarter but because they get exceptional teachers like you. Thank you for democratising statistics and machine learning and bridging the gap, more power to you.
Thank you!
I teach graduate level operations research courses which require some understanding of probability and statistics. My students really had limited understanding of some fundamental knowledge even if they took some related courses before. I found your videos organized and concise. I will recommend your channel to my future students. Thank you, Josh!
Awesome! Thanks!
Dear Josh, your videos are amazing and I would have never passed my qualification exam without them. Also, this is the first time I actually understood how MLE works. Thank you so much!
Hooray and congratulations!
Hi Josh.. I hope you know how you are changing people's life directly.. a lot of people are earning very good salary bz of your quality content.. if one person gets a job, his entirely family is getting benefited bz of him.. so, you are not just helping one single person but thousands of families.. May the almighty bless you and keep you happy, healthy and wise..
Wow! Thank you very much! It's great to hear! BAM! :)
Hi Josh, you are single-handedly carrying me through my Masters program with your videos. I was seriously considering dropping out earlier this year cause I was having a lot of difficulty understanding anything, but a lot of things are starting to click now thanks to you. Your vids are a godsend to students everywhere.
I'm glad my videos are helpful! :)
@@statquest josh how many times we should do trail and error process in order to fit the squizzle to the data for linear regression there is a direct formulae using matrices to find the slope and the intercept is there any formulae to determinre the curve ....we cant perform the operation these many times
@@GireeshAbburi The squiggle can be optimized using Gradient Descent: ruclips.net/video/sDv4f4s2SB8/видео.html
@@statquest could u please explain
It detailed what is gradient descent
@@GireeshAbburi I explain how gradient descent works here: ruclips.net/video/sDv4f4s2SB8/видео.html
this is called GENIUS AT TEACHING !!
The best explanation ever on logistic regression which includes every details. Thanks a lot Josh, i adore you so much!
Thank you very much! :)
This is a true blessing for data science students 🎉 love u 3000 😊
Thank you!
So far, the best explanation of maximum likelihood estimation on RUclips. Log odds of me being better at math would be significantly high had you been my math teacher at highschool. Thanks Josh.
Thanks!
Wowwww!!! This is the most useful video on this topic! I was really struggling to understand this concept in my class but this video explained the concept so well! I'm sooo grateful for your existence!
Thank you!
This dude is a saint. These videos really condense the ideas to some easy to follow steps.
Thanks! :)
Hey Josh, I can only resonate everybody´s thankful words here. I can´t be more grateful. You make me think that statistics can be very daunting most of the time for the lack of experts like you that can really explain its details in a very simple way. For me, this is true sign of mastery! I had one question regarding this video though, based on your other video where you explain the difference between probability (P(data | mean, sd) )and likelihood (L (mean, sd| data)). Where the in the former you find the distribution of the data under a fixed hypothesis and the latter you find the best distribution that fit the data.
Somehow in this video about logistic regression i feel that you talk about the likelihood in terms of the probability. As in "the likelihood of that this mouse is obese [given the shape of the squiggle] is the same as the predicted probability." So here the likelihood of the data point (the mouse) is based on a fixed distribution? Could you explain that if you have time? :) Again, really appreciate all that you do!!!
In the probability vs likelihood video I'm specifically talking about the difference in probability and likelihood with respect to continuous probability distributions. In this case, the s-shaped squiggle is not a probability distribution (you can because the area under the squiggle, from 0 to positive infinity > 1). However, like in the probability vs likelihood video, likelihood still refers to the y-axis. The big difference, now, however, is that so does probability. Thus, thus, in this case (because the s-shaped squiggle is not a continuous probability distribution), both likelihood and probability are the same. That said, we choose to call it "likelihood" when we are fitting the squiggle to the data to be consistent with "maximum likelihood".
@@statquest Thanks very much for your answer, Josh! I understand what you mean that the squiggle is not a probability distribution. It is also pretty clear the overall of Maximum Likelihood Estimation: to find the squiggle that best fits the data. What remains unclear to me (which was pretty clear in your video likelihood vs probability) is how I should interpret MLE in formal terms: is it the highest probability (likelihood) achieved by the data given the squiggle (a case of p(y|Hi) or the highest probability given the data p(Hi|y). From your video and explanation I think it is the former?
@@bernardoleivas8296 Given that the data is fixed, and we're wiggling the squiggle to fit it best, we want to maximize the likelihood by changing the parameters determine the shape of the squiggle (the slope and intercept in log(odds) space) given the data.
@@statquest okay! Again, Thank you very much for the amazing content =)
This is the first time I clearly understood Maximum Likelihood and Logistic Regression. Thanks for your videos.
Awesome! :)
I must say you are a magician. You have the tricks to communicate and deliver just what people want. We want more like you. Thank you so much in shaping the world.
Thank you! :)
Hello Josh,
first of all I wanted to thank you for these enormously helpful, world class educational videos. They are my lifeboat at the moment in my "Data Literacy" class.
One thing I noticed was: @3:35 it should say -1.4. This is not irrelivant since the "fancy looking formula" produces values closer to 0 for negative values.
Again, thank you so much for what you did here. I would probably have to go back to copying slides by hand which omit all the details and assume way to much
previous knowledge.
Best
Markus
Thanks!
Thank you. This is a genius way to explain the concept to a dummy like me. The rotating graph on 08:55 is icing on the cake.
Bam!
Finding this channel was a pure blessing, especially now that my econometrics classes are held online due to the pandemic and it’s even harder to understand the material. Thank you so much providing free content with such a high academic value (and with very lovely jingles too)!
Thanks and good luck with your econometrics course.
@@statquest I really loved all your videos but when I was trying to apply hyperparameter tuning for logistic regression, I was unable to understand what is a ( penalty, C). I searched and got to know that penalty is the cost function, but in all your videos on logistic regression u didn't mention about cost function, can u please tell help me out in answering?
@@nitishkumar-bk8kd The "cost" function for logistic regression is the negative log-likelihood (it is simply -1 times the log-likelihood described in this video. We can either maximize the likelihood with the log-likelihood, or, if we multiply the log-likelihood by -1, we can minimize the "cost").
However, what language are you using? Python? In python 'C' is the inverse to the regularization strength. If you want to learn about regularization, see: ruclips.net/video/Q81RR3yKn30/видео.html
@@statquest Thanks for your reply professor josh, I am happy u replied to me :) but still, I have a doubt :(
you are telling me that the cost function is -1*log(likelihood) of the data right? if so why didn't you multiply the log(likelihood) of the data with -1 to find the best fit line for the data?
and in hyperparameters, the solver parameter have options like 'lbfgs' , 'newton-cg' , 'liblinear' , 'sag' , 'saga'
Each solver tries to find the parameter weights that minimize a cost function. how these are related with likelihood?
@@nitishkumar-bk8kd Logistic Regression has traditionally been solved by maximum likelihood, which is why my video describes that approach. It is only recently that we have general purpose code that minimizes cost functions and includes regularization. Thus, when people talk about the theory of optimizing logistic regression, which is what I do in this video, they talk about maximizing the log-likelihood. In practice, things are always a little different, especially if you are using one of the newer methods that include regularization. Unfortunately, the questions you are asking are really more appropriate for a "how to do logistic regression in Python" video (NOTE: I already have a "how to do logistic regression in R" video: ruclips.net/video/C4N3_XJJ-jU/видео.html ) and I'll keep that topic in mind for a future video.
You're straight up G, Josh ! Keep up the great work, I hope you sell loads of your songs and really grow Statquest to the next level !
Thank you so much! :)
Dear Josh, you are an awesome teacher!! This lecture really cleared all of my doubts!!
Thanks!
This was so helpful! You explained the likelihood really intuitively (calculate likelihood of this data given shape of the squiggle) and it makes a lot of sense now. thank you!!!
Hooray! Thank you very much! :)
Words cannot express how grateful I am for this amazing explanation.
Wow, thank you!
Thank you for this. You make our life easier, especially for data analytics students like me. Your explanations are so great that is s easy to understand. Such a talent.
Thank you very much! :)
The minus sign may have been left out:
3:32 Should be (-)2.1
3:37 Should be (-)1.4
Thank you for producing this lesson!
That is correct. I left out a few minus signs. :(
I will be using these videos for the rest of my life. Thank you Josh
Bam! :)
I wish I could see the backend efforts you put to make concepts so easy I understood all concepts god bless you bro
Thank you! Yes, it takes a lot of time and work to make a video. I spend a lot of time researching a subject and then a lot of time trying to find a new way to present the information.
Best explanation for this concept in the internet, most just say: "a bunch of iterative stuff that statistical software do" -_- ! Thanks again sir!
Hooray!!! You're welcome. :)
This is my new favourite channel for ML really love the explanations and the speed at which you talk is easy to follow. I am a person who learns by examples so it would be great if you added more examples. Thanks for the great content.
Thank you! :)
You need to learn Binomial distribution, Bernoulli trial, likelihood, MLE, loss function, Gradient Descent, odds, log(odds), Logit function, sigmoid function, decision boundary, and expected values. And of course the mathematical intuitions too. This is overwhelming for a beginner like me. But Josh your part 1 and 2 of Logistic regression solves many problems for me. Thank you and a lot more to learn from you. BAM!!!
Dear josh, I have a question too. Is this Maximum likelihood equivalent to loss function?
That's the idea. However, to make it "loss" and something we want to minimize, we multiply the likelihood by -1.
This was the best explanation on youtube. I wish I found this before. You’re awesome!
Thanks!!! I'm glad you like the video! :)
I am student of data science, when i see this logistic algorithm calculation , i am scared and think that i could survive in this field or not, But after seeing your content of this algorithm i gonna play with this .Thankyou so much sir for this valuable content
Thanks!
Thanks so much, stats doesn't seem so hard any more with your videos. You truly have a talent for teaching :)
Thank you!!! :)
One of the best videos on Logistic Regression.... Awesome...
In your words "Tripple BAM"... 😊
Glad you liked it!
You are an awesome teacher!!! Thank you for the visualizations during your teaching, it helps to learn the concept so well!!
Glad it was helpful!
My first thought when learning about this was why couldn't we use least squares on the log odds thing graph, then straight away it's explained with the pos/neg infinity thing. Nice.
bam
@ 10:00 the fact that we both said cool at the same time was pretty cool
Double Cool!!! :)
When you use the train data to derive the coefs, how do you know the log-odds from the binary response in the train data set? in 3:31, how do you get the log(odds) for the first response is -2.1? How do you derive that candidate line? To get log(odds) we have to know p(y=1) right? When you train the model, you actually dont have know how to convert binary (0,1) response to p(y=1). I am wondering is the candidate line used to project the data from binary to log(odds) selected by random, then we keep improve it?
The candidate line is just a randomly selected line - it's a starting point that is then improved on by using maximum likelihood. So you start with a guess, calculate the likelihood of that guess, and then move the line - change the slope and change the intercept - and see if that gives you a higher or lower likelihood. If the likelihood is lower, then you move the line in the other direction. If it's higher, you keep moving in that same direction. Does that make sense?
@@statquest Yes! Thanks very much
Thanks for this question and answer, Zhihao & Josh
@@statquest BAM!!
Hey Josh,
I really like your explanations but the logistic regression is where i got stuck real bad. But it's clear now.
The problem is that you never linked the log(odds) and line equation. I saw from the internet that there's a link function between them and that's how we can use line equation in place of log(odds). If you'd have included that explanation as well, i'd have not depended on other explanations. But anyways thank you, am always a fan of your explanations. Keep up the good work!!!!
BAAAMMM!!!!!
well, this is the second channel i want to watch alllll of the videos.. thank you for your enthusiasm and good illustration of concepts🥰
Thank you! :)
Thanks Josh!! your channel is the best thing happen in my quarantine period :) love from India!
Hooray! :)
Brilliant!! From the very depth...explanation is splendid..this is all i wanted! Thank you so much
Hi josh, I have a question. How in 2:45 do we plot our actual data at infinity and negative infinity, while if i understand correctly that the x axis is the p, so there are some points plotted that are beyond infinity since the corresponding p value on x axis > 1, and at p=1 log odds is infinity
The x-axis is the variable we measured, so in this case the x-axis represents weight.
@statquest Oh now it makes sense, Thanks josh, appreciate your support🙏
3:24 ,you said that we project the original data on the candidate line, and you got values 2.1 and so on. is this the assumption that these points are falling in that regions on the line .
I'm not sure I understand your question. By "projecting onto the line", we are simply plugging the x-axis coordinates for each data point into the equation for the link to get the y-axis coordinates. The combination of the x- and y-axis coordinates is the "projection".
Thanks
TRIPLE BAM! Thank you for supporting StatQuest!
I just discover your channel and it is one of the way to learn stats! Thank you so much!
Thanks! :)
These videos are amazing!! Congratulations on this fine job!
Thank you! :)
Thank you!!! Thank you!!! Thank you!! for your dedication on the videos, you are helping a lot!!
You are so welcome!
Excuse me if this was explained in the video(I guess it went over my head), but I have one question:
At 3:51 we already have a squiggle with data points projected onto it. Then we transform it with log(p/(1-p)) and then we transform it back to the squiggle. I wonder why is this done, couldn't we automatically get the probabilities by projecting the data onto the squiggle?
Sometimes I think I understand why it's needed to transform the data but it just doesn't click. Could you please elaborate?
We start with a random straight line in the log(odds) space (on the right side of the screen). However, in order to evaluate that line, we need to calculate residuals - and we can't do that in the log(odds) space since the training (known) data is at -infinity and +infinity. So we transform it into probability space (on the left side) so we can calculate the residuals and evaluate the straight line on the right. We then rotate the straight line, transform and evaluate to see if we the rotated line is improving it's fit, etc.
I am 15 and really your videos are really helpful to me❤
I'm so glad my videos are helpful! BAM! :)
This is a brilliant video - in-depth discussion of how to fit a logistic regression line for non-mathematicians. Do consider creating a lesson giving more details of the conversion between the squiggle and the log-likelihood graph. Bet it will be awesome! BAM :) PS - purchased all your lesson :)
Awesome!!! Thank you very much!!! Have you seen the other videos in this series, they also describe the conversion between the squiggle and the log-likelihood. See: ruclips.net/p/PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe
Coming from lecture from MIT on generalized linear model which was difficult to get a grasp, this series on logistic regression helps me understand GLM better. Thank you! I also wonder if you would make a video that deeper explains multinomial logistic regression.
I'll keep that in mind.
BAM !!!! I just found a gem of a teacher.
Hooray!
Hey Josh, I have a question on the process of iterating the fitting of the candidates line. We start off with a random line, this line is then put through the process of max (log) likelihood. After this is done, we simply change the line and the slope around until we find the best likelihood fit. I have 2 questions.
1. When do we know if the likelihood is right? you show one that is log(of-data-likelihood) = -3.77. I am confused on when I will be able to say, "yes this line has the best likelihood, done."
2. Is there a method to iterating through the possible 'best' fitted line candidates? You mention some algorithms for choosing. I think since I am not sure which likelihood would be optimal, I can not visualize what algorithm would test the next line.
Your videos are fantastic and I am learning more from you than I ever did in school. Really appreciate your teaching methods.
Logistic Regression uses an iterative method called Gradient Descent to find the best fitting line. For more details on how this works, see: ruclips.net/video/sDv4f4s2SB8/видео.html
@@statquest You are an absolute hero. Thank you for your response.. I have been telling everyone about your videos!
@@Ash_Industries Thank you very much! :)
@@statquest If I may run this past you to ensure I am doing it right (based on what I got from your recommended video).
I would take all the possible y-intercepts of my log(odds) lines, process the corresponding max-likelihood values for each intercept, then calculate the slope of this relationship? Once a slope closest to 0 is found, I have now found the intercept and the slope of the best fitting log(odds) line?
@@Ash_Industries For logistic regression, we take the derivative of the log likelihood function with respect to the y-axis intercept (in log(odds) space) and slope (also in log(odds) space) and then plug those into the gradient descent algorithm and let it go. For details on how to take those derivatives (they are basically the result of applying the chain rule a few times), see: medium.com/analytics-vidhya/logistic-regression-with-gradient-descent-explained-machine-learning-a9a12b38d710
Thank you very much Sir ! . Very informative ! . You're a great teacher. Quest ON!
Thank you! :)
So, let me get this straight. We're supposed to calculate the likelihood by multiplying the collective set of probabilities (or adding the log of those probabilities) at each point...and the probability for each point is defined by a (squiggly) line). And that optimal line is a line that produces probabilities at each point, that when multiplied by each other, results in the maximum likelihood. The apparent circularity of this statement is EXTREMELY confusing...but your videos are excellent. Thank you.
You have the main idea. We wiggle the line around a bit, calculate the likelihood (or log-likelihood), then wiggle some more and calculate the likelihood and repeat until we have found the squiggle that gives us the highest likelihood.
Thank you, sir! Its an iterative process, I now see.
Hi Josh, this is wonderful, thanks for your series of presentations. Very informative. Well understood!
Thank you very much! :)
This Logistic Regression series is one of the best means I've found online for understanding the theoretical foundations behind those models in a simple way. In both Linear and Logistic Regression, you reference the idea of iterating and optimizing the line. Do you imply the usage of Gradient Descent? Do you consider solving those problems using partial derivatives less relevant?
I'm glad you like my videos! I use the concept of "iterating and optimizing" more just to convey the concept that some lines fit better than others and that one possible way to find a solution is to step towards the best fitting line. However, in practice, other solutions can be used more efficiently.
3:55 So basically log odds to probability transformation is nothing but substituting sigmoid equation p = 1/(1+e^-z) with z = log(odds). Since sigmoid converts any real number to a value between 0 and 1, this should make sense, right? And we also know that log odds is nothing but b0 + b1X, so basically we are using the fit line equation.
yep
@@statquest BAM! Mind = Blown 🤯
Great tutorials.
Something I missed: in previous video the probability was used to calculate the Log. Here the Log(odd) is used to calculate back the probabilities. I guess that 80-85 kg individuals were clinically distributed to obese/not-obese and this ratio provides the probability for this weight range. So, why we need to convert to log, and back, as probabilities are the raw data?
It's relatively easy to fit lines to things, and relatively hard to fit squiggles. So we use the log() space to fit a line to the data and then translate that back to probabilities (and a squiggly line).
Love your lessons so far.. FUN, engaging, and doesn't feel academic.. I'm only on my 3rd video.. Thanks !!
Glad you like them!
Liking the video immediately after the song!!
BAM! :)
You are Stats God!! Thank you 🙇🏻♀
:)
Hi josh, your explanation is superb but i have a doubt @ 8:15 "-3.77" is an intercept or a slope ?. I have gone through pt 1 video also.Please throw your knowledge light on this area so that i would be out of this dark. Thank you
-3.77 is not the intercept or the slope, -3.77 is the log-likelihood of that specific line. In other words, -3.77 is measure of how well the line fits the data. The goal is to find the line that gives us the maximum likelihood, and we can do this by finding the line that gives us the maximum log-likelihood. Thus, for each line that we try, we calculate its log-likelihood. If that value is larger, then the new line is a better fit. Does that make sense?
Projecting to the candidate line is explained in the coefficients video around 6:00.
It links automatically, so you're all set.
Damn great videos. So the idea of Maximum Likelihood method is to choose the model that maximazes the product of likelihoods of getting the right results from your observations.
Yep!
You are an amazing teacher!! Made it crystal clear!! Thank you soo much!! :)
Thank you! :)
Hey there, can you speak a to the process of 'projecting the observed data' onto the candidate line? In your video, you do not show what the actual values of x are which makes it very easy to projects the values onto the line, but in reality (when working with real data) this becomes a bit confusing.
I have ticket prices for the titanic (range of 0.0 - 500.0), I have classified the data per passenger whether they survived or not (1 or 0). When I choose a candiate line (y-intercept = -3.5, slope=0.5) I get a line. I am having troube with how to determine where the observed x-axis data would land on this line in order to perform max-log-likelihood on the new y-value .
In your case, ticket price is your x-axis variable (and the y-axis is the log(odds) of survival). So plug ticket price into the equation...
y-axis coordinate = -3.5 + (0.5 * ticket price)
...and you'll get the data projected onto the line.
@@statquest OK, that is what I did! I am so proud of myself lol hahahahha. Just felt like it was too easy and with all the previous video regarding why you do log-odds made me feel like I missed a step!
@@Ash_Industries bam! :)
Hi Josh. I hope you are doing well. My question is regarding the maximum likelihood (6:41):
when calculating, the probability values are being taken. so the blue dots represent the probability values for obese and for the same we can get the probability values for *not obese*.
My question is, instead of taking the probability values of not obese*, why (1- *not obese* )value is considered?
The y-axis represent the likelihood and probability that a mouse is obese. In order to calculate the likelihood or probability that a mouse is not obese, we subtract the y-axis values from 1 to flip.
You are too good. All the best wishes for the good work.
Thank you so much 😀!
I see at 6:25 there is a switch in thinking from Obese to Not Obese points, causing an extra step of doing 1-prob rather than directly using prob previously. How would this operation work when we have more than 2 classes, like 3 instead? I assume we can't calculate MLE from this logistic regression anymore. I know we can still use this framework by doing OVR, but are there other ways of calculating MLE for 3 classes if we're not doing OVR? Not sure if the "1-prob" thing will still be done, and what new graphs we'll be reading from to get the probabilities for calculating MLE.
Multinomial Logistic Regression and Cross Entropy generalize this to > 2 classes.
If this channel doesnt help you with your course, I dont think there is anything else to try, then bribery. Great videos, thanks!
Thank you! :)
Your videos have cleared my basics . Thankyou for these ☺️☺️
You are welcome!! I'm glad you like the videos and that they help you. I'm really excited about the next video (R-squared and p-values for Logistic Regression). :)
Superb!!!! What a explanation...
Thanks a lot 😊
I am your biggest fan. Thanks for explaining these things in a way they are understable. Thank you, thank you!!! :-)
Wow, thanks!
StatQuest gives the maximum likelihood...of learning!
bam!
Hi Josh,
Re: 05:36, "The likelihood the mouse is obese, given the shape of the squiggle, is the same as the predicted probability. In this case the probability is not calculated as the area under the curve, but instead is the Y-axis value, and that's why it's the same as the likelihood."
But in other videos, you mentioned probability is not the same as likelihood. Could you elaborate a little please? I often use time interchangeably and been been confused. Thanks.
The squiggle is not a continuous statistical distribution. If it where, then probability and likelihood would be different. In this case, however, both are y-axis coordinates.
@@statquest Thanks, Josh.
3:16 how do we choose what the candidate line is and 8:20 how do we know how much to rotate the line by?
We can start with any random line, and we can use Gradient Descent to iteratively determine the optimal slope and intercept. For details about how Gradient Descent works, see: ruclips.net/video/sDv4f4s2SB8/видео.html
@@statquest thank you!
And thanks a lot of you for replying the comments. I'm trying to calculate coefficients manually without any software and my question is related to hours studied by students to pass the exams. And my problem is to calculate coefficients manually as we do in linear regression simply using formulas.
The coefficients are not calculated like linear regression. Linear Regression has an analytical solution - there's a formula and you can plug numbers into it and get a solution. Logistic Regression, however, does not have an analytical solution. It's solved using an iterative algorithm, like Gradient Descent.
@@statquest thanks sir
The best tutorial on logistic regression
Thank you. :)
Glad you think so!
Great video, thank you so much for clearly explaining the topic!
Glad it was helpful!
I have a question (I don't really know about regression)
Why at 7:56 when he calculates the log(likelihood)=Sum[log(p)] + Sum[log(1-p)]
He is not doing it as the formula
log(likelihood)=Sum[ *(Y)* log(p)]+Sum[ *(1-Y)* log(1-p)]
Where is (Y) and (1-Y) ??
Remember Y represents the what is known about each mouse and is either 0 (if the mouse not obese) or 1 (if the mouse is obese). So, when using your formula, Sum[ (Y) log(p)]+Sum[ (1-Y) log(1-p)], if we expand the summation, we get:
[Y1*log(p1) + (1-Y1)*log(1-p1))] + [Y2*log(p2) + (1-Y2)*log(1-p2))] + etc. So, now imagine Y1 = 0, because mouse #1 is not obese and Y2 = 1, because mouse #2 is obese. That gives us...
[0*log(p1) + (1-0)*log(1-p1))] + [1*log(p2) + (1-1)*log(1-p2))] =
[0 + (1)*log(1-p1))] + [1*log(p2) + (0)*log(1-p2))] =
log(1-p1) + log(p2)
...and that is the formula that I'm using.
These StatQuest videos are giving me Homestar Runner vibes... but educational
TROGDOR!!! :)
Hi StatQuest!, I would like to ask that at 8:21, we need to rotate the line to obtain the best sigmoid function. But rotating a line need a pivot, how to find the optimal pivot?
I found that using the mid-point of these two cluster (in this case, finding the mean of not-obese rat and obese rat seperately, and then find the mid-point of these mean) could be problematic, because the mean could be affected by outliars. But using mid-point of medians could be a solution
Thank you.
The line is actually fit using gradient descent. For details, see this video: ruclips.net/video/sDv4f4s2SB8/видео.html
The log likelihood part around 6:00 is really confusing about why likelihood here are the same as probabilities. Would you help to explain more? Please!!! I saw a lot of people in comments are confused too.....
Likelihoods are the y-axis coordinates created by statistical distributions. Always. However, depending on the distribution, probabilities are either the y-axis coordinates or the area under the curve. When the distribution is for continuous outcomes (like height), probabilities are calculated as "the area under the curve" (as illustrated in this video: ruclips.net/video/pYxNSUDSFH4/видео.html ). In contrast, in this case the distribution is for discrete outcomes (obese or not obese) and with discrete outcomes, the y-axis is also equal to the probabilities.
Fantastic video josh.... Clear cut explanation...
Thanks so much! :)
@3:20, is the candidate line there for the projection randomly initialized? If not, where could we get the 'candidate line' on the log-odd graph during the 1st iteration of the maximum likelihood estimate?
The candidate line starts out randomly selected.
what I don't understand is : Why do you transform the y-axis to the log(odds) ? why can't we just use the maximum likelihood on the S-curve ? why do we need Bete_0 and Beta1
It's way, way easier to optimize something linear (which is what we do in the log(odds) space) than non-linear (which is what we have in the probability space). Non-linear functions like this sigmoid shape are sometimes impossible to fit because there are too many options for where and how to bend the line.
@@statquest First of all, thank you for your response. So, does this mean that by replacing the parameter \( z \) with the function of a linear equation, we are essentially optimizing the linear equation to fit the data, and because this is a component of logistic regression, we are indirectly optimizing the S-curve as well? Is that how it should be understood?
@@sayyamplay By optimizing the straight line, we optimize the squiggle. So we replace an impossible problem with something we can solve and in the process solve both.
@@statquest wow, thanks for the great explanation. That was the missing piece. Btw, I‘m currently reading Essential Math for Data Science, they mentioned your videos multiple times there. Living legend
@@sayyamplay BAM!
You mention at 9:25 of the algorithm that finds the perfect line in few iterations, are you reffering to gradient desccent? Awesome video as always!
Yes - but there are other iterative methods that people use as well.
How is projection working, how are those values projected of best fit line at 3:45. Using log( what) . How is first point 2.1 and not something else!
First, the line at 3:45 isn't a best fitting line, it's just a candidate "best fitting" line (as stated at 2:44 ). The projection is done by plugging the x-axis coordinates for each data point into the equation for the line to determine the y-axis coordinates. (for example, if the equation for the candidate line is y = 2x - 5, and if the x-axis coordinate for the first point was 0.5, then y = 2*0.5 - 5 = -4.)
could anyone help me understand Where is Sigmoid function coming into picture here ? also can anyone help me understand where are we using log loss function to optimize things ?
The sigmoid function is used to convert log(odds) to probability. If we invert the log(likelihood) (which we want to maximize) we get the log(loss) (which we want to minimize).
@@statquest Thanks Josh.. you are the best.
Thanks Josh for these great videos.
I'm wondering if it's possible to optimize the coefficients of a Logistic Model using a Genetic Algorithm. If yes, please can you demonstrate how?
Many thanks.
Maybe - but I don't know how to do it off the top of my head.
4:14 Why didn't you cancel out E and LOG on the right hand side of this equation like you did on the left hand side?
Because often we have the log(odds) values and not the (odds) values. So it's easier to plug in the log(odds) into the equation.
It seems to prove a useful model. Thankyou
:)
8:39
how do you determine which value is better?
like why is -3.77 better than -4.15?
The goal is to find the squiggle with the maximum likelihood, and is why this video is titled "Maximum Likelihood". However, at 7:41, I say that the squiggle that maximizes the likelihood is the same one that maximizes the log-likelihood, so we maximize the log-likelihood instead (because adding logs of the likelihoods on a computer is easier the multiplying the likelihoods due to something called "underflow problems")