If you found any value from the video, hit the red subscribe and like button 👍. I would really love your support! 🤗🤗 👉 You will get a New Video on Machine Learning, every Sunday, if you subscribe to my channel, here : ruclips.net/channel/UCJFAF6IsaMkzHBDdfriY-yQ
Finally! I found something useful. Thanks a lot, everyone teaches working of gradient descent in very crude way, but almost no one teaches the maths behind it. Almost everyone simply imports gradient descent from some library and no one shows pseudo code. I wanted to understand the working behind those functions, how these parameters get adjusted, and what maths is getting used behind the scenes, so if required we can create our own functions, and this video fulfilled all these requirements.
hey, can you please help me to solve this question? Question: . You run gradient descent for 15 iterations with α=0.4 and compute J(θ) after each iteration. You find that the value of J(θ) increases over time. Based on this, please describe how do you choose a suitable value for α.
If J(θ) is your cost function and it is increasing over time, you need to choose a smaller learning rate for alpha so that it instead decreases over time.
@@v1hana350 There is no need to for using cost function in K means Clustering. It is a clustering algorithm, which works differently from linear regression. It works as: - you randomly initialize cluster points. - calculate the distance between cluster point and all other points in the dataset - group data points in a particular clusters in such a way, that we put it into a cluster of nearest cluster point. - compute cluster points as average of all the points in a cluster - repeat the process You dont need cost function here. Still if you want to use one. You can take the summation of distance of cluster points to other points in that cluster:
@@CodingLane yeah I figured it out after watching few times but in the video you mentioned that we used derivative of x^2 so I think you should have emphasized that part , over all a great video You made it very much easy to clear some of my doubt in beginners stage plus I would be very much grateful if you could create a community channel on Telegram or on Discord for someone who wants to clear doubts as it's not. Possible on YT
I am a COBOL programmer started machine learning. I have a doubt.. why we randomly fix to 1000 Iterations? As you mentioned, the derivate of cost wrt to theta is a slope, why don't we stop iteration as and when the derivative reached to ZERO(meaning at centre bottom where no slope exist) OR why don't we determine cost function has reached minima by comparing it's previous value less than current value ? since I searched many sites for this reason, no where mentioned the dynamic iteration than constant iteration. I'm not sure if I'm missing something else.please guide
If your derivative reaches 0, then you will stop whether you want to or not. Learning occurs from a non-zero derivative (tells you the direction you need to move in), so if it's 0, you stop. This is typically bad for larger problems because we don't usually have an obvious global minimum, so we want our code to run as long as the cost is decreasing. But if you get a 0, this essentially "kills" the neuron which results in no learning. This is a common problem when using ReLu activation function and is why they created leaky ReLu to mitigate this issue. But if in if you truly did reach the global minima and your derivative is 0, then there's no problem. Your model will stop updating each iteration, but since you reached the minima, you should be good.
Hello, I have a question on the impact of increasing the value of theta when d(cost) / d(theta) is negative. Since the rate of change of the cost function is determined to be positive or negative by (Y - Y_predicted), does this mean that when we INCREASE theta, the value of Y_predicted decreases? I am having trouble understanding this since I assumed because X and Y_predicted share a linear relationship, increasing theta should also increase the value of Y_predicted. Would be grateful if you are able to find the time to clarify this point for me, and by the way, great video I learned a ton!
Hi Paul, we don't manually set (increase or decrease) value of theta. The model automatically sets it. That is why we use Gradient Descent algorithm, to set the appropriate value of theta to make correct predictions. If you manipulate value of theta manually yourself, then your results won't be accurate. The point you should focus on here is why and how the cost function decreases. And how it helps to automatically adjust the value of theta. The value of theta can be very small or very large. Positive or Negative. It doesn't matter. What matters is, it is automatically adjusted (whether positive/negative/small/large) in a way that it makes correct predictions.
Theta is a parameter, which we first initialize with zero. Then we train the model to changes the theta value in such a way, that with this changed value, we can make accurate predictions. Think of it like parameters of straight line. Let say, Equation of straight line is y = ax + b. Then a, b are parameters of this straight line. If we have so many such parameters, then we represent it with Theta. So intialy, our straight line will be y=0. And after training the model, value of parameters will be changes, and with these parameters our stright line will fit best on our dataset.
Check out my “What is Linear Regression?” And “Linear Regression Cost function” video from this playlist for better understanding: ruclips.net/p/PLuhqtP7jdD8AFocJuxC6_Zz0HepAWL9cF
I revisited and got the answer at 9:0, thanks a lot. just because there is no animation while you point out points its bit of a task to listen and figure out. i wish you reach next level in presentation, because you are doing a great job with all logic and fluency! i had a small confusion as i am doing Stanfords machine learning too on coursera and your video helped in no time. thanks. grow well,
@@abhzme1 glad it helped. And thanks for suggesting. I have added presentation and animation in the videos uploaded in Neural Network Playlist. Hope you find it better than this. Let me know if you have any specific suggestion while you go through those videos. I will greatly appreciate it.
@@CodingLane bro, I request you to make video on a roadmap on how to learn ML engineering from scratch to adv, and specify the resources for the same, so every self taught get an idea
Hi, the final performance won't be affected if you divide it by m or 2m. You can check my detailed answer in the comments below (in this videos or some other video of this playlist)
After differentiation the entire function gets multiplied by 2. To eliminate that 2, he divided by 2m in beginning itself. Once the 2 is removed, it makes updating values much easier.
If you found any value from the video, hit the red subscribe and like button 👍. I would really love your support! 🤗🤗
👉 You will get a New Video on Machine Learning, every Sunday, if you subscribe to my channel, here : ruclips.net/channel/UCJFAF6IsaMkzHBDdfriY-yQ
Finally! I found something useful. Thanks a lot, everyone teaches working of gradient descent in very crude way, but almost no one teaches the maths behind it. Almost everyone simply imports gradient descent from some library and no one shows pseudo code. I wanted to understand the working behind those functions, how these parameters get adjusted, and what maths is getting used behind the scenes, so if required we can create our own functions, and this video fulfilled all these requirements.
So so true!!
❤
This is a great explanation of gradient descent! Thank you!
Your welcome!
Your explanation is really good. It would be helpful if you could make video playlists on Linear Algebra, Optimization and Calculus.
Hi Shreedhar, thanks for the compliment and the suggestion. I will consider making videos on these topics too, just that, it might take some time. 🙂
Last 5 minutes were epic 😍... Thanks 💙
Thank you so much! Your comment means a lot to me.
The best ever explanation with detailed mathematical explanation
Thank you a lot for this. Your explanation helped to wrap my head around gradient descent !
i usually never comment, but this was so simple and easy to understand ty
Don't stop! This was more than helpful!
Sure 😇… glad to help!
hello, in 11:00 why did you multibly the m with 2? in the previos video there was only m
Your way of explaining things were just amazing!! , I got all u wanted to explain , thanks..
Thank you so much! I really admire your comment.
Keep going on bro u are clearing my concepts, please make a playlist on python tutorials
Thank you! I will try covering python tutorials if I get time… until then, you can check out some other playlist on RUclips for python.
Nice explanation point to point explanation others only give confusions 😅
Thanks for this. I'm learning data analytics but I come from a profession with little math, so it's challenging.
Your welcome James! I will make more videos on Machine Learning with Mathematics for sure !
can you solve questions too please , all the video you explained...
hey, can you please help me to solve this question?
Question: . You run gradient descent for 15 iterations with α=0.4 and compute J(θ) after each iteration. You find that the value of J(θ) increases over time. Based on this, please describe how do you choose a suitable value for α.
If J(θ) is your cost function and it is increasing over time, you need to choose a smaller learning rate for alpha so that it instead decreases over time.
You explain really well.....seeing in 2023
Can you make a video based on the XGboost algorithm with mathematical formulas?
Thank you for your suggestion! I will consider making video on it, but it will take time. 😊
@@CodingLane make it as soon as possible
Thanks for your respond
I have another doubt about machine learning algorithms. Please can you clarify it....how to find the cost function of K mean clustering?
@@v1hana350 There is no need to for using cost function in K means Clustering. It is a clustering algorithm, which works differently from linear regression.
It works as:
- you randomly initialize cluster points.
- calculate the distance between cluster point and all other points in the dataset
- group data points in a particular clusters in such a way, that we put it into a cluster of nearest cluster point.
- compute cluster points as average of all the points in a cluster
- repeat the process
You dont need cost function here. Still if you want to use one. You can take the summation of distance of cluster points to other points in that cluster:
I think something is missing at 12:08 where you Ommited SUM without explaining all you showed was the Matrix Differentiation
The sum will already happen with matrix multiplication… like instead of having 1^2 + 2^2 + 3^2 … we are writing [1 2 3]*[1 2 3].T
@@CodingLane yeah I figured it out after watching few times but in the video you mentioned that we used derivative of x^2 so I think you should have emphasized that part , over all a great video You made it very much easy to clear some of my doubt in beginners stage plus I would be very much grateful if you could create a community channel on Telegram or on Discord for someone who wants to clear doubts as it's not. Possible on YT
Great explanation. Please make a video on knn too.
Sure... I will make a video on it too! Thanks for the suggestion.
Thank you bro for this explanation 🙏
I am a COBOL programmer started machine learning.
I have a doubt..
why we randomly fix to 1000 Iterations?
As you mentioned, the derivate of cost wrt to theta is a slope, why don't we stop iteration as and when the derivative reached to ZERO(meaning at centre bottom where no slope exist)
OR
why don't we determine cost function has reached minima by comparing it's previous value less than current value ?
since I searched many sites for this reason, no where mentioned the dynamic iteration than constant iteration.
I'm not sure if I'm missing something else.please guide
If your derivative reaches 0, then you will stop whether you want to or not. Learning occurs from a non-zero derivative (tells you the direction you need to move in), so if it's 0, you stop. This is typically bad for larger problems because we don't usually have an obvious global minimum, so we want our code to run as long as the cost is decreasing. But if you get a 0, this essentially "kills" the neuron which results in no learning. This is a common problem when using ReLu activation function and is why they created leaky ReLu to mitigate this issue.
But if in if you truly did reach the global minima and your derivative is 0, then there's no problem. Your model will stop updating each iteration, but since you reached the minima, you should be good.
what an explanation, thanks sir .
tysm really appreciate your explanation
You’re welcome!
very good and e2e explanation
Thank you!!
Sehr hilfreich
Dankeschön
That's a awesome explanation.
Thank You so much Veeresh !
Why we take a column of zero in (m*n) features Matrix why we can not multiply directly
hello , I believe that sigma goes from zero to m not from 1 to m , anyway thanks for the great explanation
Hello, I have a question on the impact of increasing the value of theta when d(cost) / d(theta) is negative. Since the rate of change of the cost function is determined to be positive or negative by (Y - Y_predicted), does this mean that when we INCREASE theta, the value of Y_predicted decreases? I am having trouble understanding this since I assumed because X and Y_predicted share a linear relationship, increasing theta should also increase the value of Y_predicted. Would be grateful if you are able to find the time to clarify this point for me, and by the way, great video I learned a ton!
Hi Paul, we don't manually set (increase or decrease) value of theta. The model automatically sets it. That is why we use Gradient Descent algorithm, to set the appropriate value of theta to make correct predictions. If you manipulate value of theta manually yourself, then your results won't be accurate.
The point you should focus on here is why and how the cost function decreases. And how it helps to automatically adjust the value of theta. The value of theta can be very small or very large. Positive or Negative. It doesn't matter. What matters is, it is automatically adjusted (whether positive/negative/small/large) in a way that it makes correct predictions.
why do u ignore the -ve sign in the partial derivative
it was helpful!
OMG, THANK YOU!
What does theta represents in GD. Please explain
Theta is a parameter, which we first initialize with zero. Then we train the model to changes the theta value in such a way, that with this changed value, we can make accurate predictions.
Think of it like parameters of straight line.
Let say, Equation of straight line is y = ax + b.
Then a, b are parameters of this straight line. If we have so many such parameters, then we represent it with Theta.
So intialy, our straight line will be y=0. And after training the model, value of parameters will be changes, and with these parameters our stright line will fit best on our dataset.
Check out my “What is Linear Regression?” And “Linear Regression Cost function” video from this playlist for better understanding: ruclips.net/p/PLuhqtP7jdD8AFocJuxC6_Zz0HepAWL9cF
How did anyone formulate the equation of theta and alpha ?
It is provided by researchers in their paper of Linear Regression.
What is theta ?
at 8:15 I did not get why y-hat is equal to that summation ending with theta 0.
I revisited and got the answer at 9:0, thanks a lot. just because there is no animation while you point out points its bit of a task to listen and figure out. i wish you reach next level in presentation, because you are doing a great job with all logic and fluency! i had a small confusion as i am doing Stanfords machine learning too on coursera and your video helped in no time. thanks. grow well,
@@abhzme1 glad it helped. And thanks for suggesting. I have added presentation and animation in the videos uploaded in Neural Network Playlist. Hope you find it better than this. Let me know if you have any specific suggestion while you go through those videos. I will greatly appreciate it.
WOW explanation
Haha… thanks!
you also Hinted The Gradient Descent Problem..where local value will be disappear like ghost.......👻👻👻.......
Need gradient descent logistic regression and derivation
Hi Salman... I have already made a video on it... you can check that out in logistic regression playlist
@@CodingLane thanks
Hi JP, You stop uploading the video, I hope everything is fine with you.
thanks bro :)
Please explain the cost function using graphs .....
Okay... Thanks for the feedback Mudassir ! I will try to cover it in my future videos.
man the name is coding lane, what is boost?
Hi Supriya... previously, the name of the channel was Code Booster... that is why.
@@CodingLane bro, I request you to make video on a roadmap on how to learn ML engineering from scratch to adv, and specify the resources for the same, so every self taught get an idea
@@supriyamanna715 Thank you for the suggestion. I will create a video on it.
I am starting machine learning journey now I feel like I am late😪
I don't understand why your cost function is divided by 2 times the population, instead of just m. Any other guide shows it should only be m.
Hi, the final performance won't be affected if you divide it by m or 2m. You can check my detailed answer in the comments below (in this videos or some other video of this playlist)
After differentiation the entire function gets multiplied by 2. To eliminate that 2, he divided by 2m in beginning itself. Once the 2 is removed, it makes updating values much easier.
who will give alpha value :??
Pls assume people dont know calculus. That could be your niche, where other channels give their people up.
Ohh... thats a very valuable feedback. I am definitely going to take action on this. Thanks a lot !!
If you don’t know BASIC CALCULUS GO BACK TO SCHOOL AND PICK ART CLASSES YOU ARENT SMART ENOUGH FOR THIS FIELD
STUPID
Nothing interesting in this😢
Yupp… Machine Learning is not interesting, but powerful 😇
@@CodingLaneHey, it is interesting also 😠
Pretty bad explanations..lacks the flow and seems to be copied from somewhere