As a layman clicking buttons to try get a better understanding of large language models I thought I was making some progress, then I watched this video, now I think I should go back to primary school 😢
Have you been watching the previous videos in the playlist? If so, I'm very surprised this video was the one you found challenging. Feels like a breather after the last few videos
Great video!!! I am still confused about why Relu works when it's properties are quite linear. I mean, I know it's a piece-wise linear function, therefore does not meet the mathematical definition of linear function. But by using Relu, the output is still just a linear combination. Perhaps some neurons don't 'contribute', but the output is still a mathematical result of a linear combination of numbers.
Santosh Gupta Hi, in my understanding, the aim of a neural network is to simulate a function that usually cannot be represented by a closed form expression. A linear activation function could cause the final result to be linear, which is useless as the professor explained. However, the ReLU can avoid this problem, just image that, the linear combination of several outputs that activated by ReLU, is equivalent to a piece-wise function, although each piece in the function is linear, we can somehow view it as an approximation of another complex function. It is just like what we do in computer graphics, we do not draw a curve, instead, we draw a lot of straight lines to simulate a curve. Hope this could be helpful.
An activation function like ReLu is used only during forward propagation. During the backward propagation, you just take the derivatives and update the weights and biases.
Exactly my thoughts, came here to gain insight as to how non linear functions like ReLU drastically change usefulness of hidden layers. Instead it's just re-iterated common knowledge of linear functions producing no value when used on hidden layers.
Definitely yes. Without non-linearity you can only express linear functions. With non-linearities such as ReLU and sigmoid you can approximate any continuous function on a compact set (search "Universal approximation theorem" for more information).
@@almoni127 Sure, but look at that statement and tell me if that always true, even for a single layer network. How can we claim that model capacity has increased after using a non-linear activation instead of linear activation, since it is not quantifiable. How much capacity has increased?
@@TheThunderSpirit A good analogue is Boolean circuits. With no hidden layers you don't have much expressiveness. Already with one hidden layer you have universality, albeit the required hidden layer size might be too large. With arbitrarily deep networks you can approximate all polynomial computations.
From the videos above, I got to understand that RELu is a linear function... Rest are non linear functions.. But how can consider sigmoid function as binary?? Because a binary function always give output in the form of 0 or 1 but the sigmoid function varies between -ve infinity to +infinity & touching y axis at 0.5?
ReLU is NOT a linear function, just like the sigmoid function or tanh f.ex. Also. The sigmoid function does not vary between -inf, +inf, its output is in the range (0, 1).
@@saurabhshubham4448 The output is linear yes, but as far as I understand you lose the linear dependencies between the weights. Let's take two weights w1 = -0.2 and w2 = 0.5, and a linear function like lets say f(x) = 3x+0.2. Then f(w1) = -0.4 and f(w2) = 1.7. This function preserves the difference (up to a linear factor) between the two weights (w2-w1 = 0.7, and f(w2)-f(w1) = 3* (w2-w1)=2.1. You will always have this with linear functions, but with a function like ReLU you will not (only interesting of course if at least one of the weights are negative). Now maybe this math is just nonsense, but I think I got a point here somewhere. In a sense you cannot write old weights as a "linear combination" of new weights.
Do you know who this guy is? You should cherish this opportunity that such a great talent teaches you online. Also, this is the first time that I see someone picking on his English. Maybe it is time for you to improve your listening skills...
These lectures deserve to be recognized as the bible of machine learning
Prof Ng is the man!
As a layman clicking buttons to try get a better understanding of large language models I thought I was making some progress, then I watched this video, now I think I should go back to primary school 😢
Have you been watching the previous videos in the playlist? If so, I'm very surprised this video was the one you found challenging. Feels like a breather after the last few videos
it was like i was blind about lin activation function but now im gifted with vision :)
*computer vision
I needed this, thank you!
nice explanation
Great video!!! I am still confused about why Relu works when it's properties are quite linear. I mean, I know it's a piece-wise linear function, therefore does not meet the mathematical definition of linear function. But by using Relu, the output is still just a linear combination. Perhaps some neurons don't 'contribute', but the output is still a mathematical result of a linear combination of numbers.
Santosh Gupta Hi, in my understanding, the aim of a neural network is to simulate a function that usually cannot be represented by a closed form expression. A linear activation function could cause the final result to be linear, which is useless as the professor explained. However, the ReLU can avoid this problem, just image that, the linear combination of several outputs that activated by ReLU, is equivalent to a piece-wise function, although each piece in the function is linear, we can somehow view it as an approximation of another complex function. It is just like what we do in computer graphics, we do not draw a curve, instead, we draw a lot of straight lines to simulate a curve. Hope this could be helpful.
Thanks!!!
For a function to be linear, its slope must be constant throughout. As the ReLu has a lil kink, it makes it a non-linear function.
Hi, I am confused. ReLU will kill the neuron only during the forward pass? Or also during the backward pass?
An activation function like ReLu is used only during forward propagation. During the backward propagation, you just take the derivatives and update the weights and biases.
superb
I got a question - does the use of non-linear activation increase model capacity?
Exactly my thoughts, came here to gain insight as to how non linear functions like ReLU drastically change usefulness of hidden layers. Instead it's just re-iterated common knowledge of linear functions producing no value when used on hidden layers.
Definitely yes.
Without non-linearity you can only express linear functions.
With non-linearities such as ReLU and sigmoid you can approximate any continuous function on a compact set (search "Universal approximation theorem" for more information).
@@almoni127 Sure, but look at that statement and tell me if that always true, even for a single layer network. How can we claim that model capacity has increased after using a non-linear activation instead of linear activation, since it is not quantifiable. How much capacity has increased?
@@TheThunderSpirit
A good analogue is Boolean circuits.
With no hidden layers you don't have much expressiveness.
Already with one hidden layer you have universality, albeit the required hidden layer size might be too large.
With arbitrarily deep networks you can approximate all polynomial computations.
@@almoni127 Thank you
It's so ridiculous that a video whose speaker is a Chinese but we only got Korean subtitle!!
you know you can actually choose English subtitle in setting...
And hopefully you could add Chinese subtitle to the video.
From the videos above, I got to understand that RELu is a linear function... Rest are non linear functions.. But how can consider sigmoid function as binary?? Because a binary function always give output in the form of 0 or 1 but the sigmoid function varies between -ve infinity to +infinity & touching y axis at 0.5?
ReLU is NOT a linear function, just like the sigmoid function or tanh f.ex.
Also. The sigmoid function does not vary between -inf, +inf, its output is in the range (0, 1).
@@Mats-Hansen How you are saying relu as non-linear function? Its combination of two linear function.
@@saurabhshubham4448 Yes it is. But a combination of two linear functions doesn't need to be linear.
@@Mats-Hansen but whenever you apply it on input, it's output will be linear, and thus relu don't help in adding non linearity to the model.
@@saurabhshubham4448 The output is linear yes, but as far as I understand you lose the linear dependencies between the weights. Let's take two weights w1 = -0.2 and w2 = 0.5, and a linear function like lets say f(x) = 3x+0.2. Then f(w1) = -0.4 and f(w2) = 1.7. This function preserves the difference (up to a linear factor) between the two weights (w2-w1 = 0.7, and f(w2)-f(w1) = 3* (w2-w1)=2.1. You will always have this with linear functions, but with a function like ReLU you will not (only interesting of course if at least one of the weights are negative). Now maybe this math is just nonsense, but I think I got a point here somewhere. In a sense you cannot write old weights as a "linear combination" of new weights.
h
This person has a lot of knowledge, he picks one thing and then starts explaining other and other. this disease is called explainailibalible sorry
Grandpa telling bedtime story.........
Then get out of here. SOB!!!
Yeps.. ✌️
Some of us very much appreciate his way of teaching.
Your English is difficult to understand. I keep going back to figure out what you mean by some words.
lol he is doctore in stanford
maybe learn the basics such as ReLU or Sigmoid will help? I don't think these are daily English words.
Do you know who this guy is? You should cherish this opportunity that such a great talent teaches you online. Also, this is the first time that I see someone picking on his English. Maybe it is time for you to improve your listening skills...
Such a sick person you're. Neeku burra dobbindhi Ra houla sale gaa,(Telugu)
his English sucks!