From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.
just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you
I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!
hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!
Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)
Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning! (And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)
Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!
So the slide at 4:00 says "Activation functions should be differentiable", but the conclusion of the video is that you should use the ReLU activation function, which is not differentiable. (Great video btw.)
Daniel O'Connor 2 years later and you’ve probably figured it out, but I believe reLu is not differentiable only exactly at x=0, which is usually really rare
but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier or did I get the whole thing wrong?
It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.
Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language. Thanks Siraj Raval, great!
@Siraj NN can potentially grow in so many directions, you will always have something to explain to us. As you used to say 'this is only the beginning'. And ohh maaan ! you're so clear when you explain NN ;) Please keep doing what you're doing again and again and again...and again ! You are for NN, what Neil de Grass is for astrophysics. thx for sharing the github source that detail each activation source
If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions
More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.
depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.
1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1? 2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power? 3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values? 4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.
Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.
Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.
very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)
Hello, can anyone tell me how the partial derivative at 1:52 works ? Because I don't know what f(x) is in the first place (is that the sigmoid, aka our activation function??) so f'(x) is the sigmoid derivative, let's suppose, then what is h ? Is it some number tending toward 0 ? Then, why do we derivate the sigmoid in (almost) 0 ? Also, is the result we obtained y or y hat ?? It's just this single point I'm trying to understand, the rest is clear.
After each epoch check to see if any neurons have activations that are coverging toward zero. The best way to do ths would be monitor the neurons over a series of epochs and calculate a delta or differential between training epochs.
Curious why does ReLU avoid vanishing gradient problem? When z is below 0, since y is always 0, the gradient seems to be 0, which means the gradient vanishes? Or do I misunderstand about the vanishing gradient?
Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?
Excellent, as usual. I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can. It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.
My university uses "The Elements of Statistical Learning" for the introductory ML class. The book is intensely heavy with details and maths. It's free so google the title. I think this book is used at most of the US University. Try coursera first and then this book.
My plan for this year: first, do coursera: (done) second:Watch and do HW of University of California Berkeley's AI course. (partially done) third: watch and do HW of University of California Berkeley's Deep learning course. They have 3 deep learning courses all on the internet/youtube. (prereq is the second step) fourth: watch the Neural Network videos of AI guru,Geoffrey Hinton at coursera. He is well respected in deep learning and neural network. (partially done) Then finally watch Stanford's "The Elements of Statistical Learning" video lectures by the author of the book. (not planned yet) This is my plan for a year. I am done with coursera and started AI of UCB. I will take ML again next semester officially at my university. Then on spring probably deep learning and AI. Proably will take a year do all everything. homework and project.
Hi Sirj you mentioned that activation functions should be differentiable but from my understanding relu is not. I was wondering how this affect back propagation in our neural net.
From the math point of view it's not. But the only part not differentiable is at 0, for which you declare that the gradient is 0 or the gradient of identity, it doesn't matter much because you're using float32 for an optimization problem, so you're very unlikely to fall on this 0 case. Just approximate it. The purpose of the ReLu is to have sparse output and sparse gradient, it allow the network to 'activate paths'.
It doesn't matter in practice. You can return 0 or 1 when the input is at the non-differentiable point and it would do fine. Remember that neural networks are just approximators. Its algorithm is plain simple and dumb but it does the job.
Don't we use ln(1+exp(x)) instead of real Relu in practice? as far as i know, it's differentiable(and super easy to calculate differentiation), has similar shape of relu and so on.
@Yunchan Hwang We actually appreciate this 0 output on the ReLu, it's appreciable because it give sparse output and gradient, if you use your function you can't 'deactivate' some path (just put it very close to 0, which is quite different). Also you have to consider the computation time. max(0, x) is far easier to compute than ln(1+exp(x))
Is there any article which I can refer ... for citation purposes ... I found out this is the best combo for my LSTM from training ... but it will be good if I can get a paper which says use relu ...
Crashed2DesktoP this is a little less generically answerable than which activation. For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.
I have a question... For Sigmoid activation functions with an output close to 1, would the vanishing gradient problem still cause no signal to flow through it? Or instead would it cause the output to be fully saturated permanently? Either way it would be an issue but i'm just trying to wrap my head around this.
While the activation function must be non-linear, neural nets store weights as binary numbers. If the range is small enough, you can store each activation function value by looking up the weight in a table. In other words, for every possible x, given a function f(x), simply store the result f(x) in a table of x+1 entries where for every x value, value_table[x] = f(x). The time it takes to calculate the activation function becomes 0 for all intents and purposes, no matter how complex it might be. In the days when I can purchase gigabytes of memory for a couple of hundred bucks, it's hard to see why anyone would include a hyperbolic function calculation embedded in their innermost loops. Even the modified RelX function requires more work than a simple table lookup. Furthermore, by using a simple table lookup method, it can be much more easily coded into a matrix library calculation.
What if I have a classification problem with a 2 class label? Should I use softmax in the output layer(2 neurons)? Or can I treat it as a binary problem and use sigmoid (1 neuron)?
So if ReLU is best for hidden layer and softmax/linear is best for output, what is best for input layer? sorry I'm new but your video makes a lot of sense
Great video, super helpful!
thx Dan love u
You are both awesome
You are both awesome
I absolutely love the energy you both have in your videos :)
Be soo cool if both did a collab video!
Thanks, my biological neural network now has learned how to choose activation functions!
awesome
Hahahah
Remember whole is not in its parts. Whole behaviour is different from its elements
From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.
I'm using tanh but i always read saturated neurons as 0.95 or -0.95 while backpropagating so gradient doesnt disapear.
@@gorkemvids4839 doesn't*
Really enjoyed the video as you add subtle humor in between.
just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you
thx stevey love u
I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!
hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!
Dude! DUUUDE! You are AMAZING! I've read multiple papers already, but now the stuff are really making sense to me!
I love you man, 4 f***** months passed and my stupid prof. could not explain it as you did, not even partially. keep up the good work.
Thanks a lot
I really like your videos as they strike the very sweet spot between being concise and precise!
Excellent and entertaining at a high level of entropy reduction. A fan.
Dank memes and dank learning, both in the same video. Who would have thought. Thanks Raj!
Amazing video! THank you! I've never heard of neural networks until I started my internship. This is really fascinating.
Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)
Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning!
(And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)
Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!
thx will do
I can't control the gradient, the Best part of the video.
So the slide at 4:00 says "Activation functions should be differentiable", but the conclusion of the video is that you should use the ReLU activation function, which is not differentiable. (Great video btw.)
Daniel O'Connor 2 years later and you’ve probably figured it out, but I believe reLu is not differentiable only exactly at x=0, which is usually really rare
this guy needs more subs. Finally a good explanation. Thanks man!
Learning more from your videos than all my college classes together!
Super Siraj Raval!!!!! Great compilation Bro.
but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier
or did I get the whole thing wrong?
It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.
it is a piece-wise linear function which is essentially a nonlinear function. For more info, google "piece-wise linear functions".
The sparisty of the activations add to the non linearity of the neural net.
@@10parth10 that explanation helped. Thanks
Noi i understood wtf we are using this activation function, til now i was just using them now I know why am using them, thanks siraj
I love watching these videos, even if I don't understand 90% of what he is saying.
Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language.
Thanks Siraj Raval, great!
your teaching way is so cool and crazy :)
Cool. Your lecture cleared the cloud in my brain. I now have better understanding about the whole picture of the activation function.
Which software do you use to create neural network and activation function animation like @1:15 to @2:03 and @5:27 to @5:54
Super clear & concise. Amazing simplicity. You Rock !!!
@Siraj
NN can potentially grow in so many directions, you will always have something to explain to us.
As you used to say 'this is only the beginning'.
And ohh maaan ! you're so clear when you explain NN ;)
Please keep doing what you're doing again and again and again...and again !
You are for NN, what Neil de Grass is for astrophysics.
thx for sharing the github source that detail each activation source
By far the best videos of Machine Learning Ive watched. Amazing work! Love the energy and Vibe!
If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions
Another question is what the difference is if I use more hidden
layers or more hidden neurons
I think that at this moment there's not a cut and clear approach to how to choose the NN architecture
More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.
Maybe you should limit the starting biases so you can pass that phase quicker. I always apply biases betven 0 and 0.5
2 Hidden layers are enough.
depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.
1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1?
2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power?
3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values?
4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.
Just gotta say Siraj. You are amazing because i only understand half of what you say.
thx keep watching
omg, this is the first time i am seeing his video and its quite entertaining
Your channel is GOLD!
Crystal clear explanation, just loved it
Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.
this is a well thought out comment. so is the reply to it i see. making them more relevant and spare should help. ill do that
spare = sparse*
Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.
i read papers and articles... but a 10 min video helped me more tha all of that :D
@@SirajRaval It keeps it fresh and help me remember. I find I remember things you say by remembering the joke! Relu, relu, relu....
8:44 i liked this motto on the wall.
this guys makes learning so much fun!
Why at 6:38 it shows that the derivative of tanh is 1-x^2? It is very different from that.
This video is very easy to understand!
Love this video so much. Helped me so much with my LSTM RNN network
Hi Siraj:
Your videos are great!
CONGRATULATIONS!
u covered half of what my ai principles course covered on learning in 3 and half hrs in 8 mins. nice
digging your vids and enthusiasm from Portland Oregon!
hi Siraj,
you nailed it in a very short period of time. Loved it. Would like you to keep up always. Cheers....
Thanks @Siraj. What amazing and easy to digest explanation.
X8 Better than my data mining Professor, thank you 🙏
siraj you are a good ai teacher
According to Andrew Ng Sigmoid is helpful at output node, isn't it?
He relies more on RELU. Sigmoid is passe.
Sigmoid is definitely helpful on the output layer/node
I think it’s fine to use on the output layer for some binary classification problem
very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)
Hello, can anyone tell me how the partial derivative at 1:52 works ? Because I don't know what f(x) is in the first place (is that the sigmoid, aka our activation function??) so f'(x) is the sigmoid derivative, let's suppose, then what is h ? Is it some number tending toward 0 ? Then, why do we derivate the sigmoid in (almost) 0 ? Also, is the result we obtained y or y hat ?? It's just this single point I'm trying to understand, the rest is clear.
Woah ! thanks man, you made things so clear !!!
Hard stuff made easy. Congrats to a great video! Keep it up, mate!
How do you detect dead ReLUs in your model though?
By viewing the activation function values in each layer?
After each epoch check to see if any neurons have activations that are coverging toward zero. The best way to do ths would be monitor the neurons over a series of epochs and calculate a delta or differential between training epochs.
Dude.... exactly what i needed.. Thanks again!
Great video! Also make a video on How to choose the number of hidden layers and number of nodes in each layer?
will do thx
If I understand the subject right, you'll always only need one hidden layer, because of Cover's Theorem
Siraaaj you are greaattt😍. Saved alot of time of going though books 😂
thank you :)
Siraj Raval when are you coming to london? We hope to meet you soon buddy :)
Excellent explanation!!! You're really funny and I loved the way you explain things. Thank you!!!
Curious why does ReLU avoid vanishing gradient problem? When z is below 0, since y is always 0, the gradient seems to be 0, which means the gradient vanishes? Or do I misunderstand about the vanishing gradient?
Great explanation of activation functions. Now I need to tweak my model.
hard humor with gifs and memes makes me lose track of what Siraj is saying and had to rewind a bit ... LoL :)
In the case of using a LSTM, does Relu makes any difference?
Why does sigmoid function use e?
CUZ NOTING LIKE e
Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?
aweomse. watch my playlist learn python for data science
Siraj Raval Do you have videos on matlab using nn?
Excellent, as usual.
I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can.
It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.
are two rtx 2080oc will be good with i5 9400f in deep learning only
Hi, I am confused. ReLU will kill the neuron only during the forward pass? Or also during the backward pass?
Yes!!! A new episode. SWEET!!! Thanks Siraj.
Dude.. I'm so happy i subscribed. Keep doing what you're doing please.
welcome! will do
Great video Siraj. Keep up the good work
thx love u
How do you make those animations? They are really nice! 1:41
final cut pro
Siraj Raval -> do you have any videos on continues hopfield network or an article for me to read only? I had a hard time finding a good one.
What is your thought on softplus?
How did you learn machine learning. What are the sources you used? I want to start from scratch.
use Coursera ML stanford course. best course that teaches you everything about ML.
Riken Maharjan Thank you. Is there any good book for ML
My university uses "The Elements of Statistical Learning" for the introductory ML class. The book is intensely heavy with details and maths. It's free so google the title. I think this book is used at most of the US University. Try coursera first and then this book.
My plan for this year:
first, do coursera: (done)
second:Watch and do HW of University of California Berkeley's AI course. (partially done)
third: watch and do HW of University of California Berkeley's Deep learning course. They have 3 deep learning courses all on the internet/youtube. (prereq is the second step)
fourth: watch the Neural Network videos of AI guru,Geoffrey Hinton at coursera. He is well respected in deep learning and neural network. (partially done)
Then finally watch Stanford's "The Elements of Statistical Learning" video lectures by the author of the book. (not planned yet)
This is my plan for a year. I am done with coursera and started AI of UCB. I will take ML again next semester officially at my university. Then on spring probably deep learning and AI.
Proably will take a year do all everything. homework and project.
Riken Maharjan Awesome but I'm not an undergraduate. So is it possible with other sources to obtain the data scientist position.
Hi Sirj you mentioned that activation functions should be differentiable but from my understanding relu is not. I was wondering how this affect back propagation in our neural net.
From the math point of view it's not. But the only part not differentiable is at 0, for which you declare that the gradient is 0 or the gradient of identity, it doesn't matter much because you're using float32 for an optimization problem, so you're very unlikely to fall on this 0 case. Just approximate it.
The purpose of the ReLu is to have sparse output and sparse gradient, it allow the network to 'activate paths'.
stackoverflow.com/questions/30236856/how-does-the-back-propagation-algorithm-deal-with-non-differentiable-activation
It doesn't matter in practice. You can return 0 or 1 when the input is at the non-differentiable point and it would do fine. Remember that neural networks are just approximators. Its algorithm is plain simple and dumb but it does the job.
Don't we use ln(1+exp(x)) instead of real Relu in practice? as far as i know, it's differentiable(and super easy to calculate differentiation), has similar shape of relu and so on.
@Yunchan Hwang We actually appreciate this 0 output on the ReLu, it's appreciable because it give sparse output and gradient, if you use your function you can't 'deactivate' some path (just put it very close to 0, which is quite different). Also you have to consider the computation time. max(0, x) is far easier to compute than ln(1+exp(x))
@siraj loved the explanation and the analogy in the beginning. Cheers! *in deep voice* Deep learning
thx swaroop love u
Is there any article which I can refer ... for citation purposes ... I found out this is the best combo for my LSTM from training ... but it will be good if I can get a paper which says use relu ...
Thanks, for the video!
I have a question: Why should't I use tanh?
suffers from the vanishing gradient problem, i.e. the weights do not produce any change in the model
so
we use ReLU cuz it never leads to that
Are you serious, he literally just told you.
I've been wondering what loss function to use D: Can you make a video for loss functions pls :)
Crashed2DesktoP this is a little less generically answerable than which activation.
For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.
Sina Samangooei thanks for your answer. It is useful for me too. 😃
hmm. sina has a good answer but more vids similar to this coming
log- likelihood cost function with softmax output layer for classification
Why don't we use e^x as my activation function or any other polynomial function which is differentiable and does thresholding ?
quick question, if we use relu or tanh. then, our output is no longer a probability?
despised the stale memes. loved the explanation
Great insight on Activation Functions , thanks
Entire video is a GEM 💎
Totally makes sense to use ML
What's a good way to test if your neurons are dying? Any heuristics to check?
Hey Siraj, if I want to visualize and understand a neural network using C/C++ data structures and syntaxes, how would I do it?
Excellent! Great educator! Thanks for producing and sharing!
thx
I have a question... For Sigmoid activation functions with an output close to 1, would the vanishing gradient problem still cause no signal to flow through it? Or instead would it cause the output to be fully saturated permanently? Either way it would be an issue but i'm just trying to wrap my head around this.
While the activation function must be non-linear, neural nets store weights as binary numbers. If the range is small enough, you can store each activation function value by looking up the weight in a table. In other words, for every possible x, given a function f(x), simply store the result f(x) in a table of x+1 entries where for every x value, value_table[x] = f(x). The time it takes to calculate the activation function becomes 0 for all intents and purposes, no matter how complex it might be. In the days when I can purchase gigabytes of memory for a couple of hundred bucks, it's hard to see why anyone would include a hyperbolic function calculation embedded in their innermost loops. Even the modified RelX function requires more work than a simple table lookup. Furthermore, by using a simple table lookup method, it can be much more easily coded into a matrix library calculation.
Bro, I loved your content.
What if I have a classification problem with a 2 class label? Should I use softmax in the output layer(2 neurons)? Or can I treat it as a binary problem and use sigmoid (1 neuron)?
You can just use sigmoid
ReLu has a sharp bend at 0. How is it differentiable then?
Thanks, very nice explanation.
Please do a detailed video regarding the difference between multilayer neural network and deep neural network and the evolution. Pleeeease!
but if you use a relu could ' t get the value from layer to layer to big to compute ?
So if ReLU is best for hidden layer and softmax/linear is best for output, what is best for input layer? sorry I'm new but your video makes a lot of sense
Which video is this at 3:25?
careless whisper parody
Thanks and nice channel btw :)