Which Activation Function Should I Use?

Siraj Raval

Просмотров 265 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 янв 2025

Комментарии • 461

@TheCodingTrain 7 лет назад ⁺²⁰³
Great video, super helpful!
@SirajRaval 7 лет назад ⁺²⁹
thx Dan love u
@eointolster 7 лет назад ⁺¹³
You are both awesome
@eointolster 7 лет назад ⁺⁴
You are both awesome
@terigopula 6 лет назад ⁺²
I absolutely love the energy you both have in your videos :)
@silverreyes7912 6 лет назад ⁺²
Be soo cool if both did a collab video!
@Skythedragon 7 лет назад ⁺²⁶⁵
Thanks, my biological neural network now has learned how to choose activation functions!
@SirajRaval 7 лет назад ⁺²³
awesome
@GilangD21 7 лет назад ⁺¹
Hahahah
@rs-tarxvfz 5 лет назад
Remember whole is not in its parts. Whole behaviour is different from its elements
@StephenRoseDuo 7 лет назад ⁺³⁶
From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.
@gorkemvids4839 6 лет назад ⁺¹
I'm using tanh but i always read saturated neurons as 0.95 or -0.95 while backpropagating so gradient doesnt disapear.
@JorgetePanete 5 лет назад
@@gorkemvids4839 doesn't*
@rafiakhan8721 2 года назад ⁺¹
Really enjoyed the video as you add subtle humor in between.
@hussain5755 7 лет назад ⁺¹
just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you
@SirajRaval 7 лет назад ⁺¹
thx stevey love u
@grainfrizz 7 лет назад
I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!
@captainwalter 4 года назад
hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!
@pouyan74 4 года назад ⁺¹
Dude! DUUUDE! You are AMAZING! I've read multiple papers already, but now the stuff are really making sense to me!
@BOSS-bk2jx 7 лет назад ⁺³
I love you man, 4 f***** months passed and my stupid prof. could not explain it as you did, not even partially. keep up the good work.
Thanks a lot
@quant-trader-010 2 года назад ⁺¹
I really like your videos as they strike the very sweet spot between being concise and precise!
@supremehype3227 6 лет назад ⁺¹
Excellent and entertaining at a high level of entropy reduction. A fan.
@waleedtahir2072 7 лет назад ⁺¹
Dank memes and dank learning, both in the same video. Who would have thought. Thanks Raj!
@drhf1214 6 лет назад ⁺¹
Amazing video! THank you! I've never heard of neural networks until I started my internship. This is really fascinating.
@cali4nicated 5 лет назад ⁺¹
Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)
@jennycotan7080 Год назад
Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning!
(And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)
@akompsupport 7 лет назад
Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!
@SirajRaval 7 лет назад ⁺¹
thx will do
@satyamskillz 4 года назад ⁺¹
I can't control the gradient, the Best part of the video.
@Singularitarian 7 лет назад
So the slide at 4:00 says "Activation functions should be differentiable", but the conclusion of the video is that you should use the ReLU activation function, which is not differentiable. (Great video btw.)
@musilicks 5 лет назад
Daniel O'Connor 2 years later and you’ve probably figured it out, but I believe reLu is not differentiable only exactly at x=0, which is usually really rare
@gydo1942 6 лет назад ⁺¹
this guy needs more subs. Finally a good explanation. Thanks man!
@MrJnsc 7 лет назад
Learning more from your videos than all my college classes together!
@gigeg7708 6 лет назад ⁺¹
Super Siraj Raval!!!!! Great compilation Bro.
@sedthh 7 лет назад ⁺¹⁹
but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier
or did I get the whole thing wrong?
@jeffwells641 7 лет назад ⁺¹⁴
It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.
@anshu957 6 лет назад ⁺⁴
it is a piece-wise linear function which is essentially a nonlinear function. For more info, google "piece-wise linear functions".
@10parth10 4 года назад ⁺¹
The sparisty of the activations add to the non linearity of the neural net.
@UnrecycleRubdish 3 года назад
@@10parth10 that explanation helped. Thanks
@yatinarora9650 5 лет назад
Noi i understood wtf we are using this activation function, til now i was just using them now I know why am using them, thanks siraj
@nikksengaming933 7 лет назад
I love watching these videos, even if I don't understand 90% of what he is saying.
@CristianMargiotta 7 лет назад ⁺¹
Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language.
Thanks Siraj Raval, great!
@madhumithak3338 4 года назад
your teaching way is so cool and crazy :)
@slowcoding 5 лет назад
Cool. Your lecture cleared the cloud in my brain. I now have better understanding about the whole picture of the activation function.
@aiMonk 7 лет назад
Which software do you use to create neural network and activation function animation like @1:15 to @2:03 and @5:27 to @5:54
@drone_raghu 7 лет назад
Super clear & concise. Amazing simplicity. You Rock !!!
@plouismarie 7 лет назад
@Siraj
NN can potentially grow in so many directions, you will always have something to explain to us.
As you used to say 'this is only the beginning'.
And ohh maaan ! you're so clear when you explain NN ;)
Please keep doing what you're doing again and again and again...and again !
You are for NN, what Neil de Grass is for astrophysics.
thx for sharing the github source that detail each activation source
@kalreensdancevelventures5512 4 года назад
By far the best videos of Machine Learning Ive watched. Amazing work! Love the energy and Vibe!
@fersilvil 7 лет назад ⁺³
If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions
@firespark804 7 лет назад ⁺¹⁷
Another question is what the difference is if I use more hidden
layers or more hidden neurons
@davidfortini3205 7 лет назад ⁺¹
I think that at this moment there's not a cut and clear approach to how to choose the NN architecture
@trainraider8 7 лет назад
More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.
@gorkemvids4839 6 лет назад
Maybe you should limit the starting biases so you can pass that phase quicker. I always apply biases betven 0 and 0.5
@paras8361 6 лет назад
2 Hidden layers are enough.
@kayrunjaavice1421 6 лет назад
depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.
@paulbloemen7256 5 лет назад ⁺¹
1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1?
2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power?
3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values?
4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.
@ColacX 7 лет назад
Just gotta say Siraj. You are amazing because i only understand half of what you say.
@SirajRaval 7 лет назад
thx keep watching
@drip888 Год назад
omg, this is the first time i am seeing his video and its quite entertaining
@daposB 5 лет назад
Your channel is GOLD!
@akhilguptavibrantjava 6 лет назад
Crystal clear explanation, just loved it
@Jotto999 7 лет назад ⁺³⁸
Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.
@SirajRaval 7 лет назад ⁺⁵
this is a well thought out comment. so is the reply to it i see. making them more relevant and spare should help. ill do that
@SirajRaval 7 лет назад ⁺²
spare = sparse*
@austinmoran456 7 лет назад ⁺⁵
Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.
@Kaiz0kuSama 7 лет назад ⁺²
i read papers and articles... but a 10 min video helped me more tha all of that :D
@TuyoIsaza 6 лет назад
@@SirajRaval It keeps it fresh and help me remember. I find I remember things you say by remembering the joke! Relu, relu, relu....
@sahand5277 6 лет назад ⁺⁴
8:44 i liked this motto on the wall.
@joshiyogendra 6 лет назад
this guys makes learning so much fun!
@andybergon 6 лет назад
Why at 6:38 it shows that the derivative of tanh is 1-x^2? It is very different from that.
@dyjiang1350 7 лет назад
This video is very easy to understand!
@WillTesler 7 лет назад
Love this video so much. Helped me so much with my LSTM RNN network
@hectoralvarorojas1918 7 лет назад
Hi Siraj:
Your videos are great!
CONGRATULATIONS!
@Omar-kw5ui 6 лет назад
u covered half of what my ai principles course covered on learning in 3 and half hrs in 8 mins. nice
@nicodaunt 6 лет назад
digging your vids and enthusiasm from Portland Oregon!
@amanpreetsingh8100 7 лет назад
hi Siraj,
you nailed it in a very short period of time. Loved it. Would like you to keep up always. Cheers....
@MohammedAli-pg2fw 5 лет назад
Thanks @Siraj. What amazing and easy to digest explanation.
@angelo6082 7 месяцев назад
X8 Better than my data mining Professor, thank you 🙏
@clark87 5 лет назад
siraj you are a good ai teacher
@amitghodke838 5 лет назад ⁺¹
According to Andrew Ng Sigmoid is helpful at output node, isn't it?
@Aparup985 5 лет назад
He relies more on RELU. Sigmoid is passe.
@chetana9802 5 лет назад
Sigmoid is definitely helpful on the output layer/node
@musilicks 5 лет назад
I think it’s fine to use on the output layer for some binary classification problem
@mohamednoordeen6331 7 лет назад ⁺²
very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)
@TheSam1902 7 лет назад
Hello, can anyone tell me how the partial derivative at 1:52 works ? Because I don't know what f(x) is in the first place (is that the sigmoid, aka our activation function??) so f'(x) is the sigmoid derivative, let's suppose, then what is h ? Is it some number tending toward 0 ? Then, why do we derivate the sigmoid in (almost) 0 ? Also, is the result we obtained y or y hat ?? It's just this single point I'm trying to understand, the rest is clear.
@anjali7778 4 года назад
Woah ! thanks man, you made things so clear !!!
@jb.1412 7 лет назад
Hard stuff made easy. Congrats to a great video! Keep it up, mate!
@pure_virtual 7 лет назад ⁺⁸
How do you detect dead ReLUs in your model though?
@Fr0zenFireV 6 лет назад
By viewing the activation function values in each layer?
@TheOnlySaneAmerican 4 года назад
After each epoch check to see if any neurons have activations that are coverging toward zero. The best way to do ths would be monitor the neurons over a series of epochs and calculate a delta or differential between training epochs.
@TuyoIsaza 6 лет назад ⁺¹
Dude.... exactly what i needed.. Thanks again!
@vijayabhaskar-j 7 лет назад ⁺⁷
Great video! Also make a video on How to choose the number of hidden layers and number of nodes in each layer?
@SirajRaval 7 лет назад ⁺⁴
will do thx
@TheQuickUplifts 5 лет назад
If I understand the subject right, you'll always only need one hidden layer, because of Cover's Theorem
@hadyaasghar7680 6 лет назад
Siraaaj you are greaattt😍. Saved alot of time of going though books 😂
@SirajRaval 6 лет назад
thank you :)
@hadyaasghar7680 6 лет назад
Siraj Raval when are you coming to london? We hope to meet you soon buddy :)
@guilhermeabreu3131 3 года назад
Excellent explanation!!! You're really funny and I loved the way you explain things. Thank you!!!
@cenyingyang1611 3 года назад
Curious why does ReLU avoid vanishing gradient problem? When z is below 0, since y is always 0, the gradient seems to be 0, which means the gradient vanishes? Or do I misunderstand about the vanishing gradient?
@gowriparameswaribellala4423 5 лет назад
Great explanation of activation functions. Now I need to tweak my model.
@unboxwithaakash 6 лет назад ⁺¹
hard humor with gifs and memes makes me lose track of what Siraj is saying and had to rewind a bit ... LoL :)
@logancarvalho 7 лет назад
In the case of using a LSTM, does Relu makes any difference?
@Christian-mn8dh 5 лет назад ⁺¹
Why does sigmoid function use e?
@chetana9802 5 лет назад
CUZ NOTING LIKE e
@ilyassalhi 7 лет назад ⁺³
Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?
@SirajRaval 7 лет назад
aweomse. watch my playlist learn python for data science
@lubnaaashaikh8901 7 лет назад
Siraj Raval Do you have videos on matlab using nn?
@antonylawler3423 7 лет назад
Excellent, as usual.
I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can.
It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.
@fadezzgameplay7077 4 года назад
are two rtx 2080oc will be good with i5 9400f in deep learning only
@jchhjchh 6 лет назад
Hi, I am confused. ReLU will kill the neuron only during the forward pass? Or also during the backward pass?
@OzzieCoto 7 лет назад
Yes!!! A new episode. SWEET!!! Thanks Siraj.
@diogobras5135 7 лет назад
Dude.. I'm so happy i subscribed. Keep doing what you're doing please.
@SirajRaval 7 лет назад
welcome! will do
@AngryCanadian3 7 лет назад ⁺³
Great video Siraj. Keep up the good work
@SirajRaval 7 лет назад
thx love u
@jmjmjm5555 7 лет назад
How do you make those animations? They are really nice! 1:41
@SirajRaval 7 лет назад
final cut pro
@drjoriv 7 лет назад
Siraj Raval -> do you have any videos on continues hopfield network or an article for me to read only? I had a hard time finding a good one.
@shairuno 7 лет назад ⁺¹
What is your thought on softplus?
@salukaudbhasa 7 лет назад ⁺³
How did you learn machine learning. What are the sources you used? I want to start from scratch.
@rikenm 7 лет назад ⁺²
use Coursera ML stanford course. best course that teaches you everything about ML.
@salukaudbhasa 7 лет назад
Riken Maharjan Thank you. Is there any good book for ML
@rikenm 7 лет назад ⁺²
My university uses "The Elements of Statistical Learning" for the introductory ML class. The book is intensely heavy with details and maths. It's free so google the title. I think this book is used at most of the US University. Try coursera first and then this book.
@rikenm 7 лет назад ⁺²
My plan for this year:
first, do coursera: (done)
second:Watch and do HW of University of California Berkeley's AI course. (partially done)
third: watch and do HW of University of California Berkeley's Deep learning course. They have 3 deep learning courses all on the internet/youtube. (prereq is the second step)
fourth: watch the Neural Network videos of AI guru,Geoffrey Hinton at coursera. He is well respected in deep learning and neural network. (partially done)
Then finally watch Stanford's "The Elements of Statistical Learning" video lectures by the author of the book. (not planned yet)
This is my plan for a year. I am done with coursera and started AI of UCB. I will take ML again next semester officially at my university. Then on spring probably deep learning and AI.
Proably will take a year do all everything. homework and project.
@salukaudbhasa 7 лет назад
Riken Maharjan Awesome but I'm not an undergraduate. So is it possible with other sources to obtain the data scientist position.
@sidhantchadda9396 7 лет назад ⁺²⁵
Hi Sirj you mentioned that activation functions should be differentiable but from my understanding relu is not. I was wondering how this affect back propagation in our neural net.
@caenorstfuji 7 лет назад ⁺¹⁴
From the math point of view it's not. But the only part not differentiable is at 0, for which you declare that the gradient is 0 or the gradient of identity, it doesn't matter much because you're using float32 for an optimization problem, so you're very unlikely to fall on this 0 case. Just approximate it.
The purpose of the ReLu is to have sparse output and sparse gradient, it allow the network to 'activate paths'.
@amreshgiri4933 7 лет назад ⁺²
stackoverflow.com/questions/30236856/how-does-the-back-propagation-algorithm-deal-with-non-differentiable-activation
@offchan 7 лет назад ⁺¹¹
It doesn't matter in practice. You can return 0 or 1 when the input is at the non-differentiable point and it would do fine. Remember that neural networks are just approximators. Its algorithm is plain simple and dumb but it does the job.
@yunchanhwang3068 7 лет назад
Don't we use ln(1+exp(x)) instead of real Relu in practice? as far as i know, it's differentiable(and super easy to calculate differentiation), has similar shape of relu and so on.
@caenorstfuji 7 лет назад ⁺⁴
@Yunchan Hwang We actually appreciate this 0 output on the ReLu, it's appreciable because it give sparse output and gradient, if you use your function you can't 'deactivate' some path (just put it very close to 0, which is quite different). Also you have to consider the computation time. max(0, x) is far easier to compute than ln(1+exp(x))
@SwaroopPal1088 7 лет назад ⁺¹
@siraj loved the explanation and the analogy in the beginning. Cheers! *in deep voice* Deep learning
@SirajRaval 7 лет назад
thx swaroop love u
@NilavraPathak 7 лет назад
Is there any article which I can refer ... for citation purposes ... I found out this is the best combo for my LSTM from training ... but it will be good if I can get a paper which says use relu ...
@Michalos86 5 лет назад ⁺²
Thanks, for the video!
I have a question: Why should't I use tanh?
@chetana9802 5 лет назад
suffers from the vanishing gradient problem, i.e. the weights do not produce any change in the model
so
we use ReLU cuz it never leads to that
@user-rc9jf8ng2k 5 лет назад
Are you serious, he literally just told you.
@thedeliverguy879 7 лет назад ⁺⁴⁹
I've been wondering what loss function to use D: Can you make a video for loss functions pls :)
@sinjaxsan 7 лет назад ⁺²⁷
Crashed2DesktoP this is a little less generically answerable than which activation.
For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.
@RoxanaNoe 7 лет назад
Sina Samangooei thanks for your answer. It is useful for me too. 😃
@SirajRaval 7 лет назад ⁺⁸
hmm. sina has a good answer but more vids similar to this coming
@rishabh4082 6 лет назад
log- likelihood cost function with softmax output layer for classification
@harshitagarwal5188 7 лет назад
Why don't we use e^x as my activation function or any other polynomial function which is differentiable and does thresholding ?
@ahmadfitri6035 7 лет назад
quick question, if we use relu or tanh. then, our output is no longer a probability?
@toadfrommariokart64 3 года назад
despised the stale memes. loved the explanation
@jindagi_ka_safar 5 лет назад
Great insight on Activation Functions , thanks
@rahulsbhatt 5 лет назад
Entire video is a GEM 💎
Totally makes sense to use ML
@ThePeterDislikeShow 3 года назад
What's a good way to test if your neurons are dying? Any heuristics to check?
@prateekgupta5945 6 лет назад
Hey Siraj, if I want to visualize and understand a neural network using C/C++ data structures and syntaxes, how would I do it?
@BigMTBrain 7 лет назад
Excellent! Great educator! Thanks for producing and sharing!
@SirajRaval 7 лет назад
thx
@Zfhall 6 лет назад
I have a question... For Sigmoid activation functions with an output close to 1, would the vanishing gradient problem still cause no signal to flow through it? Or instead would it cause the output to be fully saturated permanently? Either way it would be an issue but i'm just trying to wrap my head around this.
@edreusser4741 2 года назад
While the activation function must be non-linear, neural nets store weights as binary numbers. If the range is small enough, you can store each activation function value by looking up the weight in a table. In other words, for every possible x, given a function f(x), simply store the result f(x) in a table of x+1 entries where for every x value, value_table[x] = f(x). The time it takes to calculate the activation function becomes 0 for all intents and purposes, no matter how complex it might be. In the days when I can purchase gigabytes of memory for a couple of hundred bucks, it's hard to see why anyone would include a hyperbolic function calculation embedded in their innermost loops. Even the modified RelX function requires more work than a simple table lookup. Furthermore, by using a simple table lookup method, it can be much more easily coded into a matrix library calculation.
@amitmauryathecoolcoder267 5 лет назад
Bro, I loved your content.
@hanserj169 5 лет назад
What if I have a classification problem with a 2 class label? Should I use softmax in the output layer(2 neurons)? Or can I treat it as a binary problem and use sigmoid (1 neuron)?
@10parth10 4 года назад
You can just use sigmoid
@achakraborti 7 лет назад
ReLu has a sharp bend at 0. How is it differentiable then?
@jflow5601 4 года назад
Thanks, very nice explanation.
@suprotikdey1910 7 лет назад
Please do a detailed video regarding the difference between multilayer neural network and deep neural network and the evolution. Pleeeease!
@timkellermann7669 7 лет назад
but if you use a relu could ' t get the value from layer to layer to big to compute ?
@DelandaBaudLacanian 2 года назад
So if ReLU is best for hidden layer and softmax/linear is best for output, what is best for input layer? sorry I'm new but your video makes a lot of sense
@Oxydron 7 лет назад
Which video is this at 3:25?
@SirajRaval 7 лет назад
careless whisper parody
@Oxydron 7 лет назад
Thanks and nice channel btw :)

Следующие

Автовоспроизведение

The moment we stopped understanding AI [AlexNet]