Can is ask why lost information can be recovered if ReLU is applied on a high dimensional activation. What about applied on a low dimensional activation?
Great explanation. Just one question. If relu destroys the important information for negative input, why even use that. Isn't it better, we use a activation function which returns non zero output for negative input?, this way we should be able to reduce the number of dimensions and we don't need to do all these things.
* As far as I understand from the paper, what destroys information is not ReLU per see but rather activation functions in general. * An activation function is non-linear; non-linearity skews layer activation data and causes information loss. Recall that lost information can be recovered if ReLU is applied on a high dimensional activation. * The ReLU example I explained serves as a good intuition of why non-linearity destroys the data. * There is formal mathematical explanation of this phenomena in the paper's supplemental material. Apologies for the late answer. Good luck!
* As far as I understand from the paper, what destroys information is not ReLU per see but rather activation functions in general. * An activation function is non-linear; non-linearity skews layer activation data and causes information loss. Recall that lost information can be recovered if ReLU is applied on a high dimensional activation. * The ReLU example I explained serves as a good intuition of why non-linearity destroys the data. * There is formal mathematical explanation of this phenomena in the paper's supplemental material. Good luck!
Can is ask why lost information can be recovered if ReLU is applied on a high dimensional activation. What about applied on a low dimensional activation?
Welcome brother. I'm not sure how I can help you because I explained in that same video the authors' justifications regarding your question.
@@zardouayassir7359 Thank you ^^
Great explanation. Just one question. If relu destroys the important information for negative input, why even use that. Isn't it better, we use a activation function which returns non zero output for negative input?, this way we should be able to reduce the number of dimensions and we don't need to do all these things.
* As far as I understand from the paper, what destroys information is not ReLU per see but rather activation functions in general.
* An activation function is non-linear; non-linearity skews layer activation data and causes information loss. Recall that lost information can be recovered if ReLU is applied on a high dimensional activation.
* The ReLU example I explained serves as a good intuition of why non-linearity destroys the data.
* There is formal mathematical explanation of this phenomena in the paper's supplemental material.
Apologies for the late answer. Good luck!
@@zardouayassir7359 Understood. Thank you!!!!
why they wont use different activation function to prevent information loss like elu or leaky relu
* As far as I understand from the paper, what destroys information is not ReLU per see but rather activation functions in general.
* An activation function is non-linear; non-linearity skews layer activation data and causes information loss. Recall that lost information can be recovered if ReLU is applied on a high dimensional activation.
* The ReLU example I explained serves as a good intuition of why non-linearity destroys the data.
* There is formal mathematical explanation of this phenomena in the paper's supplemental material.
Good luck!
great explanation.thank you.
Thank you 😊
Happy I helped :)