Sigmoid vs ReLU — The battle of the activation functions.
Sigmoid has been our friend while training NN, but I can’t help but notice that ReLU has overtaken it!
Advantages of ReLU:
No vanishing gradient. Sigmoid squashes the value between 0 and 1. Its gradient is always less than 1. (People say most of the time it is <0.5) which is much closer to zero and if we have lesser impact then it would take more time to converge.
ReLU gradient is faster to compute. Right? Instead of sigmoid(x)*(1-sigmoid(x)) for sigmoid’s derivative, we have ReLU, which is either 0 or 1.
Okay. Hold on. So, we have sigmoid which can have vanishing gradient when input is very large or very small, the gradient goes near to zero. How the gradient goes near to zero in both the case? Because of the sigmoid(x)*(1-sigmoid(x)) case. If x is large, we have sigmoid(x) very small, but 1-sigmoid(x) would cancel it and similar case when x is 0. But ReLU is different, it would give gradient of 0 or 1. But it is still gives 0, isn’t that vanishing gradient? Yes it is. But, we have solved half of what sigmoid brings, and we use ReLU. So we use better version of ReLU like LeakyReLU , ELU etc.
So the dead neurons causes problems. What do we do? We use batchnorm to center the unit and apply other form of ReLU’s.
RELU BACKPROP
What is the backward prop values?
Viola! That’s where you are wrong! Think of anything when backpropagating as what impact does it back in loss function. When we put the relu forward its negative value in forward propagation would not affect the loss function, we make it zero. So we keep track of whatever the values that turned into zeros, and every other values is what we keep! Difficult?
Check the answer below.
See, only those values where inputs were 1, its corresponding gradient comes out in backprop, other values are discarded.