Dropout,Momentum,Adam with code — I love this!

4 min readMay 26, 2020

DROPOUT

I am not telling what drop out is. We will look into the code directly!

We have keep_prob, so many neurons do we actually want to keep and how many do we want to deactivate? If we want to deactivate 30%, keep prob is 0.7.

We use dropout in both forward and backward propagation, we keep track of where we want to use the dropout in forward propagation and later use it in backward propagation!

We cache the value of dropout in D1. D1 is like mask. This is little tricky. The probability that value is greater than 0.7 is less than it being less than 0.7. since it is keep probability, so if any entries in D1 is less than 0.7, we set it as 1, not 0 (which seems true at the first glance)!

Why we divide by keep_prob? Make sure the expected or overall effect remains as it is.Diving doesn’t make it equal though!

BackPropagation with Dropout

We stored the mask and apply them again!

A common mistake when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training.

Momentum

Before momentum, I would like to explain mini-batch gradient descent, as that brought the need of momentum. So what happened with gradient descent is that we train the whole dataset to get the gradient, and calculate the mean gradient and then update the grads. The problem with this is that having millions of dataset requires more waiting time for them to forward propagate.

So they came with an idea of Stochastic Gradient Descent which says : “I dont want to wait till all dataset gets forward propagated for cost, I can depend on one training example to calculate dJ/dW and update the grads accordingly!”

It worked. Perfectly, faster than stochastic as you don’t have to wait for all training examples to forward propagate! The problem with this approach is that it ran into oscillations of high degree!

To prevent this, they introduced mini-batch gradient descent! So instead of depending on single dataset to update the parameters, we take bunch of them, (like 32,64) depending on total number of datasets! So, we take the gradient step on cost function of only those datasets (not whole dataset, not single dataset, but a batch of them).

It suffered less oscillations, but yet the oscillations made them go for MOMENTUM!

So what is momentum?

The oscillation controller. Gradient direction is always perpendicular to the parameters. So we take into account the gradients of previous steps and weight it with beta (the hyperparameter of momentum).

For every parameter we have momentum.

v is the momentum. We first initialise v as zeros, v is same shape as parameters (we need momentum for every params!). So we update v with the hyperparameter beta and update param with (1-beta).

How to choose beta?

Larger the momentum, smoother the update! Do not make it too big, it would smoothen the param too much. The common value is taken as 0.9 for beta!