Optimization — Neural Network

4 min readMar 21, 2020

Many things could go wrong once gradient is calculated. So, we will look into the Optimization things. Derived from cs.toronto.edu. Thanks to them!

Today, we will learn about gradient descent geometrically without figure though!

Gradient Descent: Trains whole network, calculates the mean gradient and then is backpropagated.

Stochastic Gradient Descent: One single example is taken, gradient is calculated and then backpropagated.

Debugging Gradient Descent:

Incorrect gradient computations

If your computed gradients are wrong, then all bets are off. If you’re lucky, the training will fail completely, and you’ll notice that something is wrong. If you’re unlucky, it will sort of work, but it will also somehow be broken. This is much more common than you might expect: it’s not unusual for an incorrectly implemented learning algorithm to perform reasonably well. But it will perform a bit worse than it should; furthermore, it will make it harder to tune, since some of the diagnostics might give misleading results if the gradients are wrong. g. Therefore, it’s completely useless to do anything else until you’re sure the gradients are correct. If you’re using one of the major neural net frameworks, you’re pretty safe, because the gradients are being computed automatically by a system which has been thoroughly tested. I use Tensorflow, so I am pretty safe!

Local Optima

In general, it’s very hard to diagnose if you’re in local Optima. In many areas of machine learning, one tries to ameliorate the issue using random restarts: initialize the training from several random locations, run the training procedure from each one, and pick whichever result has the lowest cost.In practice, the local optima are usually fine, so we think about training in terms of converging faster to a local optimum, rather than finding the global optimum.

Symmetries

Suppose we initialize all the weights and biases of a neural network to zero. All the hidden activations will be identical, and we can check by inspection that all the weights feeding into a given hidden unit will have identical derivatives. Therefore, these weights will have identical values in the next step. With nothing to distinguish different hidden units, no learning will occur. But we generally initialize weights using glorot initializer or uniform initialization!

Slow progress

Due to small learning rate!

Instability and oscillations

Learning rate high: => Instability. It shoots off the the cost!

Learning rate low enough not to overshoot => Oscillations only!

Since we can’t detect oscillations, we simply try to tune the learning rate, finding the best value we can. Typically, we do this using a grid search over values spaced approximately by factors of 3, i.e. {0.3, 0.1, 0.03, . . . , 0.0001}. The learning rate is one of the most important parameters, and one of the hardest to choose a good value for a priori, so it is usually worth tuning it carefully. One more idea is called momentum.

Momentum: HERE WE GOOOOO!

The formula is as follows:

p ← µp − α∇Gradient

θ ← θ + p

where µ is momentum parameter and 0<µ<1 .

If µ =0, then it is like ordinary gradient descent. But, if µ=1 then it says momentum never decays. i.e. it supports friction-less motion. Even if we get to optimal minimum, because of it, the cost might increase. So, we always set it to less than 1. In practice, µ = 0.9 is a reasonable value. Momentum sometimes helps a lot, and it hardly ever hurts, so using momentum is standard practice. Momentum working will be explained on other post.

Fluctuations

The difference between fluctuations and oscillations is the causing agent. Oscillations is caused by wrong learning rate whereas fluctuations is caused by stochastic nature of SGD. As in SGD, the cost might worsen if it takes wrong direction of gradient. On average the gradient direction is right, however, individual gradient may be misleading.One solution to fluctuations is to decrease the learning rate. However, this can slow down the progress too much. It’s actually fine to have fluctuations during training, since the parameters are still moving in the right direction “on average.” A better approach to deal with fluctuations is learning rate decay.

α(1) = α(0)exp( −t/τ)

where t = no of iterations,

T = decay timescale.

Another neat trick for dealing with fluctuations is iterate averaging. Not interesting for me to give attention.

Dead and saturated units

Saturated unit: Activation is very large. In sigmoid it is almost near to 1

Dead Unit: Activation is small. In sigmoid it is near to 0.

Such units will contribute nothing to gradient. We can check for saturated unit by checking the histogram of activation unit and if it is concentrated at a point at endpoints(at a fix value), it is saturated.

The solution is to use Xavier Initialization. Having proper weights would ensure that activation unit do not saturate. Or we can use non saturation activation unit such as ReLU instead of sigmoid.

Badly conditioned Curvature

This is the by far the worst problem we have to deal with. The badly conditioned curvature! . In directions of high curvature, we want to take a small step, because we can overshoot very quickly. But in gradient descent what happens is opposite, that is it gives large step on high curvature and small step on low curvature, which is the problem called badly conditioned curvature.

In practice, neural network training is very badly conditioned. This is likely a big part of why modern neural nets can take weeks to train. We just have to live with badly conditioned curvature.

A small attempt to mitigate it? Batch Norm and ADAM optimizer.

Optimization — Neural Network

Debugging Gradient Descent:

Written by Sanjiv Gautam

No responses yet