Cost Functions Machine Learning

3 min readMar 21, 2020

This is the second part of me studying Deep Learning Course from toronto website. Most of the words are derived from it, so I claim no copyright. However, I have added something for myself. This stories that I am posting is for me and not others! You may not understand it even you ever come across!

Choosing Cost Function

Squared Loss:

We cannot use MSE in classification problem. I mean, we can but it is too error prone. For example of our predicted values is 9, but it was supposed to be 0, then MSE would be 81/2. It would still work but not convenient yet.

Logistic Function :

σ(z) = 1/ (1 + exp(−z)) . We added logistic function which squashes the output between 0 and 1, and we can use MSE now. Not effective yet, why ? We will see that next.

Using MSE for classification would give small gradient. The problem with squared error loss in the classification setting is that it doesn’t distinguish bad predictions from extremely bad predictions for positive samples. If t = 1, then a prediction of y = 0.01 has roughly the same squared-error loss as a prediction of y = 0.00001.

For 0.01, we have MSE = 0.5*(1–0.01)**2 = 0.49005.

For 0.00001, we have MSE = 0.5*(1–0.00001)**2 = 0.49999000005. See nothing despite one being 1000 times lesser value than other. This brings to us, the CROSS ENTROPY LOSS.

Cross Entropy Loss

L(y, t) = −tlog y − (1 − t) log(1 − y) where y is the predicted and t is actual value. This loss also has problem, suppose y is so small that it is close to 0 but t =1 i.e. our positive sample might be predicted as negative sometimes during initial phase of computation, so the derivative dL/dy would be very large which causes difficulties . So here comes ,

Logistic-cross-entropy = t* log(1 + exp(−z )) + (1 − t) log(1 + exp( z )). We can implement this in numpy using E = t * np.logaddexp(0, -z) + (1-t) * np.logaddexp(0, z). Surprisingly, its derivative is (y-t) like MSE derivative (you can check it out) and handles the situation where predicted value is very small and actual value is 1.

Hinge Loss

L(y, t) = max(0, 1 − ty). t is the actual value and y is the predicted value.

Multiclass Classification

We use softmax in multiclass.

loss = − ∑t log(y) and can written in vector form as transpose(t).log(y) .

z = Wx + b

y = softmax(z)

L = −transpose(t).(log y)

The loss function is called softmax cross-entropy.

∂L/ ∂z = y-t (derive yourself). Before hand, we used *to have y=0.05(for example) and t = 1 but now we have them as vectors. i.e y = [0.1,0.2,0.7], t = [0, 0, 1]. Nothing is different!

Most of us use softmax as output layer and not in hidden layer as activation function. Why? The answer is from Stack Overflow, but I would write here.

Variables independence : If you use softmax layer as a hidden layer — then you will keep all your nodes (hidden variables) dependent as it sums variables activation value to 1. which may result in many problems and poor generalization.

Expressive power is lost. Maybe two activations provide contributions but due to softmax their power is reduced as it has to sum to 1 and this might affect.

Cost Functions Machine Learning

Choosing Cost Function

Multiclass Classification

Written by Sanjiv Gautam

No responses yet