Face Recognition Problem Solving

4 min readJun 7, 2020

Yolo is something I have grasp, but still I need to implement it myself inorder to understand it fully, but I cannot afford to waste time, so into Face Recognition!

One Shot Learning

Learn from one example. So what we have to do is learn from one example. So if network has just seen a face of a person once, then it must say any intruder is not them! So ONE example should recognize the person. So what is the approach that we have learnt first?

If we are to recognize about 5 persons, we might use NN that has softmax output for 6 outputs, 5 for anyone of them and 6 for anyone who is not. But can single example do this? No it can’t. We need our network to learn more, and do more!

So what we do is instead of learn each characteristics of each person, we train our network to learn the difference between two images!

So d(img1,img2) is the difference between two images of two persons. So if both of them are same person, then d should be small.

So d(img1,img2) ≤ tau where tau is a hyperparam. So seems easy right? Determine the d(img1,img2). If they are same then it returns very less number , if they are different, it returns very large number. So what we actually need is this “d”

It is not as trivial as it sounds, but hopefully we can bang it!

Siamese Network

If we have two images, we send the first image through the network, to get f(x1) and second image through the same network same param to get f(x2) , if d(x1,x2)= ||f(x1)-f(x2)||². This network where both image x1 and x2 are sent through same network having same param is called Siamese network.

So what we want in our NN is the network to learn the d function. We want our network to learn the difference function, so that it can map the difference between two outputs.

So if x1 and x2 are same person, we want our network to learn d to have small value. If they are of different person, we want d to be large.

Triplet Loss

In our network we feed in three images (name triplet), first one is ground truth image named “anchor”, second one is another image of same person as “positive” and another image of different person as negative “negative” image. So we feed in three images, and what we do you think we want our loss function to be?

Minimize the distance between anchor and positive example and maximize the distance between anchor and negative example.

loss = d(anchor,positive) — d(anchor,negative).. so when we minimize the loss function, we minimize d(anchor,positive) and maximize d(anchor, negative).

Sometimes NN can give trivial solutions, like d(anchor,positive) and d(anchor,negative) can be 0. So what we can do is modify loss function to have d(anchor,positive)-d(anchor,negative)+alpha so that there is some loss to backpropagate. Why do we need to do this?

Because NN has only one objective that is to minimize loss to zero. So what researchers have found is that sometimes NN would find trivial solution such that it learns to encode anchor, positive and negative as ZEROS. So 0–0 in both terms would result in 0. So we have to add the alpha.

What we do in loss function is, suppose our batch has 10 images.. So we output the image encoding of each , with 128 vectors, being the output of each anchor, positive and negative example. So output is of shape (10,128) for each of them.

What we do next is calculate the l2 norm between anchor and positive, example. What would be the shape of this l2 norm? It should be (10,) or (10,1) as we don’t find the norm over examples at the moment.

Similar with anchor and negative…

Then we add alpha to each example of their difference. So we subtract (10,) from norm1 and (10,) from norm2 and add alpha to it and take the maximum between 0 and the result so that our loss never goes to negative.

Then we sum over all the examples to get our triplet loss!

Difference between Face Recognition and Face Verification

Face Verification is when you say “I am this” and the machine will try to check if you are not an impostor. So it is 1:1 checking!

Face recognition is when you don’t say anything, you come near machine, it would check you against the database, and if they found you, you are on, else you are not!

In Face recognition, what you do is at first your encoding is assigned a large value and your image is checked against every image on database and your encoding is updated to have minimum value. After the iteration is done, if your min_encoding value is less than threshold and is found in the database, you are good to go, if not you are not someone the machine is desired to see.

Face Recognition Problem Solving

One Shot Learning

Siamese Network

Triplet Loss

Difference between Face Recognition and Face Verification

Written by Sanjiv Gautam