ResNets,Inception and more!

Sanjiv Gautam
5 min readJun 4, 2020

I know a bit of CNN from cs231, but is that enough? We are learning more from Andrew NG.

Filters — Edge Detection

Okay. Tell me what does this image do? It detects vertical edge. Okay, but how?

Look at the filter closely. We are going to convolve this filter around the image, right? So the left portion of image is 1, so everything brighter, the right side is darker because of -1. So what we are interested is finding whatever is in between the left brighter and right darker side, which is vertical edge.

Transitions of filters

left portion of image is bright and right is dark
right portion is bright, left is dark

No matter the result here, check out the figure below the convolved figure, it detects edge in either case.

If we have 10*10*3 size of image and we have to use filter of 3*3, we use it as 3*3*3 filter shape. So we convolve over volume, and Filter*Image over channels and then sum them (remember we don’t sum over individual channel), but over all of them at once and add them. So in case of RGB image for first 3*3*3 cube of image, we find summation over R, G and B and write them in first pixel of output. That’s why we don’t have channels in output volume, and have number of filters we use.

Resnets

We have skip connections in ResNets. We connect output of one layer to output from another 3–4 layers, and call it skip connections. What does this do? Well it is believed that training error has to decrease with increasing layers, but this only happens in theory, what is practically observed is that when plain neural network that we used to work, the training error of theirs increases with layers, so we use ResNets which works well at least for layers we have. One thing to notice is that, when we are adding two layers (coming from skip connection to current layer, we add them and then apply activation function to get the output)

Why use ResNets?

Remember in ResNets we use Z[l+2] with a[l]. You know what this is. So what’s ResNets doing? Suppose our Z[l+2] is deactivated with poor gradient which resulted in W and b to be zero, so Z[l+2] in overall becomes 0. During such case, what A[l+2] (output after applying activation of Z[l+2]+ a[l]) is that, it at remains an identity function, i.e. a[l]. So adding layer doesn’t hurt the neural network ability to do as well as the simpler network without the 2 layers we added!

But we don’t use different architect just because it doesn’t hurt. So ResNets also helps to improve performance. When we go deeper, it is hard for network to be specific, so it will be difficult for network to choose parameters that outputs right result, so what we can tell the network is that, the least you can do is make this layer (z[l+2]) at least as good as a[l].

So what ResNets is doing is telling the network “I know it is difficult for you to be specific when we dig deeper, but you have no base case on what your network looks like at first glance, so what you do is make sure you starting baseline is identity function a[l] at minimum, you can improve on that, but must be identity at least”.

1*1 Convolutions

What 1*1 convolution does is it helps in reduction of channels!

Suppose we have 28*28*256 in kth layer, and we want the channel to reduce to 28*28*128 output, so what we do is use 1*1 convolution about 128 in number, so 1*1*256 filter would be used to get 28*28*128 output!

So what’s the benefit? One usage of it is its lower computation cost. We will see in Inception Network.

Inception

We stack CONV layers into one. For example: 3*3 conv, 5*5 conv, max pooling into single layer. Let’s see how it works!

See how MaxPool is giving same output as input. We can achieve that using padding in input to get the output. Not the neat way of computation, but it works!

See what’s the output here? It is 28*28*(32+32+128+64)!

But there is problem here. The computation!

Let’s check 32 number of 5*5*192 CONV applied to 28*28*192 input, to get 28*28*32 dimension output!

How many pixels do we need to calculate for output? 28*28*32 = 25088 pixels. For each pixels, is the result of convolution, right? So 28*28*32*5*5*192 multiplication ( roughly 120 million multiplication). See what is happening? 120M multiplication, which is pretty expensive for the network.

What we do now is use 1*1 convolution.

So instead of going from 28*28*192 to 28*28*32 directly, we go for 1*1*16(half of 32) and then use 5*5*32 for that output! Let’s see the outcome.

For input and the bottleneck 1*1 conv, we have 28*28*16*1*1*192 multiplication = 2.4M.

For botteneck and output, we have 28*28*32*5*5*16, which is roughly 10M multiplication.

So we have 12M multiplication, a significant improvement to our 120M multiplication.

We also need 1*1 for maintaining size in MaxPooling, remeber MaxPooling doesn’t change the depth (channel) of the network, so at first we apply PADDED MaxPooling to keep its width and height, in our case it gives us 28*28*192 after pooling, and then we use 1*1*192 pooling about 32 in number to get 28*28*32 output just like 3*3 conv and 5*5 conv.

How do we stack up the layers programatically?

np.concatenate([X1,X2],axis=2).0 axis means you add element to row, 1 axis means you add element to column and 2 axis means you add element to channel wise. So if X1 = 10,10,3 is is added to 10,10,8 channel wise, we get 10,10,11.

So Inception network is just the connection of all those networks!

Fun fact: Inception, the name it actually got from a MEME from Inception movie!

https://i.kym-cdn.com/entries/icons/original/000/012/886/wntgd.jpg.

! We need to go deeper !

This one is for me: https://www.youtube.com/watch?v=cFFu__mcoIw&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&index=19

--

--

Sanjiv Gautam

Just an average boy who wishes mediocrity over luxury.