Neural Network with Convolutions!

14 min readApr 3, 2020

First thing. Why do we even need convolution when we have ANN which is modified to work the same way? We will look into this. Transfer Learning and many more! So lets get this going!

From Stanford Course . I might get going.

ConvNet architectures make the explicit assumption that the inputs are images. This assumption makes the forward function more efficient to implement and vastly reduce the amount of parameters in the network.

Suppose we have (32,32,3) shape of an image. So if we are to make fully connected structure, it would need 32*32*3 parameters (you know when you flatten it). Okay not much number of params, 3072 seems reasonable. But what if we have 200*200*3 shape of image? Certainly that would have 12000 params and only one one layer. As we know that we stack up the layer, that would means we stack more number of neurons into it.

CNN ensures that input is a image and tries to make a sensible tweak (we may call it), that reduces the considerable amount of params.

Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations.

Layers used to build ConvNets

i. Convolutional Layer

ii. Pooling Layer

iii. Fully-Connected Layer

Building a CNN HOUSE!

INPUT => 32*32*3 shape of image with 32 height, 32 width and 3 channels for RGB image. How do we picture this? Say in every pixel we go to the depth of 3.

CONV => Will go in much greater detail about this in coming part. Just remember that if we decide to use 12 filters(weight), then we might have 32*32*12 output from this CONV layer. In every pixel ,we go to the depth of 12.

ACTIVATION (RELU) => It just applies the non linearity to the image which results in (32,32,12) sized output.

POOL => It downsamples the image size. Generally it is of 2 types. Average and Max. So, picture it this way. Suppose we apply MaxPooling into (2,2) window size. Then, the maximum of that value would be choosed from it. For eg: [[1,4],[3,6]]. If we apply MaxPooling to it, we get value of 6.

FULLY CONNECTED LAYER => Every pixel is connected to every other pixel. Just like simple ANN. It would result in 1*1*10 (if we select 10 neurons)

Convolution Layer

Convolution layer contains kernels(filters,weights) whatever you may call it. Just think of it as this way, these terminologies are made to confuse you, simple put, they are weights that you learn in ANN. Every weight (filter, you know), expands to its full depth. What does this mean? Say you have an image of 32*32*3, it has depth of 3. So if we have weight of (2,2), then its shape would be 2*2*3 (expands to full depth). We convolve(dot product), between the image and filter(weight), just like we used to do in ANN. You know W.X. Here, X is image pixels and W is kernel. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map. What is activation map? The output after dot product is called activation map. If we have about 12 filters, each of shape (2*2*3), then we have 12 activation map.

Local Connection between layers.

When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The connections are local in space (along width and height), but always full along the entire depth.

For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the filter size is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter).

The left side is image and right side is weight volume. The kernel volume is 5. Say each neuron is of size 5*5*3 but are 5 in number in total. So what is the volume of it? volume of each filter(5*5*3) * 5(total number of filters).

Output Volume

So far we discussed the input, now it is turn for output of the convolution. The volume of activation map. It generally depends on 3 factors.

a. Depth.

b. Stride .

c. Padding.

Remember that all these are hyperparameters, so the output of the receptive volume depends on our hyperparameters.

Depth: The depth of the output layer corresponds to the total number of filters, you use. So for our 227*227*3 image, if we use the total filter number as 30, then our output would be 227*227*30. The RGB channel would be replaced by other channels.

Stride: What is stride then? Like we said, the filter is slided over the pixel image, so it is the striding value.

Padding: Sometimes it is convenient to pad the corner of the image with zeros to keep up the border value. In some application of CNN, border values carry a significant importance, so we pad the corner to keep its value. Padding helps us to keep the output volume shape equal to input volume.

Let’s do some math here:

Suppose we have input image of Width W, Stride of S, padding of P, Filter size(weight Size ) F, so, the output width would be :

W(output) = (W(input)-F+2*P)/(S)+1. This is followed in height as well. Let’s take an example. For 7*7 input image with 3*3 weight (filter) and stride 1, with 0 padding, we get W(output) = (7–3+2*0)(1)+1 = 5. So, input width would be 5*5.

Use of padding:

In order for input and output volume to be same with stride S = 1, we choose padding value P = (F-1)/2. For example, for S = 1, W = 5, F = 3. Then, P = (F-1)/2 gives P = 2. So, W(output) = (5–3+2*2)/1 + 1 = 5 (as we expect).

What would be the formula for P, if stride = 2, to keep up the same dimension as input? It is something to think about. Do the math, dude. Its easy.

Parameter Sharing

The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227–11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. So, what is the total number of neurons here?

55*55*96 = 290400. How? 55 in width, 55 in height and 96 the total number of them. However, each neuron consist of weights right? In ANN, each neuron had one dimensional weight coming from many neurons. Here however, it has 11*11*3 = 363 weights and 1 bias. So total number of parameter in the first layer becomes 363*290400. I am not multiplying it.

But how does CNN reduce param then? The basic assumption here is that, we use same weight across the single depth. i.e. we do not use 55*55 number of neurons for a single depth, rather single weight across the whole pixels of image in the single depth. So, total number of neurons reduces to 96 . 96 different neurons, that contains filter of size 11*11*3. So total number of parameters would be 11*11*3*96 = 34848 weight + 96 biases = 36944 . Look how the mighty has fallen. That’s why the name filter comes in you see. We use same filter across the image in one depth, So it is called filter instead of calling it weight.

HOWEVER sometimes the parameter sharing assumption may not make sense. What if we have centered image? Completely different features should be learned on one side of the image(center) than another (other side which contains no information whatsoever). You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.

Coding with Numpy (let me understand it code wise, else I would not sleep with so many confusion in my mind)

Input Layer

Suppose input image of size 11,11,4

If X is an input volume. Then, depth at position (x,y) would be X[x,y,:] (get all the depth) (The shape would be 1*1*4)

A depth slice at (x,y) would be X[:,:,d]. In other words, it would be 11*11*1.

Conv Layer

Let’s say X has shape of 11*11*4. If we use zero padding, and filter size of 5 and stride = 2. Output shape should be (11–5)/2+1= 4*4 shape (without the depth of course) The activation map would be:

V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0

Okay what the hell is going on here? We know that our V(activation map) would be of shape 4*4*noOfFilters. Since we are using PARAMETER SHARING technique, we multiply the first 5 height and width of image with the weight which is of 5*5*(channel of image i.e. 4) shape and bias. Remember this is the value at single place of 3 dimensional V.

V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0

What we did here? We took the stride of 2. So width wise, we went to 2 pixel right, used the same weight and bias. The process continues.

V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0

Notice that the * sign here means element wise multiplication. Okay, so we do not use dot product? The thing is, np.sum() would add them element wise, which would be same as that of dot product. The process will continue width in similar fashion.

Second activation map would be.

V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1

This is self explanatory.

V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1

The activation map shown above is done in second layer. How do I know this? W1. Activation map is placed at position 2nd in width, first on height and 2nd on depth.

V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1'

Why 4:9, 6:11 in X? Because 2nd V in width, 3rd V in width and 1 V in depth comes with that you know. Easy!

Just Remember, V[0,0,0] and V[0,0,1] would differ by W0 and W1 only. Else they have X[:5,:5,:] of input pixel.

Convolution Demo

Check awesome visualization at http://cs231n.github.io/convolutional-networks/.

Implementation using Matrix Multiplication

So how we do matrix multiplication in Convolution since we are dealing with 3D input and 3D weight?

Let’s see.

At first, the local input image is stretched out into columns i.e. we reshape the input of 11*11*3 of our input (not the weight, listen it looks like weight we are talking but I am talking about 11*11*3 shape of input image of shape 227*227*3). This is to be convoluted with 11*11*3 filter matrix with stride 4. So we our input is stretched out into 11*11*3 = 363 rows and 1 column. You know what this is!

What is our output size ? W = (227–11)/4 + 1 = 55. So it would be of shape 55*55. So, our input matrix would be changed to [363*3025] (3025=55*55). So, every column is a receptive field. How many such receptive field are there? 3025!!!.

Okay shhh. Our input is of shape 227*227*3 = 154587 pixels. Our input we just modified is of shape 3025*363 = 1098075 pixels. What am I missing here?

See the thing is, we are overlapping the inputs. The receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns. We took 11*11*3 from input of 227*227*3, with stride of 4. So, our first value of input would be input[:11,:11:,]. Second would be input[4:15,4:11,:]. So we have repeated the columns from 4 to 11 right? Does this explain? I think yes!

How do we handle the filter then? We handled the input by stretching the input into column vectors. We have 96 number of filters right? So what we do is the weight would be reshaped to (96,363). You know how 363 came. It is the row vector of filter, each of shape (1,363) (363 rows) and they are 96 in number.

The result of a convolution is now equivalent to performing one large matrix multiply np.dot(W_row, X_col), which evaluates the dot product between every filter and every receptive field location. So W_row is of shape (96,363) and X_col = (363,3025). So the output would be (96,3025). So, we can reshape it into (55,55,96). THAT”S MIND BOOBLING!

The disadvantage of this method is HUGE MEMORY as you can see it.

Dilated Convolution

It’s possible to have filters that have spaces between each cell, called dilation. As an example, in one dimension a filter w of size 3 would compute over input x the following: w[0]*x[0] + w[1]*x[1] + w[2]*x[2] This is dilation of 1. For dilation 2 the filter would instead compute w[0]*x[0] + w[1]*x[2] + w[2]*x[4]; In other words there is a gap of 1 between the weight values.

Check the second picture. W[0]*X[0]+W[1]*X[2]+W[3]*X[4]. What is filled in those gaps of weight that has dilation of 1 or more? ZEROS!

What’s the point of dilated convolution? Dilated convolutions have generally improved performance. It allows one to have larger receptive field with same computation and memory costs while also preserving resolution.

Pooling Layer

Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. What params are required for pooling? One is the window size F and other is the stride S. It goes around the matrix and then chooses the value accordingly (max or avg depending on your selection).

What is the output shape after pooling?

W = (W-F)/S +1.

Backpropagation of MaxPooling

Recall from the backpropagation that backprop of max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass. Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation so that gradient routing is efficient during backpropagation.

However, most people hate MaxPooling instead suggest larger stride CONV layer to reduce. It seems likely that future architectures will feature very few to no pooling layers.

Converting Fully Connected Layers to CONV layer

We see that CONV layer is just a simplified version of Fully Connected layer where we use same filter across the whole image. Neurons are connected only to the local regions and not across whole region like Fully Connected layers. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it’s possible to convert between FC and CONV layers.

CONV-> FC

For any CONV layer, there is an Fully Connected layer that implements same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing). This is easy right? Same value repeated for the whole filter just like FC layer.

FC->CONV

Any FC layer can be converted to a CONV layer. For example, an FC layer with K(no of neurons)=4096 that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096(no of filters used). So what would be the output of it? 1*1*4096.

Let’s check FC->CONV in more detail. Shall we?

Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. We take example of AlexNet which uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons. Before this FC layer, due to series of downsampling, the last CONV layer is of shape 7*7*512.

So how do we convert 4096 layers to 7*7*512 CONV layer? Remember we have 2 of them.

Well we will use F=7,P=0,S=1 and no of layers(K)= 4096. So the layer becomes 1*1*4096 CONV layer. (The first 4096 layer)

The second 4096 FC layer is converted to CONV layer by filter size of ????? 1. Why 1? Because our previous CONV layer is of shape 1*1*4096 and now we use it of 1. So the last FC layer is also converted to 1*1*4096 layer

The final 1000 neurons FC layer is converted to CONV layer using filter size of F = 1 (you know why), that gives output of 1*1*1000 CONV layer.

Common Patterns of CONVNET

INPUT -> [[CONV -> RELU->BatchNorm (I added BatchNorm for myself)]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3)

It is preferred to use a stack of small filters CONV to one large receptive field CONV layer. Here is why.

Suppose we stack 3 , 3*3 CONV layer onto the image. The first neuron has 3*3 view of the input volume, second neuron has the 3*3 layer of the first CONV layer output. You see what I am telling to you? The first stack of CONV layer looks at input field. Second looks at the output of first CONV and input. Technically, we can say the second CONV layer looks at 5*5 of the input volume. You get it? 2, 3*3 CONV layer watches 5*5 of input. Check that out in your copy. It’s easy!

So 3, 3*3 CONV layer is equivalent to watching it by 7*7 CONV layer. What if we use single 7*7 layer instead of 3, 3*3 conv layer?

First disadvantage is neurons would be computing a linear function over the input. As we apply non linearity only after convolution, it would not learn complex structure with those linear function!

Second disadvantage is the number of parameters! If we use 1 , 7*7 CONV with C channels, we have weight of 1*(7*7*C)*C = 49C². For 3*3 CONV layer from channels C, we have 3(number of CONV layers)*((3*3*C)*C) = 27C². So params number get reduced too.

I WILL HAVE WEIGHT OF CONV LAYER DISCUSSED IN SOME OTHER TOPIC. I PROMISE!

In PRACTISE: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch.

Layer Sizing Patterns:

The input layer (that contains the image) should be divisible by 2 many times. (224,512,32,64) should be height and width of the image!

The conv layers should be using small filters (e.g. 3x3 or at most 5x5). Try to preserve the input shape for some layers until you downsample it eventually.

The pool layers of 2*2 size and stride of 2.

Why use stride of 1 in CONV? Smaller strides work better in practice.

Why use padding? To preserve information of the borders and to keep size as you want.

Compromising based on memory constraints. For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64]. This amounts to a total of about 10 million activations, or 72MB of memory. People prefer to make the compromise at only the first CONV layer of the network. Because all it learns is edges, the finer details are learnt in the layers that come later. AlexNet uses filter sizes of 11x11 and stride of 4 in the first layer.

SO WE ARE AT THE END. CONG TO ME. NOW ALL THAT IS LEFT IS TO LEARN ABOUT THE WEIGHT OF THE CONV NET. AND THEN I AM EMANCIPATED.