Understanding Neural Style Transfer from another medium post

Sanjiv Gautam
9 min readMar 27, 2020

--

Here is the post https://medium.com/tensorflow/neural-style-transfer-creating-art-with-deep-learning-using-tf-keras-and-eager-execution-7d541ac31398. Kudos to the writer to make it simpler than the research paper. Let me try to understand and explain to myself here in the medium

The Style Transfer is an easy concept. Blend two images together, so that feature of one images transfer to another. So what feature? Its colors, its textures e.t.c. are transferred from one form to another. Who does that? Us, of course. This style transfer is based on Leon A. Gatys’ paper, A Neural Algorithm of Artistic Style, and this research paper is easy to understand as well.

We take three images. First one is content image, second style image, third is our input image. We transfer content from content image to our input image. Take content loss, then we transfer style from our style image to our input image. We do that with the help of style loss.

So here is how we go. We take the VGG pre-trained model. We use its intermediate layers. We will choose them, about 4–5 of those intermediate layers. Okay hang on! What does those intermediate layer do? They are feature map of the images that are trained. What do they do?

Let me take back to what Neural Network does with image, in our case it is CNN. So, the first layer of neural network detects corners, edges of images. Every CNN model does this, why it does that? Well, this questions remain unanswered till date, if you are able to crack this, all hail the new KING of Neural Network.

So, each layers as they grow progressively , i.e. from 1st to 2nd to 3rd and so on, they are learning new features of image. Like the first one learns the edges and all, second one might learn other features like shapes, colors, textures. So a lot of them will learn many features.

In other words, these intermediate layers learn the content and style of images right? So what we do is, we take our input image, and compute how much it differs in style from style image and how much it differs in content from content image and try to minimize the thing that minimizes them.

Getting the hands dirty with code

Now, we should code. Yeah, that’s the boring part. But, I will explain this to you as much as I have understood or as much as I am understanding along.

Since, I am not going to do everything, but take the link I gave above as reference, the code of the given link would be used as reference.

The Content Image and the Style Image

The image on the left is content. So our input image is going to learn all the contents from here, and the style would be from the image on the right

VGG has many layers. We would use Conv2 from 5th block for content loss of so many blocks of layers. Why? Because it is given in the link? :P. Well, its because choosing layer before that would not accurately transfer the content because the layers before them hasn’t learn anything fruitful for us. We are taking chance upto 5th layer as it might have learned in finer details than other blocks that followed it. As I have said earlier, the more we go deeper, the more finer details the Network learns. Why not choose the last layer? Its because it is softmax layer and it is of no use for us because we are trying to create image out it, and softmax has learned to categorize in VGG network. Does that ring the bell?

So, our content layer would be block_5_conv2. However for style loss, we would choose more than one layer from VGG. Why? Guess it. Because, first layer of network learns shape, boundary, second layer and third learns about patterns, color combination and all and it goes on in similar fashion. So, we want all of those style from style image to be transferred to our input. Are you connecting the dots now? Well, I certainly am.

So, our content layer and style_layers are defined. We have taken block 1 to 5 in style_layers. Here is the actual VGG network.

So, here is the code that defines the model input and output

It contains the function with lot of comments, so I removed the comment that is why you are seeing return at last.

Let us decode the code. First thing we did was loading the model. Then, we set the its trainable to be False, which means we prevent the VGG network to train. Since, they are trained enough, we would not need it to be trained. Then the third and fourth line saves the outputs from style and content output from VGG network into two list and then we combine them into single output. Why do we combine them? In order to have multiple outputs from one output. So, our network would yield 6 outputs. First 5 would be style output and last one would be content output for one input. What is the input there? We can get the input from vgg.input which we will layer pass into the model. So, this function would return the model which has input and 6 outputs.

Now comes the most dangerous part of all. THE LOSS FUNCTIONS

CONTENT LOSS

All these time we were talking about minimizing content loss between input and given content image. Sounds fair enough? So, lets learn the content loss. Believe me, this is going to be exciting!

So, assume our input as x and content image as p. So, believe me content loss is very very difficult. You think so? NOOOO! It is just the euclidean loss. Remember the squared distance loss, you have always been doing for linear regression and other thing? The mean squared loss is the CONTENT LOSS! Let us look at its mathematical expression.

F, here is the feature map at layer l of image x. P is the feature map of content image(p) at layer l. Summation of i and j represents row and column of feature map. Since, they have same dimension, which they must have, we will get the squared distance between them. How do we do that in code?

We have our feature map of 14*14*512 at conv2 at block5. So, we take the squared distance between 14*14*512 feature map of content image and 14*14*512 feature map of input image . Believe me it is that simple.

STYLE LOSS

I am not going to lie, this thing is bit difficult. I mean, it won’t be after you read this post, but I found it a bit overwhelming at my first try, wouldn’t be sure how you find it.

Remember us trying to find MSE of the intermediate layers in the Content Loss? Well, style loss is vouched by Gram Matrices. We compare the gram matrices of the two.

So what is Gram Matrix then?

Gram Matrix difference is the difference between the feature distributions between two images. Content Loss was about difference between specific features. We need something which does not care about the specific presence or location of the detected features within an image. The Gram matrix is perfectooo for this.

In linear algebra, the Gram matrix G of a set of vectors (𝑣1,…,𝑣𝑛) is the matrix of dot products, whose entries are 𝐺𝑖𝑗 = Vi.Vj=𝑛𝑝.𝑑𝑜𝑡(𝑣𝑖,𝑣𝑗). In other words, 𝐺𝑖𝑗 compares how similar 𝑣𝑖 is to 𝑣𝑗. A set of vectors are linearly independent if and only if the Gram determinant (the determinant of the Gram matrix) is non-zero.

pic credit: Stanford

In any specific layer, we check how related one channel is with another. We know that every channel learn different attributes, so in order to check how one channel is related to another, we need gram matrix. But why does this capture style?

Say for Lth layer, the first channel learns to detect texture and second channel learns to detect colors. So if they are correlated, then it means that whenever we have that particular textures, it would also have those colors. You see, they are correlated means one existence also depends on other! So this tells, how often this high level components tend to occur or not occur together in higher level. If they have higher correlation, then they might form higher level feature working together in deeper layer.

So degree of correlation between channel gives the degree of style of image. One important part of the gram matrix is that the diagonal elements such as 𝐺𝑖𝑖 also measures how active filter 𝑖 is. For example, suppose filter 𝑖 is detecting vertical textures in the image. Then 𝐺𝑖𝑖 measures how common vertical textures are in the image as a whole.If 𝐺𝑖𝑖 is large, this means that the image has a lot of vertical texture.

The gram matrix is given by:

So, we first ravel the input along a direction but keep no of channels as it is. Then we take the mean across the whole height and width and batch. For example , if we have images, each of shape 114*114*64, then we change it to shape of (114*114,64) and then find the dot product. Dot Product of (64,114*114) and (114*114,64) would give (64,64) shape of gram matrix, in which it is divided by (114*114) i.e. the spread along every pixel.

Style less is then given by:

The average of the gram matrix among the batch can be calculated.

Running “TRICKY” gradient descent.

The gradient descent here is bit tricky. Remember we used to train the network up running gradient descent that updates weight and biases but this doesn’t work that way. We train our input image to minimize the loss. We don’t train the weights and biases. I could have tried different methods, but since we are following the medium post. I am going to explain everything it has done, instead of me trying different approach to solve it.

So what we do here is that we take content image as our input image. 😮. This comes with a shocker to me as well. So, we take content loss of the content image, style loss of the style image and make content_image the variable, which is sent for gradient descent.

We have style_outputs of the style image. We have content output of the content image and then we stack up the loss. tf.add_n() adds the matrices element wise. Then we take style_loss and give it some weight (less weight than content loss we are using below because we want content to transfer more than style). Then we add them up.

This is the gradient function. We extract the output of the input image(it contains style and content output of the input image). The input here happens to be content image. So we take loss of it and apply the gradient with respect to the image. But hold on, image is not a variable, right? How can we take gradient of it?

The answer is here.

image = tf.Variable(content_image)

We need to do this to make out content_image an input as variable.

THAT’s IT!

We are done with NEURAL STYLE TRANSFER!

--

--

Sanjiv Gautam
Sanjiv Gautam

Written by Sanjiv Gautam

Just an average boy who wishes mediocrity over luxury.

No responses yet