Transformers — A new paradigm in NLP

Sanjiv Gautam

11 min readAug 14, 2020

Before I infringe the Internet plague of PLAGIARISM, I would like to give the full credit of transformer network to

http://jalammar.github.io/illustrated-transformer/.

Thank you.

Hold on, we have not even started yet.

The transformer consists of encoder and decoder. YES! The plain old encoder and decoder.

fig : The Transformer network shown in figure above (I repeat, I don’t own copyright of the image, its on the above link)

More Detail overview of the the transformer network

Needless to explain anything here. Just the transformer, its general overview and then a zoomed view, but not still zoomed enough to clear the TRANSFORMER CONUNDRUM. (See the words I have used here, just practising them, to make me look more professional, but sometimes I just replace the word without understanding its proper usage)

If you observe the figure carefully, the encoder has 6 networks that are connected to each other. It is mandatory that you must put 6 layers, any wrong number might cause a chaos. (I am kidding, this is arbitrary number).

I think there is nothing to panic till now. Everything going as smooth as it should be!

The encoder structure of every layer is same. They are independent in terms of parameter sharing which means the first encoder weights and biases are different to second and so on and so forth.

Let’s see single encoder’s structure.

The encoder input is fed to self-attention layer (which we will discuss), the output of this layer is fed to one neural network. (I don’t think I need to explain this, the figure is self explanatory, but more words means I can practise using heavy words here, I am sorry, BOMBASTIC words).

The decoder architecture:

Decoder has similar architecture to encoder. But one more layer is sandwiched between those layers. The Encoder-Decoder Attention layer.

Till now, we know the architecture in brief. We know how they work in overall. But we still need to explain two things here.

SELF-ATTENTION & ENCODER-DECODER ATTENTION

We will skip them for now. We will check on them later.

THE FLOW OF VECTORS

Let’s now see how the vectors flow inside these components!

In NLP, we can check association between words via a thing called Word Embeddings. It is simple a vector representation of word and every vector holds properties (not 0’s or 1 like one hot encoding). They encode meaningful attributes. So if a word “Gautam” = [0.2 0.8 0.6 0.4] (suppose), then every column of this row vector would mean something. That’s why everybody loves word embeddings!

The bottom most encoder receives the encoding layer.

Encoder receives the list of vectors as input. Each item in list is the word embedding vector of that word. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

The question comes (what if our test example has sentence that is longer than the one that it was actually trained on? How does self-attention model handle that? We will come to this later)

Self-attention will list of vectors on its own, and each of them is passed to the Neural Network. Don’t confuse with the picture shown above, output of self-attention (list of vectors), is passed through NN, separately, i.e. the parameters of the NN is adjusted for each output. We do not have independent Neural Network for each output. They share the same Neural Network.

Recap till now

So let’s recap. We have got, encoder and decoder architecture in Transformer. Each encoder has self-attention and Neural Network. Each decoder has self-attention, encoder-decoder attention and Neural Network.

The self-attention neural network in Encoder receives list of inputs, and produces list of outputs, however all those outputs are fed into NN one by one!!

Self-Attention Layer

Here is something I promised earlier. Self-attention layer.

General Overview

Check this sentence => “The animal didn't cross the street because it was too tired”. Okay we know that “it” here means the CAT. Does the model know about this? Absolutely NO! Self-attention will make the machine do so!!!

What self-attention does is “It tries to search for meaningful connection between a word to other words inside an input sentence!”. In our case, the word “it” with “animal” and so!

Think of it like RNN, (Remember RNN has hidden states that learns dependency between words)

A perfect example of self-attention layer for given input!

For every word, we check how strong the bond between the given word is with every other word in the sentence. One simple question “what defines the strength of this bond?” Well, its the WEIGHT from given word to input word!

Detail

You know the problem with Machine Learning thing? They look easy until done. Having said that, let me start making you more sad.

Steps:

The first step in self-attention layer is to create three vectors from each of the input vectors(embedding; remember there is a list of vectors as input). So for each input, for each word in each input, we create 3 vectors. These vectors are called QUERY, KEY & VALUE vector. How are these vectors created? By multiplying the weight connecting each embedding word with a matrix which learns the parameters when trained. The size of these vectors are 64 (arbitrary choice). Here is how it looks figuratively.

W(Q) gives query when multiplied with Embedding. W(K) gives Keys when multiplied with Embedding, and W(V) gives value vectors when multiplied with Embedding!

So for every word we create query, keys and values.

Why? Infact what are they in the first place?

Ans: It will be clearer when we go deep down in other steps. So keep your heads up, cause more loads INCOMINGGGGG!

Step 2:

Calculating Score. Remember how I said the bond between words is given by weight? I lied. I LIED. So if anyone read this blog upto that point and somehow skipped and went for the interview, he/she is pretty screwed.

Like I am reiterating, score gives the strength/focus of the word with every other word in the input.

How is the score calculated ?

We take the dot product between query and key vector. So, as the vector name implies, QUERY means the word we are looking to. Say for an instance, Q1 is query vector for the first input word (embedding), and k1 is the key for it. So, we learn about its strength or score by taking dot product between Q1 and k1. If we have 2nd word, and we want to find the relation between the first word and second, then we take the dot product between Q1 and K2 (second word, the key).

I hope this clears up the concept of Query and Key Vector!

The q1.k2 = 96 is the score between the word “thinking” and “machines”!

Step 3:

This step does a bit processing step. So remember we took the 64 number of units as dimension of those K,V,Q vectors? We divide them by sqrt(64) = 8. Why? Well the paper tells that dividing them means we are SMOOTHING THE GRADIENTS TO MAKE IT MORE STABLE!

Remember Xavier Initialization where we divide the weights by sqrt(number of input), well this is similar. I am not very good at it, but I think they both are significantly related with each other!

Step 4:

Then we pass them through softmax so that it normalizes. Don’t tell me to explain things here. You should know why we do softmax!

Step 5: Multiply each value vector with its softmax score. So we focus on words that we are interested in.

Step 6: Then finally, sum up those vectors!

The whole process of self-attention layer is:

Calculation (Matrix, cause it is faster! )

You know why our input X has 2 rows? Its because we pack the input as a matrix instead of vectors!

Softmax in figure gives (2*2 matrix, values has 2*3 matrix). We dot product them and get 2*3 z1 matrix. (This is the summing the step 6 talks about)!

Okay, we got a problem. The paper “Attention is all you need talks about MULTI-HEADED ATTENTION”.

So this beast of self-attention model has multiple heads!

MULTI-HEADED ATTENTION

Now this is very easy thing! Multi-headed attention means self-attention but having more than 1 Q,K,V metrices. Remember our Q,K,V metrices for each word? Now we have them in multiple number, so more numbers of Q,K,V metrices for single word!

In paper these self — attention is 8 in number.

See? Z0, Z1 …. Z7 are the scores vector. Let’s ravel Z0.

Each row in Z0 corresponds to score vector for each word against other word.

Each column in Z0 corresponds to actual score for that word.

For example, For Z0, first row 4th column means the score between first word and 4th word!

But we got a situation here. The feed forward network expects the single attention, not the 8 attention that we have now.

So we concat them and multiply them by an another weights matrix WO!

Resulting matrix that calculates all the values from those multi-headed attention matrix!

Positional Encoding

How do we account for the order of the words in the input sequence? What we do is we add vector to each word embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word. By position we also mean the distance between different words in the sequence . OK . So we add a vector to word embedding with a hope that it somehow learns the position of each word/distance between words?

YEAH.

The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

See how x1 is changed to x1+position encoding?

Everything will become clear when we see an example there.

See how x1 has first position and second position value 0 and other 1 and 1.

So for 512 dimensional vectors of each word, (we look for 20 words), we can see the position of each word. The first row means the first word.

One more thing the transformer has is the RESIDUAL!

THE RESIDUALS

It is the combination of residual connection(like ResNet) and normalization layer. Where do we add them? After the self-attention and the Neural Network layer.

This is better explained visually here:

This little tweak is also added to the decoder side!

DECODER

Decoder is fairly simple. Apart from Encoder-Decoder architecture Attention (Andre NG has described in his video in Coursera).

Let’s see how they work as a unit.

The encoder embedding input is transformed into a set of attention vectors K and V by top encoder layer. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence. (THE ANDREW NG section I told you about. Its on Youtube/Coursera)

The K and V of final encoder layer is passed to the decoder. Now the decoder outputs a value.

This value is fed into the decoder as input for next phase.

This step is repeated until the decoder outputs the special end of sentence token.

First the decoder outputs “I”, then “I” is fed into decoder that produces “am”, and then “a” and “student” is produced similarly.

The self-attention layer in decoder layer. Remember decoder architecture?

Here let me remind you.

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This means we don’t use attention after the certain part of sentence (stop searching for attention after reaching certain percent).

This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation. We set -inf value to the score, so that the encoder doesn’t give it an attention.

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

Image is not self explanatory. I suggest ANDREW NG attention video

Final layer

The decoder stack outputs a vector of floats.

How do we turn that into a word? If you don’t know how to do this and is reading this line, then you just came into transformers without basic knowledge.

You know that we got vector as output from decoder. We then project this vector to logits using NN(linear) and then use softmax to change it into probability.

Accessing Output

Choices of the highest probability

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding).

But is this safe? I mean I wouldn’t have brought this up if it was, so it is not always a best idea to give high probability word as the selected word.

That’s why we go for BEAM SEARCH

The way beam search works is that Suppose our model predicted “I” as the first word, we don’t select it at first go. We select two words, “I ”, the first word with highest probability and “a”, the second highest probable word. So we run the model first assuming it starts with “I” and we run the model again assuming it starts with “a”. So whichever version give us better probability (less error), we choose that word. How depth should we go for the word our choice. Depth I mean, if I choose “I” as the first word, how many more words should I check into to check error. We selected 2 words, so beam search has width of 2. If we select 3 words, beam search would have width of 3.