Seq to Seq model — RNN
This blogpost is inspired from this:
Sequence to Sequence (seq2seq) models solve complex Language related problems like Machine Translation, Question Answering, creating Chat-bots, Text Summarization etc.
We will check for Machine Translation. I mean, why not?
Encoder-Decoder Architecture:
The most common architecture used to build Seq2Seq models is the Encoder Decoder architecture.
Encoder (LSTM) reads the input sequence and summarizes the information in something called as the internal state vectors. Discard the outputs of the encoder and only preserve the internal states.
Decoder is an LSTM whose initial states are initialized to the final states of the Encoder LSTM. Using these initial states, decoder starts generating the output sequence.
We will look into the detail flow of this.
Encoder:
We take LSTM as the network. LSTM generally consists of inputs, outputs, hidden states and cell state.
The blog I am following takes English to Marathi as example. Since, I am just following it, it would be nice idea to do whatever it is doing.
Input sentence (English)=> “Rahul is a good boy”
Output sentence (Marathi) => “राहुल चांगला मुलगा आहे”
So we feed the word into the model instead of doing it character wise. So, X1 = ‘Rahul’, X2 = ‘is’, X3 = ‘a’, X4 = ‘good, X5 = ‘boy’.
We will use the embedding layer instead of one-hot vector as input to LSTM of encoder as it gives much more information of word rather than one-hot.
The question is what are the roles of the internal states (hi and ci) at each time step?
In very simple terms, they remember what the LSTM has learned till now. For example:
h3, c3 =>These two vectors will remember that the network has read “Rahul is a” till now. Basically its the summary of information till time step 3 which is stored in the vectors h3 and c3.
The states coming out of the last time step (c5,h5) are called as the “Thought vectors” as they summarize the entire sequence in a vector form. c5 and h5 contains the whole information of the sentence.
We will discard the Yi of the Encoder for our problem as we are not using it.
Easy till now?
Decoder:
Given the input sentence “Rahul is a good boy”, the goal of the training process is to train the decoder to output “राहुल चांगला मुलगा आहे”. Just as the Encoder scanned the input sequence word by word, similarly the Decoder will generate the output sequence word by word.
Two tokens need to be added to denote the start and end of sentence.
The figure explains all. Input and output of decoder LSTM is same. h0,c0 of decoder is h5,c5 of encoder. So, h5,c5 of the encoder preserves information of English language, and passes it to decoder.
Check out how our first term in Decoder has the token START_ and end has END_. The final state of the decoder is discarded.
We use a technique called “Teacher Forcing” where in the input at each time step is given as the actual output (and not the predicted output) from the previous time step.
Teacher Forcing:
If our model has a word “I am awesome”. Then, to send it to decoder we use “START I am awesome END”. So, for first time step, we send START and it gives some word probably “I”. What if gives “hello” instead of “I”? It is actually killing the sequence game, so we force
Decoder in INFERENCE mode:
First Step:
Step 2:
At t = 4
At t = 5
Actual Inference Algorithm:
- Only one word is generated at a time. Thus Decoder is called in a loop.
- The initial input to the decoder is always the START_ token.
- At each time step, we preserve the states of the decoder and set them as initial states for the next time step. This step LSTM does itself.
- We break the loop when the decoder predicts the END_ token.
I am doing this on my notebook. So will update here soon.
So I went to the code and had to have some refresher because boy LSTMs are toughs! So I decided to write separate post on LSTM usage from Keras.
The separate post is written there. I now have to do one thing, explain the code.
You should be able to get the preprocessing steps performed cause explaining them would give me a headache for sure.
So , I would explain from data_generation step.
So, our x_train is all the english words, and y_train is all marathi words where each word has START_ {words} END_ structure. So this is a generator function. We do not use return from list as it consumes memory in ample amount. So here is the tip, if you can wait and don’t want to give CPU a load or if your CPU is not compatible to hold huge memory use generator.
So, we have encoder_input_data which is a matrix of (batch_size,max_length_src). max_length_src is maximum length of the input. As we know that we cannot make the neurons dynamic, so instead of cutting longer sentences to short, we make sure the short sentence has as equal length as that of the longest. Same applies to decoder_input data.
But, something is different in decoder_output_data. What is the input to decoder? The final_state of encoder right? So, we get the final state from encoder and pass it to decoder. Will come to that later.
Next we fill those arrays.
The encoder_input_data is filled with index of the given word. We have not one-hot encoded it. So what we did is, If our word is something like [he is a good boy] and its numeric representation is [1,3,2,6,5,4] and max length of the sentence is 7 then for word he, we represent it as [1, 0, 0, 0, 0, 0, 0] [0,3,0,0,0,0,0].
You see what I did there ?
Now for decoder_input_data, we do the same but for target outputs and we don’t use the final word from target output i.e. END_ token.
Why? We studied that before.
Now, we have decoder_output_data is one hot encoded version of each word. Every word of target is one-hot encoded. As you can see that number_decoder_tokens has the value of len(total_target_words) +1. So why we did +1 in preprocessing step? To include this value of END_.
Encoder Model Code:
We have input, the functional api needs that. We have embedding layer, we have encoder lstm. So what value do we need from encoder lstm? Its final state and final hidden_value.
So our input for encoder lstm would be [‘Rahul is a good boy’] and its output would be one hot vector of [‘राहुल एक चांगला मुलगा आहे END_’]. You know what this is right and why one-hot. Because one hot are easier to work with.
But in the code we have not included output, will come to that later.
Decoder Model Code:
We have input in decoder input, which takes initial_state from encoder_lstm .Remember how we make an array of state_h, state_c from encoder? That is what would be the input decoder. What type of values does this take? We have input of ‘START_ राहुल एक चांगला मुलगा आहे ’. Remember, this encoder input doesn’t include the END_ token. But one most important thing here. Where is teacher forcing? We need teacher forcing, that is why we have input here in decoder as well! So that every input is the next word given by input and not by what it has predicted. If we are to depend on the word generated by this decoder model during training, we might have bad results and may take longer time to train, so we use TEACHER FORCING.
Now for the output. What could be the output for this decoder? Each word will have next word as output, a one hot version of ’ राहुल एक चांगला मुलगा आहे END_’. Wait, this sounds familiar, I think I have used this output somewhere else as well? Well, we have used this in Encoder Output.
So , our encoder and decoder output follow same structure. But, hold on, every time I pass an input, LSTM should give me output of the word index right? So what we do is we create one hot vector output. How do we do that? We use SOFTMAX.
The output of the softmax is the one hot vector that is as long as total_word of target. So, below it is a model that contains two inputs that have same output. Isn’t that cool?
Inference Model:
Now what is left is Inference model and this one seems a bit tricky. Let me boil it down properly for myself.
So, the initial state of the decoder is the final state of the encoder. So this final state as you might recall is called thought vectors.
So after each iteration, we preserve its states, and set them for next time step. The predicted output is then fed as input. When decoder predicts END_ token, we break the loop.
That was a short recap, but lets understand the code now!
This DECODER INFERENCE code looks way too difficult. Lets boil it down one by one.
So for Decoder Inference Model, we need thought vectors from encoder. So encoder_model input is encoder_inputs and output is its final state. The final_state gives us the thought vectors!
Next we setup decoder! What are the input to the decoder here? The final state of the encoder which are also called thought vectors. Then we use decoder_lstm whose initial_states are thought vectors! The output is the next word to be predicted!
#Predicting data using Inference
So we get the final_states of the encoder.
Then what is the first word to be sent there in the model? START_token and that is what we are doing!
So we stop the loop when the model predicts the end_token or it just loops with space and makes the length greater than 50.
During each loop what we do is:
We find the next_word predicted and hidden_state and cell_state for current input. For the first input, the first input word is START_ and it predicts some word. We join that word in our target sequence.
Then we check for condition for end.
Then what we do is, we setup the next target sequence. The next target sequence is the current word that is predicted. Then what would be the initial state for this word as input? The [h,c] of START_ token (or previous token if we are to talk generally) .
That’s it!