keras and sequnece to sequence

In the previous article, we implemented the LSTM model, and now we will implement the sequence to sequence model. Nowadays, the sequnece to sequence and attention-based models are often used in natural language processing such as machine translation, and BERT is also based on the attention model.

In this section, we will review and implement the basic sequnece to sequence. We will build a model that translates $y=\sin x$ to $y=\cos x$. We will not go into the details of the model here, as you can find plenty of information on it by searching. Depending on the literature, textbooks, and engineers, the sequnece to sequence model is often referred to as the “Encoder-Decoder model” or “sequence transformation model”.

In the following, we will implement seq2seq using keras. For more details, please refer to official blog for details.

github

  • The file in jupyter notebook format is here.

google colaboratory

  • To run it on google colaboratory here

Author’s environment

The author’s OS is macOS, and the options are different from those of Linux and Unix commands.

The author's OS is macOS.
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G6020
Python -V
Python 3.7.3

Import the basic libraries and keras and check their versions.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import scipy
import numpy as np

import tensorflow as tf
from tensorflow import keras
import gensim

print('matplotlib version :', matplotlib.__version__)
print('scipy version :', scipy.__version__)
print('numpy version :', np.__version__)
print('tensorflow version : ', tf.__version__)
print('keras version : ', keras.__version__)
print('gensim version : ', gensim.__version__)
matplotlib version : 3.0.3
scipy version : 1.4.1
numpy version : 1.19.4
tensorflow version : 2.4.0
keras version : 2.4.0
gensim version : 3.8.3

Input and output data for the sequence to sequence model

First, I would like to give a brief overview of the algorithm of the sequence to sequence model and what kind of data input and output will be used when running it on keras.

Image of data input/output

The sequence to sequence consists of two parts: encoder and decoder. Each part is built with a model such as RNN or LSTM. Because of these features, it is strong in analyzing time series data, and is used in fields such as machine translation and speech recognition.

To implement seq2seq in keras, we need input data (dataset 1 and 2 in the figure) and correct answer data (dataset 3) for encoder and decoder respectively.

The key point is that the correct answer data must be chronologically offset by one from the input set to the decoder.

We refer to the following website.

Sample data

We will use the following expressions as encoder and decoder data for the sample.

$$ \text{encoder} : y = \sin{x} $$

$$ \text{dencoder} : y = \cos x $$

Converts a sine function to a cosine function.

## seq2seq sin cos
x = np.linspace(-2*np.pi, 2*np.pi, 100) # from -2 pi to 2 pi
seq_in = np.sin(x)
seq_out = np.cos(x)

Check the data.

Let’s look at the details of the $x$ and $y$ data.

print('shape : ', x.shape)
print('ndim : ', x.ndim)
print('data : ', x[:10])
shape : (100,)
ndim : 1
data : [-6.28318531 -6.15625227 -6.02931923 -5.9023862 -5.77545316 -5.64852012
 -5.52158709 -5.39465405 -5.26772102 -5.14078798]
print('shape : ', seq_in.shape)
print('ndim : ', seq_in.ndim)
print('data : ', seq_in[:10])
shape : (100,)
ndim : 1
data : [2.44929360e-16 1.26592454e-01 2.51147987e-01 3.71662456e-01
 4.86196736e-01 5.92907929e-01 6.90079011e-01 7.76146464e-01
 8.49725430e-01 9.09631995e-01]
print('shape : ', seq_out.shape)
print('ndim : ', seq_out.ndim)
print('data : ', seq_out[:10])
shape : (100,)
ndim : 1
data : [1. 0.99195481 0.9679487 0.92836793 0.87384938 0.80527026
 0.72373404 0.63055267 0.52722547 0.41541501]

Let’s check the graph.

plt.plot(x, seq_in, label='$y=\exp x$')
plt.plot(x, seq_out, label='$y=\log x$')
plt.legend()
plt.grid()
plt.show()

Prepare data and parameters

Set the values of the hyperparameters to be used for training.

# Parameters for LSTM network
NUM_LSTM = 24
NUM_MID = 75

# Parameters for training.
batch_size = 10
epochs = 35

Store the data for input to keras in a numpy array.

n = len(x) - NUM_LSTM
ex = np.zeros((n, NUM_LSTM))
dx = np.zeros((n, NUM_LSTM))
dy = np.zeros((n, NUM_LSTM))

for i in range(0, n):
  ex[i] = seq_in[i : i + NUM_LSTM]
  dx[i, 1:] = seq_out[i: i + NUM_LSTM-1]
  dy[i] = seq_out[i : i + NUM_LSTM]

ex = ex.reshape(n, NUM_LSTM, 1)
dx = dx.reshape(n, NUM_LSTM, 1)
dy = dy.reshape(n, NUM_LSTM, 1)

Building the model

We will implement the sequence to sequence model using keras. Unlike simple RNNs and LSTMs, we will use Model instead of Sequence, because there are multiple models and we need input from each of them.

For more details, please refer to the official blog of keras.

from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model

e_input = Input(shape=(NUM_LSTM, 1))
e_lstm = LSTM(NUM_MID, return_state=True)
e_output, e_state_h, e_state_c = e_lstm(e_input)

# We need to pass the hidden layer and cell state to the decoeder side, so we create a list
e_state = [e_state_h, e_state_c].

d_input = Input(shape=(NUM_LSTM, 1))
d_lstm = LSTM(NUM_MID, return_sequences=True, return_state=True)
d_output, _, _ = d_lstm(d_input, initial_state=e_state)
d_dense = Dense(1, activation='linear')
d_output = d_dense(d_output)

seq2seq_model = Model([e_input, d_input], d_output)

# Model of optimization and definition of loss function = > No significant change in either
seq2seq_model.compile(optimizer="adam", loss="mean_squared_error")
# seq2seq_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# Check the model
print(seq2seq_model.summary())
Model: "model".
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 24, 1)] 0
__________________________________________________________________________________________________
input_2 (InputLayer) [(None, 24, 1)] 0
__________________________________________________________________________________________________
lstm (LSTM) [(None, 75), (None, 23100 input_1[0][0])
__________________________________________________________________________________________________
lstm_1 (LSTM) [(None, 24, 75), (None, 23100 input_2[0][0])
                                                                 lstm[0][1]
                                                                 lstm[0][2]
__________________________________________________________________________________________________
dense (Dense) (None, 24, 1) 76 lstm_1[0][0]
==================================================================================================
Total params: 46,276
Trainable params: 46,276
Non-trainable params: 0
__________________________________________________________________________________________________
None

Train a model

history = seq2seq_model.fit([ex, dx], dy, epochs=epochs, batch_size=batch_size, verbose=False)

Loss function

Visualize the loss function as it decreases.

loss = history.history['loss'].
plt.plot(np.arange(len(loss)), loss, label='loss')
plt.grid()
plt.legend()
plt.show()

You can see that it converges well enough.

Create a function that returns the encoder and decoder models for prediction.

Build the encoder model for prediction.

We will use pre-trained instances, hidden layers, and cell states as inputs.

pred_e_model = Model(e_input, e_state)

Build the decoder model for prediction.

### Build the decoder model for prediction
pred_d_input = Input(shape=(1, 1))

pred_d_state_in_h = Input(shape=(NUM_MID,))
pred_d_state_in_c = Input(shape=(NUM_MID,))
pred_d_state_in = [pred_d_state_in_h, pred_d_state_in_c].

# Use the LSTM that we used during training
pred_d_output, pred_d_state_h, pred_d_state_c = d_lstm(pred_d_input, initial_state=pred_d_state_in)
pred_d_state = [pred_d_state_h, pred_d_state_c]

# Use the DENSE layer that was used during training
pred_d_output = d_dense(pred_d_output)
pred_d_model = Model([pred_d_input] + pred_d_state_in, [pred_d_output] + pred_d_state)

Define a function for prediction.

Build a function that takes the input data, decodes it, and outputs the result.

def get_output_data(input_data):
  state_value = pred_e_model.predict(input_data)
  _dy = np.zeros((1, 1, 1))

  output_data = [].
  for i in range(0, NUM_LSTM):
    y_output, y_state_h, y_state_c = pred_d_model.predict([_dy] + state_value)

    output_data.append(y_output[0, 0, 0])
    dy[0, 0, 0] = y_output
    state_value = [y_state_h, y_state_c].

  return output_data

Check the result

init_points = [0, 24, 49, 74].

for i in init_points:
  _x = ex[i : i + 1].
  _y = get_output_data(_x)

  if i == 0:
    plt.plot(x[i : i + NUM_LSTM], _x.reshape(-1), color="b", label='input')
    plt.plot(x[i : i + NUM_LSTM], _y, color="r", label='output')
  else:
    plt.plot(x[i : i + NUM_LSTM], _x.reshape(-1), color="b")
    plt.plot(x[i : i + NUM_LSTM], _y, color="r")

plt.plot(x, seq_out, color = 'r', linestyle = "dashed", label = 'correct')
plt.grid()
plt.legend()
plt.show()

It looks like you’ve captured the shape somehow, but what do you think? I think I can make a better conversion if I push a little more, but since this is a demo, I’ll leave it at that.

Various seq2seq

Let’s turn the last code into a function and see what happens to the output for various inputs.

from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model


def main(x, seq_in, seq_out):

  n = len(x) - NUM_LSTM
  ex = np.zeros((n, NUM_LSTM))
  dx = np.zeros((n, NUM_LSTM))
  dy = np.zeros((n, NUM_LSTM))

  for i in range(0, n):
    ex[i] = seq_in[i : i + NUM_LSTM]
    dx[i, 1:] = seq_out[i: i + NUM_LSTM-1]
    dy[i] = seq_out[i : i + NUM_LSTM]

  ex = ex.reshape(n, NUM_LSTM, 1)
  dx = dx.reshape(n, NUM_LSTM, 1)
  dy = dy.reshape(n, NUM_LSTM, 1)

  e_input = Input(shape=(NUM_LSTM, 1))
  e_lstm = LSTM(NUM_MID, return_state=True)
  e_output, e_state_h, e_state_c = e_lstm(e_input)

  # List the encoder state
  e_state = [e_state_h, e_state_c].

  d_input = Input(shape=(NUM_LSTM, 1))
  d_lstm = LSTM(NUM_MID, return_sequences=True, return_state=True)
  d_output, _, _ = d_lstm(d_input, initial_state=e_state)
  d_dense = Dense(1, activation='linear')
  d_output = d_dense(d_output)

  seq2seq_model = Model([e_input, d_input], d_output)
  seq2seq_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

  ## Prediction
  history = seq2seq_model.fit([ex, dx], dy, epochs=epochs, batch_size=batch_size, verbose=False)

  ## Check the result
  init_points = [0, 24, 49, 74].

  for i in init_points:
    _x = ex[i : i + 1].
    _y = get_output_data(_x)

    if i == 0:
      plt.plot(x[i : i + NUM_LSTM], _x.reshape(-1), color="b", label='input')
      plt.plot(x[i : i + NUM_LSTM], _y, color="r", label='output')
    else:
      plt.plot(x[i : i + NUM_LSTM], _x.reshape(-1), color="b")
      plt.plot(x[i : i + NUM_LSTM], _y, color="r")

  plt.plot(x, seq_out, color = 'r', linestyle = "dashed", label = 'correct')
  plt.grid()
  plt.legend()
  plt.show()

seq2seq exp to log

x = np.linspace(0, 5 * np.pi, 100)
seq_in = np.exp(-x / 5) * (np.cos(x))
seq_out = np.exp(-x / 5) * (np.sin(x))

main(x, seq_in, seq_out)

seq2seq 0.5 squared to 2 squared

x = np.linspace(0, 1.5, 100)
seq_out = np.array(x ** 0.5)
seq_in = np.array(x ** 2)

main(x, seq_in, seq_out)

Except for the trigonometric functions, I think the model needs to be optimized a bit more to be of practical use (especially in the case of the damped oscillation curve, where the phase is 180 degrees out of phase, which is critical). Since this is an exercise to get used to seq2seq in keras, I’ll leave it at that.

I know I’m being persistent, but please refer to the official blog for more details.