Basics of keras and GRU, Comparison with LSTM

GRU is a model designed to compensate for the high number of parameters in LSTM, i.e., its high computational cost. It combines the operations of memory updating and memory forgetting into a single operation, thereby reducing the computational cost. I’ll spare you the details, as you can find plenty of them by searching. Here is a comparison between GRU and LSTM.

github

  • The file in jupyter notebook format is here

google colaboratory

  • If you want to run it in google colaboratory here

Author’s environment

The author’s OS is macOS, and the options are different from Linux and Unix commands.

! sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G6020
Python -V
Python 3.7.3

Import the basic libraries and keras and check their versions.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import scipy
import numpy as np

import tensorflow as tf
from tensorflow import keras

print('matplotlib version :', matplotlib.__version__)
print('scipy version :', scipy.__version__)
print('numpy version :', np.__version__)
print('tensorflow version : ', tf.__version__)
print('keras version : ', keras.__version__)
matplotlib version : 3.0.3
scipy version : 1.4.1
numpy version : 1.19.4
tensorflow version : 2.1.0
keras version : 2.2.4-tf

Damping vibration curve

For the sample data, we will sample from the following equation.

$$ y = \exp\left(-\frac{x}{\tau}\right)\cos(x) $$

For comparison with the LSTM, we will use the same function for the sample data.

For comparison with the LSTM, the sample data is the same function:

x = np.linspace(0, 5 * np.pi, 200)
y = np.exp(-x / 5) * (np.cos(x))

Checking the data

Let’s look at the details of the $x$ and $y$ data.

print('shape : ', x.shape)
print('ndim : ', x.ndim)
print('data : ', x[:10])
shape : (200,)
ndim : 1
data : [0. 0.07893449 0.15786898 0.23680347 0.31573796 0.39467244
 0.47360693 0.55254142 0.63147591 0.7104104 ]
print('shape : ', y.shape)
print('ndim : ', y.ndim)
print('data : ', y[:10])
shape : (200,)
ndim : 1
data : [1. 0.98127212 0.9568705 0.92712705 0.89239742 0.85305798
 0.80950282 0.76214062 0.71139167 0.65768474]

Let’s check the graph.

plt.plot(x,y)
plt.grid()
plt.show()

As $\tau=5$, we get a nice decay curve.

Building the neural net

We will preprocess the data to feed it into keras and build the recursive neural net.

The specification of compile is as follows.

compile(self, optimizer, loss, metrics=None, sample_weight_mode=None, weighted_metrics=None, target_tensors=None)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import GRU

NUM_GRU = 20
NUM_MIDDLE = 40

# Preprocess the data
n = len(x) - NUM_GRU
r_x = np.zeros((n, NUM_GRU))
r_y = np.zeros((n, NUM_GRU))
for i in range(0, n):
  r_x[i] = y[i: i + NUM_GRU].
  r_y[i] = y[i + 1: i + NUM_GRU + 1].

r_x = r_x.reshape(n, NUM_GRU, 1)
r_y = r_y.reshape(n, NUM_GRU, 1)

# Build a gru neural net
gru_model = Sequential()
gru_model.add(GRU(NUM_MIDDLE, input_shape=(NUM_GRU, 1), return_sequences=True))
gru_model.add(Dense(1, activation="linear"))
gru_model.compile(loss="mean_squared_error", optimizer="sgd")

# Build the LSTM neural net
lstm_model = Sequential()
lstm_model.add(LSTM(NUM_MIDDLE, input_shape=(NUM_GRU, 1), return_sequences=True))
lstm_model.add(Dense(1, activation="linear"))
lstm_model.compile(loss="mean_squared_error", optimizer="sgd")

Check the data to be submitted and the model overview.

print(r_y.shape)
print(r_x.shape)
(180, 20, 1)
(180, 20, 1)

Comparing the two models, we can see that the LSTM has more parameters than the RNN. However, GRU has about 20% less. LSTM also takes more time to learn.

print(gru_model.summary())
print(lstm_model.summary())
Model: "sequential".
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru (GRU) (None, 20, 40) 5160
_________________________________________________________________
dense (Dense) (None, 20, 1) 41
=================================================================
Total params: 5,201
Trainable params: 5,201
Non-trainable params: 0
_________________________________________________________________
None
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 20, 40) 6720
_________________________________________________________________
dense_1 (Dense) (None, 20, 1) 41
=================================================================
Total params: 6,761
Trainable params: 6,761
Non-trainable params: 0
_________________________________________________________________
None

Training

We will use the fit method to perform training. The specification of the fit method is as follows. See here.

fit(self, x=None, y=None, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None)
batch_size = 10
epochs = 1000

# use validation_split to use the last 10% for validation
gru_history = gru_model.fit(r_x, r_y, epochs=epochs, batch_size=batch_size, validation_split=0.1, verbose=0)

# use validation_split to use the last 10% for validation
lstm_history = lstm_model.fit(r_x, r_y, epochs=epochs, batch_size=batch_size, validation_split=0.1, verbose=0)

Visualization of the loss function

Let’s visualize how the error is reduced by training.

gru_loss = gru_history.history['loss'] # Loss function for training data
gru_val_loss = gru_history.history['val_loss'] # loss function for test data

lstm_loss = lstm_history.history['loss'] # Loss function of training data
lstm_val_loss = lstm_history.history['val_loss'] # Loss function for test data

plt.plot(np.arange(len(gru_loss)), gru_loss, label='gru_loss')
plt.plot(np.arange(len(gru_val_loss)), gru_val_loss, label='gru_val_loss')
plt.plot(np.arange(len(lstm_loss)), lstm_loss, label='lstm_loss')
plt.plot(np.arange(len(lstm_val_loss)), lstm_val_loss, label='lstm_val_loss')
plt.grid()
plt.legend()
plt.show()

Check the result

# Initial input values
gru_res = r_y[0].reshape(-1)
lstm_res = r_y[0].reshape(-1)

for i in range(0, n):
  _gru_y = gru_model.predict(gru_res[- NUM_GRU:].reshape(1, NUM_GRU, 1))
  gru_res = np.append(gru_res, _gru_y[0][NUM_GRU - 1][0])

  _lstm_y = lstm_model.predict(lstm_res[- NUM_GRU:].reshape(1, NUM_GRU, 1))
  lstm_res = np.append(lstm_res, _lstm_y[0][NUM_GRU - 1][0])

plt.plot(np.arange(len(y)), y, label=r"$\exp\left(-\frac{x}{\tau}\right) \cos x$")
plt.plot(np.arange(len(gru_res)), gru_res, label="GRU result")
plt.plot(np.arange(len(lstm_res)), lstm_res, label="LSTM result")
plt.legend()
plt.grid()
plt.show()

In the case of this damped vibration curve, we found that the GRU did not reproduce the vibration very well. However, it is not that the GRU is bad, it is just that it did not meet this model of vibrating while damping. We also found that GRU has fewer parameters than GRU. At the practical level, I think LSTM is used more often than GRU. I think this is because even though the number of parameters is reduced, it is not so drastically, and if that is the case, many people think that it is more accurate and better for the overall output if memory updating and forgetting are done separately rather than the same.