Luis Caballero Diaz's profile

Recurrent Neural Network Introduction

This project focuses on introducing recurrent neural networks used in deep learning, mainly for sequence and timeseries postprocessing, and explaining the operating working principle. Also it compares recurrent neural network to 1D convolution networks to identify the most suitable applications for each network.
RECURRENT NEURAL NETWORK INTRODUCTION
The main feature of the convolutional and standard neural network is that they do not have memory. Thus, each layer receives some inputs which are processed independently on previous states. It makes these neural networks not suitable for a sequence or a temporal series of data points unless the whole sequence is provided all at once. An example might be a text. If the full text is provided all at once as a large vector of words the network would recognize it. However, if the full text is provided word by word, the network will not properly recognize it since the linked among words will be lost. These kind of networks with no memory such as convolutional or standard layers are called feedforward networks. 

In contrast, a recurrent neural network (RNN) is a network with memory and each node is able to store some state information relative to the previous processed information. That way is the biological human approach since when reading a text, a human being reads word by word storing some information of the precedent words to provide context into the new read word. If that context is not stored and provided, the new read word might lack of sense. Therefore, a recurrent neural network is a network able to process sequential information while maintaining an internal state of what was already processed. That internal state is created from past information and constantly updated when new information comes into the network.

A simple diagram of a recurrent neural network would be depicted below, in which the recurrent network needs both the input and the stored state information built from past performance to generate the network output. It also compares to a feedforward neural network, which does not include recurrent loops.
The above graph for a recurrent network can be split in further detail to expand the process method in each input of the sequence. As it can be depicted below, when the input is X2, the output H2 is a function of the input X2 itself and also past information related to X0 and X1. 
The general neural network node operation is defined as follows:

OUTPUT = ACTIVATION FUNCTION(WEIGHTS x INPUT + BIAS)

However, for the case of a recurrent neural network, the operation is modified to consider the past information as follows:

OUTPUT = ACTIVATION FUNCTION(WEIGHTS_1 x INPUT + WEIGHTS_2 x STATE + BIAS)

Therefore, next picture leads to a more detailed basic representation of a recurrent neural network operation.
LSTM AND GRU RECURRENT LAYERS
The recurrent network concept explained above is a good starting point to understand the operating principle of a recurrent network, but in practice it does not have application. The main reason is due to vanishing gradient problem

Vanishing gradient problem refers to the potential risk for deep networks to lose the information of the initial layers, specially under certain activation functions that limits the output for a small range without mattering the input values. Therefore, a simple recurrent network as explained above, in practice, is not able to retain the information seen several timesteps before, and it only is able to retain the most recent past information, as depicted below. Thus, the recurrent network memorizes the previous states, but they are progressively vanishing, which is not suitable to operate with sequences.
However, there are two recurrent layers to fix the mentioned challenge: Long Short Term Memory (LSTM) and GRU (Gate Recurrent Unit). Both layers operate under the same principle, but GRU is designed to have reduced representational power in exchange of lower computational workload. Therefore, using either LSTM or GRU will depend on the trade off between computational expensiveness and representational power for the particular case under analysis. 

The operating principle for LSTM and GRU is to implement a way to carry past information from many timesteps before, unlike the simple concept of recurrent network explained above. Thus, the past information is kept safe, preventing older signals from gradually vanishing during processing. Therefore, recurrent layers include an additional data flow running parallel to the sequence under processing, making this parallel data flow available to be used when needed. A good example to understand the concept would be a running conveyor belt having the past information from many timesteps before intact and accessible when needed.

The operation for a realistic recurrent neural network would be captured in next equation taking into consideration both state and carry track information.

OUTPUT = ACTIVATION FUNCTION(WEIGHTS_1 x INPUT + WEIGHTS_2 x STATE + WEIGHTS_3 x CARRY_TRACK + BIAS)

Next picture depicts an implementation of a realistic recurrent neural network.
DROPOUT IN RECURRENT LAYERS
Dropout is generally known as a layer which makes zero a certain percentage of the output features for the previous layer. Thus, the model faces more difficulties to overfit since randomly some features are disabled. The core idea behind dropout is that introducing noise in the output values of a layer can avoid the network to memorize some non meaningful patterns, which would be learned if noise is not present. Therefore, by applying dropout, the network is challenged to really learn what it is meaningful for the input dataset.

For a recurrent network, the standard dropout appliance would be captured by making zero a certain percentage of information in the inputs and outputs for the recurrent layers, marked in blue in below diagram.
However, that approach does not help in mitigating overfitting in a recurrent network. The reason is that applying a standard dropout mask that varies randomly from timestep to timestep in a recurrent network would disrupt the error signal and might be harmful to the learning process, instead of helping with regularization. The proper way to use standard dropout with a recurrent network is to use a consistent dropout mask based on the same pattern of dropped units applied at every timestep.

Moreover, the standard dropout by itself does not prevent overfitting in a recurrent network. Additionally, it is needed to apply a constant dropout mask to the inner recurrent activations of the layer. That dropout is known as recurrent dropout mask and it is marked in blue in the below diagram. The recurrent dropout applied at every timestep allows the network to properly propagate its learning error through time, helping the model regularization by reducing overfitting.
STACKING RECURRENT LAYERS
When the network capacity is needed to be increased to lead to better results, a common action is to increase either nodes in the layer or number of layers. As reference, recurrent layer stacking is a classic way to build more powerful recurrent networks in exchange a more expensive computational cost. Moreover, adding layers has a diminishing factor, meaning the benefits are progressively smaller, but the computational workload is highly increased.

A recurrent layer has intermediate cells according to the timestep length, and in turn, in each cell there are units equal to the required output dimensionality. Each unit makes the same operation in parallel with different initialization leading to also different performance and final weight after training. By default, the final output of the layer corresponds to the output of all units for the last cell. However, if finally decided to stack recurrent layers, it is important that all intermediate cells return the full output sequence and not just the last cell. In keras, this is done by specifying return_sequences=True in the recurrent layer. Next picture summarizes a recurrent layer architecture.
Below defines an example of a LSTM layer of 32 outputs, applying 0.1 standard dropout, 0.5 recurrent dropout and returning the complete output sequence to enable stacking another LSTM layer. 

model.add(layers.LSTM(32, dropout=0.1, recurrent_dropout=0.5, return_sequences=True)
BIDIRECTIONAL RECURRENT NETWORKS
As explained earlier, a recurrent network is highly order or time dependent, since they process the timesteps in order and so, shuffling or reversing the timesteps can completely change the extracted representations from the sequence. However, there are some particular scenarios in which might lead to better results to process the input sequence in both chronological and anti-chronological direction. That is performed with bidirectional recurrent neural networks, which consists of using two standard recurrent networks processing each the input sequence in one direction (meaning one is processing the input sequence in chronologically order and the other one in anti-chronologically order) to finally merging both representations, as depicted below. The main benefit of a bidirectional recurrent network is that, by processing a sequence both ways, it might catch patterns that may be overlooked by a standard unidirectional network.
Note that a recurrent network trained with a reversed sequence will learn different representations than one trained with the original sequence. That is similar as human brain would remember different memories if you lived a life where you died on your first
day and were born on your last day. In machine learning, ensembling different representations is always a good practice to lead further improved results. That is caused because a different representation offers a new fresh view to look at the dataset, capturing aspects of the data that were missed by other approaches and so, combining different approaches can help boost performance. That ensembling concept to assess the data both chronological and anti-chronological way is the idea behind a bidirectional recurrent network to improve the performance over a simple chronological order network by potentially find richer representations and patterns.

However, using timeseries sequences for prediction are, in general, chronological order dependent. Therefore, success of the model will depend on the order the input sequence is processed, leading to better results when remembering more accurately the recent past information than the older one. For example, to make a weather prediction the immediate past information will lead to more meaningful representation of the prediction than knowing the weather conditions one month ago. Thus, the improvement by using a bidirectional recurrent network is not always granted since the most benefit will be reached in problems without chronological dependency, such as natural language processing. For example, for natural language processing the importance of a word for the sentence understanding is not usually dependent on its position in the sentence. 

Next picture depicts a diagram for a bidirectional recurrent network in a natural language processing example. As explained, there are two recurrent networks: one processing the recurrence in chronological order and the other one in anti-chronological order.
SEQUENCE MASKING IN RECURRENT NETWORKS
One important concept specially for natural language processing when batching several sentences as single input is SEQUENCE MASKING. Sequence masking enables handling variable length inputs in a recurrent network. The idea behind is to create a mask initialized to 0 with a length equal to the longest sequence in the input batch or dataset, and then fill the mask with 1 to all positions in which the sample has real values. Next example depicts the concept of sequence masking.  
SEQUENCE PROCESSING WITH 1D CONVOLUTIONAL NETWORK
Convolutional neural networks are widely used for computer vision problems, thanks to their ability to extract features from local input patches and allowing for representation modularity and data efficiency. The same properties that make convolutional networks suitable for computer vision also make them highly relevant to sequence processing considering time as a 1D spatial dimension, being competitive to recurrent networks in some particular problems and coming with less computational workload. 

The operation for 1D convolution network is depicted as follows.
1D convolution networks can recognize local patterns in a sequence, and as the same transformation is performed on every patch, a pattern learned at certain position in a sentence can later be recognized at a different position, leading to the same translation invariant property as the 2D convolution networks used in computer vision. However, the same translation invariant property makes convolutional networks not sensitive to the order of the timesteps, unlike recurrent networks. Note convolutional network looks for pattern anywhere in the input timeseries and has no knowledge of the temporal position of the patterns.

Therefore, they are not suitable for problems with chronological dependency in which order sensitivity is a key and more recent data should be interpreted differently from older data points to produce accurate predictions. However, this convolutional limitation is not an issue for problems in which chronological order is not required such as natural language processing applications. For those cases non chronologically dependent, convolutional networks are a faster alternative than recurrent networks.

Next the application usage for both 1D convolutional and recurrent networks is summarized:

- Recurrent neural networks must be used if global order matters in your sequence data. This is typically the case for timeseries, in which the recent past is likely to be more informative than the distant past. Also, bidirectional recurrent networks will not provide extra benefits in this particular chronological dependent cases.

- 1D convolutional neural networks are a suitable fast choice if global ordering is not fundamentally meaningful. This is often the case for text data, where a keyword found at the beginning of a sentence is just as meaningful as a keyword found at the end.
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Recurrent Neural Network Introduction
Published:

Recurrent Neural Network Introduction

Published:

Creative Fields