Luis Caballero Diaz's profile

Convolutional Neural Network Introduction

This project focuses on introducing convolutional neural networks used in deep learning, mainly for computer vision application, and explaining the operating working principle.
CONVOLUTIONAL NEURAL NETWORK INTRODUCTION
Convolutional neural networks (also known as CNN or convnets) are neural networks having convolution layers. The main difference between a standard and a convolution layer resides in how the patterns from the input feature space are identified. A detailed explanation is as follows.

- A standard layer looks for global patterns in the input feature space. For an image, the standard layer would look for a pattern involving all pixels. Thus, the standard layer will detect the pattern of an animal of a picture as a whole, considering a new pattern if in a different image the animal has moved the position. 

- A convolution layer looks for local patterns in the input feature space. For the same example of an animal image as previously, the convolution layer will detect a group of local patterns such as ear, tail, eyes, fins...  Thus, when evaluating a different image of the same animal, the convolution layer will detect the local pattern no matter the relative position in the new image. Note the level of detail in the local patterns will depend on the level of depth of the convolutional neural network. An example of local patterns is depicted as follows.
CONVOLUTION LAYER PROPERTIES
Convolution layers has two main important features:

- Their patterns are translation invariant. As mentioned before, when a pattern is learnt it will be identified at any position in other images. If the original pattern is upper-right corner and in a new image is at lower-left corner, it will be identified as the same pattern. However, a standard layer would consider them as two different patterns. That feature is what makes convolutional neural networks the best option to process images since they have a strong generalization power due to the translation invariant property.

- They learn spatial hierarchies of patterns. It means that the deeper the convolution network, the more details are learnt for the particular image. Thus, a first convolution layer
will learn small local patterns such as edges. A second convolution layer will learn higher level patterns based on the output features from the first layer, and so on. It means, the output of a convolutional layer is progressively more abstract and less visual. That property allows convolutional neural networks efficiently learning complex and abstract visual concepts. For example, let's focus on next picture to apply progressively convolutional layers to demonstrate the spatial hierarchy property.
Some features are depicted next after applying a first convolution layer. The image is still visually clear.
A second convolution layer would go into deeper details of the picture making more abstract the features, as depicted next.
After applying a third convolution layer, the image is barely recognizable. In this stage, particular and specific details of the picture are captured.
And the features are too abstract to be understood with a fourth convolution layer, as depicted below.
CONVOLUTIONAL LAYER OPERATION
A convolution operation works by sliding window patches across the input feature map at every possible location and applying certain operation to extract meaningful information in each step. The operation is a dot scalar product defined in a kernel (or also called filter), and whose result is stored in the output feature map. This operation is repeated per each output feature with a different kernel. 

Note that a convolutional layer will have defined so many kernels as output features, and each kernel must be tuned to the proper parametrization during training. As general, it is good to have a significant number of kernels to have more convolutional features, but note the higher the number of kernels, the higher the number of parameters to tune with the corresponding computational time.

Next picture depicts a convolutional operation in a 5 x 5 input feature map with a 3 x 3 kernel. The outcome is 5 corresponding to the dot scalar product between the input feature map of the depicted area in the picture and the kernel. 

1 x 1 + 2 x 0 + 1 x 1 + 2 x 0 + 0 x 1 + 0 x 0 + 1 x 1 + 0 x 0 + 2 x 1 = 5
The convolutional operation is performed in the other positions until filling completely the output feature map for the kernel under analysis. Thus, next picture shows the same 5 x 5 input feature map applying the 3 x 3 kernel to the last position of the current feature. The result in this particular case is 7 corresponding to the dot scalar product between the input feature map of the depicted area in the picture and the kernel. 

2 x 1 + 1 x 0 + 0 x 1 + 0 x 0 + 2 x 1 + 1 x 0 + 1 x 1 + 0 x 0 + 2 x 1 = 7
Once the output feature is filled in by applying kernel in all possible locations of the input feature, next step is to repeat the convolutional operation with the next kernel, leading to a new output future increasing the depth dimension. Finally, the operation will be completed when all out features were filled in by applying all kernels.

Note that for simplicity purposes the above graphs focus on a single channel input feature map, which in turns leads to kernels with depth = 1. However, the number of channels when working with color images is 3, and it would increase the depth of both input feature map and kernels to 3. Convolution operation would keep the same but performing a 3D dot scalar product instead of a 2D one. 

Next picture depicts a convolutional operation for an input feature map 7 x 7 with 3 channels using 3 x 3 kernels with also 3 channels. The output feature map is 5 x 5 (due to border effects, it is explained later) with a depth equal to the number of output features. 
As summary, convolution operation is defined by two key parameters:

- Size of the window patches to extract meaningful information from the inputs. These are typically 3 x 3 or 5 x 5, and it will correspond to the kernel width and height size.

- Depth of the output feature map. The framework will create a number of kernels equal to the output features. The process to fill in the kernels with values is within the convolutional layer parametrization, and they are defined during neural network training, starting with a random initialization. The proper number of outputs depends on each case under analysis and on the layer position within the neural network.
UNDERSTANDING BORDER EFFECTS
When applying convolutional operation, the width and height can be reduced compared to the values from the input. The reason is due to border effects and strides, which are detailed later.

BORDER EFFECTS
It is a physical effect of the convolutional operation. Let's focus on below example to visually understand it. Next picture shows a 5 x 5 input feature map and when applying 3 x 3 window patch, there are only 9 possible combinations due to the border effects. Therefore, after convolution the feature map has been reduced from 5 x 5 to 3 x 3.
If that size reduction is undesired or wants to be avoided, the solution is to apply padding. Padding consists of applying the required number of extra columns and rows in the sides of the input feature map to fit the window patch into each input feature map location. For example, next picture depicts a padding application in a 5 x 5 input feature map to perform a window patch of 3 x 3 without affecting to width and height. One column and row per side was added, but note two additional columns and rows per side would be needed for a window patch of 5 x 5.
STRIDE
The other factor influencing the output size is stride. The description of convolution operation so far considers windows in all possible locations in the input feature map. However, stride precisely defines the distance between two successive windows. All previous graphs are focused on stride = 1, but it is possible to have higher number of strides. 

For example, next picture depicts a convolution operation in a 4 x 4 input feature map with stride = 1 and stride = 2. Using stride = 1 leads to an output feature map of 3 x 3, limited by border effects, but enabling windows in all possible locations. Instead, using stride = 2 leads to an output feature map of 2 x 2, limited by both border effects and not allowing windows in contiguous locations.
POOLING OPERATION
Pooling is the downsampling process to reduce the width and height of each feature. The operation is conceptually similar to convolution, except that pooling applies a hardcoded operation, instead of applying a kernel per each feature. The most common operation is max and average, but max tends to lead to better results. Other difference between pooling and convolution is that polling is usually performed with a 2 x 2 patches and stride = 2, which leads a downsample factor of 2. Instead, convolution is normally done with 3 x 3 patches and stride = 1. 

Next picture depicts a max polling operation with 2 x 2 patch and stride = 2.
There are two major reasons to apply pooling downsampling:

1. If not applied, the spatial hierarchy property is lost. Even in the last convolution layer, the patch size (normally 3 x 3 pixels) will have information only for 3 x 3 windows (if padding applied) from the initial input, leading to very small high level patterns compared to the initial input. Applying downsampling induces spatial filter hierarchies by making successive convolution layers look at increasingly larger windows of the fraction of the initial input, being able to capture more significant patterns.

2. If not applied, the final feature map is not manageable with the same number of features. For example, having 128 features with size 5 x 5 would require 3200 parameters if connected directly to the output of the neural network. Instead, having 128 features with size 25 x 25 would increase the parameters amount to 80000. Apart from the computational burden, having too many parameters for small models leads to intense overfitting performance. 
CONVOLUTIONAL NEURAL NETWORK OVERVIEW
Below picture summarizes a convolutional neural network. There are mainly two steps:

STEP 1 --> Identify local patterns to create the convolutional features

The purpose of this step is to apply successive convolutional operations with the purpose of detecting high level patterns in the initial input. This process requires several convolutional layers, which reduces progressively the feature size but increases the number of features. In this stage, pooling operation is applied after each convolutional layer to downsampling, as detailed above.

STEP 2 --> Flatten the network to apply standard layers and calculate the network output

The convolutional layer cannot provide the required network output, so it is required to end the neural network with standard layers. However, prior to that, the feature map must be flattened to one dimension. Once flattened, a standard layer with the suitable activation function is defined to reach the network output.

This process is summarized in below picture for a multilabel classification, ending with a standard layer with 10 nodes and softmax as activation function. The model applies two convolution layers with the corresponding pooling each (step 1). Then it flattens the network and apply standard layers to provide the network output (step 2).
CONVOLUTIONAL NEURAL NETWORK EXAMPLE
Let's assess the next convolutional neural network as an example.
Main features of the defined neural network:

- The model is sequential, which means there is a single input which is postprocessed in a group of sequential layers to finally provide the last layer output as single output, as depicted below. 
- Model input corresponds to RGB images of 60 x 60 pixels, so input size is 60 x 60 x 3.
- The first part of the model consists of three convolution layers to extract the meaningful image features. It starts with 32 features in the first layer and ends with 128 features in the third layer. That is because the priority is to accumulate general high level patterns information and not small low level patterns.
- The second part of the model consists of standard layers to apply the weights tuned during training into the convolution features and provide the neural network output.
- All internal layers use ReLU as activation function to avoid vanishing gradients issues.
- The output is a single sigmoid neuron, which corresponds to the suitable output for binary classification problems. 
The summary of the convolutional neural network is depicted as follows. Let's comment the main features:

- CONV2D layer receives inputs of 60 x 60, and outputs 32 features of 58 x 58 each. The reduction is due to border convolution issues, as explained above. Note it also happens in the following convolution layers too.

- MAX_POOLING2D layer downsamples the 32 features of 58 x 58 to the half (29 x 29 each).

- CONV2D_1 layer takes the 32 features of 29 x 29, and transform them to 64 features of 27 x 27 each. 

- MAX_POOLING2D_1 layer downsamples the 64 features of 27 x 27 to the half (13 x 13 each).

- CONV2D_2 layer takes the 64 features of 13 x 13, and transform them to 128 features of 11 x 11 each. 

- MAX_POOLING2D_2 layer downsamples the 128 features of 11 x 11 to the half (5 x 5 each).

- FLATTEN layer transforms the 5 x 5 x 128 tensor to single dimension of 3200 features to proceed applying standard layers.

- DENSE layer reduces the 3200 features to 256.

- And finally, DENSE_1 outputs the single outcome of the neural network using the sigmoid activation function. 
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Convolutional Neural Network Introduction
Published:

Convolutional Neural Network Introduction

Published:

Creative Fields