Machine learning: Convolutional Neural Networks for Machine Learning

These networks preserve the spatial structure of the problem.

CNNs are popular because people are achieving state-of-the-art-results on difficult computer vision and natural language processing tasks.

Given a dataset of gray scale images with the standardized size of 32x32 pixels each, a traditional feedforward neural network would require 1024 input weights (plus one bias).

But flattening of the image matrix of pixels to a long vector of pixel values loses all of the spatial structure in the image.

CNNs expect and preserve the spatial relationship between pixels by learning internal feature representation using small squares of input data. Feature are learned and used across the whole image, allowing for the objects in the image to be shifted or translated in the scene are still detectable by the network.

In summary, below are some benefits of using CNNs:

They use fewer parameters (weights) to learn than a fully connected network
They are designed to be invariant to object position and distortion in the scene.
They automatically learn and generalize features from the input domain.

Building a blocks of Convolutional Neural Networks

There are three types of layer in a CNNs

Convolutional Layers
Pooling Layer
Fully-Connected Layers

A. Convolutional Layers

Convolution Layers are comprised of filters and feature maps.

Filters are the neurons of the layer. They have input weight and output a values.

The input size is a fixed square called a patch or a receptive field.

Feature map is the output of one filter applied to the previous layer.

For example: 8x8 pixel input image, kernel is 3x3 and will produce a feature map with the dimensions of 6x6. (8 – 3 + 1 = 6). Convolution Layers will have 10 parameters, that is 9 weights for the filter (3x3) plus one weight for bias. If more than one filter, more than one channels then the number of weights are 3x3x number of filters x number of channels.

Filter is applied to the input image. It starts at the top left corner of the image and is move from left to right one-pixel column at a time until the edge of the filter reaches the edge of the image.

For 3x3 pixel filter applied to 8x8 input image, we can see that it can only be applied six times, resulting in the width of six in the output feature map.

The reduction in the size of the input to the feature map is referred to as border effects.

This is not a problem for large images and small filters but is a problem of small image. It can also become a problem once a number of convolution layers are stacked.

For example, below is the same model updated to have two stacked convolutional layer.

This means that a 3x3 filter is applied to the 8x8 input image to result in a 6x6 feature map. A 3x3 filter is then applied to the 6x6 feature map.

# example of stacked convolutional layers
from keras.models import Sequential
from keras.layers import Conv2D
# create model
model = Sequential()
model.add(Conv2D(1, (3,3), input_shape=(8, 8, 1)))
model.add(Conv2D(1, (3,3)))
# summarize model
model.summary()

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

conv2d_1 (Conv2D) (None, 6, 6, 1) 10

_________________________________________________________________

conv2d_2 (Conv2D) (None, 4, 4, 1) 10

=================================================================

Total params: 20

Trainable params: 20

Non-trainable params: 0

We can see that the application of filters to the feature map output of the first layer, in turn, result in a smaller 4x4 feature map.

This can become a problem as we develop very deep convolution neural network models with tens or hundreds of layers.

Zero Padding

For example, in the case of applying a 3x3 filter to the 8x8 input image, we can add a border of one pixel around the outside of the image. This has the effect of creating a 10x10 input image. When the 3x3 filter is applied, it results in an 8x8 feature map. The added pixel values could have the value zero that has no effect with the dot product operation when filter is applied.

In Keras, the ‘padding’ value of ‘same’ calculates and adds the padding required to the input image (or feature map) to ensure that the output has the same shape as the input.

Down sample Input with Stride

The filter is moved across the image left to right, top to bottom, with a one-pixel column change on the horizontal movements, then a one-pixel row change on the vertical movements.

The amount of movement between applications of the filter to the input image is referred to as the stride. The default stride or strides in two dimensions is (1,1) for height and width movement. The stride can be changed, which has an effect both on how the filter is applied on the image and, in turn, the size of the resulting feature map.

For example, the stride can be changed to (2,2). This has the effect of moving the filter two pixels right for each horizontal movement of the filter and two pixels down for each vertical movement of the filter when creating the feature map.
With input image 8x8, and filter is 3x3, stride (2,2), feature map output is 3x3. (8-3+1=6/2=3)

B. Pooling Layer

Pooling Layer down-sample the previous layers feature map.

Convolutions layers in a convolution neural network systematically apply learned filters to input images in order to create feature maps that summaries the presence of those features in the input.

A limitation of the feature map output convolution layers is that they record the precise position of features in the input. This means that small movements in the position of the feature in the input image will result in a difference feature map.
A common approach to addressing this problem from signal processing is called down sampling. This is where a lower resolution version of an input signal is created that still contains the large or important structure elements.

Down sampling can be achieved with convolution layers by changing the stride of convolution. A more robust and common approach is to use a pooling layer.
A pooling layer is a new layer added after the convolution layer. Specially, after a nonlinearity (e.g. ReLU) has been applied to the feature maps output by a convolution layer.

For example, the layers in a model may looks as follow:

Input image
Convolution layer
Nonlinearity
Pooling layer

Pooling involves selecting a pooling operating. The size of pooling operation or filter is smaller the size of the feature map, specifically, it is almost always 2x2 pixels apply with a stride of 2 pixels.

This means that the pooling layer will always reduce the size of each feature map by a factor of 2. For example, a pooling layer applied to a feature map of 6x6 (36 pixels) will result in an output pooled feature map of 3x3 (9 pixels)

Two common functions used in the pooling operation are:

Average pooling: calculate the average value for each patch on the feature map
Maximum pooling (or Max pooling): calculate the maximum value for each patch of the feature map.

For example:
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]

Average pooling layer

On two-dimensional feature maps, pooling is typically applied in 2x2 patches of the feature map a stride of (2,2)

[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]

average(0.0, 0.0) = 0.0
0.0, 0.0
average(3.0, 3.0) = 3.0
3.0, 3.0
average(0.0, 0.0) = 0.0
0.0, 0.0

The result is the first line of the average pooling operation:

[0.0, 3.0, 0.0]

Final result:

[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]

Max pooling layer

[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
max(0.0, 0.0) = 0.0
0.0, 0.0
max(3.0, 3.0) = 3.0
3.0, 3.0
max(0.0, 0.0) = 0.0
0.0, 0.0

The result of the first line of the max pooling operation:

[0.0, 3.0, 0.0]

Final result

[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]

Global pooling layer

Instead of down sampling patches of the input feature map, global pooling down samples the entire feature map to a single value. This would be the same as setting the pool size to the size of the input feature map.

The outcome will be a single value that summarize the strongest activation or presence or the presence of the vertical line in the input image.

C. Fully Connected Layers

Fully Connected Layers are the normal flat feed-forward neural network layer
These layers may have a non-linear activation function of a softmax activation in order to output probabilities of class predictions.

Fully connected layers are used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers.

References:

A Gentle Introduction to Padding and Stride for Convolutional Neural Networks (machinelearningmastery.com)

A Gentle Introduction to Pooling Layers for Convolutional Neural Networks (machinelearningmastery.com)