CNNs are popular because people are achieving state-of-the-art-results on difficult computer vision and natural language processing tasks.
Given a dataset of gray scale images with the standardized size of 32x32 pixels each, a traditional feedforward neural network would require 1024 input weights (plus one bias).
But flattening of the image matrix of pixels to a long vector of pixel values loses all of the spatial structure in the image.
CNNs expect and preserve the spatial relationship between pixels by learning internal feature representation using small squares of input data. Feature are learned and used across the whole image, allowing for the objects in the image to be shifted or translated in the scene are still detectable by the network.
In summary, below are some benefits of using CNNs:
- They use fewer parameters (weights) to learn than a fully connected network
- They are designed to be invariant to object position and distortion in the scene.
- They automatically learn and generalize features from the input domain.
There are three types of layer in a CNNs
- Convolutional Layers
- Pooling Layer
- Fully-Connected Layers
Filters are the neurons of the layer. They have input weight and output a values.
The input size is a fixed
square called a patch or a receptive field.
Feature map is the output of one filter applied to the previous layer.
For example: 8x8 pixel input image, kernel is 3x3 and will produce a feature map with the dimensions of 6x6. (8 – 3 + 1 = 6). Convolution Layers will have 10 parameters, that is 9 weights for the filter (3x3) plus one weight for bias. If more than one filter, more than one channels then the number of weights are 3x3x number of filters x number of channels.
Filter is applied to the input image. It starts at the top left corner of the image and is move from left to right one-pixel column at a time until the edge of the filter reaches the edge of the image.
For 3x3 pixel filter applied to 8x8 input image, we can see that it can only be applied six times, resulting in the width of six in the output feature map.
The reduction in the size of the input to the feature map is referred to as border effects.
This is not a problem for large images and small filters but is a problem of small image. It can also become a problem once a number of convolution layers are stacked.
For example, below is the same model updated to have two stacked convolutional layer.
This means that a 3x3 filter is applied to the 8x8 input image to result in a 6x6 feature map. A 3x3 filter is then applied to the 6x6 feature map.
# example of stacked convolutional layers
from keras.models import Sequential
from keras.layers import Conv2D
# create model
model = Sequential()
model.add(Conv2D(1, (3,3), input_shape=(8, 8, 1)))
model.add(Conv2D(1, (3,3)))
# summarize model
model.summary()
3 4 5 6 7 8 9 10 11 | _________________________________________________________________ Layer
(type)
Output
Shape Param
# ================================================================= conv2d_1
(Conv2D) (None,
6, 6, 1) 10 _________________________________________________________________ conv2d_2
(Conv2D) (None,
4, 4, 1) 10 ================================================================= Total params: 20 Trainable params: 20 Non-trainable params: 0 |
We can see that the application of filters to the feature map output of the first layer, in turn, result in a smaller 4x4 feature map.
This can become a problem as we develop very deep convolution neural network models with tens or hundreds of layers.
Zero Padding
For example, in the case of
applying a 3x3 filter to the 8x8 input image, we can add a border of one pixel
around the outside of the image. This has the effect of creating a 10x10 input
image. When the 3x3 filter is applied, it results in an 8x8 feature map. The
added pixel values could have the value zero that has no effect with the dot
product operation when filter is applied.
In Keras, the ‘padding’ value of ‘same’ calculates and adds the padding required to the input image (or feature map) to ensure that the output has the same shape as the input.
Down sample Input with Stride
The amount of movement between applications of the filter to the input image is referred to as the stride. The default stride or strides in two dimensions is (1,1) for height and width movement. The stride can be changed, which has an effect both on how the filter is applied on the image and, in turn, the size of the resulting feature map.
For example, the stride can be changed to (2,2). This has the effect of moving the filter two pixels right for each horizontal movement of the filter and two pixels down for each vertical movement of the filter when creating the feature map.
With input image 8x8, and filter is 3x3, stride (2,2), feature map output is 3x3. (8-3+1=6/2=3)
Pooling Layer down-sample the previous layers feature map.
Convolutions layers in a convolution neural network systematically apply learned filters to input images in order to create feature maps that summaries the presence of those features in the input.
A limitation of the feature map output convolution layers is that they record the precise position of features in the input. This means that small movements in the position of the feature in the input image will result in a difference feature map.
A common approach to addressing this problem from signal processing is called down sampling. This is where a lower resolution version of an input signal is created that still contains the large or important structure elements.
Down sampling can be achieved with convolution layers by changing the stride of convolution. A more robust and common approach is to use a pooling layer.
A pooling layer is a new layer added after the convolution layer. Specially, after a nonlinearity (e.g. ReLU) has been applied to the feature maps output by a convolution layer.
For example, the layers in a model may looks as follow:
- Input image
- Convolution layer
- Nonlinearity
- Pooling layer
This means that the pooling layer will always reduce the size of each feature map by a factor of 2. For example, a pooling layer applied to a feature map of 6x6 (36 pixels) will result in an output pooled feature map of 3x3 (9 pixels)
Two common functions used in the pooling operation are:
- Average pooling: calculate the average value for each patch on the feature map
- Maximum pooling (or Max pooling): calculate the maximum value for each patch of the feature map.
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
Average pooling layer
On two-dimensional feature maps, pooling is typically applied in 2x2 patches of the feature map a stride of (2,2)
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
average(0.0, 0.0) = 0.0
0.0, 0.0
average(3.0, 3.0) = 3.0
3.0, 3.0
average(0.0, 0.0) = 0.0
0.0, 0.0
The result is the first line of the average pooling operation:
[0.0, 3.0, 0.0]
Final result:
[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]
Max pooling layer
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
[0.0, 0.0, 3.0, 3.0, 0.0, 0.0]
max(0.0, 0.0) = 0.0
0.0, 0.0
max(3.0, 3.0) = 3.0
3.0, 3.0
max(0.0, 0.0) = 0.0
0.0, 0.0
The result of the first line of the max pooling operation:
[0.0, 3.0, 0.0]
Final result
[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]
[0.0, 3.0, 0.0]
Global pooling layer
Instead of down sampling patches of the input feature map, global pooling down samples the entire feature map to a single value. This would be the same as setting the pool size to the size of the input feature map.
The outcome will be a single value that summarize the strongest activation or presence or the presence of the vertical line in the input image.
These layers may have a non-linear activation function of a softmax activation in order to output probabilities of class predictions.
Fully connected layers are used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers.
References:
No comments:
Post a Comment