Activation function are a critical part of the design of a neural network.

The choice of activation function in the hidden layer will control how well the network model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.

As such, a careful choice of activation function must be made for each deep learning neural network project.

Activation function is a key part of neural network design.
The modern default activation function for hidden layers is the ReLU function.
The activation function for output layers depends on the type of prediction problem.

Activation Functions

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network.

Sometimes the activation function is called a “transfer function”. Many activation functions are nonlinear.

The choice of activation function has a large impact on the capability and performance of the neural network, and difference activation may be used in difference parts of model.

Technically, the activation function is used within or after the internal processing of each node in the network, although networks are designed to use the same activation function for all nodes in a layer.

A network may have three types of layers: input layers that take raw input from the domain, hidden layers that take input from another layer and pass output to another layer, and output layers that make a prediction.

All hidden layers use the same activation function. The output layer will use a difference activation from the hidden layers.

There are many types of activation functions used in neural networks, although perhaps only a small number of functions used in practice for hidden layers and output layers.

Activation for Hidden Layers

A hidden layer receives input from another layer (such as another hidden layer or an input layer) and provides output to another layer (such as another hidden layer or output layer).

A neural network may have zero or more hidden layer.

There are perhaps three activation functions you may want to consider for use in hidden layer, they are:

Rectified Linear Activation (ReLU)
Logistic (Sigmoid)
Hyperbolic Tangent (Tanh)

This is not an exhaustive list of activation functions used for hidden layers, but they are the most commonly used.

ReLU Hidden Layer Activation Function

The Rectified Linear Activation function is perhaps the most common function used for hidden layers.

It is common because it is both simple to implement and effective at overcoming the limitations of other previously popular activation function, such as Tanh and Sigmoid.

The ReLU function is calculate as follows:

max(0.0, x)

Plot of Inputs vs. Outputs for the ReLU Activation Function.

Sigmoid Hidden Layer Activation Function

The sigmoid activation function is also called logistic function.

It is the same function used in the logistic regression classification algorithm.

Input: any real value.

Output: values in the range 0 to 1.

The sigmoid activation function is calculated as follows:

1.0 / (1.0 + e^-x)

Plot of Inputs vs. Outputs for the Sigmoid Activation Function

Tanh Hidden Layer Activation Function

The hyperbolic tangent activation function is also referred to simply as the Tanh function.

It is very similar to Sigmoid activation function even has the same S-shape.

Input: any real value.Output: values in range -1 to 1.

The Tanh activation function is calculated as follow:

(e^x – e^-x) / (e^x + e^-x)

Plot of Inputs vs. Outputs for the Tanh Activation Function

How to choose Hidden Layer Activation Function

A neural network will always have the same activation function in all hidden layers.

In the 1990s: default is Sigmoid.

From mid to late 1990s to 2010s: use Tanh as default.

The activation function is chosen based on the type of architecture network.

Modern neural network models with common architectures such as MLP and CNN will use the ReLU activation function.

Recurrent networks still common use Tanh or Sigmoid. For example, the LSTM uses Sigmoid for recurrent connections and Tanh for output.

Multilayer Perceptron(MLP): ReLU activation function.
Convolution Neural Network(CNN): ReLU activation function.
Recurrent Neural Network: Tanh and/or Sigmoid activation function

Activation for Output Layers

The output layer is the layer in a neural network model that directly outputs a prediction.

All feed-forward neural have an output layer.

There are three activation functions you may want to consider for use in the output layer, they are:

Linear
Logistic(Sigmoid)
Softmax

Linear Output Activation Function

The linear activation is also call 'identity' or 'no activation'.

This is because the linear activation function does not change the weighted sum of the input and instead return the value directly.

Plot of Inputs vs. Outputs for the Linear Activation Function

Sigmoid Output Activation Function

The sigmoid of logistic activation function was described in the previous section.

Softmax Output Activation Function

The softmax function outputs a vector that sum to 1.0 that can be interpreted as probabilities of class membership.

The input to the function is a vector of real values and the output is a vector of the same length with values that sum to 1.0 like probabilities.
The softmax function is calculated as follows:

e^x / sum(e^x)

We cannot plot the softmax function.

How to Choose an Output Activation Function

You must choose an activation function for your output layer based on the type of prediction problem that you are solving.

You may divide prediction problems into two main groups, predicting a categorical variable (classification) and predicting a numerical variable (regression).

If your problem is a regression problem, you should use a linear activation function

Regression: One node, linear activation.

If your problem is a classification problem, then there are three main types of classification.

If there are two mutually exclusive classes(binary classification), then your output layer will have one node and a sigmoid activation should be used.

If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class and a softmax activation should be used.

If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class and a sigmoid activation is used.

Binary Classification: One node, sigmoid activation.
Multiclass Classification: One node per class, softmax activation.
Multilabel Classification: One node per class, sigmoid activation.

Reference

How to Choose an Activation Function for Deep Learning (machinelearningmastery.com)

Machine learning

Menu bar

01/08/2021

How to choose an Activation Function for Deep Learning