Menu bar

10/10/2021

Dimensionality Reduction - Part 1 - What is Dimensionality Reduction?

The number of input variables or features for a dataset is referred to as its dimensionality. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.

High-dimensionality statistics and dimensionality reduction techniques are often used for data visualization. Nevertheless these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.

In this tutorial, you will discover a gentle introduction to dimensionality reduction for machine learning.

After reading this tutorial, you will know:
  • Large numbers of input features can cause poor performance for machine learning algorithms.
  • Dimensionality reduction is a general field of study concerned with reducing the number of input features.
  • Dimensionality reduction methods include feature selection, linear algebra methods, projection methods, and autoencoders.

This tutorial is divided into three parts; they are:
  • Problem With Many Input Variables
  • Dimensionality Reduction
  • Techniques for Dimensionality Reduction

A. Problem With Many Input Variables

The performance of machine learning algorithms can degrade with too many input variables.

If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. 

Input variables are also called features. We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. This is a useful geometric interpretation of a dataset.

Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that we have in that space (rows of data) often represent a small and non-representative sample.

This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as the curse of dimensionality.

Therefore, it is often desirable to reduce the number of input features.
This reduces the number of dimensions of the feature space, hence the name dimensionality reduction.


B. Dimensionality Reduction

Dimensionality reduction refers to techniques for reducing the number of input variables in training data.

High-dimensionality might mean hundreds, thousands, or even millions of input variables.

Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom.

A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model.

Any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.


C. Techniques for Dimensionality Reduction

There are many techniques that can be used for dimensionality reduction. In this section, we will review the main techniques.

1. Feature Selection Methods

Two main classes of feature selection techniques include wrapper methods and filter methods.

2. Matrix Factorization

Techniques from linear algebra can be used for dimensionality reduction. Specifically, matrix factorization methods can be used to reduce a dataset matrix into its constituent parts.

Examples include the eigendecomposition and singular value decomposition. The parts can then be ranked and a subset of those parts can be selected that best captures the salient structure of the matrix that can be used to represent the dataset. The most common method for ranking the components is principal components analysis, or PCA for short.

3. Manifold Learning

Techniques from high-dimensionality statistics can also be used for dimensionality reduction.

These techniques are sometimes referred to as manifold learning and are used to create a low-dimensional projection of high-dimensional data, often for the purposes of data visualization.

The projection is designed to both create a low-dimensional representation of the dataset whilst best preserving the salient structure or relationships in the data. Examples of manifold learning techniques include:
  • Kohonen Self-Organizing Map (SOM).
  • Sammons Mapping
  • Multidimensional Scaling (MDS)
  • t-distributed Stochastic Neighbor Embedding (t-SNE)

4. Autoencoder Methods

An auto-encoder is a kind of unsupervised neural network that is used for dimensionality reduction and feature discovery. More precisely, an auto-encoder is a feedforward neural network that is trained to predict the input itself.

5. Tips for Dimensionality Reduction

There is no best technique for dimensionality reduction and no mapping of techniques to problems. Instead, the best approach is to use systematic controlled experiments to discover what dimensionality reduction techniques, when paired with your model of choice, result in the best performance on your dataset.

Typically, linear algebra and manifold learning methods assume that all input features have the same scale or distribution. This suggests that it is good practice to either normalize or standardize data prior to using these methods if the input variables have differing scales or units.




No comments:

Post a Comment