Feature selection is a process of reducing the number of input variables when developing a predictive model.
It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in many cases, to improve the performance of the model.
Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable.
After reading this tutorial, you will know:
- There are two main types of feature selection techniques: supervised and unsupervised, and supervised methods may be divided into wrapper, filter and intrinsic.
- Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
- Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.
This tutorial is divided into four parts; they are:
- Feature Selection
- Statistics for Filter Feature Selection Methods
- Feature Selection With Any Data Type
- Common Questions
A. Feature Selection
One way to think about feature selection methods are in terms of supervised and unsupervised methods. The difference has to do with whether features are selected based on the target variable or not.
- Unsupervised Selection: Do not use the target variable (e.g. remove redundant variables).
- Supervised Selection: Use the target variable (e.g. remove irrelevant variables)
Supervised feature selection methods may further be classified into three groups, including intrinsic, wrapper, filter methods:
- Intrinsic: Algorithms that perform automatic feature selection during training.
- Filter: Select subsets of features based on their relationship with the target.
- Wrapper: Search subsets of features that perform according to a predictive model
Wrapper feature selection methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive.
Filter feature selection methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to rank and choose those input variables that will be used in the model.
There are some machine learning algorithms that perform feature selection automatically as part of learning the model. We might refer to these techniques as intrinsic feature selection methods. This includes algorithms such as penalized regression models like Lasso and decision trees, including ensembles of decision trees like random forest.
Feature selection is also related to dimensionality reduction techniques in that both methods seek fewer input variables to a predictive model. The difference is that feature selection select features to keep or remove from the dataset, whereas dimensionality reduction create a projection of the data resulting in entirely new input features. As such, dimensionality reduction is an alternate to feature selection rather than a type of feature selection.
B. Statistics for Filter Feature Selection Methods
It is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection.
The choice of statistical measures is highly dependent upon the variable data types. Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.
The type of output variable typically indicates the type of predictive modeling problem being performed.
- Numerical Output: Regression predictive modeling problem.
- Categorical Output: Classification predictive modeling problem
The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures.
We can consider a tree of input and output variable types and select statistical measures of relationship or correlation designed to work with these data types.
1. Numerical Input, Numerical Output
This is a regression predictive modeling problem with numerical input variables. The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.
- Pearson’s correlation coefficient (linear).
- Spearman’s rank coefficient (nonlinear).
- Mutual Information.
2. Numerical Input, Categorical Output
This might be the most common example of a classification problem, Again, the most common techniques are correlation based, although in this case, they must take the categorical target into account.
- ANOVA correlation coefficient (linear).
- Kendall’s rank coefficient (nonlinear).
- Mutual Information.
3. Categorical Input, Numerical Output
This is a regression predictive modeling problem with categorical input variables. This is a strange example of a regression problem (e.g. you would not encounter it often). Nevertheless, you can use the same Numerical Input, Categorical Output methods (described above), but in reverse.
4. Categorical Input, Categorical Output
This is a classification predictive modeling problem with categorical input variables. The most common correlation measure for categorical data is the chi-squared test. You can also use mutual information (information gain) from the field of information theory.
- Chi-Squared test (contingency tables).
- Mutual Information.
C. Feature Selection With Any Data Type
It is rare that we have a dataset with just a single input variable data type.
We use following method to handling different input variable data types:
- Tree-Searching Methods (depth-first, breadth-first, etc.).
- Stochastic Global Search (simulated annealing, genetic algorithm)
- Step-Wise Models.
- RFE
- Classification and Regression Trees (CART).
- Random Forest
- Bagged Decision Trees
- Gradient Boosting
D. Common Questions
1. How Do You Filter Input Variables?
There are two main techniques for filtering input variables
- Select the top k variables: SelectKBest.
- Select the top percentile variables: SelectPercentile.
2. How Can I Use Statistics for Other Data Types?
Consider transforming the variables in order to access different statistical methods. For example, you can transform a categorical variable to ordinal, even if it is not, and see if any interesting results come out. You can also make a numerical variable discrete to try categorical-based measures.
3. What is the Best Feature Selection Method?
This is unknowable. Just like there is no best machine learning algorithm, there is no best feature selection technique. At least not universally. Instead, you must discover what works best for your specific problem using careful systematic experimentation. Try a range of different techniques and discover what works best for your specific problem.
4. What is Univariate Feature Selection
In feature-based filter selection, the statistical measures are calculated considering only a single input variable at a time with a target (output) variable.
These statistical measures are termed as univariate statistical measures, which means that the interaction between input variables is not considered in the filtering process.
Univariate feature selection selects the best features on the basis of univariate statistical tests. We compare each feature to the target variable in order to determine the significant statistical relationship between them.
Univariate feature selection is also called analysis of variance ( ANOVA). The majority of the techniques are univariate means that they perform the predictor evaluation in isolation.
References:
No comments:
Post a Comment