Machine learning: Feature Selection - Part 5 - How to Use RFE for Feature Selection

Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm. RFE is popular because it is easy to configure and use, and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.

In this tutorial, you will discover how to use Recursive Feature Elimination (RFE) for feature selection in Python. After completing this tutorial, you will know:

RFE is an efficient approach for eliminating features from a training dataset for feature selection.
How to use RFE for feature selection for classification and regression predictive modeling problems.
How to explore the number of selected features and wrapped algorithm used by the RFE procedure

This tutorial is divided into three parts; they are:

Recursive Feature Elimination
RFE with scikit-learn
RFE Hyperparameters

A. Recursive Feature Elimination

A machine learning dataset for classification or regression is comprised of rows and columns, like a spreadsheet. Rows are often referred to as instances and columns are referred to as features.

Feature selection refers to techniques that select a subset of the most relevant features (columns) for a dataset. Fewer features can allow machine learning algorithms to run more efficiently (less space or time complexity) and be more effective. Some machine learning algorithms can be misled by irrelevant input features, resulting in worse predictive performance.

RFE is a wrapper-type feature selection algorithm. This means that a different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features. This is in contrast to filter-based feature selections that score each feature and select those features with the largest (or smallest) score. Technically, RFE is a wrapper-style feature selection algorithm that also uses filter-based feature selection internally.

RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains. This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

Features are scored either using the provided machine learning model (e.g. some algorithms like decision trees offer importance scores) or by using a statistical method.

B. RFE with scikit-learn

The scikit-learn Python machine learning library provides an implementation of RFE for machine learning via the RFE class in scikit-learn. RFE is a transform. To use it, first the class is configured with the chosen algorithm specified via the estimator argument and the number of features to select via the n_features_to_select argument.

RFE requires a nested algorithm that is used to provide the feature importance scores, such as a decision tree.

1. RFE for Classification

First, we can use the make classification() function to create a synthetic binary classification problem with 1,000 examples and 10 input features, five of which are informative and five of which are redundant.

Next, we can evaluate an RFE feature selection algorithm on this dataset. We will use a DecisionTreeClassifier to choose features and set the number of features to five. We will then fit a new DecisionTreeClassifier model on the selected features.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

# evaluate RFE for classification

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5,

random_state=1)

# create pipeline

rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

model = DecisionTreeClassifier()

pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# evaluate model

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

-----Result-----

Accuracy: 0.886 (0.030)

We can see the RFE that uses a decision tree and selects five features and then fits a decision tree on the selected features achieves a classification accuracy of about 88.6 percent.

We can also use the RFE model pipeline as a final model and make predictions for classification.

# make a prediction with an RFE pipeline

from sklearn.datasets import make_classification

from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5,

random_state=1)

# create pipeline

rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

model = DecisionTreeClassifier()

pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# fit the model on all available data

pipeline.fit(X, y)

# make a prediction for one example

data = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]

yhat = pipeline.predict(data)

print('Predicted Class: %d' % (yhat))

-----Result-----

Predicted Class: 1

2. RFE for Regression

In this section, we will look at using RFE for a regression problem.

The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that negative MAE values closer to zero are better and a perfect model has a MAE of 0.

# evaluate RFE for regression

from numpy import mean

from numpy import std

from sklearn.datasets import make_regression

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeRegressor

from sklearn.pipeline import Pipeline

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# create pipeline

rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5)

model = DecisionTreeRegressor()

pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# evaluate model

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv,

n_jobs=-1)

# report performance

print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

-----Result-----

MAE: -26.853 (2.696)

We can also use the RFE as part of the final model and make predictions for regression.

# make a regression prediction with an RFE pipeline

from sklearn.datasets import make_regression

from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeRegressor

from sklearn.pipeline import Pipeline

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# create pipeline

rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5)

model = DecisionTreeRegressor()

pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# fit the model on all available data

pipeline.fit(X, y)

# make a prediction for one example

data = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381,

0.87616892, -0.50446586, 0.23009474, 0.76201118]]

yhat = pipeline.predict(data)

print('Predicted: %.3f' % (yhat))

-----Result-----

Predicted: -84.288

C. RFE Hyperparameters

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the RFE method for feature selection and their effect on model performance.

1. Explore Number of Features

An important hyperparameter for the RFE algorithm is the number of features to select.

In the previous section, we used an arbitrary number of selected features, five, which matches the number of informative features in the synthetic dataset. In practice, we cannot know the best number of features to select with RFE; instead, it is good practice to test different values.

The example below demonstrates selecting different numbers of features from 2 to 10 on the synthetic binary classification dataset.

# explore the number of selected features for RFE

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

from matplotlib import pyplot

# get the dataset

def get_dataset():

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

return X, y

# get a list of models to evaluate

def get_models():

models = dict()

for i in range(2, 10):

rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=i)

model = DecisionTreeClassifier()

models[str(i)] = Pipeline(steps=[('s',rfe),('m',model)])

return models

# evaluate a given model using cross-validation

def evaluate_model(model, X, y):

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

return scores

# define dataset

X, y = get_dataset()

# get the models to evaluate

models = get_models()

# evaluate the models and store results

results, names = list(), list()

for name, model in models.items():

scores = evaluate_model(model, X, y)

results.append(scores)

names.append(name)

print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

# plot model performance for comparison

pyplot.boxplot(results, labels=names, showmeans=True)

pyplot.show()

-----Result-----

>2 0.715 (0.044)
>3 0.825 (0.031)
>4 0.876 (0.033)
>5 0.887 (0.030)
>6 0.890 (0.031)
>7 0.888 (0.025)
>8 0.885 (0.028)
>9 0.884 (0.025)

Box Plot of RFE Number of Selected Features vs. Classification Accuracy

We can see that performance improves as the number of features increase and perhaps peaks around 4-to-7 as we might expect, given that only five features are relevant to the target variable.

2. Automatically Select the Number of Features

It is also possible to automatically select the number of features chosen by RFE. This can be achieved by performing cross-validation evaluation of different numbers of features as we did in the previous section and automatically selecting the number of features that resulted in the best mean score. The RFECV class implements this.

The RFECV is configured just like the RFE class regarding the choice of the algorithm that is wrapped. Additionally, the minimum number of features to be considered can be specified via the min features to select argument (defaults to 1) and we can also specify the type of cross-validation and scoring to use via the cv (defaults to 5) and scoring arguments (uses accuracy for classification).

# automatically select the number of features for RFE

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.feature_selection import RFECV

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# create pipeline

rfe = RFECV(estimator=DecisionTreeClassifier())

model = DecisionTreeClassifier()

pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# evaluate model

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

-----Result-----

Accuracy: 0.886 (0.026)

3. Which Features Were Selected

When using RFE, we may be interested to know which features were selected and which were removed. This can be achieved by reviewing the attributes of the fit RFE object (or fit RFECV object).

# report which features were selected by RFE

from sklearn.datasets import make_classification

from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeClassifier

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5,

random_state=1)

# define RFE

rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

# fit RFE

rfe.fit(X, y)

# summarize all features

for i in range(X.shape[1]):

print('Column: %d, Selected=%s, Rank: %d' % (i, rfe.support_[i], rfe.ranking_[i]))

-----Result-----

Column: 0, Selected=False, Rank: 4
Column: 1, Selected=False, Rank: 5
Column: 2, Selected=True, Rank: 1
Column: 3, Selected=True, Rank: 1
Column: 4, Selected=True, Rank: 1
Column: 5, Selected=False, Rank: 6
Column: 6, Selected=True, Rank: 1
Column: 7, Selected=False, Rank: 2
Column: 8, Selected=True, Rank: 1
Column: 9, Selected=False, Rank: 3

4. Explore Base Algorithm

It might be helpful to explore the use of different algorithms wrapped by RFE.

# explore the algorithm wrapped by RFE

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import Perceptron

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.pipeline import Pipeline

from matplotlib import pyplot

# get the dataset

def get_dataset():

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

return X, y

# get a list of models to evaluate

def get_models():

models = dict()

# lr

rfe = RFE(estimator=LogisticRegression(), n_features_to_select=5)

model = DecisionTreeClassifier()

models['lr'] = Pipeline(steps=[('s',rfe),('m',model)])

# perceptron

rfe = RFE(estimator=Perceptron(), n_features_to_select=5)

model = DecisionTreeClassifier()

models['per'] = Pipeline(steps=[('s',rfe),('m',model)])

# cart

rfe = RFE(estimator=DecisionTreeClassifier(),n_features_to_select=5)

model = DecisionTreeClassifier()

models['cart'] = Pipeline(steps=[('s',rfe),('m',model)])

# rf

rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=5)

model = DecisionTreeClassifier()

models['rf'] = Pipeline(steps=[('s',rfe),('m',model)])

# gbm

rfe = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=5)

model = DecisionTreeClassifier()

models['gbm'] = Pipeline(steps=[('s',rfe),('m',model)])

return models

# evaluate a given model using cross-validation

def evaluate_model(model, X, y):

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

return scores

# define dataset

X, y = get_dataset()

# get the models to evaluate

models = get_models()

# evaluate the models and store results

results, names = list(), list()

for name, model in models.items():

scores = evaluate_model(model, X, y)

results.append(scores)

names.append(name)

print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

# plot model performance for comparison

pyplot.boxplot(results, labels=names, showmeans=True)

pyplot.show()

-----Result-----

>lr 0.893 (0.030)
>per 0.843 (0.040)
>cart 0.887 (0.033)
>rf 0.858 (0.038)
>gbm 0.891 (0.030)

Box Plot of RFE Wrapped Algorithm vs. Classification Accuracy

We can see the general trend of good performance with logistic regression, CART and perhaps GBM.

Machine learning

Menu bar

20/09/2021

Feature Selection - Part 5 - How to Use RFE for Feature Selection

No comments:

Post a Comment