Machine learning: Data Transform - Part 7 - How to Derive New Input Variables

Often, the input features for a predictive modeling task interact in unexpected and often nonlinear ways. These interactions can be identified and modeled by a learning algorithm. Another approach is to engineer new features that expose these interactions and see if they improve model performance.

Transforms like raising input variables to a power can help to better expose the important relationships between input variables and the target variable.

In this tutorial, you will discover how to use polynomial feature transforms for feature engineering with numerical input variables. After completing this tutorial, you will know:

Some machine learning algorithms prefer or perform better with polynomial input features.
How to use the polynomial features transform to create new versions of input variables for predictive modeling.
How the degree of the polynomial impacts the number of input features created by the transform.

This tutorial is divided into four parts; they are:

Polynomial Features
Polynomial Feature Transform
Polynomial Feature Transform Example
Effect of Polynomial Degree

A. Polynomial Features

Polynomial features are those features created by raising existing features to an exponent. For example, if a dataset had one input feature X, then a polynomial feature would be the addition of a new feature (column) where values were calculated by squaring the values in X, e.g. X2.

This process can be repeated for each input variable in the dataset, creating a transformed version of each. As such, polynomial features are a type of feature engineering, e.g. the creation of new input features based on the existing features.

The degree of the polynomial is used to control the number of features added, e.g. a degree of 3 will add two new variables for each input variable. Typically a small degree is used such as 2 or 3.

It is also common to add new variables that represent the interaction between features, e.g a new column that represents one variable multiplied by another. This too can be repeated for each input variable creating a new interaction variable for each pair of input variables.

A squared or cubed version of an input variable will change the probability distribution, separating the small and large values, a separation that is increased with the size of the exponent.

This separation can help some machine learning algorithms make better predictions and is common for regression predictive modeling tasks and generally tasks that have numerical input variables. Typically linear algorithms, such as linear regression and logistic regression, respond well to the use of polynomial input variables.

For example, when used as input to a linear regression algorithm, the method is more broadly referred to as polynomial regression.

B. Polynomial Feature Transform

The polynomial features transform is available in the scikit-learn Python machine learning library via the PolynomialFeatures class. The features created include:

The bias (the value of 1.0)
Values raised to a power for each degree (e.g. x1, x2, x3, ...)
Interactions between all pairs of features (e.g. x1 × x2, x1 × x3, ...)

For example, with two input variables with values 2 and 3 and a degree of 2, the features created would be:

1 (the bias)
2^1 = 2
3^1 = 3
2^2 = 4
3^2 = 9
2 * 3 = 6

# demonstrate the types of features created
from numpy import asarray
from sklearn.preprocessing import PolynomialFeatures
# define the dataset
data = asarray([[2,3],[2,3],[2,3]])
print(data)
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=2)
data = trans.fit_transform(data)
print(data)

-----Result-----

[[2 3]
[2 3]
[2 3]]

[[1. 2. 3. 4. 6. 9.]
[1. 2. 3. 4. 6. 9.]
[1. 2. 3. 4. 6. 9.]]

The degree argument controls the number of features created and defaults to 2.
The interaction_only argument means that only the raw values (degree 1) and the interaction (pairs of values multiplied with each other) are included, defaulting to False.
The include_bias argument defaults to True to include the bias feature

C. Polynomial Feature Transform Example

We will use the Sonar dataset in this tutorial. It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.

We can apply the polynomial features transform to the Sonar dataset directly. In this case, we will use a degree of 3.

# visualize a polynomial features transform of the sonar dataset
from pandas import read_csv
from pandas import DataFrame
from sklearn.preprocessing import PolynomialFeatures
# load dataset
dataset = read_csv('sonar.csv', header=None)
# retrieve just the numeric input values
data = dataset.values[:, :-1]
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=3)
data = trans.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# summarize
print(dataset.shape)

-----Result-----

(208, 39711)

Next, let’s evaluate the same KNN model as the previous section, but in this case on a polynomial features transform of the dataset.

# evaluate knn on the sonar dataset with polynomial features transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
# load dataset
dataset = read_csv('sonar.csv', header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the pipeline
trans = PolynomialFeatures(degree=3)
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[('t', trans), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report pipeline performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

-----Result-----

Accuracy: 0.800 (0.077)

D. Effect of Polynomial Degree

The degree of the polynomial dramatically increases the number of input features. To get an idea of how much this impacts the number of features, we can perform the transform with a range of different degrees and compare the number of features in the dataset.

# compare the effect of the degree on the number of created features
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from matplotlib import pyplot
# get the dataset
def get_dataset(filename):
   # load dataset
   dataset = read_csv(filename, header=None)
   data = dataset.values
   # separate into input and output columns
   X, y = data[:, :-1], data[:, -1]
   # ensure inputs are floats and output is an integer label
   X = X.astype('float32')
   y = LabelEncoder().fit_transform(y.astype('str'))
   return X, y

# define dataset
X, y = get_dataset('sonar.csv')
# calculate change in number of features
num_features = list()
degress = [i for i in range(1, 6)]
for d in degress:
   # create transform
   trans = PolynomialFeatures(degree=d)
   # fit and transform
   data = trans.fit_transform(X)
   # record number of features
   num_features.append(data.shape[1])
   # summarize
   print('Degree: %d, Features: %d' % (d, data.shape[1]))
# plot degree vs number of features
pyplot.plot(degress, num_features)
pyplot.show()

-----Result-----

Degree: 1, Features: 61
Degree: 2, Features: 1891
Degree: 3, Features: 39711
Degree: 4, Features: 635376
Degree: 5, Features: 8259888

Line Plot of the Degree vs. the Number of Input Features for the Polynomial
Feature Transform

More features may result in more overfitting, and in turn, worse results.

# explore the effect of degree on accuracy for the polynomial features transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from matplotlib import pyplot

# get the dataset
def get_dataset(filename):
   # load dataset
   dataset = read_csv(filename, header=None)
   data = dataset.values
   # separate into input and output columns
   X, y = data[:, :-1], data[:, -1]
   # ensure inputs are floats and output is an integer label
   X = X.astype('float32')
   y = LabelEncoder().fit_transform(y.astype('str'))
   return X, y
# get a list of models to evaluate
def get_models():
   models = dict()
   for d in range(1,5):
       # define the pipeline
       trans = PolynomialFeatures(degree=d)
       model = KNeighborsClassifier()
       models[str(d)] = Pipeline(steps=[('t', trans), ('m', model)])
   return models
# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
   cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3,                        random_state=1)
   scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv,            n_jobs=-1)
   return scores
# define dataset
X, y = get_dataset('sonar.csv')
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
   scores = evaluate_model(model, X, y)
   results.append(scores)
   names.append(name)
   print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

-----Result-----

>1 0.797 (0.073)
>2 0.793 (0.085)
>3 0.800 (0.077)
>4 0.795 (0.079)

In this case, we can see that performance is generally worse than no transform (degree 1) except for a degree 3.

Box Plots of Degree for the Polynomial Feature Transform vs. Classification
Accuracy of KNN on the Sonar Dataset

Machine learning

Menu bar

28/09/2021

Data Transform - Part 7 - How to Derive New Input Variables

No comments:

Post a Comment