Menu bar

26/09/2021

Data Transform - Part 3 - How to encode Categorical Data

Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an Ordinal encoding and a One Hot encoding

In this tutorial, you will discover how to use encoding schemes for categorical machine learning data. After completing this tutorial, you will know:
  • Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
  • How to use ordinal encoding for categorical variables that have a natural rank ordering.
  • How to use one hot encoding for categorical variables that do not have a natural rank ordering.
This tutorial is divided into six parts; they are:
  • Nominal and Ordinal Variables
  • Encoding Categorical Data
  • Breast Cancer Dataset
  • OrdinalEncoder Transform
  • OneHotEncoder Transform
  • Common Questions

A. Nominal and Ordinal Variables

Numerical data, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values. 

Categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal. Some examples include:
  • A pet variable with the values: dog and cat.
  • A color variable with the values: red, green, and blue.
  • A place variable with the values: first, second, and third.
Nominal Variable. Variable comprises a finite set of discrete values with no rank-order relationship between values.

Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly. For example, a decision tree can be learned directly from categorical data with no data transform required.

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric. 

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.


B. Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:
  • Ordinal Encoding
  • One Hot Encoding
  • Dummy Variable Encoding

1. Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value. For example, red is 1, green is 2, and blue is 3. This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one hot encoding may be used instead. This ordinal encoding transform is available in the scikit-learn Python machine learning library via the OrdinalEncoder class.

By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the categories argument as a list with the rank order of all expected labels.

# example of a ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

-----Result-----

[['red']
[
'green']
[
'blue']]
[[2.]
[1.]
[0.]]


This OrdinalEncoder class is intended for input variables that are organized into rows and columns, e.g. a matrix. If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the LabelEncoder class can be used. It does the same thing as the OrdinalEncoder, although it expects a one-dimensional input for the single target variable.

2. One Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough or even misleading to the model.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results.

In this case, a one hot encoding can be applied to the ordinal representation.

This one hot encoding transform is available in the scikit-learn Python machine learning library via the OneHotEncoder class.

# example of a one hot encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

-----Result-----

[['red']
[
'green']
[
'blue']]
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]


If new data contains categories not seen in the training dataset, the handle_unknown argument can be set to 'ignore' to not raise an error, which will result in a zero value for each label.

3. Dummy Variable Encoding

The one hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents blue and [0, 1, 0] represents green we don’t need another binary variable to represent red, instead we could use 0 values alone, e.g. [0, 0]. This is called a dummy variable encoding, and always represents C categories with C − 1 binary variables.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one hot encoding. The drop argument can be set to indicate which category will become the one that is assigned all zero values, called the baseline. We can set this to ‘first’ so that the first category is used. When the labels are sorted alphabetically, the blue label will be the first and will become the baseline.

# example of a dummy variable encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

-----Result-----

[['red']
[
'green']
[
'blue']]
[[0. 1.]
[1. 0.]
[0. 0.]]



C. Breast Cancer Dataset

We will use the Breast Cancer dataset in this tutorial. This dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and 9 input variables.

# load and summarize the dataset
from pandas import read_csv
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(
str)
# summarize
print('Input', X.shape)
print('Output', y.shape)

-----Result-----

Input (286, 9)
Output (286,)



D. OrdinalEncoder Transform

# evaluate logistic regression on the breast cancer dataset with an ordinal encoding
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(
str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

-----Result-----

Input (286, 9)
[[2. 2. 2. 0. 1. 2. 1. 2. 0.]
[3. 0. 2. 0. 0. 0. 1. 0. 0.]
[3. 0. 6. 0. 0. 1. 0. 1. 0.]
[2. 2. 6. 0. 1. 2. 1. 1. 1.]
[2. 2. 5. 4. 1. 1. 0. 4. 0.]]
Output (286,)
[1 0 1 0 1]


Accuracy: 75.79


E. OneHotEncoder Transform

# evaluate logistic regression on the breast cancer dataset with a one-hot encoding
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# load the dataset
dataset = read_csv('breast-cancer.csv', header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(
str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

-----Result-----

Input (286, 43)
[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]
[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]]


Accuracy: 70.53

In this case, the model achieved a classification accuracy of about 70.53 percent, which is worse than the ordinal encoding in the previous section.


F. Common Questions

1. What if I have a mixture of numeric and categorical data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model. Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables.

2. What if I have hundreds of categories?

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

3. What encoding technique is the best?

This is unknowable. Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.



No comments:

Post a Comment