Machine learning: Data Preparation - Part 2

Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms.

There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps.

In this tutorial, you will discover how to perform feature engineering on time series data with Python to model your time series problem with machine learning algorithms.

After completing this tutorial, you will know:

The rationale and goals of feature engineering time series data.
How to develop basic date-time based input features.
How to develop more sophisticated lag and sliding window summary statistics features.

A. Feature Engineering for Time Series

A time series dataset must be transformed to be modeled as a supervised learning problem. That is something that looks like:

time 1, value 1

time 2, value 2

time 3, value 3

To something that looks like:

input 1, output 1

input 2, output 2

input 3, output 3

We only want input features that best help the learning methods model the relationship between the inputs (X) and the outputs (y) that we would like to predict.

In this tutorial, we will look at three classes of features that we can create from our time series dataset:

Date Time Features: these are components of the time step itself for each observation.
Lag Features: these are values at prior time steps.
Window Features: these are a summary of values over a fixed window of prior time steps.

B. Goal of Feature Engineering

The goal of feature engineering is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm to model.

C. Minimum Daily Temperatures Dataset

In this lesson, we will use the Minimum Daily Temperatures dataset as an example. This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

D. Date Time Features

The supervised learning problem we are proposing is to predict the daily minimum temperature given the month and day, as follows:

Month, Day, Temperature

# create date time features of a dataset

from pandas import read_csv

from pandas import DataFrame

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

dataframe = DataFrame()

dataframe['month'] = [series.index[i].month for i in range(len(series))]

dataframe['day'] = [series.index[i].day for i in range(len(series))]

dataframe['temperature'] = [series[i] for i in range(len(series))]

print(dataframe.head(5))

month day temperature

0 1 1 20.7

1 1 2 17.9

2 1 3 18.8

3 1 4 14.6

4 1 5 15.8

Using just the month and day information alone to predict temperature is not sophisticated and will likely result in a poor model. Nevertheless, this information coupled with additional engineered features may ultimately result in a better model.

You may enumerate all the properties of a time-stamp and consider what might be useful for your problem, such as:

Minutes elapsed for the day.
Hour of day.
Business hours or not
Weekend or not.
Season of the year.
Business quarter of the year.
Daylight savings or not.
Public holiday or not.
Leap year or not

E. Lag Features

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. The simplest approach is to predict the value at the next time (t+1) given the value at the current time (t).

The supervised learning problem with shifted values looks as follows:

Value(t), Value(t+1)

The Pandas library provides the shift() function to help create these shifted or lag features from a time series dataset.

Shifted, Original

NaN, 20.7

20.7, 17.9

17.9, 18.8

We can concatenate the shifted columns together into a new DataFrame using the concat() function along the column axis (axis=1).

# create a lag feature

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

temps = DataFrame(series.values)

dataframe = concat([temps.shift(1), temps], axis=1)

dataframe.columns = ['t', 't+1']

print(dataframe.head(5))

t t+1

0 NaN 20.7

1 20.7 17.9

2 17.9 18.8

3 18.8 14.6

4 14.6 15.8

For example, below is the above case modified to include the last 3 observed values to predict the value at the next time step.

# create lag features

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

temps = DataFrame(series.values)

dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)

dataframe.columns = ['t-2', 't-1', 't', 't+1']

print(dataframe.head(5))

t-2 t-1 t t+1

0 NaN NaN NaN 20.7

1 NaN NaN 20.7 17.9

2 NaN 20.7 17.9 18.8

3 20.7 17.9 18.8 14.6

4 17.9 18.8 14.6 15.8

F. Rolling Window Statistics

We can calculate the mean of the current and previous values and use that to predict the next value.

mean(t-1, t), t+1

mean(20.7, 17.9), 18.8

19.3, 18.8

Pandas provides a rolling() function that creates a new data structure with the window of values at each time step

# create a rolling mean feature

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

temps = DataFrame(series.values)

shifted = temps.shift(1)

window = shifted.rolling(window=2)

means = window.mean()

dataframe = concat([means, temps], axis=1)

dataframe.columns = ['mean(t-1,t)', 't+1']

print(dataframe.head(5))

mean(t-1,t) t+1

0 NaN 20.7

1 NaN 17.9

2 19.30 18.8

3 18.35 14.6

4 16.70 15.8

Running the example prints the first 5 rows of the new dataset. We can see that the first two rows are not useful.

The first NaN was created by the shift of the series.
The second because NaN cannot be used to calculate a mean value.
Finally, the third row shows the expected value of 19.30 (the mean of 20.7 and 17.9) used to predict the 3rd value in the series of 18.8.

There are more statistics we can calculate and even different mathematical ways of calculating the definition of the window. Below is another example that shows a window width of 3 and a dataset comprised of more summary statistics, specifically the minimum, mean, and maximum value in the window.

# create rolling statistics features

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

temps = DataFrame(series.values)

width = 3

shifted = temps.shift(width - 1)

window = shifted.rolling(window=width)

dataframe = concat([window.min(), window.mean(), window.max(), temps], axis=1)

dataframe.columns = ['min', 'mean', 'max', 't+1']

print(dataframe.head(5))

min mean max t+1

0 NaN NaN NaN 20.7

1 NaN NaN NaN 17.9

2 NaN NaN NaN 18.8

3 NaN NaN NaN 14.6

4 17.9 19.133333 20.7 15.8

G. Expanding Window Statistics

Another type of window that may be useful includes all previous data in the series. This is called an expanding window and can help with keeping track of the bounds of observable data.

Like the rolling() function on DataFrame, Pandas provides an expanding() function that collects sets of all prior values for each time step.

For example, below are the lists of numbers in the expanding window for the first 5 time steps of the series:

#, Window Values

1, 20.7

2, 20.7, 17.9,

3, 20.7, 17.9, 18.8

4, 20.7, 17.9, 18.8, 14.6

5, 20.7, 17.9, 18.8, 14.6, 15.8

Again, you can see that we must shift the series one-time step to ensure that the output value we wish to predict is excluded from these window values. Therefore the input windows look as follows:

#, Window Values

1, NaN

2, NaN, 20.7

3, NaN, 20.7, 17.9,

4, NaN, 20.7, 17.9, 18.8

5, NaN, 20.7, 17.9, 18.8, 14.6

Thankfully, the statistical calculations exclude the NaN values in the expanding window.

# create expanding window features

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

temps = DataFrame(series.values)

window = temps.expanding()

dataframe = concat([window.min(), window.mean(), window.max(), temps.shift(-1)], axis=1)

dataframe.columns = ['min', 'mean', 'max', 't+1']

print(dataframe.head(5))

min mean max t+1

0 20.7 20.700000 20.7 17.9

1 17.9 19.300000 20.7 18.8

2 17.9 19.133333 20.7 14.6

3 14.6 18.000000 20.7 15.8

4 14.6 17.560000 20.7 15.8

Machine learning

Menu bar

04/11/2021

Data Preparation - Part 2 - Basic Feature Engineering

No comments:

Post a Comment