Machine learning: Forecast Models - Part 2 - Autoregression Models for Forecasting

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.

After completing this tutorial, you will know:

How to explore your time series data for autocorrelation.
How to develop an autocorrelation model and use it to make predictions.
How to use a developed autocorrelation model to make rolling predictions.

A. Autoregression

A regression model, such as linear regression, models an output value based on a linear combination of input values. For example:

yhat = b0 + (b1 * X1)

Where yhat is the prediction, b0 and b1 are coefficients found by optimizing the model on training data, and X is an input value.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables.

For example, we can predict the value for the next time step (t+1) given the observations at the current (t) and previous (t-1).

As a regression model, this would look as follows:

X(t + 1) = b0 + (b1 * X(t)) + (b2 * X(t − 1))

Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self).

B. Autocorrelation

An autoregression model makes an assumption that the observations at current and previous time steps are useful to predict the value at the next time step. This relationship between variables is called correlation.

If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the variables move in opposite directions as values change (e.g. one goes up and one goes down), then this is called negative correlation.

We can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags. The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modeling.

Again, because the correlation is calculated between the variable and itself at previous time steps, it is called an autocorrelation.

The correlation statistics can also help to choose which lag variables will be useful in a model and which will not. Interestingly, if all lag variables show low or no correlation with the output variable, then it suggests that the time series problem may not be predictable. This can be very useful when getting started on a new dataset.

C. Minimum Daily Temperatures Dataset

This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

D. Quick Check for Autocorrelation

We can plot the observation at the current time step (t) with the observation at the previous time step (t-1) as a scatter plot.

Pandas provides a built-in plot to do exactly this, called the lag plot() function.

# lag plot of time series

from pandas import read_csv

from matplotlib import pyplot

from pandas.plotting import lag_plot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

lag_plot(series)

pyplot.show()

-----Result-----

Lag plot of the Minimum Daily Temperatures dataset

Correlation can be calculated easily using the corr() function on the DataFrame of the lagged dataset.

# correlation of lag=1

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

result = dataframe.corr()

print(result)

-----Result-----

t t+1

t 1.00000 0.77487

t+1 0.77487 1.00000

It shows a strong positive correlation (0.77) between the observation and the lag=1 value.

E. Autocorrelation Plots

Pandas provides a built-in plot called the autocorrelation plot() function.

# autocorrelation plot of time series

from pandas import read_csv

from matplotlib import pyplot

from pandas.plotting import autocorrelation_plot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

autocorrelation_plot(series)

pyplot.show()

-----Result-----

Autocorrelation plot of the Minimum Daily Temperatures dataset with pandas

The Statsmodels library also provides a version of the plot in the plot acf() function as a line plot.

# autocorrelation plot of time series

from pandas import read_csv

from matplotlib import pyplot

from statsmodels.graphics.tsaplots import plot_acf

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

plot_acf(series, lags=31)

pyplot.show()

-----Result-----

Autocorrelation plot of the Minimum Daily Temperatures dataset with Statsmodels

F. Persistence Model

# evaluate a persistence model

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

from matplotlib import pyplot

from sklearn.metrics import mean_squared_error

from math import sqrt

# load dataset

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

# split into train and test sets

X = dataframe.values

train, test = X[1:len(X)-7], X[len(X)-7:]

train_X, train_y = train[:,0], train[:,1]

test_X, test_y = test[:,0], test[:,1]

# persistence model

def model_persistence(x):

return x

# walk-forward validation

predictions = list()

for x in test_X:

yhat = model_persistence(x)

predictions.append(yhat)

rmse = sqrt(mean_squared_error(test_y, predictions))

print('Test RMSE: %.3f' % rmse)

# plot predictions vs expected

pyplot.plot(test_y)

pyplot.plot(predictions, color='red')

pyplot.show()

-----Result-----

Test RMSE: 1.850

Line plot of the persistence forecast (red) on the Minimum Daily Temperatures
dataset (blue)

G. Autoregression Model

An autoregression model is a linear regression model that uses lagged variables as input variables. We could calculate the linear regression model manually using the LinearRegession class in scikit-learn and manually specify the lag input variables to use.

Alternately, the Statsmodels library provides an autoregression model that automatically selects an appropriate lag value using statistical tests and trains a linear regression model. It is provided in the AR class.

# create and evaluate a static autoregressive model

from pandas import read_csv

from matplotlib import pyplot

from statsmodels.tsa.ar_model import AR

from sklearn.metrics import mean_squared_error

from math import sqrt

# load dataset

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# split dataset

X = series.values

train, test = X[1:len(X)-7], X[len(X)-7:]

# train autoregression

model = AR(train)

model_fit = model.fit()

print('Lag: %s' % model_fit.k_ar)

print('Coefficients: %s' % model_fit.params)

# make predictions

predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)

for i in range(len(predictions)):

print('predicted=%f, expected=%f' % (predictions[i], test[i]))

rmse = sqrt(mean_squared_error(test, predictions))

print('Test RMSE: %.3f' % rmse)

# plot results

pyplot.plot(test)

pyplot.plot(predictions, color='red')

pyplot.show()

-----Result-----

Lag: 29

Coefficients: [ 5.57543506e-01 5.88595221e-01 -9.08257090e-02 4.82615092e-02

4.00650265e-02 3.93020055e-02 2.59463738e-02 4.46675960e-02

1.27681498e-02 3.74362239e-02 -8.11700276e-04 4.79081949e-03

1.84731397e-02 2.68908418e-02 5.75906178e-04 2.48096415e-02

7.40316579e-03 9.91622149e-03 3.41599123e-02 -9.11961877e-03

2.42127561e-02 1.87870751e-02 1.21841870e-02 -1.85534575e-02

-1.77162867e-03 1.67319894e-02 1.97615668e-02 9.83245087e-03

6.22710723e-03 -1.37732255e-03]

predicted=11.871275, expected=12.900000

predicted=13.053794, expected=14.600000

predicted=13.532591, expected=14.000000

predicted=13.243126, expected=13.600000

predicted=13.091438, expected=13.500000

predicted=13.146989, expected=15.700000

predicted=13.176153, expected=13.000000

Test RMSE: 1.225

Line plot of the AR model forecast (red) on the Minimum Daily Temperatures
dataset (blue)

An alternative would be to use the learned coefficients and manually make predictions. This requires that the history of 29 prior observations be kept and that the coefficients be retrieved from the model and used in the regression equation to come up with new forecasts.

The coefficients are provided in an array with the intercept term followed by the coefficients for each lag variable starting at t to t-n. We simply need to use them in the right order on the history of observations, as follows:

# create and evaluate an updated autoregressive model

from pandas import read_csv

from matplotlib import pyplot

from statsmodels.tsa.ar_model import AR

from sklearn.metrics import mean_squared_error

from math import sqrt

# load dataset

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# split dataset

X = series.values

train, test = X[1:len(X)-7], X[len(X)-7:]

# train autoregression

model = AR(train)

model_fit = model.fit()

window = model_fit.k_ar

coef = model_fit.params

# walk forward over time steps in test

history = train[len(train)-window:]

history = [history[i] for i in range(len(history))]

predictions = list()

for t in range(len(test)):

length = len(history)

lag = [history[i] for i in range(length-window,length)]

yhat = coef[0]

for d in range(window):

yhat += coef[d+1] * lag[window-d-1]

obs = test[t]

predictions.append(yhat)

history.append(obs)

print('predicted=%f, expected=%f' % (yhat, obs))

rmse = sqrt(mean_squared_error(test, predictions))

print('Test RMSE: %.3f' % rmse)

# plot

pyplot.plot(test)

pyplot.plot(predictions, color='red')

pyplot.show()

-----Result-----

predicted=11.871275, expected=12.900000

predicted=13.659297, expected=14.600000

predicted=14.349246, expected=14.000000

predicted=13.427454, expected=13.600000

predicted=13.374877, expected=13.500000

predicted=13.479991, expected=15.700000

predicted=14.765146, expected=13.000000

Test RMSE: 1.204

Line plot of the manual predictions with the AR model (red) on the Minimum
Daily Temperatures dataset (blue)

We can see a small improvement in the forecast when comparing the RMSE scores from 1.225 to 1.204.

Machine learning

Menu bar

16/11/2021

Forecast Models - Part 2 - Autoregression Models for Forecasting

No comments:

Post a Comment