Machine learning: Evaluate Models - Part 4 - Visualize Residual Forecast Errors

Forecast errors on time series regression problems are called residuals or residual errors. Careful exploration of residual errors on your time series prediction problem can tell you a lot about your forecast model and even suggest improvements. In this tutorial, you will discover how to visualize residual errors from time series forecasts. After completing this tutorial, you will know:

How to create and review line plots of residual errors over time.
How to review summary statistics and plots of the distribution of residual plots.
How to explore the correlation structure of residual errors.

A. Residual Forecast Errors

Forecast errors on a time series forecasting problem are called residual errors or residuals. A residual error is calculated as the expected outcome minus the forecast, for example:

residual error = expected − forecast

Or,

e = y − yhat

B. Daily Female Births Dataset

We will use the Daily Female Births Dataset as an example. This dataset describes the number of daily female births in California in 1959.

C. Persistence Forecast Model

The simplest forecast that we can make is to forecast that what happened in the previous time step will be the same as what will happen in the next time step. This is called the naive forecast or the persistence forecast model.

# calculate residuals from a persistence forecast

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

# split into train and test sets

X = dataframe.values

train_size = int(len(X) * 0.66)

train, test = X[1:train_size], X[train_size:]

train_X, train_y = train[:,0], train[:,1]

test_X, test_y = test[:,0], test[:,1]

# persistence model

predictions = [x for x in test_X]

# calculate residuals

residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]

residuals = DataFrame(residuals)

print(residuals.head())

-----Result-----

0 9.0

1 -10.0

2 3.0

3 -6.0

4 30.0

D. Residual Line Plot

The first plot is to look at the residual forecast errors over time as a line plot.

# line plot of residual errors

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

from matplotlib import pyplot

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

# split into train and test sets

X = dataframe.values

train_size = int(len(X) * 0.66)

train, test = X[1:train_size], X[train_size:]

train_X, train_y = train[:,0], train[:,1]

test_X, test_y = test[:,0], test[:,1]

# persistence model

predictions = [x for x in test_X]

# calculate residuals

residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]

residuals = DataFrame(residuals)

# plot residuals

residuals.plot()

pyplot.show()

-----Result-----

Line plot of the forecast residual errors for the Daily Female Births dataset

E. Residual Summary Statistics

We can calculate summary statistics on the residual errors. Primarily, we are interested in the mean value of the residual errors. A value close to zero suggests no bias in the forecasts, whereas positive and negative values suggest a positive or negative bias in the forecasts made.

# summary statistics of residual errors

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

# split into train and test sets

X = dataframe.values

train_size = int(len(X) * 0.66)

train, test = X[1:train_size], X[train_size:]

train_X, train_y = train[:,0], train[:,1]

test_X, test_y = test[:,0], test[:,1]

# persistence model

predictions = [x for x in test_X]

# calculate residuals

residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]

residuals = DataFrame(residuals)

# summary statistics

print(residuals.describe())

-----Result-----

count 125.000000

mean 0.064000

std 9.187776

min -28.000000

25% -6.000000

50% -1.000000

75% 5.000000

max 30.000000

Running the example shows a mean error value close to zero, but perhaps not close enough. It suggests that there may be some bias and that we may be able to further improve the model by performing a bias correction.

F. Residual Histogram and Density Plots

Plots can be used to better understand the distribution of errors beyond summary statistics. We would expect the forecast errors to be normally distributed around a zero mean. Plots can help discover skews in this distribution. We can use both histograms and density plots to better understand the distribution of residual errors.

# density plots of residual errors

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

from matplotlib import pyplot

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

# split into train and test sets

X = dataframe.values

train_size = int(len(X) * 0.66)

train, test = X[1:train_size], X[train_size:]

train_X, train_y = train[:,0], train[:,1]

test_X, test_y = test[:,0], test[:,1]

# persistence model

predictions = [x for x in test_X]

# calculate residuals

residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]

residuals = DataFrame(residuals)

# histogram plot

residuals.hist()

pyplot.show()

# density plot

residuals.plot(kind='kde')

pyplot.show()

-----Result-----

Histogram plot of the forecast residual errors for the Daily Female Births dataset

Density plot of the forecast residual errors for the Daily Female Births dataset

G. Residual Q-Q Plot

A Q-Q plot, or quantile plot, compares two distributions and can be used to see how similar or different they happen to be. We can create a Q-Q plot using the qqplot() function1 in the Statsmodels library.

# qq plot of residual errors

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

from matplotlib import pyplot

import numpy

from statsmodels.graphics.gofplots import qqplot

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

# split into train and test sets

X = dataframe.values

train_size = int(len(X) * 0.66)

train, test = X[1:train_size], X[train_size:]

train_X, train_y = train[:,0], train[:,1]

test_X, test_y = test[:,0], test[:,1]

# persistence model

predictions = [x for x in test_X]

# calculate residuals

residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]

residuals = numpy.array(residuals)

qqplot(residuals, line='r')

pyplot.show()

-----Result-----

Q-Q plot of the forecast residual errors for the Daily Female Births dataset

H. Residual Autocorrelation Plot

Autocorrelation calculates the strength of the relationship between an observation and observations at prior time steps. We can calculate the autocorrelation of the residual error time series and plot the results. This is called an autocorrelation plot.

# autoregression plot of residual errors

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

from matplotlib import pyplot

from pandas.plotting import autocorrelation_plot

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

# create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t', 't+1']

# split into train and test sets

X = dataframe.values

train_size = int(len(X) * 0.66)

train, test = X[1:train_size], X[train_size:]

train_X, train_y = train[:,0], train[:,1]

test_X, test_y = test[:,0], test[:,1]

# persistence model

predictions = [x for x in test_X]

# calculate residuals

residuals = [test_y[i]-predictions[i] for i in range(len(predictions))]

residuals = DataFrame(residuals)

autocorrelation_plot(residuals)

pyplot.show()

-----Result-----

ACF plot of the forecast residual errors for the Daily Female Births dataset

Machine learning

Menu bar

14/11/2021

Evaluate Models - Part 4 - Visualize Residual Forecast Errors

No comments:

Post a Comment