Machine learning: Temporal Structure - Part 6 - Stationarity in Time Series Data

Time series is different from more traditional classification and regression predictive modeling problems. The temporal structure adds an order to the observations. For example, when modeling, there are assumptions that the summary statistics of observations are consistent. In time series terminology, we refer to this expectation as the time series being stationary.
These assumptions can be easily violated in time series by the addition of a trend, seasonality, and other time-dependent structures.
In this tutorial, you will discover how to check if your time series is stationary with Python. After completing this tutorial, you will know:

How to identify obvious stationary and non-stationary time series using line plot.
How to spot-check summary statistics like mean and variance for a change over time.
How to use statistical tests with statistical significance to check if a time series is stationary.

A. Stationary Time Series

The observations in a stationary time series are not dependent on time.

Time series are stationary if they do not have trend or seasonal effects.

When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.

Below is an example of the Daily Female Births dataset that is stationary.

# load time series data

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.plot()

pyplot.show()

-----Result-----

Line plot of the stationary Daily Female Births time series dataset

B. Non-Stationary Time Series

Observations from a non-stationary time series show seasonal effects, trends, and other structures that depend on the time index. Summary statistics like the mean and variance do change over time, providing a drift in the concepts a model may try to capture.

Classical time series analysis and forecasting methods are concerned with making non-stationary time series data stationary by identifying and removing trends and removing seasonal effects. Below is an example of the Airline Passengers dataset that is non-stationary, showing both trend and seasonal components.

# load time series data

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.plot()

pyplot.show()

-----Result-----

Line plot of the non-stationary Airline Passengers time series dataset

C. Types of Stationary Time Series

There are some finer-grained notions of stationarity that you may come across if you dive deeper into this topic

Stationary Process: A process that generates a stationary series of observations.
Stationary Model: A model that describes a stationary series of observations.
Trend Stationary: A time series that does not exhibit a trend.
Seasonal Stationary: A time series that does not exhibit seasonality.
Strictly Stationary: A mathematical definition of a stationary process, specifically that the joint distribution of observations is invariant to time shift.

D. Stationary Time Series and Forecasting

Should you make your time series stationary? Generally, yes. If you have clear trend and seasonality in your time series, then model these components, remove them from observations, then train models on the residuals.

E. Checks for Stationarity

There are many methods to check whether a time series (direct observations, residuals, otherwise) is stationary or non-stationary.

Look at Plots: You can review a time series plot of your data and visually check if there are any obvious trends or seasonality.
Summary Statistics: You can review the summary statistics for your data for seasons or random partitions and check for obvious or significant differences.
Statistical Tests: You can use statistical tests to check if the expectations of stationarity are met or have been violated.

F. Summary Statistics

A quick and dirty check to see if your time series is non-stationary is to review summary statistics. You can split your time series into two (or more) partitions and compare the mean and variance of each group. If they differ and the difference is statistically significant, the time series is likely non-stationary. Next, let’s try this approach on the Daily Births dataset.

1. Daily Births Dataset

Because we are looking at the mean and variance, we are assuming that the data conforms to a Gaussian (also called the bell curve or normal) distribution. We can also quickly check this by eyeballing a histogram of our observations.

# plot a histogram of a time series

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.hist()

pyplot.show()

-----Result-----

Histogram plot of the Daily Female Births dataset

We clearly see the bell curve-like shape of the Gaussian distribution, perhaps with a longer right tail.

Next, we can split the time series into two contiguous sequences. We can then calculate the mean and variance of each group of numbers and compare the values.

# calculate statistics of partitioned time series data

from pandas import read_csv

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

X = series.values

split = int(len(X) / 2)

X1, X2 = X[0:split], X[split:]

mean1, mean2 = X1.mean(), X2.mean()

var1, var2 = X1.var(), X2.var()

print('mean1=%f, mean2=%f' % (mean1, mean2))

print('variance1=%f, variance2=%f' % (var1, var2))

-----Result-----

mean1=39.763736, mean2=44.185792

variance1=49.213410, variance2=48.708651

Running this example shows that the mean and variance values are different, but in the same ball-park.

2. Airline Passengers Dataset

We can split our dataset and calculate the mean and variance for each group.

# calculate statistics of partitioned time series data

from pandas import read_csv

series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True,

squeeze=True)

X = series.values

split = int(len(X) / 2)

X1, X2 = X[0:split], X[split:]

mean1, mean2 = X1.mean(), X2.mean()

var1, var2 = X1.var(), X2.var()

print('mean1=%f, mean2=%f' % (mean1, mean2))

print('variance1=%f, variance2=%f' % (var1, var2))

-----Result-----

Running the example, we can see the mean and variance look very different. We have a non-stationary time series.

mean1=182.902778, mean2=377.694444

variance1=2244.087770, variance2=7367.962191

# plot a histogram of a time series

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.hist()

pyplot.show()

-----Result-----

Histogram plot of the Airline Passengers dataset

Running the example shows that indeed the distribution of values does not look like a Gaussian, therefore the mean and variance values are less meaningful.

Reviewing the plot of the time series again, we can see that there is an obvious seasonality component, and it looks like the seasonality component is growing. This may suggest an exponential growth from season to season. A log transform can be used to flatten out exponential change back to a linear relationship. Below is the same histogram with a log transform of the time series.

# histogram and line plot of log transformed time series

from pandas import read_csv

from matplotlib import pyplot

from numpy import log

series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

X = series.values

X = log(X)

pyplot.hist(X)

pyplot.show()

pyplot.plot(X)

pyplot.show()

-----Result-----

Histogram plot of the log-transformed Airline Passengers dataset

We also create a line plot of the log transformed data and can see the exponential growth seems diminished, but we still have a trend and seasonal elements.

Line plot of the log-transformed Airline Passengers dataset

We can now calculate the mean and standard deviation of the values of the log transformed dataset.

# calculate statistics of partitioned log transformed time series data

from pandas import read_csv

from numpy import log

series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

X = series.values

X = log(X)

split = int(len(X) / 2)

X1, X2 = X[0:split], X[split:]

mean1, mean2 = X1.mean(), X2.mean()

var1, var2 = X1.var(), X2.var()

print('mean1=%f, mean2=%f' % (mean1, mean2))

print('variance1=%f, variance2=%f' % (var1, var2))

-----Result-----

mean1=5.175146, mean2=5.909206

variance1=0.068375, variance2=0.049264

G. Augmented Dickey-Fuller test

Statistical tests make strong assumptions about your data. They can only be used to inform the degree to which a null hypothesis can be rejected (or fail to be rejected). The result must be interpreted for a given problem to be meaningful. Nevertheless, they can provide a quick check and confirmatory evidence that your time series is stationary or non-stationary.

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test. The intuition behind a unit root test is that it determines how strongly a time series is defined by a trend.

There are a number of unit root tests and the Augmented Dickey-Fuller may be one of the more widely used. It uses an autoregressive model and optimizes an information criterion across multiple different lag values. The null hypothesis of the test is that the time series can be represented by a unit root, that it is not stationary (has some time-dependent structure). The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.

Null Hypothesis (H0): Fail to reject, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure.
Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary. It does not have time-dependent structure.

We interpret this result using the p-value from the test. A p-value below a threshold (such as 5% or 1%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we fail to reject the null hypothesis (non-stationary).

p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
p-value ≤ 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.

Below is an example of calculating the Augmented Dickey-Fuller test on the Daily Female Births dataset. The Statsmodels library provides the adfuller() function that implements the test.

# calculate stationarity test of time series data

from pandas import read_csv

from statsmodels.tsa.stattools import adfuller

series = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

X = series.values

result = adfuller(X)

print('ADF Statistic: %f' % result[0])

print('p-value: %f' % result[1])

print('Critical Values:')

for key, value in result[4].items():

print('\t%s: %.3f' % (key, value))

-----Result-----

ADF Statistic: -4.808291
p-value: 0.000052
Critical Values:
5%: -2.870
1%: -3.449
10%: -2.571

We can see that our statistic value of -4 is less than the value of -3.449 at 1%. This suggests that we can reject the null hypothesis with a significance level of less than 1%. Rejecting the null hypothesis means that the process has no unit root, and in turn that the time series is stationary or does not have time-dependent structure.

We can perform the same test on the Airline Passenger dataset.

# calculate stationarity test of time series data

from pandas import read_csv

from statsmodels.tsa.stattools import adfuller

series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

X = series.values

result = adfuller(X)

print('ADF Statistic: %f' % result[0])

print('p-value: %f' % result[1])

print('Critical Values:')

for key, value in result[4].items():

print('\t%s: %.3f' % (key, value))

-----Result-----

ADF Statistic: 0.815369

p-value: 0.991880

Critical Values:

5%: -2.884

1%: -3.482

10%: -2.579

The test statistic is positive, meaning we are much less likely to reject the null hypothesis (it looks non-stationary). Comparing the test statistic to the critical values, it looks like we would have to fail to reject the null hypothesis that the time series is non-stationary and does have time-dependent structure.

Let’s log transform the dataset again to make the distribution of values more linear and better meet the expectations of this statistical test.

# calculate stationarity test of log transformed time series data

from pandas import read_csv

from statsmodels.tsa.stattools import adfuller

from numpy import log

series = read_csv('airline-passengers.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

X = series.values

X = log(X)

result = adfuller(X)

print('ADF Statistic: %f' % result[0])

print('p-value: %f' % result[1])

for key, value in result[4].items():

print('\t%s: %.3f' % (key, value))

-----Result-----

ADF Statistic: -1.717017

p-value: 0.422367

5%: -2.884

1%: -3.482

10%: -2.579

We can see that the value is larger than the critical values, again, meaning that we fail to reject the null hypothesis and in turn that the time series is non-stationary.

Machine learning

Menu bar

13/11/2021

Temporal Structure - Part 6 - Stationarity in Time Series Data

No comments:

Post a Comment