Our time series dataset may contain a trend. A trend is a continued increase or decrease in the series over time. There can be benefit in identifying, modeling, and even removing trend information from your time series dataset. In this tutorial, you will discover how to model and remove trend information from time series data in Python.
After completing this tutorial, you will know:
- The importance and types of trends that may exist in time series and how to identify them.
- How to use a simple differencing method to remove a trend.
- How to model a linear trend and remove it from a sales time series dataset.
A. Trends in Time Series
A trend is a long-term increase or decrease in the level of the time series.
Identifying and understanding trend information can aid in improving model performance; below are a few reasons:
- Faster Modeling: Perhaps the knowledge of a trend or lack of a trend can suggest methods and make model selection and evaluation more efficient.
- Simpler Problem: Perhaps we can correct or remove the trend to simplify modeling and improve model performance.
- More Data: Perhaps we can use trend information, directly or as a summary, to provide additional information to the model and improve model performance.
1. Types of Trends
There are all kinds of trends. Two general classes that we may think about are:
- Deterministic Trends: These are trends that consistently increase or decrease.
- Stochastic Trends: These are trends that increase and decrease inconsistently.
In general, deterministic trends are easier to identify and remove, but the methods discussed in this tutorial can still be useful for stochastic trends. We can think about trends in terms of their scope of observations.
- Global Trends: These are trends that apply to the whole time series.
- Local Trends: These are trends that apply to parts or subsequences of a time series.
Generally, global trends are easier to identify and address.
2. Identifying a Trend
You can plot time series data to see if a trend is obvious or not. The difficulty is that in practice, identifying a trend in a time series can be a subjective process.
3. Removing a Trend
A time series with a trend is called non-stationary. An identified trend can be modeled. Once modeled, it can be removed from the time series dataset. This is called detrending the time series.
If a dataset does not have a trend or we successfully remove the trend, the dataset is said to be trend stationary.
4. Using Time Series Trends in Machine Learning
From a machine learning perspective, a trend in your data represents two opportunities:
- Remove Information: To remove systematic information that distorts the relationship between input and output variables.
- Add Information: To add systematic information to improve the relationship between input and output variables.
B. Shampoo Sales Dataset
We will use the Shampoo Sales dataset as an example. This dataset describes the monthly number of sales of shampoo over a 3 year period.
C. Detrend by Differencing
A new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step.
value(t) = observation(t) − observation(t − 1)
This has the effect of removing a trend from a time series dataset.
Below is an example that creates the difference detrended version of the Shampoo Sales dataset.
# detrend a time series using differencing
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
X = series.values
diff = list()
for i in range(1, len(X)):
value = X[i] - X[i - 1]
diff.append(value)
pyplot.plot(diff)
pyplot.show()
-----Result-----
This approach works well for data with a linear trend
D. Detrend by Model Fitting
For example, a linear model can be fit on the time index to predict the observation. This dataset would look as follows:
X, y
1, obs1
2, obs2
3, obs3
4, obs4
5, obs5
value(t) = observation(t) − prediction(t)
# use a linear model to detrend a time series
from pandas import read_csv
from pandas import datetime
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot
import numpy
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
# fit linear model
X = [i for i in range(0, len(series))]
X = numpy.reshape(X, (len(X), 1))
y = series.values
model = LinearRegression()
model.fit(X, y)
# calculate trend
trend = model.predict(X)
# plot trend
pyplot.plot(y)
pyplot.plot(trend)
pyplot.show()
# detrend
detrended = [y[i]-trend[i] for i in range(0, len(series))]
# plot detrended
pyplot.plot(detrended)
pyplot.show()
-----Result-----
Line plot of the Shampoo Sales dataset (blue) and the linear fit (green)
|
Line plot of the detrended Shampoo Sales dataset using the linear fit
|
No comments:
Post a Comment