Machine learning: Data Preparation - Part 4 - Resampling and Interpolation

You may have observations at the wrong frequency. Maybe they are too granular or not granular enough. The Pandas library in Python provides the capability to change the frequency of your time series data.

In this tutorial, you will discover how to use Pandas in Python to both increase and decrease the sampling frequency of time series data. After completing this tutorial, you will know:

About time series resampling, the two types of resampling, and the 2 main reasons why you need to use them.
How to use Pandas to upsample time series data to a higher frequency and interpolate the new observations.
How to use Pandas to downsample time series data to a lower frequency and summarize the higher frequency observations.

A. Resampling

Resampling involves changing the frequency of your time series observations. Two types of resampling are:

Upsampling: Where you increase the frequency of the samples, such as from minutes to seconds.
Downsampling: Where you decrease the frequency of the samples, such as from days to months.

In both cases, data must be invented. In the case of upsampling, care may be needed in determining how the fine-grained observations are calculated using interpolation. In the case of downsampling, care may be needed in selecting the summary statistics used to calculate the new aggregated values.

There are perhaps two main reasons why you may be interested in resampling your time series data:

Problem Framing: Resampling may be required if your data is not available at the same frequency that you want to make predictions.
Feature Engineering: Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.

B. Shampoo Sales Dataset

In this lesson, we will use the Shampoo Sales dataset as an example. This dataset describes the monthly number of sales of shampoo over a 3 year period.

The Shampoo Sales dataset only specifies year number and months.

C. Upsampling Data

The observations in the Shampoo Sales are monthly. Imagine we wanted daily sales information. We would have to upsample the frequency from monthly to daily and use an interpolation scheme to fill in the new daily frequency. The Pandas library provides a function called resample() on the Series and DataFrame objects. This can be used to group records when downsampling and making space for new observations when upsampling.

# upsample to daily intervals
from pandas import read_csv
from pandas import datetime
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
upsampled = series.resample('D').mean()
print(upsampled.head(32))

Month

1901-01-01 266.0

1901-01-02 NaN

1901-01-03 NaN

1901-01-04 NaN

1901-01-05 NaN

1901-01-06 NaN

1901-01-07 NaN

1901-01-08 NaN

1901-01-09 NaN

1901-01-10 NaN

1901-01-11 NaN

1901-01-12 NaN

1901-01-13 NaN

1901-01-14 NaN

1901-01-15 NaN

1901-01-16 NaN

1901-01-17 NaN

1901-01-18 NaN

1901-01-19 NaN

1901-01-20 NaN

1901-01-21 NaN

1901-01-22 NaN

1901-01-23 NaN

1901-01-24 NaN

1901-01-25 NaN

1901-01-26 NaN

1901-01-27 NaN

1901-01-28 NaN

1901-01-29 NaN

1901-01-30 NaN

1901-01-31 NaN

1901-02-01 145.9

Freq: D, Name: Sales, dtype: float64

The Series Pandas object provides an interpolate() function to interpolate missing values, and there is a nice selection of simple and more complex interpolation functions.

A good starting point is to use a linear interpolation.

# upsample to daily intervals with linear interpolation

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True,

squeeze=True, date_parser=parser)

upsampled = series.resample('D').mean()

interpolated = upsampled.interpolate(method='linear')

print(interpolated.head(32))

interpolated.plot()

pyplot.show()

Month

1901-01-01 266.000000

1901-01-02 262.125806

1901-01-03 258.251613

1901-01-04 254.377419

1901-01-05 250.503226

1901-01-06 246.629032

1901-01-07 242.754839

1901-01-08 238.880645

1901-01-09 235.006452

1901-01-10 231.132258

1901-01-11 227.258065

1901-01-12 223.383871

1901-01-13 219.509677

1901-01-14 215.635484

1901-01-15 211.761290

1901-01-16 207.887097

1901-01-17 204.012903

1901-01-18 200.138710

1901-01-19 196.264516

1901-01-20 192.390323

1901-01-21 188.516129

1901-01-22 184.641935

1901-01-23 180.767742

1901-01-24 176.893548

1901-01-25 173.019355

1901-01-26 169.145161

1901-01-27 165.270968

1901-01-28 161.396774

1901-01-29 157.522581

1901-01-30 153.648387

1901-01-31 149.774194

1901-02-01 145.900000

Freq: D, Name: Sales, dtype: float64

Line Plot of upsampled Shampoo Sales dataset with linear interpolation

Another common interpolation method is to use a polynomial or a spline to connect the values. This creates more curves and can look more natural on many datasets.

# upsample to daily intervals with spline interpolation

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)

upsampled = series.resample('D').mean()

interpolated = upsampled.interpolate(method='spline', order=2)

print(interpolated.head(32))

interpolated.plot()

pyplot.show()

Month

1901-01-01 266.000000

1901-01-02 258.630160

1901-01-03 251.560886

1901-01-04 244.720748

1901-01-05 238.109746

1901-01-06 231.727880

1901-01-07 225.575149

1901-01-08 219.651553

1901-01-09 213.957094

1901-01-10 208.491770

1901-01-11 203.255582

1901-01-12 198.248529

1901-01-13 193.470612

1901-01-14 188.921831

1901-01-15 184.602185

1901-01-16 180.511676

1901-01-17 176.650301

1901-01-18 173.018063

1901-01-19 169.614960

1901-01-20 166.440993

1901-01-21 163.496161

1901-01-22 160.780465

1901-01-23 158.293905

1901-01-24 156.036481

1901-01-25 154.008192

1901-01-26 152.209039

1901-01-27 150.639021

1901-01-28 149.298139

1901-01-29 148.186393

1901-01-30 147.303783

1901-01-31 146.650308

1901-02-01 145.900000

Freq: D, Name: Sales, dtype: float64

Line Plot of upsampled Shampoo Sales dataset with spline interpolation

D. Downsampling Data

The sales data is monthly, but perhaps we would prefer the data to be quarterly. The year can be divided into 4 business quarters, 3 months a piece. Instead of creating new rows between existing observations, the resample() function in Pandas will group all observations by the new frequency.

# downsample to quarterly intervals

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True,

squeeze=True, date_parser=parser)

resample = series.resample('Q')

quarterly_mean_sales = resample.mean()

print(quarterly_mean_sales.head())

quarterly_mean_sales.plot()

pyplot.show()

Month

1901-03-31 198.333333

1901-06-30 156.033333

1901-09-30 216.366667

1901-12-31 215.100000

1902-03-31 184.633333

Freq: Q-DEC, Name: Sales, dtype: float64

Line Plot of downsampling the Shampoo Sales dataset to quarterly mean values

Perhaps we want to go further and turn the monthly data into yearly data, and perhaps later use that to model the following year.

# downsample to yearly intervals

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True,

squeeze=True, date_parser=parser)

resample = series.resample('A')

yearly_mean_sales = resample.sum()

print(yearly_mean_sales.head())

yearly_mean_sales.plot()

pyplot.show()

Line Plot of downsampling the Shampoo Sales dataset to yearly sum values

Machine learning

Menu bar

06/11/2021

Data Preparation - Part 4 - Resampling and Interpolation

No comments:

Post a Comment