Machine learning: Data Preparation - Part 3

Line plots of observations over time are popular, but there is a suite of other plots that you can use to learn more about your problem.

The more you learn about your data, the more likely you are to develop a better forecasting model. In this tutorial, you will discover 6 different types of plots that you can use to visualize time series data with Python. Specifically, after completing this tutorial, you will know:

How to explore the temporal structure of time series with line plots, lag plots, and autocorrelation plots.
How to understand the distribution of observations using histograms and density plots.
How to tease out the change in distribution over intervals using box and whisker plots and heat map plots.

A. Time Series Visualization

We will take a look at 6 different types of visualizations that you can use on your own time series data. They are:

1. Line Plots.

2. Histograms and Density Plots.

3. Box and Whisker Plots.

4. Heat Maps.

5. Lag Plots or Scatter Plots.

6. Autocorrelation Plots.

The focus is on univariate time series, but the techniques are just as applicable to multivariate time series, when you have more than one observation at each time step.

B. Minimum Daily Temperatures Dataset

We will use the Minimum Daily Temperatures dataset as an example. This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

C. Line Plot

In this plot, time is shown on the x-axis with observation values along the y-axis.

# create a line plot

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.plot()

pyplot.show()

Line plot of the Minimum Daily Temperatures dataset

Below is an example of changing the style of the line to be black dots instead of a connected line (the style=’k.’ argument)

# create a dot plot

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.plot(style='k.')

pyplot.show()

Dot line plot of the Minimum Daily Temperatures dataset

It can be helpful to compare line plots for the same interval, such as from day-to-day, month-to-month, and year-to-year. The Minimum Daily Temperatures dataset spans 10 years. We can group data by year and create a line plot for each year for direct comparison. The example below shows how to do this. First the observations are grouped by year (series.groupby(Grouper(freq=’A’)))

# create stacked line plots

from pandas import read_csv

from pandas import DataFrame

from pandas import Grouper

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

groups = series.groupby(Grouper(freq='A'))

years = DataFrame()

for name, group in groups:

years[name.year] = group.values

years.plot(subplots=True, legend=False)

pyplot.show()

Stacked line plots of the Minimum Daily Temperatures dataset

D. Histogram and Density Plots

A histogram groups values into bins, and the frequency or count of observations in each bin can provide insight into the underlying distribution of the observations.

# create a histogram plot

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.hist()

pyplot.show()

Running the example shows a distribution that looks strongly Gaussian. The plotting function automatically selects the size of the bins based on the spread of values in the data.

Histogram of the Minimum Daily Temperatures dataset

We can get a better idea of the shape of the distribution of observations by using a density plot.

# create a density plot

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

series.plot(kind='kde')

pyplot.show()6.5. Box and Whisker Plots by Interval

Density Plot of the Minimum Daily Temperatures dataset

E. Box and Whisker Plots by Interval

Box and whisker plots can be created and compared for each interval in a time series, such as years, months, or days

# create a boxplot of yearly data

from pandas import read_csv

from pandas import DataFrame

from pandas import Grouper

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

groups = series.groupby(Grouper(freq='A'))

years = DataFrame()

for name, group in groups:

years[name.year] = group.values

years.boxplot()

pyplot.show()

Yearly Box and Whisker Plots of the Minimum Daily Temperatures dataset

We may also be interested in the distribution of values across months within a year

# create a boxplot of monthly data

from pandas import read_csv

from pandas import DataFrame

from pandas import Grouper

from matplotlib import pyplot

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

one_year = series['1990']

groups = one_year.groupby(Grouper(freq='M'))

months = concat([DataFrame(x[1].values) for x in groups], axis=1)

months = DataFrame(months)

months.columns = range(1,13)

months.boxplot()

pyplot.show()

Monthly Box and Whisker Plots of the Minimum Daily Temperatures dataset

F. Heat Maps

In the case of the Minimum Daily Temperatures, the observations can be arranged into a matrix of year-columns and day-rows, with minimum temperature in the cell for each day. A heat map of this matrix can then be plotted. Below is an example of creating a heatmap of the Minimum Daily Temperatures data. The matshow() function from the Matplotlib library is used as no heatmap support is provided directly in Pandas.

# create a heat map of yearly data

from pandas import read_csv

from pandas import DataFrame

from pandas import Grouper

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

groups = series.groupby(Grouper(freq='A'))

years = DataFrame()

for name, group in groups:

years[name.year] = group.values

years = years.T

pyplot.matshow(years, interpolation=None, aspect='auto')

pyplot.show()

The plot shows the cooler minimum temperatures in the middle days of the years and the warmer minimum temperatures in the start and ends of the years, and all the fading and complexity in between.

Yearly Heat Map Plot of the Minimum Daily Temperatures dataset

As with the box and whisker plot example above, we can also compare the months within a year. Below is an example of a heat map comparing the months of the year in 1990. Each column represents one month, with rows representing the days of the month from 1 to 31.

# create a heat map of monthly data

from pandas import read_csv

from pandas import DataFrame

from pandas import Grouper

from matplotlib import pyplot

from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

one_year = series['1990']

groups = one_year.groupby(Grouper(freq='M'))

months = concat([DataFrame(x[1].values) for x in groups], axis=1)

months = DataFrame(months)

months.columns = range(1,13)

pyplot.matshow(months, interpolation=None, aspect='auto')

pyplot.show()

Monthly Heat Map Plot of the Minimum Daily Temperatures dataset

G. Lag Scatter Plots

Pandas has a built-in function for exactly this called the lag plot. It plots the observation at time t on the x-axis and the observation at the next time step (t+1) on the y-axis.

If the points cluster along a diagonal line from the bottom-left to the top-right of the plot, it suggests a positive correlation relationship.
If the points cluster along a diagonal line from the top-left to the bottom-right, it suggests a negative correlation relationship.
Either relationship is good as they can be modeled

More points tighter in to the diagonal line suggests a stronger relationship and more spread from the line suggests a weaker relationship. A ball in the middle or a spread across the plot suggests a weak or no relationship.

# create a scatter plot

from pandas import read_csv

from matplotlib import pyplot

from pandas.plotting import lag_plot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

lag_plot(series)

pyplot.show()

Lag scatter plot of the Minimum Daily Temperatures dataset

We can repeat this process for an observation and any lag values.

# create multiple scatter plots

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

from matplotlib import pyplot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0,

parse_dates=True, squeeze=True)

values = DataFrame(series.values)

lags = 7

columns = [values]

for i in range(1,(lags + 1)):

columns.append(values.shift(i))

dataframe = concat(columns, axis=1)

columns = ['t']

for i in range(1,(lags + 1)):

columns.append('t-' + str(i))

dataframe.columns = columns

pyplot.figure(1)

for i in range(1,(lags + 1)):

ax = pyplot.subplot(240 + i)

ax.set_title('t vs t-' + str(i))

pyplot.scatter(x=dataframe['t'].values, y=dataframe['t-'+str(i)].values)

pyplot.show()

Multiple Lag scatter plots of the Minimum Daily Temperatures dataset

H. Autocorrelation Plots

We can quantify the strength and type of relationship between observations and their lags.

In statistics, this is called correlation, and when calculated against lag values in time series, it is called autocorrelation (self-correlation). A correlation value calculated between two groups of numbers, such as observations and their lag=1 values, results in a number between -1 and 1.

The sign of this number indicates a negative or positive correlation respectively. A value close to zero suggests a weak correlation, whereas a value closer to -1 or 1 indicates a strong correlation.

Correlation values, called correlation coefficients, can be calculated for each observation and different lag values. Once calculated, a plot can be created to help better understand how this relationship changes over the lag.

This type of plot is called an autocorrelation plot and Pandas provides this capability built in, called the autocorrelation plot() function.

# create an autocorrelation plot

from pandas import read_csv

from matplotlib import pyplot

from pandas.plotting import autocorrelation_plot

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

autocorrelation_plot(series)

pyplot.show()

This captures the relationship of an observation with past observations in the same and opposite seasons or times of year. Sine waves like those seen in this example are a strong sign of seasonality in the dataset.

Autocorrelation Plot of the Minimum Daily Temperatures dataset

Machine learning

Menu bar

05/11/2021

Data Preparation - Part 3 - Data Visualization

No comments:

Post a Comment