Machine learning: November 2021

21/11/2021

Project - Predictive Modeling Project Template

Applied machine learning is an empirical skill. You cannot get better at it by reading books and articles. You have to practice. In this lesson you will discover the simple six-step machine learning project template that you can use to jump-start your project in Python. After completing this lesson you will know:

1. How to structure an end-to-end predictive modeling project.

2. How to best use the structured project template to ensure an accurate result for your dataset.

Project 1 - Monthly Sales of French Champagne

We will work through a time series forecasting project from end-to-end, from downloading the dataset and defining the problem to training a final model and making predictions. This project is not exhaustive, but shows how you can get good results quickly by working through a time series forecasting problem systematically.

The steps of this project that we will through are as follows.

Problem Description.
Test Harness.
Persistence.
Data Analysis.
ARIMA Models.
Model Validation.

Guide For Time Series Forecast Projects

A time series forecast process is a set of steps or a recipe that leads you from defining your problem through to the outcome of having a time series forecast model or set of predictions.

In this lesson, you will discover time series forecast processes that you can use to guide you through your forecast project. After reading this lesson, you will know:

The 5-Step forecasting task by Hyndman and Athanasopoulos to guide you from problem definition to using and evaluating your forecast model.
The iterative forecast development process by Shmueli and Lichtendahl to guide you from defining your goal to implementing forecasts.
Suggestions and tips for working through your own time series forecasting project.

Forecast Models - Part 8 - Forecast Confidence Intervals

Confidence intervals provide an upper and lower expectation for the real

observation. These can be useful for assessing the range of real possible outcomes for a prediction and for better understanding the skill of the model.
In this tutorial, you will discover how to calculate and interpret confidence intervals for time series forecasts with Python.
Specifically, you will learn:

How to make a forecast with an ARIMA model and gather forecast diagnostic information.
How to interpret a confidence interval for a forecast and configure different intervals.
How to plot the confidence interval in the context of recent observations.

Forecast Models - Part 7 - Save Models and Make Predictions

Selecting a time series forecasting model is just the beginning. Using the chosen model in practice can pose challenges, including data transformations and storing the model parameters on disk.

In this tutorial, you will discover how to finalize a time series forecasting model and use it to make predictions in Python.

After completing this tutorial, you will know:

How to finalize a model and save it and required data to file.
How to load a finalized model from file and use it to make a prediction.
How to update data associated with a finalized model in order to make subsequent predictions.

Evaluate Models - Part 6 - Grid Search ARIMA Model Hyperparameters

The ARIMA model for time series analysis and forecasting can be tricky to configure. We can automate the process of evaluating a large number of hyperparameters for the ARIMA model by using a grid search procedure.

In this tutorial, you will discover how to tune the ARIMA model using a grid search of hyperparameters in Python.

After completing this tutorial, you will know:

A general procedure that you can use to tune the ARIMA hyperparameters for a rolling one-step forecast.
How to apply ARIMA hyperparameter optimization on a standard univariate time series dataset.
Ideas for extending the procedure for more elaborate and robust models.

Evaluate Models - Part 5 - Autocorrelation and Partial Autocorrelation

Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting. These are plots that graphically summarize the strength of a relationship with an observation in a time series with observations at prior time steps.

The difference between autocorrelation and partial autocorrelation can be difficult and confusing for beginners to time series forecasting.

In this tutorial, you will discover how to calculate and plot autocorrelation and partial correlation plots with Python.

After completing this tutorial, you will know:

How to plot and review the autocorrelation function for a time series.
How to plot and review the partial autocorrelation function for a time series.
The difference between autocorrelation and partial autocorrelation functions for time series analysis.

Forecast Models - Part 4 - ARIMA Model for Forecasting

A popular and widely used statistical method for time series forecasting is the ARIMA model. ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.

In this tutorial, you will discover how to develop an ARIMA model for time series data with Python. After completing this tutorial, you will know:

About the ARIMA model the parameters used and assumptions made by the model.
How to fit an ARIMA model to data and use it to make forecasts.
How to configure the ARIMA model on your time series problem.

Forecast Models - Part 3 - Moving Average Models for Forecasting

The residual errors from forecasts on a time series provide another source of information that we can model. Residual errors themselves form a time series that can have temporal structure.

A simple autoregression model of this structure can be used to predict the forecast error, which in turn can be used to correct forecasts. This type of model is called a moving average model, the same name but very different from moving average smoothing.

In this tutorial, you will discover how to model a residual error time series and use it to correct predictions with Python.

After completing this tutorial, you will know:

About how to model residual error time series using an autoregressive model.
How to develop and evaluate a model of residual error time series.
How to use a model of residual error to correct predictions and improve forecast skill.

Forecast Models - Part 2 - Autoregression Models for Forecasting

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.

After completing this tutorial, you will know:

How to explore your time series data for autocorrelation.
How to develop an autocorrelation model and use it to make predictions.
How to use a developed autocorrelation model to make rolling predictions.

Forecast Models - Part 1 - A Gentle Introduction to the Box-Jenkins Method

The Autoregressive Integrated Moving Average Model, or ARIMA for short is a standard statistical model for time series forecast and analysis.
In this lesson, you will discover the Box-Jenkins Method and tips for using it on your time series forecasting problem. Specifically, you will learn:

About the ARIMA model and the 3 steps of the general Box-Jenkins Method.
How to choose the parameters for an ARIMA model.
How to use overfitting and residual errors to diagnose a fit ARIMA model.

Evaluate Models - Part 5 - Reframe Time Series Forecasting Problems

There are many ways to reframe your forecast problem that can both simplify the prediction problem and potentially expose more or different information to be modeled. A reframing can ultimately result in better and/or more robust forecasts. In this tutorial, you will discover how to reframe your time series forecast problem with Python. After completing this tutorial, you will know:

How to reframe your time series forecast problem as an alternate regression problem.
How to reframe your time series forecast problem as a classification prediction problem.
How to reframe your time series forecast problem with an alternate time horizon.

Evaluate Models - Part 4 - Visualize Residual Forecast Errors

Forecast errors on time series regression problems are called residuals or residual errors. Careful exploration of residual errors on your time series prediction problem can tell you a lot about your forecast model and even suggest improvements. In this tutorial, you will discover how to visualize residual errors from time series forecasts. After completing this tutorial, you will know:

How to create and review line plots of residual errors over time.
How to review summary statistics and plots of the distribution of residual plots.
How to explore the correlation structure of residual errors.

Evaluate Models - Part 3 - Persistence Model for Forecasting

Establishing a baseline is essential on any time series forecasting problem. A baseline in performance gives you an idea of how well all other models will actually perform on your problem. In this tutorial, you will discover how to develop a persistence forecast that you can use to calculate a baseline level of performance on a time series dataset with Python. After completing this tutorial, you will know:

The importance of calculating a baseline of performance on time series forecast problems.
How to develop a persistence model from scratch in Python.
How to evaluate the forecast from a persistence model and use it to establish a baseline in performance.

Evaluate Models - Part 2 - Forecasting Performance Measures

In this tutorial, you will discover performance measures for evaluating time series forecasts with Python. Time series generally focus on the prediction of real values, called regression problems. Therefore the performance measures in this tutorial will focus on methods for evaluating real-valued predictions.

After completing this tutorial, you will know:

Basic measures of forecast performance, including residual forecast error and forecast bias.
Time series forecast error calculations that have the same units as the expected outcomes such as mean absolute error.
Widely used error calculations that punish large errors, such as mean squared error and root mean squared error.

Evaluate Models - Part 1 - Backtest Forecast Models

The goal of time series forecasting is to make accurate predictions about the future. The fast and powerful methods that we rely on in machine learning, such as using train-test splits and k-fold cross-validation, do not work in the case of time series data. This is because they ignore the temporal components inherent in the problem.

In this tutorial, you will discover how to evaluate machine learning models on time series data with Python. In the field of time series forecasting, this is called backtesting or hindcasting.

After completing this tutorial, you will know:

The limitations of traditional methods of model evaluation from machine learning and why evaluating models on out-of-sample data is required.
How to create train-test splits and multiple train-test splits of time series data for model evaluation in Python.
How walk-forward validation provides the most realistic evaluation of machine learning models on time series data.

Temporal Structure - Part 6 - Stationarity in Time Series Data

Time series is different from more traditional classification and regression predictive modeling problems. The temporal structure adds an order to the observations. For example, when modeling, there are assumptions that the summary statistics of observations are consistent. In time series terminology, we refer to this expectation as the time series being stationary.
These assumptions can be easily violated in time series by the addition of a trend, seasonality, and other time-dependent structures.
In this tutorial, you will discover how to check if your time series is stationary with Python. After completing this tutorial, you will know:

How to identify obvious stationary and non-stationary time series using line plot.
How to spot-check summary statistics like mean and variance for a change over time.
How to use statistical tests with statistical significance to check if a time series is stationary.

Temporal Structure - Part 5 - Use and Remove Seasonality

Time series datasets can contain a seasonal component. This is a cycle that repeats over time, such as monthly or yearly. This repeating cycle may obscure the signal that we wish to model when forecasting, and in turn may provide a strong signal to our predictive models. In this tutorial, you will discover how to identify and correct for seasonality in time series data with Python.

After completing this tutorial, you will know:

The definition of seasonality in time series and the opportunity it provides for forecasting with machine learning methods.
How to use the difference method to create a seasonally adjusted time series of daily temperature data.
How to model the seasonal component directly and explicitly subtract it from observations.

Temporal Structure - Part 4 - Use and Remove Trends

Our time series dataset may contain a trend. A trend is a continued increase or decrease in the series over time. There can be benefit in identifying, modeling, and even removing trend information from your time series dataset. In this tutorial, you will discover how to model and remove trend information from time series data in Python.

After completing this tutorial, you will know:

The importance and types of trends that may exist in time series and how to identify them.
How to use a simple differencing method to remove a trend.
How to model a linear trend and remove it from a sales time series dataset.

Temporal Structure - Part 3 - Decompose Time Series Data

Time series decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components.

After completing this tutorial, you will know:

The time series decomposition method of analysis and how it can help with forecasting.
How to automatically decompose time series data in Python.
How to decompose additive and multiplicative time series problems and plot the results.

Temporal Structure - Part 2 - A Gentle Introduction to the Random Walk

How do you know your time series problem is predictable? This is a difficult question with time series forecasting. There is a tool called a random walk that can help you understand the predictability of your time series forecast problem. In this tutorial, you will discover the random walk and its properties in Python.

After completing this tutorial, you will know:

What the random walk is and how to create one from scratch in Python.
How to analyze the properties of a random walk and recognize when a time series is and is not a random walk.
How to make predictions for a random walk

Temporal Structure - Part 1 - A Gentle Introduction to White Noise

White noise is an important concept in time series forecasting. If a time series is white noise, it is a sequence of random numbers and cannot be predicted. If the series of forecast errors are not white noise, it suggests improvements could be made to the predictive model.

In this tutorial, you will discover white noise time series with Python. After completing this tutorial, you will know:

The definition of a white noise time series and why it matters.
How to check if your time series is white noise.
Statistics and diagnostic plots to identify white noise in Python.

08/11/2021

Data Preparation - Part 6 - Moving Average Smoothing

Moving average smoothing is a naive and effective technique in time series forecasting. It can be used for data preparation, feature engineering, and even directly for making predictions.

After completing this tutorial, you will know:

How moving average smoothing works and some expectations of your data before you can use it.
How to use moving average smoothing for data preparation and feature engineering.
How to use moving average smoothing to make predictions.

Data Preparation - Part 5 - Power Transform

Data transforms are intended to remove noise and improve the signal in time series forecasting.

It can be very difficult to select a good, or even best, transform for a given prediction problem.

There are many transforms to choose from and each has a different mathematical intuition. In this tutorial, you will discover how to explore different power-based transforms for time series forecasting with Python. After completing this tutorial, you will know:

How to identify when to use and how to explore a square root transform.
How to identify when to use and explore a log transform and the expectations on raw data.
How to use the Box-Cox transform to perform square root, log, and automatically discover the best power transform for your dataset.

Data Preparation - Part 4 - Resampling and Interpolation

You may have observations at the wrong frequency. Maybe they are too granular or not granular enough. The Pandas library in Python provides the capability to change the frequency of your time series data.

In this tutorial, you will discover how to use Pandas in Python to both increase and decrease the sampling frequency of time series data. After completing this tutorial, you will know:

About time series resampling, the two types of resampling, and the 2 main reasons why you need to use them.
How to use Pandas to upsample time series data to a higher frequency and interpolate the new observations.
How to use Pandas to downsample time series data to a lower frequency and summarize the higher frequency observations.

A. Resampling

Resampling involves changing the frequency of your time series observations. Two types of resampling are:

Upsampling: Where you increase the frequency of the samples, such as from minutes to seconds.
Downsampling: Where you decrease the frequency of the samples, such as from days to months.

In both cases, data must be invented. In the case of upsampling, care may be needed in determining how the fine-grained observations are calculated using interpolation. In the case of downsampling, care may be needed in selecting the summary statistics used to calculate the new aggregated values.

There are perhaps two main reasons why you may be interested in resampling your time series data:

Problem Framing: Resampling may be required if your data is not available at the same frequency that you want to make predictions.
Feature Engineering: Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.

B. Shampoo Sales Dataset

In this lesson, we will use the Shampoo Sales dataset as an example. This dataset describes the monthly number of sales of shampoo over a 3 year period.

The Shampoo Sales dataset only specifies year number and months.

C. Upsampling Data

The observations in the Shampoo Sales are monthly. Imagine we wanted daily sales information. We would have to upsample the frequency from monthly to daily and use an interpolation scheme to fill in the new daily frequency. The Pandas library provides a function called resample() on the Series and DataFrame objects. This can be used to group records when downsampling and making space for new observations when upsampling.

# upsample to daily intervals
from pandas import read_csv
from pandas import datetime
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
upsampled = series.resample('D').mean()
print(upsampled.head(32))

Month

1901-01-01 266.0

1901-01-02 NaN

1901-01-03 NaN

1901-01-04 NaN

1901-01-05 NaN

1901-01-06 NaN

1901-01-07 NaN

1901-01-08 NaN

1901-01-09 NaN

1901-01-10 NaN

1901-01-11 NaN

1901-01-12 NaN

1901-01-13 NaN

1901-01-14 NaN

1901-01-15 NaN

1901-01-16 NaN

1901-01-17 NaN

1901-01-18 NaN

1901-01-19 NaN

1901-01-20 NaN

1901-01-21 NaN

1901-01-22 NaN

1901-01-23 NaN

1901-01-24 NaN

1901-01-25 NaN

1901-01-26 NaN

1901-01-27 NaN

1901-01-28 NaN

1901-01-29 NaN

1901-01-30 NaN

1901-01-31 NaN

1901-02-01 145.9

Freq: D, Name: Sales, dtype: float64

The Series Pandas object provides an interpolate() function to interpolate missing values, and there is a nice selection of simple and more complex interpolation functions.

A good starting point is to use a linear interpolation.

# upsample to daily intervals with linear interpolation

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True,

squeeze=True, date_parser=parser)

upsampled = series.resample('D').mean()

interpolated = upsampled.interpolate(method='linear')

print(interpolated.head(32))

interpolated.plot()

pyplot.show()

Month

1901-01-01 266.000000

1901-01-02 262.125806

1901-01-03 258.251613

1901-01-04 254.377419

1901-01-05 250.503226

1901-01-06 246.629032

1901-01-07 242.754839

1901-01-08 238.880645

1901-01-09 235.006452

1901-01-10 231.132258

1901-01-11 227.258065

1901-01-12 223.383871

1901-01-13 219.509677

1901-01-14 215.635484

1901-01-15 211.761290

1901-01-16 207.887097

1901-01-17 204.012903

1901-01-18 200.138710

1901-01-19 196.264516

1901-01-20 192.390323

1901-01-21 188.516129

1901-01-22 184.641935

1901-01-23 180.767742

1901-01-24 176.893548

1901-01-25 173.019355

1901-01-26 169.145161

1901-01-27 165.270968

1901-01-28 161.396774

1901-01-29 157.522581

1901-01-30 153.648387

1901-01-31 149.774194

1901-02-01 145.900000

Freq: D, Name: Sales, dtype: float64

Line Plot of upsampled Shampoo Sales dataset with linear interpolation

Another common interpolation method is to use a polynomial or a spline to connect the values. This creates more curves and can look more natural on many datasets.

# upsample to daily intervals with spline interpolation

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)

upsampled = series.resample('D').mean()

interpolated = upsampled.interpolate(method='spline', order=2)

print(interpolated.head(32))

interpolated.plot()

pyplot.show()

Month

1901-01-01 266.000000

1901-01-02 258.630160

1901-01-03 251.560886

1901-01-04 244.720748

1901-01-05 238.109746

1901-01-06 231.727880

1901-01-07 225.575149

1901-01-08 219.651553

1901-01-09 213.957094

1901-01-10 208.491770

1901-01-11 203.255582

1901-01-12 198.248529

1901-01-13 193.470612

1901-01-14 188.921831

1901-01-15 184.602185

1901-01-16 180.511676

1901-01-17 176.650301

1901-01-18 173.018063

1901-01-19 169.614960

1901-01-20 166.440993

1901-01-21 163.496161

1901-01-22 160.780465

1901-01-23 158.293905

1901-01-24 156.036481

1901-01-25 154.008192

1901-01-26 152.209039

1901-01-27 150.639021

1901-01-28 149.298139

1901-01-29 148.186393

1901-01-30 147.303783

1901-01-31 146.650308

1901-02-01 145.900000

Freq: D, Name: Sales, dtype: float64

Line Plot of upsampled Shampoo Sales dataset with spline interpolation

D. Downsampling Data

The sales data is monthly, but perhaps we would prefer the data to be quarterly. The year can be divided into 4 business quarters, 3 months a piece. Instead of creating new rows between existing observations, the resample() function in Pandas will group all observations by the new frequency.

# downsample to quarterly intervals

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True,

squeeze=True, date_parser=parser)

resample = series.resample('Q')

quarterly_mean_sales = resample.mean()

print(quarterly_mean_sales.head())

quarterly_mean_sales.plot()

pyplot.show()

Month

1901-03-31 198.333333

1901-06-30 156.033333

1901-09-30 216.366667

1901-12-31 215.100000

1902-03-31 184.633333

Freq: Q-DEC, Name: Sales, dtype: float64

Line Plot of downsampling the Shampoo Sales dataset to quarterly mean values

Perhaps we want to go further and turn the monthly data into yearly data, and perhaps later use that to model the following year.

# downsample to yearly intervals

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, index_col=0, parse_dates=True,

squeeze=True, date_parser=parser)

resample = series.resample('A')

yearly_mean_sales = resample.sum()

print(yearly_mean_sales.head())

yearly_mean_sales.plot()

pyplot.show()

Line Plot of downsampling the Shampoo Sales dataset to yearly sum values

05/11/2021

Data Preparation - Part 3 - Data Visualization

Line plots of observations over time are popular, but there is a suite of other plots that you can use to learn more about your problem.

The more you learn about your data, the more likely you are to develop a better forecasting model. In this tutorial, you will discover 6 different types of plots that you can use to visualize time series data with Python. Specifically, after completing this tutorial, you will know:

How to explore the temporal structure of time series with line plots, lag plots, and autocorrelation plots.
How to understand the distribution of observations using histograms and density plots.
How to tease out the change in distribution over intervals using box and whisker plots and heat map plots.

Data Preparation - Part 2 - Basic Feature Engineering

Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms.

There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps.

In this tutorial, you will discover how to perform feature engineering on time series data with Python to model your time series problem with machine learning algorithms.

After completing this tutorial, you will know:

The rationale and goals of feature engineering time series data.
How to develop basic date-time based input features.
How to develop more sophisticated lag and sliding window summary statistics features.

Data Preparation - Part 1 - Load and Explore Time Series Data

The Pandas library in Python provides excellent, built-in support for time series data. Once loaded, Pandas also provides tools to explore and better understand your dataset. In this lesson, you will discover how to load and explore your time series dataset.

After completing this tutorial, you will know:

How to load your time series dataset from a CSV file using Pandas.
How to peek at the loaded data and query using date-times.
How to calculate and review summary statistics.

Time Series as Supervised Learning

Time series forecasting can be framed as a supervised learning problem. This re-framing of your time series data allows you access to the suite of standard linear and nonlinear machine learning algorithms on your problem. In this lesson, you will discover how you can re-frame your time series problem as a supervised learning problem for machine learning.
After reading this lesson, you will know:

What supervised learning is and how it is the foundation for all predictive modeling machine learning algorithms.
The sliding window method for framing a time series dataset and how to use it.
How to use the sliding window for multivariate data and multi-step forecasting.

What Is Time Series Forecasting

Time series forecasting is an important area of machine learning that is often neglected. It is important because there are so many prediction problems that involve a time component. These problems are neglected because it is this time component that makes time series problems more

difficult to handle. In this lesson, you will discover time series forecasting.

After reading this lesson, you will know:

Standard definitions of time series, time series analysis, and time series forecasting.
The important components to consider in time series data.
Examples of time series to make your understanding concrete.

Menu bar

21/11/2021

18/11/2021

17/11/2021

16/11/2021

15/11/2021

14/11/2021

13/11/2021

12/11/2021

11/11/2021

09/11/2021

08/11/2021

06/11/2021

05/11/2021

04/11/2021

03/11/2021

02/11/2021