Machine learning: Resampling Methods - Part 2

The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.

It can be used to estimate summary statistics such as the mean or standard deviation.

It is used in applied machine learning to estimate the skill of machine learning models when making predictions on data not included in the training data.

In this tutorial, you will discover the bootstrap resampling method for estimating the skill of machine learning models on unseen data.

After completing this tutorial, you will know:

The bootstrap method involves iteratively resampling a dataset with replacement.
That when using the bootstrap you must choose the size of the sample and the number of repeats.
The scikit-learn provides a function that you can use to resample a dataset for the bootstrap method.

This tutorial is divided into 4 parts; they are:

Bootstrap Method
Configuration of the Bootstrap
Worked Example
Bootstrap in Python

A. Bootstrap Method

The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.

Importantly, samples are constructed by drawing observations from a large data sample one at a time and return them to the data sample after they have been chosen. This allows a given observation to be included in a given small sample more than one. This approach to sampling is called sampling with replacement.

The process for building one sample can be summarized as follows:

Choose the size of the sample.
While the size of the sample is less than the chosen size

Randomly select an observation from the dataset
Add it to the sample

The bootstrap method can be used to estimate a quantity of a population. This is done by repeatedly taking small samples, calculating the statistic, and taking the average of the calculated statistics.

We can summarize this procedure as follows:

Choose a number of bootstrap samples to perform
Choose a sample size
For each bootstrap sample

Draw a sample with replacement with the chosen size
Calculate the statistic on the sample

Calculate the mean of the calculated sample statistics.

The procedure can also be used to estimate the skill of a machine learning model. This is done by training the model on the sample and evaluating the skill of the model on those samples not included in the sample. These samples not included in a given sample are called the out-of-bag samples, or OOB for short.

This procedure of using the bootstrap method to estimate the skill of the model can be summarized as follows:

Choose a number of bootstrap samples to perform
Choose a sample size
For each bootstrap sample

Draw a sample with replacement with the chosen size
Fit a model on the data sample
Estimate the skill of the model on the out-of-bag sample.

Calculate the mean of the sample of model skill estimates.

B. Configuration of the Bootstrap

There are two parameters that must be chosen when performing the bootstrap: the size of the sample and the number of repetitions of the procedure to perform.

1. Sample Size

In machine learning, it is common to use a sample size that is the same as the original dataset.

Some samples will be represented multiple times in the bootstrap sample while others will not be selected at all.

If the dataset is enormous and computational efficiency is an issue, smaller samples can be used, such as 50% or 80% of the size of the dataset.

2. Repetitions

The number of repetitions must be large enough to ensure that meaningful statistics, such as the mean, standard deviation, and standard error can be calculated on the sample.

A minimum might be 20 or 30 repetitions.

Ideally, the sample of estimates would be as large as possible
given the time resources, with hundreds or thousands of repeats.

C. Worked Example

We can make the bootstrap procedure concrete with a small worked example. We will work through one iteration of the procedure. Imagine we have a dataset with 6 observations:

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

The first step is to choose the size of the sample. Here, we will use 4.

Next, we must randomly choose the first observation from the dataset. Let’s choose 0.2.

sample = [0.2]

This observation is returned to the dataset and we repeat this step 3 more times.

sample = [0.2, 0.1, 0.2, 0.6]

The example purposefully demonstrates that the same value can appear zero, one or more times in the sample. Here the observation 0.2 appears twice. An estimate can then be calculated on the drawn sample.

statistic = calculation([0.2, 0.1, 0.2, 0.6])

Those observations not chosen for the sample may be used as out of sample observations.

oob = [0.3, 0.4, 0.5]

In the case of evaluating a machine learning model, the model is fit on the drawn sample and evaluated on the out-of-bag sample.

train = [0.2, 0.1, 0.2, 0.6]
test = [0.3, 0.4, 0.5]
model = fit(train)
statistic = evaluate(model, test)

That concludes one repeat of the procedure. It can be repeated 30 or more times to give a sample of calculated statistics.

statistics = [...]

This sample of statistics can then be summarized by calculating a mean, standard deviation, or other summary values to give a final usable estimate of the statistic.

estimate = mean([...])

D. Bootstrap in Python

The scikit-learn library provides an implementation that will create a single bootstrap sample of a dataset. The resample() scikit-learn function can be used.

# scikit-learn bootstrap
from sklearn.utils import resample
# data sample
data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
# prepare bootstrap sample
boot = resample(data, replace=True, n_samples=4, random_state=1)
print('Bootstrap Sample: %s' % boot)
# out of bag observations
oob = [x for x in data if x not in boot]
print('OOB Sample: %s' % oob)

-----Result-----

Bootstrap Sample: [0.6, 0.4, 0.5, 0.1]
OOB Sample: [0.2, 0.3]

Machine learning

Menu bar

03/09/2021

Resampling Methods - Part 2 - Estimation with Bootstrap

No comments:

Post a Comment