Data is the currency of applied machine learning. Therefore, it is important that it is both collected and used effectively.
Data sampling refers to statistical methods for selecting observations from the domain with the objective of estimating a population parameter.
Whereas data resampling refers to methods for economically using a collected dataset to improve the estimate of the population parameter and help to quantify the uncertainty of the estimate.
Both data sampling and data resampling are methods that are required in a predictive modeling problem.
After completing this tutorial, you will know:
- Sampling is an active process of gathering observations with the intent of estimating a population variable.
- Resampling is a methodology of economically using a data sample to improve the accuracy and quantify the uncertainty of a population parameter.
- Resampling methods, in fact, make use of a nested resampling method.
This tutorial is divided into 2 parts; they are:
- Statistical Sampling
- Statistical Resampling
A. Statistical Sampling
Each row of data represents an observation about something in the world. When working with data, we often do not have access to all possible observations. This could be for many reasons; for example:
- It may difficult or expensive to make more observations
- It may be challenging to gather all observations together
- More observations are expected to be made in the future
If we intend to use big data infrastructure on all available data, that the data still represents a sample of observations from an idealized population.
Sampling consists of selecting some part of the population to observe so that one may estimate something about the whole population.
1. How to Sample
Statistical sampling is the process of selecting subsets of examples from a population with the objective of estimating properties of the population. Sampling is an active process.
There are many benefits to sampling compared to working with complete datasets, including reduced cost and greater speed.
Some aspects to consider prior to collecting a data sample include:
- Sample Goal: The population property that you wish to estimate using the sample.
- Population: The scope or domain from which observations could theoretically be made.
- Selection Criteria: The methodology that will be used to accept or reject observations in your sample.
- Sample Size: The number of observations that will constitute the sample.
Statistical sampling is a large field of study, but in applied machine learning, there may be three types of sampling that you are likely to use.
- Simple Random Sampling: Samples are drawn with a uniform probability from the domain.
- Systematic Sampling: Samples are drawn using a pre-specified pattern, such as at intervals.
- Stratified Sampling: Samples are drawn within pre-specified categories.
2. Sampling Errors
Sampling requires that we make a statistical inference about the population from a small set of observations.
We can generalize properties from the sample to the population. This process of estimation and generalization is much faster than working with all possible observations, but will contain errors.
There are many ways to introduce errors into your data sample. Two main types of errors include selection bias and sampling error:
- Selection Bias. Caused when the method of drawing observations skews the sample in some way.
- Sampling Error. Caused due to the random nature of drawing observations skewing the sample in some way.
B. Statistical Resampling
Once we have a data sample, it can be used to estimate the population parameter.
The problem is that we only have a single estimate of the population parameter, with little idea of the variability or uncertainty in the estimate.
One way to address this is by estimating the population parameter multiple times from our data sample. This is called resampling.
The resampling methods are easy to learn and easy to apply. They require no mathematics beyond introductory high-school algebra, but are applicable in an exceptionally broad range of subject areas.
A downside of the methods is that they can be computationally very expensive, requiring tens, hundreds, or even thousands of resamples in order to develop a robust estimate of the population parameter.
The key idea is to resample form the original data - either directly or via a fitted model - to create replicate datasets.
Each new subsample from the original data sample is used to estimate the population parameter.
Two commonly used resampling methods that you may encounter are k-fold cross-validation and the bootstrap.
- Bootstrap. Samples are drawn from the dataset with replacement (allowing the same sample to appear more than once in the sample), where those instances not drawn into the data sample may be used for the test set.
- k-fold Cross-Validation. A dataset is partitioned into k groups, where each group is given the opportunity of being used as a held out test set leaving the remaining groups as the training set.
No comments:
Post a Comment