Menu bar

18/08/2021

Examples Of Statistics In Machine Learning

Statistics and machine learning are two very closely related fields. In fact, the line between the two can be very fuzzy at times.

It would be fair to say that statistical methods are required to effectively work through a machine learning predictive modeling project.

We are going to look at 10 examples of where statistical methods are used in an applied machine learning project.
  • Problem Framing
  • Data Understanding
  • Data Cleaning
  • Data Selection
  • Data Preparation
  • Model Evaluation
  • Model Configuration
  • Model Selection
  • Model Presentation
  • Model Predictions

1. Problem Framing

This is the selection of the type of the problem, regression or classification, and perhaps the structure and types of the inputs and output for problems.

The framing of problem is not always obvious. For newcomers to a domain, it may require significant exploration of the observation in the domain.

For domain experts that may be stuck seeing the issues from a conventional perspective, they too may benefit from considering the data from multiple perspectives. 

Statistical methods that can aid in the exploration of the data during the framing of a problem include:

Exploratory Data Analysis: summarization and visualization in order to explore ad hoc view of the data.

Data Mining: Automatic discovery of structured relationships and patterns in the data.


2. Data Understanding

Data understanding means having an intimate grasp of both the distribution of variables and the relationship between the variables.

Some of this knowledge may come from domain expertise. Nevertheless, both experts and novices to a field of study will benefit  from actually handling real observations form the domain.

Summary Statistics: methods used to summary the distribution and relationships between variables using statistical quantities.

Data Visualizations: methods used to summary the distribution and relationships between variables using visualizations such as charts, plots, and graphs.


3. Data Cleaning

Data may have the following problems:
  • Data corruption
  • Data errors
  • Data loss
The process of identifying and repairing issues with the data is call data cleaning. 

The statistical methods are used for data cleaning; for example:

Outlier detection: methods for identifying observations that are far from the expected value in the distribution.

Imputation: methods for repairing or filling in corrupt or missing values in observations.


4. Data Selection                

Not all observations or all variables may be relevant when modeling. The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection.

Two type of statistical methods are used for data selection includes:

Data sample: methods to systematically create smaller representative samples from larger datasets.

Feature Selection: methods to automatically identify those variables that are most relevant to the outcome variable.


5. Data Preparation

Data can often not be used directly for modeling. Some transformation is often required in order to change the shape or structure of data to make it more suitable for the chosen framing of problem or learning algorithms. 

Scaling: methods such as standardization and normalization
Encoding: methods such as integer encoding and one hot encoding
Transforms: methods such as power transforms like Box-Cox method.
 

6. Model Evaluation

A crucial part of predictive modeling method is evaluating a learning method. This often requires the estimation of the skill of the model when making predictions on data.

Experimental Design: methods to design systematic experiments to compare the effect of the independent variable on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.

As part of implement experiment design, methods are used to resample a dataset in order to make economic use of available data in order to estimate skill of the model. 

Resample Methods: methods for systematically splitting a dataset into subsets for the purpose of training and evaluating a predictive model.
   

7. Model Configuration

A given machine learning algorithm often has a suite of hyperparameters that allow learning method to be tailored to a specific problem.

The configuration of the hyperparameters is often empirical in nature, rather than analytical, requiring large suites of experiments in order to evaluate the effect of different hyperparameter values on the skill of the model. 

The interpretation and comparison of the result between different hyperparameter configurations is make using one of two subfields of statistical, namely:

Statistical Hypothesis Tests: methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).

Estimation Statistic: methods that quantify the uncertainty of a result using confidence intervals. 


8. Model Selection

One among many machine learning algorithms may be appropriate for a given predictive modeling problem. The process of selecting one method as the solution is called model selection.

As with model configuration, two classes of statistical methods can be used to interpret the estimated skill of different models for the purposes of model selection.

Statistical Hypothesis Tests: methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).

Estimation Statistic: methods that quantify the uncertainty of a result using confidence intervals. 


9. Model Presentation

Once a final model has been trained, it can be presented to stakeholders prior to being used or deployed to make actual predictions on real data. 

A part of presenting a final model involves presenting the estimated skill of the model. 

Methods from the field of estimation statistics can be used to quantify the uncertainty in the estimated skill of the machine learning model through the use of tolerance intervals and confidence intervals.

Estimation Statistic: methods that quantify the uncertainty of a result using confidence intervals.


10. Model Predictions

Finally, it will come time to start using a final model to make predictions for new data where we do not know the real outcome. 

As part of making predictions, it is important to quantify the confidence of the prediction. 

Just like with the process of model presentation, we can use methods from the field of estimation statistics to quantify this uncertainty, such as confidence intervals and prediction intervals

Estimation Statistic: methods that quantify the uncertainty of a result using confidence intervals.


No comments:

Post a Comment