Menu bar

25/08/2021

Data Preparation in a Machine Learning Project

Data preparation may be one of the most difficult steps in any machine learning projects. The reason is that each data is difference and highly specific to the project. 

After completing this tutorial, you will know:
  • Each predictive modeling project with machine learning is different, but there are common steps performed on each project. 
  • Data preparation involves best exposing the unknown underlying structure of the problem to learning algorithm.
  • The steps before and after data preparation in a project can inform what data preparation methods to apply, or at least explore.

This tutorial is divided into three parts; they are:
  • Applied Machine Learning Process
  • What is Data Preparation
  • How to choose Data Preparation Techniques

1. Applied Machine Learning Process

Even though your project is unique, the steps on the path to a good result are generally the same from project to project.  

Further, the steps are written sequentially, but we will jump back and forth between the steps for any given project. We like to define the process using the four high-level steps:
  • Define Problems
  • Prepare data
  • Evaluate Models
  • Finalize Models
Step 1: Define Problems

This step is concerned with learning enough about the project to select the framing or framings of the prediction task. For example, is it classification or regression, or some other higher-order problem type?

Step 2: Prepare data

This step is concerned with transforming raw data that was collected into a form that can be used in modeling.

Step 3: Evaluate Models

This step is concerned with evaluating machine learning models on your dataset. It requires that you design a robust test harness used to evaluate your model so that the results you get can be trusted and used to select among the models that you have evaluated.

This involves tasks such as selecting a performance metric for evaluating e skill of model.

It is common to use k-fold cross-validation as a resampling technique, often with repeats of the process to improve the robustness of the result.

This step also involves tasks for getting the most out of well-performing models such as hyperparameter tuning and ensembles of models.

Step 4: Finalize Models

Once a suite of models has been evaluated, you must choose a model that represents the solution to the project. This is called model selection and may involve further evaluation of candidate models on a hold out validation dataset, or selection via other project-specific criteria such as model complexity.


2. What is Data Preparation

On a predictive modeling project, such as  classification and regression, raw data typically cannot be used directly. This is because of reason such as: 
  • Machine learning algorithms require data to be number
  • Some machine learning algorithms impose requirements on data.
  • Statistically noise and error may need to be corrected.
  • Complex nonlinear relationships may be teased out of the data. 
As such, raw data must be pre-processed prior to be used to fit and evaluate a machine learning model. 

There are common or standard tasks that may use or explore during the 
preparation step in a machine learning project. These tasks include:
  • Data Cleaning: Identifying or correcting mistakes or errors in the data
  • Feature Selection: Identifying those input variables that are most relevant to the task.
  • Data Transform: Changing the scale or distribution of variables
  • Feature Engineering: Deriving new variables from available data.
  • Dimensional Reduction: Creating compact projection of the data.  

3. How to choose Data Preparation Techniques

On the surface, this is a challenging question, but if we look at the data preparation step in the context of the whole project, it becomes more straightforward.

The step in a predictive modeling project before and after the data preparation step inform the data preparation that may be required. The step before the data preparation involves defining the problems. As part of defining the problem, this may involve many sub-tasks, such as:
  • Gather data from the problem domain.
  • Discuss the project with subject matter experts.
  • Select those variables to be used as inputs and outputs for a predictive model.
  • Review the data that has been collected.
  • Summarize the collected data using statistical methods.
  • Visualize the collected data using plots and charts

There may also be interplay between the data preparation step and the evaluation of models. Model evaluation may involve sub-tasks such as:
  • Select a performance metric for evaluating model predictive skill.
  • Select a model evaluation procedure.
  • Select algorithms to evaluate.
  • Tune algorithm hyperparameters.
  • Combine predictive models into ensembles.
The choice of algorithms may impose requirements and expectations on the type and form of input variables in the data. This might require variables to have a specific probability distribution, the removal of correlated input variables, and/or the removal of variables that are not strongly related to the target variable.

The choice of performance metric may also require careful preparation of the target variable in order to meet the expectations, such as scoring regression models based on prediction error using a specific unit of measure, requiring the inversion of any scaling transforms applied to that variable for modeling.

These examples, and more, highlight that although data preparation is an important step in a predictive modeling project, it does not stand alone. Instead, it is strongly influenced by the tasks performed both before and after data preparation

This highlights the highly iterative nature of any predictive modeling project.
  

No comments:

Post a Comment