Menu bar

26/08/2021

Why Data Preparation is So Important

Given that we have standard implementations of highly parameterized machine learning algorithms in open source libraries, fitting models has become routine. 

As such, the most challenging part of each predictive modeling project is how to prepare the one thing that is unique to the project: the data used for modeling. 

In this tutorial, you will discover the importance of data preparation for each machine learning project. 

After completing this tutorial, you will know:
  • Structured data in machine learning consists of rows and columns.
  • Data preparation is a required step in each machine learning project.
  • The routineness of machine learning algorithms means the majority of effort on each project is spent on data preparation.

This tutorial is divided into three parts; they are:
  • What is Data in Machine Leaning?
  • Raw Data must be prepare
  • Predictive Modeling is Mostly Data Preparation

1. What Is Data in Machine Learning

What we call data are observations of real-world phenomena. Each piece of data provides a small window into a limited aspect of reality.

The most common type of input data is typically referred to as tabular data or structured data. This is data as you might see it in a spreadsheet, in a database, or in a comma separated variable (CSV) file. This is the type of data that we will focus on.

Think of a large table of data. In linear algebra, we refer to this table of data as a matrix.

Row: A single example from the domain, often called an instance, example or sample in machine learning.

Column: A single property recorded for each example, often called a variable, predictor, or feature in machine learning.

The rows used to train a model are referred to as the training dataset and the rows used to evaluate the model are referred to as the test dataset.

Input Variables: Columns in the dataset provided to a model in order to make a prediction.

Output Variable: Column in the dataset to be predicted by a model.

When you collect your data, you may have to transform it so it forms one large table. For example, if you have your data in a relational database, it is common to represent entities in separate tables in what is referred to as a normal form so that redundancy is minimized.  In order to create one large table with one row per subject or entity that you want to model, you may need to reverse this process and introduce redundancy in the data in a process referred to as denormalization.

If your data is in a spreadsheet or database, it is standard practice to extract and save the data in CSV format. This is a standard representation that is portable, well understood, and ready for the predictive modeling process with no external dependencies. 


2. Raw Data Must Be Prepared

Raw data: Data in the form provided from the domain.

In almost all cases, raw data will need to be changed before you can use it as the basis for modeling with machine learning.

A. Machine Learning Algorithms Expect Numbers

Even though your data is represented in one large table of rows and columns, the variables in the table may have different data types. 

Some variables may be numeric, such as integers, floating-point values, ranks, rates, percentages, and so on. 

Other variables may be names, categories, or labels represented with characters or words, and some may be binary, represented with 0 and 1 or True and False. The problem is, machine learning algorithms at their core operate on numeric data. 

They take numbers as input and predict a number as output. 
All data is seen as vectors and matrices, using the terminology from linear algebra.

As such, raw data must be changed prior to training, evaluating, and using machine learning models. 

Sometimes the changes to the data can be managed internally by the machine learning algorithm; most commonly, this must be handled by the machine learning practitioner prior to modeling in what is commonly referred to as data preparation or data pre-processing


B. Machine Learning Algorithms Have Requirements

Some algorithms are known to perform worse if there are input variables that are irrelevant or redundant to the target variable. 

There are also algorithms that are negatively impacted if two or more input variables are highly correlated. 

In these cases, irrelevant or highly correlated variables may need to be identified and removed, or alternate algorithms may need to be used.

There are also algorithms that have very few requirements about the probability distribution of input variables or the presence of redundancies, but in turn, may require many more examples (rows) in order to learn how to make good predictions.


C. Model Performance Depends on Data

The performance of a machine learning algorithm is only as good as the data used to train it.

A dataset may be a weak representation of the problem we are trying to solve for many reasons, although there are two main classes of reason.
It may be because complex nonlinear relationships are compressed in the raw data that can be unpacked using data preparation techniques. It may also be because the data is not perfect, ranging from mild random fluctuations in the observations, referred to as a statistical noise, to errors that result in out-of-range values and conflicting data. 

Complex Data: Raw data contains compressed complex nonlinear relationships that may need to be exposed.

Messy Data: Raw data contains statistical noise, errors, missing values, and conflicting examples.


3.  Predictive Modeling Is Mostly Data Preparation

The vast majority of the common, popular, and widely used machine learning algorithms are decades old. Linear regression is more than 100 years old. That is to say, most algorithms are well understood and
well parameterized and there are standard definitions and implementations available in open source software, like the scikit-learn machine learning library in Python.

Although the algorithms are well understood operationally, most don’t have satisfiable theories about why they work or how to map algorithms to problems. 

This is why each predictive modeling project is empirical rather than theoretical, requiring a process of systematic experimentation of algorithms on data. 

Given that machine learning algorithms are routine for the most part, the one thing that changes from project to project is the specific data used in the modeling.

It has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data. However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it’s essential that you become fluent and efficient in data wrangling techniques.



No comments:

Post a Comment