Machine learning: September 2021

30/09/2021

What is Exploratory Data Analysis(EDA)?

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Based on the results of EDA, companies also make business decisions, which can have repercussions later.

If EDA is not done properly then it can hamper the further steps in the machine learning model building process.
If done well, it may improve the efficacy of everything we do next.

Introduction to Univariate, Bivariate and Multivariate Analysis

In the field of data, there is nothing more important than understanding the data that you are trying to analyze. In order to understand the data is it important to understand the purpose of the analysis because this will help you save time and dictate how to go about analyzing the data.

There are a lots of different tools, techniques and methods that can be used to conduct your analysis. You could use software libraries, visualization tools and statistic testing methods.

Regardless if you are a Data Analyst or a Data Scientist, it is crucial to know Univariate, Bivariate and Multivariate statistical analysis.

Data Transform - Part 7 - How to Derive New Input Variables

Often, the input features for a predictive modeling task interact in unexpected and often nonlinear ways. These interactions can be identified and modeled by a learning algorithm. Another approach is to engineer new features that expose these interactions and see if they improve model performance.

Transforms like raising input variables to a power can help to better expose the important relationships between input variables and the target variable.

Data Transform - Part 6 - How to Transform Numerical to Categorical Data

Many machine learning algorithms prefer or perform better when numerical input variables have a standard probability distribution.

The discretization transform provides an automatic way to change a numeric input variable to have a different data distribution, which in turn can be used as input to a predictive model. In this tutorial, you will discover how to use discretization transforms to map numerical values to discrete categories for machine learning. After completing this tutorial, you will know:

Many machine learning algorithms prefer or perform better when numerical features with non-standard probability distributions are made discrete.
Discretization transforms are a technique for transforming numerical input or output variables to have discrete ordinal labels.
How to use the KBinsDiscretizer to change the structure and distribution of numeric variables to improve the performance of predictive models.

Data Transform - Part 5 - How to Change Numerical Data Distributions

Many machine learning algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a standard probability distribution, such as a Gaussian (normal) or a uniform distribution.

The quantile transform provides an automatic way to transform a numeric input variable to have a different data distribution, which in turn, can be used as input to a predictive model.

Data Transform - Part 4 - How to Make Distributions More Gaussian

Machine learning algorithms like Linear Regression and Gaussian Naive Bayes assume the numerical variables have a Gaussian probability distribution. Your data may not have a Gaussian distribution and instead may have a Gaussian-like distribution (e.g. nearly Gaussian but with outliers or a skew) or a totally different distribution (e.g. exponential).

As such, you may be able to achieve better performance on a wide range of machine learning algorithms by transforming input and/or output variables to have a Gaussian or more Gaussian distribution.

Power transforms like the Box-Cox transform and the Yeo-Johnson transform provide an automatic way of performing these transforms on your data and are provided in the scikit-learn Python machine learning library.

Data Transform - Part 3 - How to encode Categorical Data

Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an Ordinal encoding and a One Hot encoding.

In this tutorial, you will discover how to use encoding schemes for categorical machine learning data. After completing this tutorial, you will know:

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
How to use ordinal encoding for categorical variables that have a natural rank ordering.
How to use one hot encoding for categorical variables that do not have a natural rank ordering.

Data Transform - Part 2 - How to scale Data with Outliers

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

Standardizing is a popular scaling technique that subtracts the mean from values and divides by the standard deviation, transforming the probability distribution for an input variable to a standard Gaussian (zero mean and unit variance). Standardization can become skewed or biased if the input variable contains outlier values.

Data Transforms - Part 1 - How to Scale Numerical Data

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.

This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

The two most popular techniques for scaling numerical data prior to modeling are normalization and standardization.

Normalization scales each input variable separately to the range 0-1, which is the range for floating-point values where we have the most precision. Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

Feature Selection - Part 6 - How to Use Feature Importance

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

There are many types and sources of feature importance scores, although popular examples include statistical correlation scores, coefficients calculated as part of linear models, decision trees, and permutation importance scores.

Feature importance scores play an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem. In this tutorial, you will discover feature importance scores for machine learning in Python. After completing this tutorial, you will know:

The role of feature importance in a predictive modeling problem.
How to calculate and review feature importance from linear models and decision trees.
How to calculate and review permutation feature importance scores.

Feature Selection - Part 5 - How to Use RFE for Feature Selection

Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm. RFE is popular because it is easy to configure and use, and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.

In this tutorial, you will discover how to use Recursive Feature Elimination (RFE) for feature selection in Python. After completing this tutorial, you will know:

RFE is an efficient approach for eliminating features from a training dataset for feature selection.
How to use RFE for feature selection for classification and regression predictive modeling problems.
How to explore the number of selected features and wrapped algorithm used by the RFE procedure

Feature Selection - Part 4 - How to Select Features for Numerical Output

The simplest case of feature selection is the case where there are numerical input variables and a numerical target for regression predictive modeling.

This is because the strength of the relationship between each input variable and the target can be calculated, called correlation, and compared relative to each other.

In this tutorial, you will discover how to perform feature selection with numerical input data for regression predictive modeling. After completing this tutorial, you will know:

How to evaluate the importance of numerical input data using the correlation and mutual information statistics.
How to perform feature selection for numerical input data when fitting and evaluating a regression model.
How to tune the number of features selected in a modeling pipeline using a grid search.

Feature Selection - Part 3 - How to Select Numerical Input Features

The two most commonly used feature selection methods for numerical input data when the target variable is categorical (e.g. classification predictive modeling) are the ANOVA F-test statistic and the mutual information statistic.

In this tutorial, you will discover how to perform feature selection with numerical input data for classification. After completing this tutorial, you will know:

The diabetes predictive modeling problem with numerical inputs and binary classification target variables.
How to evaluate the importance of numerical features using the ANOVA F-test and mutual information statistics.
How to perform feature selection for numerical data when fitting and evaluating a classification model.

Feature Selection - Part 2 - How to Select Categorical Input Features

Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.

Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data.

The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared statistic and the mutual information statistic.

In this tutorial, you will discover how to perform feature selection with categorical input data.

Feature Selection - Part 1 - What is Feature Selection

Feature selection is a process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in many cases, to improve the performance of the model.

Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable.

Optimal Threshold for Imbalanced Classification

Many machine learning algorithms are capable of predicting a probability or scoring of class membership and this must be interpreted before it can be mapped to a class label.

This is achieved by using a threshold, such as 0.5, where all values equal or greater than the threshold are mapped to one class and all other values are mapped to another class.

For those classification problems that have a severe class imbalance, the default threshold can result in poor performance. As such, a simple and straightforward approach to improving the performance of a classifier that predicts probabilities on an imbalanced classification problem is to tune the threshold used to map probabilities to class labels.

Understanding the ROC curve

Receiver Operating Characteristic (ROC) curve is a visual representation of how well your classification model works.

In this blog, we will explore how the ROC curve is constructed from scratch in three visual steps.

When to Use ROC Curves and Precision-Recall Curves

In this tutorial, you will discover ROC Curves, Precision-Recall Curves, and when to use each to interpret the prediction of probabilities for binary classification problems.

After completing this tutorial, you will know:

ROC Curves summarize the trade-off between the True Positive Rate and False Positive Rate for a predictive model using different probability thresholds.
Precision-Recall Curves summarize the trade-off between the True Positive Rate and the positive predictive value for a predictive model using different probability thresholds.
ROC Curves are appropriate when then observations are balanced between each class, whereas Precision-Recall curves are.

Interview Questions

1. Difference between predict and predict_proba

predict_proba() will give out the probabilities

predict() will give the class value.

Class value can be used wherever the evaluation metric is accuracy, recall, precision ...

Whereas probabilities can be used wherever the evaluation metric is AUC, ROC_AUC, MSE, MAE, RMSE ...

Resampling Methods - Part 3 - Estimation with Cross-Validation

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.

In this tutorial, you will discover a gentle introduction to the k-fold cross-validation procedure for estimating the skill of machine learning models.

Resampling Methods - Part 2 - Estimation with Bootstrap

The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.

It can be used to estimate summary statistics such as the mean or standard deviation.

It is used in applied machine learning to estimate the skill of machine learning models when making predictions on data not included in the training data.

In this tutorial, you will discover the bootstrap resampling method for estimating the skill of machine learning models on unseen data.

Resampling Methods - Part 1 - Introduction to Resampling

Data is the currency of applied machine learning. Therefore, it is important that it is both collected and used effectively.

Data sampling refers to statistical methods for selecting observations from the domain with the objective of estimating a population parameter.

Whereas data resampling refers to methods for economically using a collected dataset to improve the estimate of the population parameter and help to quantify the uncertainty of the estimate.

Both data sampling and data resampling are methods that are required in a predictive modeling problem.

Data Cleaning - Part 6 - How to Use Iterative Imputation

A sophisticated approach involves defining a model to predict each missing feature as a function of all other features and to repeat this process of estimating feature values multiple times. This is generally referred to as iterative imputation.

After completing this tutorial, you will know:

Missing values must be marked with NaN values and can be replaced with iteratively estimated values.
How to impute missing values with iterative models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data

Data Cleaning - Part 5 - How to Use KNN Imputation

A popular approach to missing data imputation is to use a model to predict the missing values. This requires a model to be created for each input variable that has missing values.

Although any one among a range of different models can be used to predict the missing values, the k-nearest neighbor (KNN) algorithm has proven to be generally effective, often referred to as nearest neighbor imputation.

In this tutorial, you will discover how to use nearest neighbor imputation strategies for missing data in machine learning.

Data Cleaning - Part 4 - How to Use Statistical Imputation

It is good practice to identify and replace missing value for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing.

A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic.

It is a popular approach because the statistic is easy to calculate using the training because it often results in good performance.

Menu bar

30/09/2021

29/09/2021

28/09/2021

27/09/2021

26/09/2021

24/09/2021

20/09/2021

18/09/2021

16/09/2021

14/09/2021

13/09/2021

12/09/2021

11/09/2021

03/09/2021

02/09/2021

01/09/2021