Menu bar

25/01/2022

Foundation - Intuition for Imbalanced Classification

In this tutorial, you will discover how to develop a practical intuition for imbalanced and highly skewed class distributions.

After completing this tutorial, you will know:
  1. How to create a synthetic dataset for binary classification and plot the examples by class.
  2. How to create synthetic classification datasets with any given class distribution.
  3. How different skewed class distributions actually look in practice.

This tutorial is divided into three parts; they are:
  1. Create and Plot a Binary Classification Problem
  2. Create Synthetic Dataset With Class Distribution
  3. Effect of Skewed Class Distributions

A. Create and Plot a Binary Classification Problem

The scikit-learn Python machine learning library provides functions for generating synthetic datasets. The make_blobs() function can be used to generate a specified number of examples from a test classification problem with a specified number of classes.

For example, the snippet below will generate 1,000 examples for a two-class (binary) classification problem with two input variables.
The class values have the values of 0 and 1.

X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)

Because there are only two input variables, we can create a scatter plot to plot each example as a point. This can be achieved with the scatter() Matplotlib function.

# generate binary classification dataset and plot
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
# create scatter plot for samples from each class
for class_value in range(2):
    # get row indexes for samples with this class
    row_ix = where(y == class_value)
    # create scatter of these samples
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()


Scatter Plot of Binary Classification Dataset


B. Create Synthetic Dataset with a Class Distribution


# create and plot synthetic dataset with a given class distribution
from numpy import unique
from numpy import hstack
from numpy import vstack
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs
# create a dataset with a given class distribution
def get_dataset(proportions):
# determine the number of classes
n_classes = len(proportions)
# determine the number of examples to generate for each class
largest = max([v for k,v in proportions.items()])
n_samples = largest * n_classes
n_samples = largest * n_classes
# create dataset
X, y = make_blobs(n_samples=n_samples, centers=n_classes, n_features=2, random_state=1,
cluster_std=3)
# collect the examples
X_list, y_list = list(), list()
for k,v in proportions.items():
row_ix = where(y == k)[0]
selected = row_ix[:v]
X_list.append(X[selected, :])
y_list.append(y[selected])
return vstack(X_list), hstack(y_list)
# scatter plot of dataset, different color for each class
def
plot_dataset(X, y):
# create scatter plot for samples from each class
n_classes = len(unique(y))
for class_value in range(n_classes):
# get row indexes for samples with this class
row_ix = where(y == class_value)[0]
# create scatter of these samples
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))
# show a legend
pyplot.legend()
# show the plot
pyplot.show()
# define the class distribution
proportions = {0:5000, 1:5000}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)

Scatter Plot of Binary Classification Dataset With Provided Class Distribution.


C. Effect of Skewed Class Distributions

We can ensure our class distributions meet this practice by defining the majority then the minority classes in the call to the get_dataset() function.

...
# define the class distribution
proportions = {0:10000, 1:10}
# generate dataset
X, y = get_dataset(proportions)
...


1. 1:10 Imbalanced Class Distribution

# define the class distribution
proportions = {0:10000, 1:1000}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)

Scatter Plot of Binary Classification Dataset With a 1 to 10 Class Distribution


2.  1:100 Imbalanced Class Distribution

# define the class distribution
proportions = {0:10000, 1:100}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)


Scatter Plot of Binary Classification Dataset With a 1 to 100 Class Distribution


3. 1:1000 Imbalanced Class Distribution

# define the class distribution
proportions = {0:10000, 1:10}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)

Scatter Plot of Binary Classification Dataset With a 1 to 1000 Class Distribution



 


No comments:

Post a Comment