Machine learning: Foundation - Challenge of Imbalanced Classification

In this tutorial, you will discover data characteristics that compound the challenge of imbalanced classification. After completing this tutorial, you will know:

Imbalanced classification is specifically hard because of the severely skewed class distribution and the unequal misclassification costs.
The difficulty of imbalanced classification is compounded by properties such as dataset size, label noise, and data distribution.
How to develop an intuition for the compounding effects on modeling difficulty posed by different dataset properties.

This tutorial is divided into four parts; they are:

1. Why Imbalanced Classification Is Hard

2. Compounding Effect of Dataset Size

3. Compounding Effect of Label Noise

4. Compounding Effect of Data Distribution

A. Why Imbalanced Classification Is Hard

Because the class distribution is not balanced, most machine learning algorithms will perform poorly and require modification to avoid simply predicting the majority class in all cases.

Additionally, metrics like classification accuracy lose their meaning and alternate methods for evaluating predictions on imbalanced examples are required, like ROC area under curve. This is the foundational challenge of imbalanced classification.

Misclassifying an example from the majority class as an example from the minority class called a False Positive is often not desired, but less critical than classifying an example from the minority class as belonging to the majority class, a so-called False Negative.

This is referred to as cost sensitivity of misclassification errors and is a second foundational challenge of imbalanced classification.

Nevertheless, there are other characteristics of the classification problem that, when combined with these properties, compound their effect.

There are many such characteristics, but perhaps three of the most common include:

Dataset Size.
Label Noise.
Data Distribution.

B. Compounding Effect of Dataset Size

Scatter plots are created for each differently sized dataset.

These plots highlight the critical role that dataset size plays in imbalanced classification.

It is hard to see how a model given 990 examples of the majority class and 10 of the minority class could hope to do well on the same problem depicted after 100,000 examples are drawn.

Scatter Plots of an Imbalanced Classification Dataset With Different Dataset Sizes

C. Compounding Effect of Label Noise

Label noise refers to examples that belong to one class that are labeled as another class.

This can make determining the class boundary in feature space problematic for most machine learning algorithms, and this difficulty typically increases in proportion to the percentage of noise in the labels.

Noise=0%, Ratio=Counter({0: 990, 1: 10})
Noise=1%, Ratio=Counter({0: 983, 1: 17})
Noise=5%, Ratio=Counter({0: 963, 1: 37})
Noise=7%, Ratio=Counter({0: 959, 1: 41})

Scatter Plots of an Imbalanced Classification Dataset With Different Label Noise

D. Compounding Effect of Data Distribution

Another important consideration is the distribution of examples in feature space. If we think about feature space spatially, we might like all examples in one class to be located on one part of the space, and those from the other class to appear in another part of the space.

We can use the number of clusters in the dataset as a proxy for concepts and compare a dataset with one cluster of examples per class to a second dataset with two clusters per class. This can be achieved by varying the n clusters per class argument for the make_classification() function used to create the dataset.

# vary the number of clusters for a 1:100 imbalanced dataset
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where
# number of clusters
clusters = [1, 2]
# create and plot a dataset with different numbers of clusters

for i in range(len(clusters)):
c = clusters[i]
c = clusters[i]
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=c, weights=[0.99], flip_y=0, random_state=1)
counter = Counter(y)
# define subplot
pyplot.subplot(1, 2, 1+i)
pyplot.title('Clusters=%d' % c)
pyplot.xticks([])
pyplot.yticks([])

# scatter plot of examples by class label
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
# show the figure
pyplot.show()

Scatter Plots of an Imbalanced Classification Dataset With Different Numbers of
Clusters

Machine learning

Menu bar

25/01/2022

Foundation - Challenge of Imbalanced Classification

No comments:

Post a Comment