In this tutorial, you will discover how to develop a practical intuition for imbalanced and highly skewed class distributions.
After completing this tutorial, you will know:
- How to create a synthetic dataset for binary classification and plot the examples by class.
- How to create synthetic classification datasets with any given class distribution.
- How different skewed class distributions actually look in practice.
This tutorial is divided into three parts; they are:
- Create and Plot a Binary Classification Problem
- Create Synthetic Dataset With Class Distribution
- Effect of Skewed Class Distributions
A. Create and Plot a Binary Classification Problem
The scikit-learn Python machine learning library provides functions for generating synthetic datasets. The make_blobs() function can be used to generate a specified number of examples from a test classification problem with a specified number of classes.
For example, the snippet below will generate 1,000 examples for a two-class (binary) classification problem with two input variables.
The class values have the values of 0 and 1.
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
Because there are only two input variables, we can create a scatter plot to plot each example as a point. This can be achieved with the scatter() Matplotlib function.
# generate binary classification dataset and plot
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
# create scatter plot for samples from each class
for class_value in range(2):
# get row indexes for samples with this class
row_ix = where(y == class_value)
# create scatter of these samples
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs
# generate dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)
# create scatter plot for samples from each class
for class_value in range(2):
# get row indexes for samples with this class
row_ix = where(y == class_value)
# create scatter of these samples
pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
pyplot.show()
Scatter Plot of Binary Classification Dataset
|
B. Create Synthetic Dataset with a Class Distribution
# create and plot synthetic dataset with a given class distribution
from numpy import unique
from numpy import hstack
from numpy import vstack
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs
# create a dataset with a given class distribution
def get_dataset(proportions):
from numpy import unique
from numpy import hstack
from numpy import vstack
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs
# create a dataset with a given class distribution
def get_dataset(proportions):
# determine the number of classesn_classes = len(proportions)# determine the number of examples to generate for each classlargest = max([v for k,v in proportions.items()])
n_samples = largest * n_classesn_samples = largest * n_classes# create datasetX, y = make_blobs(n_samples=n_samples, centers=n_classes, n_features=2, random_state=1,cluster_std=3)# collect the examplesX_list, y_list = list(), list()for k,v in proportions.items():
row_ix = where(y == k)[0]selected = row_ix[:v]X_list.append(X[selected, :])y_list.append(y[selected])
return vstack(X_list), hstack(y_list)
# scatter plot of dataset, different color for each class
def plot_dataset(X, y):
def plot_dataset(X, y):
# create scatter plot for samples from each classn_classes = len(unique(y))for class_value in range(n_classes):
# get row indexes for samples with this classrow_ix = where(y == class_value)[0]# create scatter of these samplespyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(class_value))
# show a legend
pyplot.legend()# show the plotpyplot.show()
# define the class distribution
proportions = {0:5000, 1:5000}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)
proportions = {0:5000, 1:5000}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)
Scatter Plot of Binary Classification Dataset With Provided Class Distribution.
|
C. Effect of Skewed Class Distributions
We can ensure our class distributions meet this practice by defining the majority then the minority classes in the call to the get_dataset() function.
...
# define the class distribution
proportions = {0:10000, 1:10}
# generate dataset
X, y = get_dataset(proportions)
...
# define the class distribution
proportions = {0:10000, 1:10}
# generate dataset
X, y = get_dataset(proportions)
...
1. 1:10 Imbalanced Class Distribution
# define the class distribution
proportions = {0:10000, 1:1000}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)
proportions = {0:10000, 1:1000}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)
Scatter Plot of Binary Classification Dataset With a 1 to 10 Class Distribution |
2.
1:100 Imbalanced Class Distribution
# define the class distribution
proportions = {0:10000, 1:100}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)
Scatter Plot of Binary Classification Dataset With a 1 to 100 Class Distribution
3. 1:1000 Imbalanced Class Distribution
# define the class distribution
proportions = {0:10000, 1:10}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)
proportions = {0:10000, 1:10}
# generate dataset
X, y = get_dataset(proportions)
# plot dataset
plot_dataset(X, y)
Scatter Plot of Binary Classification Dataset With a 1 to 1000 Class Distribution
|
No comments:
Post a Comment