Imbalanced classification is the problem of classification when there is an unequal distribution of classes in the training dataset.
The imbalance in the class distribution may vary, but a severe imbalance is more challenging to model and may require specialized techniques.
Many real-world classification problems have an imbalanced class distribution, such as fraud detection, spam detection, and churn prediction.
This tutorial is divided into five parts:
- Classification Predictive Modeling
- Imbalanced Classification Problems
- Causes of Class Imbalance
- Challenge of Imbalanced Classification
- Examples of Imbalanced Classification
A. Classification Predictive Modeling
Binary Classification Problem: A classification predictive modeling problem where all examples belong to one of two classes.
Multiclass Classification Problem: A classification predictive modeling problem where all examples belong to one of three or more classes.
B. Imbalanced Classification Problems
A classification predictive modeling problem where the distribution of examples across the classes is not equal.
There are other less general names that may be used to describe these types of classification problems, such as:
- Rare event prediction.
- Extreme event prediction.
- Severe class imbalance.
It is common to describe the imbalance of classes in a dataset in terms of a ratio.
For example, an imbalanced binary classification problem with an imbalance of 1 to 100 (1:100) means that for every one example in one class, there are 100 examples in the other class.
Another way to describe the imbalance of classes in a dataset is to summarize the class distribution as percentages of the training dataset. For example, an imbalanced multiclass classification problem may have 80 percent examples in the first class, 18 percent in the second class, and 2 percent in a third class.
C. Causes of Class Imbalance
There are perhaps two main groups of causes for the imbalance we may want to consider; they are data sampling and properties of the domain.
It is possible that the imbalance in the examples across the classes was caused by the way the examples were collected or sampled from the problem domain.
- Biased Sampling.
- Measurement Errors.
The imbalance might be a property of the problem domain. For example, the natural occurrence or presence of one class may dominate other classes.
D. Challenge of Imbalanced Classification
Majority Class: The class (or classes) in an imbalanced classification predictive modeling problem that has more examples.
Minority Class: The class in an imbalanced classification predictive modeling problem that has less examples.
When working with an imbalanced classification problem, the minority class is typically of the most interest. This means that a model’s skill in correctly predicting the class label or probability for the minority class is more important than the majority class or classes.
The minority class is harder to predict because there are few examples of this class, by definition. This means it is more challenging for a model to learn the characteristics of examples from this class, and to differentiate examples from this class from the majority class (or classes).
E. Examples of Imbalanced Classification
Many of the classification predictive modeling problems that we are interested in solving in practice are imbalanced.
- Fraud Detection.
- Claim Prediction
- Default Prediction.
- Churn Prediction.
- Spam Detection.
- Anomaly Detection.
- Outlier Detection.
- Intrusion Detection
- Conversion Prediction.
No comments:
Post a Comment