Machine learning: Understanding the ROC curve

Receiver Operating Characteristic (ROC) curve is a visual representation of how well your classification model works.

In this blog, we will explore how the ROC curve is constructed from scratch in three visual steps.

Step 1: Getting classification model predictions

When we train a classification model, we get the probability of getting a result. In this case, our example will be the likelihood of repaying a loan.

The probabilities usually range between 0 and 1. The higher value, the more likely the person is to repay a loan.

Next step is to find a threshold to classify the probabilities as "will repay" or "won't repay".

Classification model example

In the figure above, we have select a threshold at 0.35:

All prediction at or above this threshold, are classified as "will repay".
All prediction below this threshold, are classified as "won't repay".

All actual positives, those who did repay, are the blue dots.

If they were classified as “will repay”, we have a True Positive (TP)
If they were classified as “won’t repay”, we have a False Negative (FN)

All actual negatives, those who didn’t repay, are the red dots.

If they were classified as “won’t repay”, we have a True Negative (TN)
If they were classified as “will repay”, we have a False Positive (FP)

Confusion Matrix

Step 2: Calculate the True Positive Rate and False Positive Rate

Calculations for TPR and FPR

True positive rate (TPR): from all people who “did repay” in the past, what percentage did we classify correctly
False positive rate (FPR): from all people who “didn’t repay” in the past, what percentage did we miss-classify

We can see our original example at the threshold of 0.35. At this point, we:

classified correctly 90% of all positives, those who “paid back” (TPR)
miss-classified 40% of all negatives, those who “didn’t pay back” (FPR)

We can notice that the results for the TPR and FPR decrease as the threshold gets larger. If we look at the first one, where the threshold is at 0:

All positives were correctly classified, therefore TPR = 100%
All negatives were miss-classified, hence FPR = 100%

Where the threshold is at 1:

All positives were miss-classified, therefore TPR = 0%
All negatives were correctly classified, hence FPR = 0%

Overall, we can see this is a trade-off. As we increase our threshold, we’ll be better at classifying negatives, but this is at the expense of miss-classifying more positives.

Step 3: Plot the TPR and FPR for each cut-off

To plot the ROC curve, we need to calculate the TPR and FPR for many different thresholds.

For each threshold, we plot the FPR value in the x-axis and the TPR value in the y-axis. We then join the dots with a line. That’s it!

ROC curve example

The area covered below the line is called “Area Under the Curve (AUC)”. This is used to evaluate the performance of a classification model. The higher the AUC, the better the model is at distinguishing between classes.

Understanding the AUC-ROC Curve in Python

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# generate two class dataset

X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, random_state=27)

# split into train-test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)

# train models

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

# logistic regression

model1 = LogisticRegression()

# knn

model2 = KNeighborsClassifier(n_neighbors=4)

# fit model

model1.fit(X_train, y_train)

model2.fit(X_train, y_train)

# predict probabilities

pred_prob1 = model1.predict_proba(X_test)

pred_prob2 = model2.predict_proba(X_test)

from sklearn.metrics import roc_curve

# roc curve for models

fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)

fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)

# roc curve for tpr = fpr

random_probs = [0 for i in range(len(y_test))]

p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)

from sklearn.metrics import roc_auc_score

# auc scores

auc_score1 = roc_auc_score(y_test, pred_prob1[:,1])

auc_score2 = roc_auc_score(y_test, pred_prob2[:,1])

print(auc_score1, auc_score2)

# matplotlib

import matplotlib.pyplot as plt

plt.style.use('seaborn')

# plot roc curves

plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Logistic Regression')

plt.plot(fpr2, tpr2, linestyle='--',color='green', label='KNN')

plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')

# title

plt.title('ROC curve')

# x label

plt.xlabel('False Positive Rate')

# y label

plt.ylabel('True Positive rate')

plt.legend(loc='best')

plt.savefig('ROC',dpi=300)

plt.show();

It is evident from the plot that the AUC for the Logistic Regression ROC curve is higher than that for the KNN ROC curve. Therefore, we can say that logistic regression did a better job of classifying the positive class in the dataset.

AUC-ROC for Multi-Class Classification

The AUC-ROC curve is only for binary classification problems. But we can extend it to multiclass classification problems by using the One vs All technique.

So, if we have three classes 0, 1, and 2, the ROC for class 0 will be generated as classifying 0 against not 0, i.e. 1 and 2. The ROC for class 1 will be generated as classifying 1 against not 1, and so on.

# multi-class classification

from sklearn.multiclass import OneVsRestClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve

from sklearn.metrics import roc_auc_score

# generate 2 class dataset

X, y = make_classification(n_samples=1000, n_classes=3, n_features=20, n_informative=3, random_state=42)

# split into train/test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# fit model

clf = OneVsRestClassifier(LogisticRegression())

clf.fit(X_train, y_train)

pred = clf.predict(X_test)

pred_prob = clf.predict_proba(X_test)

# roc curve for classes

fpr = {}

tpr = {}

thresh ={}

n_class = 3

for i in range(n_class):

fpr[i], tpr[i], thresh[i] = roc_curve(y_test, pred_prob[:,i], pos_label=i)

# plotting

plt.plot(fpr[0], tpr[0], linestyle='--',color='orange', label='Class 0 vs Rest')

plt.plot(fpr[1], tpr[1], linestyle='--',color='green', label='Class 1 vs Rest')

plt.plot(fpr[2], tpr[2], linestyle='--',color='blue', label='Class 2 vs Rest')

plt.title('Multiclass ROC curve')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive rate')

plt.legend(loc='best')

plt.savefig('Multiclass ROC',dpi=300);

Machine learning

Menu bar

13/09/2021

Understanding the ROC curve

No comments:

Post a Comment