Receiver Operating Characteristic (ROC) curve is a visual representation of how well your classification model works.
In this blog, we will explore how the ROC curve is constructed from scratch in three visual steps.
Step 1: Getting classification model predictions
When we train a classification model, we get the probability of getting a result. In this case, our example will be the likelihood of repaying a loan.
The probabilities usually range between 0 and 1. The higher value, the more likely the person is to repay a loan.
Next step is to find a threshold to classify the probabilities as "will repay" or "won't repay".
Classification model example |
- All prediction at or above this threshold, are classified as "will repay".
- All prediction below this threshold, are classified as "won't repay".
- If they were classified as “will repay”, we have a True Positive (TP)
- If they were classified as “won’t repay”, we have a False Negative (FN)
- If they were classified as “won’t repay”, we have a True Negative (TN)
- If they were classified as “will repay”, we have a False Positive (FP)
Confusion Matrix |
Calculations for TPR and FPR |
- True positive rate (TPR): from all people who “did repay” in the past, what percentage did we classify correctly
- False positive rate (FPR): from all people who “didn’t repay” in the past, what percentage did we miss-classify
- classified correctly 90% of all positives, those who “paid back” (TPR)
- miss-classified 40% of all negatives, those who “didn’t pay back” (FPR)
- All positives were correctly classified, therefore TPR = 100%
- All negatives were miss-classified, hence FPR = 100%
- All positives were miss-classified, therefore TPR = 0%
- All negatives were correctly classified, hence FPR = 0%
Step 3: Plot the TPR and FPR for each cut-off
ROC curve example |
The area covered below the line is called “Area Under the Curve (AUC)”. This is used to evaluate the performance of a classification model. The higher the AUC, the better the model is at distinguishing between classes.
Understanding the AUC-ROC Curve in Python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# generate two class dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=20, random_state=27)
# split into train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=27)
# train models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
# logistic regression
model1 = LogisticRegression()
# knn
model2 = KNeighborsClassifier(n_neighbors=4)
# fit model
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
# predict probabilities
pred_prob1 = model1.predict_proba(X_test)
pred_prob2 = model2.predict_proba(X_test)
from sklearn.metrics import roc_curve
# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)
# roc curve for tpr = fpr
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)
from sklearn.metrics import roc_auc_score
# auc scores
auc_score1 = roc_auc_score(y_test, pred_prob1[:,1])
auc_score2 = roc_auc_score(y_test, pred_prob2[:,1])
print(auc_score1, auc_score2)
# matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn')
# plot roc curves
plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Logistic Regression')
plt.plot(fpr2, tpr2, linestyle='--',color='green', label='KNN')
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.savefig('ROC',dpi=300)
plt.show();
So, if we have three classes 0, 1, and 2, the ROC for class 0 will be generated as classifying 0 against not 0, i.e. 1 and 2. The ROC for class 1 will be generated as classifying 1 against not 1, and so on.
No comments:
Post a Comment