Using ROC Curves & AUC


This is a snippet from my upcoming book ‘Data Badass’ (pictured below):


The ROC Curve & the Area Under Curve (AUC) is used for binary classification problems. The ROC curve chart looks at the True Positive Rate vs the False Positive Rate. Ideally, you want to reduce the number of false positives as much as possible so the closest point to the very top left of the graph is the best model parameter to choose.

To compute the points of the ROC curve, we would run the model using different classification thresholds. That is, the cut off point between YES and NO. We do this because when we have a binary classifier, we’re calculating the probability of belonging to class one or class two. The threshold of 0.5; where anything above that is YES and anything below is NO, does not always provide the best results. Hence the ROC curve enables us to identify the threshold where the true positive rate is as high as possible while the false positive rate is as low as possible and hence misclassified predictions are low.

As you can see in the above chart, we can compare two models. Generally, the model closest to the top left part of the chart is the best. We can look at model accuracy using the AUC (Area Under Chart) measure. An AUC of 1.0 means that 100% of the predictions made are accurate while an AUC of 0.0 would mean all the predictions were incorrect. The higher the AUC is, the better the model is at predicting accurately.

So in short, ROC curves help us find the best model threshold, while the AUC helps us measure the models predictive power.

Terminology:
The True Positive Rate (also known as sensitivity) is the proportion of positive points which were correctly determined as positive. So, that’s the True Positive / True Positive + False Negative. False Negative points SHOULD have been predicted as Positive, so the total number of positive points in the dataset is the sum of true positive and false negative.

The False Positive rate (also known as specificity) is exactly the opposite. It’s the proportion of negative points which were correctly determined as negative (True Negative / True Negative + False Positive).

Kodey