Thursday, May 4, 2017

Revisting ROC curves for biased datasets

In a recent discussion with a lab mate, I raised question as to whether or not ROC curves are sensitive to imbalances in the number of positive and negative samples.

This question comes in the the context of protein science where it is common for a scientist to collect a small data set of observed functional proteins, and then later develop a model from this data set to predict other functional sequences. This objective boils down to a classification task where the training data set is a small sample of true functional proteins without true negatives. The naive assumption is to assume that all other proteins are non-functional, however this hypothesized negative space is often times magnitudes larger than the number of true functional proteins.

Typically ROC (Receiving Operator Curves) are used to evaluate classifier performance. To generate a ROC curve, we start off with a test set of positive and negative samples. Each test sample is evaluated with the model, which maps it a numerical value. These values are sorted by the predicted value, and then varying thresholds are evaluated to generate the ROC curve. At each threshold, a False Positive Rate and True Positive Rate is calculated and plotted. The area under this curve is known as the AUC, and it provides a simplified number that describes the model performance.

True Positive Rate = TP / (TP + FN) = Percent positive correctly classified as positive
False Positive Rate = FP / (FP + TN) = Percent of negative samples incorrectly classified as positive

(A model with a threshold capable of separating the two populations has 100% TPR and 0% FPR. )

In the extremely imbalanced case as described above, we have many more hypothesized negatives than positive samples. The question I would like to address is whether or not varying the number of negatives change the AUC?  - Does it matter if we have 1000 negatives or 10,000 negatives

In short, it depends on how you vary the negative population.

First, it should be noted that FPR and TPR are independent of each other - The TPR only depends on how the positives were classified and FPR only depends on how the negatives were classified. Given that they are independent, I will only discuss what happens when we vary the population of negatives.

AUC is insensitive to the number of negatives if the FPR for given thresholds remain unchanged for the small and large populations. If the small and large negative population comes from the same distribution, the FPR should not change and thus the AUC should not change.

However, if the small and large populations vary the FPR, then AUC will change. If you add a large number negatives which are easy to classify as negative, then your AUC will be inflated. This large number of easy to classify negatives increases the TN, thus pushing FPR closer to zero.

It makes more sense once we see an example.
In the test case, we observe 1000 positives. This is unchanged. We want to see whether or not it makes a difference if we see 1000 negatives or 11000 negatives. In the notebook, we see see that when we have 10,000 more negatives in the same population, AUC is insensitive. However when we add 10,000 negatives which are very easy for the model to classify, then the model AUC is inflated.

http://nbviewer.jupyter.org/gist/xuevin/edc323adfe9de2ed4c2d34d39f906c7b


By applying AUC analysis to imbalanced data is that we can inflate the AUC score by adding easily classified negative sequences. If we vary the the number of negatives to disrupt the FPR, then AUC scores can be inflated. That is, if we add many easy negative samples, the FPR is pushed closer to zero, and AUC is closer to 1.

Coming full circle back to the protein space, it is clear that the selection of the negative samples can influence the reported AUC. If millions of "easy" negatives are added in the evaluation of the test set, then we should be wary of the reported AUC.