# Compare Classifier Predictions using Reliability Diagrams

This notebook generates reliability diagrams for some classifiers on an artificial data set. Reliability diagrams allow checking if the predicted probabilities of a binary classifier are well calibrated. For perfectly calibrated predictions, the curve in a reliability diagram should be as close as possible to the diagonal/identity. This would correspond to a situation in which among $N$ instances for which a classifier predicts probability $p$ for class $A$, the ratio of instances which actually belong to class $A$ is approx. $p$ (for any $p$ and sufficiently large $N$).

This notebook reproduces some of the results from the paper Predicting Good Probabilities with Supervised Learning.

In [1]:
%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scikit-learn

Jan Hendrik Metzen 16/08/2014

CPython 2.7.7
IPython 2.1.0

numpy 1.8.1
scikit-learn 0.14.1

compiler   : GCC 4.1.2 20080704 (Red Hat 4.1.2-54)
system     : Linux
release    : 3.13.0-29-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

In [2]:
import numpy as np
np.random.seed(0)

from sklearn import datasets

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.isotonic import IsotonicRegression

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline


### Function for reliability curve computation¶

In [4]:
def reliability_curve(y_true, y_score, bins=10, normalize=False):
"""Compute reliability curve

Reliability curves allow checking if the predicted probabilities of a
binary classifier are well calibrated. This function returns two arrays
which encode a mapping from predicted probability to empirical probability.
For this, the predicted probabilities are partitioned into equally sized
bins and the mean predicted probability and the mean empirical probabilties
in the bins are computed. For perfectly calibrated predictions, both
quantities whould be approximately equal (for sufficiently many test
samples).

Note: this implementation is restricted to binary classification.

Parameters
----------

y_true : array, shape = [n_samples]
True binary labels (0 or 1).

y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive
class or confidence values. If normalize is False, y_score must be in
the interval [0, 1]

bins : int, optional, default=10
The number of bins into which the y_scores are partitioned.
Note: n_samples should be considerably larger than bins such that
there is sufficient data in each bin to get a reliable estimate
of the reliability

normalize : bool, optional, default=False
Whether y_score needs to be normalized into the bin [0, 1]. If True,
the smallest value in y_score is mapped onto 0 and the largest one
onto 1.

Returns
-------
y_score_bin_mean : array, shape = [bins]
The mean predicted y_score in the respective bins.

empirical_prob_pos : array, shape = [bins]
The empirical probability (frequency) of the positive class (+1) in the
respective bins.

References
----------
.. [1] Predicting Good Probabilities with Supervised Learning
<http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf>_

"""
if normalize:  # Normalize scores into bin [0, 1]
y_score = (y_score - y_score.min()) / (y_score.max() - y_score.min())

bin_width = 1.0 / bins
bin_centers = np.linspace(0, 1.0 - bin_width, bins) + bin_width / 2

y_score_bin_mean = np.empty(bins)
empirical_prob_pos = np.empty(bins)
for i, threshold in enumerate(bin_centers):
# determine all samples where y_score falls into the i-th bin
bin_idx = np.logical_and(threshold - bin_width / 2 < y_score,
y_score <= threshold + bin_width / 2)
# Store mean y_score and mean empirical probability of positive class
y_score_bin_mean[i] = y_score[bin_idx].mean()
empirical_prob_pos[i] = y_true[bin_idx].mean()
return y_score_bin_mean, empirical_prob_pos


### Training data¶

Generate a toy dataset on which different classifiers are compared. Among the 20 features, only 2 are actually informative. 2 further features are redundant, i.e., linear combinations of the 2 informative features. The remaining features are just noise.

In [5]:
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)

In [6]:
bins = 25
train_samples = 100  # Samples used for training the models
calibration_samples =  400  # Additional samples for claibration using Isotonic Regression

X_train = X[:train_samples]
X_calibration = X[train_samples:train_samples+calibration_samples]
X_test = X[train_samples+calibration_samples:]
y_train = y[:train_samples]
y_calibration = y[train_samples:train_samples+calibration_samples]
y_test = y[train_samples+calibration_samples:]


### Compute reliability curves for different classifiers¶

Compute reliability curves for different classifiers:

• Logistic Regression
• Naive Bayes
• Random Forest
• Support-Vector Classification (scores)
• Support-Vector Classification + Isotonoc Calibration
In [7]:
classifiers = {"Logistic regression": LogisticRegression(),
"Naive Bayes": GaussianNB(),
"Random Forest": RandomForestClassifier(n_estimators=100),
"SVC": SVC(kernel='linear', C=1.0),
"SVC + IR": SVC(kernel='linear', C=1.0)}

In [ ]:
reliability_scores = {}
y_score = {}
for method, clf in classifiers.items():
clf.fit(X_train, y_train)
if method == "SVC + IR":  # Calibrate SVC scores using isotonic regression.
n_plus = (y_calibration == 1.0).sum()  # number of positive examples
n_minus = (y_calibration == 0.0).sum()  # number of negative examples
# Determine target values for isotonic calibration. See
# "Predicting Good Probabilities with Supervised Learning"
# for details.
y_target = np.where(y_calibration == 0.0,
1.0 / (n_minus + 2.0),
(n_plus + 1.0) / (n_plus + 2.0))
# Perform actual calibration using isotonic calibration
svm_score = clf.decision_function(X_calibration)[:, 0]
ir = IsotonicRegression(out_of_bounds='clip').fit(svm_score, y_target)
y_score[method] = ir.transform(clf.decision_function(X_test)[:, 0])
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=False)
elif method == "SVC":
# Use SVC scores (predict_proba returns already calibrated probabilities)
y_score[method] = clf.decision_function(X_test)[:, 0]
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=True)
else:
y_score[method] = clf.predict_proba(X_test)[:, 1]
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=False)


### Plot reliability diagram¶

In [9]:
plt.figure(0, figsize=(8, 8))
plt.subplot2grid((3, 1), (0, 0), rowspan=2)
plt.plot([0.0, 1.0], [0.0, 1.0], 'k', label="Perfect")
for method, (y_score_bin_mean, empirical_prob_pos) in reliability_scores.items():
scores_not_nan = np.logical_not(np.isnan(empirical_prob_pos))
plt.plot(y_score_bin_mean[scores_not_nan],
empirical_prob_pos[scores_not_nan], label=method)
plt.ylabel("Empirical probability")
plt.legend(loc=0)

plt.subplot2grid((3, 1), (2, 0))
for method, y_score_ in y_score.items():
y_score_ = (y_score_ - y_score_.min()) / (y_score_.max() - y_score_.min())
plt.hist(y_score_, range=(0, 1), bins=bins, label=method,
histtype="step", lw=2)
plt.xlabel("Predicted Probability")
plt.ylabel("Count")
plt.legend(loc='upper center', ncol=2)

Out[9]:
<matplotlib.legend.Legend at 0x7f8e75ee1890>

The following observations can be made:

• Logistic regression returns well-calibrated probabilities close to the "perfect" line
• Naive Bayes tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because it makes the assumption that features are conditionally independent given the class, which is not the case in this dataset which contains 2 redundant features.
• Random Forest shows the opposite behavior: The histograms show peaks at approx. 0.1 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and Caruana: "Methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values. Because predictions are restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example, if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly with random forests because the base-level trees trained with random forests have relatively high variance due to feature subseting." As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the classifier could trust its "intuition" more and return probabilties closer to 0 or 1 typically. A post-processing such as Platt-calibration, which fits a sigmoid to the probabilities on a separate calibration dataset, would typically help if the calibration curve is sigmoid.
• The scores of a Support Vector Classification (SVC), which are linearly related to the distance of the sample from the hyperplane, show a similar but even stronger effect as the Random Forest. This is not too surprising as the scores are in no sense probabilties and must not be interpreted as such as the curve shows.
• One alternative to Platt-calibration is Isotonic Regression. While Platt-calibration fits a sigmoid, Isotonic Regression fits an arbitrary increasing (isotonic) function. Thus, it has a weaker inducttive bias and can be applied more broadly (also in situations where the calibration curve is not sigmoid). The downside is that it typically requires more calibration data because its inductive bias is weaker. This can also be seen in the SVC + IR curve: While the sigmoid shape of the pure SVC scores is removed and the calibration curve does not show a clear bias, it is quite noisy, indicating much variance. Thus, the used calibration dataset (even thoug 4 times larger than the training data) is too small in this case.

For a further discussion and more extensive experiments, please refer to Niculescu-Mizil and Caruana.

This post was written as an IPython notebook. You can download this notebook.