Can you trust 70%?
A model says “70% confident.” Does that mean, across many such predictions, seven in ten turn out right? Sometimes. Often not. The number on the screen and the long-run frequency are two different quantities; their alignment is called
Confidence is a number. Truth is a frequency. Calibration is the gap. Calibration compares a predicted
The promise of probability
When a model emits the number for a class, what should that number mean? The honest contract is the long-run one: across many examples that the model labeled “70%,” the truth lights up about of the time. That contract is not enforced by
Confidently wrong showed why softmax produces something that looks like a probability. This page asks the next question: does it mean what it looks like?
The reliability diagram
The standard way to see calibration is a
Bars sagging below the diagonal — model says , gets right — are the signature of overconfidence. Bars rising above mean the model is too humble: it says but is actually right of the time. The vertical gap at each bin, weighted by how many predictions land there, sums to a single number — the expected calibration error, ECE.
import numpy as np
# Bin predictions; for each bin compute mean predicted prob and observed
# accuracy. The vertical gaps are the calibration error, by bin.
def reliability(probs, labels, n_bins=10):
edges = np.linspace(0, 1, n_bins + 1)
out = []
for lo, hi in zip(edges[:-1], edges[1:]):
mask = (probs >= lo) & (probs < hi if hi < 1 else probs <= hi)
if mask.sum() == 0:
out.append((float((lo + hi) / 2), None, 0))
continue
mean_p = float(probs[mask].mean())
accuracy = float(labels[mask].mean()) # labels ∈ {0, 1}
out.append((mean_p, accuracy, int(mask.sum())))
return out
# Toy: 1000 examples drawn from a known truth(p), with labels sampled
# Bernoulli(truth(p)). The model SAYS p; reality returns truth(p).
rng = np.random.default_rng(0)
probs = rng.uniform(0, 1, size=1000)
truth = lambda p, T=0.55: 1 / (1 + np.exp(-np.log(p / (1 - p)) / T))
labels = rng.binomial(1, truth(probs))
reliability(probs, labels, n_bins=10)
# → [(0.05, 0.18, ...), ..., (0.95, 0.78, ...)]
# At "95% confident", reality returns ~78% — the model is overconfident.Real models are overconfident — predictably
An empirical pattern reported across image classifiers, language models, and tabular networks alike: the bars sit below the diagonal, and the gap is widest in the high-confidence tail. Models that say “very sure” are wrong more often than the number admits; models that say “uncertain” are roughly honest. The shape, drawn over the diagonal, looks like a sigmoid that has been slightly squashed toward 0.5. That’s not a coincidence — it’s what you get if the true posterior is the model’s stated probability run through with T < 1.
In the widget, drag to (the “overconfident” preset). The bars in the right tail sag dramatically — the bin lands near . ECE jumps. Drag to and the curve flips above the diagonal: underconfident, the bin labeled lands near .
# Expected calibration error (ECE): weighted average of bin gaps.
def ece(probs, labels, n_bins=10):
bins = reliability(probs, labels, n_bins)
n = sum(c for _, _, c in bins)
return sum(c * abs(p - a) for p, a, c in bins if a is not None) / n
ece(probs, labels, 10) # ≈ 0.13 (13% calibration gap on average)
# 0 means perfect — every bar lies on the diagonal. ~0.05 is "lab-grade
# calibrated"; modern deep nets often start at 0.10–0.30 out of the box.Linearize at one bin — the local fix
Click any bar in the widget. The brown line that appears is the
Why does this matter? Because it tells you what kind of fix is needed. Slope means “the curve is parallel to the diagonal here, just shifted” — a constant additive correction works. Slope means the gap changes with confidence — and the right correction must rotate the curve toward the diagonal, not just shift it. That rotation is exactly what one parameter — temperature — buys you, globally.
# Local linearization at one bin: y ≈ accuracy(c) + slope·(p - c).
# If slope ≈ 1, the curve is parallel to truth — a constant shift, easy to
# fix. If slope ≠ 1, the gap CHANGES with confidence, which is exactly
# what one scalar (temperature) can rotate away.
def local_slope(p_centers, accuracies, i):
# central difference; falls back to one-sided at the edges.
if i == 0:
return (accuracies[1] - accuracies[0]) / (p_centers[1] - p_centers[0])
if i == len(p_centers) - 1:
return (accuracies[-1] - accuracies[-2]) / (p_centers[-1] - p_centers[-2])
return (accuracies[i+1] - accuracies[i-1]) / (p_centers[i+1] - p_centers[i-1])
# At the bin centered at 0.85, the slope tells you the "local fix":
# slope == 1 means subtract a constant; slope < 1 means stretch toward 0.5.Temperature scaling — one scalar, post-hoc
The recipe: take the trained model. Don’t retrain. Don’t change architecture. Take the raw
Why this works: dividing the logits by T > 1 flattens the softmax — every output probability moves toward the uniform . Across many examples, that pulls down the right-tail bars (the overconfident region) more than it pulls up the middle, exactly undoing the squash that produced the sigmoid bend. It’s a remarkably cheap fix for a remarkably common failure mode — and the first thing to try whenever a reliability diagram bows below the diagonal.
# Temperature scaling: divide every logit by T before softmax.
# argmax is preserved (accuracy unchanged); only confidence is rescaled.
def softmax(z, T=1.0):
s = z / T
s = s - s.max(axis=-1, keepdims=True)
e = np.exp(s)
return e / e.sum(axis=-1, keepdims=True)
# Fit T on a held-out validation set by minimizing log-loss in T.
from scipy.optimize import minimize_scalar
def fit_temperature(logits, y):
def nll(T):
p = softmax(logits, T=T)
# negative log-likelihood of the true class
return -np.log(p[np.arange(len(y)), y] + 1e-12).mean()
res = minimize_scalar(nll, bounds=(0.05, 10.0), method="bounded")
return float(res.x)
# T > 1 → softer; T < 1 → sharper. Modern LLMs ship with T ≈ 1.5–3 to
# tame overconfidence in the high-probability tail.Temperature scaling assumes the miscalibration is the same shape everywhere in input space — one global rotation can fix it. Real models often miscalibrate differently on different slices: easy examples confidently right, hard examples confidently wrong, out-of-distribution inputs absurdly confident. A single averages the gap, which can leave both regimes worse than nothing. The honest tell: ECE drops on validation, but the right tail of the held-out reliability diagram is still bowed. The fix isn’t a bigger T; it’s a model that asks “is this input near anything I’ve seen?” — a separate piece of machinery (selective prediction, conformal sets, density estimators) that lives outside the softmax entirely.
Confidence is a number. Truth is a frequency. Calibration is the gap between them. A reliability diagram makes the gap visible; a tangent at one bin tells you the local fix; one scalar — temperature — rotates the whole curve back toward the diagonal.
In the widget, set (overconfident). Click the bin centered at . Read off the bar height. Out of every 100 predictions the model labels “85% confident,” about how many are actually correct? Now do the same for the bin at . What kind of input is the model still being honest about?
A friend looks at a reliability diagram where every bar in the right half lies below the diagonal and says: “great — the model is being modest.” Write a one-sentence reply that distinguishes modest from overconfident. Which interpretation matches “below the diagonal”?
In the overconfident regime (), click the bin at . The widget reports a local slope. Suppose that slope is and the bar height is at . Write the local linear approximation in the form — i.e., compute . Use it to predict the calibration error at .
From the confidently wrong page: cross-entropy is . Sketch — in words, no math required — why a model trained to minimize this loss has an incentive to be overconfident on the training set, and why this incentive doesn’t translate into good calibration on held-out data.
ML courses introduce softmax, cross-entropy, and accuracy. They almost never introduce calibration. The reason is partly cultural — accuracy is what leaderboards track — and partly historical: classical learning theory cared about decision boundaries (which the argmax decides), not the probabilities the model emits along the way. But every shipped model that says “70% sure” makes the calibration claim implicitly. Lemma puts the diagram next to softmax so the reader can see the gap between looks like a probability and matches frequency — a separate measurement, against held-out truth, that no amount of cross-entropy training can replace.