the hook · confidence vs. frequency

Can you trust 70%?

A model says “70% confident.” Does that mean, across many such predictions, seven in ten turn out right? Sometimes. Often not. The number on the screen and the long-run frequency are two different quantities; their alignment is called calibration , and most modern neural networks fail it on their first try. The fix is one scalar — the same temperature dial that confidently wrong introduced — fit against held-out truth.

Confidence is a number. Truth is a frequency. Calibration is the gap. Calibration compares a predicted distribution with observed frequencies. The temperature fit is a one-dimensional optimization — one scalar, one objective, one descent.

Widget — Reliability diagram

regimeoverconfident

T (truth)0.55

focal bin[0.8, 0.9]

ECE0.086

T (sharpness)0.55focal bin[0.8, 0.9]

local slope ≈ 0.56

Each bar groups predictions whose stated probability falls in a 0.1-wide bin; the bar height is the fraction that turned out correct. The green dashed diagonal is perfect calibration. With T = 0.55 the bars sag below the diagonal in the high-confidence range — the model says 0.9, reality returns about 96%. Click any bar to make it focal: the brown line is the tangent at that bin, the local linear approximation. Slope near 1 means "the curve is parallel to truth here, just shifted"; slope ≠ 1 means the calibration error changes with confidence — and a single scalar (temperature) can rotate the whole curve toward the diagonal.

the arc

The promise of probability

When a model emits the number $0.7$ for a class, what should that number mean? The honest contract is the long-run one: across many examples that the model labeled “70%,” the truth lights up about $70\%$ of the time. That contract is not enforced by softmax , by cross-entropy , or by training. Training rewards low loss on the data it sees; the alignment between probability and frequency is a separate property the model picks up — or fails to — as a side effect.

Confidently wrong showed why softmax produces something that looks like a probability. This page asks the next question: does it mean what it looks like?

The reliability diagram

The standard way to see calibration is a reliability diagram . Bin predictions by their stated probability — say ten bins of width $0.1$ . For each bin, plot two numbers. On the x-axis, the bin’s mean predicted probability. On the y-axis, the fraction that turned out correct. Perfect calibration is the diagonal $y = x$ . The widget above is exactly this plot, for a synthetic model with a tunable miscalibration.

Bars sagging below the diagonal — model says $0.9$ , gets $0.78$ right — are the signature of overconfidence. Bars rising above mean the model is too humble: it says $0.6$ but is actually right $0.8$ of the time. The vertical gap at each bin, weighted by how many predictions land there, sums to a single number — the expected calibration error, ECE.

import numpy as np

# Bin predictions; for each bin compute mean predicted prob and observed
# accuracy. The vertical gaps are the calibration error, by bin.
def reliability(probs, labels, n_bins=10):
    edges = np.linspace(0, 1, n_bins + 1)
    out = []
    for lo, hi in zip(edges[:-1], edges[1:]):
        mask = (probs >= lo) & (probs < hi if hi < 1 else probs <= hi)
        if mask.sum() == 0:
            out.append((float((lo + hi) / 2), None, 0))
            continue
        mean_p   = float(probs[mask].mean())
        accuracy = float(labels[mask].mean())   # labels ∈ {0, 1}
        out.append((mean_p, accuracy, int(mask.sum())))
    return out

# Toy: 1000 examples drawn from a known truth(p), with labels sampled
# Bernoulli(truth(p)). The model SAYS p; reality returns truth(p).
rng = np.random.default_rng(0)
probs = rng.uniform(0, 1, size=1000)
truth = lambda p, T=0.55: 1 / (1 + np.exp(-np.log(p / (1 - p)) / T))
labels = rng.binomial(1, truth(probs))
reliability(probs, labels, n_bins=10)
# → [(0.05, 0.18, ...), ..., (0.95, 0.78, ...)]
# At "95% confident", reality returns ~78% — the model is overconfident.

Real models are overconfident — predictably

An empirical pattern reported across image classifiers, language models, and tabular networks alike: the bars sit below the diagonal, and the gap is widest in the high-confidence tail. Models that say “very sure” are wrong more often than the number admits; models that say “uncertain” are roughly honest. The shape, drawn over the diagonal, looks like a sigmoid that has been slightly squashed toward 0.5. That’s not a coincidence — it’s what you get if the true posterior is the model’s stated probability run through $σ(logit(p)/T)$ with $T < 1$ .

In the widget, drag $T$ to $0.55$ (the “overconfident” preset). The bars in the right tail sag dramatically — the $0.9$ bin lands near $0.78$ . ECE jumps. Drag to $1.55$ and the curve flips above the diagonal: underconfident, the bin labeled $0.6$ lands near $0.7$ .

# Expected calibration error (ECE): weighted average of bin gaps.
def ece(probs, labels, n_bins=10):
    bins = reliability(probs, labels, n_bins)
    n = sum(c for _, _, c in bins)
    return sum(c * abs(p - a) for p, a, c in bins if a is not None) / n

ece(probs, labels, 10)        # ≈ 0.13   (13% calibration gap on average)
# 0 means perfect — every bar lies on the diagonal. ~0.05 is "lab-grade
# calibrated"; modern deep nets often start at 0.10–0.30 out of the box.

Linearize at one bin — the local fix

Click any bar in the widget. The brown line that appears is the tangent to the calibration curve at that bin’s center — the page-three trick from the linearization module, reused here. Locally, the curve looks like a line with slope $m$ and offset $c$ : $actual(p) ≈ m·p + c$ . Two numbers. The reliability bar and the diagonal differ by $(m − 1)·p + c$ in this neighborhood — a quantity you can read off the slope and the intercept of one tangent line.

Why does this matter? Because it tells you what kind of fix is needed. Slope $≈ 1$ means “the curve is parallel to the diagonal here, just shifted” — a constant additive correction works. Slope $≠ 1$ means the gap changes with confidence — and the right correction must rotate the curve toward the diagonal, not just shift it. That rotation is exactly what one parameter — temperature — buys you, globally.

# Local linearization at one bin: y ≈ accuracy(c) + slope·(p - c).
# If slope ≈ 1, the curve is parallel to truth — a constant shift, easy to
# fix. If slope ≠ 1, the gap CHANGES with confidence, which is exactly
# what one scalar (temperature) can rotate away.
def local_slope(p_centers, accuracies, i):
    # central difference; falls back to one-sided at the edges.
    if i == 0:
        return (accuracies[1] - accuracies[0]) / (p_centers[1] - p_centers[0])
    if i == len(p_centers) - 1:
        return (accuracies[-1] - accuracies[-2]) / (p_centers[-1] - p_centers[-2])
    return (accuracies[i+1] - accuracies[i-1]) / (p_centers[i+1] - p_centers[i-1])

# At the bin centered at 0.85, the slope tells you the "local fix":
# slope == 1 means subtract a constant; slope < 1 means stretch toward 0.5.

Temperature scaling — one scalar, post-hoc

The recipe: take the trained model. Don’t retrain. Don’t change architecture. Take the raw logits on a held-out validation set, add one scalar $T > 0$ , and compute $softmax(z / T)$ . Fit $T$ by minimizing log loss on the validation set — a one-dimensional optimization, takes a second. That’s temperature scaling . The argmax is preserved (every logit divided by the same constant), so accuracy doesn’t move; only the confidences rescale.

Why this works: dividing the logits by $T > 1$ flattens the softmax — every output probability moves toward the uniform $1/K$ . Across many examples, that pulls down the right-tail bars (the overconfident region) more than it pulls up the middle, exactly undoing the squash that produced the sigmoid bend. It’s a remarkably cheap fix for a remarkably common failure mode — and the first thing to try whenever a reliability diagram bows below the diagonal.

# Temperature scaling: divide every logit by T before softmax.
# argmax is preserved (accuracy unchanged); only confidence is rescaled.
def softmax(z, T=1.0):
    s = z / T
    s = s - s.max(axis=-1, keepdims=True)
    e = np.exp(s)
    return e / e.sum(axis=-1, keepdims=True)

# Fit T on a held-out validation set by minimizing log-loss in T.
from scipy.optimize import minimize_scalar

def fit_temperature(logits, y):
    def nll(T):
        p = softmax(logits, T=T)
        # negative log-likelihood of the true class
        return -np.log(p[np.arange(len(y)), y] + 1e-12).mean()
    res = minimize_scalar(nll, bounds=(0.05, 10.0), method="bounded")
    return float(res.x)

# T > 1 → softer; T < 1 → sharper. Modern LLMs ship with T ≈ 1.5–3 to
# tame overconfidence in the high-probability tail.

now break it

Temperature scaling assumes the miscalibration is the same shape everywhere in input space — one global rotation can fix it. Real models often miscalibrate differently on different slices: easy examples confidently right, hard examples confidently wrong, out-of-distribution inputs absurdly confident. A single $T$ averages the gap, which can leave both regimes worse than nothing. The honest tell: ECE drops on validation, but the right tail of the held-out reliability diagram is still bowed. The fix isn’t a bigger T; it’s a model that asks “is this input near anything I’ve seen?” — a separate piece of machinery (selective prediction, conformal sets, density estimators) that lives outside the softmax entirely.

Confidence is a number. Truth is a frequency. Calibration is the gap between them. A reliability diagram makes the gap visible; a tangent at one bin tells you the local fix; one scalar — temperature — rotates the whole curve back toward the diagonal.

exercises · 손으로 풀기

1read the bin

In the widget, set $T = 0.55$ (overconfident). Click the bin centered at $0.85$ . Read off the bar height. Out of every 100 predictions the model labels “85% confident,” about how many are actually correct? Now do the same for the bin at $0.55$ . What kind of input is the model still being honest about?

2below the diagonal · what it meansno calculator

A friend looks at a reliability diagram where every bar in the right half lies below the diagonal and says: “great — the model is being modest.” Write a one-sentence reply that distinguishes modest from overconfident. Which interpretation matches “below the diagonal”?

3local linear fit

In the overconfident regime ( $T = 0.55$ ), click the bin at $0.85$ . The widget reports a local slope. Suppose that slope is $m = 0.6$ and the bar height is $0.74$ at $p = 0.85$ . Write the local linear approximation in the form $actual(p) ≈ m·p + c$ — i.e., compute $c$ . Use it to predict the calibration error at $p = 0.95$ .

4why cross-entropy → overconfidence

From the confidently wrong page: cross-entropy is $−\log p_true$ . Sketch — in words, no math required — why a model trained to minimize this loss has an incentive to be overconfident on the training set, and why this incentive doesn’t translate into good calibration on held-out data.

why this isn't taught this way

ML courses introduce softmax, cross-entropy, and accuracy. They almost never introduce calibration. The reason is partly cultural — accuracy is what leaderboards track — and partly historical: classical learning theory cared about decision boundaries (which the argmax decides), not the probabilities the model emits along the way. But every shipped model that says “70% sure” makes the calibration claim implicitly. Lemma puts the diagram next to softmax so the reader can see the gap between looks like a probability and matches frequency — a separate measurement, against held-out truth, that no amount of cross-entropy training can replace.

glossary · used on this page · 9

calibration·캘리브레이션

The match between a model's predicted probabilities and actual frequencies. A model is _calibrated_ if, on examples it labels "70% confident," it is correct close to 70% of the time. Calibration is plotted as a _reliability diagram_: predicted probability on the x-axis, observed accuracy on the y-axis; perfect calibration is the diagonal `y = x`. Most modern neural networks are _overconfident_ — their high-probability predictions are correct less often than they claim — and the standard fix is _temperature scaling_, a single dial that pulls the curve back toward the diagonal. Calibration is distinct from accuracy: a perfectly accurate classifier on a tiny dataset can be wildly miscalibrated, and a calibrated model can be wrong about every individual prediction as long as the _frequencies_ line up.

temperature·온도

A positive scalar `T` applied as `softmax(z/T)`. Low `T` _sharpens_ the distribution — the largest logit dominates, and the model looks more confident. High `T` _flattens_ it — every class gets a similar share, and the model looks more uncertain. `T = 1` is the unaltered softmax. Crucially, `T` does not change _which_ class wins, only how strongly the winning probability is reported. Used for sampling control in language models, for soft targets in distillation, and for repairing overconfidence in calibration.

distribution·분포

A _shape of uncertainty_: how probability is allocated across possible outcomes. A discrete distribution assigns a number to each outcome — `P(X = "cat") = 0.6, P(X = "dog") = 0.3, P(X = "bird") = 0.1`. A continuous distribution assigns _density_ across an interval — there is no probability at any single point, only over a range. The numbers must sum (discrete) or integrate (continuous) to 1, because _something_ must happen. A single probability is one number; a distribution is the whole shape behind it. Most quantities a model predicts, an asset can return, or a pixel can take, are not single numbers but distributions — and the _spread_ of those distributions is often what matters more than the center.

softmax·소프트맥스

The function that turns a vector of logits `z = (z₁, …, zₙ)` into a vector of positive numbers summing to 1: `softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)`. Two facts to keep close. (1) It depends only on differences `zᵢ − zⱼ` — adding the same constant to every logit changes nothing. (2) It never outputs exactly 0 or 1, only their limits. The output _looks like_ a probability distribution; it does not guarantee that the assigned probability matches any real-world frequency.

cross-entropy·교차 엔트로피

The loss function used to train classifiers. Given a true class `y` and a predicted distribution `p` over `n` classes, cross-entropy is `−log p_y` — the negative log of the probability the model assigned to the correct answer. Equal to negative log-likelihood when the target is one-hot. It punishes confident wrong answers harshly (`p_y → 0` sends loss `→ ∞`) and rewards confident right ones (`p_y → 1` sends loss `→ 0`). Built on log because log turns the product of independent likelihoods into a sum.

reliability diagram·신뢰도 다이어그램

The standard plot for _calibration_. Bin predictions by their stated probability (`[0.0–0.1]`, `[0.1–0.2]`, …, `[0.9–1.0]`); for each bin, plot the _mean predicted probability_ on the x-axis against the _observed accuracy_ (fraction of those predictions that turned out correct) on the y-axis. A perfectly calibrated model traces the diagonal `y = x`. A bar that sits _below_ the diagonal at x=0.9 means the model says "90% confident" but is correct only — say — 75% of the time: _overconfident_. Above the diagonal is _underconfident_. The vertical gap at each bin, weighted by the number of predictions in that bin, sums to the _expected calibration error (ECE)_.

tangent line·접선

A straight line that touches a curve at a single point and matches the curve's direction there. Its slope at the point of contact equals the derivative of the function at that point: `m_tangent = f'(a)`. The tangent is what the secant becomes in the limit as its two intersection points merge — the curve's _instantaneous direction_ made visible as a line. Distinct from the trigonometric tangent; same word, different concept.

logit·로짓

A real-valued score a model produces before normalization — the raw output of a final linear layer, with no constraint to be positive or sum to anything. Logits can be any size; only their differences matter, since softmax depends only on `zᵢ − zⱼ`. A logit means nothing on its own. It only acquires interpretation as a "probability" after softmax — and even then, only in the same sense as any number between 0 and 1.

temperature scaling·온도 스케일링

The simplest fix for an overconfident classifier. The model produces _logits_ `z`, then a softmax converts them to probabilities. _Temperature scaling_ divides the logits by a single scalar `T > 0` _before_ the softmax: `softmax(z / T)`. `T = 1` leaves the model alone; `T > 1` flattens the distribution (less confident); `T < 1` sharpens it (more confident). Crucially, dividing every logit by the same constant _preserves the argmax_ — the predicted class never changes, so accuracy is unchanged. Only the confidence numbers move. `T` is fit on a held-out validation set by minimizing log loss; one parameter, one scalar, no architecture changes — usually the first thing to try when a reliability diagram bows below the diagonal.