Lemma
math, backwards
the hook · scores → probabilities → confidence

Why is the model so confident about a wrong answer?

A model does not know it is right. It has scores. turns those scores into numbers that look like probabilities. punishes the probability the model gave to the correct answer. The trap is hidden in plain sight: a bad score can still become a very confident probability. That is the entire page. Everything else is consequence.

Softmax does not check truth. It only compares scores. Softmax returns a over labels, not a truth certificate.

Widget — Score cooker
winner · cat(62.9%)
true label · cat p_true 62.9%
loss · −log p_true 0.464
cat62.9%dog23.1%fox14.0%
true label · ★
Try this. Set cat as the true label, switch the view to loss, then drag fox's logit up to 5.0. The bars show fox winning at ~95%; the orange column above the cat bar is the loss and climbs past 4. The model is confident and wrong. Now drag temperature toward 0.1: confidence rises and the loss column climbs further. The truth never moved. Softmax doesn't check it; it only compares scores.
the arc
1

Scores aren't probabilities

The last layer of a classifier outputs three numbers — one per class. They are not constrained to be positive. They are not constrained to sum to anything. They are just : raw scores. z=[2.0,1.0,0.5]z = [2.0, 1.0, 0.5] says only that the model “leaned” toward class 0 the most. It does not say “65% chance of class 0.” That number doesn’t exist yet.

We need a function that maps three real numbers to three positive numbers summing to 1. Many such functions exist. The one we use is softmax — and the reason it’s that one, not another, is the point of the next step.

2

Softmax — exponentiate, then normalize

Softmax is two steps. First, raise everything to ee: [e2.0,e1.0,e0.5][7.39,2.72,1.65][e^2.0, e^1.0, e^0.5] ≈ [7.39, 2.72, 1.65]. Now they’re all positive. Second, divide by the sum so they sum to 1: [0.629,0.231,0.140]≈ [0.629, 0.231, 0.140]. Done. Three numbers, all positive, sum to 1 — looks exactly like a probability distribution.

The exp step is not arbitrary. It guarantees positivity (any real exponent of ee is positive) and it makes softmax depend only on differences of logits — adding the same constant to every logit changes nothing. The downstream effect is profound: softmax doesn’t know the absolute scale of your scores. It only sees who’s ahead, by how much.

import numpy as np

# Three logits — raw scores. No truth check anywhere.
z = np.array([2.0, 1.0, 0.5])

# Numerically stable softmax: subtract max before exp.
def softmax(z, T=1.0):
    s = z / T
    s = s - s.max()
    e = np.exp(s)
    return e / e.sum()

p = softmax(z)
# p ≈ [0.659, 0.242, 0.099]   (sums to 1)
# Same logits + 100 give the same p — softmax depends only on differences.
softmax(z + 100)
# → [0.659, 0.242, 0.099]
3

Cross-entropy — punish the probability you gave the correct answer

Now there’s a probability vector. Suppose the truth is class 0. The number we care about is p0p_0 — what the model gave to the right answer. The training signal we want should be 0 when p0=1p_0 = 1 (perfect) and large when p0p_0 is small (confidently wrong). The simplest function that does this: logp0−\log p_0. That’s . For one correct label, cross-entropy is the negative log-likelihood of that label — same number, two names from two framings (CE: truth as a distribution, one-hot here; : truth as an observation).

Why ? Because independent observations multiply their likelihoods, and log turns multiplication into addition. The total loss across a dataset becomes a clean sum: L=Σlogp(yixi)L = Σ −\log p(y_i | x_i). That’s the same identity the logarithm module is built on.

# In PyTorch, log_softmax + nll_loss is the numerically stable cross-entropy.
import torch
import torch.nn.functional as F

z = torch.tensor([[2.0, 1.0, 0.5]])      # logits, batch of 1
y = torch.tensor([0])                     # true class — 'cat' at index 0

log_p = F.log_softmax(z, dim=1)           # avoids log(softmax) overflow
loss  = F.nll_loss(log_p, y)
# loss ≈ 0.417  =  −log p_true  =  −log(0.659)
#
# Why log_softmax not log(softmax)?
# log(softmax) computes exp() first → overflow when logits are large.
# log_softmax keeps things in log-space the whole way through.
4

Temperature — the confidence dial

Replace softmax(z)softmax(z) with softmax(z/T)softmax(z/T). T < 1 divides logits by something small, magnifying their differences — the winner pulls away. T > 1 shrinks differences, flattening the distribution. The widget shows it: drag T toward 0.1 and watch one bar reach for the ceiling. The winner doesn’t change. The wrongness doesn’t change. Only the reported confidence changes — and so the cross-entropy loss, which depends on that confidence, changes with it.

5

Confidence ≠ truth — the trap, made explicit

Set logits to [5.0,1.0,0.5][5.0, 1.0, 0.5] — the model has decided hard for class 0. Softmax says p[0.978,0.018,0.011]p ≈ [0.978, 0.018, 0.011]: 97.8% confidence. But truth is independent of this calculation. If the real label is class 1, then ptrue=0.018p_true = 0.018 and the loss is log0.0184.0−\log 0.018 ≈ 4.0. The model is confident _and* wrong. Lowering temperature makes it more confident, and the loss rises faster.

This is why “the model said 97% so it must be right” is a category error. Looks like a probability and matches reality are two unrelated claims. Aligning them is its own field — calibration — and it requires data the model never sees during training.

# The trap: a wrong score can produce a confident probability.
import numpy as np

z = np.array([5.0, 1.0, 0.5])        # model is sure of class 0
true_idx = 1                          # but truth is class 1
p = softmax(z)
# p ≈ [0.978, 0.018, 0.011]
# 97.8% confidence — and wrong.
loss = -np.log(p[true_idx])
# loss ≈ 4.0   (huge — log explodes as p_true → 0)
#
# Lower the temperature, and confidence rises further while truth is unchanged.
softmax(z, T=0.5)[0]                 # ≈ 0.99964   (even more sure)
now break it

Cross-entropy can disagree with accuracy in both directions. Try logits [2.0,1.9,1.8][2.0, 1.9, 1.8] with true class 0: argmax is correct, but ptrue0.37p_true ≈ 0.37, loss 0.99≈ 0.99 — a “right” answer with terrible loss. Now flip to [5.0,1.0,0.5][5.0, 1.0, 0.5] with true class 1: argmax is wrong, but loss 4.0≈ 4.0. Selecting the model with the lowest training loss is not the same as selecting the model that gets the most answers right. The metric you optimize is not the metric you care about; the gap is where shipped models hide.

Softmax doesn’t check truth. It compares scores, exponentiates, normalizes. Cross-entropy reads off the bar that happens to belong to the correct answer. Confidence and rightness are two different things; the model only computes the first.

exercises · 손으로 풀기
1read the bars

In the widget, set logits to [2.0,1.0,0.5][2.0, 1.0, 0.5] and lower the temperature from 1.0 to 0.1. Which class wins at each temperature? How does the winning probability change? Now raise T to 5.0 — what happens to the bars? Why does the winner never change just because of T?

2compute by hand · softmaxno calculator

Estimate softmax([2,1,0])softmax([2, 1, 0]) at T = 1. Use e2.72e ≈ 2.72, e27.39e² ≈ 7.39. Round and check the bars match.

3write the loss

The loss is logptrue−\log p_true. Compute it for ptrue0.9,0.5,0.1,0.01p_true ∈ {0.9, 0.5, 0.1, 0.01} (natural log). What does the gap between 0.10.1 and 0.010.01 say about how cross-entropy treats _confident wrongness*?

4the evil one · confidence ≠ truth

A model gives the wrong class a probability of 0.990.99, leaving 0.010.01 for the truth. What is the loss? Now imagine someone asks: “but it was 99% sure — isn’t that close to right?” Write a one-sentence reply that distinguishes looks like a probability from is actually likely.

why this isn't taught this way

ML courses usually present softmax and cross-entropy back-to-back as the “classification recipe” — exponentiate, normalize, take the log of the right one, done. Lemma keeps them apart. Softmax compresses scores into something that looks like probability; cross-entropy punishes the probability the model assigned to the right answer. They are two unrelated jobs glued together by custom. Treating them as one obscures the trap shown in arc 5: a model can be confidently wrong, and the recipe gives no signal that it is.

glossary · used on this page · 7
softmax·소프트맥스
The function that turns a vector of logits `z = (z₁, …, zₙ)` into a vector of positive numbers summing to 1: `softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)`. Two facts to keep close. (1) It depends only on differences `zᵢ − zⱼ` — adding the same constant to every logit changes nothing. (2) It never outputs exactly 0 or 1, only their limits. The output _looks like_ a probability distribution; it does not guarantee that the assigned probability matches any real-world frequency.
cross-entropy·교차 엔트로피
The loss function used to train classifiers. Given a true class `y` and a predicted distribution `p` over `n` classes, cross-entropy is `−log p_y` — the negative log of the probability the model assigned to the correct answer. Equal to negative log-likelihood when the target is one-hot. It punishes confident wrong answers harshly (`p_y → 0` sends loss `→ ∞`) and rewards confident right ones (`p_y → 1` sends loss `→ 0`). Built on log because log turns the product of independent likelihoods into a sum.
distribution·분포
A _shape of uncertainty_: how probability is allocated across possible outcomes. A discrete distribution assigns a number to each outcome — `P(X = "cat") = 0.6, P(X = "dog") = 0.3, P(X = "bird") = 0.1`. A continuous distribution assigns _density_ across an interval — there is no probability at any single point, only over a range. The numbers must sum (discrete) or integrate (continuous) to 1, because _something_ must happen. A single probability is one number; a distribution is the whole shape behind it. Most quantities a model predicts, an asset can return, or a pixel can take, are not single numbers but distributions — and the _spread_ of those distributions is often what matters more than the center.
logit·로짓
A real-valued score a model produces before normalization — the raw output of a final linear layer, with no constraint to be positive or sum to anything. Logits can be any size; only their differences matter, since softmax depends only on `zᵢ − zⱼ`. A logit means nothing on its own. It only acquires interpretation as a "probability" after softmax — and even then, only in the same sense as any number between 0 and 1.
negative log-likelihood·음의 로그우도
Abbreviated NLL. Given a probabilistic model and observed data, the _likelihood_ is the probability the model assigns to the data; the _log-likelihood_ is its logarithm; the _negative_ log-likelihood flips the sign so that minimization makes sense. For a single classification example with true class `y` and predicted distribution `p`, NLL is `−log p_y` — identical to cross-entropy in the one-hot case. The `−log p` form makes independent observations _add_, and makes confident-but-wrong predictions diverge to infinity.
logarithm·로그
The inverse of exponentiation. log_b(x) asks: what power of b gives x?
temperature·온도
A positive scalar `T` applied as `softmax(z/T)`. Low `T` _sharpens_ the distribution — the largest logit dominates, and the model looks more confident. High `T` _flattens_ it — every class gets a similar share, and the model looks more uncertain. `T = 1` is the unaltered softmax. Crucially, `T` does not change _which_ class wins, only how strongly the winning probability is reported. Used for sampling control in language models, for soft targets in distillation, and for repairing overconfidence in calibration.