Lemma
math, backwards
the hook · one probability vs. a whole shape

A probability is one guess. A distribution is the whole shape of uncertainty.

If you only know the most likely outcome of an uncertain thing, you barely know it. “Most likely heads” and “heads 99% of the time” and “heads 51% of the time” all share the same most-likely answer; the difference between them is the whole story. The right object isn’t the most likely outcome, or the expected one — it’s the distribution: the entire shape of uncertainty, every possible outcome and how much weight it carries. Most of what a model predicts, a portfolio holds, or an image’s histogram counts is a distribution before it is a number.

Drag the bars below — five outcomes, five weights, auto-normalized into probabilities. Watch the mean slide and the spread grow and shrink. Same numbers, four shapes: uniform, peaked, skewed, bimodal. The shape is the thing this module is about.

tool spec
what

A assigns probability across possible outcomes of a . Discrete: a p(x)p(x) with Σp(x)=1Σ p(x) = 1. Continuous: a f(x)f(x) with f(x)dx=1∫ f(x) dx = 1. The is the weighted average; the is the average squared distance from that center.

applies when

Any time uncertainty has a shape, not just a single value. outputs in ML, return distributions in finance, color in compression, bins, term frequencies in text. Once you can name the distribution behind a quantity, the right summary (mean? variance? entropy? a quantile?) usually picks itself.

breaks when

Two big traps. (1) Confusing the variable with its distributionX is the unknown quantity; p(x) is the rule that says how likely each value is. (2) Confusing mass with density — at a single point of a continuous distribution the probability is zero; only the density is meaningful there. Once those two distinctions are clean, the rest of probability theory falls into place.

Widget — Distribution shape lab
mean E[X]3.00
variance Var0.88
spread σ0.94
10.0620.1930.5040.1950.06μ = 3.00probabilityoutcome value x
preset
max p0.50
active outcomes5
sums to1.00
Five outcomes with values x = 1, 2, 3, 4, 5. Drag a weight, the widget normalizes, the bars sum to 1. Click uniform: each p = 0.20, mean lands on 3, variance is largest at 2. Click peaked: probability concentrates near 3, mean still 3, variance falls. Click skewed: small values are likely, large values rare — mean shifts left of 3. Click bimodal: probability splits between 1 and 5, mean lands back on 3 but variance jumps. Same mean, very different shape. That gap is why a distribution is the right object and the mean alone is not.
the arc
1

A distribution allocates probability across outcomes

A distribution is a rule for spreading total probability 1 across possible outcomes. If a model says “this image is a cat with probability 0.6, a dog with 0.3, a bird with 0.1,” the model has produced a distribution over three labels. The numbers must add up to 1 because exactly one of the outcomes will happen: probability is a constrained budget, not an unrestricted score.

The naive question — what’s the probability? — implicitly asks for one number, and the answer often is one number. The deeper question — what’s the whole distribution? — gives you the rest of the picture: every outcome the model considers possible, weighted by how much probability it spent on each. A distribution is what survives once you stop privileging one outcome over the rest.

2

Discrete distributions — categories, words, classes

The simplest case has finitely many outcomes. A coin flip: two outcomes, two probabilities. A die: six. A softmax output: as many as there are classes. A word in a document: as many as there are vocabulary entries. The whole distribution is a list of (outcome, probability) pairs — a function from outcomes to [0, 1] whose values sum to 1.

This function has a name — the , or pmf. The “mass” terminology comes from picturing probability as a literal mass you distribute across discrete buckets; the height of the bar over outcome x is p(x). The widget above is a pmf editor — every drag re-allocates the mass without changing the total.

import numpy as np

# A discrete distribution is a list of (outcome, probability) pairs that
# sum to 1. The 'outcome' can be a label, a class, a number, anything — but
# the probabilities have to add up to one because *something* must happen.
outcomes = ["cat", "dog", "bird"]
probs    = [0.6,   0.3,   0.1]
assert abs(sum(probs) - 1.0) < 1e-9

# Sampling: pick one outcome with the given probability. The law of large
# numbers says long-run sample frequencies converge to these probabilities.
rng = np.random.default_rng(0)
draws = rng.choice(outcomes, size=10_000, p=probs)
[(o, (draws == o).mean()) for o in outcomes]
# → [('cat',  0.6021),
#    ('dog',  0.2978),
#    ('bird', 0.1001)]
# Each empirical frequency ≈ the probability we set. A distribution is what
# the long-run frequency *is*; a single draw is a finite, noisy peek at it.
3

Expected value — the center of mass

When outcomes are numbers, the distribution has a mean. The expected value E[X]=Σxp(x)E[X] = Σ x · p(x) is the weighted average — the center of mass of the bar chart. Drag the widget to uniform with outcomes 1..5: the mean lands on 3, the average. Drag to skewed so small values are likely and large values rare: the mean shifts toward the heavy side.

The name expected is a little misleading. With X{0,1000}X ∈ \{0, 1000\} and equal probability, the expected value is 500 — a value X will never take. The mean is a summary of the distribution, not a forecast of any single draw. It is what the long-run average of independent draws converges to (the law of large numbers), and that’s why it deserves the name E.

# Expected value E[X] of a numerical random variable.
# Outcomes need to be numbers (else there is no mean) — labels do not.
xs = np.array([1, 2, 3, 4, 5])
ps = np.array([0.05, 0.20, 0.50, 0.20, 0.05])  # peaked at 3, sums to 1
assert abs(ps.sum() - 1.0) < 1e-9

mu = (xs * ps).sum()
# → 3.00   the weighted average; equals the center of mass of the bars.

# Variance: weighted average of (x − μ)². Spread, in squared units.
var = (ps * (xs - mu)**2).sum()
sigma = var ** 0.5
(mu, var, sigma)
# → (3.0, 0.6, 0.7746)

# Sanity check by direct sampling:
rng = np.random.default_rng(0)
draws = rng.choice(xs, size=200_000, p=ps)
(draws.mean(), draws.var(), draws.std())
# → (2.9997, 0.6003, 0.7747)
# Same numbers, with the sampling noise you would expect at N = 200k.
#
# Two distributions can share a mean and disagree wildly on variance.
# Same μ, different shape — and the shape is what drives risk in finance,
# calibration in ML, and bits-per-symbol in compression.
4

Variance — spread, not just center

Two distributions can share a mean and disagree wildly on shape. Peaked and bimodal in the widget both have mean 3 — but peaked piles probability on outcome 3 itself, while bimodal splits it between 1 and 5. The mean alone can’t tell them apart; can.

Var[X]=Σp(x)(xμ)2Var[X] = Σ p(x) · (x − μ)² — the weighted average of squared distance from the mean. Squared, because we want a measure that doesn’t cancel positive and negative deviations against each other. The square root, σ=Varσ = √Var, is the standard deviation — same idea, restored to the original units of X.

Variance is the second-order summary of a distribution. In finance it is the working definition of risk; in ML it bounds how much an estimator can wobble; in physics it shows up everywhere from thermal noise to measurement error. Two distributions with the same mean and the same variance are still not the same distribution, but the gap between them is much smaller than the gap when only the mean matches.

5

Continuous distributions — density, not mass

Most distributions in the real world are over the real line, not a finite set of buckets. Daily stock returns, measurement errors, response times, pixel intensities at sub-bit precision — these live on a continuum, and discrete masses don’t make sense: there are uncountably many possible values, and any single one has probability zero.

The right object is a f(x)f(x). Density is not probability — f(x) can exceed 1 — but it is the rate at which probability accumulates around x. Probability comes from integrating density over an interval: P(aXb)=abf(x)dxP(a ≤ X ≤ b) = ∫_a^b f(x) dx. The whole line integrates to 1.

This is where distributions meet the integral. For a discrete distribution, the expected value is a sum — Σxp(x)Σ x · p(x). For a continuous distribution, it is an integral — xf(x)dx∫ x · f(x) dx. Same idea, different machinery: integration is the continuous version of the sum, just as it is the continuous version of “accumulate the rate” we used for distance and present value. Distributions inherit calculus the moment they become continuous.

6

Where this shows up — same shape, three pillars

Distributions are the type behind a surprising fraction of the math Lemma already covers. Every place where a number is uncertain, a distribution sits underneath it. Three pillars, one structure.

ml       : softmax is a distribution over classes; calibration compares
         predicted distributions with observed frequencies.
graphics : a color histogram is a distribution over pixel values; entropy
         on that distribution sets the compression floor.
finance  : a return is drawn from a distribution; risk is a property of
         that distribution, not of any one return.

Confidently wrong — softmax produces a distribution over labels, not a truth certificate. The probabilities sum to 1 because some label has to be picked; the height of each bar is the model’s bet, not its knowledge.

Model calibration — calibration is the gap between a predicted distribution and the observed frequency. A reliability diagram asks: in the bin where the model said p = 0.8, was the answer actually correct 80% of the time? Calibration is a property of distributions, not of single predictions.

TF-IDF — term frequencies are a distribution over a vocabulary, document by document. Comparing documents is comparing distributions; cosine similarity, KL divergence, and BM25 are all distribution-comparison tools dressed up under different names.

Image compression — a histogram is a distribution over pixel values, and on that distribution is the lower bound on bits-per-pixel for independent encoding. The compression story is a distributions-and-entropy story stitched together.

Portfolio risk — risk is not a worst case; risk is a property of the return distribution. Variance, covariance, the whole shape of how returns scatter around their mean — those are the inputs to a portfolio decision, not the mean alone.

Same object, different pillar — five applications were already speaking distribution-language without a module to back them. Now they share one.

# Where this module shows up — five existing applications, one shape.

# (1) ML: softmax output is a distribution over labels.
def softmax(logits):
    e = np.exp(logits - logits.max())   # max-subtract for numerical safety
    return e / e.sum()
softmax(np.array([2.0, 1.0, 0.1]))
# → array([0.659, 0.242, 0.099])   sums to 1, is a distribution

# (2) Graphics: a color histogram is a distribution over pixel values.
img = np.array([0, 0, 0, 1, 1, 2, 2, 2, 2, 3], dtype=int)
counts = np.bincount(img, minlength=4)
hist = counts / counts.sum()
# → array([0.3, 0.2, 0.4, 0.1])   sums to 1, is a distribution

# (3) Finance: a return distribution is the input to risk.
# Two assets with identical mean but different spread.
rets_A = np.array([0.04, 0.05, 0.06])         # tight
rets_B = np.array([-0.10, 0.05, 0.20])        # wide
[(r.mean(), r.std()) for r in (rets_A, rets_B)]
# → [(0.05, 0.0081), (0.05, 0.1224)]
# Same expected return, very different distribution. *That gap is risk.*
# Portfolio variance, calibration gap, histogram entropy — all three start
# from a distribution and ask different questions of the same object.

A distribution is the whole shape of uncertainty. Discrete or continuous, the rule is the same: probability sums (or integrates) to 1 across all outcomes. The mean summarizes its center; the variance summarizes its spread; the full shape carries everything else. Most things you can call a “number” in modern math are distributions first.

exercises · 손으로 풀기
1normalize by handno calculator

A model outputs raw scores (logits) [2.0,1.0,0.1][2.0, 1.0, 0.1]. Convert them into a distribution by softmax: pi=e(xi)/Σe(xj)p_i = e^(x_i) / Σ e^(x_j). Round to two decimals. Confirm the result sums to 1.

2same mean, different varianceno calculator

Two discrete distributions with outcomes 1,2,31, 2, 3. Distribution A: p=[0,1,0]p = [0, 1, 0]. Distribution B: p=[0.5,0,0.5]p = [0.5, 0, 0.5]. Compute the mean and variance of each. State which is bigger on each dimension.

3density isn't probability

A continuous distribution has density f(x)=2xf(x) = 2x on x[0,1]x ∈ [0, 1] and 00 elsewhere. Confirm it is a valid density (integrates to 1). Compute P(X0.5)P(X ≤ 0.5). Then point out the value of f(0.5)f(0.5) and explain why that value alone is not the probability of X=0.5X = 0.5.

4recognize the distribution

In each of the five applications listed in arc 6, name the random variable, the outcome space, and whether the distribution is discrete or continuous.

glossary · used on this page · 10
distribution·분포
A _shape of uncertainty_: how probability is allocated across possible outcomes. A discrete distribution assigns a number to each outcome — `P(X = "cat") = 0.6, P(X = "dog") = 0.3, P(X = "bird") = 0.1`. A continuous distribution assigns _density_ across an interval — there is no probability at any single point, only over a range. The numbers must sum (discrete) or integrate (continuous) to 1, because _something_ must happen. A single probability is one number; a distribution is the whole shape behind it. Most quantities a model predicts, an asset can return, or a pixel can take, are not single numbers but distributions — and the _spread_ of those distributions is often what matters more than the center.
random variable·확률변수
A _quantity whose value is uncertain until it is drawn_ — but whose possible values have a distribution. A coin flip is the random variable `X ∈ {heads, tails}` with `P(X = heads) = 0.5`. A daily stock return is a real-valued random variable with some continuous distribution over `(−∞, ∞)`. Random variables are how probability talks about _quantities_: they have means (the expected value), spread (variance), and they combine into other random variables (sums, products, ratios). The variable is not the distribution — the variable _has_ a distribution. Confusing the two leads to bad math.
probability mass·확률질량
The _number a discrete distribution assigns to one outcome_. For a fair die, `P(X = 3) = 1/6`; the probability mass at `3` is `1/6`. The function `pmf(x) = P(X = x)` is the _probability mass function_ — the whole distribution as a lookup table. The masses sum to 1 across all outcomes. Discrete distributions are made of masses; continuous distributions are made of _densities_. The distinction matters because at a single point of a continuous distribution the probability is _zero_; only the density is meaningful there.
probability density·확률밀도
The _rate at which a continuous distribution accumulates probability around a point_. For a continuous random variable `X` with density `f(x)`, the probability that `X` falls in a small interval `[x, x + dx]` is approximately `f(x) · dx`. The density is _not_ a probability — it can exceed 1 — but its integral over any interval gives a probability, and its integral over the whole line is 1. Asking "what is `P(X = 3.7)`?" for a continuous `X` is a category error; the right question is "what is `P(3.6 ≤ X ≤ 3.8)`?" — an area under the density curve. Continuous distributions live as densities; integrals turn densities back into probabilities.
expected value·기댓값
The _weighted average_ of a random variable. For a discrete `X` with `P(X = xᵢ) = pᵢ`, the expected value is `E[X] = Σ xᵢ · pᵢ`. For a continuous `X` with density `f(x)`, `E[X] = ∫ x · f(x) dx`. It is the value you would converge to if you sampled `X` many times and averaged — the _center of mass_ of the distribution. "Expected" is a slightly misleading name: in `X ∈ {0, 1000}` with `P(0) = P(1000) = 0.5`, the expected value is 500 — a value `X` will _never_ take. The mean is a _summary_ of the distribution, not a _forecast_ of any one draw.
variance·분산
The expected square-distance of a random quantity from its mean. For returns `r` with mean `μ`, `Var(r) = E[(r − μ)²]`; the units are the _square_ of whatever you started with, which is why people usually quote the square root, _standard deviation_, instead. Variance is the workhorse measure of _risk_ in finance: not "how much could I lose worst-case?" but "how much does the outcome jitter from the average?". Two assets with the same mean return can have wildly different variances, and that gap is the entire reason a portfolio is more than the sum of its parts.
softmax·소프트맥스
The function that turns a vector of logits `z = (z₁, …, zₙ)` into a vector of positive numbers summing to 1: `softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)`. Two facts to keep close. (1) It depends only on differences `zᵢ − zⱼ` — adding the same constant to every logit changes nothing. (2) It never outputs exactly 0 or 1, only their limits. The output _looks like_ a probability distribution; it does not guarantee that the assigned probability matches any real-world frequency.
histogram·히스토그램
A count of how often each _value_ appears in a dataset. For an image: how many pixels are this dark, how many are that light, ignoring _where_ in the picture they sit. The histogram throws away spatial structure on purpose — it answers "what's the distribution of brightness?" but not "is the picture smooth or noisy?". When you compute _entropy_ over a histogram, you get a lower bound on the bits-per-symbol needed to encode each value _independently_; for an image, that's almost always a loose bound, because real pixels aren't independent of their neighbors.
calibration·캘리브레이션
The match between a model's predicted probabilities and actual frequencies. A model is _calibrated_ if, on examples it labels "70% confident," it is correct close to 70% of the time. Calibration is plotted as a _reliability diagram_: predicted probability on the x-axis, observed accuracy on the y-axis; perfect calibration is the diagonal `y = x`. Most modern neural networks are _overconfident_ — their high-probability predictions are correct less often than they claim — and the standard fix is _temperature scaling_, a single dial that pulls the curve back toward the diagonal. Calibration is distinct from accuracy: a perfectly accurate classifier on a tiny dataset can be wildly miscalibrated, and a calibrated model can be wrong about every individual prediction as long as the _frequencies_ line up.
entropy·엔트로피
`H(X) = −Σ pᵢ log₂ pᵢ`. The expected number of yes/no questions needed, on average, to identify which outcome occurred. Reaches `log₂ N` when all `N` outcomes are equally likely (maximum); collapses to 0 when one outcome has probability 1 (no uncertainty). The base of the log picks the unit: log₂ → bits, ln → nats, log₁₀ → bans. Built on `log` so that the entropy of independent variables adds: `H(X, Y) = H(X) + H(Y)` when X and Y are independent.