A probability is one guess. A distribution is the whole shape of uncertainty.
If you only know the most likely outcome of an uncertain thing, you barely know it. “Most likely heads” and “heads 99% of the time” and “heads 51% of the time” all share the same most-likely answer; the difference between them is the whole story. The right object isn’t the most likely outcome, or the expected one — it’s the distribution: the entire shape of uncertainty, every possible outcome and how much weight it carries. Most of what a model predicts, a portfolio holds, or an image’s histogram counts is a distribution before it is a number.
Drag the bars below — five outcomes, five weights, auto-normalized into probabilities. Watch the mean slide and the spread grow and shrink. Same numbers, four shapes: uniform, peaked, skewed, bimodal. The shape is the thing this module is about.
- what
A
assigns probability across possible outcomes of adistribution . Discrete: arandom variable with . Continuous: aprobability mass function with . Theprobability density is the weighted average; theexpected value is the average squared distance from that center.variance - applies when
Any time uncertainty has a shape, not just a single value.
outputs in ML, return distributions in finance, colorSoftmax in compression,histograms bins, term frequencies in text. Once you can name the distribution behind a quantity, the right summary (mean? variance? entropy? a quantile?) usually picks itself.calibration - breaks when
Two big traps. (1) Confusing the variable with its distribution —
Xis the unknown quantity;p(x)is the rule that says how likely each value is. (2) Confusing mass with density — at a single point of a continuous distribution the probability is zero; only the density is meaningful there. Once those two distinctions are clean, the rest of probability theory falls into place.
A distribution allocates probability across outcomes
A distribution is a rule for spreading total probability 1 across possible outcomes. If a model says “this image is a cat with probability 0.6, a dog with 0.3, a bird with 0.1,” the model has produced a distribution over three labels. The numbers must add up to 1 because exactly one of the outcomes will happen: probability is a constrained budget, not an unrestricted score.
The naive question — what’s the probability? — implicitly asks for one number, and the answer often is one number. The deeper question — what’s the whole distribution? — gives you the rest of the picture: every outcome the model considers possible, weighted by how much probability it spent on each. A distribution is what survives once you stop privileging one outcome over the rest.
Discrete distributions — categories, words, classes
The simplest case has finitely many outcomes. A coin flip: two outcomes, two probabilities. A die: six. A softmax output: as many as there are classes. A word in a document: as many as there are vocabulary entries. The whole distribution is a list of (outcome, probability) pairs — a function from outcomes to [0, 1] whose values sum to 1.
This function has a name — the x is p(x). The widget above is a pmf editor — every drag re-allocates the mass without changing the total.
import numpy as np
# A discrete distribution is a list of (outcome, probability) pairs that
# sum to 1. The 'outcome' can be a label, a class, a number, anything — but
# the probabilities have to add up to one because *something* must happen.
outcomes = ["cat", "dog", "bird"]
probs = [0.6, 0.3, 0.1]
assert abs(sum(probs) - 1.0) < 1e-9
# Sampling: pick one outcome with the given probability. The law of large
# numbers says long-run sample frequencies converge to these probabilities.
rng = np.random.default_rng(0)
draws = rng.choice(outcomes, size=10_000, p=probs)
[(o, (draws == o).mean()) for o in outcomes]
# → [('cat', 0.6021),
# ('dog', 0.2978),
# ('bird', 0.1001)]
# Each empirical frequency ≈ the probability we set. A distribution is what
# the long-run frequency *is*; a single draw is a finite, noisy peek at it.Expected value — the center of mass
When outcomes are numbers, the distribution has a mean. The expected value is the weighted average — the center of mass of the bar chart. Drag the widget to uniform with outcomes 1..5: the mean lands on 3, the average. Drag to skewed so small values are likely and large values rare: the mean shifts toward the heavy side.
The name expected is a little misleading. With and equal probability, the expected value is 500 — a value X will never take. The mean is a summary of the distribution, not a forecast of any single draw. It is what the long-run average of independent draws converges to (the law of large numbers), and that’s why it deserves the name E.
# Expected value E[X] of a numerical random variable.
# Outcomes need to be numbers (else there is no mean) — labels do not.
xs = np.array([1, 2, 3, 4, 5])
ps = np.array([0.05, 0.20, 0.50, 0.20, 0.05]) # peaked at 3, sums to 1
assert abs(ps.sum() - 1.0) < 1e-9
mu = (xs * ps).sum()
# → 3.00 the weighted average; equals the center of mass of the bars.
# Variance: weighted average of (x − μ)². Spread, in squared units.
var = (ps * (xs - mu)**2).sum()
sigma = var ** 0.5
(mu, var, sigma)
# → (3.0, 0.6, 0.7746)
# Sanity check by direct sampling:
rng = np.random.default_rng(0)
draws = rng.choice(xs, size=200_000, p=ps)
(draws.mean(), draws.var(), draws.std())
# → (2.9997, 0.6003, 0.7747)
# Same numbers, with the sampling noise you would expect at N = 200k.
#
# Two distributions can share a mean and disagree wildly on variance.
# Same μ, different shape — and the shape is what drives risk in finance,
# calibration in ML, and bits-per-symbol in compression.Variance — spread, not just center
Two distributions can share a mean and disagree wildly on shape. Peaked and bimodal in the widget both have mean 3 — but peaked piles probability on outcome 3 itself, while bimodal splits it between 1 and 5. The mean alone can’t tell them apart;
— the weighted average of squared distance from the mean. Squared, because we want a measure that doesn’t cancel positive and negative deviations against each other. The square root, , is the standard deviation — same idea, restored to the original units of X.
Variance is the second-order summary of a distribution. In finance it is the working definition of risk; in ML it bounds how much an estimator can wobble; in physics it shows up everywhere from thermal noise to measurement error. Two distributions with the same mean and the same variance are still not the same distribution, but the gap between them is much smaller than the gap when only the mean matches.
Continuous distributions — density, not mass
Most distributions in the real world are over the real line, not a finite set of buckets. Daily stock returns, measurement errors, response times, pixel intensities at sub-bit precision — these live on a continuum, and discrete masses don’t make sense: there are uncountably many possible values, and any single one has probability zero.
The right object is a f(x) can exceed 1 — but it is the rate at which probability accumulates around x. Probability comes from integrating density over an interval: . The whole line integrates to 1.
This is where distributions meet the integral. For a discrete distribution, the expected value is a sum — . For a continuous distribution, it is an integral — . Same idea, different machinery: integration is the continuous version of the sum, just as it is the continuous version of “accumulate the rate” we used for distance and present value. Distributions inherit calculus the moment they become continuous.
Where this shows up — same shape, three pillars
Distributions are the type behind a surprising fraction of the math Lemma already covers. Every place where a number is uncertain, a distribution sits underneath it. Three pillars, one structure.
ml : softmax is a distribution over classes; calibration compares
predicted distributions with observed frequencies.
graphics : a color histogram is a distribution over pixel values; entropy
on that distribution sets the compression floor.
finance : a return is drawn from a distribution; risk is a property of
that distribution, not of any one return.Confidently wrong — softmax produces a distribution over labels, not a truth certificate. The probabilities sum to 1 because some label has to be picked; the height of each bar is the model’s bet, not its knowledge.
Model calibration — calibration is the gap between a predicted distribution and the observed frequency. A reliability diagram asks: in the bin where the model said p = 0.8, was the answer actually correct 80% of the time? Calibration is a property of distributions, not of single predictions.
TF-IDF — term frequencies are a distribution over a vocabulary, document by document. Comparing documents is comparing distributions; cosine similarity, KL divergence, and BM25 are all distribution-comparison tools dressed up under different names.
Image compression — a histogram is a distribution over pixel values, and
Portfolio risk — risk is not a worst case; risk is a property of the return distribution. Variance, covariance, the whole shape of how returns scatter around their mean — those are the inputs to a portfolio decision, not the mean alone.
Same object, different pillar — five applications were already speaking distribution-language without a module to back them. Now they share one.
# Where this module shows up — five existing applications, one shape.
# (1) ML: softmax output is a distribution over labels.
def softmax(logits):
e = np.exp(logits - logits.max()) # max-subtract for numerical safety
return e / e.sum()
softmax(np.array([2.0, 1.0, 0.1]))
# → array([0.659, 0.242, 0.099]) sums to 1, is a distribution
# (2) Graphics: a color histogram is a distribution over pixel values.
img = np.array([0, 0, 0, 1, 1, 2, 2, 2, 2, 3], dtype=int)
counts = np.bincount(img, minlength=4)
hist = counts / counts.sum()
# → array([0.3, 0.2, 0.4, 0.1]) sums to 1, is a distribution
# (3) Finance: a return distribution is the input to risk.
# Two assets with identical mean but different spread.
rets_A = np.array([0.04, 0.05, 0.06]) # tight
rets_B = np.array([-0.10, 0.05, 0.20]) # wide
[(r.mean(), r.std()) for r in (rets_A, rets_B)]
# → [(0.05, 0.0081), (0.05, 0.1224)]
# Same expected return, very different distribution. *That gap is risk.*
# Portfolio variance, calibration gap, histogram entropy — all three start
# from a distribution and ask different questions of the same object.A distribution is the whole shape of uncertainty. Discrete or continuous, the rule is the same: probability sums (or integrates) to 1 across all outcomes. The mean summarizes its center; the variance summarizes its spread; the full shape carries everything else. Most things you can call a “number” in modern math are distributions first.
A model outputs raw scores (logits) . Convert them into a distribution by softmax: . Round to two decimals. Confirm the result sums to 1.
Two discrete distributions with outcomes . Distribution A: . Distribution B: . Compute the mean and variance of each. State which is bigger on each dimension.
A continuous distribution has density on and elsewhere. Confirm it is a valid density (integrates to 1). Compute . Then point out the value of and explain why that value alone is not the probability of .
In each of the five applications listed in arc 6, name the random variable, the outcome space, and whether the distribution is discrete or continuous.