How many yes/no questions to find one item in 1,024?
Halve the list each time and the answer is 10 —
Entropy = the average number of yes/no questions needed. When everything is equally likely, it collapses to . When one outcome dominates, it falls toward 0. Wordle, Huffman compression, password strength, decision trees — all are bumping against the same number.
- what
. The expected number of yes/no questions, on average, to identify which outcome of occurred. Reduces to when the N outcomes are equally likely (maximum); collapses to 0 when one outcome has probability 1 (minimum).
- applies when
Uncertainty has structure: Wordle (information gain per guess), Huffman / arithmetic compression (the Shannon bound), password strength (NIST entropy), decision-tree splits (information-gain criterion), language-model perplexity (), KL divergence and mutual information.
- breaks when
You don’t actually know — entropy is defined on a distribution, and estimating one from a sample is its own problem. Continuous distributions need differential entropy (different sign behavior, not non-negative). Entropy uses only the marginal ; identical marginals can hide very different joint structure (mutual information, not entropy, captures that).
log₂ N — the equally-likely floor
Start where the log module ended. With equally-likely outcomes, halving the candidate set each question takes questions. A standard deck: bits — six questions identify any card. Two coin flips: bits. A 64-bit random integer: 64 bits, by definition. The unit “
Surprise, weighted
When outcomes have unequal probabilities, “halve the list” stops being optimal. A question that splits a list into 90% / 10% gives you only some information: the 90% answer barely narrowed anything. The right accounting is per-outcome:
H(X) = −Σ p_i log₂ p_i = Σ p_i · (−log₂ p_i)
↑
probability × surprise, summedTwo anchors. Uniform p = 1/N for every i: — the equally-likely floor from arc 1, now recovered as the maximum. Spike p = 1 on one outcome, 0 elsewhere: — no questions, no information, no uncertainty. Every distribution sits between those two and the widget above lets you slide along that line.
Why the average matters — Shannon's bound
In 1948 Shannon proved that no code can compress a stream of symbols below bits per symbol on average — the
Information gain — Wordle's optimal first guess
Wordle: pick a 5-letter guess; the game returns one of feedback patterns (each tile gray/yellow/green); each pattern partitions the remaining candidate words. A guess gives you bits of
Same machinery underlies decision-tree splits in machine learning (“which feature partitions labels into the most uncertainty-reducing groups?”), 20 Questions strategy (“ask the question with the highest expected information gain”), and the binary-search-vs-fingerprint-lookup contrast in algorithms.
Where this shows up — same equation, two pillars
Entropy is not randomness as a vibe. It is the expected surprise of a distribution. Once that sinks in, the same four lines describe two applications that look unrelated:
rare word → high surprise rare color → high surprise predictable distribution → low entropy uniform distribution → high entropy
TF-IDF is the rare-word version: a search query weighted by , exactly the surprise of seeing that word in a random document. Stop-words have near-zero surprise, so they get near-zero weight without anyone hard-coding the list.
Why images compress is the rare-color version on a histogram of pixel values — and then again on a histogram of neighbor differences, where smooth pictures look extremely predictable (low entropy → small file) and scrambled ones look uniform (high entropy → big file). Same H, two pillars: the number isn’t a feature of language or of pixels — it’s a feature of distributions, and any field that names a distribution is implicitly bumping into this module.
Where this module flows next
The same definition reappears under new names.
H = −Σ p log₂ p. Probability times surprise, summed. Equals when uniform, falls to 0 when one outcome wins outright. Everything else — the Shannon bound, Wordle’s first guess, password strength, cross-entropy loss — is a corollary.
On the Shannon Bars widget, click the uniform preset. What is H? Now click spike: estimate H without computing — is it above or below 1 bit?
Compute H for uniform distributions over N = 2, 4, 8, 16. State the pattern.
Compute H for . Compare to the uniform-4 case.
You guess CRANE; against a 2,300-word answer list it produces 200 distinct feedback patterns and partitions the candidates accordingly. The largest partition has 168 words; the average IG is 5.79 bits. SALET produces 205 patterns, largest 119, IG 5.88. Even though both reduce uncertainty by ≈ log₂ 2300 = 11.2 bits, why is SALET “better”? Answer in one sentence about the worst case.
A junior says: “uniform distribution has minimum entropy because it’s the most boring — every outcome the same.” Refute in one sentence. State the actual minimum.
Two passwords. (a) 8 random characters from {a–z, A–Z, 0–9} = 62 symbols. (b) 6 random English words from a dictionary of 7,776 (the Diceware list). Which has higher entropy? Use , .