the hook · 20 questions, with a twist

How many yes/no questions to find one item in 1,024?

Halve the list each time and the answer is 10 — log₂(1,024) . The dictionary search from the log module, in new clothes. But the items are not all equally likely. If “the” is a thousand times more frequent than “antidisestablishmentarianism”, asking about the rare one first is wasted breath. The right number of questions, on average, is no longer $\log₂ N$ . It is entropy.

Entropy = the average number of yes/no questions needed. When everything is equally likely, it collapses to $\log₂ N$ . When one outcome dominates, it falls toward 0. Wordle, Huffman compression, password strength, decision trees — all are bumping against the same number.

tool spec

what: $H(X) = −Σ pᵢ \log₂ pᵢ$ . The expected number of yes/no questions, on average, to identify which outcome of $X$ occurred. Reduces to $\log₂ N$ when the N outcomes are equally likely (maximum); collapses to 0 when one outcome has probability 1 (minimum).
applies when: Uncertainty has structure: Wordle (information gain per guess), Huffman / arithmetic compression (the Shannon bound), password strength (NIST entropy), decision-tree splits (information-gain criterion), language-model perplexity ( $2^H$ ), KL divergence and mutual information.
breaks when: You don’t actually know $p$ — entropy is defined on a distribution, and estimating one from a sample is its own problem. Continuous distributions need differential entropy (different sign behavior, not non-negative). Entropy uses only the marginal $p$ ; identical marginals can hide very different joint structure (mutual information, not entropy, captures that).

Widget — Shannon bars

entropy H1.85 bits

uniform max log₂(4)2.00 bits

fraction of max92%

weight A · p = 0.404.0weight B · p = 0.303.0weight C · p = 0.202.0weight D · p = 0.101.0

max p0.40

min p0.10

active outcomes4

Click uniform: all four bars equal, every p = 0.25, surprise per outcome = 2 bits, H = 2 bits — the maximum for four outcomes. Click spike: one bar swallows the rest, surprise on the spike near 0, surprise on the others ≥ 2.5, H falls below 1. Click certain: one outcome at probability 1, three at 0, H = 0. The bar height in probability and bar height in surprise are opposite stories — entropy is the average of the second one weighted by the first.

the arc

log₂ N — the equally-likely floor

Start where the log module ended. With $N$ equally-likely outcomes, halving the candidate set each question takes $\log₂ N$ questions. A standard deck: $\log₂ 52 ≈ 5.7$ bits — six questions identify any card. Two coin flips: $\log₂ 4 = 2$ bits. A 64-bit random integer: 64 bits, by definition. The unit “ bit ” is the answer to one yes/no question with two equally likely outcomes — and $\log₂ N$ is how many such questions to specify one of N.

Surprise, weighted

When outcomes have unequal probabilities, “halve the list” stops being optimal. A question that splits a list into 90% / 10% gives you only some information: the 90% answer barely narrowed anything. The right accounting is per-outcome: surprise of an event with probability $p$ is $−\log₂ p$ — the bits revealed by learning that outcome. Common events carry low surprise; rare events carry high. The expected surprise is entropy:

H(X) = −Σ p_i log₂ p_i  =  Σ p_i · (−log₂ p_i)
                          ↑
               probability × surprise, summed

Two anchors. Uniform p = 1/N for every i: $H = \log₂ N$ — the equally-likely floor from arc 1, now recovered as the maximum. Spike p = 1 on one outcome, 0 elsewhere: $H = 0$ — no questions, no information, no uncertainty. Every distribution sits between those two and the widget above lets you slide along that line.

Why the average matters — Shannon's bound

In 1948 Shannon proved that no code can compress a stream of symbols below $H(X)$ bits per symbol on average — the Shannon bound . English text has per-character entropy around 1.0–1.5 bits (down from the naive log₂ 26 ≈ 4.7) because letters are not independent and are not equally likely; that gap is exactly the room compression has to work in. zip and gzip live there. The bound is tight in two senses: no code can beat it, and some codes (Huffman within 1 bit, arithmetic arbitrarily close) hit it. Entropy is not a measurement people invented — it is the floor everyone independently reinvented hitting.

Information gain — Wordle's optimal first guess

Wordle: pick a 5-letter guess; the game returns one of $3⁵$ feedback patterns (each tile gray/yellow/green); each pattern partitions the remaining candidate words. A guess gives you $IG = H(words) − E[H(words | feedback)]$ bits of information gain on average. Maximizing IG is what every Wordle solver, ever, does — though most do it by intuition. The numerical winner over the official ~2,300-word answer list is SALET (an old word for a helmet) at ≈ 5.88 bits, beating CRANE (5.79) and ADIEU (5.49). The “obvious” answer (vowels) loses to “split letters across feedback patterns evenly”, which is exactly maximum-entropy behavior on the feedback distribution.

Same machinery underlies decision-tree splits in machine learning (“which feature partitions labels into the most uncertainty-reducing groups?”), 20 Questions strategy (“ask the question with the highest expected information gain”), and the binary-search-vs-fingerprint-lookup contrast in algorithms.

Where this shows up — same equation, two pillars

Entropy is not randomness as a vibe. It is the expected surprise of a distribution. Once that sinks in, the same four lines describe two applications that look unrelated:

rare word                → high surprise
rare color               → high surprise
predictable distribution → low entropy
uniform distribution     → high entropy

TF-IDF is the rare-word version: a search query weighted by $−\log P(word)$ , exactly the surprise of seeing that word in a random document. Stop-words have near-zero surprise, so they get near-zero weight without anyone hard-coding the list.

Why images compress is the rare-color version on a histogram of pixel values — and then again on a histogram of neighbor differences, where smooth pictures look extremely predictable (low entropy → small file) and scrambled ones look uniform (high entropy → big file). Same H, two pillars: the number isn’t a feature of language or of pixels — it’s a feature of distributions, and any field that names a distribution is implicitly bumping into this module.

Where this module flows next

The same definition reappears under new names. Cross-entropy (the loss at the heart of classification) is your entropy plus a penalty when your model’s predicted distribution differs from the truth. KL divergence, mutual information, perplexity in language models — all consume this module. Password strength is just $\log₂$ of the keyspace under the assumption of a uniform attacker, which is the special case of arc 1. Every time someone says “this carries more information” or “that’s redundant”, they are gesturing at a number this module makes precise.

H = −Σ p log₂ p. Probability times surprise, summed. Equals $\log₂ N$ when uniform, falls to 0 when one outcome wins outright. Everything else — the Shannon bound, Wordle’s first guess, password strength, cross-entropy loss — is a corollary.

exercises · 손으로 풀기

1read the widget · the uniform max

On the Shannon Bars widget, click the uniform preset. What is H? Now click spike: estimate H without computing — is it above or below 1 bit?

2compute by hand · uniform Nno calculator

Compute H for uniform distributions over N = 2, 4, 8, 16. State the pattern.

3write the equation · 1/2, 1/4, 1/8, 1/8

Compute H for $p = (1/2, 1/4, 1/8, 1/8)$ . Compare to the uniform-4 case.

4Wordle · why splits matter

You guess CRANE; against a 2,300-word answer list it produces 200 distinct feedback patterns and partitions the candidates accordingly. The largest partition has 168 words; the average IG is 5.79 bits. SALET produces 205 patterns, largest 119, IG 5.88. Even though both reduce uncertainty by ≈ log₂ 2300 = 11.2 bits, why is SALET “better”? Answer in one sentence about the worst case.

5the evil one · 'uniform = lowest entropy'no calculator

A junior says: “uniform distribution has minimum entropy because it’s the most boring — every outcome the same.” Refute in one sentence. State the actual minimum.

6passwords · NIST entropy

Two passwords. (a) 8 random characters from {a–z, A–Z, 0–9} = 62 symbols. (b) 6 random English words from a dictionary of 7,776 (the Diceware list). Which has higher entropy? Use $\log₂ 62 ≈ 5.95$ , $\log₂ 7776 ≈ 12.92$ .

glossary · used on this page · 6

logarithm·로그

The inverse of exponentiation. log_b(x) asks: what power of b gives x?

bit (information)·비트 (정보)

The unit of information when the log base is 2. A coin flip carries exactly 1 bit because two equally-likely outcomes need one yes/no question to resolve. Distinct from "bit" the binary digit: a binary digit _can_ hold 1 bit, but a binary digit storing the result of a biased coin (90% heads) carries less than 1 bit on average. Information bits and storage bits are the same word for two different things.

surprise·놀람도

`I(x) = −log₂ p(x)`, the bits of information you gain by learning that outcome `x` occurred. Rare outcomes (small p) carry high surprise; certain outcomes (p = 1) carry zero. Entropy is the _expected_ surprise — `H = E[I(X)] = Σ p log₂(1/p)`. The Morse code intuition lives here: 'E' is most frequent in English, so its surprise per occurrence is small, so it earns a short code.

Shannon bound·섀넌 한계

The source-coding theorem (1948): no lossless code can use fewer than `H(X)` bits per symbol on average. Huffman codes get within 1 bit of the bound; arithmetic coding gets arbitrarily close. The reason a 5GB random file barely compresses and a 5GB English-text file compresses to a fifth: the English text has lower per-symbol entropy than its raw bytes suggest.

information gain·정보 이득

`IG(X | Y) = H(X) − H(X | Y)`. How many bits of uncertainty about X are removed, on average, by observing Y. The objective every Wordle solver, decision-tree split, and 20-questions strategy is implicitly maximizing. Bounded above by `H(X)`: you cannot gain more bits than the variable holds.

cross-entropy·교차 엔트로피

The loss function used to train classifiers. Given a true class `y` and a predicted distribution `p` over `n` classes, cross-entropy is `−log p_y` — the negative log of the probability the model assigned to the correct answer. Equal to negative log-likelihood when the target is one-hot. It punishes confident wrong answers harshly (`p_y → 0` sends loss `→ ∞`) and rewards confident right ones (`p_y → 1` sends loss `→ 0`). Built on log because log turns the product of independent likelihoods into a sum.