Lemma
math, backwards
the hook · 20 questions, with a twist

How many yes/no questions to find one item in 1,024?

Halve the list each time and the answer is 10. The dictionary search from the log module, in new clothes. But the items are not all equally likely. If “the” is a thousand times more frequent than “antidisestablishmentarianism”, asking about the rare one first is wasted breath. The right number of questions, on average, is no longer log2N\log₂ N. It is entropy.

Entropy = the average number of yes/no questions needed. When everything is equally likely, it collapses to log2N\log₂ N. When one outcome dominates, it falls toward 0. Wordle, Huffman compression, password strength, decision trees — all are bumping against the same number.

tool spec
what

H(X)=Σpilog2piH(X) = −Σ pᵢ \log₂ pᵢ. The expected number of yes/no questions, on average, to identify which outcome of XX occurred. Reduces to log2N\log₂ N when the N outcomes are equally likely (maximum); collapses to 0 when one outcome has probability 1 (minimum).

applies when

Uncertainty has structure: Wordle (information gain per guess), Huffman / arithmetic compression (the Shannon bound), password strength (NIST entropy), decision-tree splits (information-gain criterion), language-model perplexity (2H2^H), KL divergence and mutual information.

breaks when

You don’t actually know pp — entropy is defined on a distribution, and estimating one from a sample is its own problem. Continuous distributions need differential entropy (different sign behavior, not non-negative). Entropy uses only the marginal pp; identical marginals can hide very different joint structure (mutual information, not entropy, captures that).

Widget — Shannon bars
entropy H1.85 bits
uniform max log₂(4)2.00 bits
fraction of max92%
A0.40B0.30C0.20D0.10probability
max p0.40
min p0.10
active outcomes4
Click uniform: all four bars equal, every p = 0.25, surprise per outcome = 2 bits, H = 2 bits — the maximum for four outcomes. Click spike: one bar swallows the rest, surprise on the spike near 0, surprise on the others ≥ 2.5, H falls below 1. Click certain: one outcome at probability 1, three at 0, H = 0. The bar height in probability and bar height in surprise are opposite stories — entropy is the average of the second one weighted by the first.
the arc
1

log₂ N — the equally-likely floor

Start where the log module ended. With NN equally-likely outcomes, halving the candidate set each question takes log2N\log₂ N questions. A standard deck: log2525.7\log₂ 52 ≈ 5.7 bits — six questions identify any card. Two coin flips: log24=2\log₂ 4 = 2 bits. A 64-bit random integer: 64 bits, by definition. The unit “” is the answer to one yes/no question with two equally likely outcomes — and log2N\log₂ N is how many such questions to specify one of N.

2

Surprise, weighted

When outcomes have unequal probabilities, “halve the list” stops being optimal. A question that splits a list into 90% / 10% gives you only some information: the 90% answer barely narrowed anything. The right accounting is per-outcome: of an event with probability pp is log2p−\log₂ p — the bits revealed by learning that outcome. Common events carry low surprise; rare events carry high. The expected surprise is entropy:

H(X) = −Σ p_i log₂ p_i  =  Σ p_i · (−log₂ p_i)
                          ↑
               probability × surprise, summed

Two anchors. Uniform p = 1/N for every i: H=log2NH = \log₂ N — the equally-likely floor from arc 1, now recovered as the maximum. Spike p = 1 on one outcome, 0 elsewhere: H=0H = 0 — no questions, no information, no uncertainty. Every distribution sits between those two and the widget above lets you slide along that line.

3

Why the average matters — Shannon's bound

In 1948 Shannon proved that no code can compress a stream of symbols below H(X)H(X) bits per symbol on average — the . English text has per-character entropy around 1.0–1.5 bits (down from the naive log₂ 26 ≈ 4.7) because letters are not independent and are not equally likely; that gap is exactly the room compression has to work in. zip and gzip live there. The bound is tight in two senses: no code can beat it, and some codes (Huffman within 1 bit, arithmetic arbitrarily close) hit it. Entropy is not a measurement people invented — it is the floor everyone independently reinvented hitting.

4

Information gain — Wordle's optimal first guess

Wordle: pick a 5-letter guess; the game returns one of 353⁵ feedback patterns (each tile gray/yellow/green); each pattern partitions the remaining candidate words. A guess gives you IG=H(words)E[H(wordsfeedback)]IG = H(words) − E[H(words | feedback)] bits of on average. Maximizing IG is what every Wordle solver, ever, does — though most do it by intuition. The numerical winner over the official ~2,300-word answer list is SALET (an old word for a helmet) at ≈ 5.88 bits, beating CRANE (5.79) and ADIEU (5.49). The “obvious” answer (vowels) loses to “split letters across feedback patterns evenly”, which is exactly maximum-entropy behavior on the feedback distribution.

Same machinery underlies decision-tree splits in machine learning (“which feature partitions labels into the most uncertainty-reducing groups?”), 20 Questions strategy (“ask the question with the highest expected information gain”), and the binary-search-vs-fingerprint-lookup contrast in algorithms.

5

Where this shows up — same equation, two pillars

Entropy is not randomness as a vibe. It is the expected surprise of a distribution. Once that sinks in, the same four lines describe two applications that look unrelated:

rare word                → high surprise
rare color               → high surprise
predictable distribution → low entropy
uniform distribution     → high entropy

TF-IDF is the rare-word version: a search query weighted by logP(word)−\log P(word), exactly the surprise of seeing that word in a random document. Stop-words have near-zero surprise, so they get near-zero weight without anyone hard-coding the list.

Why images compress is the rare-color version on a histogram of pixel values — and then again on a histogram of neighbor differences, where smooth pictures look extremely predictable (low entropy → small file) and scrambled ones look uniform (high entropy → big file). Same H, two pillars: the number isn’t a feature of language or of pixels — it’s a feature of distributions, and any field that names a distribution is implicitly bumping into this module.

6

Where this module flows next

The same definition reappears under new names. (the loss at the heart of classification) is your entropy plus a penalty when your model’s predicted distribution differs from the truth. KL divergence, mutual information, perplexity in language models — all consume this module. Password strength is just log2\log₂ of the keyspace under the assumption of a uniform attacker, which is the special case of arc 1. Every time someone says “this carries more information” or “that’s redundant”, they are gesturing at a number this module makes precise.

H = −Σ p log₂ p. Probability times surprise, summed. Equals log2N\log₂ N when uniform, falls to 0 when one outcome wins outright. Everything else — the Shannon bound, Wordle’s first guess, password strength, cross-entropy loss — is a corollary.

exercises · 손으로 풀기
1read the widget · the uniform max

On the Shannon Bars widget, click the uniform preset. What is H? Now click spike: estimate H without computing — is it above or below 1 bit?

2compute by hand · uniform Nno calculator

Compute H for uniform distributions over N = 2, 4, 8, 16. State the pattern.

3write the equation · 1/2, 1/4, 1/8, 1/8

Compute H for p=(1/2,1/4,1/8,1/8)p = (1/2, 1/4, 1/8, 1/8). Compare to the uniform-4 case.

4Wordle · why splits matter

You guess CRANE; against a 2,300-word answer list it produces 200 distinct feedback patterns and partitions the candidates accordingly. The largest partition has 168 words; the average IG is 5.79 bits. SALET produces 205 patterns, largest 119, IG 5.88. Even though both reduce uncertainty by ≈ log₂ 2300 = 11.2 bits, why is SALET “better”? Answer in one sentence about the worst case.

5the evil one · 'uniform = lowest entropy'no calculator

A junior says: “uniform distribution has minimum entropy because it’s the most boring — every outcome the same.” Refute in one sentence. State the actual minimum.

6passwords · NIST entropy

Two passwords. (a) 8 random characters from {a–z, A–Z, 0–9} = 62 symbols. (b) 6 random English words from a dictionary of 7,776 (the Diceware list). Which has higher entropy? Use log2625.95\log₂ 62 ≈ 5.95, log2777612.92\log₂ 7776 ≈ 12.92.

glossary · used on this page · 6
logarithm·로그
The inverse of exponentiation. log_b(x) asks: what power of b gives x?
bit (information)·비트 (정보)
The unit of information when the log base is 2. A coin flip carries exactly 1 bit because two equally-likely outcomes need one yes/no question to resolve. Distinct from "bit" the binary digit: a binary digit _can_ hold 1 bit, but a binary digit storing the result of a biased coin (90% heads) carries less than 1 bit on average. Information bits and storage bits are the same word for two different things.
surprise·놀람도
`I(x) = −log₂ p(x)`, the bits of information you gain by learning that outcome `x` occurred. Rare outcomes (small p) carry high surprise; certain outcomes (p = 1) carry zero. Entropy is the _expected_ surprise — `H = E[I(X)] = Σ p log₂(1/p)`. The Morse code intuition lives here: 'E' is most frequent in English, so its surprise per occurrence is small, so it earns a short code.
Shannon bound·섀넌 한계
The source-coding theorem (1948): no lossless code can use fewer than `H(X)` bits per symbol on average. Huffman codes get within 1 bit of the bound; arithmetic coding gets arbitrarily close. The reason a 5GB random file barely compresses and a 5GB English-text file compresses to a fifth: the English text has lower per-symbol entropy than its raw bytes suggest.
information gain·정보 이득
`IG(X | Y) = H(X) − H(X | Y)`. How many bits of uncertainty about X are removed, on average, by observing Y. The objective every Wordle solver, decision-tree split, and 20-questions strategy is implicitly maximizing. Bounded above by `H(X)`: you cannot gain more bits than the variable holds.
cross-entropy·교차 엔트로피
The loss function used to train classifiers. Given a true class `y` and a predicted distribution `p` over `n` classes, cross-entropy is `−log p_y` — the negative log of the probability the model assigned to the correct answer. Equal to negative log-likelihood when the target is one-hot. It punishes confident wrong answers harshly (`p_y → 0` sends loss `→ ∞`) and rewards confident right ones (`p_y → 1` sends loss `→ 0`). Built on log because log turns the product of independent likelihoods into a sum.