the hook · ×→+

Engineers carried wooden rulers until 1972.

They used them to multiply by adding — sliding two log-spaced scales past each other, reading the product off the meeting point. The trick was 400 years old by then and is still the trick: log(a·b) = log(a) + log(b). The same one-line identity is why a modern language model can train at all. Multiply twenty probabilities of 0.1 in float32 and the result rounds silently to zero — every gradient that touched it dies with it. PyTorch never multiplies them. It calls log_softmax and adds twenty negative numbers. The model trains because somebody decided to live in log-space.

The whole module is one equation. Everything else is consequence.

Widget A — Two Stacks

log₁₀(a)0.30

log₁₀(b)0.48

log₁₀(a) + log₁₀(b)0.78

a · b6.00

a2.00b3.00

Drag a = 2 and b = 3. The marker lands on 6 — but you never multiplied. You added two log-distances. Drag b to 5. The marker jumps to 10. Same trick.

the arc

The identity that does all the work

Log is defined by one rule: log(a·b) = log(a) + log(b). Pick any base. The rule is the same. Every other property falls out of that line. log(a/b) = log(a) − log(b): take the rule, replace b with 1/b, done. log(aⁿ) = n·log(a): apply the rule n times to a · a · … · a. log(1) = 0: from log(1·a) = log(1) + log(a). There is no fourth rule because there is no fourth way to combine multiplications. Practically: the log of a number tells you how many factors of the base it is built from. log₁₀(1000) = 3 because 1000 is three tens, multiplied. Counting factors. That's it.

Napier and the slide rule (×→+ embodied)

John Napier published the first log tables in 1614 because astronomers were dying inside, multiplying nine-digit numbers by hand to predict eclipses. His tables let them look up log(a) and log(b), add the two, and look up what number had that log — the answer to a·b with no multiplication anywhere. Three centuries later, every engineer carried a slide rule: a wooden ruler with two log-spaced scales that slid past each other. Aligning 2 on one against 3 on the other physically performed log(2) + log(3) and showed 6 at the meeting point. The slide rule is the identity from § 1, made into furniture. Apollo got to the moon on these.

Underflow — and why log-space saves your model

A float32 can hold numbers down to about 10⁻³⁸. Multiply forty probabilities of 0.1 and you've crossed it — the result rounds to zero, silently. No exception. No warning. Every gradient that depended on it dies with it. This isn't a numerical-analysis curiosity; it's why every deep-learning library reports loss as a sum, not a product. The fix is the identity from § 1, applied mechanically: take logs the moment a product would otherwise form. log(p₁·p₂·…·pₙ) = Σ log(pᵢ). Each log(pᵢ) is a comfortable negative number; their sum is a comfortable larger negative number. No underflow can reach you. This is what log-likelihood is doing, and what log_softmax was built to do. Live in log-space; sums replace products; floats stop lying.

Back to the application

This module is consumed by the Bitcoin Pizza. There you're asked to compute 10⁹ · 2.89²⁰ by hand. You can't, until you take log₁₀ of both sides — and then you're adding 9 + 20·log₁₀(2.89), two numbers a human can manage. That hand-computation is only possible because of the identity in § 1. Same trick as Napier's, same trick as log_softmax. Different decade, different stakes, identical mechanism.

log(a·b) = log(a) + log(b). The whole module. Everything else — the digit-count rule, the slide rule, log-likelihood, hand-computing $10⁹ × 2.89²⁰ — is a corollary.

exercises · 손으로 풀기

1read the graph

On the Two Stacks widget, set a = 4. What value of b makes a·b land exactly on 100? Read it off the log axis without computing.

2compute by hand · the digit ruleno calculator

Without a calculator, give log₁₀(2,000,000) using only log₁₀(2) ≈ 0.301.

3write the equation · sequence probability

You evaluate a 50-token sequence; each token has probability ~0.05. Write the formula your code should compute, and the formula it should avoid. Use log(0.05) ≈ −3.00.

4compute by hand · Stirling on a napkinno calculator

Stirling: log₁₀(n!) ≈ n·log₁₀(n) − n·log₁₀(e), with log₁₀(e) ≈ 0.434. Estimate log₁₀(100!). How many digits does 100! have?

5read the graph · equal log-distance = equal ratio

On Two Stacks, drag a and b so that the gap log(b) − log(a) is exactly the gap from log(1) to log(10). What does b/a always equal, regardless of where you placed them?

6write the equation · logsumexp

You're given two probabilities p and q, but you only know log p and log q (not p, q themselves — they'd underflow). Derive a numerically stable expression for log(p + q). (This is the logsumexp trick.)

7the evil one · 'just multiply'

A junior says: "log_softmax is just a perf optimization. Mathematically you could just multiply the probabilities — switch to float64 if you're worried." Write a one-paragraph rebuttal that holds for both float32 and float64. Then state the single equation that makes log-space work.

glossary · used on this page · 0