Engineers carried wooden rulers until 1972.
They used them to multiply by adding — sliding two log-spaced scales past each other, reading the product off the meeting point. The trick was 400 years old by then and is still the trick: log(a·b) = log(a) + log(b). The same one-line identity is why a modern language model can train at all. Multiply twenty probabilities of 0.1 in float32 and the result rounds silently to zero — every gradient that touched it dies with it. PyTorch never multiplies them. It calls log_softmax and adds twenty negative numbers. The model trains because somebody decided to live in log-space.
The whole module is one equation. Everything else is consequence.
The identity that does all the work
Log is defined by one rule: log(a·b) = log(a) + log(b). Pick any base. The rule is the same. Every other property falls out of that line. log(a/b) = log(a) − log(b): take the rule, replace b with 1/b, done. log(aⁿ) = n·log(a): apply the rule n times to a · a · … · a. log(1) = 0: from log(1·a) = log(1) + log(a). There is no fourth rule because there is no fourth way to combine multiplications. Practically: the log of a number tells you how many factors of the base it is built from. log₁₀(1000) = 3 because 1000 is three tens, multiplied. Counting factors. That's it.
Napier and the slide rule (×→+ embodied)
John Napier published the first log tables in 1614 because astronomers were dying inside, multiplying nine-digit numbers by hand to predict eclipses. His tables let them look up log(a) and log(b), add the two, and look up what number had that log — the answer to a·b with no multiplication anywhere. Three centuries later, every engineer carried a slide rule: a wooden ruler with two log-spaced scales that slid past each other. Aligning 2 on one against 3 on the other physically performed log(2) + log(3) and showed 6 at the meeting point. The slide rule is the identity from § 1, made into furniture. Apollo got to the moon on these.
Underflow — and why log-space saves your model
A float32 can hold numbers down to about 10⁻³⁸. Multiply forty probabilities of 0.1 and you've crossed it — the result rounds to zero, silently. No exception. No warning. Every gradient that depended on it dies with it. This isn't a numerical-analysis curiosity; it's why every deep-learning library reports loss as a sum, not a product. The fix is the identity from § 1, applied mechanically: take logs the moment a product would otherwise form. log(p₁·p₂·…·pₙ) = Σ log(pᵢ). Each log(pᵢ) is a comfortable negative number; their sum is a comfortable larger negative number. No underflow can reach you. This is what log-likelihood is doing, and what log_softmax was built to do. Live in log-space; sums replace products; floats stop lying.
Back to the application
This module is consumed by the Bitcoin Pizza. There you're asked to compute 10⁹ · 2.89²⁰ by hand. You can't, until you take log₁₀ of both sides — and then you're adding 9 + 20·log₁₀(2.89), two numbers a human can manage. That hand-computation is only possible because of the identity in § 1. Same trick as Napier's, same trick as log_softmax. Different decade, different stakes, identical mechanism.