the hook · what's the exponent

How many times do you multiply 2 to get 1,024?

The answer is 10. That 10 is $\log₂(1,024)$ . Log is the inverse of exponentiation — it pulls the exponent out of the result. Most big numbers in nature are built from exponents: cells double, interest compounds, starlight dims as $1/r²$ , sound spans twelve orders of magnitude. The result you see (1,024 cells, a 100× gain, magnitude-7.2) is rarely the natural parameter. The exponent (how many doublings, how many years) is. Log is the function that recovers it. You already use a special case — $\log₁₀(1,000,000) = 6$ because the number has six zeros. The everyday digit count is the exponent for base 10. Generalize the digit count to any base and you have the log. And because exponents add when their bases multiply, log turns multiplication into addition: $\log(a·b) = \log(a) + \log(b)$ . That one line drives the rest of this page.

log = what’s the exponent. Everything else — $\log(a·b) = \log(a) + \log(b)$ , slide rules, log-likelihood, digit-count estimates — is consequence.

tool spec

what: The inverse of exponentiation. $\log(a·b) = \log(a) + \log(b)$ — products become sums. The whole module is that one identity.
applies when: A quantity is built from exponents — compound interest, half-life, decibels, earthquake magnitudes, sequence probabilities. The natural parameter is how many factors, and you want to recover it from the result.
breaks when: Argument is zero or negative — real log is undefined. Base 1 — every exponent gives 1 and the inverse collapses. The most common student error is logging across an addition: $\log(a + b) ≠ \log(a) + \log(b)$ . The identity needs a product underneath, every time.

Widget A — Doubling ladder

forward — exponentiation

base

exponent

result

inverse — logarithm

log

function

₂

base

(

result

)

exponent

base bb = 2exponent nn = 4result rr = 16

Three sliders — base b, exponent n, result r — all live on the same equation bⁿ = r. Drag n and r grows or shrinks. Drag r and it snaps to the nearest power of b; the answer falls out as n — that is log_b(r). The fact that n and r move together (orange follows ink) is the picture of log pulling the exponent out of the result.

Widget B — Two Stacks

log₁₀(a)0.30

log₁₀(b)0.48

log₁₀(a) + log₁₀(b)0.78

a · b6.00

a2.00b3.00

Drag a = 2 and b = 3. The marker lands on 6 — but you never multiplied. You added two log-distances. Drag b to 5. The marker jumps to 10. Same trick.

the arc

The identity that does all the work

Log is defined by one rule: $\log(a·b) = \log(a) + \log(b)$ . Pick any base. The rule is the same. (Context picks the base by convention: $\log$ means log₁₀ in engineering, ln in ML and statistics, $\log₂$ in algorithms — read by domain when the subscript is omitted.) Every other property falls out of that line. $\log(a/b) = \log(a) − \log(b)$ : take the rule, replace b with 1/b, done. $\log(aⁿ) = n·\log(a)$ : apply the rule n times to $a · a · … · a$ . $\log(1) = 0$ : from $\log(1·a) = \log(1) + \log(a)$ . There is no fourth rule because there is no fourth way to combine multiplications. Practically: the log of a number tells you how many factors of the base it is built from. $\log₁₀(1000) = 3$ because 1000 is three tens, multiplied. Counting factors. That’s it.

import math

# every log law from one identity:
math.log10(2 * 50)              # ≈ math.log10(2) + math.log10(50)
math.log10(2 ** 10)             # ≈ 10 * math.log10(2)
math.log10(1)                   # 0.0

Same trick, five places

Exponential quantities scatter across many places — time-growth ( compound interest , carbon dating), perceptual compression (decibels), scale-of-nature units (earthquake magnitude), counting information (bits). Same trick each time: set up the equation, take logs, pull the exponent. By hand.

Compound interest. A million won at 7%/year — when does it double? $1.07ᵗ = 2$ → $t = \log(2) / \log(1.07) ≈ 0.301 / 0.0294 ≈ 10.24 years$ . The Rule of 72 ( $72 / 7 ≈ 10.3$ ) is this formula sloppily memorized.
Carbon-14 dating. Carbon-14 halves every 5,730 years after death. If 25% remains: $(1/2)^(t/5730) = 0.25 = (1/2)²$ → $t = 11,460 years$ . For odd ratios (33%, 17%) only the log expression closes in a single line.
Decibels. $dB = 10·\log₁₀(P/P₀)$ . Conversation 60 dB, rock concert 110 dB → acoustic power differs by $10⁵ = 100,000×$ . Your ears don’t perceive a hundred-thousand-fold gap; hearing is logarithmic in power, and decibels track that compression directly.
Earthquake magnitude. $E = E₀·10^(1.5·M)$ . Tōhoku 2011 (M 9.0) vs an ordinary large quake (M 7.0): $E₉ / E₇ = 10^(1.5×2) = 10³ = 1,000×$ . Two units of magnitude, three orders of energy. Natural earthquake energies span 19 orders of magnitude — comparison is hopeless without the compression Richter applies.
Bits and binary search. A 1,024-page dictionary, halving each step → $\log₂(1024) = 10$ steps to find any word. A 32-bit int holds $2³² ≈ 4.3 billion$ values; identifying N items needs $\log₂(N)$ bits. A deck of cards has $\log₂(<Term id="factorial">52!</Term>) ≈ 226 bits$ of shuffle entropy — 226 yes/no questions to specify a single shuffle exactly.

Five problems, one shape: nature’s equation is exponential, take logs both sides, the exponent falls out. The identity from § 1 — multiplication into addition — is doing this work every single time.

Napier and the slide rule (×→+ embodied)

John Napier published the first log tables in 1614 because astronomers were dying inside, multiplying nine-digit numbers by hand to predict eclipses. His tables let them look up $\log(a)$ and $\log(b)$ , add the two, and look up what number had that log — the answer to $a·b$ with no multiplication anywhere. Three centuries later, every engineer carried a slide rule : a wooden ruler with two log-spaced scales that slid past each other. Aligning 2 on one against 3 on the other physically performed $\log(2) + \log(3)$ and showed 6 at the meeting point. The slide rule is the identity from § 1, made into furniture. Apollo got to the moon on these.

Underflow — and why log-space saves your model

A $float32$ can hold numbers down to about $10⁻³⁸$ . Multiply forty probabilities of $0.1$ and you’ve crossed it — the result rounds to zero, silently. No exception. No warning. Every gradient that depended on it dies with it. This isn’t a numerical-analysis curiosity; it’s why every deep-learning library reports loss as a sum, not a product. The fix is the identity from § 1, applied mechanically: take logs the moment a product would otherwise form. $\log(p₁·p₂·…·pₙ) = Σ \log(pᵢ)$ . Each $\log(pᵢ)$ is a comfortable negative number; their sum is a comfortable larger negative number. No underflow can reach you. This is what log-likelihood is doing, and what $\log_softmax$ was built to do. Live in log-space; sums replace products; floats stop lying.

import numpy as np

# Naive: multiply 40 probabilities. Underflows in float32.
p = np.float32(0.1)
np.prod([p] * 40)               # → 0.0  (silent death)

# Log-space: add 40 log-probabilities. Survives.
np.sum(np.log([p] * 40))        # → -92.10  (well-defined)

Where this shows up — same identity, two pillars

Log is what makes multiplication answer addition’s questions. Every field that compounds things multiplicatively — and many do — eventually needs to ask “how many?” or “how big?” or “how confident?” in a form that adds. Log is the bridge.

finance : rates compose multiplicatively;
        log makes them add (years to target, CAGR, continuous compounding).
ml      : independent likelihoods multiply;
        log makes them add — and *negative* log makes them a loss.

Five live consumers, all leaning on the single identity from arc 1:

Bitcoin pizza inverts compound growth: $F = P(1+r)^t$ can’t be solved for t without taking logs. $t = \log(F/P) / \log(1+r)$ . CAGR is the same identity solved for r instead. Three unknowns, one equation, three log-shaped answers.
Present value bridges from discrete compounding $(1 + r/n)^(n·t)$ to the continuous form $e^(rt)$ : take the log of the discrete expression, watch it reduce to $r·t$ in the limit. Continuous compounding isn’t a separate operation — it’s the discrete one looked at through log.
Confidently wrong builds the loss $−\log p_true$ . Multiple training examples have likelihoods that multiply; logs turn that product into a sum, and the negative sign makes “more confident, more wrong” climb instead of vanish.
TF-IDF measures rarity in bits: $idf(t) = \log(N / df(t))$ . The logarithm is what makes ‘three times rarer’ into ‘one bit more surprising’ — directly comparable to other bit-measured quantities like password strength and English-letter entropy.
Model calibration fits temperature T by minimizing log-loss on held-out validation; the logit function $\log(p/(1−p))$ is the basis change that linearizes the calibration curve in the first place. Two different log uses inside one workflow.

Five problems across two pillars, one identity: the swap from × into +. Napier’s slide rule from arc 3 is the same machine running today inside log_softmax and the calibration optimizer.

log(a·b) = log(a) + log(b). The whole module. Everything else — the digit-count rule, the slide rule , log-likelihood , hand-computing $10⁹ × 2.89²⁰ — is a corollary.

exercises · 손으로 풀기

1read the graph

On the Two Stacks widget, set a = 4. What value of b makes a·b land exactly on 100? Read it off the log axis without computing.

2compute by hand · the digit ruleno calculator

Without a calculator, give log₁₀(2,000,000) using only $\log₁₀(2) ≈ 0.301$ .

3write the equation · sequence probability

You evaluate a 50-token sequence; each token has probability ~0.05. Write the formula your code should compute, and the formula it should avoid. Use $\log(0.05) ≈ −3.00$ .

4compute by hand · Stirling on a napkinno calculator

Stirling : $\log₁₀(n!) ≈ n·\log₁₀(n) − n·\log₁₀(e)$ , with $\log₁₀(e) ≈ 0.434$ . Estimate log₁₀(100!). How many digits does 100! have?

5read the graph · equal log-distance = equal ratio

On Two Stacks, drag a and b so that the gap log(b) − log(a) is exactly the gap from log(1) to log(10). What does b/a always equal, regardless of where you placed them?

6write the equation · logsumexp

You’re given two probabilities p and q, but you only know $\log p$ and $\log q$ (not p, q themselves — they’d underflow). Derive a numerically stable expression for $\log(p + q)$ . (This is the logsumexp trick.)

7the evil one · 'just multiply'

A junior says: ” $\log_softmax$ is just a perf optimization. Mathematically you could just multiply the probabilities — switch to float64 if you’re worried.” Write a _one-paragraph* rebuttal that holds for both $float32$ and $float64$ . Then state the single equation that makes log-space work.

glossary · used on this page · 14

common log·상용로그

Logarithm with base 10. Written log(x) in most engineering contexts.

natural logarithm (ln)·자연로그

Logarithm with base e ≈ 2.71828. Written ln(x) or log_e(x).

logarithm·로그

The inverse of exponentiation. log_b(x) asks: what power of b gives x?

compound interest·복리

Interest computed on principal plus accumulated interest. Each period multiplies, not adds.

Rule of 72·72의 법칙

Doubling time ≈ 72 / (rate in %). At 8% per year, money doubles in ~9 years.

⚠ Why 72? ln(2) ≈ 0.693, and ln(1+r) ≈ r for small r. So t ≈ 0.693 / r. The 72 absorbs the small-r approximation error to land at a number with many divisors.

factorial·계승

`n!` is `1 × 2 × 3 × … × n`. It counts orderings: `n!` is the number of distinct ways to arrange n items in a line. `5! = 120`. `52!` (the number of distinct shuffles of a deck) is a 68-digit number. Grows faster than any exponential `aⁿ`, which is why `log(n!)` and Stirling's approximation matter.

slide rule·계산자

A mechanical analog calculator (~1620 to ~1972) made of two log-spaced sliding scales. Aligning numbers performed multiplication by physically adding their log-distances. Apollo trajectories were checked on these. Replaced abruptly by the pocket electronic calculator.

float32·float32

The 32-bit IEEE-754 floating-point format. About 7 decimal digits of precision, magnitude range roughly 1e-38 to 3.4e38. Numbers outside that range round to zero (underflow) or infinity (overflow).

⚠ Different from float64. float32 underflows below ~1e-38 and overflows above ~3.4e38. Most ML frameworks default to float32 because GPU memory bandwidth is the bottleneck.

gradient·그래디언트

The slope of a function in many directions at once. For a function f(x, y, z, ...), the gradient is the vector of partial derivatives — it points in the direction of steepest ascent. In machine learning, this vector tells the optimizer which way to step the parameters to reduce loss.

⚠ In ML: the vector of partial derivatives of the loss with respect to every parameter. Optimization moves opposite the gradient ("gradient descent"). If a value used in the gradient becomes 0 from underflow, the entire chain collapses — that's why log-space matters.

underflow·언더플로우

When a number is too small in magnitude for the float type to represent, hardware rounds it to zero. float32 underflows below ~1e-38, float64 below ~1e-308.

⚠ Different from overflow. Underflow rounds *to zero*, silently — no exception, no warning. Gradients that touched it die.

log-likelihood·로그우도

The log of a probability (or product of probabilities). Used because products of small probabilities underflow floats; their logs sum cleanly.

⚠ Always negative for probabilities in (0,1). Bigger (less negative) is better. "Negative log-likelihood" (NLL) flips the sign so loss can be minimized.

log_softmax·log_softmax

A numerically stable function that computes log(softmax(x)) without ever forming the underflowing softmax probabilities. softmax turns a vector of scores into probabilities; taking the log of that directly would underflow on small entries. log_softmax computes both at once via the logsumexp trick. Used in every modern classifier loss because it gives clean gradients without losing precision.

Stirling's approximation·스털링 근사

A formula for the log of a factorial that avoids computing the factorial itself: ln(n!) ≈ n·ln(n) − n + ½·ln(2πn). The leading two terms (n·ln(n) − n) are usually accurate to a few percent and let you estimate the digit count of huge factorials by hand.

logsumexp·logsumexp

A numerically stable function for computing log of a sum of exponentials. Available as torch.logsumexp and scipy.special.logsumexp. The building block of softmax, log_softmax, and almost every probabilistic loss.

⚠ The "max-shift" trick — log(Σ exp(xᵢ)) = max(x) + log(Σ exp(xᵢ − max(x))) — keeps the inner exponents ≤ 0, avoiding both overflow and underflow.