the hook · counts → bits → ranking

Why does Google show those results in that order?

For thirty years the decision rule barely changed. TF-IDF ranks documents by a score that looks like a sum of probabilities — and the log inside it is the same log that runs through the entropy module. A rare query term carries many bits of information; a common one carries almost none. The score learned to ignore stopwords without anyone hard-coding the list.

Rarity is the signal. Everything else is consequence. Term frequencies are distributions over words.

Widget — TF-IDF cooker

queryor try:

corpus N = 4 · idf = log₂(N / df)

termdfidf (bits)cat21.00

#1“cat and dog are friends”len= 5

score0.200cat: 0.200 × 1.00 = 0.200

#2“the cat sat on the mat”len= 6

score0.167cat: 0.167 × 1.00 = 0.167

#3“the dog ran fast”len= 4

score0.000cat: 0.000 × 1.00 = 0.000

#4“the quick brown fox”len= 4

score0.000cat: 0.000 × 1.00 = 0.000

Try this. Search the with raw count — doc 1 wins because it uses "the" twice. Switch to TF-IDF: the score collapses to almost nothing for every doc, because "the" appears in 3 of 4 docs and log₂(4/3) ≈ 0.42 bits is barely any signal. Now search fox: log₂(4/1) = 2 bits, so the only doc that contains it leaps to the top. Rarity is the signal.

the arc

Counting matches ranks docs by length

Search $"the cat sat"$ . Every doc has $the$ . The longest doc has it most. Counting matches ranks documents by how many words they contain — not by how relevant they are. We need a score that knows which words mean something.

TF — count, divided by length

Term frequency : $tf(t, d) = count(t, d) / length(d)$ . The fraction of words in $d$ that equal $t$ . Without dividing by length, a long doc would win every comparison just by repeating every word more often. Length normalization is “level the playing field” — same evidence per word, regardless of how many words there are.

# tf — count, normalized by doc length.
docs = [
    "the cat sat on the mat",
    "the dog ran fast",
    "cat and dog are friends",
    "the quick brown fox",
]
toks = [d.split() for d in docs]

def tf(term, doc):
    return doc.count(term) / len(doc)

# 'cat' in each doc — note doc 0 is longer, so its tf is smaller
# even though both contain 'cat' once.
[round(tf("cat", d), 3) for d in toks]
# → [0.167, 0.0, 0.2, 0.0]
#
# That length normalization is the whole reason: doc 2 "wins" between
# the two cat-containing docs because its sentence is shorter.

IDF — rarity is signal

Inverse document frequency : $idf(t) = \log₂(N / df(t))$ , where $N$ is the corpus size and df(t) counts how many docs contain $t$ . Walk the extremes:

$df = N$ (every doc has it): $\log₂(1) = 0$ bits. Stopword. Zero contribution.
$df = 1$ (only one doc has it): $\log₂(N)$ bits. Maximum signal. The query word singles out that doc.

In the 4-doc widget above: $"the"$ shows up in 3 of 4 docs, so its IDF is $\log₂(4/3) ≈ 0.42$ bits — almost nothing. $"fox"$ shows up in 1 of 4, so its IDF is $\log₂(4/1) = 2$ bits. The same word “fox” is worth 5× more to a query than “the”, and nobody hard-coded that.

from math import log2

N = len(toks)

def df(term):
    return sum(1 for d in toks if term in d)

def idf(term):
    d = df(term)
    return log2(N / d) if d else 0.0

[(t, df(t), round(idf(t), 2)) for t in ("the", "cat", "fox")]
# → [('the', 3, 0.42),    # 3 of 4 docs → barely any signal
#    ('cat', 2, 1.00),    # half the corpus → 1 bit
#    ('fox', 1, 2.00)]    # rarest → 2 bits
#
# A query word that lives in every doc gives 0 bits — the score
# learned to ignore stopwords without anyone hard-coding them.

Why log? — the same log as in entropy

The log is not a smoothing trick. Treat $df/N$ as “the probability that a random doc contains $t$ ”. Then

idf(t) = log₂(N / df(t)) = −log₂ P(t) = surprise of seeing t

That is exactly the self-information from the entropy module: the bits of surprise you’d feel if a random doc contained that term. A “surprising” word — appears in few docs — carries many bits. A “boring” word — appears in nearly all of them — carries near zero. IDF is just self-information wearing a different hat.

The frame is heuristic — it pretends “random doc” is uniform over the corpus and that presence is binary. Real corpora are messier. But the intuition the frame delivers is rigorous, and it’s the reason the log isn’t going anywhere: rarity logarithmically beats frequency, every time.

Why product (tf × idf)? — borrowing from log-probability

Independent events have probabilities that multiply; their logs add. So if a doc had a probability for each query term independently, the log of the joint probability would be a clean sum: $Σᵢ −\log P(tᵢ)$ . TF-IDF imitates that shape:

score(q, d) = Σ_{t ∈ q} tf(t, d) · idf(t)

But the imitation is not the thing. $tf$ is a count, not a probability. Multiplying it by $idf$ is not literally taking the log of a joint probability. TF-IDF is a useful ranking score that borrows the “logs of independent probabilities add” shape — not a real likelihood. Treat it as a heuristic motivated by information theory; that is exactly the right amount of trust to give it.

# The score: tf · idf, summed over query terms.
def score(query, doc):
    return sum(tf(t, doc) * idf(t) for t in query.split())

ranking = sorted(
    range(N),
    key=lambda i: -score("the", toks[i]),  # try also "fox", "cat dog"
)

# Compare to the raw-count baseline.
def raw(query, doc):
    return sum(doc.count(t) for t in query.split())

# Query "the": raw count picks doc 0 (uses 'the' twice, longest doc).
# TF-IDF picks... barely anything — every score is near zero, because
# log2(4/3) ≈ 0.42 bits. The trap of "popular doc wins" is closed.
[round(raw("the", d), 3)   for d in toks]   # → [2, 1, 0, 1]
[round(score("the", d), 3) for d in toks]   # → [0.139, 0.104, 0, 0.104]

The vector view — cosine, not Euclidean

Pin a column for every word in the vocabulary. Each doc becomes a vector of TF-IDF weights — mostly zeros, a few non-zeros where the doc actually uses those words. (This sparse-vector representation is called bag of words : it keeps frequencies but throws away order.) The query is the same kind of vector. Ranking docs becomes “find the doc-vectors closest to the query-vector”.

Why cosine similarity , not Euclidean distance? Because doc length should not move the score. Take a doc $d$ and its concatenation with itself, $d ++ d$ : the second has every component doubled — a vector pointing the same direction, twice as long. Cosine of the angle between the two: $1$ (identical direction). Euclidean distance: $\|d\|$ (non-zero). Doubling a doc shouldn’t change its meaning; cosine is the metric that agrees.

Why this became the paradigm — and what came after

TF-IDF won three races at once. It was fast (sparse vectors, inverted index), scalable (web-sized corpora fit in commodity hardware), and interpretable (every score has a per-term breakdown — exactly what the widget shows). It also produced rankings that were hard to embarrass: the model has one strong assumption (term independence) and a probability-flavored intuition, and the failure modes were familiar.

BM25 = TF-IDF + two empirical patches. (a) Saturation: the 5th occurrence shouldn’t count as much as the 1st; $tf$ is replaced by $(k₁+1)·tf / (k₁+tf)$ , saturating around $k₁ ≈ 1.2$ . (b) Length normalization: docs longer than the average get their tf damped further. Both patches were tuned on TREC evaluations in the 1990s and stayed.

The bridge to ML: TF-IDF was the first feature vector for text. Naive Bayes classifiers, logistic regression on text, latent semantic analysis — all stand on this representation. Modern dense neural retrievers (DPR, ColBERT, the dual-encoder family) replaced sparse TF-IDF vectors with dense embeddings, but they still benchmark against BM25. A new method that doesn’t beat the 30-year-old baseline on a real retrieval task is not a new method. The log is still there.

now break it

The independence assumption sometimes lies. A query for $"new york"$ on a TF-IDF index returns docs that contain either “new” or “york” with high score, including docs about new things in York, England. TF-IDF cannot tell that “new york” is one thing. Phrase matching, n-grams, and (eventually) dense embeddings are how search engines patched this — none of them removed TF-IDF, they layered on top.

The search score behaves like a sum of bits. Common words give near zero, rare words give many. The log is not a trick — it is the unit of information. But tf × idf is not literally a sum of bits; it is a heuristic that borrows the shape from log-probability. That borrowed shape ran the web for thirty years, and still defines what every neural retriever has to beat.

exercises · 손으로 풀기

1read the widget

In the widget, search $the$ with the raw count ranking. Which doc wins, and why? Now switch to TF-IDF. What happens to the ranking, and what does the IDF panel say about $"the"$ ?

2compute by hand · TF-IDFno calculator

For the 4-doc corpus in the widget and query $cat$ : compute (a) $tf(cat, d)$ for each doc, (b) $idf(cat)$ , (c) the TF-IDF score for each doc. Which doc wins, and is it the longer or the shorter cat-containing doc?

3predict · adding a duplicate

Imagine we add 100 copies of doc 1 (“the cat sat on the mat”) to the corpus, growing N from 4 to 104. For the query $cat$ : predict (without recomputing) what happens to (a) idf(cat), (b) the score of doc 3 (“cat and dog are friends”), (c) the ranking of doc 3 vs the (now numerous) copies of doc 1.

4the evil one · zero signal

Suppose a query word appears in literally every doc in the corpus. Compute its IDF and explain why the score for that word, on every doc, is exactly zero. Then: write a one-sentence explanation of why this matches intuition (rather than being a bug).

why this isn't taught this way

IR textbooks usually present TF-IDF as a sequence of definitions: tf, then idf, then their product, then cosine. The product gets a sentence (“we multiply because that’s the formula”). Lemma keeps the product honest: tf × idf is not a log probability — it borrows that shape. Hiding that distinction is what makes “why log?” feel like a trick rather than a unit of information.

glossary · used on this page · 11

TF-IDF·TF-IDF

A document score equal to `tf(term, doc) · idf(term, corpus)`, summed over the query terms. Term frequency rewards a doc for using the query word a lot; inverse document frequency discounts words that show up in many docs. The product mimics "log of a joint probability" intuition — but only mimics: tf is a count, not a probability, so TF-IDF is a heuristic motivated by information theory, not a true probabilistic model. It was the dominant search-ranking score for two decades and remains the baseline (BM25 = TF-IDF + saturation + length normalization) every neural retriever benchmarks against.

bit (information)·비트 (정보)

The unit of information when the log base is 2. A coin flip carries exactly 1 bit because two equally-likely outcomes need one yes/no question to resolve. Distinct from "bit" the binary digit: a binary digit _can_ hold 1 bit, but a binary digit storing the result of a biased coin (90% heads) carries less than 1 bit on average. Information bits and storage bits are the same word for two different things.

distribution·분포

A _shape of uncertainty_: how probability is allocated across possible outcomes. A discrete distribution assigns a number to each outcome — `P(X = "cat") = 0.6, P(X = "dog") = 0.3, P(X = "bird") = 0.1`. A continuous distribution assigns _density_ across an interval — there is no probability at any single point, only over a range. The numbers must sum (discrete) or integrate (continuous) to 1, because _something_ must happen. A single probability is one number; a distribution is the whole shape behind it. Most quantities a model predicts, an asset can return, or a pixel can take, are not single numbers but distributions — and the _spread_ of those distributions is often what matters more than the center.

term frequency·단어 빈도 (TF)

`tf(t, d) = count(t, d) / length(d)`. The fraction of words in document `d` that are equal to term `t`. Dividing by doc length is the "level the playing field" step — without it, longer docs would score higher just for using more of every word. Variants exist (raw count, log-scaled, augmented), but the normalized form is the standard intuition.

inverse document frequency·역문서빈도 (IDF)

`idf(t) = log(N / df(t))`, where `N` is the corpus size and `df(t)` is the number of docs containing term `t`. A term in every doc gives `log(1) = 0` — no signal. A term in 1 doc gives `log(N)` — maximum signal. The log isn't a tuning knob: treating `df/N` as "the probability a random doc contains this term", IDF is exactly the self-information `−log P(t)` from the entropy module — the bits of surprise the term carries. The probability frame is heuristic (assumes uniform random doc, binary presence), but the intuition is rigorous: rarity is signal.

document frequency·문서빈도 (df)

`df(t)` = the number of documents in the corpus that contain term `t` at least once. Binary presence per doc (a doc with the term ten times still counts as 1). Distinct from _collection frequency_ (total occurrences across the corpus). IDF is built on `df`, not collection frequency, because "appears at all" is the signal a query cares about.

surprise·놀람도

`I(x) = −log₂ p(x)`, the bits of information you gain by learning that outcome `x` occurred. Rare outcomes (small p) carry high surprise; certain outcomes (p = 1) carry zero. Entropy is the _expected_ surprise — `H = E[I(X)] = Σ p log₂(1/p)`. The Morse code intuition lives here: 'E' is most frequent in English, so its surprise per occurrence is small, so it earns a short code.

vector·벡터

A quantity that has both magnitude and direction. Concretely a tuple of numbers — `(3, 4)` in 2D, `(1, 0, −2)` in 3D — but the _meaning_ of those numbers depends on what you're doing. In graphics they describe a control-point offset; in physics, a velocity or force; in ML, a parameter update or a feature representation. The tuple is the same; the _role_ changes. The arithmetic — addition component-wise, scaling by a number — is the same in every role, which is why one set of math serves all of them.

bag of words·bag of words (BoW)

A representation that treats a document as a multiset of its tokens — counts only, no order. "the cat sat" and "sat the cat" produce the same bag. Crude (it loses syntax, negation, anaphora), but the right level of abstraction for the task TF-IDF solves: ranking docs by which words they contain. Every word-counting feature for text — TF-IDF, naive Bayes on text, latent Dirichlet allocation — sits on top of this representation.

cosine similarity·코사인 유사도

`cos(u, v) = (u · v) / (‖u‖ ‖v‖)`. The cosine of the angle between two vectors — a value in `[−1, 1]` (or `[0, 1]` when components are non-negative, as in TF-IDF). It measures _direction_, not _magnitude_: doubling a doc (concatenating it with itself) leaves cosine unchanged but blows up Euclidean distance. That length-invariance is why retrieval ranks by cosine, not by raw dot product or Euclidean.

BM25·BM25

`Okapi BM25` — TF-IDF with two empirical fixes. (1) **Saturation:** raw `tf` grows linearly, but the 5th occurrence of a term in a doc shouldn't count as much as the 1st. BM25 replaces `tf` with `(k₁+1)·tf / (k₁+tf)`, which saturates around `k₁ ≈ 1.2`. (2) **Length normalization:** docs longer than the average get their `tf` damped further. Both corrections were tuned on retrieval evaluations (TREC) in the 1990s and held up: BM25 is still the default sparse-retrieval baseline that dense neural retrievers must beat to claim progress.