Why does Google show those results in that order?
For thirty years the decision rule barely changed.
Rarity is the signal. Everything else is consequence. Term frequencies are
Counting matches ranks docs by length
Search . Every doc has . The longest doc has it most. Counting matches ranks documents by how many words they contain — not by how relevant they are. We need a score that knows which words mean something.
TF — count, divided by length
# tf — count, normalized by doc length.
docs = [
"the cat sat on the mat",
"the dog ran fast",
"cat and dog are friends",
"the quick brown fox",
]
toks = [d.split() for d in docs]
def tf(term, doc):
return doc.count(term) / len(doc)
# 'cat' in each doc — note doc 0 is longer, so its tf is smaller
# even though both contain 'cat' once.
[round(tf("cat", d), 3) for d in toks]
# → [0.167, 0.0, 0.2, 0.0]
#
# That length normalization is the whole reason: doc 2 "wins" between
# the two cat-containing docs because its sentence is shorter.IDF — rarity is signal
(every doc has it): bits. Stopword. Zero contribution.
(only one doc has it): bits. Maximum signal. The query word singles out that doc.
In the 4-doc widget above: shows up in 3 of 4 docs, so its IDF is bits — almost nothing. shows up in 1 of 4, so its IDF is bits. The same word “fox” is worth 5× more to a query than “the”, and nobody hard-coded that.
from math import log2
N = len(toks)
def df(term):
return sum(1 for d in toks if term in d)
def idf(term):
d = df(term)
return log2(N / d) if d else 0.0
[(t, df(t), round(idf(t), 2)) for t in ("the", "cat", "fox")]
# → [('the', 3, 0.42), # 3 of 4 docs → barely any signal
# ('cat', 2, 1.00), # half the corpus → 1 bit
# ('fox', 1, 2.00)] # rarest → 2 bits
#
# A query word that lives in every doc gives 0 bits — the score
# learned to ignore stopwords without anyone hard-coding them.Why log? — the same log as in entropy
The log is not a smoothing trick. Treat as “the probability that a random doc contains ”. Then
idf(t) = log₂(N / df(t)) = −log₂ P(t) =
That is exactly the self-information from the entropy module: the bits of
The frame is heuristic — it pretends “random doc” is uniform over the corpus and that presence is binary. Real corpora are messier. But the intuition the frame delivers is rigorous, and it’s the reason the log isn’t going anywhere: rarity logarithmically beats frequency, every time.
Why product (tf × idf)? — borrowing from log-probability
Independent events have probabilities that multiply; their logs add. So if a doc had a probability for each query term independently, the log of the joint probability would be a clean sum: . TF-IDF imitates that shape:
score(q, d) = Σt ∈ q tf(t, d) · idf(t)
But the imitation is not the thing. is a count, not a probability. Multiplying it by is not literally taking the log of a joint probability. TF-IDF is a useful ranking score that borrows the “logs of independent probabilities add” shape — not a real likelihood. Treat it as a heuristic motivated by information theory; that is exactly the right amount of trust to give it.
# The score: tf · idf, summed over query terms.
def score(query, doc):
return sum(tf(t, doc) * idf(t) for t in query.split())
ranking = sorted(
range(N),
key=lambda i: -score("the", toks[i]), # try also "fox", "cat dog"
)
# Compare to the raw-count baseline.
def raw(query, doc):
return sum(doc.count(t) for t in query.split())
# Query "the": raw count picks doc 0 (uses 'the' twice, longest doc).
# TF-IDF picks... barely anything — every score is near zero, because
# log2(4/3) ≈ 0.42 bits. The trap of "popular doc wins" is closed.
[round(raw("the", d), 3) for d in toks] # → [2, 1, 0, 1]
[round(score("the", d), 3) for d in toks] # → [0.139, 0.104, 0, 0.104]The vector view — cosine, not Euclidean
Pin a column for every word in the vocabulary. Each doc becomes a
Why
Why this became the paradigm — and what came after
TF-IDF won three races at once. It was fast (sparse vectors, inverted index), scalable (web-sized corpora fit in commodity hardware), and interpretable (every score has a per-term breakdown — exactly what the widget shows). It also produced rankings that were hard to embarrass: the model has one strong assumption (term independence) and a probability-flavored intuition, and the failure modes were familiar.
The bridge to ML: TF-IDF was the first feature vector for text. Naive Bayes classifiers, logistic regression on text, latent semantic analysis — all stand on this representation. Modern dense neural retrievers (DPR, ColBERT, the dual-encoder family) replaced sparse TF-IDF vectors with dense embeddings, but they still benchmark against BM25. A new method that doesn’t beat the 30-year-old baseline on a real retrieval task is not a new method. The log is still there.
The independence assumption sometimes lies. A query for on a TF-IDF index returns docs that contain either “new” or “york” with high score, including docs about new things in York, England. TF-IDF cannot tell that “new york” is one thing. Phrase matching, n-grams, and (eventually) dense embeddings are how search engines patched this — none of them removed TF-IDF, they layered on top.
The search score behaves like a sum of bits. Common words give near zero, rare words give many. The log is not a trick — it is the unit of information. But tf × idf is not literally a sum of bits; it is a heuristic that borrows the shape from log-probability. That borrowed shape ran the web for thirty years, and still defines what every neural retriever has to beat.
In the widget, search with the raw count ranking. Which doc wins, and why? Now switch to TF-IDF. What happens to the ranking, and what does the IDF panel say about ?
For the 4-doc corpus in the widget and query : compute (a) for each doc, (b) , (c) the TF-IDF score for each doc. Which doc wins, and is it the longer or the shorter cat-containing doc?
Imagine we add 100 copies of doc 1 (“the cat sat on the mat”) to the corpus, growing N from 4 to 104. For the query : predict (without recomputing) what happens to (a) idf(cat), (b) the score of doc 3 (“cat and dog are friends”), (c) the ranking of doc 3 vs the (now numerous) copies of doc 1.
Suppose a query word appears in literally every doc in the corpus. Compute its IDF and explain why the score for that word, on every doc, is exactly zero. Then: write a one-sentence explanation of why this matches intuition (rather than being a bug).
IR textbooks usually present TF-IDF as a sequence of definitions: tf, then idf, then their product, then cosine. The product gets a sentence (“we multiply because that’s the formula”). Lemma keeps the product honest: tf × idf is not a log probability — it borrows that shape. Hiding that distinction is what makes “why log?” feel like a trick rather than a unit of information.