TF-IDF

ko · counterpart TF-IDF

A document score equal to `tf(term, doc) · idf(term, corpus)`, summed over the query terms. Term frequency rewards a doc for using the query word a lot; inverse document frequency discounts words that show up in many docs. The product mimics "log of a joint probability" intuition — but only mimics: tf is a count, not a probability, so TF-IDF is a heuristic motivated by information theory, not a true probabilistic model. It was the dominant search-ranking score for two decades and remains the baseline (BM25 = TF-IDF + saturation + length normalization) every neural retriever benchmarks against.

invented

1958 TF (Luhn) · 1972 IDF (Spärck Jones) · Hans Peter Luhn (1958) → Karen Spärck Jones (1972) · IBM Yorktown → Cambridge

Luhn (1958, IBM) noticed that *frequent words within a document* signal what it's about — term frequency. Fourteen years later, Spärck Jones (1972, Cambridge) added the missing half: *rare words across the corpus* are the signal — inverse document frequency. The product, TF-IDF, was the dominant search-ranking score for thirty years before BM25 and Google.

en.wikipedia.org/wiki/Tf%E2%80%93idf ↗

used on · 1

applications TF-IDF · Lemma