Lemma
math, backwards
journey · 4 days · graphics → ml
How Compression Works
JPEG squeezes a 1 MB photo into 100 KB. TF-IDF squeezes a million-word vocabulary into the handful of words that actually mean something. The two pages look unrelated — one is a graphics codec, one is a search-relevance score — but they run the same three steps: change basis, drop small, reconstruct. This path opens with the entropy bound that decides what is possible, then walks the three pages that obey it.
Compression in Lemma is a shape — a procedural skeleton that recurs under different names. Read the abstract floor first, then meet the three instances. By day four the bridge between pixels and words will look like a translation problem, not a coincidence.
the path · 0/4 · 0%
- 1module·day 1·→ next/modules/entropyOpen with the bound the rest of the path bumps against. Entropy $H = -Σ pᵢ \log pᵢ$ is the minimum bits-per-symbol any lossless coder can hit. Read Arc 5 — *same equation, two pillars* — and note that the compression story already extends across pillars before you finish the abstract.open →
- 2application·day 2/graphics/image-compressionFirst instance — pillar: graphics. The objective is a raw pixel grid; the *change of basis* is the move from raw pixels to neighbour-differences (or to a histogram); the entropy of the new representation is much smaller than the entropy of the raw one. PNG lives inside this gap.open →
- 3application·day 3/graphics/jpeg-compressionSecond instance — still graphics. Same three steps, more aggressive. The basis is now the DCT — 8×8 blocks become 64 frequency coefficients, signal concentrates in the low ones. *Drop small* is quantisation; *reconstruct* runs inverse DCT. Crucially, this is *lossy* — JPEG accepts irreversible loss in exchange for going below the lossless floor day 1 named.open →
- 4application·day 4/ml/tf-idfThird instance — pillar jumps to ML. The basis is the bag-of-words representation. *Drop small* is $idf(t) = \log(N / df(t))$ — common words get near-zero weight and effectively *vanish*. The procedure is the same three steps, applied to a document instead of an image, ranking instead of bytes. *Same skeleton, different unit of compression.*open →