JPEG throws information away. Why does the picture still look like a picture?
The previous page closed at the
Lossy compression isn’t compression — it’s prioritization.
Cut the image into 8×8 blocks
JPEG starts by chopping the image into independent 8×8 pixel tiles. Why blocks? Two reasons. Locality: real pictures aren’t statistically uniform across the whole frame — a face has different structure from sky. Working in small windows lets the coder adapt without modeling global structure. Tractability: an 8×8 block is 64 numbers, small enough that direct algebra (an 8×8 transform) is fast and exact. The trade-off is blockiness — at low quality, the 8×8 tile boundaries become visible because each block was rounded independently. The widget’s “checkerboard” preset is one block; everything below scales to the rest of the picture by repetition.
See the block in a different basis
A 64-entry pixel block is a vector in 64-dimensional space. The standard
Crucially: switching basis is lossless. The information in the block doesn’t change; only the labels do. If the new basis is orthonormal, switching is just a matrix multiplication, and switching back is multiplying by the transpose. So the question is: is there a basis that’s better than pixels for what JPEG wants to do?
DCT — a basis where natural images are sparse
JPEG’s chosen basis is the
The empirical claim that makes JPEG work: natural images are sparse in the DCT basis. Most of the energy in any given 8×8 block of a typical photograph concentrates in maybe 5–10 of the 64 DCT coefficients, almost always the low-frequency ones. The widget makes this readable. Toggle to gradient: nearly all energy at and a couple of immediate neighbors. Toggle to flat: literally one coefficient (the DC term) carries everything. Toggle to checkerboard: an artificial extreme — energy concentrates at a single high-frequency cell, but the sparsity is still there.
import numpy as np
# 8x8 DCT-II in matrix form. The cosine matrix M is the same one JPEG uses;
# applying it twice (rows then columns) gives the 2D DCT.
N = 8
def dct_matrix(N=8):
M = np.zeros((N, N))
for k in range(N):
for n in range(N):
M[k, n] = np.cos((2*n + 1) * k * np.pi / (2*N))
M[0, :] *= 1 / np.sqrt(N)
M[1:, :] *= np.sqrt(2 / N)
return M
M = dct_matrix(N)
def dct2d(block):
return M @ block @ M.T # rows, then columns
def idct2d(coef):
return M.T @ coef @ M # inverse: just transpose
# DCT itself is lossless. Round-trip an 8x8 block and the error is zero
# (up to floating point).
block = np.random.default_rng(0).integers(0, 256, size=(8, 8)).astype(float)
coef = dct2d(block)
back = idct2d(coef)
np.allclose(block, back) # → True the transform alone loses nothingQuantization — drop what doesn't matter
The compression happens here. After the DCT, each coefficient is divided by an integer from a quantization table and rounded to the nearest whole number. The quantization table is hand-tuned (and standardized) to divide more aggressively in high-frequency cells than in low-frequency ones, because the human visual system is less sensitive to high-frequency error. Small high-frequency coefficients round straight to zero; the kept coefficients lose precision but survive.
The widget uses a simplified version: keep the top coefficients by magnitude, zero the rest. Real JPEG quantization is per-coefficient with a fixed table per quality setting, but the qualitative effect is identical. Drop the slider to on the texture preset and you’ll see the reconstruction is still recognizable — most of what you saw was carried by those four numbers. Drop to and you get a flat block at the average brightness; the reconstruction has zero error on the flat preset because flat was one number’s worth of information.
This is also where lossy
# Keep top K coefficients by magnitude; zero the rest. JPEG's quantization
# step is more elaborate (a per-coefficient divisor table), but the
# qualitative effect — kill small / high-frequency entries — is the same.
def keep_top_k(coef, k):
flat = coef.flatten()
if k >= flat.size: return coef.copy()
threshold = np.sort(np.abs(flat))[-k]
out = coef.copy()
out[np.abs(out) < threshold] = 0
return out
def reconstruct(coef, k):
return idct2d(keep_top_k(coef, k))
# Compare four block types: how many of 64 coefficients does each one need?
def kept_to_target_error(block, target_mae=2.0):
coef = dct2d(block)
for k in range(1, 65):
err = np.mean(np.abs(reconstruct(coef, k) - block))
if err <= target_mae:
return k, err
return 64, np.mean(np.abs(reconstruct(coef, 64) - block))
# (assumes flat / gradient / texture / checker block builders defined elsewhere)
[(name, *kept_to_target_error(b()))
for name, b in (("flat", lambda: np.full((8,8), 128.0)),
("gradient", lambda: np.add.outer(*[np.linspace(0,255,8)]*2) / 2),
("checker", lambda: 130 + 100*((np.indices((8,8)).sum(0) % 2)*2 - 1)))]
# → [('flat', 1, 0.0), DC alone reconstructs perfectly
# ('gradient', 3, ~1.5), a handful of low-freq entries
# ('checker', 1, 0.0)] surprisingly: ALL energy at one high-freq cell
# Same data, very different sparsity in the DCT basis.Reconstruct — inverse DCT brings the picture back
To decode, JPEG runs the inverse DCT on the (now mostly zero) coefficient grid. The inverse is the same matrix machinery as the forward DCT, just with the cosine matrix transposed. The output is no longer the original block — it’s a projection of the original onto the subspace spanned by the kept basis vectors. That projection is the closest approximation to the original under the L² metric, given that you’re only allowed to use the kept directions.
Two failure modes show up here. Blocking: each 8×8 tile was rounded independently, so adjacent tiles can disagree along their shared edge. Ringing: dropping high-frequency coefficients near a sharp edge causes oscillations because the remaining basis vectors can’t reproduce a step function. Both are visible at low quality settings. They’re the price of the trade.
Entropy coding — the final wrap
After quantization, each block is a stream of integers, mostly zero, with the kept values arranged in a zigzag scan order (low-frequency first). That stream goes into Huffman or arithmetic coding — the entropy module’s shannon-bound business — and that’s where the file actually shrinks on disk. JPEG’s contribution isn’t the entropy coder; that’s standard machinery. JPEG’s contribution is producing a stream the entropy coder can pack tightly. Long runs of zeros compress to almost nothing; small integers carry few bits each.
So the file size has three multiplicative savings: fewer non-zero coefficients (most rounded to zero), smaller magnitudes for the kept ones, and runs of zeros that entropy-code beautifully. The page in three bullets:
- Change basis (DCT) so the picture’s information concentrates in a few coordinates.
- Quantize (round) the small coordinates to zero — that’s the lossy step.
- Entropy-code (Huffman) the resulting sparse integer stream — that’s where the bytes are saved.
# Why the file actually shrinks: after quantization, the coefficient
# stream has lots of zeros and small ints; entropy coding (Huffman or
# arithmetic) packs that stream tightly. Same entropy module that bounds
# tf-idf and the lossless image-compression page — JPEG just feeds it a
# stream that's already been pre-sparsified by DCT + quantization.
from collections import Counter
from math import log2
def entropy(symbols):
counts = Counter(symbols)
N = len(symbols)
return sum(-(c / N) * log2(c / N) for c in counts.values() if c > 0)
# Pretend a 256-block image. Compare the entropy of the raw pixel stream
# to the entropy of the kept-DCT-coefficient stream after rounding.
rng = np.random.default_rng(1)
img = rng.integers(50, 200, size=(8, 32)) # 8 high × 32 wide = 32 blocks of 8x8
# This is illustrative; real JPEG quantizes per coefficient (zigzag table).
raw_h = entropy(img.flatten().tolist())
print(f"raw pixel H ≈ {raw_h:.2f} bits/symbol")
# After DCT + top-8-of-64 + integer rounding, most symbols are zero.
coef_stream = []
for bj in range(4):
block = img[:, bj*8:(bj+1)*8].astype(float)
kept = keep_top_k(dct2d(block), k=8)
coef_stream.extend(np.round(kept).astype(int).flatten().tolist())
sparse_h = entropy(coef_stream)
print(f"kept-8 DCT stream H ≈ {sparse_h:.2f} bits/symbol")
# Typical run: raw ~7 bits/symbol, sparse-DCT ~1-2 bits/symbol.
# Same entropy bound, very different alphabet — the gap is what JPEG
# saves in file size on top of what discarding coefficients already saved.Two images can have the same DCT-coefficient histogram (same multiset of values) and very different perceptual quality after quantization, because perceptual quality depends on which cell a coefficient sits in — high-frequency error is hidden, low-frequency error is glaring. Histogram entropy can’t tell them apart; the human eye can. JPEG’s quantization table encodes this asymmetry: smaller divisors for low-frequency cells, larger for high-frequency. The entropy of the rounded stream tells you the file size; the quantization table tells you the perceptual quality. They are different objectives, both stacked on the same DCT basis.
Lossy compression isn’t compression — it’s prioritization. JPEG changes basis (DCT) so the picture becomes sparse, throws away the coordinates that don’t matter (quantization), and Huffman-packs the rest (entropy coding). Three steps, one savings: fewer coefficients, smaller values, longer zero runs. The math doesn’t beat entropy — it picks a different alphabet.
In the widget, pick the flat preset and slide down to 1. The reconstruction is still exactly the original. Why does keeping just one coefficient suffice? Which coefficient is it, and what does it carry?
In the widget, pick checkerboard and look at the DCT panel. Most of the energy concentrates at one cell. Where? Why is it that cell, and what does it tell you about how JPEG would handle a real-image patch full of fine texture?
In the widget, pick texture and slide from 64 down to 1. The reconstruction degrades as falls. Which coefficients are dropped first, and why does that match what JPEG actually does?
Two compressed images of the same scene end up with byte streams that have identical entropy. One looks fine; the other has visible blocky artifacts. How is that possible? In one sentence, distinguish what entropy bounds and what it doesn’t.
Image-processing courses introduce DCT and quantization as JPEG-specific machinery. Information theory courses introduce entropy and entropy coding as Shannon-specific. The bridge between them — change of basis is what makes the entropy bound survivable on real signals* — gets left implicit. Lemma puts the three steps (basis change, quantization, entropy coding) in one arc so the reader can see what each one does and what it doesn’t. The hard part of lossy compression isn’t the entropy coder (that’s standard) and isn’t the transform (that’s reversible). The hard part is the quantization table — the hand-tuned weights that decide which information humans don’t notice. That table is where psychophysics meets information theory, and it’s also where every codec since 1992 has competed.