Lemma
math, backwards
the hook · same pixels, different file sizes

Why do some images compress and others don’t?

Two photos, same dimensions, same number of pixels — same data in the trivial sense. Save them as PNG: one is 18 KB, the other is 312 KB. What makes the second one resist compression? The answer is one quantity from the entropy module, applied twice — once to the histogram, once to the structure the histogram throws away.

A file’s size is bounded by its information, not its pixel count. A color histogram is a over pixel values.

Widget — Image compression
H · histogram4.00 bits/px
H · neighbor diff1.00 bits/px
ratio0.25×
histogram (counts per level)

All three patterns share this histogram — each level appears 16 times.

All three patches use the same 256 pixel values — each of 16 grayscale levels appears 16 times. So the histogram entropy is identical: 4.00 bits/pixel everywhere. But the neighbor-difference entropy isn't: gradient sits near 0 (adjacent rows differ by 0 or 1), blocks sit a little higher (jumps at block boundaries), scrambled shoots up to ~3.72 bits — adjacent pixels are essentially independent. Histogram counts symbols; neighbor differences see structure. Real compressors (PNG, JPEG, gzip) exploit the second view; that's why same-histogram images can compress to wildly different sizes.
the arc
1

Same data, different information

A 16×16 grayscale image is 256 pixels — 256 bytes if you store one byte per pixel. That’s the data. But how many bits of information does the image actually contain? That’s a different question, and the answer can be much less than 8 bits/pixel. If every pixel is the same shade, you need almost no bits to describe the picture: “all of them, value 7.” If every pixel is independent random noise, you need close to 8 bits each, because there’s no pattern to exploit. Real photos sit between, and the gap between data and information is exactly what a compressor harvests.

2

The histogram bound

The first place to look is the pixel value distribution — the histogram (we’ll wrap it shortly). Count how many pixels are at each level, divide by total to get probabilities, plug into H=Σplog2pH = -Σ p \log₂ p. The number you get is the average per pixel needed if you encoded each pixel independently, ignoring its neighbors.

For our 16-level patch where each level appears exactly 16 times, every level has probability 1/161/16, so H=log216=4H = \log₂ 16 = 4 bits/pixel. That’s the histogram bound — the Shannon limit for symbol-by-symbol encoding. Halving 8 bits/pixel to 4 already explains why a 16-shade image compresses to half the size of a 256-shade one.

import numpy as np
from collections import Counter
from math import log2

# A 16x16 patch with each of 16 grayscale levels appearing exactly 16 times.
# We'll build three layouts that share this histogram and compare entropies.
def gradient():
    img = np.zeros((16, 16), dtype=np.int32)
    for i in range(16):
        img[i, :] = i           # row i is constant level i
    return img

def blocks():
    img = np.zeros((16, 16), dtype=np.int32)
    for i in range(16):
        for j in range(16):
            img[i, j] = (i // 4) * 4 + (j // 4)
    return img

def scrambled(seed=0):
    rng = np.random.default_rng(seed)
    flat = gradient().flatten()
    rng.shuffle(flat)
    return flat.reshape(16, 16)

# Histogram entropy — bits/pixel if you encoded each pixel independently.
def histogram_entropy(img):
    flat = img.flatten()
    counts = Counter(flat)
    N = len(flat)
    return sum(-(c / N) * log2(c / N) for c in counts.values() if c > 0)

[histogram_entropy(p()) for p in (gradient, blocks, scrambled)]
# → [4.0, 4.0, 4.0]   identical — same histogram, same H.
3

What the histogram doesn't see

Toggle between gradient, blocks, and scrambled in the widget. The pixels are exactly the same multiset — same 16 copies of each level. Histogram entropy: 4.04.0 bits/pixel for all three. Identical.

But the pictures don’t look the same, and they don’t compress the same. The gradient is smooth: every adjacent pair is either equal or differs by 1. The scrambled patch has neighbors that fight each other — adjacent levels are essentially independent. Same histogram, very different spatial structure, and that structure is where most of the real compression happens.

# Neighbor-difference entropy — what a real compressor sees.
# For each pair of adjacent pixels (right + down), count abs(diff). The
# resulting distribution carries the spatial structure the histogram missed.
def neighbor_diff_entropy(img):
    diffs = []
    H, W = img.shape
    for i in range(H):
        for j in range(W):
            if j + 1 < W: diffs.append(abs(int(img[i, j]) - int(img[i, j+1])))
            if i + 1 < H: diffs.append(abs(int(img[i, j]) - int(img[i+1, j])))
    counts = Counter(diffs)
    N = len(diffs)
    return sum(-(c / N) * log2(c / N) for c in counts.values() if c > 0)

[neighbor_diff_entropy(p()) for p in (gradient, blocks, scrambled)]
# → [0.34, 1.34, 4.07]   gradient ~0, scrambled ~log2(16) ≈ 4.
# Histograms saw three identical sources; neighbor differences see three
# very different ones. That gap is the entire point of spatial compression.
4

Neighbor differences — entropy of a different alphabet

The trick: instead of encoding each pixel as a value 0..15, encode each pixel as the difference from its left or top neighbor. The first pixel still costs full bits, but every subsequent one is usually a small number — often zero. Now run H=Σplog2pH = -Σ p \log₂ p on the distribution of differences. For the gradient: most differences are 0 or 1, entropy is well under 1 bit/pixel. For scrambled: differences are uniform across the full range, entropy is back near 4. The widget shows both numbers side by side; the ratio is the speedup a structure-aware compressor delivers over a histogram-only one.

This is the page in one sentence: see what pixels appear, neighbor differences see how they sit next to each other. Real compressors (PNG, gzip on raw pixel arrays) operate on the second view.

5

Lossless vs lossy — two ways out of the histogram bound

Histogram entropy is a hard floor for compression: no scheme can encode the source in fewer bits per symbol than its entropy, on average. But there are two escape hatches. One — change the alphabet. Encode neighbor differences instead of raw pixels; encode pairs instead of singletons; encode 8×8 blocks via a transform (DCT for JPEG, wavelets for JPEG-2000). The new alphabet has lower entropy because real images are not symbol-independent. PNG and FLAC live here.

Two — give up exactness. compression decides what to throw away based on what humans don’t notice — fine color shifts, high-frequency texture, sub-millisecond audio. The decompressed file is not the original; it’s a perceptual approximation. JPEG, MP3, H.264 live here. They aren’t bound by entropy at all — only by how much quality you’ll trade for size.

# Real-world check: encode all three with PNG and compare file sizes.
# PNG uses filter prediction (essentially neighbor differences) followed by
# DEFLATE (LZ77 + Huffman). Same pixel multiset, very different output.
import io
from PIL import Image

def png_size(img):
    buf = io.BytesIO()
    Image.fromarray((img * 16).astype('uint8'), mode='L').save(buf, 'PNG')
    return len(buf.getvalue())

sizes = {p.__name__: png_size(p()) for p in (gradient, blocks, scrambled)}
# → {'gradient': ~120, 'blocks': ~140, 'scrambled': ~310}
#
# The histograms predicted 4 bits × 256 pixels = 128 bits = 16 bytes of
# *symbol payload*; PNG's framing overhead is fixed. The interesting number
# is the ratio of payloads: gradient ≈ 1×, scrambled ≈ 2-3× larger. Same
# histogram, different file size — the signature of spatial compression.
now break it

Histogram entropy is a lower bound for symbol-independent encoding, not an upper bound on what a real codec achieves. PNG of the gradient patch is much smaller than 4 bits/pixel × 256 pixels = 128 bytes — because PNG also uses LZ77, which finds the literal repeated runs that neighbor-difference entropy doesn’t quite capture. The histogram says 4 bits, neighbor diff says ~0.4, PNG-with-LZ77 effectively says ~0.1. Each step down corresponds to a richer model of what the picture actually is. The same image is more or less compressible depending on the question you ask about it.

A file’s size is bounded by its information, not its pixel count. Histograms see what values appear; neighbor differences see structure; transform codecs see frequency content. Each view exposes a smaller information budget than the last — and that’s exactly the budget the encoder has to spend.

exercises · 손으로 풀기
1compute by hand · histogram entropyno calculator

A 4-color image has 100 pixels: 50 red, 25 green, 15 blue, 10 yellow. Compute the histogram entropy in bits per pixel (natural log: log2x=lnx/ln2\log₂ x = \ln x / \ln 2; for this one just use the rough values log2(2)=1\log₂(2) = 1, log2(4)2\log₂(4) ≈ 2, log2(6.67)2.74\log₂(6.67) ≈ 2.74, log2(10)3.32\log₂(10) ≈ 3.32).

2why scrambled is harder

You have two 100×100 images. Both contain the exact same 10,000 pixel values: 5,000 black and 5,000 white. Image A is a clean checkerboard pattern; image B is the same pixels randomly scattered. Their histogram entropies are identical (H=1H = 1 bit/pixel, since p(black) = p(white) = 0.5). Which compresses smaller as a PNG, and why? Predict the rough ratio.

3the evil one · histogram is blind

In the widget, switch between gradient and scrambled. Both report Hhistogram=4.00H_histogram = 4.00 bits/pixel. So a naive engineer concludes: “Both images need 256 × 4 = 1024 bits = 128 bytes of payload, plus PNG overhead.” Run the actual PNG sizes (the code in arc 4 measures it). Why does the scrambled file end up roughly twice as large — despite the histogram saying they should be the same?

4lossy is a different game

A JPEG of a photograph is often smaller than the histogram bound of the same photograph would allow for a lossless encoding. How is that possible? In one sentence, distinguish what entropy bounds and what it doesn’t.

why this isn't taught this way

Information theory courses define entropy on i.i.d. sources and stop there — pixels, audio samples, and language tokens get treated as independent draws. Real signals never are. The gap between symbol entropy and contextual entropy (what real codecs exploit) is the whole reason a 4 MB photograph fits in 400 KB without visible loss; it’s also why “compression” is a ladder, not a single number. Lemma keeps the histogram and the neighbor-difference views in the same widget so the gap is visible from the start: same histogram, same H, three different pictures, three different file sizes. The entropy module names the bound; this page shows what the bound isn’t bounding.

glossary · used on this page · 5
distribution·분포
A _shape of uncertainty_: how probability is allocated across possible outcomes. A discrete distribution assigns a number to each outcome — `P(X = "cat") = 0.6, P(X = "dog") = 0.3, P(X = "bird") = 0.1`. A continuous distribution assigns _density_ across an interval — there is no probability at any single point, only over a range. The numbers must sum (discrete) or integrate (continuous) to 1, because _something_ must happen. A single probability is one number; a distribution is the whole shape behind it. Most quantities a model predicts, an asset can return, or a pixel can take, are not single numbers but distributions — and the _spread_ of those distributions is often what matters more than the center.
bit (information)·비트 (정보)
The unit of information when the log base is 2. A coin flip carries exactly 1 bit because two equally-likely outcomes need one yes/no question to resolve. Distinct from "bit" the binary digit: a binary digit _can_ hold 1 bit, but a binary digit storing the result of a biased coin (90% heads) carries less than 1 bit on average. Information bits and storage bits are the same word for two different things.
histogram·히스토그램
A count of how often each _value_ appears in a dataset. For an image: how many pixels are this dark, how many are that light, ignoring _where_ in the picture they sit. The histogram throws away spatial structure on purpose — it answers "what's the distribution of brightness?" but not "is the picture smooth or noisy?". When you compute _entropy_ over a histogram, you get a lower bound on the bits-per-symbol needed to encode each value _independently_; for an image, that's almost always a loose bound, because real pixels aren't independent of their neighbors.
lossless·무손실
A compression scheme that _throws nothing away_: decompressing the file gives back the exact original bits. PNG, ZIP, FLAC are lossless; the file is smaller because the encoder finds a more compact representation of the same information, not because it dropped any. The Shannon bound is the floor: no lossless scheme, no matter how clever, beats the entropy of the source. The opposite of _lossy_, which trades fidelity for size.
lossy·손실
A compression scheme that _throws information away_ on purpose, picking what to drop based on what humans don't notice — slight color shifts, high-frequency texture, sub-millisecond audio detail. JPEG, MP3, H.264 are lossy. The decompressed file is _not_ the original bits; it's an approximation chosen to look (or sound) close enough at a much smaller size. Lossy compression is not bound by entropy of the source — only by what perceptual quality the encoder is willing to sacrifice. The opposite of _lossless_.