A model is wrong. Which knob should move, and by how much?
You have a parameter and a way to measure how wrong the model is — a
In the widget below, drag the starting and the slider, then press step. Three preset rates: too small crawls down, good snaps to the minimum in a couple of steps, too big oscillates and explodes. The boundary between “good” and “explodes” is the entire stability theory of the method, in one screen. Gradient descent is the canonical example of optimization — five steps that recur in finance and calibration too.
Wrongness as a number
Take the simplest possible model: one parameter , one input , prediction . We want for the true value . “Closeness” needs a number; the standard choice is squared error — zero when the prediction is exact, large when it’s far, and (importantly) smooth everywhere so we can take derivatives.
The widget fixes , so the loss is — a parabola with minimum at . The whole problem reduces to: how do we get from any starting to ?
Which direction? — the derivative
The derivatives module gives the answer in one line: — the slope of the loss curve at the current . If L'(w) > 0, is rising as grows; we should make smaller. If L'(w) < 0, the opposite. The descent direction is always .
In the widget, the green arrow at the current point shows that direction. As approaches the minimum, the arrow shrinks; right at it has zero length — there’s no signal left, because the gradient is exactly zero at the minimum.
# Toy: one parameter w, one example (x, y) = (2, 6).
# Loss(w) = (w·x − y)² = quadratic in w, minimum at w* = y/x = 3.
def loss(w, x, y):
return (w * x - y) ** 2
def grad(w, x, y):
# d/dw [(wx − y)²] = 2x(wx − y).
return 2 * x * (w * x - y)
x, y = 2, 6
loss(0.0, x, y) # → 36 (way off)
loss(3.0, x, y) # → 0 (perfect)
grad(0.0, x, y) # → −24 (loss decreases as w grows)
grad(3.0, x, y) # → 0 (no signal at the minimum)How far? — the learning rate
The direction is settled. The size still isn’t. We multiply the gradient by a positive number (the
w ← w − η · L'(w)
Different shapes the trajectory. Try in the widget: each step shrinks the distance to w_ by a constant factor (about 0.68 here), so
Iterate — the descent loop
The whole method is just step → step → step until either the gradient is small enough (we’re near a minimum) or a budget runs out:
for _ in range(steps): w = w − η · L'(w)
Five lines of code. That’s gradient descent. The widget runs exactly this loop when you press auto. Press it with and watch converge to . Press it with and watch the iterates leave the screen.
# The descent loop, in five lines. The shape that scales to neural nets.
def descent(w0, lr, x, y, steps=20):
w = w0
for _ in range(steps):
w = w - lr * grad(w, x, y)
return w
descent(0.0, lr=0.04, x=x, y=y, steps=20) # → 2.95 slow but stable
descent(0.0, lr=0.12, x=x, y=y, steps=20) # → 3.00 fast, near-Newton
descent(0.0, lr=0.27, x=x, y=y, steps=20) # → ~10⁵ diverged
# The single requirement for stability on a quadratic with second
# derivative c is η < 2/c. Here c = L''(w) = 2x² = 8, so η < 0.25.
# Beyond that, every "step" overshoots more than it corrects, and
# the iterates explode geometrically.Diverge — when η goes too far
For our quadratic loss, write u = w − w_. The update rule becomes — the distance to the minimum gets multiplied by at each step. Stable convergence requires |r| < 1, i.e. 0 < η < 2/L''(w_). For our specific loss, , so any η < 0.25 converges and any η > 0.25 diverges.
Above , each step overshoots the minimum by more than it corrected for. The signs of alternate (we land on opposite sides) and the magnitude grows. The widget makes this picture: at , the iterates ping-pong and balloon. The fix in real ML is never “trial and error” alone — it’s measuring local curvature (second-order methods, Adam-style adaptive rates) and using a schedule that shrinks as the optimum gets closer.
Why this is everywhere in ML
Real ML doesn’t have one parameter; it has millions or billions. It doesn’t have one example; it has millions of them. It doesn’t have a clean parabolic loss; it has a high-dimensional, non-convex landscape full of saddles and plateaus. None of that changes the recipe — it just changes what each symbol stands for.
One parameter → vector. becomes a parameter
; the gradient becomes a vector of partial derivatives. Vector subtraction in place of the scalar update — the same two operations the vectors module walks through (add, scale).vector One example → mini-batch. Sum (or average) the loss over a small random sample of examples each step. The gradient is now noisy — that’s . The noise is a feature: it helps escape bad local geometry.
Clean parabola → real loss surface. Cross-entropy on a deep network is non-convex. There is no single optimum; we settle for “good enough”
, found by the same descent loop.local minima Hand-derived gradient → autograd. The chain rule is mechanical. Modern ML frameworks build a computation graph during the forward pass and walk it backward to produce automatically. The derivative we computed by hand for the toy is the same operation, scaled.
The bridge to Confidently Wrong: the squared loss here is the warm-up. The cross-entropy loss there has the same “compute gradient, take a step” geometry — just with instead of . Same descent algorithm, different landscape.
# Same loop, real ML.
# Replace the toy parameter w with a parameter VECTOR θ.
# Replace the toy gradient with the partial derivatives ∂L/∂θᵢ.
# Replace the single example with a SUM (or mini-batch average) over data.
#
# Pseudocode for a one-layer linear classifier with cross-entropy loss
# (the loss from /ml/confident-wrong) — same descent, more axes:
#
# for batch in data_loader:
# ŷ = softmax(W @ batch.x) # forward
# L = cross_entropy(ŷ, batch.y) # loss
# g = autograd.grad(L, W) # ∇_W L via reverse-mode autodiff
# W = W - lr * g # one step
#
# autograd ≠ a new idea; it's *bookkeeping for the chain rule* applied
# to the same gradient operation arc 3 derives by hand. Walk downhill,
# but with millions of axes and a clock.Direction from the derivative. Distance from η. Repeat. The whole machinery of modern ML is one variable substitution away from this single line: .
With , , and , compute the prediction and the loss .
Use the secant-to-tangent recipe from the derivatives module to derive from scratch. Then evaluate at with .
Starting at with and gradient , what is after one step? After two? You should land near the optimum quickly.
For our quadratic loss, the update on the displacement is . Show that gives exponentially in . At what does the iteration sit at the boundary (constant amplitude, no growth, no decay)? What’s special about that ?