Lemma
math, backwards
the hook · which knob, how much

A model is wrong. Which knob should move, and by how much?

You have a parameter ww and a way to measure how wrong the model is — a L(w)L(w). Which direction should ww move to make LL smaller? How far? Answers: the tells you the direction; a number called the tells you how far. Together they make one of the simplest, oldest, most-used recipes in numerical mathematics — and the engine of every modern ML system.

In the widget below, drag the starting w0w₀ and the ηη slider, then press step. Three preset rates: too small crawls down, good snaps to the minimum in a couple of steps, too big oscillates and explodes. The boundary between “good” and “explodes” is the entire stability theory of the method, in one screen. Gradient descent is the canonical example of optimization — five steps that recur in finance and calibration too.

Widget — Loss landscape
w0.0000
L(w)36.000
L'(w) = ∇L-24.000
step #0
w* = 3w
One parameter w, one example (x, y) = (2, 6), quadratic loss L(w) = (w·x − y)² with minimum at w* = 3. Hit step repeatedly with η = 0.04 and watch a slow descent. Switch to η = 0.12: nearly Newton-fast — almost one shot. Switch to η = 0.27 and click auto: the iterates oscillate and grow without bound. The boundary η < 2/L''(w*) (here 2/8 = 0.25) is the entire stability theory of the method, distilled into one widget.
the arc
1

Wrongness as a number

Take the simplest possible model: one parameter ww, one input xx, prediction y^=wx\hat{y} = w · x. We want y^y\hat{y} ≈ y for the true value yy. “Closeness” needs a number; the standard choice is squared error L(w)=(wxy)2L(w) = (w·x − y)² — zero when the prediction is exact, large when it’s far, and (importantly) smooth everywhere so we can take derivatives.

The widget fixes (x,y)=(2,6)(x, y) = (2, 6), so the loss is L(w)=(2w6)2=4(w3)2L(w) = (2w − 6)² = 4(w − 3)² — a parabola with minimum at w_=3w\_ = 3. The whole problem reduces to: how do we get from any starting w0w₀ to 33?

2

Which direction? — the derivative

The derivatives module gives the answer in one line: L(w)=2x(wxy)L'(w) = 2x(wx − y) — the slope of the loss curve at the current ww. If L'(w) &gt; 0, LL is rising as ww grows; we should make ww smaller. If L'(w) &lt; 0, the opposite. The descent direction is always L(w)−L'(w).

In the widget, the green arrow at the current point shows that direction. As ww approaches the minimum, the arrow shrinks; right at w_w\_ it has zero length — there’s no signal left, because the gradient is exactly zero at the minimum.

# Toy: one parameter w, one example (x, y) = (2, 6).
# Loss(w) = (w·x − y)²   = quadratic in w, minimum at w* = y/x = 3.
def loss(w, x, y):
    return (w * x - y) ** 2

def grad(w, x, y):
    # d/dw [(wx − y)²] = 2x(wx − y).
    return 2 * x * (w * x - y)

x, y = 2, 6
loss(0.0, x, y)        # → 36     (way off)
loss(3.0, x, y)        # → 0      (perfect)
grad(0.0, x, y)        # → −24    (loss decreases as w grows)
grad(3.0, x, y)        # → 0      (no signal at the minimum)
3

How far? — the learning rate

The direction is settled. The size still isn’t. We multiply the gradient by a positive number ηη (the · ) and subtract:

w  ←  w  −  η · L'(w)

Different ηη shapes the trajectory. Try η=0.04η = 0.04 in the widget: each step shrinks the distance to w_ by a constant factor (about 0.68 here), so is geometric but slow. η=0.12η = 0.12 is near the Newton-optimal η=1/L(w)=1/8=0.125η = 1/L''(w_) = 1/8 = 0.125 — convergence in two or three steps. η=0.27η = 0.27 is past the boundary; we’ll see what happens to it next.

4

Iterate — the descent loop

The whole method is just step → step → step until either the gradient is small enough (we’re near a minimum) or a budget runs out:

for _ in range(steps):
  w = w − η · L'(w)

Five lines of code. That’s gradient descent. The widget runs exactly this loop when you press auto. Press it with η=0.12η = 0.12 and watch ww converge to 33. Press it with η=0.27η = 0.27 and watch the iterates leave the screen.

# The descent loop, in five lines. The shape that scales to neural nets.
def descent(w0, lr, x, y, steps=20):
    w = w0
    for _ in range(steps):
        w = w - lr * grad(w, x, y)
    return w

descent(0.0, lr=0.04, x=x, y=y, steps=20)   # → 2.95   slow but stable
descent(0.0, lr=0.12, x=x, y=y, steps=20)   # → 3.00   fast, near-Newton
descent(0.0, lr=0.27, x=x, y=y, steps=20)   # → ~10⁵  diverged

# The single requirement for stability on a quadratic with second
# derivative c is η < 2/c. Here c = L''(w) = 2x² = 8, so η < 0.25.
# Beyond that, every "step" overshoots more than it corrects, and
# the iterates explode geometrically.
5

Diverge — when η goes too far

For our quadratic loss, write u = w − w_. The update rule becomes u(1ηL(w))uu ← (1 − η · L''(w_)) · u — the distance to the minimum gets multiplied by r=1ηL(w)r = 1 − η · L''(w_) at each step. Stable convergence requires |r| &lt; 1, i.e. 0 &lt; η &lt; 2/L''(w_). For our specific loss, L(w)=8L''(w) = 8, so any η &lt; 0.25 converges and any η &gt; 0.25 diverges.

Above 0.250.25, each step overshoots the minimum by more than it corrected for. The signs of uu alternate (we land on opposite sides) and the magnitude grows. The widget makes this picture: at η=0.27η = 0.27, the iterates ping-pong and balloon. The fix in real ML is never “trial and error” alone — it’s measuring local curvature (second-order methods, Adam-style adaptive rates) and using a schedule that shrinks ηη as the optimum gets closer.

6

Why this is everywhere in ML

Real ML doesn’t have one parameter; it has millions or billions. It doesn’t have one example; it has millions of them. It doesn’t have a clean parabolic loss; it has a high-dimensional, non-convex landscape full of saddles and plateaus. None of that changes the recipe — it just changes what each symbol stands for.

  • One parameter → vector. ww becomes a parameter θθ; the gradient becomes a vector of partial derivatives. Vector subtraction in place of the scalar update — the same two operations the vectors module walks through (add, scale).

  • One example → mini-batch. Sum (or average) the loss over a small random sample of examples each step. The gradient is now noisy — that’s SGDSGD. The noise is a feature: it helps escape bad local geometry.

  • Clean parabola → real loss surface. Cross-entropy on a deep network is non-convex. There is no single optimum; we settle for “good enough” , found by the same descent loop.

  • Hand-derived gradient → autograd. The chain rule is mechanical. Modern ML frameworks build a computation graph during the forward pass and walk it backward to produce L∇L automatically. The derivative we computed by hand for the toy is the same operation, scaled.

The bridge to Confidently Wrong: the squared loss here is the warm-up. The cross-entropy loss there has the same “compute gradient, take a step” geometry — just with logptrue−\log p_true instead of (y^y)2(\hat{y} − y)². Same descent algorithm, different landscape.

# Same loop, real ML.
# Replace the toy parameter w with a parameter VECTOR θ.
# Replace the toy gradient with the partial derivatives ∂L/∂θᵢ.
# Replace the single example with a SUM (or mini-batch average) over data.
#
# Pseudocode for a one-layer linear classifier with cross-entropy loss
# (the loss from /ml/confident-wrong) — same descent, more axes:
#
# for batch in data_loader:
#     ŷ = softmax(W @ batch.x)            # forward
#     L = cross_entropy(ŷ, batch.y)       # loss
#     g = autograd.grad(L, W)             # ∇_W L  via reverse-mode autodiff
#     W = W - lr * g                       # one step
#
# autograd ≠ a new idea; it's *bookkeeping for the chain rule* applied
# to the same gradient operation arc 3 derives by hand. Walk downhill,
# but with millions of axes and a clock.

Direction from the derivative. Distance from η. Repeat. The whole machinery of modern ML is one variable substitution away from this single line: wwηL(w)w ← w − η · L'(w).

exercises · 손으로 풀기
1compute the lossno calculator

With x=2x = 2, y=6y = 6, and w=1w = 1, compute the prediction y^\hat{y} and the loss L(w)=(y^y)2L(w) = (\hat{y} − y)².

2gradient by handno calculator

Use the secant-to-tangent recipe from the derivatives module to derive L(w)=2x(wxy)L'(w) = 2x(wx − y) from scratch. Then evaluate at w=1w = 1 with (x,y)=(2,6)(x, y) = (2, 6).

3one step by handno calculator

Starting at w=1w = 1 with η=0.12η = 0.12 and gradient 16−16, what is ww after one step? After two? You should land near the optimum quickly.

4why too-large η explodes

For our quadratic loss, the update on the displacement u=ww_u = w − w\_ is u(18η)uu ← (1 − 8η) · u. Show that η=0.27η = 0.27 gives un|u_n| → ∞ exponentially in nn. At what ηη does the iteration sit at the boundary (constant amplitude, no growth, no decay)? What’s special about that ηη?

glossary · used on this page · 6
loss function·손실 함수
A function that turns "the model is wrong" into a single number. Given the model's predictions `ŷ` and the true labels `y`, the loss is `L(ŷ, y)` — small when the predictions are close, large when they're far. Squared error `(ŷ − y)²` is the standard choice for regression because it's smooth and its derivative is easy. Cross-entropy is the standard for classification. Training a model means changing its parameters until the loss, summed over the dataset, gets as small as possible.
derivative·미분
The _instantaneous_ rate of change of a function at a point — defined as the limit of secant slopes as the interval between the two sample points shrinks to zero: `f'(a) = lim_{h→0} (f(a+h) − f(a)) / h`. The derivative of position is velocity; of velocity, acceleration; of mass-with-respect-to-time, mass flow. Geometrically, the slope of the tangent line. Algebraically, the operation that turns `x²` into `2x` and `sin x` into `cos x`. Almost every quantity called a _rate_ anywhere in physics, ML, and engineering is some derivative.
learning rate·학습률
The step size used in gradient descent: at each iteration, the parameter `w` updates as `w ← w − η · ∇L(w)`, where `η` (eta) is the learning rate. Too small and convergence crawls; too large and the iterates overshoot the minimum and may diverge. The "Goldilocks" range depends on the curvature of the loss — for a quadratic loss with second derivative `c`, the upper bound for stable convergence is `η < 2/c`. Real ML training uses schedules (decreasing `η` over time) and adaptive rules (per-parameter `η`) — both born from the same observation: one constant rarely works.
convergence·수렴
The state where further iterations no longer change the objective function meaningfully. Concretely: the change `|f(xₙ₊₁) − f(xₙ)|` falls below some tolerance, or the gradient magnitude `‖∇f‖` falls below it, or the iterate `xₙ` itself stops moving. Convergence is the _stopping condition_ — without one, an optimizer would run forever. It is also where the danger lives: convergence to a local minimum looks identical to convergence to a global one; both look like "no more improvement nearby." Whether the final answer is the _right_ answer depends on the problem's geometry (convex vs. not), the starting point, the step-size schedule, and luck. _Stopped_ and _solved_ are two different things.
vector·벡터
A quantity that has both magnitude and direction. Concretely a tuple of numbers — `(3, 4)` in 2D, `(1, 0, −2)` in 3D — but the _meaning_ of those numbers depends on what you're doing. In graphics they describe a control-point offset; in physics, a velocity or force; in ML, a parameter update or a feature representation. The tuple is the same; the _role_ changes. The arithmetic — addition component-wise, scaling by a number — is the same in every role, which is why one set of math serves all of them.
local minimum·국소 최솟값
A point where the objective function is no worse than every nearby point — but not necessarily the lowest point anywhere. A function with a single bowl shape (a _convex_ function) has only one minimum; finding it solves the optimization completely. A function with multiple valleys has multiple local minima; descending from a starting point only guarantees landing in _whichever valley you happened to start near_, not the deepest one (the _global minimum_). Most real-world optimization problems are non-convex — neural network training, portfolio selection with constraints, physics with non-quadratic potentials — and the math of "is this the best?" usually collapses to "does anywhere nearby look better?" The honest answer is: not always.