Lemma
math, backwards
the hook · when a formula will not answer

A function tells you the height. How do you know which way to go down?

A function says how good your current choice is — a single number. It does not tell you the best choice. For simple functions you can write down the answer; for almost everything else, you cannot. There is no formula for the temperature that calibrates this network, or the portfolio weights that minimize this variance under these constraints, or the model parameters that fit this dataset. Yet all three problems get solved, every day, by the same five moves: name the objective, name the search space, compare nearby choices, choose how far to step, and decide when to stop.

Optimization is not finding the formula. It is choosing a quantity to improve, then moving through possible choices until improvement stops. This module is about the moves. Different algorithms do them differently; the bones are identical.

tool spec
what

Given an ff over a search space, find a point xx^* that makes ff as small (or large) as possible. The standard machinery: start somewhere, take steps in a direction that improves ff, control how far each step goes (the ), and stop when no nearby move improves ff — that is . The point where you stop is a ; whether it is also the global minimum depends on the geometry of ff and where you started.

applies when

Any time you have a choice and a number that says how good the choice is. Fitting model parameters in ML, picking portfolio weights in finance, tuning a temperature scalar in , finding points in physics, designing a cheaper part in engineering. The same five-step loop runs all of them; what differs is the objective, the space, and the move rule.

breaks when

Three places the simple loop fails. (1) The objective has many local minima and descent finds a bad one. (2) The search space has constraints (probabilities sum to 1, weights are non-negative, parameters live on a manifold) and naive steps leave it. (3) The objective is cheap to evaluate but its is not, so deciding the direction is the costly step. Each of these has its own family of techniques; this module is the base case they all extend.

Widget — Objective explorer
current x-2.40
f(x)11.56
slope f'(x)-6.80
statusmoving
-3-2-10123
preset
One input x, one objective f(x) you are trying to minimize. The dashed vertical line marks the *true* minimum (if it exists in the window). Click step to apply one update x ← x − α · f'(x): move in the direction the slope says is better, by α. The orange tangent shows that direction at your current x. Try convex with a small α — steady progress to the bottom. Try plateau — the slope is near zero, so each step barely moves. Try two valleys from x = −2.4 with small α — you fall into the *left* valley, a local minimum, and stop short of the deeper right one. Try noisy — you stop at the first ripple your step size can't escape. *Same machine, four outcomes.*
the arc
1

Objective — what are we trying to improve?

Every optimization problem starts with one number. Smaller is better, or bigger is better, but only one number gets that vote. In machine learning that number is the loss function — sum of errors over a dataset. In finance it can be portfolio , expected shortfall, or the negative of risk-adjusted return. In physics it is often potential energy, and “equilibrium” is the word physicists use for “the place where the objective stops decreasing.” In calibration it is . Different fields, different words, identical role: one quantity that determines whether a candidate is preferable to its neighbours.

“Minimize” and “maximize” are the same problem twice — maximizing ff is minimizing f-f. By convention this module talks about minimization; everything carries over to maximization by flipping the sign. The objective is what locks the problem in place. Without one, “make it better” has no agreed direction. With one, every candidate has a number attached and the comparison becomes mechanical.

2

Search space — what choices are allowed?

The objective alone is not enough; you also need to say what can vary. The set of all allowed choices is the search space. For ML it is the high-dimensional of model parameters. For a portfolio it is the vector of weights, often constrained to be non-negative and sum to 1. For calibration it can be a single scalar (one temperature value) or a small vector (per-class temperatures). For physics, it is the set of positions or configurations consistent with the constraints.

Whether the search space is finite (pick one of 100 hyperparameter combinations), discrete and combinatorial (assign tasks to workers), or continuous (anywhere in Rn\mathbb{R}^n) changes which algorithms apply. This module focuses on the continuous case, because that is where the derivative — the move rule that makes the rest of Lemma tick — has the most to say. Discrete optimization (integer programming, scheduling, the travelling salesman) exists; it uses different math.

3

Move — compare nearby choices

Given the current point xx, the optimizer must decide which way to go. The simplest, most general answer: evaluate ff at xx and at points near xx, then move toward the smaller value. When ff is differentiable, “near” collapses into a single object — the derivative, or in higher dimensions the gradient — which says, to first order, exactly which direction makes ff smaller fastest. Movement becomes one operation: xxαf(x)x \leftarrow x - \alpha \cdot \nabla f(x).

The widget makes this concrete. The orange tangent at the current xx shows the slope f(x)f'(x); pressing step moves xx to the left if the slope is positive (function is increasing → go the other way) or to the right if negative. The size of the move is αf(x)\alpha \cdot f'(x) — proportional to the slope itself. Where the slope is steep the step is big; where the slope is shallow the step is small. The optimizer slows down on its own as it approaches a minimum.

This is not the only move rule. Newton’s method uses the second derivative to choose the step in one shot. Coordinate descent moves one variable at a time. Stochastic gradient descent uses a noisy approximation of the gradient. All of them are answers to the same question: which way is better, and how far?

import numpy as np

# A general "improve until you can't" loop. Given an objective f and its
# derivative f', start somewhere, take steps in the direction f' says is
# downhill, stop when f' is essentially zero (or you run out of budget).
def descend(f, df, x0, alpha=0.1, tol=1e-4, max_iters=500):
    x = x0
    history = [x]
    for _ in range(max_iters):
        g = df(x)
        if abs(g) < tol:
            break
        x = x - alpha * g
        history.append(x)
    return x, history

# Convex bowl: f(x) = (x - 1)², f'(x) = 2(x - 1). Unique min at x = 1.
f  = lambda x: (x - 1)**2
df = lambda x: 2 * (x - 1)

x_star, hist = descend(f, df, x0=-2.4, alpha=0.3)
(round(x_star, 4), len(hist), round(f(x_star), 6))
# → (1.0000, 28, 0.0)
# 28 iterations from x = -2.4 to x = 1.0. Each step uses f'(x) — the
# "which way is better" question, answered locally.
4

Step size — moving too little, too much

The step size α\alpha is the second knob, and it has at least as much practical impact as the direction. Try the widget: with the convex bowl preset and small α\alpha, the position creeps left-to-right toward the minimum — many tiny correct steps. Increase α\alpha — the descent gets faster, then too fast: each step overshoots, and the trajectory oscillates back and forth. Push α\alpha further and the position blows up entirely, walking off the screen with each iteration amplifying the previous error.

For a quadratic ff with second derivative cc, the stability bound is sharp: α<2/c\alpha < 2/c converges, α>2/c\alpha > 2/c diverges. For non-quadratic ff the bound moves as you do — the local curvature at one point differs from the curvature elsewhere. Real algorithms either fix α\alpha conservatively, schedule it (start big, decrease over time), or compute it adaptively each step (line search, trust regions). All three are answers to the same fact: no single constant works for every geometry.

In ML, α\alpha has a special name — the learning rate — because tuning it dominates training outcomes. In the rest of optimization it is just α\alpha. Same number, same role; different communities give it different folklore.

# Step size α decides whether descent crawls, lands, or diverges.
# Same convex bowl, three different α.
for alpha in (0.02, 0.3, 1.05):
    x_star, hist = descend(f, df, x0=-2.4, alpha=alpha, max_iters=300)
    print(f"α = {alpha:5.2f}  →  iters = {len(hist):4d}  final x = {x_star:8.4f}")
# α =  0.02  →  iters =  300  final x =  -0.0241   (crawls, hit max_iters)
# α =  0.30  →  iters =   28  final x =   1.0000   (lands)
# α =  1.05  →  iters =  300  final x = -1.5e+87   (overshoots, diverges)
#
# For this quadratic, the curvature is f''(x) = 2, and the stability
# bound is α < 2 / f''(x) = 1. α = 1.05 sits past the edge, so each step
# amplifies the previous error. The right α is geometry-dependent.

# Adaptive trick: shrink α whenever a step *increases* f. Backtracking
# line search is one widely-used version.
def descend_backtrack(f, df, x0, alpha0=1.0, tol=1e-4, max_iters=500):
    x = x0
    alpha = alpha0
    for _ in range(max_iters):
        g = df(x)
        if abs(g) < tol:
            break
        # Halve α until the step actually improves f.
        while f(x - alpha * g) > f(x):
            alpha *= 0.5
        x = x - alpha * g
        alpha = min(alpha * 1.5, alpha0)  # gently grow back
    return x

round(descend_backtrack(f, df, x0=-2.4, alpha0=2.0), 4)
# → 1.0   converges even with a deliberately too-big starting α
5

Stopping — local minimum and convergence

The simplest stopping rule: stop when the move would not change ff much. Concretely, when the gradient magnitude f\|\nabla f\| falls below a small tolerance, or when the change f(xn+1)f(xn)|f(x_{n+1}) - f(x_n)| does, or when the iterate xnx_n stops moving. Without a rule like this, the loop runs forever. With one, something stops the optimizer — but the stopping point is only guaranteed to be a local minimum.

This is the most important caveat the module carries. A convex objective (a single bowl, no other dips) has exactly one minimum, and any descent reaches it. Almost nothing in the real world is convex. Neural network loss surfaces are spectacularly non-convex; portfolio problems with constraints become non-convex; calibration loss is convex but its higher-dimensional cousins are not. The widget’s two valleys preset shows the problem in one dimension: starting from x=2.4x = -2.4, descent lands in the left valley — a local minimum — and stops, oblivious to the deeper right valley.

The honest position: stopped and solved are different things. Local minima look identical to global ones from the inside (no nearby move helps). Strategies that try to escape — random restarts, momentum, stochastic noise, simulated annealing — all add some way to not stop at the first place gradient vanishes.

6

Where this shows up — same five steps, three pillars

Optimization is the math behind every “we don’t have a formula, but we have a procedure” page in Lemma. Three applications already speak this language; they share the same skeleton, with different choices for what to minimize, where, and how.

ml      : loss over training data; parameters in ℝⁿ;
        gradient descent with a learning rate.
ml      : negative log-likelihood; a single temperature scalar;
        one-dimensional descent.
finance : portfolio variance; weights summing to 1;
        minimize a quadratic, sometimes closed-form, sometimes iterative.

Gradient descent — the canonical example. Objective = a regression loss summed over a dataset. Search space = the model’s parameters. Move = the gradient, computed via backpropagation. Step size = the learning rate. Stopping = the loss plateaus or you run out of training budget. Every detail is one of the five steps with a domain-specific name.

Model calibration — temperature scaling is optimization with a one-dimensional search space. Objective = negative log-likelihood on a held-out set. Search space = T(0,)T \in (0, \infty). Move = the derivative of NLL with respect to TT. Step size = small enough that a single scalar fit converges in seconds. The page solves the problem by descent; this module is what justifies that descent works.

Portfolio risk — finding the minimum-variance weights is optimization with a closed-form answer for two assets, and an iterative answer in higher dimensions. Objective = portfolio variance. Search space = weights ww summing to 1. Move = the derivative of variance with respect to weight. The arc-5 derivation on that page is exactly step one of step three — set the gradient to zero and solve. When closed form exists, optimization collapses to algebra; when it does not, the same machinery iterates instead.

Same five-step skeleton, different bones. Naming the skeleton is what makes it portable.

# Where this shows up — three real optimization stories share the same loop.

# (1) ML: gradient descent on a regression loss.
# Data y = 3·x; fit w to minimize L(w) = Σ (w·x_i − y_i)².
X = np.array([1.0, 2.0, 3.0, 4.0])
Y = 3.0 * X
def L(w):  return float(((w * X - Y) ** 2).sum())
def dL(w): return float((2 * (w * X - Y) * X).sum())
w_star, _ = descend(L, dL, x0=0.0, alpha=0.02)
round(w_star, 4)
# → 3.0   the slope of Y = 3x, recovered by descent

# (2) Calibration: fit one scalar T (temperature) to minimize NLL.
# Logits z, true labels y; we pick T to make softmax(z/T) match observed
# correctness frequencies. One parameter, well-behaved objective.
def softmax(z):
    e = np.exp(z - z.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)
np.random.seed(0)
z_logits = np.random.randn(200, 5) * 3.0       # overconfident logits
labels   = np.random.randint(0, 5, size=200)
def nll(T): return float(-np.log(softmax(z_logits / T)[range(200), labels] + 1e-12).mean())
T_star = 1.0
for _ in range(80):
    g = (nll(T_star + 1e-3) - nll(T_star - 1e-3)) / 2e-3  # numerical gradient
    T_star -= 0.1 * g
round(T_star, 3), round(nll(T_star), 3)
# → (>1.0, <nll(1.0))   T > 1 softens overconfident predictions

# (3) Portfolio risk: minimize variance over weight w ∈ [0, 1].
# Two assets, σ_A = 0.2, σ_B = 0.1, ρ = 0.0 (independent).
def var_p(w):  return float(w**2 * 0.04 + (1 - w)**2 * 0.01)
def dvar_p(w): return float(2 * w * 0.04 - 2 * (1 - w) * 0.01)
w_star, _ = descend(var_p, dvar_p, x0=0.5, alpha=2.0)
round(w_star, 4)
# → 0.2    20% in the volatile asset minimizes variance. The closed-form
#          answer w* = σ_B² / (σ_A² + σ_B²) = 0.01 / 0.05 = 0.2. Match.
#
# Three pillars, one loop: objective + derivative + step + stopping.
# The science is in the choices (which f? which start? which α?). The
# *machinery* is identical.

Optimization is choosing a quantity to improve, then moving through possible choices until improvement stops. Five moves: objective, search space, move rule, step size, stopping. Almost everything called fitting, tuning, minimizing, or finding equilibrium in the rest of Lemma is some version of this loop.

exercises · 손으로 풀기
1recognize the five steps

Pick one of the three retro-fitted applications (gradient descent, model calibration, or portfolio risk). Name what that page uses as the objective, the search space, the move rule, the step size, and the stopping condition.

2step-size stability by handno calculator

Take f(x)=(x3)2f(x) = (x - 3)^2. Compute f(x)f'(x). Starting at x0=0x_0 = 0, write the first three iterates x1,x2,x3x_1, x_2, x_3 for (a) α=0.1\alpha = 0.1, (b) α=0.5\alpha = 0.5, (c) α=1.1\alpha = 1.1. State which one converges, which one stays steady, and which one diverges.

3convex vs not — local vs global

In the widget, switch to two valleys. Reset and use the position slider to start at x=2.4x = -2.4, then take steps until you stop moving. Record where you stopped and the value of ff there. Now restart at x=+2.4x = +2.4 and do the same. Are the two stopping points the same? What does this say about the descent rule?

4maximize vs minimizeno calculator

A trader wants to maximize expected return μ(w)\mu(w) over a portfolio weight ww. Rewrite this as a minimization problem in the form this module uses. What is the objective function? What is its derivative in terms of μ\mu'?

glossary · used on this page · 10
objective function·목적 함수
The single number an optimization process tries to make as good as possible. _Good_ means small for a _loss_ (training a model, minimizing risk, fitting a curve) or large for a _reward_ (maximizing return, maximizing entropy under constraints, maximizing log-likelihood) — but the two cases are interchangeable: maximizing `f` is the same problem as minimizing `−f`. The objective locks an optimization problem in place. Without one, "make it better" has no agreed direction; with one, every candidate choice has a single number attached and the comparison becomes mechanical. Loss functions in ML, risk in portfolios, energy in physics, posterior likelihood in inference — all are objective functions wearing different clothes.
step size·스텝 크기
How far each iteration moves through search space. Concretely, given a candidate `x` and a direction of improvement `d`, the next candidate is `x + α · d`, where `α` is the step size. Too small and the search creeps; too large and it overshoots the goal and may oscillate or diverge. The "right" step depends on the local geometry — a steep narrow valley wants a small step, a wide flat plateau wants a large one. In gradient descent the step size has a special name (the _learning rate_); the rest of optimization just calls it `α`. Modern algorithms either fix `α`, schedule it (decrease over time), or compute it adaptively (line search, trust regions) — three names for "one constant rarely works."
convergence·수렴
The state where further iterations no longer change the objective function meaningfully. Concretely: the change `|f(xₙ₊₁) − f(xₙ)|` falls below some tolerance, or the gradient magnitude `‖∇f‖` falls below it, or the iterate `xₙ` itself stops moving. Convergence is the _stopping condition_ — without one, an optimizer would run forever. It is also where the danger lives: convergence to a local minimum looks identical to convergence to a global one; both look like "no more improvement nearby." Whether the final answer is the _right_ answer depends on the problem's geometry (convex vs. not), the starting point, the step-size schedule, and luck. _Stopped_ and _solved_ are two different things.
local minimum·국소 최솟값
A point where the objective function is no worse than every nearby point — but not necessarily the lowest point anywhere. A function with a single bowl shape (a _convex_ function) has only one minimum; finding it solves the optimization completely. A function with multiple valleys has multiple local minima; descending from a starting point only guarantees landing in _whichever valley you happened to start near_, not the deepest one (the _global minimum_). Most real-world optimization problems are non-convex — neural network training, portfolio selection with constraints, physics with non-quadratic potentials — and the math of "is this the best?" usually collapses to "does anywhere nearby look better?" The honest answer is: not always.
calibration·캘리브레이션
The match between a model's predicted probabilities and actual frequencies. A model is _calibrated_ if, on examples it labels "70% confident," it is correct close to 70% of the time. Calibration is plotted as a _reliability diagram_: predicted probability on the x-axis, observed accuracy on the y-axis; perfect calibration is the diagonal `y = x`. Most modern neural networks are _overconfident_ — their high-probability predictions are correct less often than they claim — and the standard fix is _temperature scaling_, a single dial that pulls the curve back toward the diagonal. Calibration is distinct from accuracy: a perfectly accurate classifier on a tiny dataset can be wildly miscalibrated, and a calibrated model can be wrong about every individual prediction as long as the _frequencies_ line up.
equilibrium·평형
A state where opposing influences cancel exactly, so the system stops changing. _Mechanical_ equilibrium: net force is zero, so acceleration is zero (Newton's first law). _Chemical_ equilibrium: forward and reverse reaction rates match, so concentrations are stable. _Economic_ equilibrium: supply and demand price match, so the market price stops drifting. The shared shape across all of them: a quantity has two competing influences that depend on the quantity itself, and the _fixed point_ where they balance is the equilibrium. Terminal velocity is the equilibrium of `dv/dt = g − kv` — the velocity at which "gravity adds to v" and "drag removes from v" cancel.
gradient·그래디언트
The slope of a function in many directions at once. For a function f(x, y, z, ...), the gradient is the vector of partial derivatives — it points in the direction of steepest ascent. In machine learning, this vector tells the optimizer which way to step the parameters to reduce loss.
⚠ In ML: the vector of partial derivatives of the loss with respect to every parameter. Optimization moves opposite the gradient ("gradient descent"). If a value used in the gradient becomes 0 from underflow, the entire chain collapses — that's why log-space matters.
variance·분산
The expected square-distance of a random quantity from its mean. For returns `r` with mean `μ`, `Var(r) = E[(r − μ)²]`; the units are the _square_ of whatever you started with, which is why people usually quote the square root, _standard deviation_, instead. Variance is the workhorse measure of _risk_ in finance: not "how much could I lose worst-case?" but "how much does the outcome jitter from the average?". Two assets with the same mean return can have wildly different variances, and that gap is the entire reason a portfolio is more than the sum of its parts.
negative log-likelihood·음의 로그우도
Abbreviated NLL. Given a probabilistic model and observed data, the _likelihood_ is the probability the model assigns to the data; the _log-likelihood_ is its logarithm; the _negative_ log-likelihood flips the sign so that minimization makes sense. For a single classification example with true class `y` and predicted distribution `p`, NLL is `−log p_y` — identical to cross-entropy in the one-hot case. The `−log p` form makes independent observations _add_, and makes confident-but-wrong predictions diverge to infinity.
vector·벡터
A quantity that has both magnitude and direction. Concretely a tuple of numbers — `(3, 4)` in 2D, `(1, 0, −2)` in 3D — but the _meaning_ of those numbers depends on what you're doing. In graphics they describe a control-point offset; in physics, a velocity or force; in ML, a parameter update or a feature representation. The tuple is the same; the _role_ changes. The arithmetic — addition component-wise, scaling by a number — is the same in every role, which is why one set of math serves all of them.