A function tells you the height. How do you know which way to go down?
A function says how good your current choice is — a single number. It does not tell you the best choice. For simple functions you can write down the answer; for almost everything else, you cannot. There is no formula for the temperature that calibrates this network, or the portfolio weights that minimize this variance under these constraints, or the model parameters that fit this dataset. Yet all three problems get solved, every day, by the same five moves: name the objective, name the search space, compare nearby choices, choose how far to step, and decide when to stop.
Optimization is not finding the formula. It is choosing a quantity to improve, then moving through possible choices until improvement stops. This module is about the moves. Different algorithms do them differently; the bones are identical.
- what
Given an
over a search space, find a point that makes as small (or large) as possible. The standard machinery: start somewhere, take steps in a direction that improves , control how far each step goes (theobjective function ), and stop when no nearby move improves — that isstep size . The point where you stop is aconvergence ; whether it is also the global minimum depends on the geometry of and where you started.local minimum - applies when
Any time you have a choice and a number that says how good the choice is. Fitting model parameters in ML, picking portfolio weights in finance, tuning a temperature scalar in
, findingcalibration points in physics, designing a cheaper part in engineering. The same five-step loop runs all of them; what differs is the objective, the space, and the move rule.equilibrium - breaks when
Three places the simple loop fails. (1) The objective has many local minima and descent finds a bad one. (2) The search space has constraints (probabilities sum to 1, weights are non-negative, parameters live on a manifold) and naive steps leave it. (3) The objective is cheap to evaluate but its
is not, so deciding the direction is the costly step. Each of these has its own family of techniques; this module is the base case they all extend.gradient
Objective — what are we trying to improve?
Every optimization problem starts with one number. Smaller is better, or bigger is better, but only one number gets that vote. In machine learning that number is the loss function — sum of errors over a dataset. In finance it can be portfolio
“Minimize” and “maximize” are the same problem twice — maximizing is minimizing . By convention this module talks about minimization; everything carries over to maximization by flipping the sign. The objective is what locks the problem in place. Without one, “make it better” has no agreed direction. With one, every candidate has a number attached and the comparison becomes mechanical.
Search space — what choices are allowed?
The objective alone is not enough; you also need to say what can vary. The set of all allowed choices is the search space. For ML it is the high-dimensional
Whether the search space is finite (pick one of 100 hyperparameter combinations), discrete and combinatorial (assign tasks to workers), or continuous (anywhere in ) changes which algorithms apply. This module focuses on the continuous case, because that is where the derivative — the move rule that makes the rest of Lemma tick — has the most to say. Discrete optimization (integer programming, scheduling, the travelling salesman) exists; it uses different math.
Move — compare nearby choices
Given the current point , the optimizer must decide which way to go. The simplest, most general answer: evaluate at and at points near , then move toward the smaller value. When is differentiable, “near” collapses into a single object — the derivative, or in higher dimensions the gradient — which says, to first order, exactly which direction makes smaller fastest. Movement becomes one operation: .
The widget makes this concrete. The orange tangent at the current shows the slope ; pressing step moves to the left if the slope is positive (function is increasing → go the other way) or to the right if negative. The size of the move is — proportional to the slope itself. Where the slope is steep the step is big; where the slope is shallow the step is small. The optimizer slows down on its own as it approaches a minimum.
This is not the only move rule. Newton’s method uses the second derivative to choose the step in one shot. Coordinate descent moves one variable at a time. Stochastic gradient descent uses a noisy approximation of the gradient. All of them are answers to the same question: which way is better, and how far?
import numpy as np
# A general "improve until you can't" loop. Given an objective f and its
# derivative f', start somewhere, take steps in the direction f' says is
# downhill, stop when f' is essentially zero (or you run out of budget).
def descend(f, df, x0, alpha=0.1, tol=1e-4, max_iters=500):
x = x0
history = [x]
for _ in range(max_iters):
g = df(x)
if abs(g) < tol:
break
x = x - alpha * g
history.append(x)
return x, history
# Convex bowl: f(x) = (x - 1)², f'(x) = 2(x - 1). Unique min at x = 1.
f = lambda x: (x - 1)**2
df = lambda x: 2 * (x - 1)
x_star, hist = descend(f, df, x0=-2.4, alpha=0.3)
(round(x_star, 4), len(hist), round(f(x_star), 6))
# → (1.0000, 28, 0.0)
# 28 iterations from x = -2.4 to x = 1.0. Each step uses f'(x) — the
# "which way is better" question, answered locally.Step size — moving too little, too much
The step size is the second knob, and it has at least as much practical impact as the direction. Try the widget: with the convex bowl preset and small , the position creeps left-to-right toward the minimum — many tiny correct steps. Increase — the descent gets faster, then too fast: each step overshoots, and the trajectory oscillates back and forth. Push further and the position blows up entirely, walking off the screen with each iteration amplifying the previous error.
For a quadratic with second derivative , the stability bound is sharp: converges, diverges. For non-quadratic the bound moves as you do — the local curvature at one point differs from the curvature elsewhere. Real algorithms either fix conservatively, schedule it (start big, decrease over time), or compute it adaptively each step (line search, trust regions). All three are answers to the same fact: no single constant works for every geometry.
In ML, has a special name — the learning rate — because tuning it dominates training outcomes. In the rest of optimization it is just . Same number, same role; different communities give it different folklore.
# Step size α decides whether descent crawls, lands, or diverges.
# Same convex bowl, three different α.
for alpha in (0.02, 0.3, 1.05):
x_star, hist = descend(f, df, x0=-2.4, alpha=alpha, max_iters=300)
print(f"α = {alpha:5.2f} → iters = {len(hist):4d} final x = {x_star:8.4f}")
# α = 0.02 → iters = 300 final x = -0.0241 (crawls, hit max_iters)
# α = 0.30 → iters = 28 final x = 1.0000 (lands)
# α = 1.05 → iters = 300 final x = -1.5e+87 (overshoots, diverges)
#
# For this quadratic, the curvature is f''(x) = 2, and the stability
# bound is α < 2 / f''(x) = 1. α = 1.05 sits past the edge, so each step
# amplifies the previous error. The right α is geometry-dependent.
# Adaptive trick: shrink α whenever a step *increases* f. Backtracking
# line search is one widely-used version.
def descend_backtrack(f, df, x0, alpha0=1.0, tol=1e-4, max_iters=500):
x = x0
alpha = alpha0
for _ in range(max_iters):
g = df(x)
if abs(g) < tol:
break
# Halve α until the step actually improves f.
while f(x - alpha * g) > f(x):
alpha *= 0.5
x = x - alpha * g
alpha = min(alpha * 1.5, alpha0) # gently grow back
return x
round(descend_backtrack(f, df, x0=-2.4, alpha0=2.0), 4)
# → 1.0 converges even with a deliberately too-big starting αStopping — local minimum and convergence
The simplest stopping rule: stop when the move would not change much. Concretely, when the gradient magnitude falls below a small tolerance, or when the change does, or when the iterate stops moving. Without a rule like this, the loop runs forever. With one, something stops the optimizer — but the stopping point is only guaranteed to be a local minimum.
This is the most important caveat the module carries. A convex objective (a single bowl, no other dips) has exactly one minimum, and any descent reaches it. Almost nothing in the real world is convex. Neural network loss surfaces are spectacularly non-convex; portfolio problems with constraints become non-convex; calibration loss is convex but its higher-dimensional cousins are not. The widget’s two valleys preset shows the problem in one dimension: starting from , descent lands in the left valley — a local minimum — and stops, oblivious to the deeper right valley.
The honest position: stopped and solved are different things. Local minima look identical to global ones from the inside (no nearby move helps). Strategies that try to escape — random restarts, momentum, stochastic noise, simulated annealing — all add some way to not stop at the first place gradient vanishes.
Where this shows up — same five steps, three pillars
Optimization is the math behind every “we don’t have a formula, but we have a procedure” page in Lemma. Three applications already speak this language; they share the same skeleton, with different choices for what to minimize, where, and how.
ml : loss over training data; parameters in ℝⁿ;
gradient descent with a learning rate.
ml : negative log-likelihood; a single temperature scalar;
one-dimensional descent.
finance : portfolio variance; weights summing to 1;
minimize a quadratic, sometimes closed-form, sometimes iterative.Gradient descent — the canonical example. Objective = a regression loss summed over a dataset. Search space = the model’s parameters. Move = the gradient, computed via backpropagation. Step size = the learning rate. Stopping = the loss plateaus or you run out of training budget. Every detail is one of the five steps with a domain-specific name.
Model calibration — temperature scaling is optimization with a one-dimensional search space. Objective = negative log-likelihood on a held-out set. Search space = . Move = the derivative of NLL with respect to . Step size = small enough that a single scalar fit converges in seconds. The page solves the problem by descent; this module is what justifies that descent works.
Portfolio risk — finding the minimum-variance weights is optimization with a closed-form answer for two assets, and an iterative answer in higher dimensions. Objective = portfolio variance. Search space = weights summing to 1. Move = the derivative of variance with respect to weight. The arc-5 derivation on that page is exactly step one of step three — set the gradient to zero and solve. When closed form exists, optimization collapses to algebra; when it does not, the same machinery iterates instead.
Same five-step skeleton, different bones. Naming the skeleton is what makes it portable.
# Where this shows up — three real optimization stories share the same loop.
# (1) ML: gradient descent on a regression loss.
# Data y = 3·x; fit w to minimize L(w) = Σ (w·x_i − y_i)².
X = np.array([1.0, 2.0, 3.0, 4.0])
Y = 3.0 * X
def L(w): return float(((w * X - Y) ** 2).sum())
def dL(w): return float((2 * (w * X - Y) * X).sum())
w_star, _ = descend(L, dL, x0=0.0, alpha=0.02)
round(w_star, 4)
# → 3.0 the slope of Y = 3x, recovered by descent
# (2) Calibration: fit one scalar T (temperature) to minimize NLL.
# Logits z, true labels y; we pick T to make softmax(z/T) match observed
# correctness frequencies. One parameter, well-behaved objective.
def softmax(z):
e = np.exp(z - z.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
np.random.seed(0)
z_logits = np.random.randn(200, 5) * 3.0 # overconfident logits
labels = np.random.randint(0, 5, size=200)
def nll(T): return float(-np.log(softmax(z_logits / T)[range(200), labels] + 1e-12).mean())
T_star = 1.0
for _ in range(80):
g = (nll(T_star + 1e-3) - nll(T_star - 1e-3)) / 2e-3 # numerical gradient
T_star -= 0.1 * g
round(T_star, 3), round(nll(T_star), 3)
# → (>1.0, <nll(1.0)) T > 1 softens overconfident predictions
# (3) Portfolio risk: minimize variance over weight w ∈ [0, 1].
# Two assets, σ_A = 0.2, σ_B = 0.1, ρ = 0.0 (independent).
def var_p(w): return float(w**2 * 0.04 + (1 - w)**2 * 0.01)
def dvar_p(w): return float(2 * w * 0.04 - 2 * (1 - w) * 0.01)
w_star, _ = descend(var_p, dvar_p, x0=0.5, alpha=2.0)
round(w_star, 4)
# → 0.2 20% in the volatile asset minimizes variance. The closed-form
# answer w* = σ_B² / (σ_A² + σ_B²) = 0.01 / 0.05 = 0.2. Match.
#
# Three pillars, one loop: objective + derivative + step + stopping.
# The science is in the choices (which f? which start? which α?). The
# *machinery* is identical.Optimization is choosing a quantity to improve, then moving through possible choices until improvement stops. Five moves: objective, search space, move rule, step size, stopping. Almost everything called fitting, tuning, minimizing, or finding equilibrium in the rest of Lemma is some version of this loop.
Pick one of the three retro-fitted applications (gradient descent, model calibration, or portfolio risk). Name what that page uses as the objective, the search space, the move rule, the step size, and the stopping condition.
Take . Compute . Starting at , write the first three iterates for (a) , (b) , (c) . State which one converges, which one stays steady, and which one diverges.
In the widget, switch to two valleys. Reset and use the position slider to start at , then take steps until you stop moving. Record where you stopped and the value of there. Now restart at and do the same. Are the two stopping points the same? What does this say about the descent rule?
A trader wants to maximize expected return over a portfolio weight . Rewrite this as a minimization problem in the form this module uses. What is the objective function? What is its derivative in terms of ?