Lemma
math, backwards
the hook · the everyday move

Most equations are hard. Their tangent line is easy.

Every time you say “for small θθ, sinθθ\sin θ ≈ θ” or “near here, things grow linearly” or “to a first approximation,” you are . The pattern is so common we stop noticing it. The pendulum clock works because of it. Newton’s method is built on it. Gradient descent is built on it. Most engineering is the discipline of staying in the regime where it holds.

In the widget below, drag a (the anchor) and x (where to evaluate). With a=0a = 0, the tangent line of sin\sin is exactly y=xy = x — the small-angle approximation, in pictures. Drag x away from 00 and watch the error grow as the square of the distance.

Widget — Tangent approximation
f(x) — true0.7174
approx — tangent0.8000
error-0.0826
error / (x − a)²-0.129
With the anchor at a = 0, the tangent line is y = x — the famous sin x ≈ x. Drag x: at x = 0.1 the error is below 0.001; at x = 1 it's near −0.16; at x = π/2 the tangent says π/2 ≈ 1.57 while the true value is 1. The "error / (x − a)²" column is roughly constant near the anchor — that's the error grows as the square of deviation rule, made directly visible.
the arc
1

Why we approximate

The functions that govern the world are mostly nonlinear: a pendulum’s sinθ\sin θ, a transistor’s exponential current, a gravitational force’s 1/r21/r², a neural network’s softmax. Solving them in closed form is, in most cases, impossible. So we trade away exactness in a controlled way: pick a point we care about, replace the nonlinear function with the closest linear function in a neighbourhood of that point, and accept that the answer is correct only “near enough.”

2

The tangent line at a point

At any smooth point aa, the function ff has both a value f(a)f(a) and a slope f(a)f'(a) (from the derivatives module). The unique line passing through (a,f(a))(a, f(a)) with that slope is the :

L_a(x)  =  f(a)  +  f'(a) · (x − a)

That is the linearization of ff at aa. It matches the function in two ways: La(a)=f(a)L_a(a) = f(a) (same value at the anchor) and La(a)=f(a)L_a'(a) = f'(a) (same slope at the anchor). No other line can claim both. The widget draws this line with the dashed brown stroke; for f(x)=sinxf(x) = \sin x at a=0a = 0, the line is L0(x)=0+1(x0)=xL_0(x) = 0 + 1 · (x − 0) = x.

import math

# Linearization of f at a:  L_a(x) = f(a) + f'(a) · (x − a).
# Choose anchor a, compare with the true value over a range of x.
def linearize(f, fprime, a):
    fa, slope = f(a), fprime(a)
    return lambda x: fa + slope * (x - a)

L0 = linearize(math.sin, math.cos, a=0)    # tangent at 0 is y = x
[(round(x, 2), round(math.sin(x), 4), round(L0(x), 4))
 for x in (0.05, 0.2, 0.5, 1.0)]
# → [(0.05, 0.0500, 0.0500),    # < 0.0001 error
#    (0.2,  0.1987, 0.2000),    # 0.001
#    (0.5,  0.4794, 0.5000),    # 0.02
#    (1.0,  0.8415, 1.0000)]    # 0.16  — visibly bad
3

Error grows as a square

For any smooth function, the error f(x)La(x)f(x) − L_a(x) behaves like a quadratic in the deviation xax − a. Doubling the deviation roughly quadruples the error; halving it cuts the error to a quarter. This is _quadratic, not linear* — and it is the reason linearization is useful: the gap closes very quickly as you approach the anchor.

Concretely for sin\sin at 00: the leading error term is x3/6−x³/6, so error/(xa)2error / (x − a)² drifts slowly with xx rather than staying constant — the cubic term dominates here. Other functions (exe^x, 1+x\sqrt{1 + x}, 1/(1x)1/(1 − x)) have a constant error/(xa)2error / (x − a)² ratio because their second derivative at the anchor is nonzero. Either way, the rule of thumb is the same: “small” deviations make linear approximation fine; “large” deviations make it wrong, fast.

# Error scales as (x − a)², not as (x − a). Quadratic, not linear.
# Doubling the deviation quadruples the error.
def error_ratio(f, L, a, x):
    return (f(x) - L(x)) / (x - a) ** 2 if x != a else None

[error_ratio(math.sin, L0, 0, x) for x in (0.05, 0.1, 0.2, 0.4, 0.8)]
# → roughly all near −0.166  (≈ −1/6)
# The leading Taylor remainder for sin near 0 is −x³/6, so dividing by
# (x − a)² gives roughly −x/6, drifting slowly with x. The shape "error
# = constant·deviation²" is the dominant term in every linearization;
# all you have to read off is the constant.
4

Where this shows up — one tool, three pillars

Linearization is the first honest lie: replace a curved thing by the line that tells the truth nearby. The lie shows up under different names in different pillars; the math is the same.

physics : sin θ ≈ θ near zero
ml      : calibration curve ≈ tangent near one bin
finance : ΔPV ≈ -D · PV · Δr near the current rate (bond duration)

The pendulum clock runs on a single linearization: sinθθ\sin θ ≈ θ for small angles. The nonlinear ODE θ¨=(g/L)sinθ\ddot{θ} = −(g/L) \sin θ becomes θ¨=(g/L)θ\ddot{θ} = −(g/L) θ — a linear oscillator with a closed-form sinusoidal solution and a constant-period swing. The whole 17th-century clock technology lives inside the small-angle regime where the lie holds, and the page’s widget makes that regime visible.

The damped oscillator extends the same lie one step further: x¨+2γx˙+ω02x=F(t)\ddot{x} + 2γ\dot{x} + ω₀² x = F(t) is the small-amplitude linearization of every physical system that oscillates with friction and forcing. Car suspensions, building sway, RLC circuits, vocal cords driving glass — the same equation runs all of them inside the regime the linearization holds.

Model calibration does the same trick on a curve a model produces. Click any bin in the reliability diagram and the widget draws the tangent to the calibration curve at that bin’s center: locally, actual(p)mp+cactual(p) ≈ m·p + c. Two numbers — slope and intercept — describe the gap between confidence and frequency near that bin. Slope ≈ 1 means the curve is parallel to the diagonal there (a constant shift); slope ≠ 1 means the gap changes with confidence, which a global rotation (temperature scaling) can fix. Same tangent-line tool, completely different use.

Present value — the bond market’s working approximation. PV is a nonlinear function of the interest rate rr (an integral of erte^{-rt}), but for small rate moves it linearizes to ΔPVDPVΔr\Delta PV \approx -D · PV · \Delta r, where DD is the modified duration. Traders quote duration instead of recomputing the integral after every rate tick. The approximation is honest inside a small Δr neighbourhood; outside it, the same page’s convexity correction (the second-order term) catches what duration misses. Linearization first, second-order correction second — the standard pattern.

The same machine also drives Newton’s method (linearize, find the line’s zero, repeat) and gradient descent (the first-order Taylor approximation says L(wηL)L(w)ηL2L(w − η·∇L) ≈ L(w) − η·\|∇L\|²; if ηη is small enough that the linearization is trustworthy, the loss decreases — past the ceiling and the linearization lies, which is exactly the explosion at η=0.27η = 0.27 in that page’s widget). Newton, gradient descent, the pendulum, calibration, and bond duration all run the same step: replace the curve with its tangent for as long as the tangent is honest.

# Newton's method: solve f(x) = 0 by repeatedly linearizing at the
# current guess, then finding where THAT line crosses zero.
def newton(f, fprime, x0, steps=5):
    x = x0
    for _ in range(steps):
        x = x - f(x) / fprime(x)         # the root of the tangent line
    return x

# Fixed point of cos:  solve cos(x) − x = 0 starting near 0.5.
newton(lambda x: math.cos(x) - x,
       lambda x: -math.sin(x) - 1,
       x0=0.5)
# → 0.7390851332151607     (the Dottie number)
#
# Each Newton step IS a linearization step. Gradient descent is the
# same recipe applied to ∇L instead of f, with a fixed-size step (η)
# instead of the exact root of the tangent.
5

The catch — only 'almost'

Every linearization is wrong outside its anchor’s neighbourhood. The discipline of using the tool well is the discipline of measuring and respecting that neighbourhood:

  • Quote a bound on the deviation for which the linear answer is good enough — not just a slogan that “linearization works.”
  • Compute or estimate the next-order term and check that it is small (or that its sign won’t bite you).
  • Stay inside the regime by design: clock escapements force small arcs; ML training schedules shrink the learning rate; control systems stay near operating points; circuit designers bias transistors into the linear region. When you can’t stay there, switch to a higher-order method or a nonlinear solver — and accept the cost.

That phrase — “the regime where the lie holds” — is the same one the pendulum page closes on. It is not coincidence; it is the pattern this module names. Every applied-math discipline has a private inventory of such regimes, kept by people who know exactly how far they can lean.

The tangent line is the cheapest answer that still gets the slope right. Linearization replaces a hard problem with an easy one — valid in a regime, wrong outside it, always. The discipline is the regime.

exercises · 손으로 풀기
1small-angle by handno calculator

Linearize f(x)=sinxf(x) = \sin x at a=0a = 0. Use the linearization to estimate sin0.1\sin 0.1. The true value (3 decimals) is 0.09980.0998. What is the error? About what fraction of xx is it?

2exponential at zerono calculator

Linearize f(x)=exf(x) = e^x at a=0a = 0. Estimate e0.2e^0.2. The true value is about 1.22141.2214. Compare with e0.5e^0.5 (true: 1.64871.6487). What does the relative error pattern look like as the deviation grows?

3Newton step as linearization

Suppose you want to solve f(x)=0f(x) = 0 for some nonlinear ff. You have a guess xnx_n. Linearize ff at xnx_n and find where that line crosses zero. Show that the next guess is x(n+1)=xnf(xn)/f(xn)x_(n+1) = x_n − f(x_n) / f'(x_n). What breaks this iteration?

4why gradient descent's η has a ceiling

The first-order Taylor approximation of the loss L(w+d)L(w)+L(w)dL(w + d) ≈ L(w) + ∇L(w) · d is honest only for small d\|d\|. The gradient descent step chooses d=ηLd = −η · ∇L. Use the second-order term to argue why the toy quadratic loss in /ml/gradient-descent has a hard ηη ceiling at 2/L(w)2/L''(w_).

glossary · used on this page · 2
linearization·선형화
Replacing a nonlinear function near a chosen point by its tangent line — keeping only the constant and first-derivative terms of the Taylor expansion. Near `x = 0`, `sin x ≈ x`, `cos x ≈ 1 − x²/2`, `e^x ≈ 1 + x`. The approximation is excellent for small `|x|` and grows wrong as `|x|` increases. Mechanical clocks, electrical circuit analysis, control systems, and most of "the engineering equations" are linearized versions of much harder nonlinear ones, valid in the small-deviation regime where everything in the system is supposed to live.
tangent line·접선
A straight line that touches a curve at a single point and matches the curve's direction there. Its slope at the point of contact equals the derivative of the function at that point: `m_tangent = f'(a)`. The tangent is what the secant becomes in the limit as its two intersection points merge — the curve's _instantaneous direction_ made visible as a line. Distinct from the trigonometric tangent; same word, different concept.