📐 Module 4 · The Math Arsenal · Chapter 4.3 · 12 min read

Derivative and Gradient

The calculus that sets the direction of learning — why SIDRA cares about it for training.

What you'll learn here

  • Write single-variable derivative and multi-variable gradient definitions
  • Show the gradient is the direction of steepest ascent
  • Restate that the chain rule is the mathematical atom of backprop
  • Compute the Jacobian of an MVM and explain why it matters for SIDRA
  • Summarize the difficulty of gradient estimation in analog hardware and SIDRA's practical workarounds

Hook: The Math of Finding Direction

You’re on a hill and want to descend. It’s foggy, but you can feel the slope under your feet. Which way do you step? The direction of steepest descent. That direction is always the negative gradient.

Every bit of AI training boils down to this:

  1. Compute the value of the loss function (your altitude in the fog).
  2. Compute the loss gradient with respect to every weight (the slope).
  3. Take a small step along the negative gradient (descend).
  4. Repeat.

Billions of weights, trillions of steps. But each step is a calculus operation — a derivative.

The interesting question for SIDRA: can we compute the gradient physically? Ohm + KCL give us MVM (4.2). The gradient is also mathematically an MVM (a Jacobian product). So in principle, the crossbar can do backprop. Practical hurdles exist (discussed in 3.6), but the direction is right.

This chapter builds derivative and gradient from scratch, revisits the chain rule, and outlines SIDRA’s analog gradient-compute strategy.

Intuition: Derivative = Local Slope

What is a derivative?

Given a function f(x)f(x), we want to know at a point x0x_0 how fast the function changes. Definition:

f(x0)=limh0f(x0+h)f(x0)hf'(x_0) = \lim_{h \to 0} \frac{f(x_0 + h) - f(x_0)}{h}

Intuition: if xx wiggles a little (by hh), how much does ff move? The ratio = the derivative.

Example: f(x)=x2f(x) = x^2. f(x)=2xf'(x) = 2x. At x=3x = 3, derivative = 6 → wiggle xx up by 0.01 and ff rises by about 0.06 (f(3)=9,f(3.01)=9.0601f(3) = 9, f(3.01) = 9.0601).

Single-variable derivative rules:

  • (c)=0(c)' = 0 (constant)
  • (xn)=nxn1(x^n)' = n x^{n-1}
  • (ex)=ex(e^x)' = e^x
  • (lnx)=1/x(\ln x)' = 1/x
  • (sinx)=cosx(\sin x)' = \cos x
  • Sum: (f+g)=f+g(f + g)' = f' + g'
  • Product: (fg)=fg+fg(fg)' = f'g + fg'
  • Chain: f(g(x))=f(g(x))g(x)f(g(x))' = f'(g(x)) \cdot g'(x)

Gradient — multivariable derivative:

f:RNRf: \mathbb{R}^N \to \mathbb{R} (vector to scalar). Take partial derivatives with respect to each variable:

f=(fx1,fx2,,fxN)\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_N} \right)

A vector: the direction of steepest ascent. Its negative: steepest descent.

What it means in AI: loss LL is a function of the weight vector w\mathbf{w}. The gradient wL\nabla_{\mathbf{w}} L tells you which way to change weights to raise the loss. The opposite direction lowers it. That’s the whole math of optimization.

Formalism: Partial Derivatives, Jacobian, Chain Rule

L1 · Başlangıç

Partial derivative:

Hold one variable, treat others as constants:

fxi=limh0f(,xi+h,)f(,xi,)h\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(\ldots, x_i + h, \ldots) - f(\ldots, x_i, \ldots)}{h}

Example: f(x,y)=x2y+3yf(x, y) = x^2 y + 3y. fx=2xy\frac{\partial f}{\partial x} = 2xy. fy=x2+3\frac{\partial f}{\partial y} = x^2 + 3.

Gradient (compact):

f(x)=\nabla f(\mathbf{x}) = vector of all partial derivatives. Steepest direction.

Practical compute — in AI:

A neuron’s output: a=f(wx)a = f(\mathbf{w}^\top \mathbf{x}). Loss L=(ay)2L = (a - y)^2.

What’s the gradient with respect to w\mathbf{w}? Chain rule:

wL=Laazzw\nabla_{\mathbf{w}} L = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}}
  • L/a=2(ay)\partial L / \partial a = 2(a - y)
  • a/z=f(z)\partial a / \partial z = f'(z) (activation derivative)
  • z/w=x\partial z / \partial \mathbf{w} = \mathbf{x}

Result: wL=2(ay)f(z)x\nabla_{\mathbf{w}} L = 2(a - y) f'(z) \mathbf{x}.

Vector multiplication. For SIDRA: x\mathbf{x} is already at the crossbar input; (ay)f(z)(a-y)f'(z) is a scalar → you multiply every component of the gradient vector by that scalar.

L2 · Tam

Jacobian matrix:

f:RNRMf: \mathbb{R}^N \to \mathbb{R}^M, vector-to-vector. Sensitivity of each output to each input:

Jij=fixj\mathbf{J}_{ij} = \frac{\partial f_i}{\partial x_j}

J\mathbf{J} is M×NM \times N. The gradient is the M=1M = 1 case (scalar-valued function).

Jacobian of an MVM:

y=Wx\mathbf{y} = \mathbf{W} \mathbf{x}. Jacobian with respect to x\mathbf{x}:

J=yx=W\mathbf{J} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \mathbf{W}

So: the Jacobian of an MVM is the matrix itself. Programmed into the crossbar, readable. That’s the mathematical foundation for why a backward pass can run naturally on SIDRA.

Backward pass = transpose MVM:

Given Ly\frac{\partial L}{\partial \mathbf{y}} at a layer output, we want the input gradient:

Lx=JLy=WLy\frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}^\top \frac{\partial L}{\partial \mathbf{y}} = \mathbf{W}^\top \frac{\partial L}{\partial \mathbf{y}}

Also an MVM, with W\mathbf{W}^\top instead of W\mathbf{W}. The SIDRA crossbar can swap row/column roles to run a transpose MVM → backward pass is feasible in hardware.

Weight gradient:

LW=Lyx\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{y}} \mathbf{x}^\top

This is an outer product — two vectors become a matrix. Size M×NM \times N, same as the weight matrix. Outer products are hard in hardware; specialized circuits required (5.9 compute engine).

L3 · Derin

Multi-layer chain rule:

LL loss, a(l)\mathbf{a}^{(l)} output of layer ll. Backprop:

δ(l)=Lz(l)=(W(l+1))δ(l+1)f(z(l))\delta^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} = (\mathbf{W}^{(l+1)})^\top \delta^{(l+1)} \odot f'(\mathbf{z}^{(l)})

We saw this in 3.6. Each layer = one transpose MVM + elementwise multiply (Hadamard).

SIDRA backward pass:

Forward pass is natural on the crossbar:

  • Vi=ai(l1)V_i = a_i^{(l-1)} (row input)
  • Ij=GijVi=zj(l)I_j = \sum G_{ij} V_i = z_j^{(l)} (column output)

For the backward pass, transpose is needed:

  • Vj=δj(l+1)V_j = \delta_j^{(l+1)} (input on columns)
  • Ii=GijVjI_i = \sum G_{ij} V_j (output from rows)

Physically possible — the crossbar is symmetric. But practical issues:

  1. Activation derivative: f(z)f'(\mathbf{z}) requires elementwise multiply. Extra circuitry.
  2. Noise accumulation: each backward pass adds ~5% noise. In a 10-layer network, the gradient degrades.
  3. Outer product: needed for the weight gradient. Not direct on the crossbar; two MVMs + an outer-product accelerator.

SIDRA strategy (realistic):

  • Y1 (today): No backprop. Training on GPU.
  • Y3 (2027): Prototype transpose MVM for the last-layer + bias update.
  • Y10 (2029): Hybrid training on the last 2-3 layers (transfer learning). Gradient noise tolerable.
  • Y100 (2031+): Full analog backward pass + circuits for outer product.

Gradient estimation techniques:

Estimate rather than compute the exact gradient (much cheaper):

  1. Finite difference: f(f(w+h)f(w))/h\nabla f \approx (f(w+h) - f(w))/h. Single-parameter, crude.
  2. Simultaneous perturbation (SPSA): random perturbation → two inferences → gradient estimate. Parallelizable. Interesting for SIDRA.
  3. Forward-mode AD: tiny parallel changes across all weights → observe output shifts. Hardware-cheap.
  4. Feedback alignment: random backward matrix → approximates backprop. Used in SNNs.
  5. Evolution strategies: gradient-free — weight mutation + selection. Candidate for SIDRA online learning.

SIDRA Y100 vision: SPSA + forward-forward combination. Gradient never explicitly computed, but the end effect matches (local, parallel, analog).

Experiment: Compute a Gradient by Hand

f(x,y,z)=x2y+yz+z3f(x, y, z) = x^2 y + y z + z^3. Find f\nabla f and evaluate at (1,2,3)(1, 2, 3).

Partials:

fx=2xy\frac{\partial f}{\partial x} = 2xy → at (1,2,3)(1,2,3): 212=42 \cdot 1 \cdot 2 = 4.

fy=x2+z\frac{\partial f}{\partial y} = x^2 + z1+3=41 + 3 = 4.

fz=y+3z2\frac{\partial f}{\partial z} = y + 3z^22+27=292 + 27 = 29.

Gradient:

f(1,2,3)=(4,4,29)\nabla f(1, 2, 3) = (4, 4, 29)

Interpretation: at (1,2,3)(1, 2, 3), the steepest-ascent direction is (4,4,29)(4, 4, 29). The zz direction dominates — nudging zz changes ff a lot. Steepest descent: a small step along (4,4,29)(-4, -4, -29).

Numerical check: f(1,2,3)=12+23+27=35f(1, 2, 3) = 1 \cdot 2 + 2 \cdot 3 + 27 = 35.

f(1.01,2,3)=1.02012+6+27=2.0402+33=35.0402f(1.01, 2, 3) = 1.0201 \cdot 2 + 6 + 27 = 2.0402 + 33 = 35.0402.

Change: 0.0402. Estimate (gradient × delta): 40.01=0.044 \cdot 0.01 = 0.04. Consistent.

SIDRA parallel:

If x,y,zx, y, z are SIDRA weights (conductances) and ff is the loss:

  • Apply a small perturbation (ΔG\Delta G) to each weight.
  • Measure the new loss (small difference).
  • Gradient = (new loss − old loss) / ΔG\Delta G.

That’s SPSA-style gradient estimation. On a 256×256 crossbar, perturbations to all weights can be applied at once → one batched MVM estimates the entire gradient vector. SIDRA Y100 target.

Quick Quiz

1/6Simplest definition of a derivative?

Lab Exercise

Estimate a layer’s gradient on a SIDRA crossbar via SPSA.

Scenario:

  • 256×256 crossbar, 65,536 weights.
  • Loss LL computed in external CMOS (e.g. cross-entropy).
  • Goal: estimate L/Gij\partial L / \partial G_{ij} per weight.

SPSA algorithm:

  1. Generate a random ±1\pm 1 perturbation matrix Δ\Delta (256×256).
  2. Original weights: G\mathbf{G}.
  3. Perturb: G+=G+cΔ\mathbf{G}^+ = \mathbf{G} + c \Delta, G=GcΔ\mathbf{G}^- = \mathbf{G} - c \Delta (small cc, e.g. 0.01Gmax0.01 \cdot G_{\max}).
  4. Two separate MVM + loss: L+=L(G+),L=L(G)L^+ = L(\mathbf{G}^+), L^- = L(\mathbf{G}^-).
  5. Gradient estimate: g^ij=L+L2cΔij\hat{g}_{ij} = \frac{L^+ - L^-}{2c \Delta_{ij}}.

Questions:

(a) How many MVMs per iteration? (How many does classical backprop need?) (b) SIDRA Y1 MVM = 10 ns. Per SPSA iteration? (c) Total time and energy for 100 iterations? (d) Classical backprop (2 forwards = 20 ns MVM + 50 ns backward overhead) ≈ 70 ns/iter. Compare. (e) How noisy is SPSA’s gradient vs backprop? When is it acceptable?

Solutions

(a) SPSA: 2 MVMs/iteration (two perturbed forwards). Classical backprop: 1 forward + 1 backward = 2 MVMs (same count, but backward is more complex).

(b) One iteration: 2 MVM × 10 ns + 100 ns loss compute + 100 ns perturbation write = ~220 ns.

(c) 100 iterations: 22 µs. SIDRA Y1 at 3 W → energy: 3 W × 22 µs = 66 µJ. Small.

(d) Backprop 70 ns/iter × 100 = 7 µs. SPSA 22 µs (3× slower). But backprop hardware is complex; SPSA is simple. Trade-off: speed vs hardware complexity.

(e) SPSA noise: O(1/N)O(1/\sqrt{N}) estimation error per iteration. Averaged over 100 iterations, error drops. Acceptable: online learning, edge fine-tuning, when hardware simplicity matters. Unacceptable: from-scratch training (full backprop wins on total FLOPs).

Conclusion: SPSA or similar gradient-free methods are a sensible prototype path for SIDRA Y10.

Cheat Sheet

  • Derivative: local rate of change, f(x)=lim(f(x+h)f(x))/hf'(x) = \lim (f(x+h)-f(x))/h.
  • Gradient: vector of partial derivatives, steepest direction.
  • Chain rule: L/w=L/yy/zz/w\partial L/\partial w = \partial L/\partial y \cdot \partial y/\partial z \cdot \partial z/\partial w. Backprop’s math atom.
  • Jacobian: derivative matrix of a vector-to-vector function. For MVM, J=W\mathbf{J} = \mathbf{W}.
  • Transpose MVM: L/x=WL/y\partial L/\partial x = \mathbf{W}^\top \partial L/\partial y. The backward pass.
  • SIDRA strategy: Y1 inference-only; Y10 hybrid; Y100 full analog backward; SPSA/forward-forward as prototype candidates.

Vision: Analog Gradient Hardware

Today’s gradient compute is digital (GPU). If it goes analog, AI training energy drops by orders of magnitude:

  • Y1 (today): Gradients on GPU only. SIDRA = inference.
  • Y3 (2027): Prototype last-layer gradient + SPSA trial.
  • Y10 (2029): Hybrid gradient — last N layers analog, rest digital. Edge fine-tuning.
  • Y100 (2031+): Full analog backward + forward-forward. Training 100× cheaper.
  • Y1000 (long horizon): Analog equilibrium propagation. Training + inference on the same device, continuously.

Meaning for Türkiye: we can skip the GPU training-data-center race through ASIC + power advantage. Analog gradient + STDP + R-STDP combination → brain-budgeted AI. SIDRA is a credible first manufacturer in this category.

Unexpected future: gradient-native hardware language. A programming model where gradients are first-class citizens. SIDRA’s software stack (Module 6) heads in that direction.

Further Reading

  • Next chapter: 4.4 — Probability and Noise
  • Previous: 4.2 — Ohm + Kirchhoff = Analog MVM
  • Classical calculus: Stewart, Calculus: Early Transcendentals — the standard.
  • Vector calculus: Marsden & Tromba, Vector Calculus.
  • Backprop math: Rumelhart, Hinton, Williams, Nature 1986.
  • SPSA original: Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation, IEEE TAC 1992.
  • Analog gradient hardware: Ambrogio et al., Equivalent-accuracy accelerated neural-network training using analog memory, Nature 2018.