Derivative and Gradient

Hook: The Math of Finding Direction

You’re on a hill and want to descend. It’s foggy, but you can feel the slope under your feet. Which way do you step? The direction of steepest descent. That direction is always the negative gradient.

Every bit of AI training boils down to this:

Compute the value of the loss function (your altitude in the fog).
Compute the loss gradient with respect to every weight (the slope).
Take a small step along the negative gradient (descend).
Repeat.

Billions of weights, trillions of steps. But each step is a calculus operation — a derivative.

The interesting question for SIDRA: can we compute the gradient physically? Ohm + KCL give us MVM (4.2). The gradient is also mathematically an MVM (a Jacobian product). So in principle, the crossbar can do backprop. Practical hurdles exist (discussed in 3.6), but the direction is right.

This chapter builds derivative and gradient from scratch, revisits the chain rule, and outlines SIDRA’s analog gradient-compute strategy.

Intuition: Derivative = Local Slope

What is a derivative?

Given a function $f(x)$ , we want to know at a point $x_0$ how fast the function changes. Definition:

f'(x_0) = \lim_{h \to 0} \frac{f(x_0 + h) - f(x_0)}{h}

Intuition: if $x$ wiggles a little (by $h$ ), how much does $f$ move? The ratio = the derivative.

Example: $f(x) = x^2$ . $f'(x) = 2x$ . At $x = 3$ , derivative = 6 → wiggle $x$ up by 0.01 and $f$ rises by about 0.06 ( $f(3) = 9, f(3.01) = 9.0601$ ).

Single-variable derivative rules:

$(c)' = 0$ (constant)
$(x^n)' = n x^{n-1}$
$(e^x)' = e^x$
$(\ln x)' = 1/x$
$(\sin x)' = \cos x$
Sum: $(f + g)' = f' + g'$
Product: $(fg)' = f'g + fg'$
Chain: $f(g(x))' = f'(g(x)) \cdot g'(x)$

Gradient — multivariable derivative:

$f: \mathbb{R}^N \to \mathbb{R}$ (vector to scalar). Take partial derivatives with respect to each variable:

\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_N} \right)

A vector: the direction of steepest ascent. Its negative: steepest descent.

What it means in AI: loss $L$ is a function of the weight vector $\mathbf{w}$ . The gradient $\nabla_{\mathbf{w}} L$ tells you which way to change weights to raise the loss. The opposite direction lowers it. That’s the whole math of optimization.

Formalism: Partial Derivatives, Jacobian, Chain Rule

L1 · Başlangıç

Partial derivative:

Hold one variable, treat others as constants:

\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(\ldots, x_i + h, \ldots) - f(\ldots, x_i, \ldots)}{h}

Example: $f(x, y) = x^2 y + 3y$ . $\frac{\partial f}{\partial x} = 2xy$ . $\frac{\partial f}{\partial y} = x^2 + 3$ .

Gradient (compact):

$\nabla f(\mathbf{x}) =$ vector of all partial derivatives. Steepest direction.

Practical compute — in AI:

A neuron’s output: $a = f(\mathbf{w}^\top \mathbf{x})$ . Loss $L = (a - y)^2$ .

What’s the gradient with respect to $\mathbf{w}$ ? Chain rule:

\nabla_{\mathbf{w}} L = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}}

$\partial L / \partial a = 2(a - y)$
$\partial a / \partial z = f'(z)$ (activation derivative)
$\partial z / \partial \mathbf{w} = \mathbf{x}$

Result: $\nabla_{\mathbf{w}} L = 2(a - y) f'(z) \mathbf{x}$ .

Vector multiplication. For SIDRA: $\mathbf{x}$ is already at the crossbar input; $(a-y)f'(z)$ is a scalar → you multiply every component of the gradient vector by that scalar.

L2 · Tam

Jacobian matrix:

$f: \mathbb{R}^N \to \mathbb{R}^M$ , vector-to-vector. Sensitivity of each output to each input:

\mathbf{J}_{ij} = \frac{\partial f_i}{\partial x_j}

$\mathbf{J}$ is $M \times N$ . The gradient is the $M = 1$ case (scalar-valued function).

Jacobian of an MVM:

$\mathbf{y} = \mathbf{W} \mathbf{x}$ . Jacobian with respect to $\mathbf{x}$ :

\mathbf{J} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \mathbf{W}

So: the Jacobian of an MVM is the matrix itself. Programmed into the crossbar, readable. That’s the mathematical foundation for why a backward pass can run naturally on SIDRA.

Backward pass = transpose MVM:

Given $\frac{\partial L}{\partial \mathbf{y}}$ at a layer output, we want the input gradient:

\frac{\partial L}{\partial \mathbf{x}} = \mathbf{J}^\top \frac{\partial L}{\partial \mathbf{y}} = \mathbf{W}^\top \frac{\partial L}{\partial \mathbf{y}}

Also an MVM, with $\mathbf{W}^\top$ instead of $\mathbf{W}$ . The SIDRA crossbar can swap row/column roles to run a transpose MVM → backward pass is feasible in hardware.

Weight gradient:

\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{y}} \mathbf{x}^\top

This is an outer product — two vectors become a matrix. Size $M \times N$ , same as the weight matrix. Outer products are hard in hardware; specialized circuits required (5.9 compute engine).

L3 · Derin

Multi-layer chain rule:

$L$ loss, $\mathbf{a}^{(l)}$ output of layer $l$ . Backprop:

\delta^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} = (\mathbf{W}^{(l+1)})^\top \delta^{(l+1)} \odot f'(\mathbf{z}^{(l)})

We saw this in 3.6. Each layer = one transpose MVM + elementwise multiply (Hadamard).

SIDRA backward pass:

Forward pass is natural on the crossbar:

$V_i = a_i^{(l-1)}$ (row input)
$I_j = \sum G_{ij} V_i = z_j^{(l)}$ (column output)

For the backward pass, transpose is needed:

$V_j = \delta_j^{(l+1)}$ (input on columns)
$I_i = \sum G_{ij} V_j$ (output from rows)

Physically possible — the crossbar is symmetric. But practical issues:

Activation derivative: $f'(\mathbf{z})$ requires elementwise multiply. Extra circuitry.
Noise accumulation: each backward pass adds ~5% noise. In a 10-layer network, the gradient degrades.
Outer product: needed for the weight gradient. Not direct on the crossbar; two MVMs + an outer-product accelerator.

SIDRA strategy (realistic):

Y1 (today): No backprop. Training on GPU.
Y3 (2027): Prototype transpose MVM for the last-layer + bias update.
Y10 (2029): Hybrid training on the last 2-3 layers (transfer learning). Gradient noise tolerable.
Y100 (2031+): Full analog backward pass + circuits for outer product.

Gradient estimation techniques:

Estimate rather than compute the exact gradient (much cheaper):

Finite difference: $\nabla f \approx (f(w+h) - f(w))/h$ . Single-parameter, crude.
Simultaneous perturbation (SPSA): random perturbation → two inferences → gradient estimate. Parallelizable. Interesting for SIDRA.
Forward-mode AD: tiny parallel changes across all weights → observe output shifts. Hardware-cheap.
Feedback alignment: random backward matrix → approximates backprop. Used in SNNs.
Evolution strategies: gradient-free — weight mutation + selection. Candidate for SIDRA online learning.

SIDRA Y100 vision: SPSA + forward-forward combination. Gradient never explicitly computed, but the end effect matches (local, parallel, analog).

Experiment: Compute a Gradient by Hand

$f(x, y, z) = x^2 y + y z + z^3$ . Find $\nabla f$ and evaluate at $(1, 2, 3)$ .

Partials:

$\frac{\partial f}{\partial x} = 2xy$ → at $(1,2,3)$ : $2 \cdot 1 \cdot 2 = 4$ .

$\frac{\partial f}{\partial y} = x^2 + z$ → $1 + 3 = 4$ .

$\frac{\partial f}{\partial z} = y + 3z^2$ → $2 + 27 = 29$ .

Gradient:

$\nabla f(1, 2, 3) = (4, 4, 29)$

Interpretation: at $(1, 2, 3)$ , the steepest-ascent direction is $(4, 4, 29)$ . The $z$ direction dominates — nudging $z$ changes $f$ a lot. Steepest descent: a small step along $(-4, -4, -29)$ .

Numerical check: $f(1, 2, 3) = 1 \cdot 2 + 2 \cdot 3 + 27 = 35$ .

$f(1.01, 2, 3) = 1.0201 \cdot 2 + 6 + 27 = 2.0402 + 33 = 35.0402$ .

Change: 0.0402. Estimate (gradient × delta): $4 \cdot 0.01 = 0.04$ . Consistent.

SIDRA parallel:

If $x, y, z$ are SIDRA weights (conductances) and $f$ is the loss:

Apply a small perturbation ( $\Delta G$ ) to each weight.
Measure the new loss (small difference).
Gradient = (new loss − old loss) / $\Delta G$ .

That’s SPSA-style gradient estimation. On a 256×256 crossbar, perturbations to all weights can be applied at once → one batched MVM estimates the entire gradient vector. SIDRA Y100 target.

Quick Quiz

1/6Simplest definition of a derivative?

The integral of the functionRate of change of f when x wiggles: f'(x) = lim (f(x+h) - f(x))/hThe function's maximumThe function's inverse

Lab Exercise

Estimate a layer’s gradient on a SIDRA crossbar via SPSA.

Scenario:

256×256 crossbar, 65,536 weights.
Loss $L$ computed in external CMOS (e.g. cross-entropy).
Goal: estimate $\partial L / \partial G_{ij}$ per weight.

SPSA algorithm:

Generate a random $\pm 1$ perturbation matrix $\Delta$ (256×256).
Original weights: $\mathbf{G}$ .
Perturb: $\mathbf{G}^+ = \mathbf{G} + c \Delta$ , $\mathbf{G}^- = \mathbf{G} - c \Delta$ (small $c$ , e.g. $0.01 \cdot G_{\max}$ ).
Two separate MVM + loss: $L^+ = L(\mathbf{G}^+), L^- = L(\mathbf{G}^-)$ .
Gradient estimate: $\hat{g}_{ij} = \frac{L^+ - L^-}{2c \Delta_{ij}}$ .

Questions:

(a) How many MVMs per iteration? (How many does classical backprop need?) (b) SIDRA Y1 MVM = 10 ns. Per SPSA iteration? (c) Total time and energy for 100 iterations? (d) Classical backprop (2 forwards = 20 ns MVM + 50 ns backward overhead) ≈ 70 ns/iter. Compare. (e) How noisy is SPSA’s gradient vs backprop? When is it acceptable?

Solutions

(a) SPSA: 2 MVMs/iteration (two perturbed forwards). Classical backprop: 1 forward + 1 backward = 2 MVMs (same count, but backward is more complex).

(b) One iteration: 2 MVM × 10 ns + 100 ns loss compute + 100 ns perturbation write = ~220 ns.

(c) 100 iterations: 22 µs. SIDRA Y1 at 3 W → energy: 3 W × 22 µs = 66 µJ. Small.

(d) Backprop 70 ns/iter × 100 = 7 µs. SPSA 22 µs (3× slower). But backprop hardware is complex; SPSA is simple. Trade-off: speed vs hardware complexity.

(e) SPSA noise: $O(1/\sqrt{N})$ estimation error per iteration. Averaged over 100 iterations, error drops. Acceptable: online learning, edge fine-tuning, when hardware simplicity matters. Unacceptable: from-scratch training (full backprop wins on total FLOPs).

Conclusion: SPSA or similar gradient-free methods are a sensible prototype path for SIDRA Y10.

Cheat Sheet

Derivative: local rate of change, $f'(x) = \lim (f(x+h)-f(x))/h$ .
Gradient: vector of partial derivatives, steepest direction.
Chain rule: $\partial L/\partial w = \partial L/\partial y \cdot \partial y/\partial z \cdot \partial z/\partial w$ . Backprop’s math atom.
Jacobian: derivative matrix of a vector-to-vector function. For MVM, $\mathbf{J} = \mathbf{W}$ .
Transpose MVM: $\partial L/\partial x = \mathbf{W}^\top \partial L/\partial y$ . The backward pass.
SIDRA strategy: Y1 inference-only; Y10 hybrid; Y100 full analog backward; SPSA/forward-forward as prototype candidates.

Vision: Analog Gradient Hardware

Today’s gradient compute is digital (GPU). If it goes analog, AI training energy drops by orders of magnitude:

Y1 (today): Gradients on GPU only. SIDRA = inference.
Y3 (2027): Prototype last-layer gradient + SPSA trial.
Y10 (2029): Hybrid gradient — last N layers analog, rest digital. Edge fine-tuning.
Y100 (2031+): Full analog backward + forward-forward. Training 100× cheaper.
Y1000 (long horizon): Analog equilibrium propagation. Training + inference on the same device, continuously.

Meaning for Türkiye: we can skip the GPU training-data-center race through ASIC + power advantage. Analog gradient + STDP + R-STDP combination → brain-budgeted AI. SIDRA is a credible first manufacturer in this category.

Unexpected future: gradient-native hardware language. A programming model where gradients are first-class citizens. SIDRA’s software stack (Module 6) heads in that direction.

Derivative and Gradient

Prerequisites

What you'll learn here

Hook: The Math of Finding Direction

Intuition: Derivative = Local Slope

Formalism: Partial Derivatives, Jacobian, Chain Rule

Experiment: Compute a Gradient by Hand

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Analog Gradient Hardware

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: The Math of Finding Direction

🧭 Intuition: Derivative = Local Slope

📐 Formalism: Partial Derivatives, Jacobian, Chain Rule

🧪 Experiment: Compute a Gradient by Hand

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: Analog Gradient Hardware

📚 Further Reading

Hook: The Math of Finding Direction

Intuition: Derivative = Local Slope

Formalism: Partial Derivatives, Jacobian, Chain Rule

Experiment: Compute a Gradient by Hand

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Analog Gradient Hardware

Further Reading