🧠 Module 3 · From Biology to Algorithm · Chapter 3.6 · 14 min read

Backpropagation

The chain rule that trains modern AI — and why doing it in hardware is hard.

What you'll learn here

  • Define a loss function and why it's needed (MSE, cross-entropy)
  • Compute $dL/dw$ step by step using the chain rule
  • Distinguish SGD, mini-batch, momentum, and Adam optimizers
  • Explain vanishing/exploding gradients and the fixes (ReLU, batch norm, residual)
  • State why backprop is hard in hardware and how SIDRA's online-learning approach differs

Hook: The One Algorithm of 1986

In 1986 Rumelhart, Hinton, and Williams published Learning representations by back-propagating errors. Three pages. One algorithm. One mathematical idea: the chain rule.

That paper birthed modern AI. ChatGPT, AlphaGo, Tesla autopilot — all trained with backprop. But backprop’s roots go further back: Werbos 1974 PhD thesis described the same algorithm (less noticed); Linnainmaa 1970 published it as automatic differentiation. Hinton + team’s contribution: making it work for deep neural networks.

40 years later: GPT-4 ran backprop for months on ~50,000 H100 GPUs. Same algorithm; scale grew ~10⁹×.

Why this matters for SIDRA: backprop is for training, not inference. SIDRA Y1 is inference-focused — training still happens on GPUs. But doing backprop in hardware would break the GPU dependency. That’s Y100’s real claim. This chapter unpacks both the math and the hardware difficulty.

Intuition: The Error Signal Walks Backward

Training a neural network is a four-step loop:

  1. Forward pass: data goes in → output computed.
  2. Loss compute: output vs target → loss LL measured.
  3. Backward pass: sensitivity of LL to each weight (dL/dwdL/dw) computed.
  4. Update: wwηdL/dww \leftarrow w - \eta \cdot dL/dw (SGD).

Repeat: millions of times. Each iteration, a small step. Optimization = roll downhill.

Loss function — why? Network output isn’t right/wrong as a single bit; “how close” must be a number. Two popular choices:

  • MSE (Mean Squared Error): L=1N(yy^)2L = \frac{1}{N} \sum (y - \hat{y})^2. For regression.
  • Cross-entropy: L=ylogy^L = -\sum y \log \hat{y}. For classification.

Both are differentiable — you can take a gradient.

The intuition behind backprop: the error at the output came from the last layer’s weights. “If this weight had been a bit larger, would the error have shrunk?” = dL/dwlastdL/dw_{\text{last}}. Same question for the previous layer — but indirectly (chain rule). The error signal walks backward from output to input, answering “am I to blame?” for every weight.

Versus Hebbian: Hebbian uses only pre + post info → local. Backprop pushes the error signal globally → global. Backprop is more powerful (targeted learning), but harder in hardware.

Formalism: From the Chain Rule to Adam Optimizer

L1 · Başlangıç

Single-neuron example:

z=wx+bz = wx + b, a=f(z)a = f(z), L=(ay)2L = (a - y)^2.

Question: how much does LL change when ww changes?

Chain rule:

dLdw=dLdadadzdzdw\frac{dL}{dw} = \frac{dL}{da} \cdot \frac{da}{dz} \cdot \frac{dz}{dw}

Compute:

  • dL/da=2(ay)dL/da = 2(a - y)
  • da/dz=f(z)da/dz = f'(z)
  • dz/dw=xdz/dw = x

Result:

dLdw=2(ay)f(z)x\frac{dL}{dw} = 2(a - y) \cdot f'(z) \cdot x

Update: wwηdL/dww \leftarrow w - \eta \cdot dL/dw, η\eta learning rate (~0.01).

L2 · Tam

Multi-layer network — backward pass:

LL loss, a(L)\mathbf{a}^{(L)} last-layer output. At each layer:

δ(l)=Lz(l)\delta^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}

Last layer: δ(L)=aLf(z(L))\delta^{(L)} = \nabla_a L \odot f'(\mathbf{z}^{(L)}).

Backprop equation: δ(l)\delta^{(l)} from δ(l+1)\delta^{(l+1)}:

δ(l)=(W(l+1))δ(l+1)f(z(l))\delta^{(l)} = (\mathbf{W}^{(l+1)})^\top \delta^{(l+1)} \odot f'(\mathbf{z}^{(l)})

Weight gradient:

LW(l)=δ(l)(a(l1))\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{a}^{(l-1)})^\top

This is an outer product — a vector × another vector’s transpose = a matrix. One gradient per weight.

Optimization algorithms:

  • SGD (Stochastic Gradient Descent): wwηgw \leftarrow w - \eta g
  • SGD + Momentum: vμv+gv \leftarrow \mu v + g, wwηvw \leftarrow w - \eta v. Escapes local minima.
  • AdaGrad: per-weight learning rate. wwηg/g2w \leftarrow w - \eta g / \sqrt{\sum g^2}
  • RMSProp: decaying AdaGrad.
  • Adam (2014): Momentum + RMSProp combined. Modern default. Tracks both first and second moment of the gradient.

Mini-batch SGD: L=1BibatchLiL = \frac{1}{B} \sum_{i \in \text{batch}} L_i. BB = batch size, typically 32-512. Stochastic noise → escapes local minima.

L3 · Derin

Vanishing/exploding gradients:

In deep networks, the backward pass δ(l)=(W(l+1))δ(l+1)f(z(l))\delta^{(l)} = (\mathbf{W}^{(l+1)})^\top \delta^{(l+1)} \odot f'(\mathbf{z}^{(l)}) repeats many times. If f<1|f'| < 1 (true for sigmoid, tanh): δ\delta shrinks exponentially → vanishing gradient → deep layers don’t learn.

If W>1|W| > 1 and f>1|f'| > 1: δ\delta explodes → exploding gradient → instability.

Fixes (the foundations of modern deep learning):

  1. ReLU activation: f(z)=0f'(z) = 0 or 11 — no vanishing if z>0z > 0.
  2. Batch normalization (Ioffe & Szegedy 2015): normalize each layer’s outputs → smoother gradient flow.
  3. Residual connections (He et al. 2015, ResNet): a(l+1)=a(l)+f()\mathbf{a}^{(l+1)} = \mathbf{a}^{(l)} + f(\ldots). Gradient flows back through the identity → 1000+ layers possible.
  4. LSTM gates (for RNNs): open/close to preserve gradient.
  5. Gradient clipping: if g>|g| > threshold, clip. Prevents explosions.

Hardware difficulty for backprop:

Backprop carries 3 difficulties:

  1. Bidirectional dataflow: forward + backward. Hardware needs two distinct signal paths.
  2. Non-local information: dL/dwdL/dw comes from across the network. A memristor cell can’t know “the gradients of other cells” (non-local).
  3. High precision: gradients can be very small (~10⁻⁶). 8-bit memristors (256 levels) → quantization error corrupts the gradient.

Hardware attempts:

  • Equilibrium propagation (Bengio 2017): energy-based, local learning. Still prototype.
  • Direct feedback alignment (Lillicrap 2014): uses random backward projection. Local, but quality holds.
  • Forward-forward (Hinton 2022): no backward pass at all; two forward passes are compared.
  • In-memory backprop: all weights in the crossbar; backward pass via transposed crossbar read. Active research for SIDRA.

SIDRA strategy (realistic):

  • Y1-Y3: inference-focused, training on external GPU.
  • Y10: hybrid training — last layer updates on SIDRA (transfer learning); earlier layers frozen.
  • Y100: online incremental learning — Hebbian + reinforcement signal. Not exact backprop, but similar effect.
  • Y1000: hardware backprop — new devices (ferroelectric FET, magneto-tunnel junction).

Experiment: Backprop Steps in a 2-Layer Network

Data: x=0.5x = 0.5, target y=1y = 1.

Network:

  • Hidden: h=σ(w1x+b1)h = \sigma(w_1 x + b_1), init w1=0.3,b1=0w_1 = 0.3, b_1 = 0, σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z})
  • Output: y^=w2h+b2\hat{y} = w_2 h + b_2, init w2=0.5,b2=0w_2 = 0.5, b_2 = 0
  • Loss: L=(y^y)2L = (\hat{y} - y)^2

Forward pass:

  • z1=0.3×0.5+0=0.15z_1 = 0.3 \times 0.5 + 0 = 0.15
  • h=σ(0.15)=0.537h = \sigma(0.15) = 0.537
  • y^=0.5×0.537+0=0.269\hat{y} = 0.5 \times 0.537 + 0 = 0.269
  • L=(0.2691)2=0.534L = (0.269 - 1)^2 = 0.534

Backward pass:

  • dL/dy^=2(0.2691)=1.462dL/d\hat{y} = 2(0.269 - 1) = -1.462
  • dL/dw2=dL/dy^×dy^/dw2=1.462×0.537=0.785dL/dw_2 = dL/d\hat{y} \times d\hat{y}/dw_2 = -1.462 \times 0.537 = -0.785
  • dL/dh=dL/dy^×dy^/dh=1.462×0.5=0.731dL/dh = dL/d\hat{y} \times d\hat{y}/dh = -1.462 \times 0.5 = -0.731
  • dh/dz1=σ(0.15)(1σ(0.15))=0.537×0.463=0.249dh/dz_1 = \sigma(0.15)(1 - \sigma(0.15)) = 0.537 \times 0.463 = 0.249
  • dL/dw1=dL/dh×dh/dz1×dz1/dw1=0.731×0.249×0.5=0.091dL/dw_1 = dL/dh \times dh/dz_1 \times dz_1/dw_1 = -0.731 \times 0.249 \times 0.5 = -0.091

Update (η=1\eta = 1):

  • w20.51×(0.785)=1.285w_2 \leftarrow 0.5 - 1 \times (-0.785) = 1.285
  • w10.31×(0.091)=0.391w_1 \leftarrow 0.3 - 1 \times (-0.091) = 0.391

New forward pass:

  • z1=0.391×0.5=0.196z_1 = 0.391 \times 0.5 = 0.196
  • h=σ(0.196)=0.549h = \sigma(0.196) = 0.549
  • y^=1.285×0.549=0.706\hat{y} = 1.285 \times 0.549 = 0.706
  • L=(0.7061)2=0.087L = (0.706 - 1)^2 = 0.087

Loss dropped 0.534 → 0.087 — 6× improvement in one iteration. That’s how backprop works — each step nudges in the right direction.

SIDRA parallel: to do this on a memristor crossbar: forward pass → analog MVM (easy, natural). Backward pass → MVM with the transpose of the weight matrix. SIDRA crossbars support transpose reads but need extra circuitry + calibration. Planned in the Y10 prototype.

Quick Quiz

1/6What's the mathematical foundation of backprop?

Lab Exercise

Backprop budget for training GPT-3.

Data:

  • GPT-3: 175 billion parameters
  • Training tokens: ~300 billion
  • Forward pass FLOPs: ~6 × params × tokens = 6 × 1.75 × 10¹¹ × 3 × 10¹¹ = 3.15 × 10²³ FLOP total
  • Backward pass: ~2× forward → total ~6× forward = 9.45 × 10²³ FLOP
  • NVIDIA A100: ~312 TFLOPS sustained → 1.27 × 10¹² ops/s (after sustained derate)
  • Total training time (compute-bound): 9.45 × 10²³ / 1.27 × 10¹² = 7.4 × 10¹¹ A100-seconds = ~235,000 A100-hours
  • A100 250W → 235K × 0.25 = 58.7 MWh per A100-bound (parallelism reduces wall-clock)
  • Patterson 2021: 1287 MWh (PUE + other overhead included)

Questions:

(a) GPT-3 training A100-hours required? (b) How many A100s in parallel for a 30-day run? (c) How many SIDRA Y100 (analog) chips for inference of the same model? (d) Y1000 hypothesis (hardware backprop) energy estimate to train?

Solutions

(a) ~235,000 A100-hours. With 1024 A100s in parallel: 235K / 1024 = ~230 hours ≈ 9.6 days core compute; in practice 30+ days (data, sync, overhead).

(b) 30 days = 720 hours → 235K / 720 ≈ ~325 A100s in parallel. OpenAI is estimated to have used ~1000+ (consistent with Patterson’s 1287 MWh).

(c) GPT-3 175B params. Y100 target: 100 billion memristors per chip. 2 Y100 chips = 200B → fits all of GPT-3. Inference: 1 forward pass = 6 × 175B × 1 token ≈ 10¹² ops. Y100: ~3 × 10¹⁶ ops/s → 30 µs/token. Single chip, 100 W. GPT-3 inference at 200 W, real-time.

(d) Y1000 hypothesis: if analog hardware drops energy/op ~1000×, then 1287 MWh / 1000 = 1.29 MWh = 1290 kWh ≈ 4 households per month. GPT-3 training in an ordinary apartment. Speculative today, but the direction is right.

Note: A100 → H100 → B200 → … GPUs evolve too. SIDRA isn’t directly competing — it’s a category difference. Trained model on SIDRA for inference + edge AI scenarios.

Cheat Sheet

  • Backprop = chain rule: L/w\partial L / \partial w as a product of inner derivatives.
  • Three steps: forward pass → loss → backward pass → update.
  • Loss: MSE (regression), Cross-entropy (classification).
  • Optimizer family: SGD → SGD+momentum → AdaGrad → RMSProp → Adam (2014, modern default).
  • Vanishing gradient: sigmoid kills it in deep nets. Fixes: ReLU + BatchNorm + Residual.
  • Hardware difficulty: bidirectional + non-local + high precision → doesn’t map to analog crossbars.
  • SIDRA strategy: inference-focused (Y1-Y3); hybrid training (Y10); online incremental (Y100); hardware backprop (Y1000 horizon).

Vision: Post-Backprop AI and SIDRA's Shot

Backprop has been AI’s foundation for 40 years. The next 20 are scouting alternatives:

  • Y1 (today): SIDRA inference, training on GPU (backprop). Hybrid economics.
  • Y3 (2027): Hybrid fine-tuning — last layer updates on SIDRA, the rest frozen.
  • Y10 (2029): Equilibrium propagation or forward-forward prototypes. Limited training on the SIDRA crossbar.
  • Y100 (2031+): Online + incremental + reinforcement learning integrated. Backprop’s role taken by “neuromorphic plasticity” — Hebbian + STDP + dopaminergic reward.
  • Y1000 (long horizon): Full backprop analog on bio-compatible devices. Not brain-style, but post-brain AI.

Strategic chance for Türkiye: the US/China race was built on backprop + huge GPU farms. SIDRA is built on an “alternative learning paradigm”. That category difference is Türkiye’s leadership chance — if we validate online learning at the Y10 prototype, we can open a category alone in the world.

Unexpected future: synthetic learning. Brain-like systems that learn from few examples and generalize from one image + one sentence. Classical backprop wants millions of examples; biology takes a handful. SIDRA’s online + sparse + plastic foundation fits this paradigm. The first large-scale few-shot AI prototype could be SIDRA Y10-Y100.

Further Reading

  • Next chapter: 3.7 — Memristor ↔ Synapse Mapping
  • Previous: 3.5 — From Artificial Neuron to Transformer
  • Backprop original: Rumelhart, Hinton, Williams, Learning representations by back-propagating errors, Nature 1986.
  • Werbos priority: P. Werbos, Beyond Regression: New Tools for Prediction and Analysis…, Harvard PhD thesis 1974.
  • Adam optimizer: Kingma & Ba, Adam: A Method for Stochastic Optimization, ICLR 2015.
  • ResNet: He et al., Deep residual learning for image recognition, CVPR 2016.
  • Equilibrium propagation: Scellier & Bengio, Equilibrium propagation: bridging the gap between energy-based models and backpropagation, Front. Comput. Neurosci. 2017.
  • Forward-forward: G. Hinton, The forward-forward algorithm, arXiv 2022.