Backpropagation
The chain rule that trains modern AI — and why doing it in hardware is hard.
Prerequisites
What you'll learn here
- Define a loss function and why it's needed (MSE, cross-entropy)
- Compute $dL/dw$ step by step using the chain rule
- Distinguish SGD, mini-batch, momentum, and Adam optimizers
- Explain vanishing/exploding gradients and the fixes (ReLU, batch norm, residual)
- State why backprop is hard in hardware and how SIDRA's online-learning approach differs
Hook: The One Algorithm of 1986
In 1986 Rumelhart, Hinton, and Williams published Learning representations by back-propagating errors. Three pages. One algorithm. One mathematical idea: the chain rule.
That paper birthed modern AI. ChatGPT, AlphaGo, Tesla autopilot — all trained with backprop. But backprop’s roots go further back: Werbos 1974 PhD thesis described the same algorithm (less noticed); Linnainmaa 1970 published it as automatic differentiation. Hinton + team’s contribution: making it work for deep neural networks.
40 years later: GPT-4 ran backprop for months on ~50,000 H100 GPUs. Same algorithm; scale grew ~10⁹×.
Why this matters for SIDRA: backprop is for training, not inference. SIDRA Y1 is inference-focused — training still happens on GPUs. But doing backprop in hardware would break the GPU dependency. That’s Y100’s real claim. This chapter unpacks both the math and the hardware difficulty.
Intuition: The Error Signal Walks Backward
Training a neural network is a four-step loop:
- Forward pass: data goes in → output computed.
- Loss compute: output vs target → loss measured.
- Backward pass: sensitivity of to each weight () computed.
- Update: (SGD).
Repeat: millions of times. Each iteration, a small step. Optimization = roll downhill.
Loss function — why? Network output isn’t right/wrong as a single bit; “how close” must be a number. Two popular choices:
- MSE (Mean Squared Error): . For regression.
- Cross-entropy: . For classification.
Both are differentiable — you can take a gradient.
The intuition behind backprop: the error at the output came from the last layer’s weights. “If this weight had been a bit larger, would the error have shrunk?” = . Same question for the previous layer — but indirectly (chain rule). The error signal walks backward from output to input, answering “am I to blame?” for every weight.
Versus Hebbian: Hebbian uses only pre + post info → local. Backprop pushes the error signal globally → global. Backprop is more powerful (targeted learning), but harder in hardware.
Formalism: From the Chain Rule to Adam Optimizer
Single-neuron example:
, , .
Question: how much does change when changes?
Chain rule:
Compute:
Result:
Update: , learning rate (~0.01).
Multi-layer network — backward pass:
loss, last-layer output. At each layer:
Last layer: .
Backprop equation: from :
Weight gradient:
This is an outer product — a vector × another vector’s transpose = a matrix. One gradient per weight.
Optimization algorithms:
- SGD (Stochastic Gradient Descent):
- SGD + Momentum: , . Escapes local minima.
- AdaGrad: per-weight learning rate.
- RMSProp: decaying AdaGrad.
- Adam (2014): Momentum + RMSProp combined. Modern default. Tracks both first and second moment of the gradient.
Mini-batch SGD: . = batch size, typically 32-512. Stochastic noise → escapes local minima.
Vanishing/exploding gradients:
In deep networks, the backward pass repeats many times. If (true for sigmoid, tanh): shrinks exponentially → vanishing gradient → deep layers don’t learn.
If and : explodes → exploding gradient → instability.
Fixes (the foundations of modern deep learning):
- ReLU activation: or — no vanishing if .
- Batch normalization (Ioffe & Szegedy 2015): normalize each layer’s outputs → smoother gradient flow.
- Residual connections (He et al. 2015, ResNet): . Gradient flows back through the identity → 1000+ layers possible.
- LSTM gates (for RNNs): open/close to preserve gradient.
- Gradient clipping: if threshold, clip. Prevents explosions.
Hardware difficulty for backprop:
Backprop carries 3 difficulties:
- Bidirectional dataflow: forward + backward. Hardware needs two distinct signal paths.
- Non-local information: comes from across the network. A memristor cell can’t know “the gradients of other cells” (non-local).
- High precision: gradients can be very small (~10⁻⁶). 8-bit memristors (256 levels) → quantization error corrupts the gradient.
Hardware attempts:
- Equilibrium propagation (Bengio 2017): energy-based, local learning. Still prototype.
- Direct feedback alignment (Lillicrap 2014): uses random backward projection. Local, but quality holds.
- Forward-forward (Hinton 2022): no backward pass at all; two forward passes are compared.
- In-memory backprop: all weights in the crossbar; backward pass via transposed crossbar read. Active research for SIDRA.
SIDRA strategy (realistic):
- Y1-Y3: inference-focused, training on external GPU.
- Y10: hybrid training — last layer updates on SIDRA (transfer learning); earlier layers frozen.
- Y100: online incremental learning — Hebbian + reinforcement signal. Not exact backprop, but similar effect.
- Y1000: hardware backprop — new devices (ferroelectric FET, magneto-tunnel junction).
Experiment: Backprop Steps in a 2-Layer Network
Data: , target .
Network:
- Hidden: , init ,
- Output: , init
- Loss:
Forward pass:
Backward pass:
Update ():
New forward pass:
Loss dropped 0.534 → 0.087 — 6× improvement in one iteration. That’s how backprop works — each step nudges in the right direction.
SIDRA parallel: to do this on a memristor crossbar: forward pass → analog MVM (easy, natural). Backward pass → MVM with the transpose of the weight matrix. SIDRA crossbars support transpose reads but need extra circuitry + calibration. Planned in the Y10 prototype.
Quick Quiz
Lab Exercise
Backprop budget for training GPT-3.
Data:
- GPT-3: 175 billion parameters
- Training tokens: ~300 billion
- Forward pass FLOPs: ~6 × params × tokens = 6 × 1.75 × 10¹¹ × 3 × 10¹¹ = 3.15 × 10²³ FLOP total
- Backward pass: ~2× forward → total ~6× forward = 9.45 × 10²³ FLOP
- NVIDIA A100: ~312 TFLOPS sustained → 1.27 × 10¹² ops/s (after sustained derate)
- Total training time (compute-bound): 9.45 × 10²³ / 1.27 × 10¹² = 7.4 × 10¹¹ A100-seconds = ~235,000 A100-hours
- A100 250W → 235K × 0.25 = 58.7 MWh per A100-bound (parallelism reduces wall-clock)
- Patterson 2021: 1287 MWh (PUE + other overhead included)
Questions:
(a) GPT-3 training A100-hours required? (b) How many A100s in parallel for a 30-day run? (c) How many SIDRA Y100 (analog) chips for inference of the same model? (d) Y1000 hypothesis (hardware backprop) energy estimate to train?
Solutions
(a) ~235,000 A100-hours. With 1024 A100s in parallel: 235K / 1024 = ~230 hours ≈ 9.6 days core compute; in practice 30+ days (data, sync, overhead).
(b) 30 days = 720 hours → 235K / 720 ≈ ~325 A100s in parallel. OpenAI is estimated to have used ~1000+ (consistent with Patterson’s 1287 MWh).
(c) GPT-3 175B params. Y100 target: 100 billion memristors per chip. 2 Y100 chips = 200B → fits all of GPT-3. Inference: 1 forward pass = 6 × 175B × 1 token ≈ 10¹² ops. Y100: ~3 × 10¹⁶ ops/s → 30 µs/token. Single chip, 100 W. GPT-3 inference at 200 W, real-time.
(d) Y1000 hypothesis: if analog hardware drops energy/op ~1000×, then 1287 MWh / 1000 = 1.29 MWh = 1290 kWh ≈ 4 households per month. GPT-3 training in an ordinary apartment. Speculative today, but the direction is right.
Note: A100 → H100 → B200 → … GPUs evolve too. SIDRA isn’t directly competing — it’s a category difference. Trained model on SIDRA for inference + edge AI scenarios.
Cheat Sheet
- Backprop = chain rule: as a product of inner derivatives.
- Three steps: forward pass → loss → backward pass → update.
- Loss: MSE (regression), Cross-entropy (classification).
- Optimizer family: SGD → SGD+momentum → AdaGrad → RMSProp → Adam (2014, modern default).
- Vanishing gradient: sigmoid kills it in deep nets. Fixes: ReLU + BatchNorm + Residual.
- Hardware difficulty: bidirectional + non-local + high precision → doesn’t map to analog crossbars.
- SIDRA strategy: inference-focused (Y1-Y3); hybrid training (Y10); online incremental (Y100); hardware backprop (Y1000 horizon).
Vision: Post-Backprop AI and SIDRA's Shot
Backprop has been AI’s foundation for 40 years. The next 20 are scouting alternatives:
- Y1 (today): SIDRA inference, training on GPU (backprop). Hybrid economics.
- Y3 (2027): Hybrid fine-tuning — last layer updates on SIDRA, the rest frozen.
- Y10 (2029): Equilibrium propagation or forward-forward prototypes. Limited training on the SIDRA crossbar.
- Y100 (2031+): Online + incremental + reinforcement learning integrated. Backprop’s role taken by “neuromorphic plasticity” — Hebbian + STDP + dopaminergic reward.
- Y1000 (long horizon): Full backprop analog on bio-compatible devices. Not brain-style, but post-brain AI.
Strategic chance for Türkiye: the US/China race was built on backprop + huge GPU farms. SIDRA is built on an “alternative learning paradigm”. That category difference is Türkiye’s leadership chance — if we validate online learning at the Y10 prototype, we can open a category alone in the world.
Unexpected future: synthetic learning. Brain-like systems that learn from few examples and generalize from one image + one sentence. Classical backprop wants millions of examples; biology takes a handful. SIDRA’s online + sparse + plastic foundation fits this paradigm. The first large-scale few-shot AI prototype could be SIDRA Y10-Y100.
Further Reading
- Next chapter: 3.7 — Memristor ↔ Synapse Mapping
- Previous: 3.5 — From Artificial Neuron to Transformer
- Backprop original: Rumelhart, Hinton, Williams, Learning representations by back-propagating errors, Nature 1986.
- Werbos priority: P. Werbos, Beyond Regression: New Tools for Prediction and Analysis…, Harvard PhD thesis 1974.
- Adam optimizer: Kingma & Ba, Adam: A Method for Stochastic Optimization, ICLR 2015.
- ResNet: He et al., Deep residual learning for image recognition, CVPR 2016.
- Equilibrium propagation: Scellier & Bengio, Equilibrium propagation: bridging the gap between energy-based models and backpropagation, Front. Comput. Neurosci. 2017.
- Forward-forward: G. Hinton, The forward-forward algorithm, arXiv 2022.