Backpropagation

Hook: The One Algorithm of 1986

In 1986 Rumelhart, Hinton, and Williams published Learning representations by back-propagating errors. Three pages. One algorithm. One mathematical idea: the chain rule.

That paper birthed modern AI. ChatGPT, AlphaGo, Tesla autopilot — all trained with backprop. But backprop’s roots go further back: Werbos 1974 PhD thesis described the same algorithm (less noticed); Linnainmaa 1970 published it as automatic differentiation. Hinton + team’s contribution: making it work for deep neural networks.

40 years later: GPT-4 ran backprop for months on ~50,000 H100 GPUs. Same algorithm; scale grew ~10⁹×.

Why this matters for SIDRA: backprop is for training, not inference. SIDRA Y1 is inference-focused — training still happens on GPUs. But doing backprop in hardware would break the GPU dependency. That’s Y100’s real claim. This chapter unpacks both the math and the hardware difficulty.

Intuition: The Error Signal Walks Backward

Training a neural network is a four-step loop:

Forward pass: data goes in → output computed.
Loss compute: output vs target → loss $L$ measured.
Backward pass: sensitivity of $L$ to each weight ( $dL/dw$ ) computed.
Update: $w \leftarrow w - \eta \cdot dL/dw$ (SGD).

Repeat: millions of times. Each iteration, a small step. Optimization = roll downhill.

Loss function — why? Network output isn’t right/wrong as a single bit; “how close” must be a number. Two popular choices:

MSE (Mean Squared Error): $L = \frac{1}{N} \sum (y - \hat{y})^2$ . For regression.
Cross-entropy: $L = -\sum y \log \hat{y}$ . For classification.

Both are differentiable — you can take a gradient.

The intuition behind backprop: the error at the output came from the last layer’s weights. “If this weight had been a bit larger, would the error have shrunk?” = $dL/dw_{\text{last}}$ . Same question for the previous layer — but indirectly (chain rule). The error signal walks backward from output to input, answering “am I to blame?” for every weight.

Versus Hebbian: Hebbian uses only pre + post info → local. Backprop pushes the error signal globally → global. Backprop is more powerful (targeted learning), but harder in hardware.

Formalism: From the Chain Rule to Adam Optimizer

L1 · Başlangıç

Single-neuron example:

$z = wx + b$ , $a = f(z)$ , $L = (a - y)^2$ .

Question: how much does $L$ change when $w$ changes?

Chain rule:

\frac{dL}{dw} = \frac{dL}{da} \cdot \frac{da}{dz} \cdot \frac{dz}{dw}

Compute:

$dL/da = 2(a - y)$
$da/dz = f'(z)$
$dz/dw = x$

Result:

\frac{dL}{dw} = 2(a - y) \cdot f'(z) \cdot x

Update: $w \leftarrow w - \eta \cdot dL/dw$ , $\eta$ learning rate (~0.01).

L2 · Tam

Multi-layer network — backward pass:

$L$ loss, $\mathbf{a}^{(L)}$ last-layer output. At each layer:

\delta^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}

Last layer: $\delta^{(L)} = \nabla_a L \odot f'(\mathbf{z}^{(L)})$ .

Backprop equation: $\delta^{(l)}$ from $\delta^{(l+1)}$ :

\delta^{(l)} = (\mathbf{W}^{(l+1)})^\top \delta^{(l+1)} \odot f'(\mathbf{z}^{(l)})

Weight gradient:

\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{a}^{(l-1)})^\top

This is an outer product — a vector × another vector’s transpose = a matrix. One gradient per weight.

Optimization algorithms:

SGD (Stochastic Gradient Descent): $w \leftarrow w - \eta g$
SGD + Momentum: $v \leftarrow \mu v + g$ , $w \leftarrow w - \eta v$ . Escapes local minima.
AdaGrad: per-weight learning rate. $w \leftarrow w - \eta g / \sqrt{\sum g^2}$
RMSProp: decaying AdaGrad.
Adam (2014): Momentum + RMSProp combined. Modern default. Tracks both first and second moment of the gradient.

Mini-batch SGD: $L = \frac{1}{B} \sum_{i \in \text{batch}} L_i$ . $B$ = batch size, typically 32-512. Stochastic noise → escapes local minima.

L3 · Derin

Vanishing/exploding gradients:

In deep networks, the backward pass $\delta^{(l)} = (\mathbf{W}^{(l+1)})^\top \delta^{(l+1)} \odot f'(\mathbf{z}^{(l)})$ repeats many times. If $|f'| < 1$ (true for sigmoid, tanh): $\delta$ shrinks exponentially → vanishing gradient → deep layers don’t learn.

If $|W| > 1$ and $|f'| > 1$ : $\delta$ explodes → exploding gradient → instability.

Fixes (the foundations of modern deep learning):

ReLU activation: $f'(z) = 0$ or $1$ — no vanishing if $z > 0$ .
Batch normalization (Ioffe & Szegedy 2015): normalize each layer’s outputs → smoother gradient flow.
Residual connections (He et al. 2015, ResNet): $\mathbf{a}^{(l+1)} = \mathbf{a}^{(l)} + f(\ldots)$ . Gradient flows back through the identity → 1000+ layers possible.
LSTM gates (for RNNs): open/close to preserve gradient.
Gradient clipping: if $|g| >$ threshold, clip. Prevents explosions.

Hardware difficulty for backprop:

Backprop carries 3 difficulties:

Bidirectional dataflow: forward + backward. Hardware needs two distinct signal paths.
Non-local information: $dL/dw$ comes from across the network. A memristor cell can’t know “the gradients of other cells” (non-local).
High precision: gradients can be very small (~10⁻⁶). 8-bit memristors (256 levels) → quantization error corrupts the gradient.

Hardware attempts:

Equilibrium propagation (Bengio 2017): energy-based, local learning. Still prototype.
Direct feedback alignment (Lillicrap 2014): uses random backward projection. Local, but quality holds.
Forward-forward (Hinton 2022): no backward pass at all; two forward passes are compared.
In-memory backprop: all weights in the crossbar; backward pass via transposed crossbar read. Active research for SIDRA.

SIDRA strategy (realistic):

Y1-Y3: inference-focused, training on external GPU.
Y10: hybrid training — last layer updates on SIDRA (transfer learning); earlier layers frozen.
Y100: online incremental learning — Hebbian + reinforcement signal. Not exact backprop, but similar effect.
Y1000: hardware backprop — new devices (ferroelectric FET, magneto-tunnel junction).

Experiment: Backprop Steps in a 2-Layer Network

Data: $x = 0.5$ , target $y = 1$ .

Network:

Hidden: $h = \sigma(w_1 x + b_1)$ , init $w_1 = 0.3, b_1 = 0$ , $\sigma(z) = 1/(1+e^{-z})$
Output: $\hat{y} = w_2 h + b_2$ , init $w_2 = 0.5, b_2 = 0$
Loss: $L = (\hat{y} - y)^2$

Forward pass:

$z_1 = 0.3 \times 0.5 + 0 = 0.15$
$h = \sigma(0.15) = 0.537$
$\hat{y} = 0.5 \times 0.537 + 0 = 0.269$
$L = (0.269 - 1)^2 = 0.534$

Backward pass:

$dL/d\hat{y} = 2(0.269 - 1) = -1.462$
$dL/dw_2 = dL/d\hat{y} \times d\hat{y}/dw_2 = -1.462 \times 0.537 = -0.785$
$dL/dh = dL/d\hat{y} \times d\hat{y}/dh = -1.462 \times 0.5 = -0.731$
$dh/dz_1 = \sigma(0.15)(1 - \sigma(0.15)) = 0.537 \times 0.463 = 0.249$
$dL/dw_1 = dL/dh \times dh/dz_1 \times dz_1/dw_1 = -0.731 \times 0.249 \times 0.5 = -0.091$

Update ( $\eta = 1$ ):

$w_2 \leftarrow 0.5 - 1 \times (-0.785) = 1.285$
$w_1 \leftarrow 0.3 - 1 \times (-0.091) = 0.391$

New forward pass:

$z_1 = 0.391 \times 0.5 = 0.196$
$h = \sigma(0.196) = 0.549$
$\hat{y} = 1.285 \times 0.549 = 0.706$
$L = (0.706 - 1)^2 = 0.087$

Loss dropped 0.534 → 0.087 — 6× improvement in one iteration. That’s how backprop works — each step nudges in the right direction.

SIDRA parallel: to do this on a memristor crossbar: forward pass → analog MVM (easy, natural). Backward pass → MVM with the transpose of the weight matrix. SIDRA crossbars support transpose reads but need extra circuitry + calibration. Planned in the Y10 prototype.

Quick Quiz

1/6What's the mathematical foundation of backprop?

Linear algebraThe chain rule (calculus) — the derivative of a composite function is the product of inner derivativesProbabilityGeometry

Lab Exercise

Backprop budget for training GPT-3.

Data:

GPT-3: 175 billion parameters
Training tokens: ~300 billion
Forward pass FLOPs: ~6 × params × tokens = 6 × 1.75 × 10¹¹ × 3 × 10¹¹ = 3.15 × 10²³ FLOP total
Backward pass: ~2× forward → total ~6× forward = 9.45 × 10²³ FLOP
NVIDIA A100: ~312 TFLOPS sustained → 1.27 × 10¹² ops/s (after sustained derate)
Total training time (compute-bound): 9.45 × 10²³ / 1.27 × 10¹² = 7.4 × 10¹¹ A100-seconds = ~235,000 A100-hours
A100 250W → 235K × 0.25 = 58.7 MWh per A100-bound (parallelism reduces wall-clock)
Patterson 2021: 1287 MWh (PUE + other overhead included)

Questions:

(a) GPT-3 training A100-hours required? (b) How many A100s in parallel for a 30-day run? (c) How many SIDRA Y100 (analog) chips for inference of the same model? (d) Y1000 hypothesis (hardware backprop) energy estimate to train?

Solutions

(a) ~235,000 A100-hours. With 1024 A100s in parallel: 235K / 1024 = ~230 hours ≈ 9.6 days core compute; in practice 30+ days (data, sync, overhead).

(b) 30 days = 720 hours → 235K / 720 ≈ ~325 A100s in parallel. OpenAI is estimated to have used ~1000+ (consistent with Patterson’s 1287 MWh).

(c) GPT-3 175B params. Y100 target: 100 billion memristors per chip. 2 Y100 chips = 200B → fits all of GPT-3. Inference: 1 forward pass = 6 × 175B × 1 token ≈ 10¹² ops. Y100: ~3 × 10¹⁶ ops/s → 30 µs/token. Single chip, 100 W. GPT-3 inference at 200 W, real-time.

(d) Y1000 hypothesis: if analog hardware drops energy/op ~1000×, then 1287 MWh / 1000 = 1.29 MWh = 1290 kWh ≈ 4 households per month. GPT-3 training in an ordinary apartment. Speculative today, but the direction is right.

Note: A100 → H100 → B200 → … GPUs evolve too. SIDRA isn’t directly competing — it’s a category difference. Trained model on SIDRA for inference + edge AI scenarios.

Cheat Sheet

Backprop = chain rule: $\partial L / \partial w$ as a product of inner derivatives.
Three steps: forward pass → loss → backward pass → update.
Loss: MSE (regression), Cross-entropy (classification).
Optimizer family: SGD → SGD+momentum → AdaGrad → RMSProp → Adam (2014, modern default).
Vanishing gradient: sigmoid kills it in deep nets. Fixes: ReLU + BatchNorm + Residual.
Hardware difficulty: bidirectional + non-local + high precision → doesn’t map to analog crossbars.
SIDRA strategy: inference-focused (Y1-Y3); hybrid training (Y10); online incremental (Y100); hardware backprop (Y1000 horizon).

Vision: Post-Backprop AI and SIDRA's Shot

Backprop has been AI’s foundation for 40 years. The next 20 are scouting alternatives:

Y1 (today): SIDRA inference, training on GPU (backprop). Hybrid economics.
Y3 (2027): Hybrid fine-tuning — last layer updates on SIDRA, the rest frozen.
Y10 (2029): Equilibrium propagation or forward-forward prototypes. Limited training on the SIDRA crossbar.
Y100 (2031+): Online + incremental + reinforcement learning integrated. Backprop’s role taken by “neuromorphic plasticity” — Hebbian + STDP + dopaminergic reward.
Y1000 (long horizon): Full backprop analog on bio-compatible devices. Not brain-style, but post-brain AI.

Strategic chance for Türkiye: the US/China race was built on backprop + huge GPU farms. SIDRA is built on an “alternative learning paradigm”. That category difference is Türkiye’s leadership chance — if we validate online learning at the Y10 prototype, we can open a category alone in the world.

Unexpected future: synthetic learning. Brain-like systems that learn from few examples and generalize from one image + one sentence. Classical backprop wants millions of examples; biology takes a handful. SIDRA’s online + sparse + plastic foundation fits this paradigm. The first large-scale few-shot AI prototype could be SIDRA Y10-Y100.

Backpropagation

Prerequisites

What you'll learn here

Hook: The One Algorithm of 1986

Intuition: The Error Signal Walks Backward

Formalism: From the Chain Rule to Adam Optimizer

Experiment: Backprop Steps in a 2-Layer Network

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Post-Backprop AI and SIDRA's Shot

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: The One Algorithm of 1986

🧭 Intuition: The Error Signal Walks Backward

📐 Formalism: From the Chain Rule to Adam Optimizer

🧪 Experiment: Backprop Steps in a 2-Layer Network

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: Post-Backprop AI and SIDRA's Shot

📚 Further Reading

Hook: The One Algorithm of 1986

Intuition: The Error Signal Walks Backward

Formalism: From the Chain Rule to Adam Optimizer

Experiment: Backprop Steps in a 2-Layer Network

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Post-Backprop AI and SIDRA's Shot

Further Reading