🧠 Module 3 · From Biology to Algorithm · Chapter 3.3 · 12 min read

Hebbian Learning

Cells that fire together wire together — learning's one-line equation.

Prerequisites

What you'll learn here

  • Recall Hebb's 1949 statement and its modern one-line form
  • Show how the Hebbian update $\Delta w = \eta \cdot x \cdot y$ captures correlation
  • Explain why pure Hebbian blows weights up and how Oja's rule fixes it
  • State BCM theory's sliding threshold and why it's necessary
  • Sketch how a Hebbian update can be implemented in hardware on a SIDRA crossbar

Hook: One Sentence Since 1949

In The Organization of Behavior (1949), Donald Hebb wrote a single hypothesis for the biological basis of learning:

“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.”

Carla Shatz later compressed it into four words: “Cells that fire together, wire together.”

This sentence is the mathematical foundation of modern AI. Backprop is a derivative of it; the Hopfield network runs on it; STDP is its spike-timing version; SIDRA’s future “online learning” target is implementing it on the crossbar.

77 years, four words. This chapter unfolds the math and the hardware feasibility behind it.

Intuition: Correlation = Weight Increase

Hebbian rule, simplest form:

Δw=ηxy\Delta w = \eta \cdot x \cdot y
  • ww — synaptic weight (presynaptic to postsynaptic)
  • xx — presynaptic activity (rate or 0/1)
  • yy — postsynaptic activity
  • η\eta — learning rate (~0.01-0.1)

Why does it work? When xx and yy are simultaneously high, Δw>0\Delta w > 0 → connection strengthens. When both are low, Δw0\Delta w \approx 0 → no change. The synapse becomes a measure of statistical correlation between two neurons.

Concrete example: if seeing a dog (input xx) and “wag tail” (output yy) co-occur, the “see-dog → wag-tail” synapse strengthens. The next dog auto-triggers the response. Classical conditioning (Pavlov 1897, predating Hebb) is the behavioral version of the same rule.

The danger: pure Hebbian is unstable. ww only ever grows → it saturates. The brain solves this with several mechanisms:

  • Synaptic scaling (normalize all synapses)
  • LTD (anti-Hebbian update)
  • BCM (sliding threshold)
  • STDP (timing asymmetry)

In modern neural-net training, the same problem is solved with weight decay + batch normalization + Adam optimizer. Same problem, different fix.

Formalism: Pure Hebbian → Oja → BCM

L1 · Başlangıç

Vector form:

A neuron has NN synapses, weight vector wRN\mathbf{w} \in \mathbb{R}^N. Input x\mathbf{x}, output y=wxy = \mathbf{w}^\top \mathbf{x}.

Hebbian update:

Δw=ηyx=η(wx)x\Delta \mathbf{w} = \eta \cdot y \cdot \mathbf{x} = \eta (\mathbf{w}^\top \mathbf{x}) \mathbf{x}

Expected value (random inputs):

Δw=ηxxw=ηCw\langle \Delta \mathbf{w} \rangle = \eta \cdot \langle \mathbf{x} \mathbf{x}^\top \rangle \mathbf{w} = \eta \cdot \mathbf{C} \mathbf{w}
  • C\mathbf{C} — input covariance matrix.

This is the iterative power method → over time, w\mathbf{w} converges to the principal eigenvector of C\mathbf{C}. So Hebbian learns the first principal component (PCA) of the input.

L2 · Tam

Oja’s rule (1982) — adds normalization:

Δw=ηy(xyw)\Delta \mathbf{w} = \eta \cdot y \cdot (\mathbf{x} - y \mathbf{w})

The extra y2w-y^2 \mathbf{w} term pulls w1\|\mathbf{w}\| \to 1. Result: weights stay bounded, first PCA component is learned.

Generalized Oja (Sanger 1989): with KK output neurons in parallel, the first KK eigenvectors are learned → full PCA. Online, sequential, label-free.

Anti-Hebbian:

Δw=ηxy\Delta w = -\eta \cdot x \cdot y

Weakens correlation. Used for inhibitory synapses or whitening (decorrelating inputs).

L3 · Derin

BCM theory (Bienenstock-Cooper-Munro 1982):

A meaningful evolution of pure Hebbian. Sliding threshold θM\theta_M that depends on postsynaptic activity:

Δw=ηxy(yθM)\Delta w = \eta \cdot x \cdot y \cdot (y - \theta_M)
  • y>θMy > \theta_M → LTP (weight ↑)
  • y<θMy < \theta_M → LTD (weight ↓)
  • θM\theta_M itself drifts slowly: θMy2\theta_M \propto \langle y^2 \rangle (long-run average).

Result:

  • A hyperactive neuron → θM\theta_M rises → less LTP, more LTD → “self-quenches”.
  • An underactive neuron → θM\theta_M drops → easier to fire.
  • Homeostatic balance. That’s how the brain stays stable.

BCM is biologically supported by the NMDA + Ca²⁺ cascade (chapter 3.2). Low Ca²⁺ → LTD, high Ca²⁺ → LTP, threshold = average Ca²⁺ level. Same equation, biological substrate.

Tie to supervised learning:

Hebbian = unsupervised. No labels, just correlation. Supervised learning (backprop, chapter 3.6) = Hebbian + error signal backpropagation. Modern deep learning settles for one without the other; the brain uses both (cortical areas are Hebbian, with reward signals via dopamine on top).

Hardware feasibility — why SIDRA cares:

The Hebbian update is local: it only needs pre-activity + post-activity + weight. Backprop is global: the gradient chain runs across the whole network. Implementing a local rule in hardware is much easier — especially on an analog crossbar. At a memristor cell:

ΔGVpreVpostΔt\Delta G \propto V_{\text{pre}} \cdot V_{\text{post}} \cdot \Delta t

Voltage coincidence grows or shrinks the filament → Hebbian emerges naturally. SIDRA Y10 targets local Hebbian; Y100 targets local STDP.

Experiment: A 2-Input Neuron Learns a Correlation

A single neuron with 2 inputs (x1,x2x_1, x_2). Output y=w1x1+w2x2y = w_1 x_1 + w_2 x_2.

Training data: 1000 samples; (x1,x2)(x_1, x_2) pairs from a Normal distribution — but correlated: x2=0.8x1+0.6ξx_2 = 0.8 x_1 + 0.6 \xi (ξ\xi independent noise). Principal axis is (1,0.8)(1, 0.8).

Pure Hebbian (η=0.01\eta = 0.01, w0=(0.1,0.1)\mathbf{w}_0 = (0.1, 0.1)):

Iterationw\mathbf{w}w\|\mathbf{w}\|
0(0.10, 0.10)0.14
100(0.42, 0.34)0.54
500(3.8, 3.0)4.85
1000(62, 49)79

Blows up — but the ratio w2/w10.8w_2/w_1 \approx 0.8 is right; the direction lines up with the principal axis.

Oja’s rule:

Iterationw\mathbf{w}w\|\mathbf{w}\|
0(0.10, 0.10)0.14
100(0.65, 0.51)0.83
500(0.78, 0.62)1.00
1000(0.78, 0.63)1.00

Converges. w(0.78,0.63)\mathbf{w} \approx (0.78, 0.63) = first eigenvector of the data covariance.

Result: Hebbian + normalization = automatic dimensionality reduction (PCA). The brain does similar things in V1 — neurons converge to edge detectors (Olshausen & Field 1996, sparse coding).

SIDRA parallel: in a crossbar, 2-input/1-output neuron = 2 memristors. Local Hebbian update → learn the principal axis in analog hardware. This is the core of online unsupervised feature learning.

Quick Quiz

1/6What is the modern one-line form of Hebb's 1949 hypothesis?

Lab Exercise

Online-learning energy budget on a SIDRA crossbar.

Data:

  • 256 × 256 crossbar, 65,536 memristors total
  • Each cell 256 levels (8 bit)
  • SET energy: ~10 pJ; partial-SET update Δ: ~1 pJ
  • Y1 clock: 100 MHz (10 ns/cycle)
  • TDP: 3 W

Questions:

(a) Energy if all 65K cells get a Hebbian update simultaneously? (b) Within 3 W TDP, how many crossbars can be updated simultaneously? (c) Updates per second total? (d) Brain LTP/LTD rate: an average synapse updates once per minute (~17 mHz). What’s SIDRA Y1’s update rate ratio? (e) How big a feature map can SIDRA Y1 learn unsupervised via local Hebbian?

Solutions

(a) 65,536 × 1 pJ = 65.5 nJ (single clock cycle, parallel).

(b) 3 W = 3 J/s. 10 ns clock → power = 65.5 nJ / 10 ns = 6.55 W. Above TDP for a single crossbar. In practice: ~30K cells per cycle (within TDP), or all cells but at half clock.

(c) With α = 0.5 and a single crossbar at half rate: 65,536 × 5×10⁷ = 3.3 × 10¹² updates/s.

(d) Brain: 10¹⁴ synapses × 17 × 10⁻³ Hz = 1.7 × 10¹² updates/s. SIDRA Y1 single crossbar: 3.3 × 10¹² → ~2× brain’s synapse-update rate (but on only 65K synapses, 10⁹× less capacity).

(e) Local Hebbian + 65K parallel = a 65K-dimensional feature map. Enough for PCA-level learning on 256×256 image pixels (modest resolution). SIDRA Y1 can practically do unsupervised feature learning on an MNIST-scale dataset (28×28 = 784 features). This is the Y3 prototype goal.

Cheat Sheet

  • Hebb 1949: “Cells that fire together wire together” — Δw = η·x·y.
  • Pure Hebbian: unstable (w explodes). Fix: normalization (Oja) or sliding threshold (BCM).
  • Oja: Δw = η·y·(x − y·w). Bounded; converges to first PCA eigenvector.
  • BCM: Δw = η·x·y·(y − θ_M); θ_M ∝ ⟨y²⟩. Homeostatic. Brain’s likely model.
  • PCA tie: Hebbian = iterative power method on input covariance.
  • Locality: only pre + post information needed → fits analog hardware naturally.
  • SIDRA: ΔG ∝ V_pre · V_post · Δt — Hebbian is automatic on a memristor crossbar.

Vision: An Online-Learning SIDRA

Today’s GPUs do batch training — data centers, GBs, MWh. The brain learns online — every moment, in tiny steps. That gap is Y100’s real claim:

  • Y1 (today): Inference-focused. Training in external GPUs; weights written to the wafer. Hebbian implementation is technically possible but not prototyped.
  • Y3 (2027): Per-crossbar local Hebbian prototype. Small unsupervised feature learning (MNIST, CIFAR-10).
  • Y10 (2029): Cross-crossbar BCM/Oja coordination. Online classification, edge AI scenarios (smart camera, sensor classification).
  • Y100 (2031+): STDP + reinforcement learning integration. Brain-style continuous learning, lifelong learning. GPT-class systems’ energy drops 1000×.
  • Y1000 (long horizon): Hebbian on bio-compatible organic synapses → brain-coupled training. Implant reads, synthetic neuron learns.

Strategic angle for Türkiye: training → data center → energy → carbon footprint is a race we’re behind on. We can compete in online-learning hardware because the game is changing. Local learning + low energy = a different category. SIDRA’s claim of being “Türkiye’s bridge on the AI path” rests exactly here.

Unexpected future: federated SIDRA cluster. 100 SIDRA chips in different cities, each learning locally with Hebbian, weights shared via federated learning. No data center. In-country distributed AI infrastructure. 2030+ horizon.

Further Reading

  • Next chapter: 3.4 — Brain Energy Efficiency
  • Previous: 3.2 — The Synapse
  • Hebb original: D. O. Hebb, The Organization of Behavior (1949).
  • Oja’s rule: Erkki Oja, A simplified neuron model as a principal component analyzer, J. Math. Biol. 1982.
  • BCM: Bienenstock, Cooper, Munro, Theory for the development of neuron selectivity, J. Neurosci. 1982.
  • Sparse coding (V1): Olshausen & Field, Emergence of simple-cell receptive field properties…, Nature 1996.
  • Hebbian hardware: Pedretti & Ielmini, In-memory computing with resistive switching devices, Nature Electronics 2018.