From Artificial Neuron to Transformer
y = f(Wx+b) — the single atom of 80 years of AI history.
Prerequisites
What you'll learn here
- Trace the chain McCulloch-Pitts → perceptron → MLP → CNN → Transformer in chronological order
- Write the single-neuron equation y = f(Wx+b) and name activation functions (ReLU, sigmoid, GELU)
- Explain why a single perceptron can't solve XOR and what depth buys you
- Write self-attention as softmax(QK^T/√d)V and describe multi-head attention
- Map a Transformer block onto a SIDRA crossbar
Hook: 80 Years, One Equation
In 1943 McCulloch and Pitts, in A Logical Calculus of the Ideas Immanent in Nervous Activity, reduced the biological neuron to a single logical unit. In 2017 Vaswani et al. published Attention is All You Need — the Transformer architecture that made ChatGPT possible.
74 years in between, thousands of papers, millions of engineer-hours. But while architectures shifted, the fundamental mathematical atom stayed the same:
Multiply, sum, pass through an activation, repeat. All modern AI — vision, language, protein folding, game play — is this one sentence, stacked.
Why this matters for SIDRA: the core of that formula is — a matrix-vector multiply (MVM). The crossbar is built exactly for that (chapter 1.5: Ohm + Kirchhoff = MVM). So the heart of every AI architecture maps naturally onto SIDRA hardware.
This chapter sweeps those 74 years, shows what problem each architecture solved, and explains why Transformer is ideal for the SIDRA crossbar.
Intuition: Modern AI in 9 Steps
The evolution of AI architectures in nine big steps:
| Year | Model | Contribution |
|---|---|---|
| 1943 | McCulloch-Pitts neuron | First mathematical neuron. Binary threshold. |
| 1949 | Hebbian learning | How weights update (chapter 3.3) |
| 1958 | Rosenblatt perceptron | Single-layer trainable classifier |
| 1969 | Minsky & Papert XOR critique | Perceptron can’t solve XOR → AI winter |
| 1986 | Rumelhart-Hinton-Williams backprop | Multi-layer training becomes possible (3.6) |
| 1989-1998 | LeCun CNN (LeNet) | Convolution + pooling for images |
| 1997 | Hochreiter-Schmidhuber LSTM | Long-range sequence learning |
| 2012 | Krizhevsky AlexNet | GPU + deep CNN → modern AI era |
| 2017 | Vaswani Transformer | Self-attention → GPT, BERT, etc. |
Each step fixes the previous limitation:
- Perceptron: can’t solve XOR → MLP arrives.
- MLP: too many params for 28×28 image → CNN (weight sharing).
- CNN: not great on sequences → RNN/LSTM.
- LSTM: struggles with long sequences (vanishing gradient) → Attention.
- Attention: no sense of position → Transformer (positional encoding).
Every modern giant (GPT, Claude, Gemini, LLaMA) is a Transformer variant. The underlying math: + self-attention.
Formalism: From a Single Neuron to a Transformer
Single artificial neuron (perceptron):
- — input vector (features)
- — weight vector (learned)
- — bias
- — activation function
Activation functions:
| Name | Formula | Purpose |
|---|---|---|
| Step | if , else 0 | Original Rosenblatt perceptron |
| Sigmoid | 1940-2000 standard; vanishing gradient | |
| Tanh | Centered sigmoid | |
| ReLU | Deep-net standard since 2012 | |
| GELU | Common in Transformers |
A single neuron is a classifier: with = step, it outputs 0 or 1 — binary classification. Rosenblatt proved this in 1958.
Multilayer perceptron (MLP):
Each layer is one MVM + activation. Universal approximation theorem (Cybenko 1989, Hornik 1991): a wide enough 2-layer MLP approximates any continuous function. Depth is practical (fewer parameters for complex functions); theoretically 2 layers suffice.
The XOR problem:
A single perceptron cannot implement:
| XOR | ||
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Because no linear separator works. A 2-layer MLP solves it: the hidden layer bends the decision surface.
Convolution (CNN):
MLPs have too many parameters for images (784 inputs at 28×28). The CNN fix: weight sharing + local connectivity. A 3×3 filter uses the same weights at every position.
Still an MVM — but with weight reuse. In SIDRA: one patch of the crossbar stores the filter.
Recurrent (RNN):
For time-series, the output feeds into the next step:
Problem: repeated multiplication by in long sequences → gradient vanishes (small eigenvalues go to 0) or explodes. LSTM (1997) partially fixed it with gated memory, but ~100 time-step range is still hard.
Self-attention and the Transformer:
Transformers connect every time step to every other time step directly. The mechanism:
From each input vector , derive three vectors:
- Query
- Key
- Value
Attention score: (normalized dot product).
Softmax normalization: .
Output: .
Vectorized:
Multi-Head Attention: run this process in parallel heads, concatenate. Each head learns a different relational pattern (syntax, semantics, position, etc.).
Transformer block:
Feed-forward network (FFN): 2-layer MLP. Each transformer block: attention + FFN + residual + layer-norm.
GPT-class models: 96-128 transformer blocks stacked, , ~175 billion params (GPT-3). Training compute 3.14 × 10²³ FLOP (~1287 MWh).
Why all this matters for SIDRA:
Every Transformer op is either an MVM () or a small softmax/layer-norm. MVMs are 90%+ of the total compute. The SIDRA crossbar exists for MVM → Transformers exist for SIDRA.
GPT-2 inference on SIDRA Y1:
- GPT-2 small (124M params). Each forward pass ~250 MFLOPS.
- Y1: ~30 TOPS analog → 1 inference ~10 µs.
- Batch of 32 GPT-2 inferences: ~300 µs, ~1 mW energy.
- Cost of thinking at brain-budget scale.
Note: this is an estimate; calibration + overhead on the real Y1 prototype will add a few ms. The scale is right.
Experiment: One Neuron Tries AND, OR, XOR
A single perceptron ( = step, if , else 0) attempting three logic gates:
AND ():
- Table: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1
- Weights: , → solves ✅
OR ():
- (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→1
- , → solves ✅
XOR ():
- (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0
- Impossible — no linear separator. Minsky-Papert 1969.
A 2-layer MLP solves XOR:
Hidden layer (2 neurons):
- ≈ OR
- ≈ AND
Output:
- = OR − AND = XOR
SIDRA parallel:
- Single perceptron = one crossbar row. On Y1: 2 memristors + 1 threshold circuit.
- 2-layer MLP = 2 crossbars. Threshold circuits on both.
- GPT-2 = 124M params → 124M memristors. About 30% of Y1 stores GPT-2.
- GPT-3 = 175B params → 417 SIDRA Y1 chips (impractical on Y1); Y100 in a single chip.
Quick Quiz
Lab Exercise
Map a small Transformer block onto a SIDRA crossbar.
Data:
- Modest model: , , heads.
- One Transformer block: self-attention + FFN.
- SIDRA crossbar: 256×256 memristors, 8-bit weight.
Questions:
(a) How many distinct weight matrices in self-attention? Sizes? (b) How many 256×256 crossbars to cover each matrix? (c) How many for the FFN? (d) Total crossbars per transformer block? (e) SIDRA Y1 has 419M cells = 419M / (256×256) ≈ 6400 crossbars. How many transformer blocks fit?
Solutions
(a) 4 matrices: , each 512×512. Plus two for FFN: (512×2048), (2048×512).
(b) 512/256 = 2 × 2 = 4 crossbars per attention matrix. 4 matrices × 4 = 16 crossbars for attention.
(c) FFN: 512×2048 → (2 × 8) = 16 crossbars. 2048×512 → 16 crossbars. 32 crossbars for FFN.
(d) 16 + 32 = 48 crossbars / transformer block.
(e) 6400 / 48 ≈ 133 transformer blocks. GPT-3 small (125M) is 12 blocks; GPT-3 175B is 96. SIDRA Y1 can hold a ~GPT-3-small-class model on a single chip; 96-block GPT-3 needs ~Y3 or Y10.
Note: this is only parameter storage. Loading a trained model into Y1 and running inference works. Training still happens on GPU.
Cheat Sheet
- Single neuron: . The atom of 80 years of AI.
- Evolution: McCulloch-Pitts 1943 → Perceptron 1958 → MLP (backprop 1986) → CNN (1989) → LSTM (1997) → AlexNet (2012) → Transformer (2017).
- XOR limit: single perceptron = linear separator; MLP (hidden layer) solves XOR.
- Activations: step, sigmoid, tanh, ReLU (modern default), GELU (Transformers).
- Self-attention: softmax(QK^T/√d)V. Each token links directly to every other token.
- Transformer block: MultiHead(Attention) + FFN + residual + LayerNorm.
- SIDRA fit: MVMs are 90%+ of the compute → crossbar is the natural accelerator.
Vision: Beyond the Transformer, and SIDRA
Transformer is king today, not forever. Next steps:
- Y1 (today): Small Transformer inference (GPT-2, BERT-small) fits Y1. Edge use (smart assistant, translation).
- Y3 (2027): GPT-3-class (175B) multi-chip inference. Low-power laptop/data-center inference.
- Y10 (2029): Transformer + sparse mixture-of-experts + online learning. Brain-budget energy.
- Y100 (2031+): Post-Transformer architectures — State-Space Models (Mamba), linear attention, Mixture of Agents. SIDRA’s MVM focus fits most.
- Y1000 (long horizon): Neuromorphic-Transformer hybrid — spike-based Transformer, 1% activity. Brain-scale continuous learning.
Strategic chance for Türkiye: the US + China lead on Transformer — we caught up but aren’t in front. But post-Transformer architectures are a new category. SIDRA’s analog + online-learning foundation gives early footprint at that transition. Türkiye’s shot at shipping the first “homegrown AI architecture” lies at this junction.
Unexpected future: emergent meaning neurons. In large language models, specific neuron clusters encode specific concepts (love, numbers, Türkiye, the Euphrates). This “conceptual memory” can be mapped explicitly to hardware. SIDRA Y100 → interpretable AI crossbar — you can see which crossbar stores which concept. Critical for security/explainability first, then a new kind of AI architecture.
Further Reading
- Next chapter: 3.6 — Backpropagation
- Previous: 3.4 — Brain Energy Efficiency
- McCulloch-Pitts: A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 1943.
- Rosenblatt perceptron: F. Rosenblatt, The perceptron: a probabilistic model…, Psych. Rev. 1958.
- Minsky-Papert critique: Perceptrons: An Introduction to Computational Geometry, 1969.
- Universal approximation: Cybenko, Approximation by superpositions of a sigmoidal function, 1989.
- Transformer: Vaswani et al., Attention is all you need, NeurIPS 2017.
- GPT-3: Brown et al., Language models are few-shot learners, NeurIPS 2020.
- State Space Models (post-Transformer): Gu & Dao, Mamba: Linear-time sequence modeling…, 2023.