From Artificial Neuron to Transformer

Hook: 80 Years, One Equation

In 1943 McCulloch and Pitts, in A Logical Calculus of the Ideas Immanent in Nervous Activity, reduced the biological neuron to a single logical unit. In 2017 Vaswani et al. published Attention is All You Need — the Transformer architecture that made ChatGPT possible.

74 years in between, thousands of papers, millions of engineer-hours. But while architectures shifted, the fundamental mathematical atom stayed the same:

y = f(\mathbf{W}\mathbf{x} + \mathbf{b})

Multiply, sum, pass through an activation, repeat. All modern AI — vision, language, protein folding, game play — is this one sentence, stacked.

Why this matters for SIDRA: the core of that formula is $\mathbf{W}\mathbf{x}$ — a matrix-vector multiply (MVM). The crossbar is built exactly for that (chapter 1.5: Ohm + Kirchhoff = MVM). So the heart of every AI architecture maps naturally onto SIDRA hardware.

This chapter sweeps those 74 years, shows what problem each architecture solved, and explains why Transformer is ideal for the SIDRA crossbar.

Intuition: Modern AI in 9 Steps

The evolution of AI architectures in nine big steps:

Year	Model	Contribution
1943	McCulloch-Pitts neuron	First mathematical neuron. Binary threshold.
1949	Hebbian learning	How weights update (chapter 3.3)
1958	Rosenblatt perceptron	Single-layer trainable classifier
1969	Minsky & Papert XOR critique	Perceptron can’t solve XOR → AI winter
1986	Rumelhart-Hinton-Williams backprop	Multi-layer training becomes possible (3.6)
1989-1998	LeCun CNN (LeNet)	Convolution + pooling for images
1997	Hochreiter-Schmidhuber LSTM	Long-range sequence learning
2012	Krizhevsky AlexNet	GPU + deep CNN → modern AI era
2017	Vaswani Transformer	Self-attention → GPT, BERT, etc.

Each step fixes the previous limitation:

Perceptron: can’t solve XOR → MLP arrives.
MLP: too many params for 28×28 image → CNN (weight sharing).
CNN: not great on sequences → RNN/LSTM.
LSTM: struggles with long sequences (vanishing gradient) → Attention.
Attention: no sense of position → Transformer (positional encoding).

Every modern giant (GPT, Claude, Gemini, LLaMA) is a Transformer variant. The underlying math: $y = f(\mathbf{W}\mathbf{x} + \mathbf{b})$ + self-attention.

Formalism: From a Single Neuron to a Transformer

L1 · Başlangıç

Single artificial neuron (perceptron):

y = f\left(\sum_i w_i x_i + b\right) = f(\mathbf{w}^\top \mathbf{x} + b)

$\mathbf{x}$ — input vector (features)
$\mathbf{w}$ — weight vector (learned)
$b$ — bias
$f$ — activation function

Activation functions:

Name	Formula	Purpose
Step	$f(z) = 1$ if $z > 0$ , else 0	Original Rosenblatt perceptron
Sigmoid	$f(z) = 1/(1+e^{-z})$	1940-2000 standard; vanishing gradient
Tanh	$f(z) = (e^z - e^{-z})/(e^z + e^{-z})$	Centered sigmoid
ReLU	$f(z) = \max(0, z)$	Deep-net standard since 2012
GELU	$f(z) = z \cdot \Phi(z)$	Common in Transformers

A single neuron is a classifier: with $f$ = step, it outputs 0 or 1 — binary classification. Rosenblatt proved this in 1958.

L2 · Tam

Multilayer perceptron (MLP):

\mathbf{h}_1 = f(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1), \quad \mathbf{h}_2 = f(\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2), \quad \ldots, \quad \mathbf{y} = \mathbf{W}_L \mathbf{h}_{L-1} + \mathbf{b}_L

Each layer is one MVM + activation. Universal approximation theorem (Cybenko 1989, Hornik 1991): a wide enough 2-layer MLP approximates any continuous function. Depth is practical (fewer parameters for complex functions); theoretically 2 layers suffice.

The XOR problem:

A single perceptron cannot implement:

$x_1$	$x_2$	XOR
0	0	0
0	1	1
1	0	1
1	1	0

Because no linear separator works. A 2-layer MLP solves it: the hidden layer bends the decision surface.

Convolution (CNN):

MLPs have too many parameters for images (784 inputs at 28×28). The CNN fix: weight sharing + local connectivity. A 3×3 filter uses the same weights at every position.

h_{i,j} = f\left(\sum_{m,n} w_{m,n} \cdot x_{i+m, j+n}\right)

Still an MVM — but with weight reuse. In SIDRA: one patch of the crossbar stores the filter.

Recurrent (RNN):

For time-series, the output feeds into the next step:

\mathbf{h}_t = f(\mathbf{W}_x \mathbf{x}_t + \mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{b})

Problem: repeated multiplication by $\mathbf{W}_h$ in long sequences → gradient vanishes (small eigenvalues go to 0) or explodes. LSTM (1997) partially fixed it with gated memory, but ~100 time-step range is still hard.

L3 · Derin

Self-attention and the Transformer:

Transformers connect every time step to every other time step directly. The mechanism:

From each input vector $\mathbf{x}_i$ , derive three vectors:

Query $\mathbf{q}_i = \mathbf{W}_Q \mathbf{x}_i$
Key $\mathbf{k}_i = \mathbf{W}_K \mathbf{x}_i$
Value $\mathbf{v}_i = \mathbf{W}_V \mathbf{x}_i$

Attention score: $\text{score}_{ij} = \mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d}$ (normalized dot product).

Softmax normalization: $a_{ij} = \text{softmax}_j(\text{score}_{ij})$ .

Output: $\mathbf{y}_i = \sum_j a_{ij} \mathbf{v}_j$ .

Vectorized:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V

Multi-Head Attention: run this process in $h$ parallel heads, concatenate. Each head learns a different relational pattern (syntax, semantics, position, etc.).

Transformer block:

\text{Output} = \text{LayerNorm}(\mathbf{x} + \text{MultiHead}(\mathbf{x})) \to \text{LayerNorm}(\ldots + \text{FFN}(\ldots))

Feed-forward network (FFN): 2-layer MLP. Each transformer block: attention + FFN + residual + layer-norm.

GPT-class models: 96-128 transformer blocks stacked, $d \approx 12K$ , ~175 billion params (GPT-3). Training compute 3.14 × 10²³ FLOP (~1287 MWh).

Why all this matters for SIDRA:

Every Transformer op is either an MVM ( $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V, \mathbf{W}_O, \text{FFN}_1, \text{FFN}_2$ ) or a small softmax/layer-norm. MVMs are 90%+ of the total compute. The SIDRA crossbar exists for MVM → Transformers exist for SIDRA.

GPT-2 inference on SIDRA Y1:

GPT-2 small (124M params). Each forward pass ~250 MFLOPS.
Y1: ~30 TOPS analog → 1 inference ~10 µs.
Batch of 32 GPT-2 inferences: ~300 µs, ~1 mW energy.
Cost of thinking at brain-budget scale.

Note: this is an estimate; calibration + overhead on the real Y1 prototype will add a few ms. The scale is right.

Experiment: One Neuron Tries AND, OR, XOR

A single perceptron ( $f$ = step, $y = 1$ if $w_1 x_1 + w_2 x_2 + b > 0$ , else 0) attempting three logic gates:

AND ( $x_1 \wedge x_2$ ):

Table: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1
Weights: $w_1 = w_2 = 1$ , $b = -1.5$ → solves ✅

OR ( $x_1 \vee x_2$ ):

(0,0)→0, (0,1)→1, (1,0)→1, (1,1)→1
$w_1 = w_2 = 1$ , $b = -0.5$ → solves ✅

XOR ( $x_1 \oplus x_2$ ):

(0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0
Impossible — no linear separator. Minsky-Papert 1969.

A 2-layer MLP solves XOR:

Hidden layer (2 neurons):

$h_1 = \text{step}(x_1 + x_2 - 0.5)$ ≈ OR
$h_2 = \text{step}(x_1 + x_2 - 1.5)$ ≈ AND

Output:

$y = \text{step}(h_1 - h_2 - 0.5)$ = OR − AND = XOR

SIDRA parallel:

Single perceptron = one crossbar row. On Y1: 2 memristors + 1 threshold circuit.
2-layer MLP = 2 crossbars. Threshold circuits on both.
GPT-2 = 124M params → 124M memristors. About 30% of Y1 stores GPT-2.
GPT-3 = 175B params → 417 SIDRA Y1 chips (impractical on Y1); Y100 in a single chip.

Quick Quiz

1/6Mathematical form of a single artificial neuron?

y = x + by = f(Wx + b) (multiply by weights, sum, pass through activation)y = Wx²y = random(x)

Lab Exercise

Map a small Transformer block onto a SIDRA crossbar.

Data:

Modest model: $d_{\text{model}} = 512$ , $d_{\text{ff}} = 2048$ , $h = 8$ heads.
One Transformer block: self-attention + FFN.
SIDRA crossbar: 256×256 memristors, 8-bit weight.

Questions:

(a) How many distinct weight matrices in self-attention? Sizes? (b) How many 256×256 crossbars to cover each matrix? (c) How many for the FFN? (d) Total crossbars per transformer block? (e) SIDRA Y1 has 419M cells = 419M / (256×256) ≈ 6400 crossbars. How many transformer blocks fit?

Solutions

(a) 4 matrices: $W_Q, W_K, W_V, W_O$ , each 512×512. Plus two for FFN: $W_1$ (512×2048), $W_2$ (2048×512).

(b) 512/256 = 2 × 2 = 4 crossbars per attention matrix. 4 matrices × 4 = 16 crossbars for attention.

(c) FFN: $W_1$ 512×2048 → (2 × 8) = 16 crossbars. $W_2$ 2048×512 → 16 crossbars. 32 crossbars for FFN.

(d) 16 + 32 = 48 crossbars / transformer block.

(e) 6400 / 48 ≈ 133 transformer blocks. GPT-3 small (125M) is 12 blocks; GPT-3 175B is 96. SIDRA Y1 can hold a ~GPT-3-small-class model on a single chip; 96-block GPT-3 needs ~Y3 or Y10.

Note: this is only parameter storage. Loading a trained model into Y1 and running inference works. Training still happens on GPU.

Cheat Sheet

Single neuron: $y = f(\mathbf{w}^\top \mathbf{x} + b)$ . The atom of 80 years of AI.
Evolution: McCulloch-Pitts 1943 → Perceptron 1958 → MLP (backprop 1986) → CNN (1989) → LSTM (1997) → AlexNet (2012) → Transformer (2017).
XOR limit: single perceptron = linear separator; MLP (hidden layer) solves XOR.
Activations: step, sigmoid, tanh, ReLU (modern default), GELU (Transformers).
Self-attention: softmax(QK^T/√d)V. Each token links directly to every other token.
Transformer block: MultiHead(Attention) + FFN + residual + LayerNorm.
SIDRA fit: MVMs are 90%+ of the compute → crossbar is the natural accelerator.

Vision: Beyond the Transformer, and SIDRA

Transformer is king today, not forever. Next steps:

Y1 (today): Small Transformer inference (GPT-2, BERT-small) fits Y1. Edge use (smart assistant, translation).
Y3 (2027): GPT-3-class (175B) multi-chip inference. Low-power laptop/data-center inference.
Y10 (2029): Transformer + sparse mixture-of-experts + online learning. Brain-budget energy.
Y100 (2031+): Post-Transformer architectures — State-Space Models (Mamba), linear attention, Mixture of Agents. SIDRA’s MVM focus fits most.
Y1000 (long horizon): Neuromorphic-Transformer hybrid — spike-based Transformer, 1% activity. Brain-scale continuous learning.

Strategic chance for Türkiye: the US + China lead on Transformer — we caught up but aren’t in front. But post-Transformer architectures are a new category. SIDRA’s analog + online-learning foundation gives early footprint at that transition. Türkiye’s shot at shipping the first “homegrown AI architecture” lies at this junction.

Unexpected future: emergent meaning neurons. In large language models, specific neuron clusters encode specific concepts (love, numbers, Türkiye, the Euphrates). This “conceptual memory” can be mapped explicitly to hardware. SIDRA Y100 → interpretable AI crossbar — you can see which crossbar stores which concept. Critical for security/explainability first, then a new kind of AI architecture.

From Artificial Neuron to Transformer

Prerequisites

What you'll learn here

Hook: 80 Years, One Equation

Intuition: Modern AI in 9 Steps

Formalism: From a Single Neuron to a Transformer

Experiment: One Neuron Tries AND, OR, XOR

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Beyond the Transformer, and SIDRA

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: 80 Years, One Equation

🧭 Intuition: Modern AI in 9 Steps

📐 Formalism: From a Single Neuron to a Transformer

🧪 Experiment: One Neuron Tries AND, OR, XOR

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: Beyond the Transformer, and SIDRA

📚 Further Reading

Hook: 80 Years, One Equation

Intuition: Modern AI in 9 Steps

Formalism: From a Single Neuron to a Transformer

Experiment: One Neuron Tries AND, OR, XOR

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Beyond the Transformer, and SIDRA

Further Reading