🧠 Module 3 · From Biology to Algorithm · Chapter 3.5 · 15 min read

From Artificial Neuron to Transformer

y = f(Wx+b) — the single atom of 80 years of AI history.

Prerequisites

What you'll learn here

  • Trace the chain McCulloch-Pitts → perceptron → MLP → CNN → Transformer in chronological order
  • Write the single-neuron equation y = f(Wx+b) and name activation functions (ReLU, sigmoid, GELU)
  • Explain why a single perceptron can't solve XOR and what depth buys you
  • Write self-attention as softmax(QK^T/√d)V and describe multi-head attention
  • Map a Transformer block onto a SIDRA crossbar

Hook: 80 Years, One Equation

In 1943 McCulloch and Pitts, in A Logical Calculus of the Ideas Immanent in Nervous Activity, reduced the biological neuron to a single logical unit. In 2017 Vaswani et al. published Attention is All You Need — the Transformer architecture that made ChatGPT possible.

74 years in between, thousands of papers, millions of engineer-hours. But while architectures shifted, the fundamental mathematical atom stayed the same:

y=f(Wx+b)y = f(\mathbf{W}\mathbf{x} + \mathbf{b})

Multiply, sum, pass through an activation, repeat. All modern AI — vision, language, protein folding, game play — is this one sentence, stacked.

Why this matters for SIDRA: the core of that formula is Wx\mathbf{W}\mathbf{x} — a matrix-vector multiply (MVM). The crossbar is built exactly for that (chapter 1.5: Ohm + Kirchhoff = MVM). So the heart of every AI architecture maps naturally onto SIDRA hardware.

This chapter sweeps those 74 years, shows what problem each architecture solved, and explains why Transformer is ideal for the SIDRA crossbar.

Intuition: Modern AI in 9 Steps

The evolution of AI architectures in nine big steps:

YearModelContribution
1943McCulloch-Pitts neuronFirst mathematical neuron. Binary threshold.
1949Hebbian learningHow weights update (chapter 3.3)
1958Rosenblatt perceptronSingle-layer trainable classifier
1969Minsky & Papert XOR critiquePerceptron can’t solve XOR → AI winter
1986Rumelhart-Hinton-Williams backpropMulti-layer training becomes possible (3.6)
1989-1998LeCun CNN (LeNet)Convolution + pooling for images
1997Hochreiter-Schmidhuber LSTMLong-range sequence learning
2012Krizhevsky AlexNetGPU + deep CNN → modern AI era
2017Vaswani TransformerSelf-attention → GPT, BERT, etc.

Each step fixes the previous limitation:

  • Perceptron: can’t solve XOR → MLP arrives.
  • MLP: too many params for 28×28 image → CNN (weight sharing).
  • CNN: not great on sequences → RNN/LSTM.
  • LSTM: struggles with long sequences (vanishing gradient) → Attention.
  • Attention: no sense of position → Transformer (positional encoding).

Every modern giant (GPT, Claude, Gemini, LLaMA) is a Transformer variant. The underlying math: y=f(Wx+b)y = f(\mathbf{W}\mathbf{x} + \mathbf{b}) + self-attention.

Formalism: From a Single Neuron to a Transformer

L1 · Başlangıç

Single artificial neuron (perceptron):

y=f(iwixi+b)=f(wx+b)y = f\left(\sum_i w_i x_i + b\right) = f(\mathbf{w}^\top \mathbf{x} + b)
  • x\mathbf{x} — input vector (features)
  • w\mathbf{w} — weight vector (learned)
  • bb — bias
  • ff — activation function

Activation functions:

NameFormulaPurpose
Stepf(z)=1f(z) = 1 if z>0z > 0, else 0Original Rosenblatt perceptron
Sigmoidf(z)=1/(1+ez)f(z) = 1/(1+e^{-z})1940-2000 standard; vanishing gradient
Tanhf(z)=(ezez)/(ez+ez)f(z) = (e^z - e^{-z})/(e^z + e^{-z})Centered sigmoid
ReLUf(z)=max(0,z)f(z) = \max(0, z)Deep-net standard since 2012
GELUf(z)=zΦ(z)f(z) = z \cdot \Phi(z)Common in Transformers

A single neuron is a classifier: with ff = step, it outputs 0 or 1 — binary classification. Rosenblatt proved this in 1958.

L2 · Tam

Multilayer perceptron (MLP):

h1=f(W1x+b1),h2=f(W2h1+b2),,y=WLhL1+bL\mathbf{h}_1 = f(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1), \quad \mathbf{h}_2 = f(\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2), \quad \ldots, \quad \mathbf{y} = \mathbf{W}_L \mathbf{h}_{L-1} + \mathbf{b}_L

Each layer is one MVM + activation. Universal approximation theorem (Cybenko 1989, Hornik 1991): a wide enough 2-layer MLP approximates any continuous function. Depth is practical (fewer parameters for complex functions); theoretically 2 layers suffice.

The XOR problem:

A single perceptron cannot implement:

x1x_1x2x_2XOR
000
011
101
110

Because no linear separator works. A 2-layer MLP solves it: the hidden layer bends the decision surface.

Convolution (CNN):

MLPs have too many parameters for images (784 inputs at 28×28). The CNN fix: weight sharing + local connectivity. A 3×3 filter uses the same weights at every position.

hi,j=f(m,nwm,nxi+m,j+n)h_{i,j} = f\left(\sum_{m,n} w_{m,n} \cdot x_{i+m, j+n}\right)

Still an MVM — but with weight reuse. In SIDRA: one patch of the crossbar stores the filter.

Recurrent (RNN):

For time-series, the output feeds into the next step:

ht=f(Wxxt+Whht1+b)\mathbf{h}_t = f(\mathbf{W}_x \mathbf{x}_t + \mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{b})

Problem: repeated multiplication by Wh\mathbf{W}_h in long sequences → gradient vanishes (small eigenvalues go to 0) or explodes. LSTM (1997) partially fixed it with gated memory, but ~100 time-step range is still hard.

L3 · Derin

Self-attention and the Transformer:

Transformers connect every time step to every other time step directly. The mechanism:

From each input vector xi\mathbf{x}_i, derive three vectors:

  • Query qi=WQxi\mathbf{q}_i = \mathbf{W}_Q \mathbf{x}_i
  • Key ki=WKxi\mathbf{k}_i = \mathbf{W}_K \mathbf{x}_i
  • Value vi=WVxi\mathbf{v}_i = \mathbf{W}_V \mathbf{x}_i

Attention score: scoreij=qikj/d\text{score}_{ij} = \mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d} (normalized dot product).

Softmax normalization: aij=softmaxj(scoreij)a_{ij} = \text{softmax}_j(\text{score}_{ij}).

Output: yi=jaijvj\mathbf{y}_i = \sum_j a_{ij} \mathbf{v}_j.

Vectorized:

Attention(Q,K,V)=softmax(QKd)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V

Multi-Head Attention: run this process in hh parallel heads, concatenate. Each head learns a different relational pattern (syntax, semantics, position, etc.).

Transformer block:

Output=LayerNorm(x+MultiHead(x))LayerNorm(+FFN())\text{Output} = \text{LayerNorm}(\mathbf{x} + \text{MultiHead}(\mathbf{x})) \to \text{LayerNorm}(\ldots + \text{FFN}(\ldots))

Feed-forward network (FFN): 2-layer MLP. Each transformer block: attention + FFN + residual + layer-norm.

GPT-class models: 96-128 transformer blocks stacked, d12Kd \approx 12K, ~175 billion params (GPT-3). Training compute 3.14 × 10²³ FLOP (~1287 MWh).

Why all this matters for SIDRA:

Every Transformer op is either an MVM (WQ,WK,WV,WO,FFN1,FFN2\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V, \mathbf{W}_O, \text{FFN}_1, \text{FFN}_2) or a small softmax/layer-norm. MVMs are 90%+ of the total compute. The SIDRA crossbar exists for MVM → Transformers exist for SIDRA.

GPT-2 inference on SIDRA Y1:

  • GPT-2 small (124M params). Each forward pass ~250 MFLOPS.
  • Y1: ~30 TOPS analog → 1 inference ~10 µs.
  • Batch of 32 GPT-2 inferences: ~300 µs, ~1 mW energy.
  • Cost of thinking at brain-budget scale.

Note: this is an estimate; calibration + overhead on the real Y1 prototype will add a few ms. The scale is right.

Experiment: One Neuron Tries AND, OR, XOR

A single perceptron (ff = step, y=1y = 1 if w1x1+w2x2+b>0w_1 x_1 + w_2 x_2 + b > 0, else 0) attempting three logic gates:

AND (x1x2x_1 \wedge x_2):

  • Table: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1
  • Weights: w1=w2=1w_1 = w_2 = 1, b=1.5b = -1.5solves

OR (x1x2x_1 \vee x_2):

  • (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→1
  • w1=w2=1w_1 = w_2 = 1, b=0.5b = -0.5solves

XOR (x1x2x_1 \oplus x_2):

  • (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0
  • Impossible — no linear separator. Minsky-Papert 1969.

A 2-layer MLP solves XOR:

Hidden layer (2 neurons):

  • h1=step(x1+x20.5)h_1 = \text{step}(x_1 + x_2 - 0.5) ≈ OR
  • h2=step(x1+x21.5)h_2 = \text{step}(x_1 + x_2 - 1.5) ≈ AND

Output:

  • y=step(h1h20.5)y = \text{step}(h_1 - h_2 - 0.5) = OR − AND = XOR

SIDRA parallel:

  • Single perceptron = one crossbar row. On Y1: 2 memristors + 1 threshold circuit.
  • 2-layer MLP = 2 crossbars. Threshold circuits on both.
  • GPT-2 = 124M params → 124M memristors. About 30% of Y1 stores GPT-2.
  • GPT-3 = 175B params → 417 SIDRA Y1 chips (impractical on Y1); Y100 in a single chip.

Quick Quiz

1/6Mathematical form of a single artificial neuron?

Lab Exercise

Map a small Transformer block onto a SIDRA crossbar.

Data:

  • Modest model: dmodel=512d_{\text{model}} = 512, dff=2048d_{\text{ff}} = 2048, h=8h = 8 heads.
  • One Transformer block: self-attention + FFN.
  • SIDRA crossbar: 256×256 memristors, 8-bit weight.

Questions:

(a) How many distinct weight matrices in self-attention? Sizes? (b) How many 256×256 crossbars to cover each matrix? (c) How many for the FFN? (d) Total crossbars per transformer block? (e) SIDRA Y1 has 419M cells = 419M / (256×256) ≈ 6400 crossbars. How many transformer blocks fit?

Solutions

(a) 4 matrices: WQ,WK,WV,WOW_Q, W_K, W_V, W_O, each 512×512. Plus two for FFN: W1W_1 (512×2048), W2W_2 (2048×512).

(b) 512/256 = 2 × 2 = 4 crossbars per attention matrix. 4 matrices × 4 = 16 crossbars for attention.

(c) FFN: W1W_1 512×2048 → (2 × 8) = 16 crossbars. W2W_2 2048×512 → 16 crossbars. 32 crossbars for FFN.

(d) 16 + 32 = 48 crossbars / transformer block.

(e) 6400 / 48 ≈ 133 transformer blocks. GPT-3 small (125M) is 12 blocks; GPT-3 175B is 96. SIDRA Y1 can hold a ~GPT-3-small-class model on a single chip; 96-block GPT-3 needs ~Y3 or Y10.

Note: this is only parameter storage. Loading a trained model into Y1 and running inference works. Training still happens on GPU.

Cheat Sheet

  • Single neuron: y=f(wx+b)y = f(\mathbf{w}^\top \mathbf{x} + b). The atom of 80 years of AI.
  • Evolution: McCulloch-Pitts 1943 → Perceptron 1958 → MLP (backprop 1986) → CNN (1989) → LSTM (1997) → AlexNet (2012) → Transformer (2017).
  • XOR limit: single perceptron = linear separator; MLP (hidden layer) solves XOR.
  • Activations: step, sigmoid, tanh, ReLU (modern default), GELU (Transformers).
  • Self-attention: softmax(QK^T/√d)V. Each token links directly to every other token.
  • Transformer block: MultiHead(Attention) + FFN + residual + LayerNorm.
  • SIDRA fit: MVMs are 90%+ of the compute → crossbar is the natural accelerator.

Vision: Beyond the Transformer, and SIDRA

Transformer is king today, not forever. Next steps:

  • Y1 (today): Small Transformer inference (GPT-2, BERT-small) fits Y1. Edge use (smart assistant, translation).
  • Y3 (2027): GPT-3-class (175B) multi-chip inference. Low-power laptop/data-center inference.
  • Y10 (2029): Transformer + sparse mixture-of-experts + online learning. Brain-budget energy.
  • Y100 (2031+): Post-Transformer architectures — State-Space Models (Mamba), linear attention, Mixture of Agents. SIDRA’s MVM focus fits most.
  • Y1000 (long horizon): Neuromorphic-Transformer hybrid — spike-based Transformer, 1% activity. Brain-scale continuous learning.

Strategic chance for Türkiye: the US + China lead on Transformer — we caught up but aren’t in front. But post-Transformer architectures are a new category. SIDRA’s analog + online-learning foundation gives early footprint at that transition. Türkiye’s shot at shipping the first “homegrown AI architecture” lies at this junction.

Unexpected future: emergent meaning neurons. In large language models, specific neuron clusters encode specific concepts (love, numbers, Türkiye, the Euphrates). This “conceptual memory” can be mapped explicitly to hardware. SIDRA Y100 → interpretable AI crossbar — you can see which crossbar stores which concept. Critical for security/explainability first, then a new kind of AI architecture.

Further Reading

  • Next chapter: 3.6 — Backpropagation
  • Previous: 3.4 — Brain Energy Efficiency
  • McCulloch-Pitts: A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 1943.
  • Rosenblatt perceptron: F. Rosenblatt, The perceptron: a probabilistic model…, Psych. Rev. 1958.
  • Minsky-Papert critique: Perceptrons: An Introduction to Computational Geometry, 1969.
  • Universal approximation: Cybenko, Approximation by superpositions of a sigmoidal function, 1989.
  • Transformer: Vaswani et al., Attention is all you need, NeurIPS 2017.
  • GPT-3: Brown et al., Language models are few-shot learners, NeurIPS 2020.
  • State Space Models (post-Transformer): Gu & Dao, Mamba: Linear-time sequence modeling…, 2023.