Vector, Matrix, MVM

Hook: One Operation Runs All AI

If you boil all modern AI compute down to a single operation, more than 90% turns out to be one thing: matrix-vector multiplication (MVM).

One transformer attention block: 6 MVMs (Q, K, V, O, FFN1, FFN2). Softmax and normalization are tiny shares.
A CNN convolution layer: filtering = MVM over a sliding window.
An RNN/LSTM: every time step = MVM.
GPT-3 inference: ~99% MVM, the rest softmax + LayerNorm.

So AI accelerator = MVM accelerator. The SIDRA crossbar exists for this: it does one MVM in a single analog step.

This chapter builds MVM math from scratch: what is a vector → what is a matrix → how does the multiply work → and why is the SIDRA crossbar so efficient at it.

Intuition: Vector, Matrix, and Multiply

A vector is an ordered list of numbers — both a point (in space) and a direction.

\mathbf{x} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 1 \\ -2 \\ 5 \\ 0 \end{bmatrix}

Two-dimensional, four-dimensional, 256-dimensional — the only difference is the count of entries.

A matrix is a rectangular table of numbers — a stack of vectors.

\mathbf{W} = \begin{bmatrix} 2 & 1 & 0 \\ -1 & 3 & 4 \end{bmatrix}

$\mathbf{W}$ here is a 2×3 matrix (2 rows, 3 columns).

Matrix-vector multiply (MVM):

\mathbf{y} = \mathbf{W} \mathbf{x}

Meaning: each row of $\mathbf{W}$ takes a dot product with $\mathbf{x}$ . Result: a new vector.

Numerical example:

\mathbf{W} = \begin{bmatrix} 2 & 1 \\ -1 & 3 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 4 \\ 5 \end{bmatrix}

$y_1 = 2 \cdot 4 + 1 \cdot 5 = 13$ $y_2 = -1 \cdot 4 + 3 \cdot 5 = 11$

$\mathbf{y} = \begin{bmatrix} 13 \\ 11 \end{bmatrix}$ .

Each row is one dot product. An $N \times N$ matrix needs $N$ dot products = $N^2$ multiplies + $N(N-1)$ adds. Complexity $O(N^2)$ .

Bottom line: a neural-net layer = MVM + activation. A crossbar = MVM. Natural fit.

Formalism: Vector Spaces and Linear Maps

L1 · Başlangıç

Vector definition (compact):

\mathbf{x} \in \mathbb{R}^N \iff \mathbf{x} = (x_1, x_2, \ldots, x_N)^\top, \quad x_i \in \mathbb{R}

Dot product:

\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{N} a_i b_i

The dot product measures similarity between two vectors (in cosine):

\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \cdot \|\mathbf{b}\| \cdot \cos\theta

$\theta$ = angle between vectors.
Same direction → $\cos\theta = 1$ , maximum.
Perpendicular → 0.
Opposite → −1.

Matrix-vector multiply (formal):

$\mathbf{W} \in \mathbb{R}^{M \times N}$ , $\mathbf{x} \in \mathbb{R}^N$ . Output $\mathbf{y} \in \mathbb{R}^M$ :

y_i = \sum_{j=1}^{N} W_{ij} \cdot x_j

For each $i$ , one dot product: row $i$ of $\mathbf{W}$ × $\mathbf{x}$ .

L2 · Tam

Matrix as a linear transformation:

A matrix $\mathbf{W}$ is a linear function from $\mathbb{R}^N$ to $\mathbb{R}^M$ :

T(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha T(\mathbf{x}) + \beta T(\mathbf{y})

Geometrically: rotation, scaling, shear, projection — all matrix multiplications.

Typical AI matrix sizes:

Model	One-layer matrix size
MLP MNIST classifier	784 × 128, 128 × 10
ResNet-50 last layer	2048 × 1000
GPT-2 small attention	768 × 768
GPT-3 attention	12288 × 12288
Llama-2 70B FFN	8192 × 28672

A single GPT-3 attention layer: 12288² ≈ 151M weights. 96 layers → ~14B weights. Per inference step, billions of MVM operations.

Complexity: $N \times N$ MVM = $N^2$ multiplies + $N(N-1)$ adds. Total $\approx 2N^2$ FLOP.

$N$	FLOPs	Digital time (1 GHz, 1 op/cycle)	SIDRA analog time
256	~131K	131 µs	~10 ns (one MVM, parallel)
1024	~2M	2 ms	~10 ns
12288 (GPT-3)	~302M	0.3 s	a few MVMs, still ~ns

The crossbar’s edge is parallelism. All $N^2$ multiplies happen physically at once (Ohm’s law), and all $N$ sums happen at once (Kirchhoff). Chapters 1.5 and 3.7.

L3 · Derin

Matrix-matrix multiply (the generalization of MVM):

$\mathbf{C} = \mathbf{A} \mathbf{B}$ , $\mathbf{A} \in \mathbb{R}^{M \times K}$ , $\mathbf{B} \in \mathbb{R}^{K \times N}$ :

C_{ij} = \sum_{k=1}^{K} A_{ik} B_{kj}

Complexity: $O(MNK)$ . Typical AI has $M=N=K$ → $O(N^3)$ . GPT-3 attention as a matrix-matrix mul: $O(12288^3) \approx 1.85 \times 10^{12}$ FLOP/layer. 96 layers → ~ $1.8 \times 10^{14}$ FLOP/inference. A single H100 GPU does this in ~50 ms.

SIDRA matrix-matrix approach: matrix-matrix = $N$ matrix-vector multiplies. SIDRA does these sequentially. Y1: 256×256 crossbar. A 12288×12288 matrix needs ~48 crossbars in parallel. Total matrix-matrix time = $N$ × MVM time = $N \times 10$ ns.

Sparse vs dense:

Dense matrix: 100% non-zero. Typical FFN.
Sparse matrix: mostly zero. Modern large models reach 50-90% sparsity.
SIDRA is ideal for dense. Sparse needs extra circuitry (or zero-program the cell).

Low-rank approximation (LoRA):

In modern fine-tuning, instead of updating a large $\mathbf{W}$ directly: $\mathbf{W} + \mathbf{A}\mathbf{B}^\top$ , with $\mathbf{A} \in \mathbb{R}^{N \times r}$ , $\mathbf{B} \in \mathbb{R}^{N \times r}$ , $r \ll N$ . Memory + compute drop to $O(Nr)$ . SIDRA Y10 + LoRA-style incremental learning is a strong candidate.

Practical SIDRA crossbar arithmetic:

256×256 crossbar:

65,536 memristors (each 8-bit weight).
One MVM: 256-input vector × 256×256 matrix → 256-output vector.
Time: ~10 ns (read setup + Ohmic settling).
Energy: 256 × 0.1 pJ = ~26 pJ per MVM.
Efficiency: 65K MAC / 26 pJ = 2.5 × 10¹² MAC/J = 2.5 TMAC/J = 2.5 TOPS/W (crossbar only; no peripheral circuitry).

Peripheral circuitry (ADC, DAC, control) is typically 50-90% of total energy — that’s why real SIDRA Y1 efficiency is ~10 TOPS/W (lab 3.4).

Experiment: Step Through a 4×4 MVM

$\mathbf{W} = \begin{bmatrix} 1 & 2 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 2 & 1 \\ 0 & 1 & 0 & 2 \end{bmatrix}$ , $\mathbf{x} = \begin{bmatrix} 2 \\ 1 \\ 3 \\ 1 \end{bmatrix}$

By hand:

$y_1 = 1 \cdot 2 + 2 \cdot 1 + 0 \cdot 3 + 1 \cdot 1 = 2 + 2 + 0 + 1 = 5$

$y_2 = 0 \cdot 2 + 1 \cdot 1 + 1 \cdot 3 + 0 \cdot 1 = 0 + 1 + 3 + 0 = 4$

$y_3 = 1 \cdot 2 + 0 \cdot 1 + 2 \cdot 3 + 1 \cdot 1 = 2 + 0 + 6 + 1 = 9$

$y_4 = 0 \cdot 2 + 1 \cdot 1 + 0 \cdot 3 + 2 \cdot 1 = 0 + 1 + 0 + 2 = 3$

$\mathbf{y} = \begin{bmatrix} 5 \\ 4 \\ 9 \\ 3 \end{bmatrix}$ .

Op count: 4 × 4 = 16 multiplies + 4 × 3 = 12 adds = 28 operations. Digital CPU: 28 cycles (ideal) ≈ 28 ns @ 1 GHz.

On a SIDRA crossbar:

4×4 = 16 memristors. $G_{ij} = W_{ij}$ (programmed conductances).
$V_j = x_j$ (input voltages on 4 rows).
Each column sums currents: $I_i = \sum_j G_{ij} V_j$ .
16 multiplies + 12 sums in one Ohmic + KCL step. Time: ~10 ns.

Comparison: to do the same job, CPU needs 28 ns, GPU on a single thread ~10 ns (parallel cores), SIDRA crossbar ~10 ns but with 100× less energy.

Scale up: $1024 \times 1024$ MVM:

CPU: ~2.1M ns ≈ 2 ms.
GPU (all cores): ~50 ns (A100 cuBLAS).
SIDRA: 16 parallel crossbars (256×256 each) → ~10 ns. Matches GPU at 100× less energy.

Quick Quiz

1/6How much compute is in a matrix-vector multiply (MVM)?

O(N)O(N²) multiplies + addsO(N³)O(log N)

Lab Exercise

Map all the parameters of GPT-2 small onto SIDRA Y1’s crossbars.

GPT-2 small architecture:

Embedding: 50257 (vocab) × 768 (d_model)
Positional embed: 1024 × 768
12 transformer blocks, each:
- Attention W_Q, W_K, W_V, W_O: each 768 × 768
- FFN: W_1 = 768 × 3072, W_2 = 3072 × 768
Final layer norm + output: 768 × 50257

SIDRA Y1:

419M memristors.
Per crossbar: 256 × 256 = 65,536 cells.
Total crossbars: 419M / 65,536 ≈ 6400.

Questions:

(a) One attention matrix (768 × 768) → how many crossbars? (b) One FFN matrix (768 × 3072) → how many crossbars? (c) One transformer block total? (d) 12 blocks + embeddings, all of GPT-2 small → how many crossbars? (e) What fraction of Y1 is used? What can fill the rest?

Solutions

(a) 768 / 256 = 3 → 3×3 = 9 crossbars per attention matrix. With 4 matrices (Q, K, V, O) per block: 36 crossbars.

(b) FFN: $W_1$ 768 × 3072 → (3 × 12) = 36 crossbars. $W_2$ 3072 × 768 → 36 crossbars. Total 72 crossbars per FFN.

(c) 36 (attention) + 72 (FFN) = 108 crossbars per block.

(d) 12 blocks × 108 = 1296 crossbars. Embedding 50257 × 768 → ~600 crossbars. Output projection ~600. Total ~2500 crossbars.

(e) 2500 / 6400 = ~39%. Y1’s other 61% is free → other models (BERT-small, MobileNet, small LSTM) or multi-copy GPT-2 (batching).

Note: this is static storage only. Inference also needs scratch memory (KV cache, activations), provided by surrounding CMOS DRAM. SIDRA Y1 = “ready-model store + MVM engine” for inference.

Cheat Sheet

Vector: ordered list of numbers, $\mathbf{x} \in \mathbb{R}^N$ .
Matrix: rectangular table of numbers, $\mathbf{W} \in \mathbb{R}^{M \times N}$ .
MVM: $\mathbf{y} = \mathbf{W}\mathbf{x}$ , every row a dot product. Complexity $O(N^2)$ .
Dot product: $\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i = \|\mathbf{a}\|\|\mathbf{b}\|\cos\theta$ . Measures similarity.
MVM share in AI: ~90%+ (transformer, CNN, RNN).
Crossbar advantage: $N^2$ multiplies in parallel + $N$ sums in parallel = $O(1)$ steps.
SIDRA Y1: 256×256 crossbar, 1 MVM ~10 ns, ~26 pJ.

Vision: MVM-Native Compute and SIDRA's Role

MVM has become the central operation — but digital hardware does it badly (2N² FLOPs, sequential). The analog crossbar does it well (one step). So:

Y1 (today): 256×256 crossbar, ~10 TOPS/W. GPT-2 small inference on one chip.
Y3 (2027): 512×512 crossbar, better ADC, ~50 TOPS/W. GPT-2 medium / BERT-base.
Y10 (2029): 1024×1024 crossbar, 16-bit weight, ~150 TOPS/W. GPT-3 small (2.7B) on a single chip.
Y100 (2031+): 4096×4096 crossbar, photonic weight transfer, ~1000 TOPS/W. GPT-3 175B in a few chips. All-time-bestseller LLM inference at the edge.
Y1000 (long horizon): Crossbar fabric + 3D stacking + analog backprop. Training also analog.

Strategic value for Türkiye: the MVM-acceleration category has no single leader yet. NVIDIA digital, Cerebras wafer-scale, Mythic analog-flash, Rain analog-photonic — the category is open. SIDRA, in the “memristor-based analog MVM” sub-category, is on a path to global leadership. A concrete way out of the classic CPU/GPU race.

Unexpected future: MVM-native programming language. Today’s AI software (PyTorch, JAX) thinks in FLOPs. As SIDRA-class crossbar architectures spread, an MVM-native compiler + language will emerge — fitting the hardware’s real abstraction. SIDRA’s software stack (Module 6) heads in that direction.

Vector, Matrix, MVM

Prerequisites

What you'll learn here

Hook: One Operation Runs All AI

Intuition: Vector, Matrix, and Multiply

Formalism: Vector Spaces and Linear Maps

Experiment: Step Through a 4×4 MVM

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: MVM-Native Compute and SIDRA's Role

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: One Operation Runs All AI

🧭 Intuition: Vector, Matrix, and Multiply

📐 Formalism: Vector Spaces and Linear Maps

🧪 Experiment: Step Through a 4×4 MVM

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: MVM-Native Compute and SIDRA's Role

📚 Further Reading

Hook: One Operation Runs All AI

Intuition: Vector, Matrix, and Multiply

Formalism: Vector Spaces and Linear Maps

Experiment: Step Through a 4×4 MVM

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: MVM-Native Compute and SIDRA's Role

Further Reading