Vector, Matrix, MVM
One algebra operation lives at the heart of the SIDRA crossbar.
What you'll learn here
- Define vector and matrix as ordered arrays of numbers
- Show that matrix-vector multiply (MVM) is a collection of dot products
- Explain MVM's O(Nยฒ) complexity and how the crossbar collapses it to O(1) time
- Map a 256-vector + 256ร256 matrix onto a SIDRA crossbar
- Grasp why modern AI workloads are dominated by MVM
Hook: One Operation Runs All AI
If you boil all modern AI compute down to a single operation, more than 90% turns out to be one thing: matrix-vector multiplication (MVM).
- One transformer attention block: 6 MVMs (Q, K, V, O, FFN1, FFN2). Softmax and normalization are tiny shares.
- A CNN convolution layer: filtering = MVM over a sliding window.
- An RNN/LSTM: every time step = MVM.
- GPT-3 inference: ~99% MVM, the rest softmax + LayerNorm.
So AI accelerator = MVM accelerator. The SIDRA crossbar exists for this: it does one MVM in a single analog step.
This chapter builds MVM math from scratch: what is a vector โ what is a matrix โ how does the multiply work โ and why is the SIDRA crossbar so efficient at it.
Intuition: Vector, Matrix, and Multiply
A vector is an ordered list of numbers โ both a point (in space) and a direction.
Two-dimensional, four-dimensional, 256-dimensional โ the only difference is the count of entries.
A matrix is a rectangular table of numbers โ a stack of vectors.
here is a 2ร3 matrix (2 rows, 3 columns).
Matrix-vector multiply (MVM):
Meaning: each row of takes a dot product with . Result: a new vector.
Numerical example:
.
Each row is one dot product. An matrix needs dot products = multiplies + adds. Complexity .
Bottom line: a neural-net layer = MVM + activation. A crossbar = MVM. Natural fit.
Formalism: Vector Spaces and Linear Maps
Vector definition (compact):
Dot product:
The dot product measures similarity between two vectors (in cosine):
- = angle between vectors.
- Same direction โ , maximum.
- Perpendicular โ 0.
- Opposite โ โ1.
Matrix-vector multiply (formal):
, . Output :
For each , one dot product: row of ร .
Matrix as a linear transformation:
A matrix is a linear function from to :
Geometrically: rotation, scaling, shear, projection โ all matrix multiplications.
Typical AI matrix sizes:
| Model | One-layer matrix size |
|---|---|
| MLP MNIST classifier | 784 ร 128, 128 ร 10 |
| ResNet-50 last layer | 2048 ร 1000 |
| GPT-2 small attention | 768 ร 768 |
| GPT-3 attention | 12288 ร 12288 |
| Llama-2 70B FFN | 8192 ร 28672 |
A single GPT-3 attention layer: 12288ยฒ โ 151M weights. 96 layers โ ~14B weights. Per inference step, billions of MVM operations.
Complexity: MVM = multiplies + adds. Total FLOP.
| FLOPs | Digital time (1 GHz, 1 op/cycle) | SIDRA analog time | |
|---|---|---|---|
| 256 | ~131K | 131 ยตs | ~10 ns (one MVM, parallel) |
| 1024 | ~2M | 2 ms | ~10 ns |
| 12288 (GPT-3) | ~302M | 0.3 s | a few MVMs, still ~ns |
The crossbarโs edge is parallelism. All multiplies happen physically at once (Ohmโs law), and all sums happen at once (Kirchhoff). Chapters 1.5 and 3.7.
Matrix-matrix multiply (the generalization of MVM):
, , :
Complexity: . Typical AI has โ . GPT-3 attention as a matrix-matrix mul: FLOP/layer. 96 layers โ ~ FLOP/inference. A single H100 GPU does this in ~50 ms.
SIDRA matrix-matrix approach: matrix-matrix = matrix-vector multiplies. SIDRA does these sequentially. Y1: 256ร256 crossbar. A 12288ร12288 matrix needs ~48 crossbars in parallel. Total matrix-matrix time = ร MVM time = ns.
Sparse vs dense:
- Dense matrix: 100% non-zero. Typical FFN.
- Sparse matrix: mostly zero. Modern large models reach 50-90% sparsity.
- SIDRA is ideal for dense. Sparse needs extra circuitry (or zero-program the cell).
Low-rank approximation (LoRA):
In modern fine-tuning, instead of updating a large directly: , with , , . Memory + compute drop to . SIDRA Y10 + LoRA-style incremental learning is a strong candidate.
Practical SIDRA crossbar arithmetic:
256ร256 crossbar:
- 65,536 memristors (each 8-bit weight).
- One MVM: 256-input vector ร 256ร256 matrix โ 256-output vector.
- Time: ~10 ns (read setup + Ohmic settling).
- Energy: 256 ร 0.1 pJ = ~26 pJ per MVM.
- Efficiency: 65K MAC / 26 pJ = 2.5 ร 10ยนยฒ MAC/J = 2.5 TMAC/J = 2.5 TOPS/W (crossbar only; no peripheral circuitry).
Peripheral circuitry (ADC, DAC, control) is typically 50-90% of total energy โ thatโs why real SIDRA Y1 efficiency is ~10 TOPS/W (lab 3.4).
Experiment: Step Through a 4ร4 MVM
,
By hand:
.
Op count: 4 ร 4 = 16 multiplies + 4 ร 3 = 12 adds = 28 operations. Digital CPU: 28 cycles (ideal) โ 28 ns @ 1 GHz.
On a SIDRA crossbar:
- 4ร4 = 16 memristors. (programmed conductances).
- (input voltages on 4 rows).
- Each column sums currents: .
- 16 multiplies + 12 sums in one Ohmic + KCL step. Time: ~10 ns.
Comparison: to do the same job, CPU needs 28 ns, GPU on a single thread ~10 ns (parallel cores), SIDRA crossbar ~10 ns but with 100ร less energy.
Scale up: MVM:
- CPU: ~2.1M ns โ 2 ms.
- GPU (all cores): ~50 ns (A100 cuBLAS).
- SIDRA: 16 parallel crossbars (256ร256 each) โ ~10 ns. Matches GPU at 100ร less energy.
Quick Quiz
Lab Exercise
Map all the parameters of GPT-2 small onto SIDRA Y1โs crossbars.
GPT-2 small architecture:
- Embedding: 50257 (vocab) ร 768 (d_model)
- Positional embed: 1024 ร 768
- 12 transformer blocks, each:
- Attention W_Q, W_K, W_V, W_O: each 768 ร 768
- FFN: W_1 = 768 ร 3072, W_2 = 3072 ร 768
- Final layer norm + output: 768 ร 50257
SIDRA Y1:
- 419M memristors.
- Per crossbar: 256 ร 256 = 65,536 cells.
- Total crossbars: 419M / 65,536 โ 6400.
Questions:
(a) One attention matrix (768 ร 768) โ how many crossbars? (b) One FFN matrix (768 ร 3072) โ how many crossbars? (c) One transformer block total? (d) 12 blocks + embeddings, all of GPT-2 small โ how many crossbars? (e) What fraction of Y1 is used? What can fill the rest?
Solutions
(a) 768 / 256 = 3 โ 3ร3 = 9 crossbars per attention matrix. With 4 matrices (Q, K, V, O) per block: 36 crossbars.
(b) FFN: 768 ร 3072 โ (3 ร 12) = 36 crossbars. 3072 ร 768 โ 36 crossbars. Total 72 crossbars per FFN.
(c) 36 (attention) + 72 (FFN) = 108 crossbars per block.
(d) 12 blocks ร 108 = 1296 crossbars. Embedding 50257 ร 768 โ ~600 crossbars. Output projection ~600. Total ~2500 crossbars.
(e) 2500 / 6400 = ~39%. Y1โs other 61% is free โ other models (BERT-small, MobileNet, small LSTM) or multi-copy GPT-2 (batching).
Note: this is static storage only. Inference also needs scratch memory (KV cache, activations), provided by surrounding CMOS DRAM. SIDRA Y1 = โready-model store + MVM engineโ for inference.
Cheat Sheet
- Vector: ordered list of numbers, .
- Matrix: rectangular table of numbers, .
- MVM: , every row a dot product. Complexity .
- Dot product: . Measures similarity.
- MVM share in AI: ~90%+ (transformer, CNN, RNN).
- Crossbar advantage: multiplies in parallel + sums in parallel = steps.
- SIDRA Y1: 256ร256 crossbar, 1 MVM ~10 ns, ~26 pJ.
Vision: MVM-Native Compute and SIDRA's Role
MVM has become the central operation โ but digital hardware does it badly (2Nยฒ FLOPs, sequential). The analog crossbar does it well (one step). So:
- Y1 (today): 256ร256 crossbar, ~10 TOPS/W. GPT-2 small inference on one chip.
- Y3 (2027): 512ร512 crossbar, better ADC, ~50 TOPS/W. GPT-2 medium / BERT-base.
- Y10 (2029): 1024ร1024 crossbar, 16-bit weight, ~150 TOPS/W. GPT-3 small (2.7B) on a single chip.
- Y100 (2031+): 4096ร4096 crossbar, photonic weight transfer, ~1000 TOPS/W. GPT-3 175B in a few chips. All-time-bestseller LLM inference at the edge.
- Y1000 (long horizon): Crossbar fabric + 3D stacking + analog backprop. Training also analog.
Strategic value for Tรผrkiye: the MVM-acceleration category has no single leader yet. NVIDIA digital, Cerebras wafer-scale, Mythic analog-flash, Rain analog-photonic โ the category is open. SIDRA, in the โmemristor-based analog MVMโ sub-category, is on a path to global leadership. A concrete way out of the classic CPU/GPU race.
Unexpected future: MVM-native programming language. Todayโs AI software (PyTorch, JAX) thinks in FLOPs. As SIDRA-class crossbar architectures spread, an MVM-native compiler + language will emerge โ fitting the hardwareโs real abstraction. SIDRAโs software stack (Module 6) heads in that direction.
Further Reading
- Next chapter: 4.2 โ Ohm + Kirchhoff = Analog MVM
- Previous: 3.8 โ STDP
- Classical reference: Strang, Linear Algebra and Its Applications โ the standard textbook.
- MVM accelerator history: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 6th ed.
- In-memory computing: Sebastian et al., Memory devices and applications for in-memory computing, Nature Nanotech. 2020.
- Modern AI accelerator comparison: Jouppi et al., In-datacenter performance analysis of a tensor processing unit, ISCA 2017 (Google TPU).