๐Ÿ“ Module 4 ยท The Math Arsenal ยท Chapter 4.1 ยท 12 min read

Vector, Matrix, MVM

One algebra operation lives at the heart of the SIDRA crossbar.

What you'll learn here

  • Define vector and matrix as ordered arrays of numbers
  • Show that matrix-vector multiply (MVM) is a collection of dot products
  • Explain MVM's O(Nยฒ) complexity and how the crossbar collapses it to O(1) time
  • Map a 256-vector + 256ร—256 matrix onto a SIDRA crossbar
  • Grasp why modern AI workloads are dominated by MVM

Hook: One Operation Runs All AI

If you boil all modern AI compute down to a single operation, more than 90% turns out to be one thing: matrix-vector multiplication (MVM).

  • One transformer attention block: 6 MVMs (Q, K, V, O, FFN1, FFN2). Softmax and normalization are tiny shares.
  • A CNN convolution layer: filtering = MVM over a sliding window.
  • An RNN/LSTM: every time step = MVM.
  • GPT-3 inference: ~99% MVM, the rest softmax + LayerNorm.

So AI accelerator = MVM accelerator. The SIDRA crossbar exists for this: it does one MVM in a single analog step.

This chapter builds MVM math from scratch: what is a vector โ†’ what is a matrix โ†’ how does the multiply work โ†’ and why is the SIDRA crossbar so efficient at it.

Intuition: Vector, Matrix, and Multiply

A vector is an ordered list of numbers โ€” both a point (in space) and a direction.

x=[34],v=[1โˆ’250]\mathbf{x} = \begin{bmatrix} 3 \\ 4 \end{bmatrix}, \quad \mathbf{v} = \begin{bmatrix} 1 \\ -2 \\ 5 \\ 0 \end{bmatrix}

Two-dimensional, four-dimensional, 256-dimensional โ€” the only difference is the count of entries.

A matrix is a rectangular table of numbers โ€” a stack of vectors.

W=[210โˆ’134]\mathbf{W} = \begin{bmatrix} 2 & 1 & 0 \\ -1 & 3 & 4 \end{bmatrix}

W\mathbf{W} here is a 2ร—3 matrix (2 rows, 3 columns).

Matrix-vector multiply (MVM):

y=Wx\mathbf{y} = \mathbf{W} \mathbf{x}

Meaning: each row of W\mathbf{W} takes a dot product with x\mathbf{x}. Result: a new vector.

Numerical example:

W=[21โˆ’13],x=[45]\mathbf{W} = \begin{bmatrix} 2 & 1 \\ -1 & 3 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} 4 \\ 5 \end{bmatrix}

y1=2โ‹…4+1โ‹…5=13y_1 = 2 \cdot 4 + 1 \cdot 5 = 13 y2=โˆ’1โ‹…4+3โ‹…5=11y_2 = -1 \cdot 4 + 3 \cdot 5 = 11

y=[1311]\mathbf{y} = \begin{bmatrix} 13 \\ 11 \end{bmatrix}.

Each row is one dot product. An Nร—NN \times N matrix needs NN dot products = N2N^2 multiplies + N(Nโˆ’1)N(N-1) adds. Complexity O(N2)O(N^2).

Bottom line: a neural-net layer = MVM + activation. A crossbar = MVM. Natural fit.

Formalism: Vector Spaces and Linear Maps

L1 ยท BaลŸlangฤฑรง

Vector definition (compact):

xโˆˆRNโ€…โ€ŠโŸบโ€…โ€Šx=(x1,x2,โ€ฆ,xN)โŠค,xiโˆˆR\mathbf{x} \in \mathbb{R}^N \iff \mathbf{x} = (x_1, x_2, \ldots, x_N)^\top, \quad x_i \in \mathbb{R}

Dot product:

aโ‹…b=โˆ‘i=1Naibi\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{N} a_i b_i

The dot product measures similarity between two vectors (in cosine):

aโ‹…b=โˆฅaโˆฅโ‹…โˆฅbโˆฅโ‹…cosโกฮธ\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \cdot \|\mathbf{b}\| \cdot \cos\theta
  • ฮธ\theta = angle between vectors.
  • Same direction โ†’ cosโกฮธ=1\cos\theta = 1, maximum.
  • Perpendicular โ†’ 0.
  • Opposite โ†’ โˆ’1.

Matrix-vector multiply (formal):

WโˆˆRMร—N\mathbf{W} \in \mathbb{R}^{M \times N}, xโˆˆRN\mathbf{x} \in \mathbb{R}^N. Output yโˆˆRM\mathbf{y} \in \mathbb{R}^M:

yi=โˆ‘j=1NWijโ‹…xjy_i = \sum_{j=1}^{N} W_{ij} \cdot x_j

For each ii, one dot product: row ii of W\mathbf{W} ร— x\mathbf{x}.

L2 ยท Tam

Matrix as a linear transformation:

A matrix W\mathbf{W} is a linear function from RN\mathbb{R}^N to RM\mathbb{R}^M:

T(ฮฑx+ฮฒy)=ฮฑT(x)+ฮฒT(y)T(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha T(\mathbf{x}) + \beta T(\mathbf{y})

Geometrically: rotation, scaling, shear, projection โ€” all matrix multiplications.

Typical AI matrix sizes:

ModelOne-layer matrix size
MLP MNIST classifier784 ร— 128, 128 ร— 10
ResNet-50 last layer2048 ร— 1000
GPT-2 small attention768 ร— 768
GPT-3 attention12288 ร— 12288
Llama-2 70B FFN8192 ร— 28672

A single GPT-3 attention layer: 12288ยฒ โ‰ˆ 151M weights. 96 layers โ†’ ~14B weights. Per inference step, billions of MVM operations.

Complexity: Nร—NN \times N MVM = N2N^2 multiplies + N(Nโˆ’1)N(N-1) adds. Total โ‰ˆ2N2\approx 2N^2 FLOP.

NNFLOPsDigital time (1 GHz, 1 op/cycle)SIDRA analog time
256~131K131 ยตs~10 ns (one MVM, parallel)
1024~2M2 ms~10 ns
12288 (GPT-3)~302M0.3 sa few MVMs, still ~ns

The crossbarโ€™s edge is parallelism. All N2N^2 multiplies happen physically at once (Ohmโ€™s law), and all NN sums happen at once (Kirchhoff). Chapters 1.5 and 3.7.

L3 ยท Derin

Matrix-matrix multiply (the generalization of MVM):

C=AB\mathbf{C} = \mathbf{A} \mathbf{B}, AโˆˆRMร—K\mathbf{A} \in \mathbb{R}^{M \times K}, BโˆˆRKร—N\mathbf{B} \in \mathbb{R}^{K \times N}:

Cij=โˆ‘k=1KAikBkjC_{ij} = \sum_{k=1}^{K} A_{ik} B_{kj}

Complexity: O(MNK)O(MNK). Typical AI has M=N=KM=N=K โ†’ O(N3)O(N^3). GPT-3 attention as a matrix-matrix mul: O(122883)โ‰ˆ1.85ร—1012O(12288^3) \approx 1.85 \times 10^{12} FLOP/layer. 96 layers โ†’ ~1.8ร—10141.8 \times 10^{14} FLOP/inference. A single H100 GPU does this in ~50 ms.

SIDRA matrix-matrix approach: matrix-matrix = NN matrix-vector multiplies. SIDRA does these sequentially. Y1: 256ร—256 crossbar. A 12288ร—12288 matrix needs ~48 crossbars in parallel. Total matrix-matrix time = NN ร— MVM time = Nร—10N \times 10 ns.

Sparse vs dense:

  • Dense matrix: 100% non-zero. Typical FFN.
  • Sparse matrix: mostly zero. Modern large models reach 50-90% sparsity.
  • SIDRA is ideal for dense. Sparse needs extra circuitry (or zero-program the cell).

Low-rank approximation (LoRA):

In modern fine-tuning, instead of updating a large W\mathbf{W} directly: W+ABโŠค\mathbf{W} + \mathbf{A}\mathbf{B}^\top, with AโˆˆRNร—r\mathbf{A} \in \mathbb{R}^{N \times r}, BโˆˆRNร—r\mathbf{B} \in \mathbb{R}^{N \times r}, rโ‰ชNr \ll N. Memory + compute drop to O(Nr)O(Nr). SIDRA Y10 + LoRA-style incremental learning is a strong candidate.

Practical SIDRA crossbar arithmetic:

256ร—256 crossbar:

  • 65,536 memristors (each 8-bit weight).
  • One MVM: 256-input vector ร— 256ร—256 matrix โ†’ 256-output vector.
  • Time: ~10 ns (read setup + Ohmic settling).
  • Energy: 256 ร— 0.1 pJ = ~26 pJ per MVM.
  • Efficiency: 65K MAC / 26 pJ = 2.5 ร— 10ยนยฒ MAC/J = 2.5 TMAC/J = 2.5 TOPS/W (crossbar only; no peripheral circuitry).

Peripheral circuitry (ADC, DAC, control) is typically 50-90% of total energy โ€” thatโ€™s why real SIDRA Y1 efficiency is ~10 TOPS/W (lab 3.4).

Experiment: Step Through a 4ร—4 MVM

W=[1201011010210102]\mathbf{W} = \begin{bmatrix} 1 & 2 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 2 & 1 \\ 0 & 1 & 0 & 2 \end{bmatrix}, x=[2131]\mathbf{x} = \begin{bmatrix} 2 \\ 1 \\ 3 \\ 1 \end{bmatrix}

By hand:

y1=1โ‹…2+2โ‹…1+0โ‹…3+1โ‹…1=2+2+0+1=5y_1 = 1 \cdot 2 + 2 \cdot 1 + 0 \cdot 3 + 1 \cdot 1 = 2 + 2 + 0 + 1 = 5

y2=0โ‹…2+1โ‹…1+1โ‹…3+0โ‹…1=0+1+3+0=4y_2 = 0 \cdot 2 + 1 \cdot 1 + 1 \cdot 3 + 0 \cdot 1 = 0 + 1 + 3 + 0 = 4

y3=1โ‹…2+0โ‹…1+2โ‹…3+1โ‹…1=2+0+6+1=9y_3 = 1 \cdot 2 + 0 \cdot 1 + 2 \cdot 3 + 1 \cdot 1 = 2 + 0 + 6 + 1 = 9

y4=0โ‹…2+1โ‹…1+0โ‹…3+2โ‹…1=0+1+0+2=3y_4 = 0 \cdot 2 + 1 \cdot 1 + 0 \cdot 3 + 2 \cdot 1 = 0 + 1 + 0 + 2 = 3

y=[5493]\mathbf{y} = \begin{bmatrix} 5 \\ 4 \\ 9 \\ 3 \end{bmatrix}.

Op count: 4 ร— 4 = 16 multiplies + 4 ร— 3 = 12 adds = 28 operations. Digital CPU: 28 cycles (ideal) โ‰ˆ 28 ns @ 1 GHz.

On a SIDRA crossbar:

  • 4ร—4 = 16 memristors. Gij=WijG_{ij} = W_{ij} (programmed conductances).
  • Vj=xjV_j = x_j (input voltages on 4 rows).
  • Each column sums currents: Ii=โˆ‘jGijVjI_i = \sum_j G_{ij} V_j.
  • 16 multiplies + 12 sums in one Ohmic + KCL step. Time: ~10 ns.

Comparison: to do the same job, CPU needs 28 ns, GPU on a single thread ~10 ns (parallel cores), SIDRA crossbar ~10 ns but with 100ร— less energy.

Scale up: 1024ร—10241024 \times 1024 MVM:

  • CPU: ~2.1M ns โ‰ˆ 2 ms.
  • GPU (all cores): ~50 ns (A100 cuBLAS).
  • SIDRA: 16 parallel crossbars (256ร—256 each) โ†’ ~10 ns. Matches GPU at 100ร— less energy.

Quick Quiz

1/6How much compute is in a matrix-vector multiply (MVM)?

Lab Exercise

Map all the parameters of GPT-2 small onto SIDRA Y1โ€™s crossbars.

GPT-2 small architecture:

  • Embedding: 50257 (vocab) ร— 768 (d_model)
  • Positional embed: 1024 ร— 768
  • 12 transformer blocks, each:
    • Attention W_Q, W_K, W_V, W_O: each 768 ร— 768
    • FFN: W_1 = 768 ร— 3072, W_2 = 3072 ร— 768
  • Final layer norm + output: 768 ร— 50257

SIDRA Y1:

  • 419M memristors.
  • Per crossbar: 256 ร— 256 = 65,536 cells.
  • Total crossbars: 419M / 65,536 โ‰ˆ 6400.

Questions:

(a) One attention matrix (768 ร— 768) โ†’ how many crossbars? (b) One FFN matrix (768 ร— 3072) โ†’ how many crossbars? (c) One transformer block total? (d) 12 blocks + embeddings, all of GPT-2 small โ†’ how many crossbars? (e) What fraction of Y1 is used? What can fill the rest?

Solutions

(a) 768 / 256 = 3 โ†’ 3ร—3 = 9 crossbars per attention matrix. With 4 matrices (Q, K, V, O) per block: 36 crossbars.

(b) FFN: W1W_1 768 ร— 3072 โ†’ (3 ร— 12) = 36 crossbars. W2W_2 3072 ร— 768 โ†’ 36 crossbars. Total 72 crossbars per FFN.

(c) 36 (attention) + 72 (FFN) = 108 crossbars per block.

(d) 12 blocks ร— 108 = 1296 crossbars. Embedding 50257 ร— 768 โ†’ ~600 crossbars. Output projection ~600. Total ~2500 crossbars.

(e) 2500 / 6400 = ~39%. Y1โ€™s other 61% is free โ†’ other models (BERT-small, MobileNet, small LSTM) or multi-copy GPT-2 (batching).

Note: this is static storage only. Inference also needs scratch memory (KV cache, activations), provided by surrounding CMOS DRAM. SIDRA Y1 = โ€œready-model store + MVM engineโ€ for inference.

Cheat Sheet

  • Vector: ordered list of numbers, xโˆˆRN\mathbf{x} \in \mathbb{R}^N.
  • Matrix: rectangular table of numbers, WโˆˆRMร—N\mathbf{W} \in \mathbb{R}^{M \times N}.
  • MVM: y=Wx\mathbf{y} = \mathbf{W}\mathbf{x}, every row a dot product. Complexity O(N2)O(N^2).
  • Dot product: aโ‹…b=โˆ‘aibi=โˆฅaโˆฅโˆฅbโˆฅcosโกฮธ\mathbf{a} \cdot \mathbf{b} = \sum a_i b_i = \|\mathbf{a}\|\|\mathbf{b}\|\cos\theta. Measures similarity.
  • MVM share in AI: ~90%+ (transformer, CNN, RNN).
  • Crossbar advantage: N2N^2 multiplies in parallel + NN sums in parallel = O(1)O(1) steps.
  • SIDRA Y1: 256ร—256 crossbar, 1 MVM ~10 ns, ~26 pJ.

Vision: MVM-Native Compute and SIDRA's Role

MVM has become the central operation โ€” but digital hardware does it badly (2Nยฒ FLOPs, sequential). The analog crossbar does it well (one step). So:

  • Y1 (today): 256ร—256 crossbar, ~10 TOPS/W. GPT-2 small inference on one chip.
  • Y3 (2027): 512ร—512 crossbar, better ADC, ~50 TOPS/W. GPT-2 medium / BERT-base.
  • Y10 (2029): 1024ร—1024 crossbar, 16-bit weight, ~150 TOPS/W. GPT-3 small (2.7B) on a single chip.
  • Y100 (2031+): 4096ร—4096 crossbar, photonic weight transfer, ~1000 TOPS/W. GPT-3 175B in a few chips. All-time-bestseller LLM inference at the edge.
  • Y1000 (long horizon): Crossbar fabric + 3D stacking + analog backprop. Training also analog.

Strategic value for Tรผrkiye: the MVM-acceleration category has no single leader yet. NVIDIA digital, Cerebras wafer-scale, Mythic analog-flash, Rain analog-photonic โ€” the category is open. SIDRA, in the โ€œmemristor-based analog MVMโ€ sub-category, is on a path to global leadership. A concrete way out of the classic CPU/GPU race.

Unexpected future: MVM-native programming language. Todayโ€™s AI software (PyTorch, JAX) thinks in FLOPs. As SIDRA-class crossbar architectures spread, an MVM-native compiler + language will emerge โ€” fitting the hardwareโ€™s real abstraction. SIDRAโ€™s software stack (Module 6) heads in that direction.

Further Reading

  • Next chapter: 4.2 โ€” Ohm + Kirchhoff = Analog MVM
  • Previous: 3.8 โ€” STDP
  • Classical reference: Strang, Linear Algebra and Its Applications โ€” the standard textbook.
  • MVM accelerator history: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 6th ed.
  • In-memory computing: Sebastian et al., Memory devices and applications for in-memory computing, Nature Nanotech. 2020.
  • Modern AI accelerator comparison: Jouppi et al., In-datacenter performance analysis of a tensor processing unit, ISCA 2017 (Google TPU).