📐 Module 4 · The Math Arsenal · Chapter 4.8 · 12 min read

Linear Algebra Laboratory

Pull Module 4 together through one end-to-end project — SIDRA on MNIST.

What you'll learn here

  • Combine Module 4's seven concepts in one end-to-end project
  • Apply the math of an MNIST classifier (MVM, gradient, probability, quantization)
  • Trace the mapping of math onto SIDRA hardware step by step
  • Prepare for Module 5 (Chip Hardware)

Hook: Math That Lives on SIDRA

Across Module 4 we covered seven mathematical concepts in detail: vector/MVM (4.1), Ohm+KCL bridge (4.2), derivative/gradient (4.3), probability/noise (4.4), Fourier (4.5), quantization (4.6), information theory (4.7).

Each is its own field. But for SIDRA, they form one story:

AI models are sequences of MVMs. Every MVM lives in hardware via Ohm + KCL. Training needs gradients. Noise is both a problem and a solution. Fourier speeds up large convolutions. Quantization bounds bit depth. Information theory gives the theoretical ceiling.

This chapter shows that story end-to-end through a concrete exercise: build an MNIST classifier and see where each concept enters. You’ll close the chapter ready for Module 5 (Chip Hardware).

Intuition: 7 Concepts, 1 End-to-End Project

MNIST classification is simple: 28×28 grayscale handwritten digits → 10 classes (0-9). Mathematically, it uses every concept in Module 4:

Math (Module 4)Where it appears in the MNIST pipeline
4.1 Vector/Matrix/MVMInput (784-vector), weights (matrices), layer output
4.2 Ohm+KCL=MVMEvery MVM runs analog on the SIDRA crossbar
4.3 Derivative/GradientTraining (backprop), loss minimization
4.4 Probability/NoiseImpact of SIDRA noise on classification
4.5 Fourier(Absent in MNIST, but present in convolution)
4.6 QuantizationFP32 training → INT8 inference (SIDRA 256 levels)
4.7 Information theoryCross-entropy loss, model capacity

Project structure:

  1. Data preprocessing: pixels → vector → normalize.
  2. Architecture: 2-layer MLP (784 → 128 → 10).
  3. Training: SGD + backprop (on the GPU).
  4. Quantization: FP32 → INT8 (for SIDRA).
  5. Deploy: program the weights into the SIDRA crossbar.
  6. Inference: analog MVM + CMOS activation + ADC.
  7. Accuracy: expect 97-98%.

Formalism: End-to-End Math

L1 · Başlangıç

1. Data: MNIST image xR784\mathbf{x} \in \mathbb{R}^{784}, label y{0,1,,9}y \in \{0, 1, \ldots, 9\}.

Normalize: x(xμ)/σ\mathbf{x} \leftarrow (\mathbf{x} - \mu) / \sigma, with μ,σ\mu, \sigma from the training set.

2. Architecture:

First layer: z1=W1x+b1\mathbf{z}_1 = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1, W1R128×784\mathbf{W}_1 \in \mathbb{R}^{128 \times 784}.

Activation: h1=ReLU(z1)\mathbf{h}_1 = \text{ReLU}(\mathbf{z}_1).

Second layer: z2=W2h1+b2\mathbf{z}_2 = \mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2, W2R10×128\mathbf{W}_2 \in \mathbb{R}^{10 \times 128}.

Output: y^=softmax(z2)\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}_2), 10-class probability.

3. Loss (information theory):

Cross-entropy: L=logy^yL = -\log \hat{y}_{y^*}, with yy^* the true label.

4. Gradient (calculus):

Backprop:

  • δ2=y^ey\delta_2 = \hat{\mathbf{y}} - \mathbf{e}_{y^*} (one-hot)
  • W2L=δ2h1\nabla_{\mathbf{W}_2} L = \delta_2 \mathbf{h}_1^\top
  • δ1=W2δ21[z1>0]\delta_1 = \mathbf{W}_2^\top \delta_2 \odot \mathbb{1}[\mathbf{z}_1 > 0] (ReLU derivative)
  • W1L=δ1x\nabla_{\mathbf{W}_1} L = \delta_1 \mathbf{x}^\top

Update: WWηL\mathbf{W} \leftarrow \mathbf{W} - \eta \nabla L.

L2 · Tam

5. Training loop:

  • 60,000 training images.
  • 10 epochs (all data 10 passes).
  • Batch 64, learning rate 0.01.
  • FP32 arithmetic (on GPU).

Post-training: FP32 accuracy ~98%.

6. Quantization:

Weights FP32 → INT8:

  • Dynamic range: min/max of W\mathbf{W}.
  • 256 levels: Δ=(maxmin)/256\Delta = (max-min)/256.
  • Round: W(Wmin)/ΔW \to \lfloor (W - min)/\Delta \rfloor.

Activations also INT8 quantized (layer-wise calibration).

Post-quantization accuracy: ~97.8% (0.2% loss).

7. Deploy to SIDRA:

Each INT8 weight → memristor conductance level.

  • Wij{0,1,,255}W_{ij} \in \{0, 1, \ldots, 255\}Gij[Gmin,Gmax]G_{ij} \in [G_{min}, G_{max}].
  • ISPP programs each cell to target GG (chapter 5.5).
  • Positive weight = GG; negative weight = separate “negative” crossbar (or offset on a single crossbar).

8. Inference (SIDRA):

  • Input x\mathbf{x} converted to voltages V\mathbf{V} by DAC.
  • Crossbar MVM: I=GV\mathbf{I} = \mathbf{G}^\top \mathbf{V} (Ohm + KCL, 10 ns).
  • ADC: I\mathbf{I} → INT8 integer.
  • CMOS: apply ReLU, pass to the next layer.
  • Last layer: digital softmax + argmax → class.

Total inference: ~50-100 ns + ADC overhead. 10M+ MNIST inferences per second.

L3 · Derin

9. Noise analysis:

Each MVM output: I=GV+ϵ\mathbf{I} = \mathbf{G}^\top \mathbf{V} + \boldsymbol{\epsilon}, ϵjN(0,σ2)\epsilon_j \sim \mathcal{N}(0, \sigma^2), σ5%\sigma \approx 5\% relative.

2 layers → noise accumulates: σoutσ12+σ22=7%\sigma_{\text{out}} \approx \sqrt{\sigma_1^2 + \sigma_2^2} = 7\% relative.

Classification margin (top-1 vs top-2 score gap) is typically 20-50%. Noise < margin → classification still correct.

But: in hard examples, margins shrink (e.g. 4 vs 9, 1 vs 7). Noise pushes past threshold → misclassification. Expected loss: 0.5-1%.

Averaging: 4× re-reads halves the noise. Time and energy 4× → +0.2% accuracy. Trade-off.

10. Energy:

  • MVM: 26 pJ + ADC 256 pJ + DAC 128 pJ + control 50 pJ ≈ 460 pJ.
  • 2-layer inference: 2 × 460 pJ = 920 pJ ≈ 1 nJ/inference.
  • SIDRA Y1 at 3 W TDP → 3 × 10¹² / 1 × 10⁻⁹ = 3 million inferences/second.

Compare: H100 with batch 32 runs 100 GOPS → ~10 µs/inference = 100K/s per thread; with batching, millions/s. SIDRA is very efficient for small batches.

11. Information-theoretic frame:

MNIST discrete distribution: H(Y)=log210=3.32H(Y) = \log_2 10 = 3.32 bits.

Model cross-entropy: H(YX)0.07H(Y | X) \approx 0.07 bits (trained).

Mutual information I(X;Y)=H(Y)H(YX)3.25I(X; Y) = H(Y) - H(Y | X) \approx 3.25 bits — the model extracts 3.25 bits of information from the input.

Weights 100K × 8 bits = 800 kbit total. Most of it is “redundant”: from an information-bottleneck perspective, pruning + quantization can compress far further.

12. Fourier (optional):

If we used a CNN: convolution layers. Large filters could use FFT-based acceleration. For MNIST, small 3×3 filters suffice, direct MVM.

Experiment: Inference on One MNIST Image

Test image “7” (784 pixels, normalized):

Layer 1 (784 → 128):

  • Input vector x\mathbf{x}: 784 values in [-1, 1].
  • Weight matrix W1\mathbf{W}_1: 128×784 → 100K values. SIDRA crossbar: 4× (256×256) partial. 784 = 256 × 3 + 16 → 3 or 4 crossbars. Simplification: use 128 × 768 (crop 16 pixels) → 3 crossbars.
  • MVM result z1R128\mathbf{z}_1 \in \mathbb{R}^{128}.
  • ReLU: zero out negatives, keep positives.

Layer 2 (128 → 10):

  • Input h1R128\mathbf{h}_1 \in \mathbb{R}^{128}.
  • Weights W2R10×128\mathbf{W}_2 \in \mathbb{R}^{10 \times 128}: a single 128×128 crossbar is enough (first 10 columns used).
  • MVM result z2R10\mathbf{z}_2 \in \mathbb{R}^{10}.
  • Softmax (digital): probability per class.

Output: probability vector, e.g. (0.01,0.02,0.01,0.01,0.02,0.01,0.01,0.85,0.03,0.03)(0.01, 0.02, 0.01, 0.01, 0.02, 0.01, 0.01, 0.85, 0.03, 0.03).

Argmax: index 7 → class “7”. Correct!

Latency:

  • Layer 1 (3 MVMs in parallel): 10 ns + ADC 5 ns = 15 ns.
  • Layer 2 (1 MVM): 15 ns.
  • CMOS activation + softmax: ~20 ns.
  • Total: ~50 ns/inference. Theoretical 20M inferences/s.

Energy:

  • 4 MVMs × 460 pJ = ~2 nJ.
  • SIDRA Y1 3 W × 50 ns = 150 nJ, but with low activity → real ~2-10 nJ.

Accuracy:

  • FP32 model: 98%.
  • SIDRA INT8 + noise: 97.5-97.8%.
  • On a single “7” image: correct classification probability > 99%.

Comprehensive Quiz

Tests all seven chapters of Module 4. Each question combines 2-3 concepts.

1/8An MLP layer y = ReLU(Wx + b). On SIDRA, which steps are in hardware, which are CMOS?

Integrated Lab: Design Your MNIST Model

Apply Module 4 through the following steps.

Task: deploy a “3 vs 8” binary classifier onto SIDRA Y1.

Parameters:

  • Data: only 3s and 8s from MNIST (~12,000 training images).
  • Target: 2-class accuracy > 99%.

Decisions:

(a) Architecture: MLP 784 → 64 → 2, or a CNN with 3×3 filters + max pool + FC? Which fits SIDRA?

(b) Training: how many epochs? Optimizer (SGD vs Adam)? Learning rate?

(c) Quantization: INT8 post-training, or QAT? Training budget?

(d) SIDRA mapping: crossbars used? Memory budget?

(e) Noise analysis: averaging factor for noisy inference?

(f) Deploy: per-inference latency and energy?

Solutions

(a) MLP 784 → 64 → 2. Simple, fast, ideal for SIDRA. A CNN could be more accurate but 3 vs 8 is simple → MLP suffices. Crossbars: ~3 × 256×256 for the first layer, 1 small one for the second.

(b) Adam optimizer, lr=0.001, 5 epochs. Small dataset → few epochs suffice. Adam is hyperparameter-tolerant. Batch 64.

(c) QAT, 3 FP32 epochs + 2 INT8-simulated epochs. Post-training loses ~0.2% but QAT is safer especially with a >99% target.

(d) 4 crossbars total. 3 for the first layer (784/256 ≈ 3), 1 small for the second. ~260K cells total, ~0.06% of Y1. Plenty of room for other models or ensembles.

(e) A 2-class margin is large (FP32 confidence near 100%) → noise mostly tolerated. Averaging 1 (single read suffices) unless critical.

(f) Latency: 30 ns/inference (2 MVMs + CMOS). Energy: ~0.5 nJ. Throughput: 6M inferences/s per chip. Power: 3 W (TDP-bound).

Extension: the same 4 crossbars can store 10 different “one-vs-all” classifiers → a full 10-class MNIST classifier. Ensemble MNIST accuracy ~98.5%.

Module 4 Cheat Sheet

At-a-glance gains:

  • ✅ Vector, matrix, MVM — the atomic operation of AI.
  • ✅ Ohm + KCL = analog MVM physics bridge.
  • ✅ Derivative + gradient — the math atom of training.
  • ✅ Probability + noise — SIDRA’s reality + AI’s regularizer.
  • ✅ Fourier — signal processing + some AI architectures (FNO, FNet).
  • ✅ Quantization — bit depth = SIDRA cell level.
  • ✅ Information theory — theoretical capacity bound + AI loss functions.
  • ✅ End-to-end: MNIST pipeline on SIDRA Y1 at ~1 nJ/inference.

Ready for SIDRA: Module 5 turns this math into silicon. ADC, TDC, sense-amplifier, compute engine, DMA — all are circuit implementations of Module 4 concepts.

Vision: Math → Silicon → AI

Module 4 gave math; Module 5 will give silicon. But the bridge is not one-way:

  • Y1 (today): Classical math (FP32) → quantization → SIDRA. Direction: math → hardware.
  • Y3 (2027): Hardware-aware training. Noise, quantization, gradient tuned to SIDRA physics. Bidirectional.
  • Y10 (2029): Hardware-software co-design. Architecture, quantization, model optimized together. The compiler uses Module 4 math to tune circuit parameters.
  • Y100 (2031+): Math-native hardware. Device physics (memristor, photonic) directly implements AI primitives (MVM, gradient, attention). Mathematical abstraction = hardware abstraction.
  • Y1000 (long horizon): New math. Analog AI develops its own math — stochastic, non-linear, brain-like rule systems.

Meaning for Türkiye: across Module 4 we saw — classical academic math education maps directly to SIDRA. Türkiye’s strong math + physics + engineering tradition = natural SIDRA infrastructure. Math is among our most successful global ranks (math olympiad medals, academic publications). Channeling that stock into SIDRA is a strategic opportunity.

Unexpected future: AI discovering its own math. Today’s AI uses human math. Tomorrow’s AI discovers: new theorems (Wu 2024 examples), new algorithms (AlphaEvolve), new physics (PINN). SIDRA Y100 + symbolic AI → Türkiye’s first “automated science discovery” system is a real possibility.

Further Reading

  • Next module: 🚧 5.1 · The Neuromorphic Computing Paradigm — Coming soon
  • Previous: 4.7 — Information Theory
  • Module 1 summary: 1.10 — Physics Module Review
  • Module 2 summary: 2.10 — Chemistry Module Review
  • Linear algebra (classical): Strang, Introduction to Linear Algebra, 6th ed.
  • Math + AI together: Goodfellow, Bengio, Courville, Deep Learning — Chapters 2-4 are the math basis.
  • Information theory + ML: MacKay, Information Theory, Inference, and Learning Algorithms.
  • Modern AI math: Deisenroth, Faisal, Ong, Mathematics for Machine Learning.