Linear Algebra Laboratory
Pull Module 4 together through one end-to-end project — SIDRA on MNIST.
Prerequisites
What you'll learn here
- Combine Module 4's seven concepts in one end-to-end project
- Apply the math of an MNIST classifier (MVM, gradient, probability, quantization)
- Trace the mapping of math onto SIDRA hardware step by step
- Prepare for Module 5 (Chip Hardware)
Hook: Math That Lives on SIDRA
Across Module 4 we covered seven mathematical concepts in detail: vector/MVM (4.1), Ohm+KCL bridge (4.2), derivative/gradient (4.3), probability/noise (4.4), Fourier (4.5), quantization (4.6), information theory (4.7).
Each is its own field. But for SIDRA, they form one story:
AI models are sequences of MVMs. Every MVM lives in hardware via Ohm + KCL. Training needs gradients. Noise is both a problem and a solution. Fourier speeds up large convolutions. Quantization bounds bit depth. Information theory gives the theoretical ceiling.
This chapter shows that story end-to-end through a concrete exercise: build an MNIST classifier and see where each concept enters. You’ll close the chapter ready for Module 5 (Chip Hardware).
Intuition: 7 Concepts, 1 End-to-End Project
MNIST classification is simple: 28×28 grayscale handwritten digits → 10 classes (0-9). Mathematically, it uses every concept in Module 4:
| Math (Module 4) | Where it appears in the MNIST pipeline |
|---|---|
| 4.1 Vector/Matrix/MVM | Input (784-vector), weights (matrices), layer output |
| 4.2 Ohm+KCL=MVM | Every MVM runs analog on the SIDRA crossbar |
| 4.3 Derivative/Gradient | Training (backprop), loss minimization |
| 4.4 Probability/Noise | Impact of SIDRA noise on classification |
| 4.5 Fourier | (Absent in MNIST, but present in convolution) |
| 4.6 Quantization | FP32 training → INT8 inference (SIDRA 256 levels) |
| 4.7 Information theory | Cross-entropy loss, model capacity |
Project structure:
- Data preprocessing: pixels → vector → normalize.
- Architecture: 2-layer MLP (784 → 128 → 10).
- Training: SGD + backprop (on the GPU).
- Quantization: FP32 → INT8 (for SIDRA).
- Deploy: program the weights into the SIDRA crossbar.
- Inference: analog MVM + CMOS activation + ADC.
- Accuracy: expect 97-98%.
Formalism: End-to-End Math
1. Data: MNIST image , label .
Normalize: , with from the training set.
2. Architecture:
First layer: , .
Activation: .
Second layer: , .
Output: , 10-class probability.
3. Loss (information theory):
Cross-entropy: , with the true label.
4. Gradient (calculus):
Backprop:
- (one-hot)
- (ReLU derivative)
Update: .
5. Training loop:
- 60,000 training images.
- 10 epochs (all data 10 passes).
- Batch 64, learning rate 0.01.
- FP32 arithmetic (on GPU).
Post-training: FP32 accuracy ~98%.
6. Quantization:
Weights FP32 → INT8:
- Dynamic range: min/max of .
- 256 levels: .
- Round: .
Activations also INT8 quantized (layer-wise calibration).
Post-quantization accuracy: ~97.8% (0.2% loss).
7. Deploy to SIDRA:
Each INT8 weight → memristor conductance level.
- → .
- ISPP programs each cell to target (chapter 5.5).
- Positive weight = ; negative weight = separate “negative” crossbar (or offset on a single crossbar).
8. Inference (SIDRA):
- Input converted to voltages by DAC.
- Crossbar MVM: (Ohm + KCL, 10 ns).
- ADC: → INT8 integer.
- CMOS: apply ReLU, pass to the next layer.
- Last layer: digital softmax + argmax → class.
Total inference: ~50-100 ns + ADC overhead. 10M+ MNIST inferences per second.
9. Noise analysis:
Each MVM output: , , relative.
2 layers → noise accumulates: relative.
Classification margin (top-1 vs top-2 score gap) is typically 20-50%. Noise < margin → classification still correct.
But: in hard examples, margins shrink (e.g. 4 vs 9, 1 vs 7). Noise pushes past threshold → misclassification. Expected loss: 0.5-1%.
Averaging: 4× re-reads halves the noise. Time and energy 4× → +0.2% accuracy. Trade-off.
10. Energy:
- MVM: 26 pJ + ADC 256 pJ + DAC 128 pJ + control 50 pJ ≈ 460 pJ.
- 2-layer inference: 2 × 460 pJ = 920 pJ ≈ 1 nJ/inference.
- SIDRA Y1 at 3 W TDP → 3 × 10¹² / 1 × 10⁻⁹ = 3 million inferences/second.
Compare: H100 with batch 32 runs 100 GOPS → ~10 µs/inference = 100K/s per thread; with batching, millions/s. SIDRA is very efficient for small batches.
11. Information-theoretic frame:
MNIST discrete distribution: bits.
Model cross-entropy: bits (trained).
Mutual information bits — the model extracts 3.25 bits of information from the input.
Weights 100K × 8 bits = 800 kbit total. Most of it is “redundant”: from an information-bottleneck perspective, pruning + quantization can compress far further.
12. Fourier (optional):
If we used a CNN: convolution layers. Large filters could use FFT-based acceleration. For MNIST, small 3×3 filters suffice, direct MVM.
Experiment: Inference on One MNIST Image
Test image “7” (784 pixels, normalized):
Layer 1 (784 → 128):
- Input vector : 784 values in [-1, 1].
- Weight matrix : 128×784 → 100K values. SIDRA crossbar: 4× (256×256) partial. 784 = 256 × 3 + 16 → 3 or 4 crossbars. Simplification: use 128 × 768 (crop 16 pixels) → 3 crossbars.
- MVM result .
- ReLU: zero out negatives, keep positives.
Layer 2 (128 → 10):
- Input .
- Weights : a single 128×128 crossbar is enough (first 10 columns used).
- MVM result .
- Softmax (digital): probability per class.
Output: probability vector, e.g. .
Argmax: index 7 → class “7”. Correct!
Latency:
- Layer 1 (3 MVMs in parallel): 10 ns + ADC 5 ns = 15 ns.
- Layer 2 (1 MVM): 15 ns.
- CMOS activation + softmax: ~20 ns.
- Total: ~50 ns/inference. Theoretical 20M inferences/s.
Energy:
- 4 MVMs × 460 pJ = ~2 nJ.
- SIDRA Y1 3 W × 50 ns = 150 nJ, but with low activity → real ~2-10 nJ.
Accuracy:
- FP32 model: 98%.
- SIDRA INT8 + noise: 97.5-97.8%.
- On a single “7” image: correct classification probability > 99%.
Comprehensive Quiz
Tests all seven chapters of Module 4. Each question combines 2-3 concepts.
Integrated Lab: Design Your MNIST Model
Apply Module 4 through the following steps.
Task: deploy a “3 vs 8” binary classifier onto SIDRA Y1.
Parameters:
- Data: only 3s and 8s from MNIST (~12,000 training images).
- Target: 2-class accuracy > 99%.
Decisions:
(a) Architecture: MLP 784 → 64 → 2, or a CNN with 3×3 filters + max pool + FC? Which fits SIDRA?
(b) Training: how many epochs? Optimizer (SGD vs Adam)? Learning rate?
(c) Quantization: INT8 post-training, or QAT? Training budget?
(d) SIDRA mapping: crossbars used? Memory budget?
(e) Noise analysis: averaging factor for noisy inference?
(f) Deploy: per-inference latency and energy?
Solutions
(a) MLP 784 → 64 → 2. Simple, fast, ideal for SIDRA. A CNN could be more accurate but 3 vs 8 is simple → MLP suffices. Crossbars: ~3 × 256×256 for the first layer, 1 small one for the second.
(b) Adam optimizer, lr=0.001, 5 epochs. Small dataset → few epochs suffice. Adam is hyperparameter-tolerant. Batch 64.
(c) QAT, 3 FP32 epochs + 2 INT8-simulated epochs. Post-training loses ~0.2% but QAT is safer especially with a >99% target.
(d) 4 crossbars total. 3 for the first layer (784/256 ≈ 3), 1 small for the second. ~260K cells total, ~0.06% of Y1. Plenty of room for other models or ensembles.
(e) A 2-class margin is large (FP32 confidence near 100%) → noise mostly tolerated. Averaging 1 (single read suffices) unless critical.
(f) Latency: 30 ns/inference (2 MVMs + CMOS). Energy: ~0.5 nJ. Throughput: 6M inferences/s per chip. Power: 3 W (TDP-bound).
Extension: the same 4 crossbars can store 10 different “one-vs-all” classifiers → a full 10-class MNIST classifier. Ensemble MNIST accuracy ~98.5%.
Module 4 Cheat Sheet
At-a-glance gains:
- ✅ Vector, matrix, MVM — the atomic operation of AI.
- ✅ Ohm + KCL = analog MVM physics bridge.
- ✅ Derivative + gradient — the math atom of training.
- ✅ Probability + noise — SIDRA’s reality + AI’s regularizer.
- ✅ Fourier — signal processing + some AI architectures (FNO, FNet).
- ✅ Quantization — bit depth = SIDRA cell level.
- ✅ Information theory — theoretical capacity bound + AI loss functions.
- ✅ End-to-end: MNIST pipeline on SIDRA Y1 at ~1 nJ/inference.
Ready for SIDRA: Module 5 turns this math into silicon. ADC, TDC, sense-amplifier, compute engine, DMA — all are circuit implementations of Module 4 concepts.
Vision: Math → Silicon → AI
Module 4 gave math; Module 5 will give silicon. But the bridge is not one-way:
- Y1 (today): Classical math (FP32) → quantization → SIDRA. Direction: math → hardware.
- Y3 (2027): Hardware-aware training. Noise, quantization, gradient tuned to SIDRA physics. Bidirectional.
- Y10 (2029): Hardware-software co-design. Architecture, quantization, model optimized together. The compiler uses Module 4 math to tune circuit parameters.
- Y100 (2031+): Math-native hardware. Device physics (memristor, photonic) directly implements AI primitives (MVM, gradient, attention). Mathematical abstraction = hardware abstraction.
- Y1000 (long horizon): New math. Analog AI develops its own math — stochastic, non-linear, brain-like rule systems.
Meaning for Türkiye: across Module 4 we saw — classical academic math education maps directly to SIDRA. Türkiye’s strong math + physics + engineering tradition = natural SIDRA infrastructure. Math is among our most successful global ranks (math olympiad medals, academic publications). Channeling that stock into SIDRA is a strategic opportunity.
Unexpected future: AI discovering its own math. Today’s AI uses human math. Tomorrow’s AI discovers: new theorems (Wu 2024 examples), new algorithms (AlphaEvolve), new physics (PINN). SIDRA Y100 + symbolic AI → Türkiye’s first “automated science discovery” system is a real possibility.
Further Reading
- Next module: 🚧 5.1 · The Neuromorphic Computing Paradigm — Coming soon
- Previous: 4.7 — Information Theory
- Module 1 summary: 1.10 — Physics Module Review
- Module 2 summary: 2.10 — Chemistry Module Review
- Linear algebra (classical): Strang, Introduction to Linear Algebra, 6th ed.
- Math + AI together: Goodfellow, Bengio, Courville, Deep Learning — Chapters 2-4 are the math basis.
- Information theory + ML: MacKay, Information Theory, Inference, and Learning Algorithms.
- Modern AI math: Deisenroth, Faisal, Ong, Mathematics for Machine Learning.