Linear Algebra Laboratory

Hook: Math That Lives on SIDRA

Across Module 4 we covered seven mathematical concepts in detail: vector/MVM (4.1), Ohm+KCL bridge (4.2), derivative/gradient (4.3), probability/noise (4.4), Fourier (4.5), quantization (4.6), information theory (4.7).

Each is its own field. But for SIDRA, they form one story:

AI models are sequences of MVMs. Every MVM lives in hardware via Ohm + KCL. Training needs gradients. Noise is both a problem and a solution. Fourier speeds up large convolutions. Quantization bounds bit depth. Information theory gives the theoretical ceiling.

This chapter shows that story end-to-end through a concrete exercise: build an MNIST classifier and see where each concept enters. You’ll close the chapter ready for Module 5 (Chip Hardware).

Intuition: 7 Concepts, 1 End-to-End Project

MNIST classification is simple: 28×28 grayscale handwritten digits → 10 classes (0-9). Mathematically, it uses every concept in Module 4:

Math (Module 4)	Where it appears in the MNIST pipeline
4.1 Vector/Matrix/MVM	Input (784-vector), weights (matrices), layer output
4.2 Ohm+KCL=MVM	Every MVM runs analog on the SIDRA crossbar
4.3 Derivative/Gradient	Training (backprop), loss minimization
4.4 Probability/Noise	Impact of SIDRA noise on classification
4.5 Fourier	(Absent in MNIST, but present in convolution)
4.6 Quantization	FP32 training → INT8 inference (SIDRA 256 levels)
4.7 Information theory	Cross-entropy loss, model capacity

Project structure:

Data preprocessing: pixels → vector → normalize.
Architecture: 2-layer MLP (784 → 128 → 10).
Training: SGD + backprop (on the GPU).
Quantization: FP32 → INT8 (for SIDRA).
Deploy: program the weights into the SIDRA crossbar.
Inference: analog MVM + CMOS activation + ADC.
Accuracy: expect 97-98%.

Formalism: End-to-End Math

L1 · Başlangıç

1. Data: MNIST image $\mathbf{x} \in \mathbb{R}^{784}$ , label $y \in \{0, 1, \ldots, 9\}$ .

Normalize: $\mathbf{x} \leftarrow (\mathbf{x} - \mu) / \sigma$ , with $\mu, \sigma$ from the training set.

2. Architecture:

First layer: $\mathbf{z}_1 = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1$ , $\mathbf{W}_1 \in \mathbb{R}^{128 \times 784}$ .

Activation: $\mathbf{h}_1 = \text{ReLU}(\mathbf{z}_1)$ .

Second layer: $\mathbf{z}_2 = \mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2$ , $\mathbf{W}_2 \in \mathbb{R}^{10 \times 128}$ .

Output: $\hat{\mathbf{y}} = \text{softmax}(\mathbf{z}_2)$ , 10-class probability.

3. Loss (information theory):

Cross-entropy: $L = -\log \hat{y}_{y^*}$ , with $y^*$ the true label.

4. Gradient (calculus):

Backprop:

$\delta_2 = \hat{\mathbf{y}} - \mathbf{e}_{y^*}$ (one-hot)
$\nabla_{\mathbf{W}_2} L = \delta_2 \mathbf{h}_1^\top$
$\delta_1 = \mathbf{W}_2^\top \delta_2 \odot \mathbb{1}[\mathbf{z}_1 > 0]$ (ReLU derivative)
$\nabla_{\mathbf{W}_1} L = \delta_1 \mathbf{x}^\top$

Update: $\mathbf{W} \leftarrow \mathbf{W} - \eta \nabla L$ .

L2 · Tam

5. Training loop:

60,000 training images.
10 epochs (all data 10 passes).
Batch 64, learning rate 0.01.
FP32 arithmetic (on GPU).

Post-training: FP32 accuracy ~98%.

6. Quantization:

Weights FP32 → INT8:

Dynamic range: min/max of $\mathbf{W}$ .
256 levels: $\Delta = (max-min)/256$ .
Round: $W \to \lfloor (W - min)/\Delta \rfloor$ .

Activations also INT8 quantized (layer-wise calibration).

Post-quantization accuracy: ~97.8% (0.2% loss).

7. Deploy to SIDRA:

Each INT8 weight → memristor conductance level.

$W_{ij} \in \{0, 1, \ldots, 255\}$ → $G_{ij} \in [G_{min}, G_{max}]$ .
ISPP programs each cell to target $G$ (chapter 5.5).
Positive weight = $G$ ; negative weight = separate “negative” crossbar (or offset on a single crossbar).

8. Inference (SIDRA):

Input $\mathbf{x}$ converted to voltages $\mathbf{V}$ by DAC.
Crossbar MVM: $\mathbf{I} = \mathbf{G}^\top \mathbf{V}$ (Ohm + KCL, 10 ns).
ADC: $\mathbf{I}$ → INT8 integer.
CMOS: apply ReLU, pass to the next layer.
Last layer: digital softmax + argmax → class.

Total inference: ~50-100 ns + ADC overhead. 10M+ MNIST inferences per second.

L3 · Derin

9. Noise analysis:

Each MVM output: $\mathbf{I} = \mathbf{G}^\top \mathbf{V} + \boldsymbol{\epsilon}$ , $\epsilon_j \sim \mathcal{N}(0, \sigma^2)$ , $\sigma \approx 5\%$ relative.

2 layers → noise accumulates: $\sigma_{\text{out}} \approx \sqrt{\sigma_1^2 + \sigma_2^2} = 7\%$ relative.

Classification margin (top-1 vs top-2 score gap) is typically 20-50%. Noise < margin → classification still correct.

But: in hard examples, margins shrink (e.g. 4 vs 9, 1 vs 7). Noise pushes past threshold → misclassification. Expected loss: 0.5-1%.

Averaging: 4× re-reads halves the noise. Time and energy 4× → +0.2% accuracy. Trade-off.

10. Energy:

MVM: 26 pJ + ADC 256 pJ + DAC 128 pJ + control 50 pJ ≈ 460 pJ.
2-layer inference: 2 × 460 pJ = 920 pJ ≈ 1 nJ/inference.
SIDRA Y1 at 3 W TDP → 3 × 10¹² / 1 × 10⁻⁹ = 3 million inferences/second.

Compare: H100 with batch 32 runs 100 GOPS → ~10 µs/inference = 100K/s per thread; with batching, millions/s. SIDRA is very efficient for small batches.

11. Information-theoretic frame:

MNIST discrete distribution: $H(Y) = \log_2 10 = 3.32$ bits.

Model cross-entropy: $H(Y | X) \approx 0.07$ bits (trained).

Mutual information $I(X; Y) = H(Y) - H(Y | X) \approx 3.25$ bits — the model extracts 3.25 bits of information from the input.

Weights 100K × 8 bits = 800 kbit total. Most of it is “redundant”: from an information-bottleneck perspective, pruning + quantization can compress far further.

12. Fourier (optional):

If we used a CNN: convolution layers. Large filters could use FFT-based acceleration. For MNIST, small 3×3 filters suffice, direct MVM.

Experiment: Inference on One MNIST Image

Test image “7” (784 pixels, normalized):

Layer 1 (784 → 128):

Input vector $\mathbf{x}$ : 784 values in [-1, 1].
Weight matrix $\mathbf{W}_1$ : 128×784 → 100K values. SIDRA crossbar: 4× (256×256) partial. 784 = 256 × 3 + 16 → 3 or 4 crossbars. Simplification: use 128 × 768 (crop 16 pixels) → 3 crossbars.
MVM result $\mathbf{z}_1 \in \mathbb{R}^{128}$ .
ReLU: zero out negatives, keep positives.

Layer 2 (128 → 10):

Input $\mathbf{h}_1 \in \mathbb{R}^{128}$ .
Weights $\mathbf{W}_2 \in \mathbb{R}^{10 \times 128}$ : a single 128×128 crossbar is enough (first 10 columns used).
MVM result $\mathbf{z}_2 \in \mathbb{R}^{10}$ .
Softmax (digital): probability per class.

Output: probability vector, e.g. $(0.01, 0.02, 0.01, 0.01, 0.02, 0.01, 0.01, 0.85, 0.03, 0.03)$ .

Argmax: index 7 → class “7”. Correct!

Latency:

Layer 1 (3 MVMs in parallel): 10 ns + ADC 5 ns = 15 ns.
Layer 2 (1 MVM): 15 ns.
CMOS activation + softmax: ~20 ns.
Total: ~50 ns/inference. Theoretical 20M inferences/s.

Energy:

4 MVMs × 460 pJ = ~2 nJ.
SIDRA Y1 3 W × 50 ns = 150 nJ, but with low activity → real ~2-10 nJ.

Accuracy:

FP32 model: 98%.
SIDRA INT8 + noise: 97.5-97.8%.
On a single “7” image: correct classification probability > 99%.

Comprehensive Quiz

Tests all seven chapters of Module 4. Each question combines 2-3 concepts.

1/8An MLP layer y = ReLU(Wx + b). On SIDRA, which steps are in hardware, which are CMOS?

All CMOSWx MVM on the analog crossbar (Ohm+KCL); b add and ReLU in peripheral CMOSAll analogNo hardware

Integrated Lab: Design Your MNIST Model

Apply Module 4 through the following steps.

Task: deploy a “3 vs 8” binary classifier onto SIDRA Y1.

Parameters:

Data: only 3s and 8s from MNIST (~12,000 training images).
Target: 2-class accuracy > 99%.

Decisions:

(a) Architecture: MLP 784 → 64 → 2, or a CNN with 3×3 filters + max pool + FC? Which fits SIDRA?

(b) Training: how many epochs? Optimizer (SGD vs Adam)? Learning rate?

(c) Quantization: INT8 post-training, or QAT? Training budget?

(d) SIDRA mapping: crossbars used? Memory budget?

(e) Noise analysis: averaging factor for noisy inference?

(f) Deploy: per-inference latency and energy?

Solutions

(a) MLP 784 → 64 → 2. Simple, fast, ideal for SIDRA. A CNN could be more accurate but 3 vs 8 is simple → MLP suffices. Crossbars: ~3 × 256×256 for the first layer, 1 small one for the second.

(b) Adam optimizer, lr=0.001, 5 epochs. Small dataset → few epochs suffice. Adam is hyperparameter-tolerant. Batch 64.

(c) QAT, 3 FP32 epochs + 2 INT8-simulated epochs. Post-training loses ~0.2% but QAT is safer especially with a >99% target.

(d) 4 crossbars total. 3 for the first layer (784/256 ≈ 3), 1 small for the second. ~260K cells total, ~0.06% of Y1. Plenty of room for other models or ensembles.

(e) A 2-class margin is large (FP32 confidence near 100%) → noise mostly tolerated. Averaging 1 (single read suffices) unless critical.

(f) Latency: 30 ns/inference (2 MVMs + CMOS). Energy: ~0.5 nJ. Throughput: 6M inferences/s per chip. Power: 3 W (TDP-bound).

Extension: the same 4 crossbars can store 10 different “one-vs-all” classifiers → a full 10-class MNIST classifier. Ensemble MNIST accuracy ~98.5%.

Module 4 Cheat Sheet

At-a-glance gains:

✅ Vector, matrix, MVM — the atomic operation of AI.
✅ Ohm + KCL = analog MVM physics bridge.
✅ Derivative + gradient — the math atom of training.
✅ Probability + noise — SIDRA’s reality + AI’s regularizer.
✅ Fourier — signal processing + some AI architectures (FNO, FNet).
✅ Quantization — bit depth = SIDRA cell level.
✅ Information theory — theoretical capacity bound + AI loss functions.
✅ End-to-end: MNIST pipeline on SIDRA Y1 at ~1 nJ/inference.

Ready for SIDRA: Module 5 turns this math into silicon. ADC, TDC, sense-amplifier, compute engine, DMA — all are circuit implementations of Module 4 concepts.

Vision: Math → Silicon → AI

Module 4 gave math; Module 5 will give silicon. But the bridge is not one-way:

Y1 (today): Classical math (FP32) → quantization → SIDRA. Direction: math → hardware.
Y3 (2027): Hardware-aware training. Noise, quantization, gradient tuned to SIDRA physics. Bidirectional.
Y10 (2029): Hardware-software co-design. Architecture, quantization, model optimized together. The compiler uses Module 4 math to tune circuit parameters.
Y100 (2031+): Math-native hardware. Device physics (memristor, photonic) directly implements AI primitives (MVM, gradient, attention). Mathematical abstraction = hardware abstraction.
Y1000 (long horizon): New math. Analog AI develops its own math — stochastic, non-linear, brain-like rule systems.

Meaning for Türkiye: across Module 4 we saw — classical academic math education maps directly to SIDRA. Türkiye’s strong math + physics + engineering tradition = natural SIDRA infrastructure. Math is among our most successful global ranks (math olympiad medals, academic publications). Channeling that stock into SIDRA is a strategic opportunity.

Unexpected future: AI discovering its own math. Today’s AI uses human math. Tomorrow’s AI discovers: new theorems (Wu 2024 examples), new algorithms (AlphaEvolve), new physics (PINN). SIDRA Y100 + symbolic AI → Türkiye’s first “automated science discovery” system is a real possibility.

Linear Algebra Laboratory

Prerequisites

What you'll learn here

Hook: Math That Lives on SIDRA

Intuition: 7 Concepts, 1 End-to-End Project

Formalism: End-to-End Math

Experiment: Inference on One MNIST Image

Comprehensive Quiz

Integrated Lab: Design Your MNIST Model

Module 4 Cheat Sheet

Vision: Math → Silicon → AI

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: Math That Lives on SIDRA

🧭 Intuition: 7 Concepts, 1 End-to-End Project

📐 Formalism: End-to-End Math

🧪 Experiment: Inference on One MNIST Image

📝 Comprehensive Quiz

🛠️ Integrated Lab: Design Your MNIST Model

🗂️ Module 4 Cheat Sheet

🔮 Vision: Math → Silicon → AI

📚 Further Reading

Hook: Math That Lives on SIDRA

Intuition: 7 Concepts, 1 End-to-End Project

Formalism: End-to-End Math

Experiment: Inference on One MNIST Image

Comprehensive Quiz

Integrated Lab: Design Your MNIST Model

Module 4 Cheat Sheet

Vision: Math → Silicon → AI

Further Reading