MUX, Decoder, and Analog ECC
Pick the cell, fix the error — the control backbone of large arrays.
Prerequisites
What you'll learn here
- Explain multiplexer (MUX) and address decoder roles
- Sketch the Y1 WL/BL selection mechanism
- State why ECC (Error Correction Code) is needed and Hamming-code basics
- Distinguish analog ECC strategies (redundancy, averaging, sigma-delta)
- Compute Y1 cell-failure tolerance
Hook: Pick the Right Cell from 419M
SIDRA Y1 has 419M memristors. Which one to read, when? Address decoder + multiplexer.
Also: some cells are broken (manufacturing defects), some reads are noisy (analog error). Detect and correct. ECC.
This chapter covers SIDRA’s “control backbone” and fault-tolerance mechanisms.
Intuition: From Address to Cell to Correct Answer
MUX (Multiplexer):
N inputs, 1 output. Address bits select:
- 256:1 MUX = 8-bit address, picks 1 of 256 cells.
- WL MUX: which row to drive.
- BL MUX: which column to read.
Decoder:
Address → one-hot signal. 8-bit address → 256 outputs, only one high.
In Y1, every CU has a WL decoder + BL decoder at its head. A MUX runs across the 16 crossbars.
ECC (Error Correction Code):
Bit-level: parity, Hamming, BCH, Reed-Solomon. Detect + fix one-bit errors.
Analog-level: redundancy (3× copies, majority vote), averaging (5× reads), sigma-delta (cancel quantization error).
Formalism: Decoder + MUX + ECC
8-bit address → pick one of 256 cells:
Address a[7:0] (256 combinations).
Decoder: 8-input, 256-output combinational circuit. Output o[i] = 1 ⇔ address = i.
Circuit: 8-input AND gate × 256 (simple but big). Optimized: tree-based decoder, log_2 256 = 8-level AND tree.
256:1 MUX:
Decoder outputs gate transmission gates. Only the selected cell connects to the crossbar.
Circuit: 256 transmission gates + decoder. Area: ~250 µm² (in 28 nm CMOS).
Y1 hierarchy:
- Within a CU: 16 crossbars → 16:1 MUX (4-bit address).
- Within a crossbar: 256×256 → WL decoder (8-bit) + BL decoder (8-bit).
- Cluster: 25 CUs → 25:1 routing matrix.
Hamming code (classical bit-level ECC):
n data bits + k parity bits, n + k = 2^k - 1.
Corrects single-bit errors, detects double-bit. Common: (7, 4) Hamming → 4 data + 3 parity.
For SIDRA: byte (8 bit) uses (12, 8) Hamming = 4 parity bits. Each cell read passes ECC check.
In practice:
- Data: 8 bits
- Storage: 12 bits (4 parity)
- After read: compute parity, find any flipped bit, correct.
Overhead: 50% storage. Y1 419M / 12 = ~35M data bytes (350 MB net).
Analog ECC:
Bit-level ECC is digital. Analog quantization errors differ. Strategies:
1. Redundancy: write each weight to 3 cells, take majority.
- Overhead: 200%.
- Tolerates: 1 cell failure / 3.
2. Averaging: write each weight to 1 cell, read 5 times, average.
- Overhead: 5× read time (storage unchanged).
- Tolerates: random noise reduces by √5 = 2.2×.
3. Sigma-delta: track weight + error, fold the error into the next cell.
- Overhead: small.
- Tolerates: quantization error nullified long-term.
SIDRA Y1 approach: triple redundancy for critical cells (references, biases); 2× averaging + Hamming bytes for the rest. Total overhead ~30%.
Failure modes and tolerance:
Cell failure types:
- Stuck-at-LRS: cell won’t program, always low R. Rate: ~0.5%.
- Stuck-at-HRS: cell won’t program, always high R. Rate: ~0.3%.
- Read fail: noisy measurement, beyond margin. Rate: ~0.1%.
- Drift fail: value drifts over time. Rate: ~0.1%/year.
Total Y1 production cell failure: ~1%.
419M × 0.01 = 4.2M faulty cells in Y1.
Tolerance strategy:
- Per crossbar, spare rows (256 + 4 redundant). Boot tests map out bad rows.
- Per crossbar, spare columns (256 + 4 redundant). Same.
- Byte-level ECC (Hamming).
Test and remapping:
At boot, every crossbar is tested:
- Program all cells → read → compare.
- Faulty cells go in a table (cell-failure map).
- The compiler uses that table → routes weights to good cells.
Sigma-delta as decision aid:
Each bit read carries a confidence. Sigma-delta tracks error cumulatively, folds into the next reading. Standard in modern analog ADCs.
Reed-Solomon (block ECC):
Tolerates multi-byte errors (e.g., a whole row corrupt). Y1 uses cluster-level Reed-Solomon. Overhead 10-20%.
Practical Y1 total ECC overhead:
- Cell redundancy (critical): 5%
- Byte Hamming: 50% (critical bytes), 0% (non-critical)
- Cluster Reed-Solomon: 15%
- Practical total: ~30% overhead.
419M cells - 30% = ~290M effective weights. Still big (290 MB), GPT-2 fits.
MTBF (Mean Time Between Failures):
Y1 production yield 75% → 25% chips scrapped.
Operating MTBF: cell-failure rate ~0.1%/year → 4.2M cells × 0.001 = 4200 new failures/year. The cell-failure map gets a periodic update.
After 10 years: 42K new failures. ECC handles it. A 10-year lifetime is realistic.
Experiment: Hamming Code Step-by-Step
8-bit data: 10110011 (binary).
(12, 8) Hamming: add 4 parity bits.
Positions (1-12):
- p1, p2, d1, p3, d2, d3, d4, p4, d5, d6, d7, d8.
Data d1-d8 = 1, 0, 1, 1, 0, 0, 1, 1.
Parity compute:
p1 = XOR(d1, d2, d4, d5, d7) = 1 ⊕ 0 ⊕ 1 ⊕ 0 ⊕ 1 = 1 p2 = XOR(d1, d3, d4, d6, d7) = 1 ⊕ 1 ⊕ 1 ⊕ 0 ⊕ 1 = 0 p3 = XOR(d2, d3, d4, d8) = 0 ⊕ 1 ⊕ 1 ⊕ 1 = 1 p4 = XOR(d5, d6, d7, d8) = 0 ⊕ 0 ⊕ 1 ⊕ 1 = 0
Final 12-bit: p1 p2 d1 p3 d2 d3 d4 p4 d5 d6 d7 d8 = 1 0 1 1 0 1 1 0 0 0 1 1.
Storage: 12 cells.
Error simulation: d3 (position 6) flipped.
Read: 1 0 1 1 0 0 1 0 0 0 1 1 (d3 changed).
Decoder compute:
Recompute parity, compare with received.
p1’ = 1 ⊕ 0 ⊕ 1 ⊕ 0 ⊕ 1 = 1 ✓ p2’ = 1 ⊕ 0 ⊕ 1 ⊕ 0 ⊕ 1 = 1 ✗ (received 0) p3’ = 0 ⊕ 0 ⊕ 1 ⊕ 1 = 0 ✗ (received 1) p4’ = 0 ⊕ 0 ⊕ 1 ⊕ 1 = 0 ✓
Syndrome: 0110 = 6 → position 6 errored. Flip bit 6 → 1 → 0 ⊕ 1 = correction.
Result: original d3 = 1 recovered.
Win: a 1-bit error was corrected automatically. Memristor drift over years flipping a cell → ECC catches it.
Quick Quiz
Lab Exercise
Y1 ECC budget analysis.
Y1:
- 419M cells.
- Manufacturing failure 1% → 4.2M faulty.
- Drift +0.1%/year → 0.4M/year new.
ECC strategies:
- Hamming (12,8): 50% overhead.
- Triple redundancy: 200%.
- Sigma-delta: 0% (sequential).
Questions:
(a) Apply Hamming to all cells: net effective data bytes? (b) Triple redundancy only for “critical” 20% of weights (references, biases): net? (c) Hybrid (Hamming for critical + averaging for non-critical) targeting 30% overhead? (d) After 10 years (4M extra failures): accuracy drop? (e) Annual periodic refresh requirement?
Solutions
(a) 419M / 12 × 8 = 279M data bytes (50% overhead). Corrects 1-bit errors.
(b) Critical 20% = 84M weights × 3 = 252M cells. Non-critical 80% = 335M weights × 1 = 335M cells. Total 587M > 419M. Doesn’t fit. Drop critical to 5% → 21M × 3 + 398M × 1 = 461M. Still over. In practice: minimal redundancy.
(c) Hybrid: 50% cells Hamming (50% overhead), 50% sigma-delta (0% overhead). Net overhead = 25%. Raw 419M × 0.75 = 314M effective weights.
(d) 10 years → 4M failures / 314M total = 1.3% extra error rate. ECC + redundancy mask most → AI accuracy drop below 0.5%.
(e) Refresh: once/year reprogram critical cells (per failure map). Total refresh: ~5% of cells × 100 µs = ~20 s (parallel cluster under 1 s). Once a year.
Cheat Sheet
- MUX: N inputs → 1 output (address selection).
- Decoder: address → one-hot output.
- Y1 hierarchy: Cluster MUX (25:1), CU MUX (16:1), crossbar decoder (256×256).
- ECC: bit-level (Hamming, Reed-Solomon), analog-level (redundancy, averaging, sigma-delta).
- Y1 production failure: ~1%, tolerated via redundant rows/cols + ECC.
- Boot test: 100 ms, builds failure map.
- Total ECC overhead: ~30%. Net: 290M weights.
Vision: Fault-Tolerant AI Hardware
Y10+ targets:
- Y3: smart ECC (compiler-aware), overhead 20%.
- Y10: model-aware redundancy (critical layers protected), overhead 15%.
- Y100: Self-healing crossbar — cells degrade → auto-refresh + reroute. Overhead 5%.
- Y1000: Bio-compatible self-repair (organic synapses).
For Türkiye: fault-tolerant hardware design is critical for space/defense. ASELSAN, TUSAŞ collaboration → SIDRA-based satellite/defense AI products.
Further Reading
- Next chapter: 5.9 — Compute Engine and DMA
- Previous: 5.7 — TIA: Transimpedance Sensing
- ECC classic: Lin & Costello, Error Control Coding, 2nd ed.
- Analog ECC: Akarvardar et al., Analog circuit techniques for error-tolerant memory systems, JSSC 2021.
- Memristor reliability: Govoreanu et al., RRAM endurance and retention, IEEE EDM 2017.