Test, Calibration, Verification
Before a SIDRA chip ships — the layers of QC.
Prerequisites
What you'll learn here
- Name the production test flow (wafer → package → system)
- Detail boot-time calibration steps
- Cover BIST (Built-In Self-Test) and continuous hardware verification
- Walk through failure-mode analysis and the RMA process
- State AI-accuracy verification (model-based QA) standards
Hook: Chip Built — Now What?
419M memristors + billions of transistors. Are they all working? Test, test, test.
Three post-production stages:
- Wafer test: every die tested, 75% yield.
- Package test: packaged chip, 95% pass.
- System test: on motherboard, real workload.
Then to the field. In the field: continuous boot-time calibration + BIST.
Intuition: Multiple Test Layers
Wafer (38 dies)
↓ probe test (each die)
Good dies (~28)
↓ packaging
Chips (~95% pass)
↓ system test
Ready to ship (~70% total yield)
↓
Field (boot-time calibrate + BIST)Each layer catches defects. Late catches are expensive.
Formalism: Test Stages
Wafer test (probe card):
Probe needles touch each die on the wafer. Tests:
- DC tests: voltage/current limits.
- Memory test: SRAM bit-by-bit.
- Crossbar test: program + read 16 reference cells.
- Interconnect test: BAR mapping.
Time: 30 s/die. 38 dies/wafer = 20 minutes.
Failed dies marked, not packaged.
Package test:
Packaged chip:
- PCIe enumeration (BIOS sees it).
- Full memory test (DDR/SRAM).
- Full crossbar test (program + read every cell).
- Thermal stress (85°C for 1 hour).
Failed chip RMA → returned to supplier.
System test:
Chip on motherboard:
- Full SDK runs.
- MNIST + ResNet inference accuracy.
- 24-hour stress test.
Pass = “accepted”, ships to customer.
Boot-time calibration:
Every chip power-up:
1. Voltage rails stabilize.
2. Thermal sensors active.
3. Bandgap reference calibrated.
4. Each crossbar reads 16 reference cells → DAC scale set.
5. Failure map loaded (from NVRAM).
6. SRAM init.
7. Driver READY signal.Time: ~100 ms (detail in chapter 6.3).
BIST (Built-In Self-Test):
Test circuitry inside the chip. Runs automatically at boot:
void bist_run(void) {
// SRAM walking-bit test
for (int addr = 0; addr < SRAM_SIZE; addr++) {
for (int bit = 0; bit < 32; bit++) {
sram[addr] = (1 << bit);
assert(sram[addr] == (1 << bit));
}
}
// Crossbar reference test
for (int xb = 0; xb < N_CROSSBARS; xb++) {
program_reference(xb);
if (!verify_reference(xb)) {
mark_crossbar_bad(xb);
}
}
// Compute engine test
for (int op = 0; op < N_OPS; op++) {
result = run_op(op, test_input);
assert(result == expected[op]);
}
}Faulty component → update failure map, ECC routes around.
Periodic runtime test:
During inference, 1% reference MVMs. If results drift, recalibrate.
if (inference_count % 100 == 0) {
actual = run_reference_mvm();
if (abs(actual - expected) > THRESHOLD) {
recalibrate_all();
}
}Drift, temperature changes get auto-corrected.
Failure-mode analysis:
Most failures are caught in production. Remaining are field issues:
| Failure | Frequency | Action |
|---|---|---|
| Cell drift | 1%/year | ECC + auto-refresh |
| Single CU dead | 0.1%/year | Failure map + reroute |
| Cluster fault | 0.01%/year | Performance drop, notify |
| Total chip fail | 0.001%/year | RMA, replace |
MTBF (Mean Time Between Failures) target: 100,000 hours = 11 years.
RMA process:
Customer reports a fault → ticket → SIDRA sales → spare chip shipped → faulty returned → analyzed → supply chain improvement.
Field data → production improvement. Closed loop.
AI accuracy verification:
After deploying a compiled model on real SIDRA:
test_acc = chip.benchmark(test_data)
sim_acc = simulator.benchmark(test_data)
if abs(test_acc - sim_acc) > 0.01:
# Sim vs chip differ by > 1%
investigate()
update_simulator_model()Field data feeds the simulator. The sim model stays close to reality.
Compliance + certification:
SIDRA chip certifications:
- CE (Europe).
- FCC (US).
- TSE (Türkiye).
- AEC-Q100 (automotive for Y10+).
- DO-254 (avionics for Y100+).
Each cert requires testing + documentation + independent audit.
Production-error benchmarks:
Y1 targets:
- Wafer yield: 75%.
- Package yield: 95%.
- System yield: 95%.
- Net: 67%. Industry typical (TSMC 28 nm, ~75-85%).
Y10+ tighter, target 80% net.
Experiment: Y1 Test Flow
Wafer batch (1000 wafers):
- 1000 × 38 dies = 38,000 dies.
- Wafer test 75% yield: 28,500 good dies.
- Wafer test cost: 38K.
Packaging: 28,500 dies packaged.
- Package test 95%: 27,075 good packages.
- Test cost: 135K.
System test: 27,075 packages.
- System test 95%: 25,720 ship-ready chips.
Net production yield: 25,720 / 38,000 = 67.7%.
Cost/chip:
- Wafer: 38.9.
- Packaging: 52.6.
- Test: (135K + 12.0.
- Total: ~$103/chip production cost.
Customer-facing target $50-200/chip for Y1. Margins reasonable.
Quick Quiz
Lab Exercise
A customer reports a SIDRA chip issue in the field.
Scenario: “My Y1 chip’s MNIST accuracy dropped from 97% to 94% after 6 months.”
Diagnostic steps:
- Inspect boot logs (temperature, voltage OK?).
- Pull failure-map stats (cell-failure rate normal?).
- Measure reference MVM accuracy.
- Drift analysis (time series).
- Run periodic refresh.
- Re-measure accuracy.
Typical outcome:
Drift accumulated → refresh wasn’t running. SDK auto-refresh feature wasn’t enabled. Setting fixed → accuracy returns to 97%.
Prevention: SDK default auto-refresh on. Customer education materials.
Cheat Sheet
- Production test: wafer → package → system. Net yield ~67%.
- Boot calibration: 100 ms per power-up.
- BIST: in-chip test, automatic at boot.
- Periodic runtime test: drift/temperature correction.
- Failure map + ECC: runtime tolerance.
- MTBF: 11-year target.
- Production cost: ~$103/Y1 chip.
Vision: Automatic Test and Maintenance
- Y1: classical test + boot calibration.
- Y3: ML-predicted testing (learns defect patterns).
- Y10: self-healing crossbar (auto-refresh).
- Y100: chip self-optimizes (with online learning).
- Y1000: bio-compatible device with years of self-repair.
For Türkiye: test + calibration software is the operational backbone of the SIDRA workshop. Local test equipment (Aselsan, BİLGEM partners).
Further Reading
- Next chapter: 6.10 — End-to-End Production Stack Lab
- Previous: 6.8 — Digital Twin
- Production test: Bushnell & Agrawal, Essentials of Electronic Testing, Springer.
- BIST: Mukhopadhyay et al., IEEE TVLSI BIST tutorial.