💻 Module 6 · Software Stack · Chapter 6.9 · 9 min read

Test, Calibration, Verification

Before a SIDRA chip ships — the layers of QC.

Prerequisites

6.8 — Digital Twin / Simulator

What you'll learn here

Name the production test flow (wafer → package → system)
Detail boot-time calibration steps
Cover BIST (Built-In Self-Test) and continuous hardware verification
Walk through failure-mode analysis and the RMA process
State AI-accuracy verification (model-based QA) standards

Hook: Chip Built — Now What?

419M memristors + billions of transistors. Are they all working? Test, test, test.

Three post-production stages:

Wafer test: every die tested, 75% yield.
Package test: packaged chip, 95% pass.
System test: on motherboard, real workload.

Then to the field. In the field: continuous boot-time calibration + BIST.

Intuition: Multiple Test Layers

Wafer (38 dies)
    ↓ probe test (each die)
Good dies (~28)
    ↓ packaging
Chips (~95% pass)
    ↓ system test
Ready to ship (~70% total yield)
    ↓
Field (boot-time calibrate + BIST)

Each layer catches defects. Late catches are expensive.

Formalism: Test Stages

L1 · Başlangıç

Wafer test (probe card):

Probe needles touch each die on the wafer. Tests:

DC tests: voltage/current limits.
Memory test: SRAM bit-by-bit.
Crossbar test: program + read 16 reference cells.
Interconnect test: BAR mapping.

Time: 30 s/die. 38 dies/wafer = 20 minutes.

Failed dies marked, not packaged.

Package test:

Packaged chip:

PCIe enumeration (BIOS sees it).
Full memory test (DDR/SRAM).
Full crossbar test (program + read every cell).
Thermal stress (85°C for 1 hour).

Failed chip RMA → returned to supplier.

System test:

Chip on motherboard:

Full SDK runs.
MNIST + ResNet inference accuracy.
24-hour stress test.

Pass = “accepted”, ships to customer.

L2 · Tam

Boot-time calibration:

Every chip power-up:

1. Voltage rails stabilize.
2. Thermal sensors active.
3. Bandgap reference calibrated.
4. Each crossbar reads 16 reference cells → DAC scale set.
5. Failure map loaded (from NVRAM).
6. SRAM init.
7. Driver READY signal.

Time: ~100 ms (detail in chapter 6.3).

BIST (Built-In Self-Test):

Test circuitry inside the chip. Runs automatically at boot:

void bist_run(void) {
    // SRAM walking-bit test
    for (int addr = 0; addr < SRAM_SIZE; addr++) {
        for (int bit = 0; bit < 32; bit++) {
            sram[addr] = (1 << bit);
            assert(sram[addr] == (1 << bit));
        }
    }
    
    // Crossbar reference test
    for (int xb = 0; xb < N_CROSSBARS; xb++) {
        program_reference(xb);
        if (!verify_reference(xb)) {
            mark_crossbar_bad(xb);
        }
    }
    
    // Compute engine test
    for (int op = 0; op < N_OPS; op++) {
        result = run_op(op, test_input);
        assert(result == expected[op]);
    }
}

Faulty component → update failure map, ECC routes around.

Periodic runtime test:

During inference, 1% reference MVMs. If results drift, recalibrate.

if (inference_count % 100 == 0) {
    actual = run_reference_mvm();
    if (abs(actual - expected) > THRESHOLD) {
        recalibrate_all();
    }
}

Drift, temperature changes get auto-corrected.

L3 · Derin

Failure-mode analysis:

Most failures are caught in production. Remaining are field issues:

Failure	Frequency	Action
Cell drift	1%/year	ECC + auto-refresh
Single CU dead	0.1%/year	Failure map + reroute
Cluster fault	0.01%/year	Performance drop, notify
Total chip fail	0.001%/year	RMA, replace

MTBF (Mean Time Between Failures) target: 100,000 hours = 11 years.

RMA process:

Customer reports a fault → ticket → SIDRA sales → spare chip shipped → faulty returned → analyzed → supply chain improvement.

Field data → production improvement. Closed loop.

AI accuracy verification:

After deploying a compiled model on real SIDRA:

test_acc = chip.benchmark(test_data)
sim_acc = simulator.benchmark(test_data)

if abs(test_acc - sim_acc) > 0.01:
    # Sim vs chip differ by > 1%
    investigate()
    update_simulator_model()

Field data feeds the simulator. The sim model stays close to reality.

Compliance + certification:

SIDRA chip certifications:

CE (Europe).
FCC (US).
TSE (Türkiye).
AEC-Q100 (automotive for Y10+).
DO-254 (avionics for Y100+).

Each cert requires testing + documentation + independent audit.

Production-error benchmarks:

Y1 targets:

Wafer yield: 75%.
Package yield: 95%.
System yield: 95%.
Net: 67%. Industry typical (TSMC 28 nm, ~75-85%).

Y10+ tighter, target 80% net.

Experiment: Y1 Test Flow

Wafer batch (1000 wafers):

1000 × 38 dies = 38,000 dies.
Wafer test 75% yield: 28,500 good dies.
Wafer test cost: $1/die =$ 38K.

Packaging: 28,500 dies packaged.

Package test 95%: 27,075 good packages.
Test cost: $5/package =$ 135K.

System test: 27,075 packages.

System test 95%: 25,720 ship-ready chips.

Net production yield: 25,720 / 38,000 = 67.7%.

Cost/chip:

Wafer: $1000 × 1000 / 25,720 =$ 38.9.
Packaging: $50 × 27,075 / 25,720 =$ 52.6.
Test: ( $38K +$ 135K + $135K) / 25,720 =$ 12.0.
Total: ~$103/chip production cost.

Customer-facing target $50-200/chip for Y1. Margins reasonable.

Quick Quiz

1/6What are the 3 production test stages?

1 stageWafer test → Package test → System test (each different yield)10 stagesNone

Lab Exercise

A customer reports a SIDRA chip issue in the field.

Scenario: “My Y1 chip’s MNIST accuracy dropped from 97% to 94% after 6 months.”

Diagnostic steps:

Inspect boot logs (temperature, voltage OK?).
Pull failure-map stats (cell-failure rate normal?).
Measure reference MVM accuracy.
Drift analysis (time series).
Run periodic refresh.
Re-measure accuracy.

Typical outcome:

Drift accumulated → refresh wasn’t running. SDK auto-refresh feature wasn’t enabled. Setting fixed → accuracy returns to 97%.

Prevention: SDK default auto-refresh on. Customer education materials.

Cheat Sheet

Production test: wafer → package → system. Net yield ~67%.
Boot calibration: 100 ms per power-up.
BIST: in-chip test, automatic at boot.
Periodic runtime test: drift/temperature correction.
Failure map + ECC: runtime tolerance.
MTBF: 11-year target.
Production cost: ~$103/Y1 chip.

Vision: Automatic Test and Maintenance

Y1: classical test + boot calibration.
Y3: ML-predicted testing (learns defect patterns).
Y10: self-healing crossbar (auto-refresh).
Y100: chip self-optimizes (with online learning).
Y1000: bio-compatible device with years of self-repair.

For Türkiye: test + calibration software is the operational backbone of the SIDRA workshop. Local test equipment (Aselsan, BİLGEM partners).

Prerequisites

What you'll learn here

🪝 Hook: Chip Built — Now What?

🧭 Intuition: Multiple Test Layers

📐 Formalism: Test Stages

🧪 Experiment: Y1 Test Flow

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: Automatic Test and Maintenance

📚 Further Reading