💻 Module 6 · Software Stack · Chapter 6.9 · 9 min read

Test, Calibration, Verification

Before a SIDRA chip ships — the layers of QC.

What you'll learn here

  • Name the production test flow (wafer → package → system)
  • Detail boot-time calibration steps
  • Cover BIST (Built-In Self-Test) and continuous hardware verification
  • Walk through failure-mode analysis and the RMA process
  • State AI-accuracy verification (model-based QA) standards

Hook: Chip Built — Now What?

419M memristors + billions of transistors. Are they all working? Test, test, test.

Three post-production stages:

  1. Wafer test: every die tested, 75% yield.
  2. Package test: packaged chip, 95% pass.
  3. System test: on motherboard, real workload.

Then to the field. In the field: continuous boot-time calibration + BIST.

Intuition: Multiple Test Layers

Wafer (38 dies)
    ↓ probe test (each die)
Good dies (~28)
    ↓ packaging
Chips (~95% pass)
    ↓ system test
Ready to ship (~70% total yield)

Field (boot-time calibrate + BIST)

Each layer catches defects. Late catches are expensive.

Formalism: Test Stages

L1 · Başlangıç

Wafer test (probe card):

Probe needles touch each die on the wafer. Tests:

  • DC tests: voltage/current limits.
  • Memory test: SRAM bit-by-bit.
  • Crossbar test: program + read 16 reference cells.
  • Interconnect test: BAR mapping.

Time: 30 s/die. 38 dies/wafer = 20 minutes.

Failed dies marked, not packaged.

Package test:

Packaged chip:

  • PCIe enumeration (BIOS sees it).
  • Full memory test (DDR/SRAM).
  • Full crossbar test (program + read every cell).
  • Thermal stress (85°C for 1 hour).

Failed chip RMA → returned to supplier.

System test:

Chip on motherboard:

  • Full SDK runs.
  • MNIST + ResNet inference accuracy.
  • 24-hour stress test.

Pass = “accepted”, ships to customer.

L2 · Tam

Boot-time calibration:

Every chip power-up:

1. Voltage rails stabilize.
2. Thermal sensors active.
3. Bandgap reference calibrated.
4. Each crossbar reads 16 reference cells → DAC scale set.
5. Failure map loaded (from NVRAM).
6. SRAM init.
7. Driver READY signal.

Time: ~100 ms (detail in chapter 6.3).

BIST (Built-In Self-Test):

Test circuitry inside the chip. Runs automatically at boot:

void bist_run(void) {
    // SRAM walking-bit test
    for (int addr = 0; addr < SRAM_SIZE; addr++) {
        for (int bit = 0; bit < 32; bit++) {
            sram[addr] = (1 << bit);
            assert(sram[addr] == (1 << bit));
        }
    }
    
    // Crossbar reference test
    for (int xb = 0; xb < N_CROSSBARS; xb++) {
        program_reference(xb);
        if (!verify_reference(xb)) {
            mark_crossbar_bad(xb);
        }
    }
    
    // Compute engine test
    for (int op = 0; op < N_OPS; op++) {
        result = run_op(op, test_input);
        assert(result == expected[op]);
    }
}

Faulty component → update failure map, ECC routes around.

Periodic runtime test:

During inference, 1% reference MVMs. If results drift, recalibrate.

if (inference_count % 100 == 0) {
    actual = run_reference_mvm();
    if (abs(actual - expected) > THRESHOLD) {
        recalibrate_all();
    }
}

Drift, temperature changes get auto-corrected.

L3 · Derin

Failure-mode analysis:

Most failures are caught in production. Remaining are field issues:

FailureFrequencyAction
Cell drift1%/yearECC + auto-refresh
Single CU dead0.1%/yearFailure map + reroute
Cluster fault0.01%/yearPerformance drop, notify
Total chip fail0.001%/yearRMA, replace

MTBF (Mean Time Between Failures) target: 100,000 hours = 11 years.

RMA process:

Customer reports a fault → ticket → SIDRA sales → spare chip shipped → faulty returned → analyzed → supply chain improvement.

Field data → production improvement. Closed loop.

AI accuracy verification:

After deploying a compiled model on real SIDRA:

test_acc = chip.benchmark(test_data)
sim_acc = simulator.benchmark(test_data)

if abs(test_acc - sim_acc) > 0.01:
    # Sim vs chip differ by > 1%
    investigate()
    update_simulator_model()

Field data feeds the simulator. The sim model stays close to reality.

Compliance + certification:

SIDRA chip certifications:

  • CE (Europe).
  • FCC (US).
  • TSE (Türkiye).
  • AEC-Q100 (automotive for Y10+).
  • DO-254 (avionics for Y100+).

Each cert requires testing + documentation + independent audit.

Production-error benchmarks:

Y1 targets:

  • Wafer yield: 75%.
  • Package yield: 95%.
  • System yield: 95%.
  • Net: 67%. Industry typical (TSMC 28 nm, ~75-85%).

Y10+ tighter, target 80% net.

Experiment: Y1 Test Flow

Wafer batch (1000 wafers):

  • 1000 × 38 dies = 38,000 dies.
  • Wafer test 75% yield: 28,500 good dies.
  • Wafer test cost: 1/die=1/die = 38K.

Packaging: 28,500 dies packaged.

  • Package test 95%: 27,075 good packages.
  • Test cost: 5/package=5/package = 135K.

System test: 27,075 packages.

  • System test 95%: 25,720 ship-ready chips.

Net production yield: 25,720 / 38,000 = 67.7%.

Cost/chip:

  • Wafer: 1000×1000/25,720=1000 × 1000 / 25,720 = 38.9.
  • Packaging: 50×27,075/25,720=50 × 27,075 / 25,720 = 52.6.
  • Test: (38K+38K + 135K + 135K)/25,720=135K) / 25,720 = 12.0.
  • Total: ~$103/chip production cost.

Customer-facing target $50-200/chip for Y1. Margins reasonable.

Quick Quiz

1/6What are the 3 production test stages?

Lab Exercise

A customer reports a SIDRA chip issue in the field.

Scenario: “My Y1 chip’s MNIST accuracy dropped from 97% to 94% after 6 months.”

Diagnostic steps:

  1. Inspect boot logs (temperature, voltage OK?).
  2. Pull failure-map stats (cell-failure rate normal?).
  3. Measure reference MVM accuracy.
  4. Drift analysis (time series).
  5. Run periodic refresh.
  6. Re-measure accuracy.

Typical outcome:

Drift accumulated → refresh wasn’t running. SDK auto-refresh feature wasn’t enabled. Setting fixed → accuracy returns to 97%.

Prevention: SDK default auto-refresh on. Customer education materials.

Cheat Sheet

  • Production test: wafer → package → system. Net yield ~67%.
  • Boot calibration: 100 ms per power-up.
  • BIST: in-chip test, automatic at boot.
  • Periodic runtime test: drift/temperature correction.
  • Failure map + ECC: runtime tolerance.
  • MTBF: 11-year target.
  • Production cost: ~$103/Y1 chip.

Vision: Automatic Test and Maintenance

  • Y1: classical test + boot calibration.
  • Y3: ML-predicted testing (learns defect patterns).
  • Y10: self-healing crossbar (auto-refresh).
  • Y100: chip self-optimizes (with online learning).
  • Y1000: bio-compatible device with years of self-repair.

For Türkiye: test + calibration software is the operational backbone of the SIDRA workshop. Local test equipment (Aselsan, BİLGEM partners).

Further Reading