The ISPP Algorithm, Step by Step
Iterative programming hitting 256 levels — software perspective.
Prerequisites
What you'll learn here
- Detail the ISPP algorithm at firmware level with pseudocode
- Explain the adaptive pulse-width strategy
- Describe how ISPP failure (max iterations) is handled
- Summarize the compiler ↔ ISPP API
- Identify Y10's closed-loop ISPP target
Hook: 256 Levels at 1% Precision
Chapter 5.5 covered the ISPP (Incremental Step Pulse Programming) principle. This chapter does the firmware code detail.
Goal: program 256 discrete G levels at ±1% precision. The algorithm runs in the RISC-V firmware, per cell.
Intuition: Iterative Approach
A standard SET pulse has 5%+ error. ISPP feedback brings it to 1%:
Target G = 50 µS.
Pulse 1 → measure 30 µS → -20 error.
Pulse 2 → measure 42 µS → -8.
Pulse 3 → measure 49 µS → -1. Done!3-15 iterations, average 5-7. Pulse width or voltage rises each iteration.
Formalism: Pseudocode + Adaptive Control
Core ISPP firmware code:
typedef struct {
uint32_t cell_addr;
int target_g; // µS × 100 (fixed-point)
int tolerance; // ±µS × 100
int max_iterations; // typically 15
} ispp_params_t;
int ispp_program_cell(ispp_params_t *p) {
int pulse_voltage = INITIAL_VOLTAGE; // mV (1500 = 1.5V)
int pulse_width = INITIAL_WIDTH; // ns (10)
for (int i = 0; i < p->max_iterations; i++) {
apply_set_pulse(p->cell_addr, pulse_voltage, pulse_width);
int g_actual = read_cell(p->cell_addr);
int error = p->target_g - g_actual;
if (abs(error) < p->tolerance) {
return ISPP_SUCCESS;
}
if (error > 0) {
// SET stronger
pulse_voltage += 50; // +50 mV
pulse_width = pulse_width * 12 / 10; // +20%
// Saturate
if (pulse_voltage > MAX_VOLTAGE) pulse_voltage = MAX_VOLTAGE;
if (pulse_width > MAX_WIDTH) pulse_width = MAX_WIDTH;
} else {
// Overshoot → small RESET
apply_reset_pulse(p->cell_addr, -500, 5); // -0.5V, 5ns
pulse_voltage = INITIAL_VOLTAGE; // reset
pulse_width = INITIAL_WIDTH;
}
}
return ISPP_FAIL;
} Adaptive pulse strategy:
First few iterations aggressive (large step). As we approach the target, small steps. Logarithmic convergence.
// Pulse width by error magnitude
if (error > 50) pulse_width = 100; // big step
else if (error > 20) pulse_width = 50;
else if (error > 5) pulse_width = 20;
else pulse_width = 10; // fine tuneCrossbar parallel programming:
256 cells in the same row can be programmed in parallel. Same pulse applied; BL control selects which cells receive it.
void program_row(int row, int targets[256]) {
int active[256] = {1, 1, ..., 1}; // all cells active
for (int iter = 0; iter < 15; iter++) {
// Pulse to all active cells
apply_row_pulse(row, active, voltage, width);
// Read all
int reads[256];
read_row(row, reads);
// Per-cell check
bool all_done = true;
for (int col = 0; col < 256; col++) {
if (active[col] && abs(reads[col] - targets[col]) < tolerance) {
active[col] = 0; // this cell done
} else if (active[col]) {
all_done = false;
}
}
if (all_done) return;
// Pulse adjust
voltage += 50;
width = width * 12 / 10;
}
}Time: 256 cells parallel → ~16 iterations × 50 ns = 800 ns/row. 256 rows × 800 ns = 200 µs/crossbar.
ISPP fail handling:
After 15 iterations without success:
- Cell is stuck-LRS or stuck-HRS.
- Add to failure map.
- ECC remaps to a redundant cell.
- Compiler knows; writes the weight to another cell.
Y1 ISPP fail rate: 0.5%. Manageable.
Closed-loop ISPP (Y10 target):
Classic ISPP: pulse → wait → measure. Closed-loop: measure during the pulse, stop instantly.
Pros:
- 30% time reduction.
- Higher precision (less overshoot).
Cons:
- More complex circuitry.
Tested in Y10 prototype.
Temperature compensation:
G(T) Arrhenius. Per-crossbar thermal sensor → firmware applies correction:
int g_target_corrected = g_target * exp_lookup(Ea / (k * T));ISPP runs against the corrected target. Drift drops.
Compiler ↔ ISPP API:
Driver IOCTL LOAD_MODEL:
struct model_data {
weight_t weights[N]; // FP32 weights
int n_layers;
layer_info_t layers[N_LAYERS];
};Firmware:
- Compiler-quantized weights are received.
- Map to cells (cluster, CU, crossbar, row, col).
- Program each cell with ISPP.
- Update failure map.
- Signal “model loaded” to driver.
Performance:
Y1 model load:
- 419M cells × 800 ns / 256 (parallel) = 1.3 seconds.
- Practical: ~640 ms (some clusters parallel).
Initial inference latency ~1 second. Done once.
Experiment: ISPP Trace
Cell #12345 target G = 75 µS:
Iter 1: V=1500mV, W=10ns → G=20 µS, err=-55
Iter 2: V=1550mV, W=12ns → G=45 µS, err=-30
Iter 3: V=1600mV, W=14ns → G=62 µS, err=-13
Iter 4: V=1650mV, W=17ns → G=72 µS, err=-3
Iter 5: V=1700mV, W=20ns → G=76 µS, err=+1 → SUCCESS!Total: 5 iterations × 50 ns = 250 ns.
Failed example (cell #98765, stuck-HRS):
Iter 1-15: G always under 5 µS, target 50.
Iter 15: SUCCESS=false, FAIL signal.Cell disabled, ECC remaps to redundant cell.
Quick Quiz
Lab Exercise
ISPP optimization:
(a) Classic ISPP 5% initial error, 5 iterations → 1%. Total 250 ns/cell. (b) Closed-loop: 175 ns/cell. 30% faster. (c) Adaptive pulse: 200 ns/cell. 20% faster. (d) Both combined (Y10 target): 150 ns/cell. 40% faster.
Y10 model load (10B cells, 50K parallel columns): 10B / 50K × 150 ns = 30 ms. 20× faster than Y1 (640 ms).
Inference downtime is critical (for model swaps) → Y10 wins big.
Cheat Sheet
- ISPP: iterative programming, 5% → 1% error.
- 5-15 iterations, average 5-7.
- Adaptive pulse: error-based pulse width.
- Parallel: 256 cells/row simultaneously.
- Fail handling: failure map + ECC.
- Y1 model load: 640 ms.
- Y10 closed-loop: 30 ms.
Vision: Write Speed = Online Learning
Faster ISPP enables online learning. Continuous weight updates like STDP:
- Y1: static model (training external).
- Y3: last-layer online ISPP (1-second update).
- Y10: 30 ms ISPP → fine-tuning in seconds.
- Y100: microsecond ISPP → hardware STDP. Bio-compatible plasticity.
Further Reading
- Next chapter: 6.5 — SDK Layers
- Previous: 6.3 — RISC-V Firmware
- Classical ISPP: chapter 5.5.
- Closed-loop: Strachan et al., Closed-loop programming in memristor crossbars, IEEE EDL 2018.