Compiler: Model → Analog Mapping
From FP32 weights to crossbar addresses — the SIDRA compiler's job.
Prerequisites
What you'll learn here
- Identify the compiler's 5 main stages (parse, quantize, map, optimize, emit)
- Explain the strategy for splitting a weight matrix across crossbars
- State operator fusion, layer ordering, memory planning optimizations
- Understand how the compiler includes noise, IR drop, and thermal models
- Estimate compile time and output size for Y1
Hook: From High-Level to Silicon Addresses
PyTorch model: FP32 tensors, abstract ops. SIDRA hardware: each cell has an exact (cluster, CU, crossbar, row, col) address.
The compiler bridges the gap. Tasks:
- Parse the model (PyTorch → graph).
- Quantize (FP32 → INT8).
- Map (place weights onto cells).
- Optimize (fusion, scheduling, IR drop compensation).
- Emit (binary, driver format).
This chapter details every stage.
Intuition: 5 Stages
[PyTorch Model]
↓ parse
[IR (Intermediate Representation) — computation graph]
↓ quantize
[INT8 weights + activations]
↓ map
[Cell assignments (cluster/CU/crossbar/row/col)]
↓ optimize
[Fused ops, scheduled layers, pre-distorted weights]
↓ emit
[SIDRA binary (.sidra file)]
↓ driver load
[Programmed on-chip via ISPP]A classical compiler (like GCC) but the target is SIDRA.
Formalism: Compiler Stages
Stage 1: Parse.
PyTorch model → FX graph (PyTorch’s IR) or ONNX → SIDRA IR.
SIDRA IR nodes:
- MatMul: weight matrix + input tensor.
- Conv2D: filter + input.
- Activation: ReLU, GELU, etc.
- LayerNorm, Softmax.
Stage 2: Quantize.
FP32 → INT8. Per-tensor min/max analysis (using calibration data):
def calibrate(model, calib_data):
for batch in calib_data:
for layer in model.layers:
stats[layer.name].update(layer.output.min(), layer.output.max())
return stats
def quantize_weight(w_fp32, stats):
scale = (stats.max - stats.min) / 255
w_int8 = round((w_fp32 - stats.min) / scale).clip(0, 255)
return w_int8, scale, stats.minResult: each weight as 8-bit + per-tensor scale factor.
Stage 3: Map.
Weight matrix ; each crossbar is 256×256. Split:
M = 768, N = 768 → 3 × 3 = 9 crossbars.Each block lives on one crossbar. Cell addresses:
for i in range(M_blocks):
for j in range(N_blocks):
crossbar_id = allocate_crossbar()
for r in range(256):
for c in range(256):
cell_addr = (cluster, cu, crossbar_id, r, c)
write_map[(i*256+r, j*256+c)] = cell_addr Stage 4: Optimize.
4.1 Operator fusion:
Linear + BN + ReLU into a single fused op:
y = ReLU(BN(Wx + b)) → y = fused_linear_bn_relu(W', b', x)Absorb BN params into W. Compute engine single cycle.
4.2 Layer ordering:
Place layers on crossbars to minimize data movement. Consecutive layers in the same cluster.
4.3 IR drop compensation (chapter 5.12):
Programming correction for cells at the end of a long WL:
for row in range(256):
for col in range(256):
ir_drop = estimate_ir_drop(row, col, activity)
w_compensated = w_target * (1 + ir_drop)
program_cell(cell_addr, w_compensated)4.4 Noise-aware (chapter 5.10):
Sensitive layers (near output) → averaging. Rest single read.
4.5 Redundancy:
Critical weights 3× (TMR). Non-critical 1×.
Stage 5: Emit.
Binary format:
Header: magic, version, model_id, metadata
Section 1: Cell programs (address + target_G each)
Section 2: Layer graph (execution order, deps)
Section 3: Activation LUTs
Section 4: Runtime paramsTypical Y1 model binary: 500 MB (419M cells × 8 bit + overhead).
Compiler simulator loop:
The compiler uses the digital twin (chapter 6.8) for pre-deployment model testing:
compiler_output = compile(model)
simulator.load(compiler_output)
accuracy = simulator.benchmark(test_data)
if accuracy < target:
# Re-compile with stronger QAT, more averaging, etc.
model = qat_retrain(model)
compiler_output = compile(model)Layer-wise quantization:
Not all layers at the same bit depth. Sensitive ones at INT16, rest at INT8. Compiler auto-chooses:
for layer in model.layers:
if layer.sensitivity > threshold:
layer.bit_depth = 16
else:
layer.bit_depth = 8Multi-chip scheduling:
Big models (e.g. GPT-3) won’t fit a single Y1. The compiler partitions:
Chip 1: Layers 1-4
Chip 2: Layers 5-8
...Chip-to-chip over PCIe. Latency bottleneck.
Compile time:
Y1 model compile:
- Parse: 0.1 s.
- Quantize + calibrate: 10-60 s (depends on calibration data).
- Map + optimize: 5-30 s.
- Emit: 1 s.
- Total: 30-100 s.
Compile done once; output cached.
SIDRA IR:
LLVM-style SSA (Static Single Assignment) form. Optimization passes:
- Dead-code elimination.
- Constant folding.
- Layer merging.
- Dataflow optimization.
Future (Y10+ compiler):
- Auto-quantization: optimal bit depth per layer.
- Training-time compiler: hardware-aware gradients during training.
- Multi-objective optimization: speed, energy, accuracy Pareto.
Experiment: Compile an MNIST Model
import sidra
# 1. PyTorch model
model = torch.load("mnist_mlp.pth") # 2-layer MLP, 100K params
# 2. Calibration data
calib_data = [img for img, _ in mnist_test[:100]]
# 3. Compile
compiled = sidra.compile(
model,
calib_data=calib_data,
target_device="y1",
optimization_level=2
)
print(f"Compiled size: {compiled.size_bytes / 1024 / 1024:.1f} MB")
print(f"Crossbars used: {compiled.crossbar_count}")
print(f"Expected accuracy: {compiled.simulated_accuracy:.2%}")
# Output:
# Compiled size: 0.5 MB
# Crossbars used: 4
# Expected accuracy: 97.5%
# 4. Deploy
chip = sidra.Chip(0)
chip.deploy(compiled)
# 5. Inference
for img, label in mnist_test:
pred = chip.infer(img)
# 25 µs/inferenceCompile time: 30 seconds. Deploy (ISPP): 1 second. Inference: 25 µs/sample.
Quick Quiz
Lab Exercise
Compiler optimization impact.
Model: ResNet-18 ImageNet.
Comparison:
| Optimization | Speed | Energy | Accuracy |
|---|---|---|---|
| None | 1× | 1× | 75.0% |
| Op fusion | 1.3× | 0.9× | 75.0% |
| Quantize-aware | 1.1× | 1× | 75.2% |
| IR drop comp | 1× | 1× | 75.5% |
| Noise-injection training | 1× | 1.1× | 76.0% |
| All | 1.5× | 0.8× | 76.0% |
Compile time: None 30 s, All 120 s (4× longer but the deploy-time gain is bigger).
Cheat Sheet
- 5 stages: Parse → Quantize → Map → Optimize → Emit.
- Main optimizations: op fusion, IR-drop comp, noise-aware, redundancy.
- Compile time: 30-100 s/model (once).
- Binary size: ~500 MB Y1 (full cell program).
- Simulator loop: pre-deploy accuracy guarantee.
Vision: The Compiler's Future
- Y1: offline compile, PyTorch input.
- Y3: auto-quantization, per-layer bit depth.
- Y10: hardware-aware training compiler.
- Y100: multi-objective (speed/energy/accuracy Pareto).
- Y1000: AI-generated compiler (self-improving compiler).
For Türkiye: compiler-engineering depth. Turkish compiler research groups (METU, Boğaziçi) are strong in this area.
Further Reading
- Next chapter: 6.8 — Digital Twin / Simulator
- Previous: 6.6 — PyTorch Backend
- AI compiler: Apache TVM, MLIR (LLVM).
- Compiler for CIM: Ankit et al., PUMA: A programmable ultra-efficient memristor-based accelerator, ASPLOS 2019.