💻 Module 6 · Software Stack · Chapter 6.7 · 10 min read

Compiler: Model → Analog Mapping

From FP32 weights to crossbar addresses — the SIDRA compiler's job.

What you'll learn here

  • Identify the compiler's 5 main stages (parse, quantize, map, optimize, emit)
  • Explain the strategy for splitting a weight matrix across crossbars
  • State operator fusion, layer ordering, memory planning optimizations
  • Understand how the compiler includes noise, IR drop, and thermal models
  • Estimate compile time and output size for Y1

Hook: From High-Level to Silicon Addresses

PyTorch model: FP32 tensors, abstract ops. SIDRA hardware: each cell has an exact (cluster, CU, crossbar, row, col) address.

The compiler bridges the gap. Tasks:

  1. Parse the model (PyTorch → graph).
  2. Quantize (FP32 → INT8).
  3. Map (place weights onto cells).
  4. Optimize (fusion, scheduling, IR drop compensation).
  5. Emit (binary, driver format).

This chapter details every stage.

Intuition: 5 Stages

[PyTorch Model]
    ↓ parse
[IR (Intermediate Representation) — computation graph]
    ↓ quantize
[INT8 weights + activations]
    ↓ map
[Cell assignments (cluster/CU/crossbar/row/col)]
    ↓ optimize
[Fused ops, scheduled layers, pre-distorted weights]
    ↓ emit
[SIDRA binary (.sidra file)]
    ↓ driver load
[Programmed on-chip via ISPP]

A classical compiler (like GCC) but the target is SIDRA.

Formalism: Compiler Stages

L1 · Başlangıç

Stage 1: Parse.

PyTorch model → FX graph (PyTorch’s IR) or ONNX → SIDRA IR.

SIDRA IR nodes:

  • MatMul: weight matrix + input tensor.
  • Conv2D: filter + input.
  • Activation: ReLU, GELU, etc.
  • LayerNorm, Softmax.

Stage 2: Quantize.

FP32 → INT8. Per-tensor min/max analysis (using calibration data):

def calibrate(model, calib_data):
    for batch in calib_data:
        for layer in model.layers:
            stats[layer.name].update(layer.output.min(), layer.output.max())
    return stats

def quantize_weight(w_fp32, stats):
    scale = (stats.max - stats.min) / 255
    w_int8 = round((w_fp32 - stats.min) / scale).clip(0, 255)
    return w_int8, scale, stats.min

Result: each weight as 8-bit + per-tensor scale factor.

Stage 3: Map.

Weight matrix WRM×NW \in \mathbb{R}^{M \times N}; each crossbar is 256×256. Split:

M = 768, N = 768 → 3 × 3 = 9 crossbars.

Each block WijW_{ij} lives on one crossbar. Cell addresses:

for i in range(M_blocks):
    for j in range(N_blocks):
        crossbar_id = allocate_crossbar()
        for r in range(256):
            for c in range(256):
                cell_addr = (cluster, cu, crossbar_id, r, c)
                write_map[(i*256+r, j*256+c)] = cell_addr
L2 · Tam

Stage 4: Optimize.

4.1 Operator fusion:

Linear + BN + ReLU into a single fused op:

y = ReLU(BN(Wx + b))  →  y = fused_linear_bn_relu(W', b', x)

Absorb BN params into W. Compute engine single cycle.

4.2 Layer ordering:

Place layers on crossbars to minimize data movement. Consecutive layers in the same cluster.

4.3 IR drop compensation (chapter 5.12):

Programming correction for cells at the end of a long WL:

for row in range(256):
    for col in range(256):
        ir_drop = estimate_ir_drop(row, col, activity)
        w_compensated = w_target * (1 + ir_drop)
        program_cell(cell_addr, w_compensated)

4.4 Noise-aware (chapter 5.10):

Sensitive layers (near output) → averaging. Rest single read.

4.5 Redundancy:

Critical weights 3× (TMR). Non-critical 1×.

Stage 5: Emit.

Binary format:

Header: magic, version, model_id, metadata
Section 1: Cell programs (address + target_G each)
Section 2: Layer graph (execution order, deps)
Section 3: Activation LUTs
Section 4: Runtime params

Typical Y1 model binary: 500 MB (419M cells × 8 bit + overhead).

L3 · Derin

Compiler simulator loop:

The compiler uses the digital twin (chapter 6.8) for pre-deployment model testing:

compiler_output = compile(model)
simulator.load(compiler_output)
accuracy = simulator.benchmark(test_data)
if accuracy < target:
    # Re-compile with stronger QAT, more averaging, etc.
    model = qat_retrain(model)
    compiler_output = compile(model)

Layer-wise quantization:

Not all layers at the same bit depth. Sensitive ones at INT16, rest at INT8. Compiler auto-chooses:

for layer in model.layers:
    if layer.sensitivity > threshold:
        layer.bit_depth = 16
    else:
        layer.bit_depth = 8

Multi-chip scheduling:

Big models (e.g. GPT-3) won’t fit a single Y1. The compiler partitions:

Chip 1: Layers 1-4
Chip 2: Layers 5-8
...

Chip-to-chip over PCIe. Latency bottleneck.

Compile time:

Y1 model compile:

  • Parse: 0.1 s.
  • Quantize + calibrate: 10-60 s (depends on calibration data).
  • Map + optimize: 5-30 s.
  • Emit: 1 s.
  • Total: 30-100 s.

Compile done once; output cached.

SIDRA IR:

LLVM-style SSA (Static Single Assignment) form. Optimization passes:

  • Dead-code elimination.
  • Constant folding.
  • Layer merging.
  • Dataflow optimization.

Future (Y10+ compiler):

  • Auto-quantization: optimal bit depth per layer.
  • Training-time compiler: hardware-aware gradients during training.
  • Multi-objective optimization: speed, energy, accuracy Pareto.

Experiment: Compile an MNIST Model

import sidra

# 1. PyTorch model
model = torch.load("mnist_mlp.pth")  # 2-layer MLP, 100K params

# 2. Calibration data
calib_data = [img for img, _ in mnist_test[:100]]

# 3. Compile
compiled = sidra.compile(
    model,
    calib_data=calib_data,
    target_device="y1",
    optimization_level=2
)

print(f"Compiled size: {compiled.size_bytes / 1024 / 1024:.1f} MB")
print(f"Crossbars used: {compiled.crossbar_count}")
print(f"Expected accuracy: {compiled.simulated_accuracy:.2%}")
# Output:
# Compiled size: 0.5 MB
# Crossbars used: 4
# Expected accuracy: 97.5%

# 4. Deploy
chip = sidra.Chip(0)
chip.deploy(compiled)

# 5. Inference
for img, label in mnist_test:
    pred = chip.infer(img)
    # 25 µs/inference

Compile time: 30 seconds. Deploy (ISPP): 1 second. Inference: 25 µs/sample.

Quick Quiz

1/6What are the 5 compiler stages?

Lab Exercise

Compiler optimization impact.

Model: ResNet-18 ImageNet.

Comparison:

OptimizationSpeedEnergyAccuracy
None75.0%
Op fusion1.3×0.9×75.0%
Quantize-aware1.1×75.2%
IR drop comp75.5%
Noise-injection training1.1×76.0%
All1.5×0.8×76.0%

Compile time: None 30 s, All 120 s (4× longer but the deploy-time gain is bigger).

Cheat Sheet

  • 5 stages: Parse → Quantize → Map → Optimize → Emit.
  • Main optimizations: op fusion, IR-drop comp, noise-aware, redundancy.
  • Compile time: 30-100 s/model (once).
  • Binary size: ~500 MB Y1 (full cell program).
  • Simulator loop: pre-deploy accuracy guarantee.

Vision: The Compiler's Future

  • Y1: offline compile, PyTorch input.
  • Y3: auto-quantization, per-layer bit depth.
  • Y10: hardware-aware training compiler.
  • Y100: multi-objective (speed/energy/accuracy Pareto).
  • Y1000: AI-generated compiler (self-improving compiler).

For Türkiye: compiler-engineering depth. Turkish compiler research groups (METU, Boğaziçi) are strong in this area.

Further Reading