💻 Module 6 · Software Stack · Chapter 6.7 · 10 min read

Compiler: Model → Analog Mapping

From FP32 weights to crossbar addresses — the SIDRA compiler's job.

Prerequisites

6.6 — Writing a PyTorch Backend

What you'll learn here

Identify the compiler's 5 main stages (parse, quantize, map, optimize, emit)
Explain the strategy for splitting a weight matrix across crossbars
State operator fusion, layer ordering, memory planning optimizations
Understand how the compiler includes noise, IR drop, and thermal models
Estimate compile time and output size for Y1

Hook: From High-Level to Silicon Addresses

PyTorch model: FP32 tensors, abstract ops. SIDRA hardware: each cell has an exact (cluster, CU, crossbar, row, col) address.

The compiler bridges the gap. Tasks:

Parse the model (PyTorch → graph).
Quantize (FP32 → INT8).
Map (place weights onto cells).
Optimize (fusion, scheduling, IR drop compensation).
Emit (binary, driver format).

This chapter details every stage.

Intuition: 5 Stages

[PyTorch Model]
    ↓ parse
[IR (Intermediate Representation) — computation graph]
    ↓ quantize
[INT8 weights + activations]
    ↓ map
[Cell assignments (cluster/CU/crossbar/row/col)]
    ↓ optimize
[Fused ops, scheduled layers, pre-distorted weights]
    ↓ emit
[SIDRA binary (.sidra file)]
    ↓ driver load
[Programmed on-chip via ISPP]

A classical compiler (like GCC) but the target is SIDRA.

Formalism: Compiler Stages

L1 · Başlangıç

Stage 1: Parse.

PyTorch model → FX graph (PyTorch’s IR) or ONNX → SIDRA IR.

SIDRA IR nodes:

MatMul: weight matrix + input tensor.
Conv2D: filter + input.
Activation: ReLU, GELU, etc.
LayerNorm, Softmax.

Stage 2: Quantize.

FP32 → INT8. Per-tensor min/max analysis (using calibration data):

def calibrate(model, calib_data):
    for batch in calib_data:
        for layer in model.layers:
            stats[layer.name].update(layer.output.min(), layer.output.max())
    return stats

def quantize_weight(w_fp32, stats):
    scale = (stats.max - stats.min) / 255
    w_int8 = round((w_fp32 - stats.min) / scale).clip(0, 255)
    return w_int8, scale, stats.min

Result: each weight as 8-bit + per-tensor scale factor.

Stage 3: Map.

Weight matrix $W \in \mathbb{R}^{M \times N}$ ; each crossbar is 256×256. Split:

M = 768, N = 768 → 3 × 3 = 9 crossbars.

Each block $W_{ij}$ lives on one crossbar. Cell addresses:

for i in range(M_blocks):
    for j in range(N_blocks):
        crossbar_id = allocate_crossbar()
        for r in range(256):
            for c in range(256):
                cell_addr = (cluster, cu, crossbar_id, r, c)
                write_map[(i*256+r, j*256+c)] = cell_addr

L2 · Tam

Stage 4: Optimize.

4.1 Operator fusion:

Linear + BN + ReLU into a single fused op:

y = ReLU(BN(Wx + b))  →  y = fused_linear_bn_relu(W', b', x)

Absorb BN params into W. Compute engine single cycle.

4.2 Layer ordering:

Place layers on crossbars to minimize data movement. Consecutive layers in the same cluster.

4.3 IR drop compensation (chapter 5.12):

Programming correction for cells at the end of a long WL:

for row in range(256):
    for col in range(256):
        ir_drop = estimate_ir_drop(row, col, activity)
        w_compensated = w_target * (1 + ir_drop)
        program_cell(cell_addr, w_compensated)

4.4 Noise-aware (chapter 5.10):

Sensitive layers (near output) → averaging. Rest single read.

4.5 Redundancy:

Critical weights 3× (TMR). Non-critical 1×.

Stage 5: Emit.

Binary format:

Header: magic, version, model_id, metadata
Section 1: Cell programs (address + target_G each)
Section 2: Layer graph (execution order, deps)
Section 3: Activation LUTs
Section 4: Runtime params

Typical Y1 model binary: 500 MB (419M cells × 8 bit + overhead).

L3 · Derin

Compiler simulator loop:

The compiler uses the digital twin (chapter 6.8) for pre-deployment model testing:

compiler_output = compile(model)
simulator.load(compiler_output)
accuracy = simulator.benchmark(test_data)
if accuracy < target:
    # Re-compile with stronger QAT, more averaging, etc.
    model = qat_retrain(model)
    compiler_output = compile(model)

Layer-wise quantization:

Not all layers at the same bit depth. Sensitive ones at INT16, rest at INT8. Compiler auto-chooses:

for layer in model.layers:
    if layer.sensitivity > threshold:
        layer.bit_depth = 16
    else:
        layer.bit_depth = 8

Multi-chip scheduling:

Big models (e.g. GPT-3) won’t fit a single Y1. The compiler partitions:

Chip 1: Layers 1-4
Chip 2: Layers 5-8
...

Chip-to-chip over PCIe. Latency bottleneck.

Compile time:

Y1 model compile:

Parse: 0.1 s.
Quantize + calibrate: 10-60 s (depends on calibration data).
Map + optimize: 5-30 s.
Emit: 1 s.
Total: 30-100 s.

Compile done once; output cached.

SIDRA IR:

LLVM-style SSA (Static Single Assignment) form. Optimization passes:

Dead-code elimination.
Constant folding.
Layer merging.
Dataflow optimization.

Future (Y10+ compiler):

Auto-quantization: optimal bit depth per layer.
Training-time compiler: hardware-aware gradients during training.
Multi-objective optimization: speed, energy, accuracy Pareto.

Experiment: Compile an MNIST Model

import sidra

# 1. PyTorch model
model = torch.load("mnist_mlp.pth")  # 2-layer MLP, 100K params

# 2. Calibration data
calib_data = [img for img, _ in mnist_test[:100]]

# 3. Compile
compiled = sidra.compile(
    model,
    calib_data=calib_data,
    target_device="y1",
    optimization_level=2
)

print(f"Compiled size: {compiled.size_bytes / 1024 / 1024:.1f} MB")
print(f"Crossbars used: {compiled.crossbar_count}")
print(f"Expected accuracy: {compiled.simulated_accuracy:.2%}")
# Output:
# Compiled size: 0.5 MB
# Crossbars used: 4
# Expected accuracy: 97.5%

# 4. Deploy
chip = sidra.Chip(0)
chip.deploy(compiled)

# 5. Inference
for img, label in mnist_test:
    pred = chip.infer(img)
    # 25 µs/inference

Compile time: 30 seconds. Deploy (ISPP): 1 second. Inference: 25 µs/sample.

Quick Quiz

1/6What are the 5 compiler stages?

1 stageParse → Quantize → Map → Optimize → Emit10 stagesNone

Lab Exercise

Compiler optimization impact.

Model: ResNet-18 ImageNet.

Comparison:

Optimization	Speed	Energy	Accuracy
None	1×	1×	75.0%
Op fusion	1.3×	0.9×	75.0%
Quantize-aware	1.1×	1×	75.2%
IR drop comp	1×	1×	75.5%
Noise-injection training	1×	1.1×	76.0%
All	1.5×	0.8×	76.0%

Compile time: None 30 s, All 120 s (4× longer but the deploy-time gain is bigger).

Cheat Sheet

5 stages: Parse → Quantize → Map → Optimize → Emit.
Main optimizations: op fusion, IR-drop comp, noise-aware, redundancy.
Compile time: 30-100 s/model (once).
Binary size: ~500 MB Y1 (full cell program).
Simulator loop: pre-deploy accuracy guarantee.

Vision: The Compiler's Future

Y1: offline compile, PyTorch input.
Y3: auto-quantization, per-layer bit depth.
Y10: hardware-aware training compiler.
Y100: multi-objective (speed/energy/accuracy Pareto).
Y1000: AI-generated compiler (self-improving compiler).

For Türkiye: compiler-engineering depth. Turkish compiler research groups (METU, Boğaziçi) are strong in this area.

Prerequisites

What you'll learn here

🪝 Hook: From High-Level to Silicon Addresses

🧭 Intuition: 5 Stages

📐 Formalism: Compiler Stages

🧪 Experiment: Compile an MNIST Model

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: The Compiler's Future

📚 Further Reading