Compute Engine and DMA
Everything outside the crossbar — activation, bias, data movement.
Prerequisites
What you'll learn here
- Identify the compute engine's role outside the crossbar (activation, bias, normalization)
- Explain DMA (Direct Memory Access) and SIDRA's dataflow
- Describe the LUT (Look-Up Table) implementation of activation functions
- Budget compute-engine power and area for Y1
- Describe how inter-layer data flows through the routing matrix
Hook: What Happens After the MVM?
A crossbar does an MVM in 15 ns. But AI models aren’t only MVM. Extra operations:
- Bias add: y = Wx + b. Add vector b to Wx.
- Activation function: ReLU, sigmoid, GELU, softmax. Non-linearity.
- Layer normalization: measure mean/std, normalize.
- Scale factor: for quantization.
- Concat, split, reshape: tensor manipulation.
All of this runs on CMOS in the Compute Engine. The crossbar is the MVM engine; the compute engine is the everything-else engine.
DMA (Direct Memory Access): moves data across clusters/CUs/crossbars without the CPU.
Intuition: The CMOS Sidekick
In each CU:
- 16 crossbars (analog MVM engine).
- 1 compute engine (CMOS digital).
The compute engine has:
- ALU (Arithmetic Logic Unit): 32-bit integer/float add, multiply, bit-shift.
- Activation LUT: 256-entry table (ReLU, sigmoid, GELU). Single-cycle transform.
- Scaler: re-scales intermediate outputs (for INT8).
- DMA controller: data transfers.
- Small SRAM (32 KB): staging buffer.
Clock: 1 GHz. One compute engine per CU → 16 crossbars run in parallel; the compute engine processes outputs in sequence.
DMA:
Y1 memory hierarchy:
- L3 SRAM (chip) 16 MB
- L2 SRAM (cluster) 2 MB × 16 = 32 MB
- L1 SRAM (CU) 128 KB × 400 = 50 MB
DMA moves data across layers without CPU instructions. One DMA controller per cluster.
Formalism: Compute Engine Operations
Core compute-engine ops:
1. Bias addition:
Crossbar output . Add bias:
One add per element. 256 elements × 1 ns = 256 ns (or 256 ALUs in parallel → 1 ns).
2. Activation (ReLU):
Hardware: simple comparator + MUX. 256 in parallel → 1 ns.
3. Sigmoid / GELU (LUT):
Complex functions via Look-Up Table:
- 256 pre-computed entries.
- Input 8-bit → table index → output 8-bit.
- 1 clock cycle (1 ns).
Area: 256 × 8 bit = 2 kbit SRAM.
4. Softmax:
Hardware:
- 256 exp LUT lookups.
- 256 sums (log_2 256 = 8-level tree).
- 256 divisions (another LUT).
- Time: ~20 ns. Frequent in Transformer attention.
5. Layer norm:
- : 256 sum / 256 → 1 division.
- : sum of (x - μ)² / 256, then square root.
- Time: ~50 ns.
DMA (Direct Memory Access):
Job: copy data from one memory region to another. No CPU — DMA controller does it alone.
Typical Y1 DMA flow:
- CPU tells DMA controller “copy N bytes from A to B.”
- DMA controller handles the SRAM-to-SRAM copy.
- Completion → interrupt or flag → CPU knows.
DMA bandwidth: 10 GB/s per cluster → 160 GB/s total for Y1.
Inference dataflow example (GPT-2 1 token):
1. CPU pushes input token from PCIe to L3 SRAM (4 KB).
2. DMA L3 → L2 (active cluster).
3. DMA L2 → L1 (active CU).
4. CU crossbar MVM (attention Q).
5. Compute engine scale, softmax.
6. Result L1 → L2 → L3 (DMA).
7. Next layer, repeat.Each layer: ~1 µs (MVM + compute + DMA).
12 layers × 1 µs = 12 µs / token.
Earlier the count was 1.4 µs (5.4). Difference: this is more realistic (includes DMA). 1.4 µs was the pure MVM-core time.
Data-movement energy:
DMA transfer: L1 → L2 ~1 pJ/byte. L2 → L3 ~5 pJ/byte. L3 → PCIe ~20 pJ/byte.
1 MB intra-chip DMA = 1M × 5 pJ = 5 mJ. Small share of inference energy.
Compute engine area + power (Y1):
One compute engine per CU:
- 256 parallel ALUs: ~50K transistors.
- LUT SRAM (10 KB): ~100K transistors.
- DMA controller: ~20K transistors.
- Total per CU: ~200K transistors ≈ 0.05 mm² at 28 nm.
Y1 has 400 CUs × 0.05 mm² = 20 mm² for compute engines. ~20% of the die.
Power:
Compute engine activity ~50% during inference. Per CU 750 µW → total Y1 300 mW. 10% of TDP. Efficient.
DMA overhead:
Typical layer: MVM 15 ns + compute 10 ns + DMA 50 ns = 75 ns. DMA can dominate. Design priority: minimize data movement.
Fused operations:
Compile-time optimization: merge bias + ReLU → single compute-engine cycle. Standard technique in modern AI compilers.
On-chip cache hierarchy:
L1 SRAM (128 KB) → active layer output. L2 SRAM (2 MB cluster) → 2-3 layers of history. L3 SRAM (16 MB chip) → large buffer (Transformer KV cache).
KV cache: 12 layer × 768 dim × 2 byte = ~18 KB/token. 1024 tokens = 18 MB. Doesn’t fit Y1 L3! Temporary buffer + DRAM.
That’s a Y1 limit on long context. Y10+ will add 1 GB HBM.
Transformer attention fused:
Attention = Q · K^T / √d · softmax · V.
Fused: 3 MVMs + 1 softmax + 1 MVM = 5 ops. The compute engine orchestrates.
Typical 1 attention head: ~200 ns.
Experiment: GPT-2 Layer Inference Timing
Single GPT-2 layer (attention + FFN):
Attention (768-dim, 12 heads):
- Q = W_Q · x: 9 crossbars parallel MVM, 15 ns.
- K = W_K · x: 9 crossbars × 15 ns.
- V = W_V · x: 9 crossbars × 15 ns.
- Link: DMA 5 ns.
- Q · K^T: matrix-matrix, ~100 ns (64-dim per head × 12 heads).
- Softmax: compute engine 20 ns.
- · V: 100 ns.
- Project through W_O: 9 crossbars × 15 ns.
Attention total: ~300 ns.
FFN:
- W1 · x: 36 crossbars (768 × 3072), parallel → ~50 ns.
- GELU: compute engine 10 ns.
- W2 · output: 36 crossbars parallel → ~50 ns.
FFN total: ~110 ns.
Layer total: ~410 ns.
12 layers × 410 ns = 4.9 µs / token. Practical GPT-2 inference.
Previous 5.4 estimate was 1.4 µs (theoretical ideal, MVM only). This is realistic (DMA + compute + attention overhead).
Energy:
- MVM: 12 × (9+9+9+9+36+36) crossbars × 26 pJ = 33 nJ.
- Compute engine: 12 × 200 ns × 300 mW = 720 nJ.
- DMA: ~10 nJ.
- Total: ~760 nJ/token.
GPT-2 1000 tokens: 760 µJ. At 3 W TDP, ~250 µs wall-clock (batch of one).
More aggressive: 16 clusters in parallel → 16 tokens/step → 1000 tokens = 16 ms. Fast + efficient.
Quick Quiz
Lab Exercise
Whisper-tiny (39M params, speech recognition) inference on SIDRA Y1.
Model structure:
- Encoder: 4 transformer layers, 384-dim.
- Decoder: 4 transformer layers, 384-dim.
Compute parameters:
- Per-layer MVM: ~400K MAC.
- Bias, LayerNorm, softmax add ~10%.
- KV cache: 4 layers × 384 × 2 byte = 3 kB/token.
Questions:
(a) Per-layer inference time? (b) 1-second audio (100 tokens) total time? (c) KV cache SRAM requirement? (d) Compute engine / crossbar time ratio? (e) Energy estimate for an edge device (3W TDP)?
Solutions
(a) 400K MAC / 4.4 TOPS per crossbar = ~100 ns MVM + 50 ns compute + 50 ns DMA = ~200 ns / layer.
(b) 8 layers × 100 tokens × 200 ns = 160 µs. Fast! A second of audio processes in 160 µs.
(c) 100 tokens × 3 kB = 300 kB. Far smaller than Y1 L3 (16 MB). Fits in cluster L1/L2.
(d) Compute 50 / (100 + 50 + 50) = 25%. Crossbar 50%. DMA 25%. Balanced.
(e) 3 W × 160 µs = 0.5 mJ. 1 billion seconds = 32 years of continuous speech = 16 MJ. An edge device at 1 hour battery can do 24/7 recognition.
Real product: a SIDRA Y1-based smart assistant (smart speaker) with 24-hour battery for always-on speech. H100 can’t do this (700 W).
Cheat Sheet
- Compute engine: CMOS digital, post-crossbar. Bias, activation, norm, scale.
- ALU, LUT, scaler: main components.
- DMA: moves data between memory regions, CPU-free.
- Y1 compute area: ~20% die. Power 10% TDP.
- Layer time: ~400 ns (MVM 50 + compute 50 + DMA 50 + overhead).
- Fused ops: compile-time optimized, speed + energy.
Vision: Future Compute Engine
- Y3: RISC-V core for the compute engine (flexible control flow).
- Y10: “soft” compute engine — FPGA-style programmable logic. Layer-specific.
- Y100: Fully analog post-MVM — even activation is analog. No CMOS needed.
- Y1000: Fully analog + photonic compute. CMOS retired.
For Türkiye: compute-engine design builds on VLSI engineering depth. ASELSAN, Siemens Turkey, etc., have strong experience. SIDRA channels that into neuromorphic AI.
Further Reading
- Next chapter: 5.10 — Noise Models
- Previous: 5.8 — MUX, Decoder, Analog ECC
- Compute engine design: Hennessy & Patterson, Computer Organization and Design, 6th ed.
- DMA: classical Intel 8237 chipset reference.
- Modern AI dataflow: Jouppi et al., TPU ISCA 2017.