🔌 Module 5 · Chip Hardware · Chapter 5.9 · 10 min read

Compute Engine and DMA

Everything outside the crossbar — activation, bias, data movement.

What you'll learn here

  • Identify the compute engine's role outside the crossbar (activation, bias, normalization)
  • Explain DMA (Direct Memory Access) and SIDRA's dataflow
  • Describe the LUT (Look-Up Table) implementation of activation functions
  • Budget compute-engine power and area for Y1
  • Describe how inter-layer data flows through the routing matrix

Hook: What Happens After the MVM?

A crossbar does an MVM in 15 ns. But AI models aren’t only MVM. Extra operations:

  • Bias add: y = Wx + b. Add vector b to Wx.
  • Activation function: ReLU, sigmoid, GELU, softmax. Non-linearity.
  • Layer normalization: measure mean/std, normalize.
  • Scale factor: for quantization.
  • Concat, split, reshape: tensor manipulation.

All of this runs on CMOS in the Compute Engine. The crossbar is the MVM engine; the compute engine is the everything-else engine.

DMA (Direct Memory Access): moves data across clusters/CUs/crossbars without the CPU.

Intuition: The CMOS Sidekick

In each CU:

  • 16 crossbars (analog MVM engine).
  • 1 compute engine (CMOS digital).

The compute engine has:

  • ALU (Arithmetic Logic Unit): 32-bit integer/float add, multiply, bit-shift.
  • Activation LUT: 256-entry table (ReLU, sigmoid, GELU). Single-cycle transform.
  • Scaler: re-scales intermediate outputs (for INT8).
  • DMA controller: data transfers.
  • Small SRAM (32 KB): staging buffer.

Clock: 1 GHz. One compute engine per CU → 16 crossbars run in parallel; the compute engine processes outputs in sequence.

DMA:

Y1 memory hierarchy:

  • L3 SRAM (chip) 16 MB
  • L2 SRAM (cluster) 2 MB × 16 = 32 MB
  • L1 SRAM (CU) 128 KB × 400 = 50 MB

DMA moves data across layers without CPU instructions. One DMA controller per cluster.

Formalism: Compute Engine Operations

L1 · Başlangıç

Core compute-engine ops:

1. Bias addition:

Crossbar output yraw=Wx\mathbf{y}_{\text{raw}} = \mathbf{W} \mathbf{x}. Add bias: y=yraw+b\mathbf{y} = \mathbf{y}_{\text{raw}} + \mathbf{b}

One add per element. 256 elements × 1 ns = 256 ns (or 256 ALUs in parallel → 1 ns).

2. Activation (ReLU):

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Hardware: simple comparator + MUX. 256 in parallel → 1 ns.

3. Sigmoid / GELU (LUT):

Complex functions via Look-Up Table:

  • 256 pre-computed entries.
  • Input 8-bit → table index → output 8-bit.
  • 1 clock cycle (1 ns).

Area: 256 × 8 bit = 2 kbit SRAM.

4. Softmax:

softmax(xi)=exi/jexj\text{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}

Hardware:

  • 256 exp LUT lookups.
  • 256 sums (log_2 256 = 8-level tree).
  • 256 divisions (another LUT).
  • Time: ~20 ns. Frequent in Transformer attention.

5. Layer norm:

y=(xμ)/σy = (x - \mu) / \sigma

  • μ\mu: 256 sum / 256 → 1 division.
  • σ\sigma: sum of (x - μ)² / 256, then square root.
  • Time: ~50 ns.
L2 · Tam

DMA (Direct Memory Access):

Job: copy data from one memory region to another. No CPU — DMA controller does it alone.

Typical Y1 DMA flow:

  1. CPU tells DMA controller “copy N bytes from A to B.”
  2. DMA controller handles the SRAM-to-SRAM copy.
  3. Completion → interrupt or flag → CPU knows.

DMA bandwidth: 10 GB/s per cluster → 160 GB/s total for Y1.

Inference dataflow example (GPT-2 1 token):

1. CPU pushes input token from PCIe to L3 SRAM (4 KB).
2. DMA L3 → L2 (active cluster).
3. DMA L2 → L1 (active CU).
4. CU crossbar MVM (attention Q).
5. Compute engine scale, softmax.
6. Result L1 → L2 → L3 (DMA).
7. Next layer, repeat.

Each layer: ~1 µs (MVM + compute + DMA).

12 layers × 1 µs = 12 µs / token.

Earlier the count was 1.4 µs (5.4). Difference: this is more realistic (includes DMA). 1.4 µs was the pure MVM-core time.

Data-movement energy:

DMA transfer: L1 → L2 ~1 pJ/byte. L2 → L3 ~5 pJ/byte. L3 → PCIe ~20 pJ/byte.

1 MB intra-chip DMA = 1M × 5 pJ = 5 mJ. Small share of inference energy.

L3 · Derin

Compute engine area + power (Y1):

One compute engine per CU:

  • 256 parallel ALUs: ~50K transistors.
  • LUT SRAM (10 KB): ~100K transistors.
  • DMA controller: ~20K transistors.
  • Total per CU: ~200K transistors ≈ 0.05 mm² at 28 nm.

Y1 has 400 CUs × 0.05 mm² = 20 mm² for compute engines. ~20% of the die.

Power:

Compute engine activity ~50% during inference. Per CU 750 µW → total Y1 300 mW. 10% of TDP. Efficient.

DMA overhead:

Typical layer: MVM 15 ns + compute 10 ns + DMA 50 ns = 75 ns. DMA can dominate. Design priority: minimize data movement.

Fused operations:

Compile-time optimization: merge bias + ReLU → single compute-engine cycle. Standard technique in modern AI compilers.

On-chip cache hierarchy:

L1 SRAM (128 KB) → active layer output. L2 SRAM (2 MB cluster) → 2-3 layers of history. L3 SRAM (16 MB chip) → large buffer (Transformer KV cache).

KV cache: 12 layer × 768 dim × 2 byte = ~18 KB/token. 1024 tokens = 18 MB. Doesn’t fit Y1 L3! Temporary buffer + DRAM.

That’s a Y1 limit on long context. Y10+ will add 1 GB HBM.

Transformer attention fused:

Attention = Q · K^T / √d · softmax · V.

Fused: 3 MVMs + 1 softmax + 1 MVM = 5 ops. The compute engine orchestrates.

Typical 1 attention head: ~200 ns.

Experiment: GPT-2 Layer Inference Timing

Single GPT-2 layer (attention + FFN):

Attention (768-dim, 12 heads):

  1. Q = W_Q · x: 9 crossbars parallel MVM, 15 ns.
  2. K = W_K · x: 9 crossbars × 15 ns.
  3. V = W_V · x: 9 crossbars × 15 ns.
  4. Link: DMA 5 ns.
  5. Q · K^T: matrix-matrix, ~100 ns (64-dim per head × 12 heads).
  6. Softmax: compute engine 20 ns.
  7. · V: 100 ns.
  8. Project through W_O: 9 crossbars × 15 ns.

Attention total: ~300 ns.

FFN:

  1. W1 · x: 36 crossbars (768 × 3072), parallel → ~50 ns.
  2. GELU: compute engine 10 ns.
  3. W2 · output: 36 crossbars parallel → ~50 ns.

FFN total: ~110 ns.

Layer total: ~410 ns.

12 layers × 410 ns = 4.9 µs / token. Practical GPT-2 inference.

Previous 5.4 estimate was 1.4 µs (theoretical ideal, MVM only). This is realistic (DMA + compute + attention overhead).

Energy:

  • MVM: 12 × (9+9+9+9+36+36) crossbars × 26 pJ = 33 nJ.
  • Compute engine: 12 × 200 ns × 300 mW = 720 nJ.
  • DMA: ~10 nJ.
  • Total: ~760 nJ/token.

GPT-2 1000 tokens: 760 µJ. At 3 W TDP, ~250 µs wall-clock (batch of one).

More aggressive: 16 clusters in parallel → 16 tokens/step → 1000 tokens = 16 ms. Fast + efficient.

Quick Quiz

1/6Compute engine vs crossbar?

Lab Exercise

Whisper-tiny (39M params, speech recognition) inference on SIDRA Y1.

Model structure:

  • Encoder: 4 transformer layers, 384-dim.
  • Decoder: 4 transformer layers, 384-dim.

Compute parameters:

  • Per-layer MVM: ~400K MAC.
  • Bias, LayerNorm, softmax add ~10%.
  • KV cache: 4 layers × 384 × 2 byte = 3 kB/token.

Questions:

(a) Per-layer inference time? (b) 1-second audio (100 tokens) total time? (c) KV cache SRAM requirement? (d) Compute engine / crossbar time ratio? (e) Energy estimate for an edge device (3W TDP)?

Solutions

(a) 400K MAC / 4.4 TOPS per crossbar = ~100 ns MVM + 50 ns compute + 50 ns DMA = ~200 ns / layer.

(b) 8 layers × 100 tokens × 200 ns = 160 µs. Fast! A second of audio processes in 160 µs.

(c) 100 tokens × 3 kB = 300 kB. Far smaller than Y1 L3 (16 MB). Fits in cluster L1/L2.

(d) Compute 50 / (100 + 50 + 50) = 25%. Crossbar 50%. DMA 25%. Balanced.

(e) 3 W × 160 µs = 0.5 mJ. 1 billion seconds = 32 years of continuous speech = 16 MJ. An edge device at 1 hour battery can do 24/7 recognition.

Real product: a SIDRA Y1-based smart assistant (smart speaker) with 24-hour battery for always-on speech. H100 can’t do this (700 W).

Cheat Sheet

  • Compute engine: CMOS digital, post-crossbar. Bias, activation, norm, scale.
  • ALU, LUT, scaler: main components.
  • DMA: moves data between memory regions, CPU-free.
  • Y1 compute area: ~20% die. Power 10% TDP.
  • Layer time: ~400 ns (MVM 50 + compute 50 + DMA 50 + overhead).
  • Fused ops: compile-time optimized, speed + energy.

Vision: Future Compute Engine

  • Y3: RISC-V core for the compute engine (flexible control flow).
  • Y10: “soft” compute engine — FPGA-style programmable logic. Layer-specific.
  • Y100: Fully analog post-MVM — even activation is analog. No CMOS needed.
  • Y1000: Fully analog + photonic compute. CMOS retired.

For Türkiye: compute-engine design builds on VLSI engineering depth. ASELSAN, Siemens Turkey, etc., have strong experience. SIDRA channels that into neuromorphic AI.

Further Reading