🔌 Module 5 · Chip Hardware · Chapter 5.4 · 14 min read

YILDIRIM Chip Architecture

SIDRA's first-generation chip spec — a full architecture map of Y1.

What you'll learn here

  • Sketch YILDIRIM Y1's four-level hierarchy (crossbar → CU → cluster → chip)
  • List the functional components at each level (ADC, DAC, compute engine, DMA)
  • Break down the Y1 chip's power, area, and throughput budgets
  • Explain the CPU-SIDRA hybrid system architecture (PCIe link)
  • Summarize the Y1→Y100 roadmap at a high level

Hook: YILDIRIM at a Glance

YILDIRIM is the first-generation neuromorphic AI chip from SIDRA SEMICONDUCTOR. The Y1 product spec:

  • Die area: ~100 mm² (10 mm × 10 mm).
  • Process: 28 nm CMOS substrate + HfO₂ BEOL memristor.
  • Transistors: ~4 billion (28 nm CMOS).
  • Memristors: 419 million.
  • Weight capacity: 419 MB (8 bits per cell).
  • Throughput: 30 TOPS analog.
  • TDP: 3 W.
  • Interface: PCIe 5.0 × 4 lanes (16 GB/s).

The chip doesn’t do inference alone — it runs hybrid with a CPU. The CPU handles control flow + non-MVM; YILDIRIM handles MVMs. This chapter walks the architecture end-to-end.

Intuition: A 4-Level Hierarchy

YILDIRIM Y1 is physically organized in four hierarchy levels:

Level 1: CROSSBAR (256×256 = 65K cells)
    ↓ × 16
Level 2: COMPUTE UNIT (1M cells + ADC/DAC + local control)
    ↓ × 25
Level 3: CLUSTER (25 CUs = 25M cells + L2 memory + DMA)
    ↓ × 17 (approx)
Level 4: CHIP (16 Clusters = 419M cells + PCIe + L3 memory)

Top-down:

LevelCountTotal cellsAdditional components
Crossbar165,536WL/BL drivers, cell matrix
CU16 crossbars1,048,576ADC column (256), DAC column (256), compute engine, local SRAM 128 KB
Cluster25 CUs26,214,400DMA, routing matrix, L2 SRAM 2 MB
Chip16 Clusters419,430,400PCIe controller, L3 SRAM 16 MB, clock tree, power management

Why hierarchical?

  1. Local compute — minimize communication distance.
  2. Parallel execution — different layers can run different models.
  3. Power management — turn CUs/clusters on/off.
  4. Scalability — Y10 can scale the same design 10×.

Parallel throughput:

  • Crossbar: 4.4 TOPS (MVM at 67M/s × 65K).
  • CU: 4.4 × 16 = 70 TOPS (16 crossbars in parallel).
  • Cluster: 70 × 25 = 1.76 POPS.
  • Chip: 1.76 × 16 = 28 POPS (theoretical analog).

Practical: ADC/data movement bottleneck → 30 TOPS real Y1 figure.

Formalism: Y1 Chip Components in Detail

L1 · Başlangıç

Crossbar level (detail in 5.3):

  • 256 × 256 memristors.
  • Local WL/BL drivers.
  • Function: single MVM, 15 ns.

Compute Unit (CU) components:

  1. 16 crossbars (parallel access).
  2. 256 DACs (8-bit, 0.5 V range).
  3. 256 ADCs (8-bit, ~1 pJ/conversion).
  4. Compute engine:
    • Activation functions (ReLU, sigmoid, softmax) via LUT.
    • Bias addition.
    • Scalar multiply (scale factor).
    • Layer-norm (mean/std compute).
  5. Local SRAM: 128 KB (intermediate activations).
  6. Control: state machine, crossbar sequencing.

CU per-MVM time: ~15 ns analog + 5 ns digital post-processing = 20 ns. 50M inferences/s single-CU.

Cluster components:

  1. 25 CUs.
  2. DMA (Direct Memory Access): data in/out of the cluster.
  3. Routing matrix: data flow between the 25 CUs.
  4. L2 SRAM: 2 MB (model-weight cache, intermediate output storage).
  5. Power control: per-CU power gating.

Chip components:

  1. 16 Clusters.
  2. PCIe 5.0 controller: host CPU link.
  3. L3 SRAM: 16 MB (large intermediate outputs).
  4. Power management: voltage regulators, clock tree, DVFS.
  5. Test and calibration: crossbar calibration at every boot.
  6. Thermal sensors: temperature measured per cluster, throttling if needed.
L2 · Tam

Y1 power budget (3 W TDP):

ComponentPower shareNotes
Crossbar MVM~0.5 WAll 6400 crossbars @ 20% activity (sparsity)
DAC~0.8 W6400 × 256 = 1.6M DACs @ 30% activity
ADC~1.0 W6400 × 256 = 1.6M ADCs, 1 pJ × 50M/s each
Compute engine~0.3 WActivation, bias, scale
SRAM + DMA~0.2 WMemory access
PCIe + clock~0.2 WInterface, clock tree
Total~3.0 WTDP target

Area budget (100 mm² die):

ComponentAreaFraction
Crossbar active~4.2 mm²4.2% (6400 × 656 µm²)
Crossbar peripheral~20 mm²20% (WL/BL drivers, local control)
ADC~25 mm²25% (1.6M ADCs)
DAC~10 mm²10%
Compute engine + SRAM~20 mm²20%
PCIe, I/O~15 mm²15%
Spacing + routing~5 mm²5%
Total100 mm²-

ADC dominates area — a typical analog AI chip problem. Y10 target: bring ADC area to 10% (TDC technology, chapter 5.6).

Clock speed:

  • CMOS substrate (control, compute engine): 1 GHz.
  • Crossbar analog: asynchronous (no clock, settling-based).
  • PCIe 5.0: 32 GT/s link rate.

DVFS (Dynamic Voltage and Frequency Scaling):

Voltage/frequency tracks activity:

  • Idle: 100 MHz, 0.6 V → 100 mW.
  • Average: 500 MHz, 0.8 V → 1 W.
  • Peak: 1 GHz, 1 V → 3 W.

CPU-SIDRA interface:

A host CPU (e.g. Intel Xeon or AMD EPYC) connects over PCIe 5.0:

  1. CPU loads the model (weights programmed into the crossbars).
  2. CPU sends input data over PCIe.
  3. YILDIRIM runs MVMs.
  4. Output returns over PCIe.
  5. CPU handles non-MVM (softmax, tokenization, post-processing).

PCIe 5.0 bandwidth: 16 GB/s. Sufficient? GPT-2 inference input: 512 token × 768 dim × 2 byte = 0.8 MB → 0.05 µs @ 16 GB/s. Not throughput-bound.

L3 · Derin

Dataflow (typical inference):

Host CPU (x86) 
    ↓ PCIe 5.0 (16 GB/s)
YILDIRIM Chip:
    L3 SRAM (16 MB) — input buffer
        ↓ DMA
    L2 SRAM (2 MB × 16 cluster) — active-layer weight cache

    L1 SRAM (128 KB × 25 × 16 = 50 MB) — intermediate activations

    Crossbar (419 MB) — persistent weights

    Compute Engine — activation, bias

    L3 SRAM — output buffer
        ↓ PCIe
Host CPU

Memory hierarchy (theoretical):

  • Memristor: 419 MB (fixed, written at program-time).
  • L3 SRAM: 16 MB (intermediate outputs, big buffers).
  • L2 SRAM: 2 MB × 16 = 32 MB.
  • L1 SRAM: 128 KB × 400 CU = 50 MB.

Total ~520 MB on-chip. Not enormous, but everything on-chip — no external DRAM. That’s SIDRA’s von Neumann-bypass claim.

Routing matrix:

Within a cluster, “routing” between the 25 CUs:

  • Every CU output can be forwarded to any other CU.
  • 25×25 routing matrix = 625 connection points.
  • Each connection bi-directional, 32-bit wide, 1 GHz → 4 GB/s per connection.
  • Total routing bandwidth: ~2.5 TB/s within a cluster.

Why this matters: deep-model intermediate outputs flow layer-to-layer. The routing matrix supports that flow. YILDIRIM Y1 is a “graph-based” dataflow architecture, not a rote-sequential one.

Calibration and test (boot-time):

At chip power-up:

  1. Temperature read: 4 thermal sensors per cluster.
  2. Voltage calibration: DAC reference voltages tuned.
  3. Crossbar health check: 16 reference cells per crossbar read, offset computed.
  4. ECC prep: redundant cells and parity set.

Boot time: ~100 ms. One-time; doesn’t affect inference.

Tolerance and failure:

  • Per-crossbar 1% failed cells tolerated (ECC).
  • Per-cluster 1 CU failure tolerated (redundant mapping).
  • Chip-level 5% cell failure → still 95% accuracy.

Y1 production yield target: 70-80%. Failed chips ship as low-spec (mobile, IoT).

Experiment: Y1 Chip vs H100 GPU — an Inference Scenario

Scenario: BERT-base (110M parameters), 1000-sentence NLU inference.

NVIDIA H100:

  • Model: 110M × 2 byte = 220 MB.
  • DRAM-loaded: 220 MB / 3 TB/s = 73 µs (once).
  • Inference: 0.2 ms/sentence × 1000 = 200 ms.
  • Total: ~275 ms.
  • Energy: 700 W × 0.275 s = 192 J.

SIDRA Y1:

  • Model: loaded into the Y1 chip once (640 ms, pre-inference).
  • Inference: 1 ms/sentence × 1000 = 1 s.
  • Total: ~1 s.
  • Energy: 3 W × 1 s = 3 J.

Comparison:

  • H100 3.6× faster (latency).
  • SIDRA 64× more efficient (energy).

Batch vs single:

  • H100 with batch 32 speeds up 32× → 8.4 ms/sentence batch.
  • SIDRA Y1 single-sentence. No batch-32 — but 32 sentences in parallel across CUs → 1 ms/sentence parallel.

Conclusion:

  • Datacenter (many requests, batching): H100 fits better.
  • Edge/embedded (single device, energy-critical): SIDRA Y1 wins big.

Y10 target: 30 → 300 TOPS. Matches H100 inference (training excluded). Y100 beats H100 at inference.

Quick Quiz

1/6What is YILDIRIM Y1's four-level hierarchy?

Lab Exercise

Map GPT-2 small inference onto SIDRA Y1.

Model: GPT-2 small (124M parameters).

Questions:

(a) What fraction of Y1 cells does GPT-2 use? (b) Crossbars per attention block? (c) Crossbars per FFN? (d) Where do all 12 blocks land across clusters? (e) Per-token inference latency? (f) 1000-token generation latency and energy?

Solutions

(a) 124M / 419M = 29.6%. About a third of Y1. 70% free for other models.

(b) Attention: 4 matrices (Q, K, V, O) × 768 × 768. Each 3×3 = 9 crossbars → 36 crossbars/attention.

(c) FFN: W1 (768×3072) = 3×12 = 36 crossbars + W2 (3072×768) = 36 crossbars = 72 crossbars/FFN.

(d) 12 blocks × (36 + 72) = 1296 crossbars. 1296 / (16 crossbars/CU) = 81 CUs. 81 / 25 = ~3-4 clusters. Uses 3-4 of Y1’s 16.

(e) Token inference: 12 blocks × (attention + FFN MVMs). Each MVM ~15 ns. Attention 6 MVMs sequential × 15 = 90 ns. FFN 2 MVMs × 15 = 30 ns. Per block ~120 ns. 12 blocks: ~1.4 µs/token.

(f) 1000 tokens × 1.4 µs = 1.4 ms. Energy: 3 W × 1.4 ms = 4 mJ. Laptop GPT-2 becomes possible.

Compare: H100 at the same 1000-token generation takes ~100 ms at 70 J. SIDRA is 70× faster + 17000× more efficient. But H100 can batch 32 tokens → total throughput favors H100.

Cheat Sheet

  • 4-level hierarchy: Crossbar → CU → Cluster → Chip.
  • Y1: 6400 crossbars, 419M memristors, 100 mm², 3 W, 30 TOPS, PCIe 5.0 × 4.
  • Components: MVM crossbar, ADC/DAC, compute engine, L1/L2/L3 SRAM, DMA, PCIe.
  • Power budget: ADC ~33%, DAC ~27%, crossbar 17%, compute ~10%, memory/IO 13%.
  • Area: ADC 25%, periphery 20%, compute/SRAM 20%, IO 15%, crossbar 4.2%.
  • CPU hybrid: CPU control + non-MVM, YILDIRIM MVM.
  • Tolerance: 1% cell + 1 CU + 5% chip failure tolerated.

Vision: YILDIRIM Evolution Y1→Y10→Y100

Y1 (2026-2027):

  • 28 nm CMOS, 100 nm cell.
  • 419M memristors, 30 TOPS, 3 W.
  • Edge inference focus.

Y10 (2029-2030):

  • 14 nm CMOS, 70 nm cell.
  • 10B memristors, 300 TOPS, 30 W.
  • 1S1R 3D-stack begins.
  • TDC ADC technology.
  • Hybrid training (last layer).
  • Datacenter deployments.

Y100 (2031-2033):

  • 7 nm CMOS, 28 nm cell.
  • 100B memristors, 3 POPS, 100 W.
  • 1S1R 8-layer 3D.
  • Photonic interconnect (wafer level).
  • On-chip online learning (STDP).
  • GPT-3 inference on a single chip.

Y1000 (2035+):

  • 2D material (MoS₂) cell, 7 nm.
  • 1T memristor, optional superconducting.
  • 100× Y100 performance.
  • Bio-compatible organic generation prototype.

Strategic for Türkiye: a new generation every 2-3 years → by 2030, Türkiye is the third neuromorphic chip producer (after the US and China). The concrete face of semiconductor sovereignty.

Unexpected: different YILDIRIM variants for different markets:

  • YILDIRIM-mobile (low power, battery-device).
  • YILDIRIM-auto (autonomous vehicle, thermal).
  • YILDIRIM-medical (implant, bio-compatible).
  • YILDIRIM-space (radiation-hardened, satellite).

Further Reading

  • Next chapter: 5.5 — DAC (SAR + ISPP)
  • Previous: 5.3 — The Crossbar Array
  • Modern AI chip architectures: Jouppi et al., In-datacenter performance analysis of a tensor processing unit, ISCA 2017 (Google TPU).
  • Cerebras wafer-scale: Lie et al., Cerebras CS-2 Wafer-Scale System, HotChips 2022.
  • Compute-in-memory chips: Ambrogio et al., An analog-AI chip for energy-efficient deep learning inference, Nature 2023.