🔌 Module 5 · Chip Hardware · Chapter 5.4 · 14 min read

YILDIRIM Chip Architecture

SIDRA's first-generation chip spec — a full architecture map of Y1.

Prerequisites

5.3 — The Crossbar Array

What you'll learn here

Sketch YILDIRIM Y1's four-level hierarchy (crossbar → CU → cluster → chip)
List the functional components at each level (ADC, DAC, compute engine, DMA)
Break down the Y1 chip's power, area, and throughput budgets
Explain the CPU-SIDRA hybrid system architecture (PCIe link)
Summarize the Y1→Y100 roadmap at a high level

Hook: YILDIRIM at a Glance

YILDIRIM is the first-generation neuromorphic AI chip from SIDRA SEMICONDUCTOR. The Y1 product spec:

Die area: ~100 mm² (10 mm × 10 mm).
Process: 28 nm CMOS substrate + HfO₂ BEOL memristor.
Transistors: ~4 billion (28 nm CMOS).
Memristors: 419 million.
Weight capacity: 419 MB (8 bits per cell).
Throughput: 30 TOPS analog.
TDP: 3 W.
Interface: PCIe 5.0 × 4 lanes (16 GB/s).

The chip doesn’t do inference alone — it runs hybrid with a CPU. The CPU handles control flow + non-MVM; YILDIRIM handles MVMs. This chapter walks the architecture end-to-end.

Intuition: A 4-Level Hierarchy

YILDIRIM Y1 is physically organized in four hierarchy levels:

Level 1: CROSSBAR (256×256 = 65K cells)
    ↓ × 16
Level 2: COMPUTE UNIT (1M cells + ADC/DAC + local control)
    ↓ × 25
Level 3: CLUSTER (25 CUs = 25M cells + L2 memory + DMA)
    ↓ × 17 (approx)
Level 4: CHIP (16 Clusters = 419M cells + PCIe + L3 memory)

Top-down:

Level	Count	Total cells	Additional components
Crossbar	1	65,536	WL/BL drivers, cell matrix
CU	16 crossbars	1,048,576	ADC column (256), DAC column (256), compute engine, local SRAM 128 KB
Cluster	25 CUs	26,214,400	DMA, routing matrix, L2 SRAM 2 MB
Chip	16 Clusters	419,430,400	PCIe controller, L3 SRAM 16 MB, clock tree, power management

Why hierarchical?

Local compute — minimize communication distance.
Parallel execution — different layers can run different models.
Power management — turn CUs/clusters on/off.
Scalability — Y10 can scale the same design 10×.

Parallel throughput:

Crossbar: 4.4 TOPS (MVM at 67M/s × 65K).
CU: 4.4 × 16 = 70 TOPS (16 crossbars in parallel).
Cluster: 70 × 25 = 1.76 POPS.
Chip: 1.76 × 16 = 28 POPS (theoretical analog).

Practical: ADC/data movement bottleneck → 30 TOPS real Y1 figure.

Formalism: Y1 Chip Components in Detail

L1 · Başlangıç

Crossbar level (detail in 5.3):

256 × 256 memristors.
Local WL/BL drivers.
Function: single MVM, 15 ns.

Compute Unit (CU) components:

16 crossbars (parallel access).
256 DACs (8-bit, 0.5 V range).
256 ADCs (8-bit, ~1 pJ/conversion).
Compute engine:
- Activation functions (ReLU, sigmoid, softmax) via LUT.
- Bias addition.
- Scalar multiply (scale factor).
- Layer-norm (mean/std compute).
Local SRAM: 128 KB (intermediate activations).
Control: state machine, crossbar sequencing.

CU per-MVM time: ~15 ns analog + 5 ns digital post-processing = 20 ns. 50M inferences/s single-CU.

Cluster components:

25 CUs.
DMA (Direct Memory Access): data in/out of the cluster.
Routing matrix: data flow between the 25 CUs.
L2 SRAM: 2 MB (model-weight cache, intermediate output storage).
Power control: per-CU power gating.

Chip components:

16 Clusters.
PCIe 5.0 controller: host CPU link.
L3 SRAM: 16 MB (large intermediate outputs).
Power management: voltage regulators, clock tree, DVFS.
Test and calibration: crossbar calibration at every boot.
Thermal sensors: temperature measured per cluster, throttling if needed.

L2 · Tam

Y1 power budget (3 W TDP):

Component	Power share	Notes
Crossbar MVM	~0.5 W	All 6400 crossbars @ 20% activity (sparsity)
DAC	~0.8 W	6400 × 256 = 1.6M DACs @ 30% activity
ADC	~1.0 W	6400 × 256 = 1.6M ADCs, 1 pJ × 50M/s each
Compute engine	~0.3 W	Activation, bias, scale
SRAM + DMA	~0.2 W	Memory access
PCIe + clock	~0.2 W	Interface, clock tree
Total	~3.0 W	TDP target

Area budget (100 mm² die):

Component	Area	Fraction
Crossbar active	~4.2 mm²	4.2% (6400 × 656 µm²)
Crossbar peripheral	~20 mm²	20% (WL/BL drivers, local control)
ADC	~25 mm²	25% (1.6M ADCs)
DAC	~10 mm²	10%
Compute engine + SRAM	~20 mm²	20%
PCIe, I/O	~15 mm²	15%
Spacing + routing	~5 mm²	5%
Total	100 mm²	-

ADC dominates area — a typical analog AI chip problem. Y10 target: bring ADC area to 10% (TDC technology, chapter 5.6).

Clock speed:

CMOS substrate (control, compute engine): 1 GHz.
Crossbar analog: asynchronous (no clock, settling-based).
PCIe 5.0: 32 GT/s link rate.

DVFS (Dynamic Voltage and Frequency Scaling):

Voltage/frequency tracks activity:

Idle: 100 MHz, 0.6 V → 100 mW.
Average: 500 MHz, 0.8 V → 1 W.
Peak: 1 GHz, 1 V → 3 W.

CPU-SIDRA interface:

A host CPU (e.g. Intel Xeon or AMD EPYC) connects over PCIe 5.0:

CPU loads the model (weights programmed into the crossbars).
CPU sends input data over PCIe.
YILDIRIM runs MVMs.
Output returns over PCIe.
CPU handles non-MVM (softmax, tokenization, post-processing).

PCIe 5.0 bandwidth: 16 GB/s. Sufficient? GPT-2 inference input: 512 token × 768 dim × 2 byte = 0.8 MB → 0.05 µs @ 16 GB/s. Not throughput-bound.

L3 · Derin

Dataflow (typical inference):

Host CPU (x86) 
    ↓ PCIe 5.0 (16 GB/s)
YILDIRIM Chip:
    L3 SRAM (16 MB) — input buffer
        ↓ DMA
    L2 SRAM (2 MB × 16 cluster) — active-layer weight cache
        ↓
    L1 SRAM (128 KB × 25 × 16 = 50 MB) — intermediate activations
        ↓
    Crossbar (419 MB) — persistent weights
        ↓
    Compute Engine — activation, bias
        ↓
    L3 SRAM — output buffer
        ↓ PCIe
Host CPU

Memory hierarchy (theoretical):

Memristor: 419 MB (fixed, written at program-time).
L3 SRAM: 16 MB (intermediate outputs, big buffers).
L2 SRAM: 2 MB × 16 = 32 MB.
L1 SRAM: 128 KB × 400 CU = 50 MB.

Total ~520 MB on-chip. Not enormous, but everything on-chip — no external DRAM. That’s SIDRA’s von Neumann-bypass claim.

Routing matrix:

Within a cluster, “routing” between the 25 CUs:

Every CU output can be forwarded to any other CU.
25×25 routing matrix = 625 connection points.
Each connection bi-directional, 32-bit wide, 1 GHz → 4 GB/s per connection.
Total routing bandwidth: ~2.5 TB/s within a cluster.

Why this matters: deep-model intermediate outputs flow layer-to-layer. The routing matrix supports that flow. YILDIRIM Y1 is a “graph-based” dataflow architecture, not a rote-sequential one.

Calibration and test (boot-time):

At chip power-up:

Temperature read: 4 thermal sensors per cluster.
Voltage calibration: DAC reference voltages tuned.
Crossbar health check: 16 reference cells per crossbar read, offset computed.
ECC prep: redundant cells and parity set.

Boot time: ~100 ms. One-time; doesn’t affect inference.

Tolerance and failure:

Per-crossbar 1% failed cells tolerated (ECC).
Per-cluster 1 CU failure tolerated (redundant mapping).
Chip-level 5% cell failure → still 95% accuracy.

Y1 production yield target: 70-80%. Failed chips ship as low-spec (mobile, IoT).

Experiment: Y1 Chip vs H100 GPU — an Inference Scenario

Scenario: BERT-base (110M parameters), 1000-sentence NLU inference.

NVIDIA H100:

Model: 110M × 2 byte = 220 MB.
DRAM-loaded: 220 MB / 3 TB/s = 73 µs (once).
Inference: 0.2 ms/sentence × 1000 = 200 ms.
Total: ~275 ms.
Energy: 700 W × 0.275 s = 192 J.

SIDRA Y1:

Model: loaded into the Y1 chip once (640 ms, pre-inference).
Inference: 1 ms/sentence × 1000 = 1 s.
Total: ~1 s.
Energy: 3 W × 1 s = 3 J.

Comparison:

H100 3.6× faster (latency).
SIDRA 64× more efficient (energy).

Batch vs single:

H100 with batch 32 speeds up 32× → 8.4 ms/sentence batch.
SIDRA Y1 single-sentence. No batch-32 — but 32 sentences in parallel across CUs → 1 ms/sentence parallel.

Conclusion:

Datacenter (many requests, batching): H100 fits better.
Edge/embedded (single device, energy-critical): SIDRA Y1 wins big.

Y10 target: 30 → 300 TOPS. Matches H100 inference (training excluded). Y100 beats H100 at inference.

Quick Quiz

1/6What is YILDIRIM Y1's four-level hierarchy?

Crossbar onlyCrossbar → Compute Unit (CU) → Cluster → ChipCPU → GPU → FPGADRAM → SRAM → cache

Lab Exercise

Map GPT-2 small inference onto SIDRA Y1.

Model: GPT-2 small (124M parameters).

Questions:

(a) What fraction of Y1 cells does GPT-2 use? (b) Crossbars per attention block? (c) Crossbars per FFN? (d) Where do all 12 blocks land across clusters? (e) Per-token inference latency? (f) 1000-token generation latency and energy?

Solutions

(a) 124M / 419M = 29.6%. About a third of Y1. 70% free for other models.

(b) Attention: 4 matrices (Q, K, V, O) × 768 × 768. Each 3×3 = 9 crossbars → 36 crossbars/attention.

(d) 12 blocks × (36 + 72) = 1296 crossbars. 1296 / (16 crossbars/CU) = 81 CUs. 81 / 25 = ~3-4 clusters. Uses 3-4 of Y1’s 16.

(e) Token inference: 12 blocks × (attention + FFN MVMs). Each MVM ~15 ns. Attention 6 MVMs sequential × 15 = 90 ns. FFN 2 MVMs × 15 = 30 ns. Per block ~120 ns. 12 blocks: ~1.4 µs/token.

(f) 1000 tokens × 1.4 µs = 1.4 ms. Energy: 3 W × 1.4 ms = 4 mJ. Laptop GPT-2 becomes possible.

Compare: H100 at the same 1000-token generation takes ~100 ms at 70 J. SIDRA is 70× faster + 17000× more efficient. But H100 can batch 32 tokens → total throughput favors H100.

Cheat Sheet

4-level hierarchy: Crossbar → CU → Cluster → Chip.
Y1: 6400 crossbars, 419M memristors, 100 mm², 3 W, 30 TOPS, PCIe 5.0 × 4.
Components: MVM crossbar, ADC/DAC, compute engine, L1/L2/L3 SRAM, DMA, PCIe.
Power budget: ADC ~33%, DAC ~27%, crossbar 17%, compute ~10%, memory/IO 13%.
Area: ADC 25%, periphery 20%, compute/SRAM 20%, IO 15%, crossbar 4.2%.
CPU hybrid: CPU control + non-MVM, YILDIRIM MVM.
Tolerance: 1% cell + 1 CU + 5% chip failure tolerated.

Vision: YILDIRIM Evolution Y1→Y10→Y100

Y1 (2026-2027):

28 nm CMOS, 100 nm cell.
419M memristors, 30 TOPS, 3 W.
Edge inference focus.

Y10 (2029-2030):

14 nm CMOS, 70 nm cell.
10B memristors, 300 TOPS, 30 W.
1S1R 3D-stack begins.
TDC ADC technology.
Hybrid training (last layer).
Datacenter deployments.

Y100 (2031-2033):

7 nm CMOS, 28 nm cell.
100B memristors, 3 POPS, 100 W.
1S1R 8-layer 3D.
Photonic interconnect (wafer level).
On-chip online learning (STDP).
GPT-3 inference on a single chip.

Y1000 (2035+):

2D material (MoS₂) cell, 7 nm.
1T memristor, optional superconducting.
100× Y100 performance.
Bio-compatible organic generation prototype.

Strategic for Türkiye: a new generation every 2-3 years → by 2030, Türkiye is the third neuromorphic chip producer (after the US and China). The concrete face of semiconductor sovereignty.

Unexpected: different YILDIRIM variants for different markets:

YILDIRIM-mobile (low power, battery-device).
YILDIRIM-auto (autonomous vehicle, thermal).
YILDIRIM-medical (implant, bio-compatible).
YILDIRIM-space (radiation-hardened, satellite).

Prerequisites

What you'll learn here

🪝 Hook: YILDIRIM at a Glance

🧭 Intuition: A 4-Level Hierarchy

📐 Formalism: Y1 Chip Components in Detail

🧪 Experiment: Y1 Chip vs H100 GPU — an Inference Scenario

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: YILDIRIM Evolution Y1→Y10→Y100

📚 Further Reading