YILDIRIM Chip Architecture
SIDRA's first-generation chip spec — a full architecture map of Y1.
Prerequisites
What you'll learn here
- Sketch YILDIRIM Y1's four-level hierarchy (crossbar → CU → cluster → chip)
- List the functional components at each level (ADC, DAC, compute engine, DMA)
- Break down the Y1 chip's power, area, and throughput budgets
- Explain the CPU-SIDRA hybrid system architecture (PCIe link)
- Summarize the Y1→Y100 roadmap at a high level
Hook: YILDIRIM at a Glance
YILDIRIM is the first-generation neuromorphic AI chip from SIDRA SEMICONDUCTOR. The Y1 product spec:
- Die area: ~100 mm² (10 mm × 10 mm).
- Process: 28 nm CMOS substrate + HfO₂ BEOL memristor.
- Transistors: ~4 billion (28 nm CMOS).
- Memristors: 419 million.
- Weight capacity: 419 MB (8 bits per cell).
- Throughput: 30 TOPS analog.
- TDP: 3 W.
- Interface: PCIe 5.0 × 4 lanes (16 GB/s).
The chip doesn’t do inference alone — it runs hybrid with a CPU. The CPU handles control flow + non-MVM; YILDIRIM handles MVMs. This chapter walks the architecture end-to-end.
Intuition: A 4-Level Hierarchy
YILDIRIM Y1 is physically organized in four hierarchy levels:
Level 1: CROSSBAR (256×256 = 65K cells)
↓ × 16
Level 2: COMPUTE UNIT (1M cells + ADC/DAC + local control)
↓ × 25
Level 3: CLUSTER (25 CUs = 25M cells + L2 memory + DMA)
↓ × 17 (approx)
Level 4: CHIP (16 Clusters = 419M cells + PCIe + L3 memory)Top-down:
| Level | Count | Total cells | Additional components |
|---|---|---|---|
| Crossbar | 1 | 65,536 | WL/BL drivers, cell matrix |
| CU | 16 crossbars | 1,048,576 | ADC column (256), DAC column (256), compute engine, local SRAM 128 KB |
| Cluster | 25 CUs | 26,214,400 | DMA, routing matrix, L2 SRAM 2 MB |
| Chip | 16 Clusters | 419,430,400 | PCIe controller, L3 SRAM 16 MB, clock tree, power management |
Why hierarchical?
- Local compute — minimize communication distance.
- Parallel execution — different layers can run different models.
- Power management — turn CUs/clusters on/off.
- Scalability — Y10 can scale the same design 10×.
Parallel throughput:
- Crossbar: 4.4 TOPS (MVM at 67M/s × 65K).
- CU: 4.4 × 16 = 70 TOPS (16 crossbars in parallel).
- Cluster: 70 × 25 = 1.76 POPS.
- Chip: 1.76 × 16 = 28 POPS (theoretical analog).
Practical: ADC/data movement bottleneck → 30 TOPS real Y1 figure.
Formalism: Y1 Chip Components in Detail
Crossbar level (detail in 5.3):
- 256 × 256 memristors.
- Local WL/BL drivers.
- Function: single MVM, 15 ns.
Compute Unit (CU) components:
- 16 crossbars (parallel access).
- 256 DACs (8-bit, 0.5 V range).
- 256 ADCs (8-bit, ~1 pJ/conversion).
- Compute engine:
- Activation functions (ReLU, sigmoid, softmax) via LUT.
- Bias addition.
- Scalar multiply (scale factor).
- Layer-norm (mean/std compute).
- Local SRAM: 128 KB (intermediate activations).
- Control: state machine, crossbar sequencing.
CU per-MVM time: ~15 ns analog + 5 ns digital post-processing = 20 ns. 50M inferences/s single-CU.
Cluster components:
- 25 CUs.
- DMA (Direct Memory Access): data in/out of the cluster.
- Routing matrix: data flow between the 25 CUs.
- L2 SRAM: 2 MB (model-weight cache, intermediate output storage).
- Power control: per-CU power gating.
Chip components:
- 16 Clusters.
- PCIe 5.0 controller: host CPU link.
- L3 SRAM: 16 MB (large intermediate outputs).
- Power management: voltage regulators, clock tree, DVFS.
- Test and calibration: crossbar calibration at every boot.
- Thermal sensors: temperature measured per cluster, throttling if needed.
Y1 power budget (3 W TDP):
| Component | Power share | Notes |
|---|---|---|
| Crossbar MVM | ~0.5 W | All 6400 crossbars @ 20% activity (sparsity) |
| DAC | ~0.8 W | 6400 × 256 = 1.6M DACs @ 30% activity |
| ADC | ~1.0 W | 6400 × 256 = 1.6M ADCs, 1 pJ × 50M/s each |
| Compute engine | ~0.3 W | Activation, bias, scale |
| SRAM + DMA | ~0.2 W | Memory access |
| PCIe + clock | ~0.2 W | Interface, clock tree |
| Total | ~3.0 W | TDP target |
Area budget (100 mm² die):
| Component | Area | Fraction |
|---|---|---|
| Crossbar active | ~4.2 mm² | 4.2% (6400 × 656 µm²) |
| Crossbar peripheral | ~20 mm² | 20% (WL/BL drivers, local control) |
| ADC | ~25 mm² | 25% (1.6M ADCs) |
| DAC | ~10 mm² | 10% |
| Compute engine + SRAM | ~20 mm² | 20% |
| PCIe, I/O | ~15 mm² | 15% |
| Spacing + routing | ~5 mm² | 5% |
| Total | 100 mm² | - |
ADC dominates area — a typical analog AI chip problem. Y10 target: bring ADC area to 10% (TDC technology, chapter 5.6).
Clock speed:
- CMOS substrate (control, compute engine): 1 GHz.
- Crossbar analog: asynchronous (no clock, settling-based).
- PCIe 5.0: 32 GT/s link rate.
DVFS (Dynamic Voltage and Frequency Scaling):
Voltage/frequency tracks activity:
- Idle: 100 MHz, 0.6 V → 100 mW.
- Average: 500 MHz, 0.8 V → 1 W.
- Peak: 1 GHz, 1 V → 3 W.
CPU-SIDRA interface:
A host CPU (e.g. Intel Xeon or AMD EPYC) connects over PCIe 5.0:
- CPU loads the model (weights programmed into the crossbars).
- CPU sends input data over PCIe.
- YILDIRIM runs MVMs.
- Output returns over PCIe.
- CPU handles non-MVM (softmax, tokenization, post-processing).
PCIe 5.0 bandwidth: 16 GB/s. Sufficient? GPT-2 inference input: 512 token × 768 dim × 2 byte = 0.8 MB → 0.05 µs @ 16 GB/s. Not throughput-bound.
Dataflow (typical inference):
Host CPU (x86)
↓ PCIe 5.0 (16 GB/s)
YILDIRIM Chip:
L3 SRAM (16 MB) — input buffer
↓ DMA
L2 SRAM (2 MB × 16 cluster) — active-layer weight cache
↓
L1 SRAM (128 KB × 25 × 16 = 50 MB) — intermediate activations
↓
Crossbar (419 MB) — persistent weights
↓
Compute Engine — activation, bias
↓
L3 SRAM — output buffer
↓ PCIe
Host CPUMemory hierarchy (theoretical):
- Memristor: 419 MB (fixed, written at program-time).
- L3 SRAM: 16 MB (intermediate outputs, big buffers).
- L2 SRAM: 2 MB × 16 = 32 MB.
- L1 SRAM: 128 KB × 400 CU = 50 MB.
Total ~520 MB on-chip. Not enormous, but everything on-chip — no external DRAM. That’s SIDRA’s von Neumann-bypass claim.
Routing matrix:
Within a cluster, “routing” between the 25 CUs:
- Every CU output can be forwarded to any other CU.
- 25×25 routing matrix = 625 connection points.
- Each connection bi-directional, 32-bit wide, 1 GHz → 4 GB/s per connection.
- Total routing bandwidth: ~2.5 TB/s within a cluster.
Why this matters: deep-model intermediate outputs flow layer-to-layer. The routing matrix supports that flow. YILDIRIM Y1 is a “graph-based” dataflow architecture, not a rote-sequential one.
Calibration and test (boot-time):
At chip power-up:
- Temperature read: 4 thermal sensors per cluster.
- Voltage calibration: DAC reference voltages tuned.
- Crossbar health check: 16 reference cells per crossbar read, offset computed.
- ECC prep: redundant cells and parity set.
Boot time: ~100 ms. One-time; doesn’t affect inference.
Tolerance and failure:
- Per-crossbar 1% failed cells tolerated (ECC).
- Per-cluster 1 CU failure tolerated (redundant mapping).
- Chip-level 5% cell failure → still 95% accuracy.
Y1 production yield target: 70-80%. Failed chips ship as low-spec (mobile, IoT).
Experiment: Y1 Chip vs H100 GPU — an Inference Scenario
Scenario: BERT-base (110M parameters), 1000-sentence NLU inference.
NVIDIA H100:
- Model: 110M × 2 byte = 220 MB.
- DRAM-loaded: 220 MB / 3 TB/s = 73 µs (once).
- Inference: 0.2 ms/sentence × 1000 = 200 ms.
- Total: ~275 ms.
- Energy: 700 W × 0.275 s = 192 J.
SIDRA Y1:
- Model: loaded into the Y1 chip once (640 ms, pre-inference).
- Inference: 1 ms/sentence × 1000 = 1 s.
- Total: ~1 s.
- Energy: 3 W × 1 s = 3 J.
Comparison:
- H100 3.6× faster (latency).
- SIDRA 64× more efficient (energy).
Batch vs single:
- H100 with batch 32 speeds up 32× → 8.4 ms/sentence batch.
- SIDRA Y1 single-sentence. No batch-32 — but 32 sentences in parallel across CUs → 1 ms/sentence parallel.
Conclusion:
- Datacenter (many requests, batching): H100 fits better.
- Edge/embedded (single device, energy-critical): SIDRA Y1 wins big.
Y10 target: 30 → 300 TOPS. Matches H100 inference (training excluded). Y100 beats H100 at inference.
Quick Quiz
Lab Exercise
Map GPT-2 small inference onto SIDRA Y1.
Model: GPT-2 small (124M parameters).
Questions:
(a) What fraction of Y1 cells does GPT-2 use? (b) Crossbars per attention block? (c) Crossbars per FFN? (d) Where do all 12 blocks land across clusters? (e) Per-token inference latency? (f) 1000-token generation latency and energy?
Solutions
(a) 124M / 419M = 29.6%. About a third of Y1. 70% free for other models.
(b) Attention: 4 matrices (Q, K, V, O) × 768 × 768. Each 3×3 = 9 crossbars → 36 crossbars/attention.
(c) FFN: W1 (768×3072) = 3×12 = 36 crossbars + W2 (3072×768) = 36 crossbars = 72 crossbars/FFN.
(d) 12 blocks × (36 + 72) = 1296 crossbars. 1296 / (16 crossbars/CU) = 81 CUs. 81 / 25 = ~3-4 clusters. Uses 3-4 of Y1’s 16.
(e) Token inference: 12 blocks × (attention + FFN MVMs). Each MVM ~15 ns. Attention 6 MVMs sequential × 15 = 90 ns. FFN 2 MVMs × 15 = 30 ns. Per block ~120 ns. 12 blocks: ~1.4 µs/token.
(f) 1000 tokens × 1.4 µs = 1.4 ms. Energy: 3 W × 1.4 ms = 4 mJ. Laptop GPT-2 becomes possible.
Compare: H100 at the same 1000-token generation takes ~100 ms at 70 J. SIDRA is 70× faster + 17000× more efficient. But H100 can batch 32 tokens → total throughput favors H100.
Cheat Sheet
- 4-level hierarchy: Crossbar → CU → Cluster → Chip.
- Y1: 6400 crossbars, 419M memristors, 100 mm², 3 W, 30 TOPS, PCIe 5.0 × 4.
- Components: MVM crossbar, ADC/DAC, compute engine, L1/L2/L3 SRAM, DMA, PCIe.
- Power budget: ADC ~33%, DAC ~27%, crossbar 17%, compute ~10%, memory/IO 13%.
- Area: ADC 25%, periphery 20%, compute/SRAM 20%, IO 15%, crossbar 4.2%.
- CPU hybrid: CPU control + non-MVM, YILDIRIM MVM.
- Tolerance: 1% cell + 1 CU + 5% chip failure tolerated.
Vision: YILDIRIM Evolution Y1→Y10→Y100
Y1 (2026-2027):
- 28 nm CMOS, 100 nm cell.
- 419M memristors, 30 TOPS, 3 W.
- Edge inference focus.
Y10 (2029-2030):
- 14 nm CMOS, 70 nm cell.
- 10B memristors, 300 TOPS, 30 W.
- 1S1R 3D-stack begins.
- TDC ADC technology.
- Hybrid training (last layer).
- Datacenter deployments.
Y100 (2031-2033):
- 7 nm CMOS, 28 nm cell.
- 100B memristors, 3 POPS, 100 W.
- 1S1R 8-layer 3D.
- Photonic interconnect (wafer level).
- On-chip online learning (STDP).
- GPT-3 inference on a single chip.
Y1000 (2035+):
- 2D material (MoS₂) cell, 7 nm.
- 1T memristor, optional superconducting.
- 100× Y100 performance.
- Bio-compatible organic generation prototype.
Strategic for Türkiye: a new generation every 2-3 years → by 2030, Türkiye is the third neuromorphic chip producer (after the US and China). The concrete face of semiconductor sovereignty.
Unexpected: different YILDIRIM variants for different markets:
- YILDIRIM-mobile (low power, battery-device).
- YILDIRIM-auto (autonomous vehicle, thermal).
- YILDIRIM-medical (implant, bio-compatible).
- YILDIRIM-space (radiation-hardened, satellite).
Further Reading
- Next chapter: 5.5 — DAC (SAR + ISPP)
- Previous: 5.3 — The Crossbar Array
- Modern AI chip architectures: Jouppi et al., In-datacenter performance analysis of a tensor processing unit, ISCA 2017 (Google TPU).
- Cerebras wafer-scale: Lie et al., Cerebras CS-2 Wafer-Scale System, HotChips 2022.
- Compute-in-memory chips: Ambrogio et al., An analog-AI chip for energy-efficient deep learning inference, Nature 2023.