The Crossbar Array
65,536 cells, one MVM engine — SIDRA YILDIRIM's foundation block.
Prerequisites
What you'll learn here
- Identify the geometry and wires of a 256×256 crossbar
- Step through the physical MVM flow on the crossbar
- List the design limits of crossbars (IR drop, sneak path, sourcing impedance)
- Understand the 1T1R crossbar layout and use of metal layers
- Compute Y1 crossbar throughput and energy numbers
Hook: 65,536 Multipliers, One Step
A 256×256 SIDRA crossbar = 65,536 memristor cells. Together they:
- Run in parallel
- In a single 10-ns electrical step
- Compute one matrix-vector multiply (MVM)
A traditional CPU does the same job in 65,536 instructions × 1 ns = 65 µs. The crossbar is 6500× faster. And uses 100× less energy.
This chapter covers the crossbar’s wires, design, and limits. Chapter 5.2 dealt with one cell; here we see how thousands fit together.
Intuition: Intersections of Horizontal and Vertical Wires
A crossbar physically is two perpendicular wire sets:
- Word-line (WL): horizontal wires (rows). 256 of them.
- Bit-line (BL): vertical wires (columns). 256 of them.
- A memristor cell at every intersection (plus a transistor for 1T1R).
BL1 BL2 BL3 ... BL256
| | | |
WL1 ──[R]──[R]──[R]──...──[R]──
| | | |
WL2 ──[R]──[R]──[R]──...──[R]──
| | | |
WL3 ──[R]──[R]──[R]──...──[R]──
| | | |
...
| | | |
WL256──[R]──[R]──[R]──...──[R]──
| | | |
I1 I2 I3 ... I256 (column output currents)Sizes: 256 × 256 = 65,536 intersections. Each cell 100 nm × 100 nm (Y1) → total crossbar area: ~25.6 µm × 25.6 µm = 0.000656 mm². Tiny.
MVM flow:
- Input vector (256-dim) is applied to WLs as voltages via DACs.
- Currents flow from each WL through every cell to BLs (Ohm).
- On each BL all cell currents sum (Kirchhoff).
- 256 BLs → 256 ADCs → digital output vector .
We covered the math in chapter 4.2. Here it’s the physical-circuit side.
Formalism: Crossbar Design Details
Wire parameters (Y1):
- Word-line (WL): Cu metal, 30 nm wide, 50 nm thick, ~26 µm long.
- Bit-line (BL): Cu metal, same dimensions, perpendicular.
- WL and BL on different metal layers (M2 and M3, say) with dielectric between.
- A via at each crossing connects to the memristor.
Resistance calc (Cu, BEOL chapter 2.8):
- µΩ·cm (with size effect)
- nm² = cm²
- µm = cm
- Ω
A single wire is ~600 Ω. With 256 cells drawing through, this resistance creates IR drop.
Cell resistance:
- LRS: 10 kΩ (G = 100 µS).
- HRS: 1 MΩ (G = 1 µS).
Wire (600 Ω) is much less than LRS (10 kΩ) → wire negligible? Wait — IR drop covered in chapter 5.12.
Active-area analysis:
256×256 crossbar = 65,536 cells × 100 nm × 100 nm = 656 µm². The 28 nm CMOS substrate would hold ~330,000 transistors in this area (1T1R needs 65K transistors + supporting circuits). Crossbar area is smaller than transistor area.
Physical MVM flow (256×256, parallel):
Step 1 — DAC input prep (5 ns):
- 256 DACs convert input vector to voltages.
- Each DAC: 8-bit input → 0-0.5 V analog out.
- DAC energy: 0.5 pJ × 256 = 128 pJ.
Step 2 — Word-line activation (1 ns):
- DAC outputs to WL drivers.
- WL voltage rises (RC settling).
- Typical settling: τ = R_drv × C_WL ≈ 1 ns.
Step 3 — Crossbar settling (5 ns):
- Currents flow physically per Ohm’s law.
- Each cell: .
- Each column sums by KCL: .
- All physics happens immediately (not light-speed limited — capacitor charge time).
- 256-cell parallel current upper bound: ~10 mA (worst-case all-LRS).
Step 4 — ADC conversion (5 ns):
- 256 ADCs convert column currents to numbers.
- 8-bit ADC: 256 levels.
- ADC energy: 1 pJ × 256 = 256 pJ.
Step 5 — Output (1 ns):
- ADC outputs go to the compute engine.
Total MVM time: ~10-15 ns (settling + ADC). The detail behind the “10 ns” figure in chapter 4.2.
Total MVM energy:
- DAC: 128 pJ
- Crossbar (Ohm dissipation): 26 pJ (typical activity)
- ADC: 256 pJ
- Control: 50 pJ
- Total: ~460 pJ/MVM (we saw in 4.2).
Throughput: 1 / 15 ns = 67M MVMs/s/crossbar. 65K MAC × 67M = 4.4 × 10¹² MAC/s = 4.4 TOPS per crossbar.
Y1 with 6400 crossbars in parallel → theoretical 28 POPS. Practical 30 TOPS (sequential bottlenecks).
IR drop numbers:
256 cells driven in parallel, each 50 µS @ 0.25 V → each cell 12.5 µA.
Word-line current (worst case): 256 × 12.5 = 3.2 mA.
WL resistance 600 Ω → IR drop: 3.2 mA × 600 Ω = 1.9 V! That swallows the 0.25 V input completely.
Mitigation:
- Actually current decreases along the WL (each cell takes some). Average is half.
- Double-ended WL drive (both ends powered).
- Wider wire (50 nm × 100 nm) → resistance drops.
- Practical result: ~5-10% IR drop (chapter 5.12 has the full analysis).
Sneak path:
In 1T1R the transistor blocks sneak paths — only the selected cell sources current.
In 1R/1S1R, sneak-path strategies:
- Half-select (V/2 schemes): apply half voltage → sneak current shrinks but doesn’t vanish.
- OTS selector (1S1R): below threshold = closed.
- Negative voltages: current in only one direction.
SIDRA Y1 uses 1T1R → zero sneak path. Y10 plans 1S1R 3D stacks with NbOx OTS (chapter 2.3).
Sourcing impedance:
Word-line driver output impedance (R_drv) is critical. High → IR drop grows. Low → big transistor → big area.
SIDRA Y1: R_drv ≈ 50 Ω, transistor width 1 µm.
1T1R layout practice:
A 1T1R cell fits in ~6F² (F = minimum feature):
- 28 nm Y1: 6 × 28² = 4700 nm² ≈ 70 nm × 70 nm. Plus contact + via.
- Practical 100 nm × 100 nm = 10000 nm² (with margin).
256×256 crossbar physical layout:
- Active area (memristor): 25.6 µm × 25.6 µm.
- Periphery: WL/BL drivers, ADC, DAC ~50 µm extra each side.
- Total crossbar block: ~125 µm × 125 µm = 15625 µm².
Y1 die area 1 cm² = 10⁸ µm². 6400 crossbars × 15625 = 10⁸ µm². Fills the Y1 die exactly (CMOS substrate + ADC + interconnect included).
Multiple metal layers:
WL: M3 metal layer. BL: M4 metal layer. Memristor cell: built in BEOL between M3 and M4. 1T1R transistor: in the 28 nm CMOS substrate (below M1). Power/ground: M5+ upper layers.
20-layer BEOL Y1 fits this stack compactly.
Experiment: 256×256 Crossbar MVM Latency vs CPU
Job: 256-vector × 256×256 matrix.
SIDRA Y1 crossbar:
- DAC setup: 5 ns
- Crossbar settling: 5 ns
- ADC: 5 ns
- Total: 15 ns
- 65,536 MACs + 65,280 adds = ~131K ops
- Throughput: 8.7 TOPS per single crossbar
Modern CPU (Intel Xeon, AVX-512):
- 1 GHz, 16 MAC/cycle (AVX-512)
- 16 GMAC/s = 16 GOPS
- 256×256 MVM: 65K MAC / 16G = 4 µs
- Crossbar 4000 / 15 = 266× faster
GPU (H100, FP16):
- 1 PFLOPS sustained ≈ 500 TMAC/s
- 256×256 MVM: 65K MAC / 500T = 130 ns
- Crossbar 130 / 15 = 8.7× faster (single crossbar vs whole H100)
But H100 runs 1000+ threads in parallel. SIDRA Y1 has 6400 crossbars in parallel.
Total throughput compare:
- Y1: 6400 × 4.4 TOPS = ~30 POPS analog.
- H100: ~1 PFLOPS (FP8 sparse). 30× worse.
- But H100 is dynamic (per-batch), Y1 is static (model fixed).
Energy/MAC:
- CPU: ~1000 pJ (cache + DRAM included).
- GPU H100: ~10 pJ (HBM included).
- SIDRA Y1: ~0.05 pJ (crossbar) + 0.05 pJ overhead = 0.1 pJ.
- SIDRA 100× more efficient.
Bottom line: for the same MVM, SIDRA crossbar is in nanoseconds — not seconds, not microseconds. Energy/op is 100× less than H100.
Quick Quiz
Lab Exercise
Map a 4-layer CNN onto SIDRA Y1 crossbars.
Model: ResNet-18 (small CNN), ImageNet inference.
- 11M parameters, 1.8 GFLOP/inference.
- 4 main conv layers + FC.
Per-layer dims:
- Conv1: 3×3 × 64 filters, 224×224 image.
- Conv2: 3×3 × 128 filters.
- Conv3: 3×3 × 256 filters.
- Conv4: 3×3 × 512 filters.
- FC: 512 × 1000 (ImageNet classes).
Questions:
(a) Total params 11M; what fraction of SIDRA Y1 (419M cells)? (b) Conv1: 3×3 = 9 inputs × 64 outputs kernel → matrix size? Crossbars? (c) Crossbars for the whole CNN? (d) Inference time (sliding-window conv, average ~28×28 spatial output)? (e) Inference energy?
Solutions
(a) 11M / 419M = 2.6%. Y1 is much larger than ResNet-18 needs. The rest stays free for other models or batching.
(b) Conv1 weight matrix: 9 × 64 = 576 weights per kernel slot. 3×64 + 3×64 = 192-cell crossbar fits easily — 1 crossbar suffices for Conv1.
(c) Total: ~50-100 crossbars (each conv ~10-20). FC: 512 × 1000 → 2 × 4 = 8 crossbars. Total ~70 crossbars, ~1% of Y1.
(d) Sliding-window Conv1: 224 × 224 = 50,176 sliding positions. Each is 1 MVM × 15 ns = 750 µs for Conv1. Deeper layers smaller (28×28, 14×14) → fewer positions. Total inference: ~5-10 ms. Real-time camera-capable.
(e) Total MAC: 1.8 G. SIDRA 0.1 pJ/MAC → 0.18 mJ. Inference energy ~0.2 mJ. Tiny.
Bottom line: real-time ResNet-18 inference (30 fps) on Y1 = 6 mJ/s = 6 mW. Far below 3 W TDP — can run additional models in parallel.
Cheat Sheet
- Crossbar: 2 perpendicular wire sets + memristors at intersections.
- 256×256 = 65K MACs in parallel.
- MVM time: ~10-15 ns (DAC + Ohm settling + ADC).
- Throughput: 4.4 TOPS/crossbar.
- Y1: 6400 crossbars → 30 TOPS practical.
- Cell: 1T1R for Y1 (no sneak path).
- Layout: 25 µm active, 125 µm block (periphery included).
- IR drop: ~5-10% practical on WL, full detail in 5.12.
- Energy/MAC: ~0.1 pJ (CPU 10000×, GPU 100× worse).
Vision: Bigger, Denser, More Three-Dimensional
Crossbar design evolves:
- Y1 (today): 256×256, 1T1R, 100 nm cell, 28 nm CMOS substrate.
- Y3 (2027): 512×512, tighter 1T1R, 70 nm cell, 14 nm CMOS.
- Y10 (2029): 1024×1024, 1S1R 3D-stack 4 layers, 28 nm cell, 7 nm CMOS.
- Y100 (2031+): 4096×4096, photonic-linked, 14 nm cell. 3D stack 16 layers. 100B cells per chip.
- Y1000 (long horizon): Crossbar = the basic computing block, no CPU. All compute on the crossbar.
Meaning for Türkiye: crossbar design is a new sub-category in semiconductors. Outside the classic CPU/GPU lane, less crowded, open territory. Türkiye’s technical capability is sufficient — fab infrastructure is the bottleneck. The SIDRA workshop is the first concrete step past it.
Unexpected future: crossbars in everything. Phones, cars, appliances will hold tiny SIDRA crossbars doing local AI. Cloud dependency drops. The local-AI era. 2030+ horizon.
Further Reading
- Next chapter: 5.4 — YILDIRIM Chip Architecture
- Previous: 5.2 — Deep Dive: The Memristor
- MVM math link: 4.2 — Ohm + Kirchhoff = Analog MVM
- Crossbar history: Borghetti et al., ‘Memristive’ switches enable ‘stateful’ logic operations via material implication, Nature 2010.
- 1T1R design: Sheu et al., A 4Mb embedded SLC resistive-RAM macro with 7.2 ns read-write random-access time…, ISSCC 2011.
- Crossbar review: Yu, Neuro-inspired computing with emerging nonvolatile memory, Proc. IEEE 2018.