🔌 Module 5 · Chip Hardware · Chapter 5.3 · 12 min read

The Crossbar Array

65,536 cells, one MVM engine — SIDRA YILDIRIM's foundation block.

What you'll learn here

  • Identify the geometry and wires of a 256×256 crossbar
  • Step through the physical MVM flow on the crossbar
  • List the design limits of crossbars (IR drop, sneak path, sourcing impedance)
  • Understand the 1T1R crossbar layout and use of metal layers
  • Compute Y1 crossbar throughput and energy numbers

Hook: 65,536 Multipliers, One Step

A 256×256 SIDRA crossbar = 65,536 memristor cells. Together they:

  • Run in parallel
  • In a single 10-ns electrical step
  • Compute one matrix-vector multiply (MVM)

A traditional CPU does the same job in 65,536 instructions × 1 ns = 65 µs. The crossbar is 6500× faster. And uses 100× less energy.

This chapter covers the crossbar’s wires, design, and limits. Chapter 5.2 dealt with one cell; here we see how thousands fit together.

Intuition: Intersections of Horizontal and Vertical Wires

A crossbar physically is two perpendicular wire sets:

  • Word-line (WL): horizontal wires (rows). 256 of them.
  • Bit-line (BL): vertical wires (columns). 256 of them.
  • A memristor cell at every intersection (plus a transistor for 1T1R).
       BL1   BL2   BL3   ...   BL256
        |     |     |           |
WL1 ──[R]──[R]──[R]──...──[R]── 
        |     |     |           |
WL2 ──[R]──[R]──[R]──...──[R]── 
        |     |     |           |
WL3 ──[R]──[R]──[R]──...──[R]── 
        |     |     |           |
  ...  
        |     |     |           |
WL256──[R]──[R]──[R]──...──[R]── 
        |     |     |           |
        I1    I2    I3    ...   I256 (column output currents)

Sizes: 256 × 256 = 65,536 intersections. Each cell 100 nm × 100 nm (Y1) → total crossbar area: ~25.6 µm × 25.6 µm = 0.000656 mm². Tiny.

MVM flow:

  1. Input vector xx (256-dim) is applied to WLs as voltages via DACs.
  2. Currents flow from each WL through every cell to BLs (Ohm).
  3. On each BL all cell currents sum (Kirchhoff).
  4. 256 BLs → 256 ADCs → digital output vector yy.

We covered the math in chapter 4.2. Here it’s the physical-circuit side.

Formalism: Crossbar Design Details

L1 · Başlangıç

Wire parameters (Y1):

  • Word-line (WL): Cu metal, 30 nm wide, 50 nm thick, ~26 µm long.
  • Bit-line (BL): Cu metal, same dimensions, perpendicular.
  • WL and BL on different metal layers (M2 and M3, say) with dielectric between.
  • A via at each crossing connects to the memristor.

Resistance calc (Cu, BEOL chapter 2.8):

  • ρCu,eff=3.5\rho_{Cu, eff} = 3.5 µΩ·cm (with size effect)
  • A=30×50=1500A = 30 \times 50 = 1500 nm² = 1.5×10111.5 \times 10^{-11} cm²
  • L=26L = 26 µm = 2.6×1032.6 \times 10^{-3} cm
  • Rwire=ρL/A=3.5×106×2.6×103/1.5×1011=607R_{wire} = \rho L / A = 3.5 \times 10^{-6} \times 2.6 \times 10^{-3} / 1.5 \times 10^{-11} = 607 Ω

A single wire is ~600 Ω. With 256 cells drawing through, this resistance creates IR drop.

Cell resistance:

  • LRS: 10 kΩ (G = 100 µS).
  • HRS: 1 MΩ (G = 1 µS).

Wire (600 Ω) is much less than LRS (10 kΩ) → wire negligible? Wait — IR drop covered in chapter 5.12.

Active-area analysis:

256×256 crossbar = 65,536 cells × 100 nm × 100 nm = 656 µm². The 28 nm CMOS substrate would hold ~330,000 transistors in this area (1T1R needs 65K transistors + supporting circuits). Crossbar area is smaller than transistor area.

L2 · Tam

Physical MVM flow (256×256, parallel):

Step 1 — DAC input prep (5 ns):

  • 256 DACs convert input vector to voltages.
  • Each DAC: 8-bit input → 0-0.5 V analog out.
  • DAC energy: 0.5 pJ × 256 = 128 pJ.

Step 2 — Word-line activation (1 ns):

  • DAC outputs to WL drivers.
  • WL voltage rises (RC settling).
  • Typical settling: τ = R_drv × C_WL ≈ 1 ns.

Step 3 — Crossbar settling (5 ns):

  • Currents flow physically per Ohm’s law.
  • Each cell: Iij=GijViI_{ij} = G_{ij} \cdot V_i.
  • Each column sums by KCL: Ij=iIijI_j = \sum_i I_{ij}.
  • All physics happens immediately (not light-speed limited — capacitor charge time).
  • 256-cell parallel current upper bound: ~10 mA (worst-case all-LRS).

Step 4 — ADC conversion (5 ns):

  • 256 ADCs convert column currents to numbers.
  • 8-bit ADC: 256 levels.
  • ADC energy: 1 pJ × 256 = 256 pJ.

Step 5 — Output (1 ns):

  • ADC outputs go to the compute engine.

Total MVM time: ~10-15 ns (settling + ADC). The detail behind the “10 ns” figure in chapter 4.2.

Total MVM energy:

  • DAC: 128 pJ
  • Crossbar (Ohm dissipation): 26 pJ (typical activity)
  • ADC: 256 pJ
  • Control: 50 pJ
  • Total: ~460 pJ/MVM (we saw in 4.2).

Throughput: 1 / 15 ns = 67M MVMs/s/crossbar. 65K MAC × 67M = 4.4 × 10¹² MAC/s = 4.4 TOPS per crossbar.

Y1 with 6400 crossbars in parallel → theoretical 28 POPS. Practical 30 TOPS (sequential bottlenecks).

L3 · Derin

IR drop numbers:

256 cells driven in parallel, each 50 µS @ 0.25 V → each cell 12.5 µA.

Word-line current (worst case): 256 × 12.5 = 3.2 mA.

WL resistance 600 Ω → IR drop: 3.2 mA × 600 Ω = 1.9 V! That swallows the 0.25 V input completely.

Mitigation:

  • Actually current decreases along the WL (each cell takes some). Average is half.
  • Double-ended WL drive (both ends powered).
  • Wider wire (50 nm × 100 nm) → resistance drops.
  • Practical result: ~5-10% IR drop (chapter 5.12 has the full analysis).

Sneak path:

In 1T1R the transistor blocks sneak paths — only the selected cell sources current.

In 1R/1S1R, sneak-path strategies:

  • Half-select (V/2 schemes): apply half voltage → sneak current shrinks but doesn’t vanish.
  • OTS selector (1S1R): below threshold = closed.
  • Negative voltages: current in only one direction.

SIDRA Y1 uses 1T1R → zero sneak path. Y10 plans 1S1R 3D stacks with NbOx OTS (chapter 2.3).

Sourcing impedance:

Word-line driver output impedance (R_drv) is critical. High → IR drop grows. Low → big transistor → big area.

SIDRA Y1: R_drv ≈ 50 Ω, transistor width 1 µm.

1T1R layout practice:

A 1T1R cell fits in ~6F² (F = minimum feature):

  • 28 nm Y1: 6 × 28² = 4700 nm² ≈ 70 nm × 70 nm. Plus contact + via.
  • Practical 100 nm × 100 nm = 10000 nm² (with margin).

256×256 crossbar physical layout:

  • Active area (memristor): 25.6 µm × 25.6 µm.
  • Periphery: WL/BL drivers, ADC, DAC ~50 µm extra each side.
  • Total crossbar block: ~125 µm × 125 µm = 15625 µm².

Y1 die area 1 cm² = 10⁸ µm². 6400 crossbars × 15625 = 10⁸ µm². Fills the Y1 die exactly (CMOS substrate + ADC + interconnect included).

Multiple metal layers:

WL: M3 metal layer. BL: M4 metal layer. Memristor cell: built in BEOL between M3 and M4. 1T1R transistor: in the 28 nm CMOS substrate (below M1). Power/ground: M5+ upper layers.

20-layer BEOL Y1 fits this stack compactly.

Experiment: 256×256 Crossbar MVM Latency vs CPU

Job: 256-vector × 256×256 matrix.

SIDRA Y1 crossbar:

  • DAC setup: 5 ns
  • Crossbar settling: 5 ns
  • ADC: 5 ns
  • Total: 15 ns
  • 65,536 MACs + 65,280 adds = ~131K ops
  • Throughput: 8.7 TOPS per single crossbar

Modern CPU (Intel Xeon, AVX-512):

  • 1 GHz, 16 MAC/cycle (AVX-512)
  • 16 GMAC/s = 16 GOPS
  • 256×256 MVM: 65K MAC / 16G = 4 µs
  • Crossbar 4000 / 15 = 266× faster

GPU (H100, FP16):

  • 1 PFLOPS sustained ≈ 500 TMAC/s
  • 256×256 MVM: 65K MAC / 500T = 130 ns
  • Crossbar 130 / 15 = 8.7× faster (single crossbar vs whole H100)

But H100 runs 1000+ threads in parallel. SIDRA Y1 has 6400 crossbars in parallel.

Total throughput compare:

  • Y1: 6400 × 4.4 TOPS = ~30 POPS analog.
  • H100: ~1 PFLOPS (FP8 sparse). 30× worse.
  • But H100 is dynamic (per-batch), Y1 is static (model fixed).

Energy/MAC:

  • CPU: ~1000 pJ (cache + DRAM included).
  • GPU H100: ~10 pJ (HBM included).
  • SIDRA Y1: ~0.05 pJ (crossbar) + 0.05 pJ overhead = 0.1 pJ.
  • SIDRA 100× more efficient.

Bottom line: for the same MVM, SIDRA crossbar is in nanoseconds — not seconds, not microseconds. Energy/op is 100× less than H100.

Quick Quiz

1/6How many memristors are in a 256×256 SIDRA crossbar?

Lab Exercise

Map a 4-layer CNN onto SIDRA Y1 crossbars.

Model: ResNet-18 (small CNN), ImageNet inference.

  • 11M parameters, 1.8 GFLOP/inference.
  • 4 main conv layers + FC.

Per-layer dims:

  • Conv1: 3×3 × 64 filters, 224×224 image.
  • Conv2: 3×3 × 128 filters.
  • Conv3: 3×3 × 256 filters.
  • Conv4: 3×3 × 512 filters.
  • FC: 512 × 1000 (ImageNet classes).

Questions:

(a) Total params 11M; what fraction of SIDRA Y1 (419M cells)? (b) Conv1: 3×3 = 9 inputs × 64 outputs kernel → matrix size? Crossbars? (c) Crossbars for the whole CNN? (d) Inference time (sliding-window conv, average ~28×28 spatial output)? (e) Inference energy?

Solutions

(a) 11M / 419M = 2.6%. Y1 is much larger than ResNet-18 needs. The rest stays free for other models or batching.

(b) Conv1 weight matrix: 9 × 64 = 576 weights per kernel slot. 3×64 + 3×64 = 192-cell crossbar fits easily — 1 crossbar suffices for Conv1.

(c) Total: ~50-100 crossbars (each conv ~10-20). FC: 512 × 1000 → 2 × 4 = 8 crossbars. Total ~70 crossbars, ~1% of Y1.

(d) Sliding-window Conv1: 224 × 224 = 50,176 sliding positions. Each is 1 MVM × 15 ns = 750 µs for Conv1. Deeper layers smaller (28×28, 14×14) → fewer positions. Total inference: ~5-10 ms. Real-time camera-capable.

(e) Total MAC: 1.8 G. SIDRA 0.1 pJ/MAC → 0.18 mJ. Inference energy ~0.2 mJ. Tiny.

Bottom line: real-time ResNet-18 inference (30 fps) on Y1 = 6 mJ/s = 6 mW. Far below 3 W TDP — can run additional models in parallel.

Cheat Sheet

  • Crossbar: 2 perpendicular wire sets + memristors at intersections.
  • 256×256 = 65K MACs in parallel.
  • MVM time: ~10-15 ns (DAC + Ohm settling + ADC).
  • Throughput: 4.4 TOPS/crossbar.
  • Y1: 6400 crossbars → 30 TOPS practical.
  • Cell: 1T1R for Y1 (no sneak path).
  • Layout: 25 µm active, 125 µm block (periphery included).
  • IR drop: ~5-10% practical on WL, full detail in 5.12.
  • Energy/MAC: ~0.1 pJ (CPU 10000×, GPU 100× worse).

Vision: Bigger, Denser, More Three-Dimensional

Crossbar design evolves:

  • Y1 (today): 256×256, 1T1R, 100 nm cell, 28 nm CMOS substrate.
  • Y3 (2027): 512×512, tighter 1T1R, 70 nm cell, 14 nm CMOS.
  • Y10 (2029): 1024×1024, 1S1R 3D-stack 4 layers, 28 nm cell, 7 nm CMOS.
  • Y100 (2031+): 4096×4096, photonic-linked, 14 nm cell. 3D stack 16 layers. 100B cells per chip.
  • Y1000 (long horizon): Crossbar = the basic computing block, no CPU. All compute on the crossbar.

Meaning for Türkiye: crossbar design is a new sub-category in semiconductors. Outside the classic CPU/GPU lane, less crowded, open territory. Türkiye’s technical capability is sufficient — fab infrastructure is the bottleneck. The SIDRA workshop is the first concrete step past it.

Unexpected future: crossbars in everything. Phones, cars, appliances will hold tiny SIDRA crossbars doing local AI. Cloud dependency drops. The local-AI era. 2030+ horizon.

Further Reading