🧭 Module 0 · Welcome · Chapter 0.1 · 12 min read

What is SIDRACHIP?

The chip on the other side of the memory wall.

What you'll learn here

  • Briefly explain why modern AI is running into the 'memory wall'
  • Describe, with a simple picture, the difference between Von Neumann architecture and in-memory computing
  • Say in your own words why SIDRACHIP is a 'memristor-based analog AI chip'
  • Distinguish the YILDIRIM Y1 / Y10 / Y100 product family

Hook: Brain 20 W, NVIDIA 1500 W

The human brain — about the size of a laptop — writes a poem, gets a joke, recognizes a tree, all on 20 watts. Doing the same on NVIDIA H100 takes 700 watts, and a server rack pulls 1500+ watts. Is the 75× gap an engineering curiosity, or is it the architecture itself?

It’s the architecture. And for a very specific reason. In 1945, Von Neumann wrote a famous note describing two separate parts of a computer: memory and compute. Connecting them, a bus. Eighty years later, modern GPUs still have that same bus — except now 3 terabytes per second of data is running up and down it. The bus itself has become an energy sink. We call this the memory wall.

SIDRACHIP is the chip on the other side of that wall.

Intuition: What if Memory and Compute Were the Same Thing?

Picture this:

  • Classic GPU: You’re sitting in a library. For every calculation you have to get up, walk 10 meters to the shelf, grab a book, come back, read, compute, carry the book back. Every single time.
  • Memristor crossbar: The library itself can compute. Ask the question in front of the shelf, the answer appears inside the shelf. You never walk.

“Compute inside the shelf” is not a metaphor — that’s literally what happens. A memristor (memory-resistor) stores a number (a weight) and performs multiplication by that number via Ohm’s Law when voltage is applied. Build a 256-row × 256-column crossbar, apply different input voltages on each row, and the current summed in each column gives you the inner product of that column’s weight vector with the input. Since all 256 columns run at once — a full 256-dimensional matrix-vector multiply, in one clock cycle, in analog.

In the animation above you can watch, on the left, the classical GPU shuffling data between memory and compute; and on the right, SIDRA’s in-memory computing. Press “Start” and watch the energy counters diverge.

Formalism: The Von Neumann Bottleneck and In-Memory Compute

L1 · Intro

One sentence: In classic computers, moving data from memory to the compute unit costs much more energy than the computation itself. SIDRACHIP eliminates that movement.

L2 · Full

Split the cost of one MAC (multiply-accumulate) operation:

Etotal=Ecompute+EmoveE_{\text{total}} = E_{\text{compute}} + E_{\text{move}}

Typical numbers in 28 nm CMOS:

OperationEnergy
32-bit FMAC~3 pJ
32-bit DRAM read~640 pJ

So on a GPU, ~99.5% of the energy budget is data movement.

Analog MVM does this: apply input voltages v\mathbf{v} to the crossbar rows; the memristors at each intersection already hold conductance GijG_{ij} (i.e. WijW_{ij}). The current summed in each output column, by Kirchhoff’s current law, equals

Ij=iviGijI_j = \sum_i v_i \cdot G_{ij}

So the vector–matrix product is solved physically. One clock cycle, zero walking.

L3 · Deep

At scale: in 1994 Wulf and McKee observed that while CPU speed was improving ~60% per year, DRAM access time was gaining only ~7%. That gap has widened for 30 years, so in modern systems compute waits, transport burns — in both respects the bottleneck is memory. Modern LLMs have O(1011)\mathcal{O}(10^{11}) parameters, and most of inference time is pulling those from DRAM. For transformer decode with FP16 weights:

Arithmetic Intensity=FLOPs/tokenbytes moved/token2Nparam2Nparam=1\text{Arithmetic Intensity} = \frac{\text{FLOPs/token}}{\text{bytes moved/token}} \approx \frac{2 \cdot N_{\text{param}}}{2 \cdot N_{\text{param}}} = 1

Only ~1 FLOP per byte. GPUs can’t reach their theoretical peak — they end up “memory bound”. In-memory computing pushes this ratio to infinity because data doesn’t move.

For YILDIRIM Y10: 20 layers × ~793,000 subarrays/layer × (256×256) cells ≈ 1.04 trillion memristors (the 256 CUs are already aggregated into the subarray count). Peak ~3,400 TOPS, ~97 TOPS/W — about 2.4× better than NVIDIA B300 on the same workload.

Experiment: Watch the Energy Counter

Assuming you ran the animation above — try this:

  1. Let the animation run for 10 seconds.
  2. Note the energy counters on the left (GPU) and right (SIDRA).
  3. Compute GPU_energy / SIDRA_energy. You should see a number around ~100.

This isn’t made up — the constants in the animation match real-world orders of magnitude (DRAM read ~640 pJ vs analog MVM ~5 pJ/op). Real numbers vary with workload; what matters is the order of magnitude gap.

Quiz

1/4On modern GPUs, roughly what percentage of the total energy of a 32-bit FMAC operation is spent moving data from DRAM?

Lab Task: Compute Your Own Memory Wall

Pen-and-paper (or pen + head).

A ResNet-50 inference does roughly 4 billion MACs and touches 25 million parameters (weights). Assume FP32 weights (4 bytes/weight) and a DRAM read cost of ~640 pJ per 32-bit access.

  1. 25M parameters = how many total bytes?
  2. If each weight is read from DRAM exactly once during inference, how many 32-bit reads happen?
  3. What’s the total transport energy (in millijoules)?
  4. What’s the total compute energy (4 billion MACs × 3 pJ)?
  5. Compute the ratio transport / compute. That’s ResNet-50’s “memory wall coefficient”.
  6. (Think) GPT-3 175B has ~700× more parameters but only ~800× more MACs. How does the coefficient change?
Hint?
  • (1) 25·10⁶ × 4 = 10⁸ bytes ≈ 100 MB.
  • (2) 25 million 32-bit reads.
  • (3) 25·10⁶ × 640 pJ = 1.6 × 10¹⁰ pJ = 16 mJ.
  • (4) 4·10⁹ × 3 pJ = 1.2 × 10¹⁰ pJ = 12 mJ.
  • (5) 16 / 12 ≈ 1.33×. ResNet-50 is mildly memory-bound; ~57% transport, ~43% compute.
  • (6) In LLMs every token reads all weights, and MACs only grow proportionally (~2/weight). Arithmetic intensity collapses to ~1 FLOP/byte and the coefficient climbs to 50-100×. That’s why SIDRA’s advantage on LLM inference is much larger than on ResNet.

Cheat Sheet

  • Memory wall: On modern CPUs/GPUs the bulk of the energy goes to data movement, not compute.
  • In-memory computing: The memory cell itself computes → transport energy ≈ 0.
  • Memristor: Memory-resistor. Tunable conductance; under voltage, Ohm + Kirchhoff solve analog MVM.
  • SIDRACHIP: Memristor crossbar + 28 nm CMOS base → CuM (CMOS-under-Memristor) architecture.
  • Product family:
    • Y1 / SIDRA ZERRE — 4 layers, 16 CUs, 52 TOPS, 2026 Q4 PoC
    • Y10 / SIDRA AZIM — 20 layers, 256 CUs, 3,400 TOPS, 2027 Q3 production
    • Y100 / SIDRA EFLAK — 50-100+ layers, photonic I/O, 10,000+ TOPS, 2029-30 vision
  • Why these numbers matter: Same workload at ~1/25 the energy of NVIDIA. Not a number — a paradigm.

Vision: Beyond SIDRA

SIDRA is one variation of memristor-based analog AI. Competing paradigms in the post-Y100 landscape:

  • Neuromorphic full-stack: Intel Loihi 2 (spike-based, 1M neurons), IBM NorthPole (analog MAC + digital sparsity). SRAM-based, not memristor — different tradeoffs.
  • Quantum AI accelerators: IBM Heron (156 qubits) + classical NN hybrids. Specialized narrow algorithms where quantum has leverage.
  • Photonic AI: Lightmatter, Lightelligence — silicon photonic MZI mesh for MVM. SIDRA Y100 adds photonic input on this path.
  • DNA storage + compute: petabit/cm³, archive-scale. Hours rather than microseconds; wholly different regime.
  • Biological brain-machine: organoid conditioned learning (FinalSpark, 2024). Probably not serious in 10 years, but research is vibrant.
  • Superconducting NN: Josephson-junction SFQ (single flux quantum) neurons at 4 K — 10³× SIDRA’s energy efficiency.
  • Molecular computing: DNA-origami circuits, enzyme-catalyzed compute — parallel, slow, biocompatible.
  • 3D chiplet ecosystem: UCIe standard, heterogeneous stacks; SIDRA as a chiplet beside a GPU.

Key lesson: SIDRA is one approach, not the only one. Module 3 will map biological neurons/synapses onto memristors; that’s where the richness of this vision becomes clear.

Biggest lever for post-Y10 SIDRA: photonic-electronic hybrid + chiplet. Optical interconnect raises data-movement bandwidth 100×; a chiplet architecture lets each generation update only the “AI tile”. Total system performance up 10× at flat power. 2028–2032 horizon.

Further Reading

  • Next chapter: 0.2 — How to Read This Book
  • Reference: Master Spec v3.0 — docs/specifications/master/PA-MASTER-SPECIFICATION-v3.0.md
  • Academic: W. A. Wulf, S. A. McKee, Hitting the Memory Wall, 1994 — the paper that named it.
  • Academic: L. Chua, Memristor — The Missing Circuit Element, IEEE TCT 1971 — the theoretical birth of the memristor (40 years before HP Labs built it).