🔌 Module 5 · Chip Hardware · Chapter 5.1 · 13 min read

The Neuromorphic Computing Paradigm

The only way to break the von Neumann wall — and the SIDRA YILDIRIM choice.

Prerequisites

4.8 — Linear Algebra Laboratory

What you'll learn here

State the von Neumann architecture limit (memory wall) and the neuromorphic fix
Describe how compute-in-memory is realized in SIDRA YILDIRIM
Compare digital neuromorphic (Loihi, TrueNorth) with analog (SIDRA)
Explain YILDIRIM's three core design principles (compute-in-memory, analog precision, hierarchical parallelism)
Place neuromorphic computing in industry context and SIDRA's slot in the category

Hook: The 1945 Wall, Today

In 1945 John von Neumann described modern computer architecture: CPU on one side, memory on the other, connected by a bus.

That architecture has stood for 80 years. But in the AI era it has hit a wall:

CPU speed: ~20% per year.
Memory speed: ~5% per year.
Memory access is 100-1000× slower than CPU compute.
About 70% of GPT-3 inference time is waiting for memory bandwidth, not computing.

This is the memory wall or von Neumann bottleneck. The fix? Put compute and memory in the same place → Compute-in-Memory (CIM). This is the core idea of neuromorphic computing.

SIDRA YILDIRIM’s choice: analog compute-in-memory. The memristor crossbar both stores weights and performs MVM → memory-compute unification. This module (5) covers the silicon details. This chapter explains the paradigm.

Intuition: Memory and Compute Together

Traditional (von Neumann):

[CPU] ←──bus──→ [DRAM]
  ↑               ↑
  MAC             Weights
  unit

For every MVM: read weights from DRAM → travel the bus → into a CPU register → MAC → write the result back. Data movement = energy + time. Memory access is 100-1000× more expensive than the MAC itself.

Compute-in-Memory (SIDRA YILDIRIM):

[Crossbar]
  ↑
  Weights in place
  MVM in place (Ohm + KCL)
  Output = analog current

Weights never move. Apply input voltages → collect output currents. Memory = compute. We saw the math in 4.2.

Comparison:

Metric	Von Neumann (GPU)	CIM (SIDRA)
MVM energy	~1-10 pJ/MAC	~20-50 fJ/MAC
Memory access	Per MVM	Once (programming)
Digital/analog	Fully digital	Mixed (crossbar analog)
Scale	GB-TB models	MB-GB models (Y1)
Flexibility	Anything	AI inference focus

Three principles of neuromorphic computing:

Compute-in-Memory: minimize data movement.
Spike/Event-driven: compute only when something happens. (Chapters 3.1-3.8.)
Parallel/Asynchronous: no global clock; events drive things.

SIDRA Y1 implements (1) (analog CIM). Y3+ adds (2) (spike-based). Y100 fully implements (3). The roadmap leans toward neuromorphic.

Formalism: CIM Efficiency and Design Principles

L1 · Başlangıç

Memory wall, formally:

Total energy per MVM:

E_{\text{MVM}} = E_{\text{compute}} + E_{\text{memory}} + E_{\text{interconnect}}

GPU (von Neumann):

$E_{\text{compute}}$ : ~10 pJ/MAC (FP16).
$E_{\text{memory}}$ : ~100 pJ/MAC (DRAM access).
$E_{\text{interconnect}}$ : ~50 pJ/MAC.
Total: ~160 pJ/MAC. Compute itself is 6%!

SIDRA CIM:

$E_{\text{compute}}$ : ~0.05 pJ/MAC (crossbar).
$E_{\text{memory}}$ : 0 (in place).
$E_{\text{interconnect}}$ : ~0.05 pJ/MAC (ADC, DAC).
Total: ~0.1 pJ/MAC. 1600× more efficient — but only for MVM.

When does CIM win?

AI models are 90%+ MVM. As long as that ratio holds, CIM wins. Non-MVM ops (softmax, LayerNorm, control flow) live in digital CMOS.

Wins:

Inference (MVM-heavy): big win.
Training: mixed — forward CIM, backward partial (Module 6).
Tiny model: ADC overhead dominates → small win.
Big model: natural CIM advantage.

L2 · Tam

Neuromorphic architecture types:

1. Digital neuromorphic:

Conventional CMOS, but spike/event-driven:

IBM TrueNorth (2014): 1M neurons, 256M synapses. Digital, 1 kHz clock, 70 mW. Spike-based but weights are fixed (post-training).
Intel Loihi (2018): 130K neurons/chip, on-chip STDP learning. Digital CMOS. 100 mW/chip.
SpiNNaker (Manchester): 1M ARM cores, spike emulation.

2. Analog neuromorphic (CIM-based):

Memristor-based, analog MVM:

Mythic AI (flash-based): NAND-flash analog MVM. 25 TOPS/W. Commercial 2021-2023.
Rain AI (photonic): optical MVM. 2024+.
SIDRA YILDIRIM (HfO₂ memristor): 10 TOPS/W Y1, target 300 at Y100.

Comparison:

Property	Loihi (digital)	Mythic (analog)	SIDRA YILDIRIM (analog memristor)
Core device	CMOS	NAND flash	HfO₂ memristor
MVM type	Digital	Analog current	Analog current
Bits	8-bit typical	~8-bit effective	8-bit (256 levels)
STDP	Yes	No	Y10+ target
Efficiency	~10 TOPS/W	25 TOPS/W	10-300 TOPS/W
Product year	2018	2022	2026+

SIDRA’s edge: Memristor non-volatile + 256 levels + CMOS-process compatible. More precise than flash (bit-level control), lower energy. More scalable than photonics at room temperature.

L3 · Derin

YILDIRIM chip architecture — three design principles:

Principle 1: Compute-in-Memory (CIM).

Each crossbar is both memory and compute. The 256×256 cell is the basic building block. The CMOS substrate (28 nm) drives the crossbar + ADC/DAC + control.

Principle 2: Analog precision.

8-bit (256-level) cell precision. ISPP keeps programming error ~1% (chapter 5.5). Temperature-aware reads (chapter 5.10).

Principle 3: Hierarchical parallelism.

Crossbar → Compute Unit (CU, 16 crossbars) → Cluster (16 CUs) → Chip (4 Clusters) → System (multi-chip).

Within a crossbar: 65K MACs in parallel (Ohm+KCL).
Within a CU: 16 crossbars in parallel. 16× throughput.
Within a Cluster: 16 CUs in parallel. 256× throughput.
Within a Chip: 4 Clusters = 1024 crossbars in parallel. Y1 total.

Y1 numbers:

Crossbar: 256×256 = 65K cells.
CU: 16 crossbars = 1M cells.
Cluster: 16 CUs = 16M cells.
Chip: 4 Clusters = 64M cells? But Y1 totals 419M → not 4, but more Clusters.
Correction: Y1 = ~26 Clusters × 16 CUs × 16 crossbars × 65K = 419M. Or different shaping.

Y1 spec (approximate):

32 CU × 16 crossbar × 65K = 33.5M. No.
Detail in chapter 5.4 (YILDIRIM Architecture).

Breaking von Neumann:

SIDRA is “hybrid” — CPU + SIDRA. CPU handles control and non-MVM ops. SIDRA handles MVM. A fast bus between them (PCIe 5.0 in Y1). Even so, for MVM, data movement is minimized → CIM wins for the 80%+ AI workload.

Counter-arguments and limits:

Training difficulty: CIM backward pass is hard in hardware. Y1 inference-only.
Scale issue: memristor lifetime is limited (~10⁹ SET/RESET); training burns through it fast.
Flexibility: changing weights requires reprogramming (microsecond-millisecond).
Noise: analog → 6-8 effective bits; not enough for high precision.

SIDRA’s answer: inference-focused + 256-level + ISPP + peripheral circuitry + compiler optimization. As a package: 10-300 TOPS/W.

Experiment: GPT-2 Inference Energy Analysis

GPT-2 small, single token inference:

Parameters: 124M × 2 byte (FP16) = 248 MB.
FLOPs: ~250 MFLOPs.
Memory access: all parameters once (from DRAM).

NVIDIA H100 (von Neumann):

Compute energy: 250 MFLOP × 10 pJ ≈ 2.5 mJ.
Memory energy: 248 MB × 100 pJ/byte ≈ 25 mJ (DRAM).
Interconnect: ~5 mJ.
Total: ~32 mJ. Memory dominates.

SIDRA Y1 (CIM):

Compute energy: 250 MFLOP × 0.05 pJ ≈ 12.5 µJ.
Memory energy: 0 (in place).
ADC/DAC: 0.05 pJ/MAC × 250M ≈ 12.5 µJ.
Total: ~25 µJ.

Ratio: H100 / SIDRA = 32 mJ / 25 µJ = 1280×. Theoretical. Practically, SIDRA Y1 prototype expects 50-100× efficiency once overheads are counted.

Latency:

H100: ~1 µs/token (batch 1), 0.01 µs/token (batch 32).
SIDRA Y1: ~100-1000 µs/token (sequential through one crossbar).

But: SIDRA Y3+ runs multiple crossbars in parallel → latency drops. Y10 will be comparable to a datacenter edge for GPT-3.

Bottom line: SIDRA is ideal for low-power edge inference. H100 is ideal for high-throughput datacenter training. They run side by side, not as competitors.

Quick Quiz

1/6What is the von Neumann bottleneck?

CPU is too slowEnergy + latency moving data between memory and CPU — much of the compute time is actually waiting for memoryScreen is smallA software bug

Lab Exercise

SIDRA Y1 vs Raspberry Pi edge inference comparison.

Raspberry Pi 5 (typical edge AI):

CPU: 4-core ARM Cortex-A76, 2.4 GHz.
AI performance: ~10 GOPS INT8 (with Coral TPU, ~4 TOPS).
Power: ~5 W total.
Memory: 8 GB DDR4.

SIDRA Y1 (edge):

CIM: 30 TOPS analog.
Power: 3 W.
Memory (model): 419M × 1 byte = 419 MB on-chip (non-volatile).

Scenario: real-time speech recognition in a smartphone app (Whisper-tiny model, 39M parameters).

Questions:

(a) Does the model fit Raspberry Pi 5 memory? SIDRA Y1? (b) Whisper-tiny inference time on Raspberry Pi 5 (~30 MFLOP/sec for real-time)? (c) Same on SIDRA Y1? (d) Energy for a day’s use (10% activity)? (e) Why is SIDRA advantageous in this scenario?

Solutions

(a) Raspberry Pi: 39M × 2 byte = 78 MB → fits comfortably. SIDRA Y1: 39M < 419M → fits, 9% used. The other 91% open for other models.

(b) Raspberry Pi 5 with Coral TPU: 30 MFLOP / 4 TFLOPS ≈ 8 µs/inference. Real-time-capable.

(d) 1 hour = 3600 s × 10% activity = 360 s. Inferences ~100/s × 360 = 36,000/hour.

Raspberry Pi: 8 µs × 36K × 5W = 1.44 J + 5W idle × 3240 s = ~16 kJ idle. Idle dominates.
SIDRA Y1: 1 µs × 36K × 3W = 0.1 J + 3W idle × 3600 s = 10.8 kJ. 33% saving in edge use.

(e) SIDRA wins on: (1) low idle power (non-volatile, memristor zero-power asleep), (2) shorter active time (8× speed), (3) persistent model (no cold start). Battery life is the critical edge metric → SIDRA leads.

Real-product estimate: 2027-2028 SIDRA Y3-based smart earbuds / home assistant → continuous listening + speech recognition, 24-hour battery. Today’s solutions: 4-8 hours.

Cheat Sheet

Von Neumann bottleneck: CPU/memory split → expensive data movement.
Memory wall: in AI workloads, memory access costs 10-100× the compute.
CIM (Compute-in-Memory): memory and compute in the same place → memristor crossbar runs MVM in place.
SIDRA YILDIRIM: analog memristor CIM. HfO₂, 256 levels, 10 TOPS/W Y1.
Rivals: Loihi (digital spike), Mythic (flash analog), Rain (photonic). Different trade-offs.
Three design principles: CIM, analog precision, hierarchical parallelism.
Limit: Y1 inference-only; backward pass hard in hardware (3.6).

Vision: The Post-Von-Neumann Era

The 80-year von Neumann architecture is slowly giving way to message-passing parallel heterogeneous architectures. SIDRA is a concrete example of the transition:

Y1 (today): Hybrid (CPU + SIDRA). CIM for inference; CPU for control + non-MVM.
Y3 (2027): Larger SIDRA, smaller CPU. Adds spike-based inference. Datacenter deployments.
Y10 (2029): SIDRA fully dominant in inference. Minimal CPU. Edge AI widespread.
Y100 (2031+): Von Neumann largely bypassed. CIM + spike + photonic. Same architecture in the datacenter and at the edge.
Y1000 (long horizon): Compute-in-sensor. Cameras, microphones, sensors are themselves AI hardware. No data center.

Meaning for Türkiye: leaving von Neumann = leaving the classical CPU/GPU race. Türkiye’s national AI architecture claim lives at this intersection. SIDRA YILDIRIM = Türkiye’s concrete hardware example of “we are in the race”. With academia + workshop + industry combined, 2028-2030 could see Türkiye among the top 10 neuromorphic companies globally.

Unexpected future: a neuromorphic OS. Today’s operating systems assume von Neumann. As SIDRA-class hardware spreads, a new OS paradigm becomes necessary: event-driven, spike-queued, asynchronous. A “neuromorphic core” Linux module. The first sketch appears in Module 6 (the software stack).

Prerequisites

What you'll learn here

🪝 Hook: The 1945 Wall, Today

🧭 Intuition: Memory and Compute Together

📐 Formalism: CIM Efficiency and Design Principles

🧪 Experiment: GPT-2 Inference Energy Analysis

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: The Post-Von-Neumann Era

📚 Further Reading