The Neuromorphic Computing Paradigm
The only way to break the von Neumann wall — and the SIDRA YILDIRIM choice.
Prerequisites
What you'll learn here
- State the von Neumann architecture limit (memory wall) and the neuromorphic fix
- Describe how compute-in-memory is realized in SIDRA YILDIRIM
- Compare digital neuromorphic (Loihi, TrueNorth) with analog (SIDRA)
- Explain YILDIRIM's three core design principles (compute-in-memory, analog precision, hierarchical parallelism)
- Place neuromorphic computing in industry context and SIDRA's slot in the category
Hook: The 1945 Wall, Today
In 1945 John von Neumann described modern computer architecture: CPU on one side, memory on the other, connected by a bus.
That architecture has stood for 80 years. But in the AI era it has hit a wall:
- CPU speed: ~20% per year.
- Memory speed: ~5% per year.
- Memory access is 100-1000× slower than CPU compute.
- About 70% of GPT-3 inference time is waiting for memory bandwidth, not computing.
This is the memory wall or von Neumann bottleneck. The fix? Put compute and memory in the same place → Compute-in-Memory (CIM). This is the core idea of neuromorphic computing.
SIDRA YILDIRIM’s choice: analog compute-in-memory. The memristor crossbar both stores weights and performs MVM → memory-compute unification. This module (5) covers the silicon details. This chapter explains the paradigm.
Intuition: Memory and Compute Together
Traditional (von Neumann):
[CPU] ←──bus──→ [DRAM]
↑ ↑
MAC Weights
unitFor every MVM: read weights from DRAM → travel the bus → into a CPU register → MAC → write the result back. Data movement = energy + time. Memory access is 100-1000× more expensive than the MAC itself.
Compute-in-Memory (SIDRA YILDIRIM):
[Crossbar]
↑
Weights in place
MVM in place (Ohm + KCL)
Output = analog currentWeights never move. Apply input voltages → collect output currents. Memory = compute. We saw the math in 4.2.
Comparison:
| Metric | Von Neumann (GPU) | CIM (SIDRA) |
|---|---|---|
| MVM energy | ~1-10 pJ/MAC | ~20-50 fJ/MAC |
| Memory access | Per MVM | Once (programming) |
| Digital/analog | Fully digital | Mixed (crossbar analog) |
| Scale | GB-TB models | MB-GB models (Y1) |
| Flexibility | Anything | AI inference focus |
Three principles of neuromorphic computing:
- Compute-in-Memory: minimize data movement.
- Spike/Event-driven: compute only when something happens. (Chapters 3.1-3.8.)
- Parallel/Asynchronous: no global clock; events drive things.
SIDRA Y1 implements (1) (analog CIM). Y3+ adds (2) (spike-based). Y100 fully implements (3). The roadmap leans toward neuromorphic.
Formalism: CIM Efficiency and Design Principles
Memory wall, formally:
Total energy per MVM:
GPU (von Neumann):
- : ~10 pJ/MAC (FP16).
- : ~100 pJ/MAC (DRAM access).
- : ~50 pJ/MAC.
- Total: ~160 pJ/MAC. Compute itself is 6%!
SIDRA CIM:
- : ~0.05 pJ/MAC (crossbar).
- : 0 (in place).
- : ~0.05 pJ/MAC (ADC, DAC).
- Total: ~0.1 pJ/MAC. 1600× more efficient — but only for MVM.
When does CIM win?
AI models are 90%+ MVM. As long as that ratio holds, CIM wins. Non-MVM ops (softmax, LayerNorm, control flow) live in digital CMOS.
Wins:
- Inference (MVM-heavy): big win.
- Training: mixed — forward CIM, backward partial (Module 6).
- Tiny model: ADC overhead dominates → small win.
- Big model: natural CIM advantage.
Neuromorphic architecture types:
1. Digital neuromorphic:
Conventional CMOS, but spike/event-driven:
- IBM TrueNorth (2014): 1M neurons, 256M synapses. Digital, 1 kHz clock, 70 mW. Spike-based but weights are fixed (post-training).
- Intel Loihi (2018): 130K neurons/chip, on-chip STDP learning. Digital CMOS. 100 mW/chip.
- SpiNNaker (Manchester): 1M ARM cores, spike emulation.
2. Analog neuromorphic (CIM-based):
Memristor-based, analog MVM:
- Mythic AI (flash-based): NAND-flash analog MVM. 25 TOPS/W. Commercial 2021-2023.
- Rain AI (photonic): optical MVM. 2024+.
- SIDRA YILDIRIM (HfO₂ memristor): 10 TOPS/W Y1, target 300 at Y100.
Comparison:
| Property | Loihi (digital) | Mythic (analog) | SIDRA YILDIRIM (analog memristor) |
|---|---|---|---|
| Core device | CMOS | NAND flash | HfO₂ memristor |
| MVM type | Digital | Analog current | Analog current |
| Bits | 8-bit typical | ~8-bit effective | 8-bit (256 levels) |
| STDP | Yes | No | Y10+ target |
| Efficiency | ~10 TOPS/W | 25 TOPS/W | 10-300 TOPS/W |
| Product year | 2018 | 2022 | 2026+ |
SIDRA’s edge: Memristor non-volatile + 256 levels + CMOS-process compatible. More precise than flash (bit-level control), lower energy. More scalable than photonics at room temperature.
YILDIRIM chip architecture — three design principles:
Principle 1: Compute-in-Memory (CIM).
Each crossbar is both memory and compute. The 256×256 cell is the basic building block. The CMOS substrate (28 nm) drives the crossbar + ADC/DAC + control.
Principle 2: Analog precision.
8-bit (256-level) cell precision. ISPP keeps programming error ~1% (chapter 5.5). Temperature-aware reads (chapter 5.10).
Principle 3: Hierarchical parallelism.
Crossbar → Compute Unit (CU, 16 crossbars) → Cluster (16 CUs) → Chip (4 Clusters) → System (multi-chip).
- Within a crossbar: 65K MACs in parallel (Ohm+KCL).
- Within a CU: 16 crossbars in parallel. 16× throughput.
- Within a Cluster: 16 CUs in parallel. 256× throughput.
- Within a Chip: 4 Clusters = 1024 crossbars in parallel. Y1 total.
Y1 numbers:
- Crossbar: 256×256 = 65K cells.
- CU: 16 crossbars = 1M cells.
- Cluster: 16 CUs = 16M cells.
- Chip: 4 Clusters = 64M cells? But Y1 totals 419M → not 4, but more Clusters.
- Correction: Y1 = ~26 Clusters × 16 CUs × 16 crossbars × 65K = 419M. Or different shaping.
Y1 spec (approximate):
- 32 CU × 16 crossbar × 65K = 33.5M. No.
- Detail in chapter 5.4 (YILDIRIM Architecture).
Breaking von Neumann:
SIDRA is “hybrid” — CPU + SIDRA. CPU handles control and non-MVM ops. SIDRA handles MVM. A fast bus between them (PCIe 5.0 in Y1). Even so, for MVM, data movement is minimized → CIM wins for the 80%+ AI workload.
Counter-arguments and limits:
- Training difficulty: CIM backward pass is hard in hardware. Y1 inference-only.
- Scale issue: memristor lifetime is limited (~10⁹ SET/RESET); training burns through it fast.
- Flexibility: changing weights requires reprogramming (microsecond-millisecond).
- Noise: analog → 6-8 effective bits; not enough for high precision.
SIDRA’s answer: inference-focused + 256-level + ISPP + peripheral circuitry + compiler optimization. As a package: 10-300 TOPS/W.
Experiment: GPT-2 Inference Energy Analysis
GPT-2 small, single token inference:
- Parameters: 124M × 2 byte (FP16) = 248 MB.
- FLOPs: ~250 MFLOPs.
- Memory access: all parameters once (from DRAM).
NVIDIA H100 (von Neumann):
- Compute energy: 250 MFLOP × 10 pJ ≈ 2.5 mJ.
- Memory energy: 248 MB × 100 pJ/byte ≈ 25 mJ (DRAM).
- Interconnect: ~5 mJ.
- Total: ~32 mJ. Memory dominates.
SIDRA Y1 (CIM):
- Compute energy: 250 MFLOP × 0.05 pJ ≈ 12.5 µJ.
- Memory energy: 0 (in place).
- ADC/DAC: 0.05 pJ/MAC × 250M ≈ 12.5 µJ.
- Total: ~25 µJ.
Ratio: H100 / SIDRA = 32 mJ / 25 µJ = 1280×. Theoretical. Practically, SIDRA Y1 prototype expects 50-100× efficiency once overheads are counted.
Latency:
- H100: ~1 µs/token (batch 1), 0.01 µs/token (batch 32).
- SIDRA Y1: ~100-1000 µs/token (sequential through one crossbar).
But: SIDRA Y3+ runs multiple crossbars in parallel → latency drops. Y10 will be comparable to a datacenter edge for GPT-3.
Bottom line: SIDRA is ideal for low-power edge inference. H100 is ideal for high-throughput datacenter training. They run side by side, not as competitors.
Quick Quiz
Lab Exercise
SIDRA Y1 vs Raspberry Pi edge inference comparison.
Raspberry Pi 5 (typical edge AI):
- CPU: 4-core ARM Cortex-A76, 2.4 GHz.
- AI performance: ~10 GOPS INT8 (with Coral TPU, ~4 TOPS).
- Power: ~5 W total.
- Memory: 8 GB DDR4.
SIDRA Y1 (edge):
- CIM: 30 TOPS analog.
- Power: 3 W.
- Memory (model): 419M × 1 byte = 419 MB on-chip (non-volatile).
Scenario: real-time speech recognition in a smartphone app (Whisper-tiny model, 39M parameters).
Questions:
(a) Does the model fit Raspberry Pi 5 memory? SIDRA Y1? (b) Whisper-tiny inference time on Raspberry Pi 5 (~30 MFLOP/sec for real-time)? (c) Same on SIDRA Y1? (d) Energy for a day’s use (10% activity)? (e) Why is SIDRA advantageous in this scenario?
Solutions
(a) Raspberry Pi: 39M × 2 byte = 78 MB → fits comfortably. SIDRA Y1: 39M < 419M → fits, 9% used. The other 91% open for other models.
(b) Raspberry Pi 5 with Coral TPU: 30 MFLOP / 4 TFLOPS ≈ 8 µs/inference. Real-time-capable.
(c) SIDRA Y1: 30 MFLOP / 30 TOPS analog → 1 µs/inference. 8× faster.
(d) 1 hour = 3600 s × 10% activity = 360 s. Inferences ~100/s × 360 = 36,000/hour.
- Raspberry Pi: 8 µs × 36K × 5W = 1.44 J + 5W idle × 3240 s = ~16 kJ idle. Idle dominates.
- SIDRA Y1: 1 µs × 36K × 3W = 0.1 J + 3W idle × 3600 s = 10.8 kJ. 33% saving in edge use.
(e) SIDRA wins on: (1) low idle power (non-volatile, memristor zero-power asleep), (2) shorter active time (8× speed), (3) persistent model (no cold start). Battery life is the critical edge metric → SIDRA leads.
Real-product estimate: 2027-2028 SIDRA Y3-based smart earbuds / home assistant → continuous listening + speech recognition, 24-hour battery. Today’s solutions: 4-8 hours.
Cheat Sheet
- Von Neumann bottleneck: CPU/memory split → expensive data movement.
- Memory wall: in AI workloads, memory access costs 10-100× the compute.
- CIM (Compute-in-Memory): memory and compute in the same place → memristor crossbar runs MVM in place.
- SIDRA YILDIRIM: analog memristor CIM. HfO₂, 256 levels, 10 TOPS/W Y1.
- Rivals: Loihi (digital spike), Mythic (flash analog), Rain (photonic). Different trade-offs.
- Three design principles: CIM, analog precision, hierarchical parallelism.
- Limit: Y1 inference-only; backward pass hard in hardware (3.6).
Vision: The Post-Von-Neumann Era
The 80-year von Neumann architecture is slowly giving way to message-passing parallel heterogeneous architectures. SIDRA is a concrete example of the transition:
- Y1 (today): Hybrid (CPU + SIDRA). CIM for inference; CPU for control + non-MVM.
- Y3 (2027): Larger SIDRA, smaller CPU. Adds spike-based inference. Datacenter deployments.
- Y10 (2029): SIDRA fully dominant in inference. Minimal CPU. Edge AI widespread.
- Y100 (2031+): Von Neumann largely bypassed. CIM + spike + photonic. Same architecture in the datacenter and at the edge.
- Y1000 (long horizon): Compute-in-sensor. Cameras, microphones, sensors are themselves AI hardware. No data center.
Meaning for Türkiye: leaving von Neumann = leaving the classical CPU/GPU race. Türkiye’s national AI architecture claim lives at this intersection. SIDRA YILDIRIM = Türkiye’s concrete hardware example of “we are in the race”. With academia + workshop + industry combined, 2028-2030 could see Türkiye among the top 10 neuromorphic companies globally.
Unexpected future: a neuromorphic OS. Today’s operating systems assume von Neumann. As SIDRA-class hardware spreads, a new OS paradigm becomes necessary: event-driven, spike-queued, asynchronous. A “neuromorphic core” Linux module. The first sketch appears in Module 6 (the software stack).
Further Reading
- Next chapter: 5.2 — Deep Dive: The Memristor
- Previous module: 4.8 — Linear Algebra Laboratory
- Von Neumann original: J. von Neumann, First Draft of a Report on the EDVAC, 1945.
- Memory wall: Wulf & McKee, Hitting the memory wall, ACM SIGARCH Comput. Archit. News 1995.
- Neuromorphic concept: Carver Mead, Analog VLSI and Neural Systems, 1989.
- IBM TrueNorth: Merolla et al., A million spiking-neuron integrated circuit…, Science 2014.
- Intel Loihi: Davies et al., Loihi: A neuromorphic manycore processor with on-chip learning, IEEE Micro 2018.
- CIM review: Sebastian et al., Memory devices and applications for in-memory computing, Nature Nanotech. 2020.