📐 Module 4 · The Math Arsenal · Chapter 4.4 · 12 min read

Probability and Noise

Memristor noise isn't always a bug — sometimes it's a feature.

What you'll learn here

  • Define random variable, expected value (E), variance (Var)
  • State the formulas for Normal (Gaussian), Bernoulli, Poisson distributions and their use cases
  • Explain the physics of thermal (Johnson), shot, and 1/f noise
  • Compute SNR (Signal-to-Noise Ratio) for a SIDRA crossbar
  • Show how noise can be useful in AI (regularizer, dropout)

Hook: Perfection Isn't Possible — and Isn't Needed

An ideal chip: every signal precise, every measurement correct, every computation deterministic. A practical chip: noise on every signal, error in every measurement, estimation in every computation.

A SIDRA Y1 cell stores 8-bit (256-level) conductance. But thermal noise, shot noise, drift, IR drop, and temperature swings drop effective accuracy to ~6 bits. Two bits lost. Is that a problem?

Answer: usually not. Sometimes an advantage.

  • 6 bits is enough for AI inference (INT8 is standard, INT4 is widespread).
  • Noise plays the role of a regularizer in classical AI (dropout, weight noise).
  • The brain synapse is already noisy (vesicles are probabilistic) — feature, not bug.
  • SIDRA’s real position: not “deterministic digital”, but “noisy but efficient analog”.

This chapter covers probability fundamentals, noise sources, how SIDRA measures and tames them, and shows that noise can help AI learning.

Intuition: Probability and Expected Value

A random variable (RV) is a variable whose value is random.

  • A die roll: X{1,2,3,4,5,6}X \in \{1, 2, 3, 4, 5, 6\}, equal probability.
  • A memristor read current: I=μ+ϵI = \mu + \epsilon, μ\mu “true” value, ϵ\epsilon Gaussian noise.

Probability distribution: the probability of every value.

  • Die: P(X=k)=1/6P(X = k) = 1/6 for every kk.
  • Memristor: ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2) — zero-mean Gaussian.

Expected value (E): long-run average.

  • Die: E[X]=(1+2+3+4+5+6)/6=3.5E[X] = (1+2+3+4+5+6)/6 = 3.5.
  • Memristor: E[I]=μE[I] = \mu (noise has zero mean).

Variance: how scattered around the mean.

  • Die: Var[X]=E[(X3.5)2]=2.92\text{Var}[X] = E[(X - 3.5)^2] = 2.92.
  • Memristor: Var[I]=σ2\text{Var}[I] = \sigma^2.

Standard deviation: Var\sqrt{\text{Var}}. Same units, “typical deviation” size.

  • Memristor: σ\sigma. Typical SIDRA: σ5%\sigma \approx 5\% of μ\mu.

Intuition: a single measurement is noisy, but the average of many measurements is much sharper. Central limit theorem: the standard deviation of an NN-sample average is σ/N\sigma/\sqrt{N}. 100 measurements → 10× improvement.

SIDRA practical use: if a single MVM is repeated 10× over 100 µs, effective accuracy rises from 6 bits to ~9 bits — but speed drops 10×. Trade-off.

Formalism: Distributions, Noise Models, SNR

L1 · Başlangıç

Three core distributions:

Bernoulli: X{0,1}X \in \{0, 1\}, P(X=1)=pP(X=1) = p.

  • Expected value: E[X]=pE[X] = p.
  • Variance: Var[X]=p(1p)\text{Var}[X] = p(1-p).
  • Use: single-bit event (vesicle release, bit read).

Normal (Gaussian): XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2).

  • Density: f(x)=1σ2πe(xμ)2/(2σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}.
  • E[X]=μE[X] = \mu, Var[X]=σ2\text{Var}[X] = \sigma^2.
  • Use: thermal noise, measurement error, weight initialization.

Poisson: X{0,1,2,}X \in \{0, 1, 2, \ldots\}, P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}.

  • E[X]=Var[X]=λE[X] = \text{Var}[X] = \lambda.
  • Use: spike count, photon count, rare events.

Expected-value rules:

  • Linearity: E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y].
  • If independent: E[XY]=E[X]E[Y]E[XY] = E[X] E[Y].

Variance rules:

  • Var[aX]=a2Var[X]\text{Var}[aX] = a^2 \text{Var}[X].
  • If independent: Var[X+Y]=Var[X]+Var[Y]\text{Var}[X+Y] = \text{Var}[X] + \text{Var}[Y].
L2 · Tam

Three physical noise sources:

1. Thermal noise (Johnson-Nyquist):

σI2=4kTGΔf\sigma_I^2 = 4 k T G \Delta f
  • kk = Boltzmann (1.38 × 10⁻²³ J/K)
  • TT = temperature (K)
  • GG = conductance (S)
  • Δf\Delta f = bandwidth (Hz)

Numbers: T=300T = 300 K, G=100G = 100 µS, Δf=100\Delta f = 100 MHz → σI=41.38×1023300104108=1.66×1016\sigma_I = \sqrt{4 \cdot 1.38 \times 10^{-23} \cdot 300 \cdot 10^{-4} \cdot 10^8} = \sqrt{1.66 \times 10^{-16}} A = 12.9 nA.

Typical MVM output: 1-10 µA → SNR ≈ 770 → 30-40 dB.

2. Shot noise:

σI2=2qIΔf\sigma_I^2 = 2 q I \Delta f
  • qq = electron charge (1.6 × 10⁻¹⁹ C)
  • II = average current

Numbers: I=1I = 1 µA, Δf=100\Delta f = 100 MHz → σI=21.6×1019106108=3.2×1017\sigma_I = \sqrt{2 \cdot 1.6 \times 10^{-19} \cdot 10^{-6} \cdot 10^8} = \sqrt{3.2 \times 10^{-17}} = 5.7 nA.

Dominates at low current. Same order as thermal.

3. 1/f (flicker) noise:

SI(f)=KI2fS_I(f) = \frac{K I^2}{f}
  • KK = material constant (HfO₂ ≈ 10⁻¹¹).

Grows as frequency drops (slow drift source). Dominates over long retention.

Total noise (independent sources):

σtotal2=σthermal2+σshot2+σ1/f2\sigma_{\text{total}}^2 = \sigma_{\text{thermal}}^2 + \sigma_{\text{shot}}^2 + \sigma_{1/f}^2

SNR (Signal-to-Noise Ratio):

SNR=μ2σtotal2=PsignalPnoise\text{SNR} = \frac{\mu^2}{\sigma_{\text{total}}^2} = \frac{P_{\text{signal}}}{P_{\text{noise}}}

In dB: SNRdB=10log10(SNR)\text{SNR}_{\text{dB}} = 10 \log_{10}(\text{SNR}).

  • 30 dB → 1000× signal:noise → ~5 effective bits.
  • 40 dB → 10000× → ~6.5 bits.
  • 60 dB → 10⁶× → ~10 bits.

SIDRA Y1 target: ~30-40 dB.

L3 · Derin

Crossbar noise in detail:

Total noise current in one column of a 256×256 crossbar:

σcol2=i=1256σi2256σcell2\sigma_{\text{col}}^2 = \sum_{i=1}^{256} \sigma_i^2 \approx 256 \cdot \sigma_{\text{cell}}^2

So σcol=256σcell=16σcell\sigma_{\text{col}} = \sqrt{256} \cdot \sigma_{\text{cell}} = 16 \cdot \sigma_{\text{cell}}.

Signal also sums: μcol=μi=256μˉ\mu_{\text{col}} = \sum \mu_i = 256 \cdot \bar{\mu} (mean).

SNR: (256μˉ)2256σcell2=256μˉ2σcell2=256SNRcell\frac{(256 \bar{\mu})^2}{256 \sigma_{\text{cell}}^2} = 256 \cdot \frac{\bar{\mu}^2}{\sigma_{\text{cell}}^2} = 256 \cdot \text{SNR}_{\text{cell}}.

Crossbar SNR is NN× the per-cell SNR. Good news.

Programming noise:

You can’t program a memristor exactly. After ISPP: GactualN(Gtarget,(ρGtarget)2)G_{\text{actual}} \sim \mathcal{N}(G_{\text{target}}, (\rho G_{\text{target}})^2), ρ1%\rho \approx 1\% (ISPP) or 5%5\% (basic).

This noise is persistent — unlike thermal, it doesn’t change per read. In AI it acts as “weight quantization noise”. Modern DL is designed to tolerate it (post-training quantization).

Drift:

Conductance changes slowly: G(t)=G0+αlog(t/t0)G(t) = G_0 + \alpha \log(t/t_0). Typical: ~5% drift per year.

Fix: periodic refresh (re-program a few cells per month) or drift-aware compiler (predict and compensate).

Is noise bad for AI?

Surprisingly: mostly not, sometimes helpful:

  1. Weight noise = stochastic regularizer: adding small noise to weights reduces overfitting (Hinton et al. 1992).
  2. Dropout: randomly disable neurons during training → more robust model. SIDRA’s natural “sneak path” noise can do something similar.
  3. Stochastic gradient: SGD’s strength is its noise → finds good minima.
  4. Bayesian networks: weights are actually distributions. SIDRA hardware noise produces this naturally.

SIDRA Y10 target: controlled-stochastic memristor — noise level tunable by design. Optimize per AI workload.

Experiment: Compute the SNR of a Cell

A SIDRA Y1 cell:

  • G=50G = 50 µS (between HRS and LRS)
  • V=0.25V = 0.25 V (read voltage)
  • T=300T = 300 K
  • Δf=100\Delta f = 100 MHz
  • ISPP programming noise: ρ=2%\rho = 2\%σG=1\sigma_G = 1 µS

Signal current: μI=GV=50×1060.25=12.5\mu_I = G \cdot V = 50 \times 10^{-6} \cdot 0.25 = 12.5 µA.

Thermal noise: σT=4kTGΔf=41.38×10233005×105108\sigma_T = \sqrt{4 k T G \Delta f} = \sqrt{4 \cdot 1.38 \times 10^{-23} \cdot 300 \cdot 5 \times 10^{-5} \cdot 10^8} =8.28×1017=9.1= \sqrt{8.28 \times 10^{-17}} = 9.1 nA.

Shot noise: σS=2qIΔf=21.6×10191.25×105108\sigma_S = \sqrt{2 q I \Delta f} = \sqrt{2 \cdot 1.6 \times 10^{-19} \cdot 1.25 \times 10^{-5} \cdot 10^8} =4×1016=20= \sqrt{4 \times 10^{-16}} = 20 nA.

Programming noise (in current units): σP=σGV=1060.25=0.25\sigma_P = \sigma_G \cdot V = 10^{-6} \cdot 0.25 = 0.25 µA = 250 nA.

Total: σtotal=9.12+202+250282+400+62500251\sigma_{\text{total}} = \sqrt{9.1^2 + 20^2 + 250^2} \approx \sqrt{82 + 400 + 62500} \approx 251 nA.

Programming noise dominates (much larger than thermal/shot).

SNR: SNR=μI/σtotal=12500/251=49.8\text{SNR} = \mu_I / \sigma_{\text{total}} = 12500 / 251 = 49.8 SNRdB=20log10(49.8)=33.9\text{SNR}_{\text{dB}} = 20 \log_{10}(49.8) = 33.9 dB.

Effective bits: log2(50)5.6\log_2(50) \approx 5.6 bits.

Crossbar level (256 columns): SNR rises 256× → SNR=12,750\text{SNR} = 12,750, ~42 dB, ~7 effective bits.

Bottom line: SIDRA Y1 has ~7 effective bits per column. Enough for INT8 inference; not for FP32.

Improvement paths:

  1. Tighter ISPP: ρ=1%\rho = 1\% → programming noise drops 50%.
  2. Multi-read averaging (4 samples): σ\sigma drops 50%.
  3. Cold operation (T = 250 K): thermal noise drops 15%.

Y10 target: ~50 dB SNR, ~9-10 effective bits.

Quick Quiz

1/6What is the expected value E[X]?

Lab Exercise

SNR analysis for MNIST classification on SIDRA Y1.

Scenario:

  • MNIST: 28×28 = 784 pixels, 10 classes.
  • 2-layer MLP: 784 × 128 → 128 × 10.
  • Each layer on SIDRA crossbars: first layer 4 crossbars (256×256), second layer 1 crossbar.
  • Each cell: 6-7 effective-bit SNR.
  • Inference: one forward pass.

Data:

  • Typical MNIST classification accuracy (FP32 model): 98%.
  • After INT8 quantization: 97.5% (1% loss).
  • INT4 quantization: 94% (4% loss).
  • 6-bit effective (SIDRA): 96-97% expected.

Questions:

(a) MVMs per inference? Latency in ns? (b) How much noise per MVM (mean current × 5%)? (c) How does total noise accumulate across 2 layers? (d) How many MVM averages to hit 96% classification? (e) How much does averaging extend inference? Practical?

Solutions

(a) Layer 1: 784×128. Crossbar 256×256 → 4 MVMs (parallel). Layer 2: 128×10 → 1 MVM. Total 5 MVMs, sequential. Time: 5 × 10 ns = 50 ns.

(b) Each MVM output ~10 µA. Programming noise 5% → 0.5 µA. Thermal/shot ~50 nA. Total ~510 nA per output → ~5% relative.

(c) Two layers in series → noise RMS adds: 0.052+0.052=0.071=7%\sqrt{0.05^2 + 0.05^2} = 0.071 = 7\% relative. As long as the classification margin exceeds this, accuracy holds.

(d) 5-bit effective → ~93% accuracy. 6-bit (single MVM) → 96%. 7-bit (4× averaging) → 97-98%. 4× averaging suffices.

(e) 4× averaging: 4 × 50 ns = 200 ns/inference. Still 5M inferences/s. Practical. SIDRA Y1 5M MNIST/s. Compare: H100 ~100M MNIST/s, but at 700 W. SIDRA ~150× slower but 230× less energy.

Note: Y1 is overkill for MNIST (419M cells, MNIST model needs ~100K). Real role: parallel batches of small models.

Cheat Sheet

  • Random variable: value-is-random variable. E[X]E[X] = mean, Var[X]\text{Var}[X] = scatter.
  • Three distributions: Normal (noise), Bernoulli (binary), Poisson (count).
  • Three noises: Thermal (4kTG·Δf), Shot (2qI·Δf), 1/f (drift).
  • Total noise: σtotal2=σT2+σS2+σ1/f2\sigma_{\text{total}}^2 = \sigma_T^2 + \sigma_S^2 + \sigma_{1/f}^2.
  • SNR: signal²/noise². dB = 10 log₁₀ SNR.
  • Crossbar SNR: cell SNR × N (parallelism wins).
  • SIDRA Y1: ~30-40 dB SNR, ~6-7 effective bits.
  • Noise = feature: stochastic regularizer, dropout effect, Bayesian nets.

Vision: Make Noise a Design Tool

Classical engineering: noise is the enemy. Modern AI: noise is a friend. SIDRA brings that paradigm into silicon:

  • Y1 (today): Noise is “tolerated bad” — enough for INT8 inference.
  • Y3 (2027): ISPP improvement + temperature compensation → SNR 50 dB, 9 bits.
  • Y10 (2029): Controlled-stochastic memristor — noise level programmable. For Bayesian nets, dropout-replication, stochastic MAC.
  • Y100 (2031+): Noise-aware compiler — train the model knowing per-cell noise profiles. Hardware-software co-design.
  • Y1000 (long horizon): Noise-energy co-optimization. AI models use noise as a compute resource (sampling, MCMC).

Meaning for Türkiye: the noise-tolerant AI design race has just begun. SIDRA is an early move. Combine academia + workshop + industry (ASELSAN, ASELSAN AI etc.) → Türkiye’s first national “noise-aware AI architecture”.

Unexpected future: the stochastic AI era. Instead of today’s deterministic models, AI that gives probabilistic answers (a distribution of answers, with confidence). That mirrors the brain; SIDRA hardware stochasticity is the natural carrier. The Y100 version of ChatGPT returns not just an “answer” but “answer + confidence interval”.

Further Reading

  • Next chapter: 4.5 — Fourier Transform
  • Previous: 4.3 — Derivative and Gradient
  • Probability foundation: Ross, A First Course in Probability.
  • Stochastic processes: Ross, Introduction to Probability Models.
  • Thermal noise: Nyquist, Thermal agitation of electric charge in conductors, Phys. Rev. 1928.
  • Shot noise: Schottky 1918 (original).
  • Memristor noise: Suri et al., Physical aspects of low power synapses based on phase change memory devices, J. Appl. Phys. 2012.
  • Noise as regularizer in AI: Hinton et al., Improving neural networks by preventing co-adaptation of feature detectors, arXiv 2012 (dropout).
  • Bayesian neural networks: Neal, Bayesian Learning for Neural Networks, Springer 1996.