Probability and Noise

Hook: Perfection Isn't Possible — and Isn't Needed

An ideal chip: every signal precise, every measurement correct, every computation deterministic. A practical chip: noise on every signal, error in every measurement, estimation in every computation.

A SIDRA Y1 cell stores 8-bit (256-level) conductance. But thermal noise, shot noise, drift, IR drop, and temperature swings drop effective accuracy to ~6 bits. Two bits lost. Is that a problem?

Answer: usually not. Sometimes an advantage.

6 bits is enough for AI inference (INT8 is standard, INT4 is widespread).
Noise plays the role of a regularizer in classical AI (dropout, weight noise).
The brain synapse is already noisy (vesicles are probabilistic) — feature, not bug.
SIDRA’s real position: not “deterministic digital”, but “noisy but efficient analog”.

This chapter covers probability fundamentals, noise sources, how SIDRA measures and tames them, and shows that noise can help AI learning.

Intuition: Probability and Expected Value

A random variable (RV) is a variable whose value is random.

A die roll: $X \in \{1, 2, 3, 4, 5, 6\}$ , equal probability.
A memristor read current: $I = \mu + \epsilon$ , $\mu$ “true” value, $\epsilon$ Gaussian noise.

Probability distribution: the probability of every value.

Die: $P(X = k) = 1/6$ for every $k$ .
Memristor: $\epsilon \sim \mathcal{N}(0, \sigma^2)$ — zero-mean Gaussian.

Expected value (E): long-run average.

Die: $E[X] = (1+2+3+4+5+6)/6 = 3.5$ .
Memristor: $E[I] = \mu$ (noise has zero mean).

Variance: how scattered around the mean.

Die: $\text{Var}[X] = E[(X - 3.5)^2] = 2.92$ .
Memristor: $\text{Var}[I] = \sigma^2$ .

Standard deviation: $\sqrt{\text{Var}}$ . Same units, “typical deviation” size.

Memristor: $\sigma$ . Typical SIDRA: $\sigma \approx 5\%$ of $\mu$ .

Intuition: a single measurement is noisy, but the average of many measurements is much sharper. Central limit theorem: the standard deviation of an $N$ -sample average is $\sigma/\sqrt{N}$ . 100 measurements → 10× improvement.

SIDRA practical use: if a single MVM is repeated 10× over 100 µs, effective accuracy rises from 6 bits to ~9 bits — but speed drops 10×. Trade-off.

Formalism: Distributions, Noise Models, SNR

L1 · Başlangıç

Three core distributions:

Bernoulli: $X \in \{0, 1\}$ , $P(X=1) = p$ .

Expected value: $E[X] = p$ .
Variance: $\text{Var}[X] = p(1-p)$ .
Use: single-bit event (vesicle release, bit read).

Normal (Gaussian): $X \sim \mathcal{N}(\mu, \sigma^2)$ .

Density: $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/(2\sigma^2)}$ .
$E[X] = \mu$ , $\text{Var}[X] = \sigma^2$ .
Use: thermal noise, measurement error, weight initialization.

Poisson: $X \in \{0, 1, 2, \ldots\}$ , $P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$ .

$E[X] = \text{Var}[X] = \lambda$ .
Use: spike count, photon count, rare events.

Expected-value rules:

Linearity: $E[aX + bY] = aE[X] + bE[Y]$ .
If independent: $E[XY] = E[X] E[Y]$ .

Variance rules:

$\text{Var}[aX] = a^2 \text{Var}[X]$ .
If independent: $\text{Var}[X+Y] = \text{Var}[X] + \text{Var}[Y]$ .

L2 · Tam

Three physical noise sources:

1. Thermal noise (Johnson-Nyquist):

\sigma_I^2 = 4 k T G \Delta f

$k$ = Boltzmann (1.38 × 10⁻²³ J/K)
$T$ = temperature (K)
$G$ = conductance (S)
$\Delta f$ = bandwidth (Hz)

Numbers: $T = 300$ K, $G = 100$ µS, $\Delta f = 100$ MHz → $\sigma_I = \sqrt{4 \cdot 1.38 \times 10^{-23} \cdot 300 \cdot 10^{-4} \cdot 10^8} = \sqrt{1.66 \times 10^{-16}}$ A = 12.9 nA.

Typical MVM output: 1-10 µA → SNR ≈ 770 → 30-40 dB.

2. Shot noise:

\sigma_I^2 = 2 q I \Delta f

$q$ = electron charge (1.6 × 10⁻¹⁹ C)
$I$ = average current

Numbers: $I = 1$ µA, $\Delta f = 100$ MHz → $\sigma_I = \sqrt{2 \cdot 1.6 \times 10^{-19} \cdot 10^{-6} \cdot 10^8} = \sqrt{3.2 \times 10^{-17}}$ = 5.7 nA.

Dominates at low current. Same order as thermal.

3. 1/f (flicker) noise:

S_I(f) = \frac{K I^2}{f}

$K$ = material constant (HfO₂ ≈ 10⁻¹¹).

Grows as frequency drops (slow drift source). Dominates over long retention.

Total noise (independent sources):

\sigma_{\text{total}}^2 = \sigma_{\text{thermal}}^2 + \sigma_{\text{shot}}^2 + \sigma_{1/f}^2

SNR (Signal-to-Noise Ratio):

\text{SNR} = \frac{\mu^2}{\sigma_{\text{total}}^2} = \frac{P_{\text{signal}}}{P_{\text{noise}}}

In dB: $\text{SNR}_{\text{dB}} = 10 \log_{10}(\text{SNR})$ .

30 dB → 1000× signal:noise → ~5 effective bits.
40 dB → 10000× → ~6.5 bits.
60 dB → 10⁶× → ~10 bits.

SIDRA Y1 target: ~30-40 dB.

L3 · Derin

Crossbar noise in detail:

Total noise current in one column of a 256×256 crossbar:

\sigma_{\text{col}}^2 = \sum_{i=1}^{256} \sigma_i^2 \approx 256 \cdot \sigma_{\text{cell}}^2

So $\sigma_{\text{col}} = \sqrt{256} \cdot \sigma_{\text{cell}} = 16 \cdot \sigma_{\text{cell}}$ .

Signal also sums: $\mu_{\text{col}} = \sum \mu_i = 256 \cdot \bar{\mu}$ (mean).

SNR: $\frac{(256 \bar{\mu})^2}{256 \sigma_{\text{cell}}^2} = 256 \cdot \frac{\bar{\mu}^2}{\sigma_{\text{cell}}^2} = 256 \cdot \text{SNR}_{\text{cell}}$ .

Crossbar SNR is $N$ × the per-cell SNR. Good news.

Programming noise:

You can’t program a memristor exactly. After ISPP: $G_{\text{actual}} \sim \mathcal{N}(G_{\text{target}}, (\rho G_{\text{target}})^2)$ , $\rho \approx 1\%$ (ISPP) or $5\%$ (basic).

This noise is persistent — unlike thermal, it doesn’t change per read. In AI it acts as “weight quantization noise”. Modern DL is designed to tolerate it (post-training quantization).

Drift:

Conductance changes slowly: $G(t) = G_0 + \alpha \log(t/t_0)$ . Typical: ~5% drift per year.

Fix: periodic refresh (re-program a few cells per month) or drift-aware compiler (predict and compensate).

Is noise bad for AI?

Surprisingly: mostly not, sometimes helpful:

Weight noise = stochastic regularizer: adding small noise to weights reduces overfitting (Hinton et al. 1992).
Dropout: randomly disable neurons during training → more robust model. SIDRA’s natural “sneak path” noise can do something similar.
Stochastic gradient: SGD’s strength is its noise → finds good minima.
Bayesian networks: weights are actually distributions. SIDRA hardware noise produces this naturally.

SIDRA Y10 target: controlled-stochastic memristor — noise level tunable by design. Optimize per AI workload.

Experiment: Compute the SNR of a Cell

A SIDRA Y1 cell:

$G = 50$ µS (between HRS and LRS)
$V = 0.25$ V (read voltage)
$T = 300$ K
$\Delta f = 100$ MHz
ISPP programming noise: $\rho = 2\%$ → $\sigma_G = 1$ µS

Signal current: $\mu_I = G \cdot V = 50 \times 10^{-6} \cdot 0.25 = 12.5$ µA.

Thermal noise: $\sigma_T = \sqrt{4 k T G \Delta f} = \sqrt{4 \cdot 1.38 \times 10^{-23} \cdot 300 \cdot 5 \times 10^{-5} \cdot 10^8}$ $= \sqrt{8.28 \times 10^{-17}} = 9.1$ nA.

Shot noise: $\sigma_S = \sqrt{2 q I \Delta f} = \sqrt{2 \cdot 1.6 \times 10^{-19} \cdot 1.25 \times 10^{-5} \cdot 10^8}$ $= \sqrt{4 \times 10^{-16}} = 20$ nA.

Programming noise (in current units): $\sigma_P = \sigma_G \cdot V = 10^{-6} \cdot 0.25 = 0.25$ µA = 250 nA.

Total: $\sigma_{\text{total}} = \sqrt{9.1^2 + 20^2 + 250^2} \approx \sqrt{82 + 400 + 62500} \approx 251$ nA.

Programming noise dominates (much larger than thermal/shot).

SNR: $\text{SNR} = \mu_I / \sigma_{\text{total}} = 12500 / 251 = 49.8$ $\text{SNR}_{\text{dB}} = 20 \log_{10}(49.8) = 33.9$ dB.

Effective bits: $\log_2(50) \approx 5.6$ bits.

Crossbar level (256 columns): SNR rises 256× → $\text{SNR} = 12,750$ , ~42 dB, ~7 effective bits.

Bottom line: SIDRA Y1 has ~7 effective bits per column. Enough for INT8 inference; not for FP32.

Improvement paths:

Tighter ISPP: $\rho = 1\%$ → programming noise drops 50%.
Multi-read averaging (4 samples): $\sigma$ drops 50%.
Cold operation (T = 250 K): thermal noise drops 15%.

Y10 target: ~50 dB SNR, ~9-10 effective bits.

Quick Quiz

1/6What is the expected value E[X]?

Maximum of XLong-run average: ∑ x · P(x)Standard deviation of XSquare of X

Lab Exercise

SNR analysis for MNIST classification on SIDRA Y1.

Scenario:

MNIST: 28×28 = 784 pixels, 10 classes.
2-layer MLP: 784 × 128 → 128 × 10.
Each layer on SIDRA crossbars: first layer 4 crossbars (256×256), second layer 1 crossbar.
Each cell: 6-7 effective-bit SNR.
Inference: one forward pass.

Data:

Typical MNIST classification accuracy (FP32 model): 98%.
After INT8 quantization: 97.5% (1% loss).
INT4 quantization: 94% (4% loss).
6-bit effective (SIDRA): 96-97% expected.

Questions:

(a) MVMs per inference? Latency in ns? (b) How much noise per MVM (mean current × 5%)? (c) How does total noise accumulate across 2 layers? (d) How many MVM averages to hit 96% classification? (e) How much does averaging extend inference? Practical?

Solutions

(a) Layer 1: 784×128. Crossbar 256×256 → 4 MVMs (parallel). Layer 2: 128×10 → 1 MVM. Total 5 MVMs, sequential. Time: 5 × 10 ns = 50 ns.

(b) Each MVM output ~10 µA. Programming noise 5% → 0.5 µA. Thermal/shot ~50 nA. Total ~510 nA per output → ~5% relative.

(c) Two layers in series → noise RMS adds: $\sqrt{0.05^2 + 0.05^2} = 0.071 = 7\%$ relative. As long as the classification margin exceeds this, accuracy holds.

(d) 5-bit effective → ~93% accuracy. 6-bit (single MVM) → 96%. 7-bit (4× averaging) → 97-98%. 4× averaging suffices.

(e) 4× averaging: 4 × 50 ns = 200 ns/inference. Still 5M inferences/s. Practical. SIDRA Y1 5M MNIST/s. Compare: H100 ~100M MNIST/s, but at 700 W. SIDRA ~150× slower but 230× less energy.

Note: Y1 is overkill for MNIST (419M cells, MNIST model needs ~100K). Real role: parallel batches of small models.

Cheat Sheet

Random variable: value-is-random variable. $E[X]$ = mean, $\text{Var}[X]$ = scatter.
Three distributions: Normal (noise), Bernoulli (binary), Poisson (count).
Three noises: Thermal (4kTG·Δf), Shot (2qI·Δf), 1/f (drift).
Total noise: $\sigma_{\text{total}}^2 = \sigma_T^2 + \sigma_S^2 + \sigma_{1/f}^2$ .
SNR: signal²/noise². dB = 10 log₁₀ SNR.
Crossbar SNR: cell SNR × N (parallelism wins).
SIDRA Y1: ~30-40 dB SNR, ~6-7 effective bits.
Noise = feature: stochastic regularizer, dropout effect, Bayesian nets.

Vision: Make Noise a Design Tool

Classical engineering: noise is the enemy. Modern AI: noise is a friend. SIDRA brings that paradigm into silicon:

Y1 (today): Noise is “tolerated bad” — enough for INT8 inference.
Y3 (2027): ISPP improvement + temperature compensation → SNR 50 dB, 9 bits.
Y10 (2029): Controlled-stochastic memristor — noise level programmable. For Bayesian nets, dropout-replication, stochastic MAC.
Y100 (2031+): Noise-aware compiler — train the model knowing per-cell noise profiles. Hardware-software co-design.
Y1000 (long horizon): Noise-energy co-optimization. AI models use noise as a compute resource (sampling, MCMC).

Meaning for Türkiye: the noise-tolerant AI design race has just begun. SIDRA is an early move. Combine academia + workshop + industry (ASELSAN, ASELSAN AI etc.) → Türkiye’s first national “noise-aware AI architecture”.

Unexpected future: the stochastic AI era. Instead of today’s deterministic models, AI that gives probabilistic answers (a distribution of answers, with confidence). That mirrors the brain; SIDRA hardware stochasticity is the natural carrier. The Y100 version of ChatGPT returns not just an “answer” but “answer + confidence interval”.

Probability and Noise

Prerequisites

What you'll learn here

Hook: Perfection Isn't Possible — and Isn't Needed

Intuition: Probability and Expected Value

Formalism: Distributions, Noise Models, SNR

Experiment: Compute the SNR of a Cell

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Make Noise a Design Tool

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: Perfection Isn't Possible — and Isn't Needed

🧭 Intuition: Probability and Expected Value

📐 Formalism: Distributions, Noise Models, SNR

🧪 Experiment: Compute the SNR of a Cell

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: Make Noise a Design Tool

📚 Further Reading

Hook: Perfection Isn't Possible — and Isn't Needed

Intuition: Probability and Expected Value

Formalism: Distributions, Noise Models, SNR

Experiment: Compute the SNR of a Cell

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Make Noise a Design Tool

Further Reading