Quantization and Quantization Error

Hook: A Numerical Ruler for the Continuous World

Nature is continuous: voltage, current, sound, light — all unbroken. Computers are discrete: 0s and 1s, finite states.

The operation that bridges them is quantization: map a continuous value to one of a finite set of levels.

A thermometer reads 23.7°C → display shows “24” → one bit rounded.
Microphone analog audio → 16-bit ADC → 65,536 discrete levels.
An AI model’s weights are trained in FP32 (4 bytes) → quantized to INT8 (1 byte) for inference → 4× compression.

Quantization at the heart of SIDRA:

A memristor cell → 256 levels (8 bits). Maps the continuous conductance range onto 256 discrete values.
ADC → analog current to 8-bit digital.
DAC → 8-bit digital back to analog voltage.

Quantization error is unavoidable. Some information is lost when the continuous signal becomes discrete. This chapter covers the math and practice of that loss, and what it means for SIDRA.

Intuition: Ruler Markings = Precision

The denser a ruler’s tick marks, the more precise the measurement. Same for quantization:

Bits	Levels	Tick density
1	2	only 0 and 1 (no middle)
4	16	~6% precision per level
8	256	~0.4% per level
16	65,536	~0.0015%
32	4 billion	float-like

Rounding error (quantization error): the deviation introduced by snapping a continuous value to the nearest discrete level.

Range $[A, B]$ → $N$ levels → step $\Delta = (B-A)/N$ .
Worst-case error: $\pm \Delta/2$ .
Mean-square error: $\sigma_q^2 = \Delta^2 / 12$ (for uniform distribution).

SIDRA example:

Cell conductance range: 1 µS — 100 µS, 256 levels → $\Delta = (100-1)/256 = 0.39$ µS.
$\sigma_q = \Delta / \sqrt{12} = 0.11$ µS.
For $G = 50$ µS, relative error: $0.11 / 50 = 0.22\%$ → ~9 effective bits (quantization alone).
But noise + drift combined → ~6 effective bits (we saw in 4.4).

Bottom line: quantization is a fundamental floor, but in practice noise dominates. For SIDRA, 8-bit quantization is good enough.

Formalism: Quantization Error and Bit Depth

L1 · Başlangıç

Uniform quantization:

Range $[A, B]$ , $N = 2^b$ levels ( $b$ bits). Step:

\Delta = \frac{B - A}{N}

Continuous value $x$ → nearest level $\hat{x} = A + \lfloor (x-A)/\Delta + 0.5 \rfloor \cdot \Delta$ .

Error: $e = x - \hat{x}, |e| \leq \Delta/2$ .

Quantization noise (uniform assumption):

$e \sim \text{Uniform}(-\Delta/2, +\Delta/2)$ , $E[e] = 0$ , $\text{Var}[e] = \Delta^2/12$ .

SQNR (Signal-to-Quantization-Noise Ratio):

For a full-range sinusoid:

\text{SQNR}_{\text{dB}} = 6.02 b + 1.76

Each extra bit gains ~6 dB. SIDRA Y1 8-bit cell → ~50 dB SQNR.

Bit-depth impact:

Bits	SQNR (dB)	Typical use
1	~7.78	Aggressive compression, limited
4	~25.84	INT4 quantized LLM (experimental)
8	~49.92	INT8 inference (standard)
16	~98.08	FP16 training (half float)
32	~194.21	FP32 training (full float)

L2 · Tam

Quantization trends in modern AI:

Training → usually FP32 or FP16 (high precision for gradients).

Inference → INT8 standard, INT4 spreading, INT2-INT3 experimental.

Why is this possible? AI models are noise-tolerant. Non-linear activations absorb small errors. Models trained for “robustness” tolerate INT8 → INT4 with 1-3% accuracy loss.

Quantization-Aware Training (QAT):

Simulate quantization during training → the model adapts to the error. Outperforms post-training quantization. Typical QAT: 0.5% accuracy loss at INT4.

Two-sided quantization:

Weights: uniformly INT8. In hardware, 256-level memristor.
Activations: INT8 dynamic range. Per-layer min/max calibration.
Accumulator: INT16 or INT32 (avoid overflow on additions).

SIDRA fits this: weight = memristor (8-bit), activation = ADC output (8-bit), accumulator = CMOS sum (16-bit).

Mixed precision:

Some layers INT8, some INT4, some FP16 — by sensitivity. Modern practice. SIDRA Y10 target: per-layer quantization (compiler decision, 6.7).

L3 · Derin

Logarithmic (log) quantization:

Logarithmic level distribution instead of uniform. Good for wide-dynamic-range signals like sound (perception in dB is logarithmic).

\hat{x} = \text{sign}(x) \cdot 2^{\lfloor \log_2 |x| \rfloor}

Advantage: precise at small values, coarse at large. Logarithmic Number System (LNS) is a candidate in some AI areas.

Stochastic rounding:

Classical rounding (round) is deterministic. Stochastic rounding picks up or down by probability:

\hat{x} = \begin{cases} \lfloor x/\Delta \rfloor \cdot \Delta & \text{prob } 1-p \\ \lceil x/\Delta \rceil \cdot \Delta & \text{prob } p \end{cases}, \quad p = (x \mod \Delta)/\Delta

Key property: unbiased on average (E[ $\hat{x}$ ] = x). Useful for training (no gradient bias).

SIDRA Y10+ candidate: controlled noise + stochastic rounding → training accuracy improves.

Numerical instability with quantization:

Very small gradients (< $\Delta/2$ ) get rounded away → “underflow”. Modern Transformer training uses bf16 (brain float) commonly — exponent as wide as FP32, mantissa shorter. Solves underflow.

For SIDRA: the gradient won’t fit in INT8 → training needs an FP16/bf16 digital coprocessor. Inference INT8 is enough.

More than 8 bits per cell?

SIDRA Y10 target: multi-cell combinations. 2 cells = 16 levels × 16 = 256 levels = 8 bits (same), or 1024 levels = 10 bits if logarithmic combination. 4 cells → 16 bits. But noise compounds → effective bits may stay flat.

Practical decision: for SIDRA, single-cell 8-bit + digital accumulator → effective 12-16 bits. Plenty for modern AI inference.

Experiment: 4-bit vs 8-bit Quantization Impact

A model’s weights $w \in [-1, 1]$ . Compare INT8 and INT4 quantization.

INT8: 256 levels, $\Delta = 2/256 = 0.0078$ .

Example weight $w = 0.347$ → nearest level: $\hat{w} = 0.3437$ (44th level, zero-centered). Error: $0.0033$ .

INT4: 16 levels, $\Delta = 2/16 = 0.125$ .

Same weight → nearest level: $\hat{w} = 0.3125$ (3rd level + offset). Error: $0.034$ . 10× larger.

1000-weight RMS error:

INT8 RMS: $0.0078/\sqrt{12} \approx 0.0023$ .
INT4 RMS: $0.125/\sqrt{12} \approx 0.036$ . 15× larger.

MNIST inference impact:

FP32 baseline: 98.0%
INT8: 97.8% (0.2% loss)
INT4: 96.5% (1.5% loss)
INT2: 88% (10% loss)

SIDRA Y1: 256 levels = INT8 equivalent → expect 97.8% on MNIST. Practically indistinguishable from FP32 to a user.

More complex model (BERT-base):

FP32: 88% (GLUE)
INT8: 87.5%
INT4: 85%
SIDRA Y1: expect 87.5%.

Bottom line: SIDRA’s 8-bit quantization is consistent with industry inference standards.

Interactive: Quantization Live

Change the distribution and bit-width below. Watch how the histogram collapses, how MSE evolves, and the memory savings. SIDRA’s natural fit to memristor levels (4-8 bit) becomes visible in that sweet spot.

Quantization Demo — FP32 → INTₙ

Compress weight distribution to low-bit analog levels

4-bit · 0.5-2%

! Sweet spot

Inputs

Weight distribution

Target bit width4-bit · 15 levels

Histogram

Metrics

Quantization levels

15

Scale (step)

0.1429

Memory savings

88%

MSE

0.00170

Max error

0.071

SNR_q

18.4 dB

Typical accuracy loss*

0.5-2%

*Literature estimate for LLM/CNN (Google QAT 2021, NVIDIA Ampere guide). Actual value model-specific.

Memristor levels (typically 4-8 bit) naturally match. 16+ bit analog is limited by electronic noise; impractical.

Try:

8-bit (256 levels) — very close to FP32, MSE < 0.001.
4-bit (16 levels) — SIDRA cell capacity. Still acceptable for most models (0.5-2% accuracy loss).
2-bit (4 levels) — aggressive. Only for robust models.
Sparse distribution — 85% zero. Good even at low bits because most of the mass is already near zero.

Quick Quiz

1/6What's the variance of the (uniform) quantization error?

Δ²Δ²/2Δ²/12 — for a uniform distribution0

Lab Exercise

Quantization performance: SIDRA Y1 vs Y10 vs Y100.

Data:

Y1: 256 levels/cell (8-bit), single cell.
Y10 (target): multi-cell combinations → effective 10-12 bits.
Y100 (vision): mixed-precision + analog FP-equivalent ~16 bits.
Test model: ResNet-50 ImageNet (Top-1 accuracy 76.0% FP32 baseline).

Literature:

INT8: 75.2% (0.8% loss)
INT4: 73.5% (2.5% loss)
INT2: 68.0% (8% loss)

Questions:

(a) Expected ResNet-50 inference accuracy on SIDRA Y1 (8-bit)? (b) Y10 (12-bit effective)? (c) Y100 (16-bit effective)? (d) Y1 with MobileNet-V2: typical loss? (e) Compare ResNet-50 inference time (Y1 vs H100)?

Solutions

(a) Y1 INT8 → ~75.2%. Production-acceptable (smartphone cameras use INT8).

(b) Y10 ~12-bit ≈ near FP16 precision → ~75.7-75.9%. Indistinguishable from FP32.

(c) Y100 16-bit → 76.0%. Fully equivalent. Even suitable for training.

(d) MobileNet-V2 is more quantization-sensitive (depthwise conv amplifies small errors). FP32 72.0%, INT8 ~71.0% (1% loss). Y1 holds ~71%. Still practical.

(e) ResNet-50: 4.1B FLOP/inference. Y1: 30 TOPS analog → ~140 µs/inference. H100: 50 TFLOPS sustained → ~80 µs (batch 1; ~3 µs/inference at batch 32). Latency: H100 1.5-50× faster; energy: SIDRA 50× less.

Practice: SIDRA wins at the edge with tight power budgets (smart camera, IoT); H100 wins in the data center. SIDRA isn’t trying to replace H100 — it sits beside.

Cheat Sheet

Quantization: continuous → discrete mapping. Bit depth = number of levels.
Step: $\Delta = (B-A)/N$ , $N = 2^b$ .
Error: uniform $\sigma_q^2 = \Delta^2/12$ .
SQNR: ~6 dB per bit.
Modern AI standard: INT8 inference, INT4 spreading.
SIDRA Y1: 256 levels = 8-bit equivalent, ideal for INT8 inference.
QAT: simulate quantization during training → better outcomes.
Stochastic rounding: unbiased, useful for gradients.

Vision: Ultra-Low-Bit AI and SIDRA's Path

AI bit depths keep dropping: FP32 → FP16 → INT8 → INT4 → INT2. Limit: 1-bit (binary networks).

Y1 (today): 8-bit suffices. INT8 inference standard.
Y3 (2027): Mixed precision (per-layer 4/8/16-bit). 95% of layers INT8, 5% INT4.
Y10 (2029): Multi-cell 12-bit + stochastic rounding. FP16-equivalent inference.
Y100 (2031+): Analog FP16 (mantissa + exponent in separate cells). Training+inference on the same chip.
Y1000 (long horizon): Logarithmic + stochastic + analog FP. Brain-style fully analog.

Meaning for Türkiye: low-bit AI design is mature in digital hardware (NVIDIA TensorRT, Apple CoreML). Open territory on the analog side — SIDRA could be the first major analog INT8 platform. Türkiye’s academia + industry have enough capacity to play this race.

Unexpected future: the 1-bit world. All weights ±1 (BinaryConnect, XNOR-Net). MVM becomes xor + popcount. SIDRA crossbar with 1-bit memristors (HRS/LRS only) → maximum density + minimum energy. 2030+ horizon, but the direction is set.

Quantization and Quantization Error

Prerequisites

What you'll learn here

Hook: A Numerical Ruler for the Continuous World

Intuition: Ruler Markings = Precision

Formalism: Quantization Error and Bit Depth

Experiment: 4-bit vs 8-bit Quantization Impact

Interactive: Quantization Live

Quantization Demo — FP32 → INTₙ

Inputs

Histogram

Metrics

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Ultra-Low-Bit AI and SIDRA's Path

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: A Numerical Ruler for the Continuous World

🧭 Intuition: Ruler Markings = Precision

📐 Formalism: Quantization Error and Bit Depth

🧪 Experiment: 4-bit vs 8-bit Quantization Impact

🧮 Interactive: Quantization Live

Inputs

Histogram

Metrics

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: Ultra-Low-Bit AI and SIDRA's Path

📚 Further Reading

Hook: A Numerical Ruler for the Continuous World

Intuition: Ruler Markings = Precision

Formalism: Quantization Error and Bit Depth

Experiment: 4-bit vs 8-bit Quantization Impact

Interactive: Quantization Live

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Ultra-Low-Bit AI and SIDRA's Path

Further Reading