📐 Module 4 · The Math Arsenal · Chapter 4.6 · 11 min read

Quantization and Quantization Error

Continuous analog world, discrete digital storage — SIDRA sits in between.

What you'll learn here

  • Define quantization and explain why it's necessary
  • Cover the roles of ADC and DAC and the bit-depth concept
  • Compute the quantization error ($\sigma_q^2 = \Delta^2/12$)
  • State the practical impact of INT8, INT4, FP16 quantized AI models
  • Show why a 256-level SIDRA cell equals 8-bit quantization

Hook: A Numerical Ruler for the Continuous World

Nature is continuous: voltage, current, sound, light — all unbroken. Computers are discrete: 0s and 1s, finite states.

The operation that bridges them is quantization: map a continuous value to one of a finite set of levels.

  • A thermometer reads 23.7°C → display shows “24” → one bit rounded.
  • Microphone analog audio → 16-bit ADC → 65,536 discrete levels.
  • An AI model’s weights are trained in FP32 (4 bytes) → quantized to INT8 (1 byte) for inference → 4× compression.

Quantization at the heart of SIDRA:

  • A memristor cell → 256 levels (8 bits). Maps the continuous conductance range onto 256 discrete values.
  • ADC → analog current to 8-bit digital.
  • DAC → 8-bit digital back to analog voltage.

Quantization error is unavoidable. Some information is lost when the continuous signal becomes discrete. This chapter covers the math and practice of that loss, and what it means for SIDRA.

Intuition: Ruler Markings = Precision

The denser a ruler’s tick marks, the more precise the measurement. Same for quantization:

BitsLevelsTick density
12only 0 and 1 (no middle)
416~6% precision per level
8256~0.4% per level
1665,536~0.0015%
324 billionfloat-like

Rounding error (quantization error): the deviation introduced by snapping a continuous value to the nearest discrete level.

  • Range [A,B][A, B]NN levels → step Δ=(BA)/N\Delta = (B-A)/N.
  • Worst-case error: ±Δ/2\pm \Delta/2.
  • Mean-square error: σq2=Δ2/12\sigma_q^2 = \Delta^2 / 12 (for uniform distribution).

SIDRA example:

  • Cell conductance range: 1 µS — 100 µS, 256 levels → Δ=(1001)/256=0.39\Delta = (100-1)/256 = 0.39 µS.
  • σq=Δ/12=0.11\sigma_q = \Delta / \sqrt{12} = 0.11 µS.
  • For G=50G = 50 µS, relative error: 0.11/50=0.22%0.11 / 50 = 0.22\% → ~9 effective bits (quantization alone).
  • But noise + drift combined → ~6 effective bits (we saw in 4.4).

Bottom line: quantization is a fundamental floor, but in practice noise dominates. For SIDRA, 8-bit quantization is good enough.

Formalism: Quantization Error and Bit Depth

L1 · Başlangıç

Uniform quantization:

Range [A,B][A, B], N=2bN = 2^b levels (bb bits). Step:

Δ=BAN\Delta = \frac{B - A}{N}

Continuous value xx → nearest level x^=A+(xA)/Δ+0.5Δ\hat{x} = A + \lfloor (x-A)/\Delta + 0.5 \rfloor \cdot \Delta.

Error: e=xx^,eΔ/2e = x - \hat{x}, |e| \leq \Delta/2.

Quantization noise (uniform assumption):

eUniform(Δ/2,+Δ/2)e \sim \text{Uniform}(-\Delta/2, +\Delta/2), E[e]=0E[e] = 0, Var[e]=Δ2/12\text{Var}[e] = \Delta^2/12.

SQNR (Signal-to-Quantization-Noise Ratio):

For a full-range sinusoid:

SQNRdB=6.02b+1.76\text{SQNR}_{\text{dB}} = 6.02 b + 1.76

Each extra bit gains ~6 dB. SIDRA Y1 8-bit cell → ~50 dB SQNR.

Bit-depth impact:

BitsSQNR (dB)Typical use
1~7.78Aggressive compression, limited
4~25.84INT4 quantized LLM (experimental)
8~49.92INT8 inference (standard)
16~98.08FP16 training (half float)
32~194.21FP32 training (full float)
L2 · Tam

Quantization trends in modern AI:

Training → usually FP32 or FP16 (high precision for gradients).

Inference → INT8 standard, INT4 spreading, INT2-INT3 experimental.

Why is this possible? AI models are noise-tolerant. Non-linear activations absorb small errors. Models trained for “robustness” tolerate INT8 → INT4 with 1-3% accuracy loss.

Quantization-Aware Training (QAT):

Simulate quantization during training → the model adapts to the error. Outperforms post-training quantization. Typical QAT: 0.5% accuracy loss at INT4.

Two-sided quantization:

  • Weights: uniformly INT8. In hardware, 256-level memristor.
  • Activations: INT8 dynamic range. Per-layer min/max calibration.
  • Accumulator: INT16 or INT32 (avoid overflow on additions).

SIDRA fits this: weight = memristor (8-bit), activation = ADC output (8-bit), accumulator = CMOS sum (16-bit).

Mixed precision:

Some layers INT8, some INT4, some FP16 — by sensitivity. Modern practice. SIDRA Y10 target: per-layer quantization (compiler decision, 6.7).

L3 · Derin

Logarithmic (log) quantization:

Logarithmic level distribution instead of uniform. Good for wide-dynamic-range signals like sound (perception in dB is logarithmic).

x^=sign(x)2log2x\hat{x} = \text{sign}(x) \cdot 2^{\lfloor \log_2 |x| \rfloor}

Advantage: precise at small values, coarse at large. Logarithmic Number System (LNS) is a candidate in some AI areas.

Stochastic rounding:

Classical rounding (round) is deterministic. Stochastic rounding picks up or down by probability:

x^={x/ΔΔprob 1px/ΔΔprob p,p=(xmodΔ)/Δ\hat{x} = \begin{cases} \lfloor x/\Delta \rfloor \cdot \Delta & \text{prob } 1-p \\ \lceil x/\Delta \rceil \cdot \Delta & \text{prob } p \end{cases}, \quad p = (x \mod \Delta)/\Delta

Key property: unbiased on average (E[x^\hat{x}] = x). Useful for training (no gradient bias).

SIDRA Y10+ candidate: controlled noise + stochastic rounding → training accuracy improves.

Numerical instability with quantization:

Very small gradients (< Δ/2\Delta/2) get rounded away → “underflow”. Modern Transformer training uses bf16 (brain float) commonly — exponent as wide as FP32, mantissa shorter. Solves underflow.

For SIDRA: the gradient won’t fit in INT8 → training needs an FP16/bf16 digital coprocessor. Inference INT8 is enough.

More than 8 bits per cell?

SIDRA Y10 target: multi-cell combinations. 2 cells = 16 levels × 16 = 256 levels = 8 bits (same), or 1024 levels = 10 bits if logarithmic combination. 4 cells → 16 bits. But noise compounds → effective bits may stay flat.

Practical decision: for SIDRA, single-cell 8-bit + digital accumulator → effective 12-16 bits. Plenty for modern AI inference.

Experiment: 4-bit vs 8-bit Quantization Impact

A model’s weights w[1,1]w \in [-1, 1]. Compare INT8 and INT4 quantization.

INT8: 256 levels, Δ=2/256=0.0078\Delta = 2/256 = 0.0078.

Example weight w=0.347w = 0.347 → nearest level: w^=0.3437\hat{w} = 0.3437 (44th level, zero-centered). Error: 0.00330.0033.

INT4: 16 levels, Δ=2/16=0.125\Delta = 2/16 = 0.125.

Same weight → nearest level: w^=0.3125\hat{w} = 0.3125 (3rd level + offset). Error: 0.0340.034. 10× larger.

1000-weight RMS error:

  • INT8 RMS: 0.0078/120.00230.0078/\sqrt{12} \approx 0.0023.
  • INT4 RMS: 0.125/120.0360.125/\sqrt{12} \approx 0.036. 15× larger.

MNIST inference impact:

  • FP32 baseline: 98.0%
  • INT8: 97.8% (0.2% loss)
  • INT4: 96.5% (1.5% loss)
  • INT2: 88% (10% loss)

SIDRA Y1: 256 levels = INT8 equivalent → expect 97.8% on MNIST. Practically indistinguishable from FP32 to a user.

More complex model (BERT-base):

  • FP32: 88% (GLUE)
  • INT8: 87.5%
  • INT4: 85%
  • SIDRA Y1: expect 87.5%.

Bottom line: SIDRA’s 8-bit quantization is consistent with industry inference standards.

Quick Quiz

1/6What's the variance of the (uniform) quantization error?

Lab Exercise

Quantization performance: SIDRA Y1 vs Y10 vs Y100.

Data:

  • Y1: 256 levels/cell (8-bit), single cell.
  • Y10 (target): multi-cell combinations → effective 10-12 bits.
  • Y100 (vision): mixed-precision + analog FP-equivalent ~16 bits.
  • Test model: ResNet-50 ImageNet (Top-1 accuracy 76.0% FP32 baseline).

Literature:

  • INT8: 75.2% (0.8% loss)
  • INT4: 73.5% (2.5% loss)
  • INT2: 68.0% (8% loss)

Questions:

(a) Expected ResNet-50 inference accuracy on SIDRA Y1 (8-bit)? (b) Y10 (12-bit effective)? (c) Y100 (16-bit effective)? (d) Y1 with MobileNet-V2: typical loss? (e) Compare ResNet-50 inference time (Y1 vs H100)?

Solutions

(a) Y1 INT8 → ~75.2%. Production-acceptable (smartphone cameras use INT8).

(b) Y10 ~12-bit ≈ near FP16 precision → ~75.7-75.9%. Indistinguishable from FP32.

(c) Y100 16-bit → 76.0%. Fully equivalent. Even suitable for training.

(d) MobileNet-V2 is more quantization-sensitive (depthwise conv amplifies small errors). FP32 72.0%, INT8 ~71.0% (1% loss). Y1 holds ~71%. Still practical.

(e) ResNet-50: 4.1B FLOP/inference. Y1: 30 TOPS analog → ~140 µs/inference. H100: 50 TFLOPS sustained → ~80 µs (batch 1; ~3 µs/inference at batch 32). Latency: H100 1.5-50× faster; energy: SIDRA 50× less.

Practice: SIDRA wins at the edge with tight power budgets (smart camera, IoT); H100 wins in the data center. SIDRA isn’t trying to replace H100 — it sits beside.

Cheat Sheet

  • Quantization: continuous → discrete mapping. Bit depth = number of levels.
  • Step: Δ=(BA)/N\Delta = (B-A)/N, N=2bN = 2^b.
  • Error: uniform σq2=Δ2/12\sigma_q^2 = \Delta^2/12.
  • SQNR: ~6 dB per bit.
  • Modern AI standard: INT8 inference, INT4 spreading.
  • SIDRA Y1: 256 levels = 8-bit equivalent, ideal for INT8 inference.
  • QAT: simulate quantization during training → better outcomes.
  • Stochastic rounding: unbiased, useful for gradients.

Vision: Ultra-Low-Bit AI and SIDRA's Path

AI bit depths keep dropping: FP32 → FP16 → INT8 → INT4 → INT2. Limit: 1-bit (binary networks).

  • Y1 (today): 8-bit suffices. INT8 inference standard.
  • Y3 (2027): Mixed precision (per-layer 4/8/16-bit). 95% of layers INT8, 5% INT4.
  • Y10 (2029): Multi-cell 12-bit + stochastic rounding. FP16-equivalent inference.
  • Y100 (2031+): Analog FP16 (mantissa + exponent in separate cells). Training+inference on the same chip.
  • Y1000 (long horizon): Logarithmic + stochastic + analog FP. Brain-style fully analog.

Meaning for Türkiye: low-bit AI design is mature in digital hardware (NVIDIA TensorRT, Apple CoreML). Open territory on the analog side — SIDRA could be the first major analog INT8 platform. Türkiye’s academia + industry have enough capacity to play this race.

Unexpected future: the 1-bit world. All weights ±1 (BinaryConnect, XNOR-Net). MVM becomes xor + popcount. SIDRA crossbar with 1-bit memristors (HRS/LRS only) → maximum density + minimum energy. 2030+ horizon, but the direction is set.

Further Reading

  • Next chapter: 4.7 — Information Theory: Entropy and Channel
  • Previous: 4.5 — Fourier Transform
  • Classical quantization: Gray & Neuhoff, Quantization, IEEE Trans. Inf. Theory 1998.
  • AI quantization survey: Gholami et al., A Survey of Quantization Methods for Efficient Neural Network Inference, arXiv 2021.
  • QAT: Jacob et al., Quantization and training of neural networks for efficient integer-arithmetic-only inference, CVPR 2018.
  • BinaryConnect: Courbariaux et al., Training deep neural networks with weights and activations constrained to +1 or -1, NeurIPS 2015.
  • Stochastic rounding: Gupta et al., Deep learning with limited numerical precision, ICML 2015.