Information Theory: Entropy and Channel
How much information does one bit carry — and how much does a SIDRA cell?
What you'll learn here
- Define Shannon entropy ($H = -\sum p \log p$) and state its intuitive meaning
- Explain channel capacity ($C = B \log_2(1 + \text{SNR})$) and the Shannon-Hartley theorem
- Compute the real information capacity of a SIDRA cell
- Summarize information theory's roles in AI (cross-entropy, KL divergence)
- Compare theoretical and practical information limits of the SIDRA crossbar
Hook: What Is Information?
You want to predict a coin flip. The result is told to you → you got 1 bit of information. Probability 50/50 → full uncertainty → “1 bit” of measurement.
A six-sided die: result → log₂(6) ≈ 2.58 bits.
If a coin is biased (heads with prob 0.99) → the result is mostly expected → very little information (~0.08 bits). Information = reduction in uncertainty.
In 1948, Claude Shannon, in A Mathematical Theory of Communication, measured information. He also defined the channel capacity (how much information per second can pass). That paper laid the foundation of modern communication, compression, and cryptography.
Interesting question for SIDRA: how much information does a memristor cell actually carry? Theoretically 8 bits, but noise caps it. This chapter does that math.
Intuition: Entropy = Uncertainty
Entropy (H): the average information content of a random event.
- Single possible outcome (): → no information (you already knew).
- equally likely outcomes: → maximum.
- In general .
Examples:
- Fair coin: bit.
- Fair die: bits.
- Biased coin (0.99/0.01): bits.
- English letters: ~4.1 bits/letter (idealized 27 letters equally distributed → ).
Information compression:
Entropy = the smallest possible representation. Compression theoretical limits:
- Fair coin: 1 bit/flip (already optimal).
- Biased coin: 0.08 bits/flip → 12× compression possible.
- English: 4.1 bits/letter → ~2× more efficient than 8-bit ASCII.
ZIP, JPEG, MP3, Brotli — all push toward the entropy limit.
In SIDRA terms:
A cell stores 256 levels = theoretical bits. But:
- 5% programming error → some levels indistinguishable → effective level count drops.
- Thermal/shot noise → uncertainty per read.
- Effective entropy: ~6 bits (we saw in 4.4).
So a SIDRA Y1 cell carries ~6 bits in practice, not 8. That loss is fundamental — directly tied to SNR.
Formalism: Entropy, Cross-Entropy, Channel Capacity
Shannon entropy:
or in nats (natural log):
Joint entropy:
Two RVs :
Conditional entropy:
“Remaining uncertainty about once is known.”
Mutual information:
“Information and share.”
Important property: , equality at independence. In AI: measures information flow from input to output.
Channel capacity (Shannon-Hartley):
Continuous (analog) channel, bandwidth , signal-to-noise ratio SNR:
- : bandwidth (Hz).
- SNR: signal-to-noise ratio.
Practical:
- Telephone line (3 kHz, SNR ~1000): kbit/s. (V.34 modem speeds.)
- WiFi 802.11ac (160 MHz, SNR ~30 dB = 1000): Gbit/s.
- 5G (100 MHz, SNR ~30 dB): Gbit/s.
SIDRA read channel:
A cell read takes 10 ns (B = 100 MHz). SNR ~30 dB (~1000): bit/s = 1 Gbit/s per cell read.
A single MVM reads 256 cells in parallel → 256 × 1 Gbit/s = 256 Gbit/s crossbar throughput. Practical AI inference uses far less than this; it’s the physical limit.
Cross-entropy:
Two distributions (true) and (model prediction):
The AI classification loss = cross-entropy. Drops as the model captures the true distribution.
KL divergence (Kullback-Leibler):
A “distance” between two distributions (asymmetric). In AI: regularization (Bayesian VI, ELBO).
Effective information capacity of a SIDRA cell:
8-bit programming ( levels). Gaussian noise . Number of distinguishable levels:
(4σ rule: ~2 standard deviations on each side, ~95% distinguishability.)
SIDRA Y1: µS, µS → .
That’s only 6 distinguishable levels = effective bits? Seems low.
Refinement: for programming is overestimated. With ISPP, µS → → effective bits. More realistic.
Crossbar level:
Reading 256 columns in parallel → SNR rises by × → information capacity:
A single MVM with 256 columns = 256 × 1.5 Gbit/s = ~400 Gbit/s. Again the physical limit; practical AI uses far less.
Information bottleneck (Tishby 1999):
Neural-net training = maximize mutual information between input and output, while compressing intermediate-layer information. Modern deep-learning theory.
Why SIDRA cares: naturally “information-compressing” layers (noisy, bit-limited). The information bottleneck theory naturally supports SIDRA hardware — modern training targets “enough information”, not “lossless”.
Brain information capacity:
- 86B neurons × 1 Hz × log₂ 1 spike = ~10¹¹ bit/s “spike rate code” (rough).
- More accurate: spike timing (ms precision), sparse coding → ~10¹³-10¹⁴ bit/s.
- But the brain uses less than 1% of that as meaningful information (sensory redundancy).
SIDRA Y100 target: ~10¹³ bit/s analog throughput → matches synaptic bandwidth.
Experiment: Compute an Entropy
Approximate English letter probabilities:
| Letter | Probability | |
|---|---|---|
| e | 0.13 | 0.382 |
| t | 0.09 | 0.313 |
| a | 0.08 | 0.292 |
| o | 0.075 | 0.281 |
| i | 0.07 | 0.269 |
| … | … | … |
| z | 0.001 | 0.0099 |
Sum (26 letters): bits/letter.
Comparison:
- ASCII: 8 bits/letter → 49% inefficient.
- Optimal Huffman coding: ~4.1 bits/letter → 0% inefficient (entropy limit).
- Modern language compression (Brotli): ~3.5 bits/letter (adds word + language model).
SIDRA cell’s effective entropy:
8-bit programming but 6-bit effective with noise (from 4.4):
bits.
256-cell crossbar column: bits. But dependencies (a single noise source affects all cells) → effective slightly less.
Practical: SIDRA Y1 419M cells × 6 bits = ~2.5 Gbit total stored information. A typical small AI model (GPT-2: 124M params × 8 bit = 1 Gbit) fits in Y1.
Quick Quiz
Lab Exercise
Information flow in SIDRA Y1 MNIST classification.
Scenario:
- MNIST input: 28×28 = 784 pixels × 8 bits = 6272 bits/image.
- Output: 10 classes → log₂ 10 ≈ 3.32 bits/image.
- Required information compression: 6272 / 3.32 ≈ 1900× compression.
SIDRA Y1 model: 2-layer MLP, 784 → 128 → 10. Each layer in a SIDRA crossbar.
Questions:
(a) Total information processed per inference (inputs × weights × outputs)? (b) Initial cross-entropy (random model)? (c) Trained (FP32 model) cross-entropy? (d) Increase in cross-entropy after SIDRA INT8 quantization? (e) Information-theoretically, is the weight information (8-bit × 100K params) excessive for MNIST?
Solutions
(a) Input 6272 bits. Weights 100K × 8 = 800 kbit. Output 3.32 bits. Total information flow: input + weights + intermediate activations ≈ 800 kbit/inference (weights dominate).
(b) Random 10-class model: bits. Initial cross-entropy ≈ 2.30 nats = 3.32 bits (uniform).
(c) Well-trained MNIST: cross-entropy ≈ 0.05-0.10 nats. Very low. The model is highly confident.
(d) INT8 quantized cross-entropy ≈ 0.06-0.12 nats. Tiny rise. Accuracy loss 0.2%.
(e) Optimal model size for MNIST (information-theoretic): ~50K-100K parameters suffice (entropy-based capacity analysis). Y1 100K params = optimal. More would risk overfitting. SIDRA Y1’s size is “just right” for MNIST.
Note: large modern models (BERT, GPT) use far more parameters because they need more complex distributions. Y1 is undersized for big LLMs; Y10+ needed.
Cheat Sheet
- Entropy: . Uncertainty measure.
- Maximum: (uniform); minimum: 0 (certainty).
- Joint, conditional, mutual information: entropy variants.
- Channel capacity: (Shannon-Hartley).
- Cross-entropy: AI classification loss.
- KL divergence: distribution distance, regularization.
- SIDRA cell: ~6 effective bits, ~1 Gbit/s read capacity.
- Information bottleneck: AI nets maximize information flow + compress.
Vision: Information-Aware AI Hardware
Modern AI hardware is usually rated by FLOPs. Information-theoretic metrics are more accurate: “how many meaningful bits per second?”
- Y1 (today): 6 effective bits/cell. Enough for INT8 models.
- Y3 (2027): 8 effective bits/cell (ISPP improvement). Exact reproduction of INT8 models.
- Y10 (2029): Multi-cell 12 bits. FP16 equivalent. More complex models (BERT-large, GPT-2).
- Y100 (2031+): 16 bits + dynamic range. GPT-3-class models at the edge.
- Y1000 (long horizon): 24+ bits + analog FP. Approaching brain-scale capacity.
Meaning for Türkiye: information-aware hardware design is a fresh paradigm. SIDRA + Information Bottleneck Theory + academic research → Türkiye can make a distinctive contribution to AI architecture.
Unexpected future: information-conserving AI. Like thermodynamics: in a closed system, information is preserved. Reversible computing approaches it → no energy. SIDRA Y1000 target: sub-Landauer information processing. Sci-fi today, but a clear direction.
Further Reading
- Next chapter: 4.8 — Linear Algebra Laboratory
- Previous: 4.6 — Quantization and Quantization Error
- Classical reference: Shannon, A Mathematical Theory of Communication, Bell System Tech. J. 1948.
- Modern textbook: Cover & Thomas, Elements of Information Theory, 2nd ed.
- Compression: MacKay, Information Theory, Inference, and Learning Algorithms.
- Information bottleneck: Tishby, Pereira, Bialek, The information bottleneck method, arXiv 2000.
- Deep learning + IB: Tishby & Zaslavsky, Deep learning and the information bottleneck principle, ITW 2015.