Information Theory: Entropy and Channel

Hook: What Is Information?

You want to predict a coin flip. The result is told to you → you got 1 bit of information. Probability 50/50 → full uncertainty → “1 bit” of measurement.

A six-sided die: result → log₂(6) ≈ 2.58 bits.

If a coin is biased (heads with prob 0.99) → the result is mostly expected → very little information (~0.08 bits). Information = reduction in uncertainty.

In 1948, Claude Shannon, in A Mathematical Theory of Communication, measured information. He also defined the channel capacity (how much information per second can pass). That paper laid the foundation of modern communication, compression, and cryptography.

Interesting question for SIDRA: how much information does a memristor cell actually carry? Theoretically 8 bits, but noise caps it. This chapter does that math.

Intuition: Entropy = Uncertainty

Entropy (H): the average information content of a random event.

H(X) = -\sum_i p_i \log_2 p_i

Single possible outcome ( $p = 1$ ): $H = 0$ → no information (you already knew).
$N$ equally likely outcomes: $H = \log_2 N$ → maximum.
In general $0 \leq H(X) \leq \log_2 N$ .

Examples:

Fair coin: $H = -2 \cdot 0.5 \log_2 0.5 = 1$ bit.
Fair die: $H = \log_2 6 = 2.58$ bits.
Biased coin (0.99/0.01): $H = -0.99 \log 0.99 - 0.01 \log 0.01 \approx 0.08$ bits.
English letters: ~4.1 bits/letter (idealized 27 letters equally distributed → $\log_2 27 = 4.75$ ).

Information compression:

Entropy = the smallest possible representation. Compression theoretical limits:

Fair coin: 1 bit/flip (already optimal).
Biased coin: 0.08 bits/flip → 12× compression possible.
English: 4.1 bits/letter → ~2× more efficient than 8-bit ASCII.

ZIP, JPEG, MP3, Brotli — all push toward the entropy limit.

In SIDRA terms:

A cell stores 256 levels = $\log_2 256 = 8$ theoretical bits. But:

5% programming error → some levels indistinguishable → effective level count drops.
Thermal/shot noise → uncertainty per read.
Effective entropy: ~6 bits (we saw in 4.4).

So a SIDRA Y1 cell carries ~6 bits in practice, not 8. That loss is fundamental — directly tied to SNR.

Formalism: Entropy, Cross-Entropy, Channel Capacity

L1 · Başlangıç

Shannon entropy:

H(X) = -\sum_i p_i \log_2 p_i \quad \text{(bits)}

or in nats (natural log):

H(X) = -\sum_i p_i \ln p_i

Joint entropy:

Two RVs $X, Y$ :

H(X, Y) = -\sum_{i,j} p_{ij} \log p_{ij}

Conditional entropy:

H(X | Y) = H(X, Y) - H(Y)

“Remaining uncertainty about $X$ once $Y$ is known.”

Mutual information:

I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)

“Information $X$ and $Y$ share.”

Important property: $I(X; Y) \geq 0$ , equality at independence. In AI: measures information flow from input to output.

L2 · Tam

Channel capacity (Shannon-Hartley):

Continuous (analog) channel, bandwidth $B$ , signal-to-noise ratio SNR:

C = B \log_2 (1 + \text{SNR}) \quad \text{(bit/s)}

$B$ : bandwidth (Hz).
SNR: signal-to-noise ratio.

Practical:

Telephone line (3 kHz, SNR ~1000): $C = 3000 \cdot \log_2 1001 \approx 30$ kbit/s. (V.34 modem speeds.)
WiFi 802.11ac (160 MHz, SNR ~30 dB = 1000): $C \approx 1.6$ Gbit/s.
5G (100 MHz, SNR ~30 dB): $C \approx 1$ Gbit/s.

SIDRA read channel:

A cell read takes 10 ns (B = 100 MHz). SNR ~30 dB (~1000): $C = 10^8 \cdot \log_2 1001 \approx 10^9$ bit/s = 1 Gbit/s per cell read.

A single MVM reads 256 cells in parallel → 256 × 1 Gbit/s = 256 Gbit/s crossbar throughput. Practical AI inference uses far less than this; it’s the physical limit.

Cross-entropy:

Two distributions $P$ (true) and $Q$ (model prediction):

H(P, Q) = -\sum_i p_i \log q_i

The AI classification loss = cross-entropy. Drops as the model captures the true distribution.

KL divergence (Kullback-Leibler):

D_{KL}(P \| Q) = \sum_i p_i \log \frac{p_i}{q_i} = H(P, Q) - H(P)

A “distance” between two distributions (asymmetric). In AI: regularization (Bayesian VI, ELBO).

L3 · Derin

Effective information capacity of a SIDRA cell:

8-bit programming ( $N = 256$ levels). Gaussian noise $\sigma$ . Number of distinguishable levels:

N_{\text{eff}} \approx \frac{B - A}{4\sigma} + 1

(4σ rule: ~2 standard deviations on each side, ~95% distinguishability.)

SIDRA Y1: $B - A = 99$ µS, $\sigma_{\text{program}} = 5$ µS → $N_{\text{eff}} = 99/20 + 1 = 6$ .

That’s only 6 distinguishable levels = $\log_2 6 = 2.58$ effective bits? Seems low.

Refinement: $\sigma$ for programming is overestimated. With ISPP, $\sigma \approx 1$ µS → $N_{\text{eff}} = 25$ → $\log_2 25 = 4.6$ effective bits. More realistic.

Crossbar level:

Reading 256 columns in parallel → SNR rises by $\sqrt{N} = 16$ × → information capacity:

C_{\text{col}} = 100 \text{MHz} \cdot \log_2(1 + 256 \cdot \text{SNR}_{\text{cell}}) \approx 1.5 \text{Gbit/s}

A single MVM with 256 columns = 256 × 1.5 Gbit/s = ~400 Gbit/s. Again the physical limit; practical AI uses far less.

Information bottleneck (Tishby 1999):

Neural-net training = maximize mutual information $I(X; Y)$ between input and output, while compressing intermediate-layer information. Modern deep-learning theory.

Why SIDRA cares: naturally “information-compressing” layers (noisy, bit-limited). The information bottleneck theory naturally supports SIDRA hardware — modern training targets “enough information”, not “lossless”.

Brain information capacity:

86B neurons × 1 Hz × log₂ 1 spike = ~10¹¹ bit/s “spike rate code” (rough).
More accurate: spike timing (ms precision), sparse coding → ~10¹³-10¹⁴ bit/s.
But the brain uses less than 1% of that as meaningful information (sensory redundancy).

SIDRA Y100 target: ~10¹³ bit/s analog throughput → matches synaptic bandwidth.

Experiment: Compute an Entropy

Approximate English letter probabilities:

Letter	Probability	$-p \log_2 p$
e	0.13	0.382
t	0.09	0.313
a	0.08	0.292
o	0.075	0.281
i	0.07	0.269
…	…	…
z	0.001	0.0099

Sum (26 letters): $H \approx 4.1$ bits/letter.

Comparison:

ASCII: 8 bits/letter → 49% inefficient.
Optimal Huffman coding: ~4.1 bits/letter → 0% inefficient (entropy limit).
Modern language compression (Brotli): ~3.5 bits/letter (adds word + language model).

SIDRA cell’s effective entropy:

8-bit programming but 6-bit effective with noise (from 4.4):

$H_{\text{cell}} \approx 6$ bits.

256-cell crossbar column: $H_{\text{col}} \approx 256 \times 6 = 1536$ bits. But dependencies (a single noise source affects all cells) → effective slightly less.

Practical: SIDRA Y1 419M cells × 6 bits = ~2.5 Gbit total stored information. A typical small AI model (GPT-2: 124M params × 8 bit = 1 Gbit) fits in Y1.

Quick Quiz

1/6Shannon entropy formula?

H = ∑ pH = -∑ p log p (negative log-sum of probabilities)H = log pH = p × log p

Lab Exercise

Information flow in SIDRA Y1 MNIST classification.

Scenario:

MNIST input: 28×28 = 784 pixels × 8 bits = 6272 bits/image.
Output: 10 classes → log₂ 10 ≈ 3.32 bits/image.
Required information compression: 6272 / 3.32 ≈ 1900× compression.

SIDRA Y1 model: 2-layer MLP, 784 → 128 → 10. Each layer in a SIDRA crossbar.

Questions:

(a) Total information processed per inference (inputs × weights × outputs)? (b) Initial cross-entropy (random model)? (c) Trained (FP32 model) cross-entropy? (d) Increase in cross-entropy after SIDRA INT8 quantization? (e) Information-theoretically, is the weight information (8-bit × 100K params) excessive for MNIST?

Solutions

(a) Input 6272 bits. Weights 100K × 8 = 800 kbit. Output 3.32 bits. Total information flow: input + weights + intermediate activations ≈ 800 kbit/inference (weights dominate).

(b) Random 10-class model: $H = \log 10 = 3.32$ bits. Initial cross-entropy ≈ 2.30 nats = 3.32 bits (uniform).

(c) Well-trained MNIST: cross-entropy ≈ 0.05-0.10 nats. Very low. The model is highly confident.

(d) INT8 quantized cross-entropy ≈ 0.06-0.12 nats. Tiny rise. Accuracy loss 0.2%.

(e) Optimal model size for MNIST (information-theoretic): ~50K-100K parameters suffice (entropy-based capacity analysis). Y1 100K params = optimal. More would risk overfitting. SIDRA Y1’s size is “just right” for MNIST.

Note: large modern models (BERT, GPT) use far more parameters because they need more complex distributions. Y1 is undersized for big LLMs; Y10+ needed.

Cheat Sheet

Entropy: $H = -\sum p \log p$ . Uncertainty measure.
Maximum: $\log_2 N$ (uniform); minimum: 0 (certainty).
Joint, conditional, mutual information: entropy variants.
Channel capacity: $C = B \log_2(1 + \text{SNR})$ (Shannon-Hartley).
Cross-entropy: AI classification loss.
KL divergence: distribution distance, regularization.
SIDRA cell: ~6 effective bits, ~1 Gbit/s read capacity.
Information bottleneck: AI nets maximize information flow + compress.

Vision: Information-Aware AI Hardware

Modern AI hardware is usually rated by FLOPs. Information-theoretic metrics are more accurate: “how many meaningful bits per second?”

Y1 (today): 6 effective bits/cell. Enough for INT8 models.
Y3 (2027): 8 effective bits/cell (ISPP improvement). Exact reproduction of INT8 models.
Y10 (2029): Multi-cell 12 bits. FP16 equivalent. More complex models (BERT-large, GPT-2).
Y100 (2031+): 16 bits + dynamic range. GPT-3-class models at the edge.
Y1000 (long horizon): 24+ bits + analog FP. Approaching brain-scale capacity.

Meaning for Türkiye: information-aware hardware design is a fresh paradigm. SIDRA + Information Bottleneck Theory + academic research → Türkiye can make a distinctive contribution to AI architecture.

Unexpected future: information-conserving AI. Like thermodynamics: in a closed system, information is preserved. Reversible computing approaches it → no energy. SIDRA Y1000 target: sub-Landauer information processing. Sci-fi today, but a clear direction.

Information Theory: Entropy and Channel

Prerequisites

What you'll learn here

Hook: What Is Information?

Intuition: Entropy = Uncertainty

Formalism: Entropy, Cross-Entropy, Channel Capacity

Experiment: Compute an Entropy

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Information-Aware AI Hardware

Further Reading

Prerequisites

What you'll learn here

🪝 Hook: What Is Information?

🧭 Intuition: Entropy = Uncertainty

📐 Formalism: Entropy, Cross-Entropy, Channel Capacity

🧪 Experiment: Compute an Entropy

📝 Quick Quiz

🛠️ Lab Exercise

🗂️ Cheat Sheet

🔮 Vision: Information-Aware AI Hardware

📚 Further Reading

Hook: What Is Information?

Intuition: Entropy = Uncertainty

Formalism: Entropy, Cross-Entropy, Channel Capacity

Experiment: Compute an Entropy

Quick Quiz

Lab Exercise

Cheat Sheet

Vision: Information-Aware AI Hardware

Further Reading