The Information Bottleneck in AI

How Compression Creates Intelligence

Jan 03, 2026

Why the smartest AI systems don’t remember everything- they remember only what matters

Every conversation around AI today sounds the same:
“Bigger models are coming.”
“GPT-5 will change everything.”
“Parameter count equals intelligence.”

The industry worships size.
More data - more epochs - more GPU - more tokens.

Size is the headline.
Scale is the narrative.
Hardware is the obsession.

But look beneath the hype, and a quieter truth is emerging:
AI doesn’t become intelligent by storing more.
AI becomes intelligent by forgetting more.

This is the world of the Information Bottleneck
and once you understand this principle,
LLMs, embeddings, hallucination, memory, data efficiency,
and even AGI start to make sense in a new way.

This blog explains the entire theory:
how it works, why it works, why it changes model design,
why it explains hallucination,
and why the future of AI isn’t size
it’s compression.

The Counterintuitive Idea: Intelligence Comes From Losing Information

In 1999, Israeli physicist Naftali Tishby proposed a radical thought:

If a learning system wants to make accurate predictions,
it must throw away most of the information it receives.

It sounds absurd.
Schools teach the opposite:
learn more, store more, memorize everything.

But biology disagrees.
Brains forget.
Consciousness compresses.
Memory decays into meaning.

Evolution didn’t produce animals that remember every detail.
Evolution produced animals that remember only the patterns
needed to survive.

AI is now replicating that strategy.

Why Raw Data Is Useless to Intelligence

When a model reads a sentence:

“The little boy kicked the blue ball across the street.”

The raw data contains:

The exact shade of blue
The ordering of letters
The capitalisation
Token IDs
Word shapes
Spacing
Punctuation
Sentence length

A naive system would keep all of it.

A smart system keeps only:
boy - kicks - ball - across - street
(actor - action - object - direction - location)

Meaning survives.
Detail dies.

That death of detail
is the source of intelligence.

The Exact Math Behind the Theory

Information Bottleneck tries to maximise:

I(Z;Y) - useful information kept

and minimise:

I(Z;X) - useless information from input

Where:

X = input
Y = target
Z = compressed representation

In simple words:

Z must keep what predicts Y
and discard everything else from X.

This explains deep learning at scale.

Bottleneck vs Memorization

This is why test loss matters more than train loss.

Why Deep Networks First Memorize, Then Forget

Researchers noticed a striking pattern:
neural networks go through two phases during training.

Phase 1 - Fitting

The model memorizes most data.
Train accuracy climbs.
Loss shrinks.
But test accuracy barely improves.

Phase 2 - Compression

Suddenly the model begins throwing information away:
latent structure emerges
clusters sharpen
semantic categories form

Test accuracy shoots up.

This aligns perfectly with human learning:
we first copy, then internalize, then abstract.

ASCII TRAINING CURVE

Accuracy ↑

│

│ ┌─────────────── Plateau

│ ┌─┘

└──────────────────────────→ Training Time

Fit Phase Compress Phase Generalization

This pattern repeats across:
CNNs, RNNs, Transformers, Diffusion models.

Why Compression Builds Abstraction

The brain does not store every memory.
It stores compressed templates:

“dogness”
“anger”
“gravity”
“sarcasm”
“triangle”

You never remember every chair you’ve seen.
You remember the concept chair.

Compression - category formation - abstraction.

Neural networks form:

embedding clusters
vector manifolds
semantic gradients

These are compressed abstractions.

Where Compression Operates in AI

Every step reduces dimensionality.

Compression Eliminates Hallucination

Hallucination happens when:
representation loses meaning structure
and compression collapses too far.

When IB is tuned correctly:
models hallucinate less
not because they “try harder”
but because they store better abstractions.

CODE EXAMPLE: Showing IB Power With PCA

from sklearn.datasets import load_digits

from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)

pca = PCA(n_components=20)

X_reduced = pca.fit_transform(X)

clf = LogisticRegression(max_iter=5000)

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y)

clf.fit(X_train, y_train)

print(”Accuracy:”, clf.score(X_test, y_test))

90% of data lost.
Accuracy ↑
Generalization ↑

Compression creates intelligence.

Why GPT Models Are Compression Engines

LLMs don’t store language literally.
They store meaning.

That’s why LLMs can:

predict unseen sentences
translate without dictionaries
generate poetry
explain physics

They compress human history
into a conceptual latent universe.

GPT models store
density fields of knowledge - not books.

Latent Space Geometry - Where Intelligence Lives

Latent space is multidimensional meaning geometry.

Concepts like:

cat
tiger
lion

form clusters
because they share meaning vectors.

That clustering is compression.

Without compression - chaos.

Latent Dimensionality vs Accuracy Trend

Accuracy ↑

│ *

└──────────────────────────→ Dimensionality Reduction

As dimensionality shrinks thoughtfully,
accuracy increases.

Why New AI Models Are Getting Smaller

Look at the trend:

Qwen 2.5
Meta Llama-3
Gemini Flash
Phi models
Mixtral
Nemotron MoE

These outperform much larger models.
Why?
Better compression architecture.

The bottleneck beats brute force.

Information Bottleneck Is the Blueprint of Human Thought

Humans fail when overloaded:
too much detail - no thinking.

We compress automatically:

events - stories
people - identities
visuals - symbols
emotions - categories

Without forgetting,
we drown.

Memory = Stored Compression

Humans remember
the abstraction of a trauma,
not the precise light photons
that hit the retina.

AI must do the same.

That means:
future LLM breakthroughs will come
from improved compression layers,
not trillion-token training.

Why the Bottleneck Solves the Cost Crisis

GPU economics are brutal.

Compression unlocks:

distillation
pruning
4-bit models
quantization
sparse mixtures

Soon, 4B models will outperform 70B models
because compression beats brute memory.

The Coming Shift: Bottleneck - AGI

AGI requires:

concept space
abstraction
memory
forgetting
hierarchical compression

GPT-4 already shows early signs:
it generalizes beyond training samples.

This is IB at scale.

The Information Bottleneck isn’t a theory about math.
It’s a theory about reality:

Intelligence isn’t remembering the world
it’s compressing the world into meaning.

Everything intelligent brains, animals, society, evolution
survives because it forgets most inputs.

The same will be true for AI.

The future belongs to compression-driven models:

smaller
smarter
cheaper
more general
more human

This is not the end of deep learning.
It is the moment deep learning
finally begins to understand itself.

Michael J. Goldrich

14h

AI becomes intelligent by forgetting more.

The compression principle explains why smaller, focused implementations often outperform massive general models.

If you're thinking about what this means for your team's AI strategy: https://vivander.substack.com/p/something-shifted-when-i-read-openais

The idea that intelligence is compression, not memorization, completely reframes how we should think about AI success.

Most orgs are still measuring AI like they measure databases: how much can it hold?

The real question is how well does it forget what doesn't matter.

1 more comment...

AI Space

Discussion about this post

Ready for more?