Press enter or click to view image in full size
By the Author
The powerful shift from the transformer
9 min read1 day ago
–
Remember that time you walked into a room and completely forgot why you went there? That frustrating “brain fart” is your** short-term memory** failing you.
Now, imagine if AI models had that same problem, but every single time they tried to answer a question.
Well, they do. And it has been holding back Artificial Intelligence for years.
Until now.
Researchers at Google — specifically Ali Behrouz, Peilin Zhong, and Vahab Mirrokni — dropped a research paper that might be the big leap in AI architecture since the Transformer itself. It’s called Titans, and it does something that sounds almost human:
**it learns to remember thing…
Press enter or click to view image in full size
By the Author
The powerful shift from the transformer
9 min read1 day ago
–
Remember that time you walked into a room and completely forgot why you went there? That frustrating “brain fart” is your** short-term memory** failing you.
Now, imagine if AI models had that same problem, but every single time they tried to answer a question.
Well, they do. And it has been holding back Artificial Intelligence for years.
Until now.
Researchers at Google — specifically Ali Behrouz, Peilin Zhong, and Vahab Mirrokni — dropped a research paper that might be the big leap in AI architecture since the Transformer itself. It’s called Titans, and it does something that sounds almost human:
it learns to remember things at test time, just like we do when we’re actually using our brains in real situations.
Let me break down why this matters and how it works, without the academic jargon that makes most AI papers feel like reading a foreign language.
The Memory Problem That’s Been Haunting AI
Back in 2017, Google published “Attention is All You Need,” introducing the Transformer architecture. That paper changed everything.** Transformers became the backbone of ChatGPT, Claude, Gemini, and pretty much every large language model you’ve heard of.**
But Transformers have a problem:
they’re terrible at remembering things over long stretches.
Think of a Transformer’s memory like a sliding window on a train. You can see what’s immediately around you (the current context), but you can’t see what you passed five minutes ago unless it’s still in your limited view. The technical term is “context window,” but really it’s just how much text the model can look at once.
Here’s where it gets problematic. To make that window bigger, the computational cost explodes. Not doubles. Not triples. It goes up quadratically. If you want to double your context window, you need four times the computing power. Want to triple it? That’s nine times the power.
The research community has tried everything to fix this. They built Linear RNNs and State Space Models (SSMs) like Mamba-2. These models are fast because they compress history into a fixed-size state. But that compression is “lossy” — it’s like trying to summarize the entire Harry Potter series into a single tweet. You lose the nuance, the specific dates, and the “needle in the haystack.”
We were stuck choosing between the Transformer (smart but amnesiac and slow) and the RNN (fast but forgetful).
Titans breaks this tradeoff.
How Human Memory Actually Works
Press enter or click to view image in full size
By the Author
Before we dive into how to build a Titan, let’s look at the biological blueprint it copies. Your brain doesn’t treat all information equally. It has three types of memory working together:
- Short-term memory: Your mental scratch pad. It’s what you use to hold a phone number just long enough to dial it. It’s fast, focused on the now.
- Long-term memory: Your brain’s hard drive. It stores experiences, facts, and skills over weeks or years.
- Meta-memory: Your brain knowing what it knows. It’s the “learned expertise” — like knowing how to speak English or how to drive, regardless of where you are driving.
Here’s the critical part: your brain doesn’t try to remember everything. It uses a clever trick: it remembers surprising things.
If you walk down the street and see a cat, your brain ignores it. That’s low surprise. But if you see a cat driving a car, your brain creates a vivid memory instantly. That’s high surprise.
Titans is the first architecture to successfully engineer this biological system into silicon.
Enter Titans: Learning New Context on the Fly
Press enter or click to view image in full size
By the Author
The Titans architecture introduces a radical concept: Learning at Test Time.
In traditional AI, there is a hard wall between “Training” (school) and “Inference” (work). Once a model is trained, its brain is frozen. It cannot learn anything new unless you retrain it from scratch.
Titans smashes this wall. It introduces a Neural Long-Term Memory (NLM) module. This isn’t a database or a simple vector; it is a Deep Neural Network (MLP) that lives inside the model.
When Titans reads your input, it uses the data to update the weights of this internal memory network. It is literally “learning to memorize” on the fly.
The Three-Part Memory System
Press enter or click to view image in full size
By the Author
- The Core (Short-Term): This is the standard Attention mechanism. It handles the immediate “now” with high precision.
- Long-Term Memory: This is the deep neural network that updates itself. It holds the “story so far.”
- Persistent Memory: These are fixed, learnable parameters that hold task-specific knowledge (like grammar or coding rules) that doesn’t change during the conversation.
The Math of “Surprise”: How It Decides What to Keep
How does the model know if a piece of information is a “cat” (boring) or a “cat driving a car” (surprising)?
It uses Gradients.
In machine learning, a gradient usually tells the model how to learn during training. Titans uses it during inference as a Surprise Metric.
- The model tries to predict the next token.
- If it predicts correctly (Low Error), the gradient is small. The model says, “I knew that,” and the memory barely changes.
- If it predicts incorrectly (High Error), the gradient is large. The model says, “Whoa, I didn’t expect that!” and significantly updates its memory weights.
The “Banana Peel” Example:
If the model is summarizing a serious financial report and suddenly encounters a picture of a banana peel, the “surprise” (gradient) spikes. The model flags this as an anomaly and prioritizes storing it in long-term memory.
The Math:
It doesn’t just look at the current moment; it uses Momentum to track the flow of surprise.
Memory Update Rule (Simplified)
Memoryₜ = Memoryₜ₋₁ + ΔMemoryₜ
ΔMemoryₜ = α × Past Surprise − η × Current Surprise
Get Divy Yadav’s stories in your inbox
Join Medium for free to get updates from this writer.
Term Definitions
- Memoryₜ: The model’s internal memory after processing the current input step
- Memoryₜ₋₁: The memory state carried over from the previous step
- ΔMemoryₜ: The amount by which memory is updated at the current step
- Past Surprise: Accumulated unexpected information from earlier steps (surprise momentum)
- α (alpha): Forgetting factor that controls how much past surprise is retained
- Current Surprise: The model’s current prediction error signal
- η (eta): Learning rate that controls how strongly new information updates memory
Intuition: Memory strengthens when surprising patterns persist over time and fades when new information contradicts past context.
This “Decay” is crucial. It acts as a Forgetting Gate. If the conversation shifts topics entirely, the decay factor kicks in and wipes the irrelevant parts of memory to make room for new data. This prevents the “brain” from getting full.
MIRAS: The Blueprint for the Post-Transformer Era
While Titans is the specific architecture, Google also introduced MIRAS (Memorization, Information Retrieval, and Associative Systems), which is the theoretical framework (the blueprint).
Why should you care as a builder or researcher?
MIRAS changes the game because it stops us from “guessing” new architectures and starts letting us “engineer” them.
For the last five years, researchers have been randomly adding layers or tweaking attention heads to see what works. MIRAS unifies everything — RNNs, Transformers, SSMs — under one theory: Everything is an Associative Memory.
It defines any sequence model by four knobs you can turn:
- Memory Architecture: (Vector? Matrix? Neural Network?)
- Attentional Bias: (What do we prioritize?)
- Retention Gate: (How fast do we forget?)
- Memory Algorithm: (How do we update?)
This matters because it moves AI design from alchemy to chemistry. It implies that the future of AI isn’t just “bigger Transformers.” It’s designing specific “memory optimizers” for specific tasks. If you need a model that never forgets legal precedents, you tune the Retention Gate. If you need a model that adapts to a user’s slang instantly, you tune the Memory Algorithm.
The Results: Beating GPT-4 with a Fraction of the Parameters
Press enter or click to view image in full size
It proves Titans crushes GPT-4 on memory retention (context length), but it does not prove it is smarter at reasoning or coding. Source: Google’s blog
Google didn’t just write a theory paper; they proved it works with hard data.
1. The Power of Deep Memory The researchers ran strict ablation studies to see if “depth” really matters. The results were clear: deeper memory modules (more layers in the internal neural net) consistently achieved lower perplexity (confusion) than shallower ones.
- What this means: As the sequence gets longer, a shallow memory fills up and fails. A deep memory keeps scaling. The graphs show that while other models flatline, Titans with deep memory gets smarter as it reads more.
2. Extreme Long-Context Recall (BABILong) This is the ultimate test. The BABILong benchmark requires reasoning across facts scattered throughout documents millions of words long.
- The Result: Titans outperformed GPT-4, despite having vastly fewer parameters. Why? Because GPT-4 eventually runs out of context space or gets “distracted.” Titans just keeps learning.
- Scale: It effectively scales to >2 Million tokens with higher accuracy in “needle-in-a-haystack” tasks compared to all baselines.
**Source: **BABILong benchmark
3. Efficiency & Versatility In standard language modeling, Titans beat state-of-the-art linear models like Mamba-2 and Transformer++. But it didn’t stop at text. When tested on DNA sequencing and Time-Series forecasting, Titans crushed the baselines there too. This proves that “surprise-based memory” is a universal principle of intelligence, not just a language trick.
What This Changes for People Building AI Systems
If you are an engineer or a founder, this isn’t just theory. This architecture fundamentally changes the stack you are building today.
- RAG becomes a patch, not a foundation: You won’t need to chop documents into tiny “chunks” and search them with vector databases if the model can simply read the whole book and remember it.
- Agent memory stops being an external database problem: Currently, we store agent memory in JSON files or SQL databases. Titans moves memory inside the model weights, making retrieval instant and semantic.
- Long-running agents become feasible: An agent could run for weeks, learning your coding style or preferences, without crashing or needing a “context reset.”
- Evaluation must consider memory drift: We currently test for accuracy. Now, we must test for stability — does the model’s memory degrade over time?
- Security now includes memory poisoning: Hackers won’t just try “prompt injection.” They will try to feed the model data that “poisons” its long-term memory, corrupting its future answers.
Who is using it right now?
Currently, this is cutting-edge research from Google DeepMind. It is not yet in public APIs like Gemini Pro. However, the architecture is open, and the implications are immediate for anyone building in:
- Financial Analysis: Reading years of 10-K reports without losing the narrative thread.
- Genomics: Analyzing DNA sequences where a gene at the start affects a trait millions of base pairs later.
- Legal Tech: Reviewing entire case histories where the specific wording of a contract from 100 pages ago matters now.
Critical Lens: The Cracks in the Armor
This all sounds perfect, but if you are an engineer, you know there is no such thing as a free lunch. Titans introduces new, significant risks that we haven’t had to deal with before.
- Memory Contamination: Since the model learns at test time, bad data could theoretically “poison” the memory for the duration of the session. A user could trick the model into “learning” a false fact that persists and corrupts later answers.
- Stability Issues: “Learning at test time” is essentially running an optimizer on a live production model. If the gradients explode (a common issue in deep learning), the model’s brain could effectively “crash” mid-sentence, producing garbage outputs.
- Debugging Difficulty: With a Transformer, if the output is wrong, you can inspect the attention weights. With Titans, the “state” is hidden inside the weights of a neural network that changed since you last looked at it. Debugging “why did it say that?” becomes exponentially harder.
- Reproducibility: Two users could ask the same question to the same model but get different answers depending on the exact sequence of “surprises” the model encountered just before. This makes deterministic testing a nightmare.
Conclusion: Memory Matters
The Transformer revolution in 2017 was about attention. The message was clear: if you attend to the right information, you can solve incredible problems.
Titans adds a crucial complement: attention isn’t enough. You also need to remember.
Not just passive memory where you stuff tokens into a context window and hope the model can find them. Active memory that learns what to store, how to store it, and when to forget it. Memory that adapts during use, not just during training. Memory that mirrors how biological brains actually work.
The results speak for themselves. Titans outperforms Transformers and modern alternatives across the board. It scales to context windows that were previously impossible to use effectively. It enables genuine reasoning over long documents in a way that RAG and other workarounds can’t match.
Is Titans the final answer? No. It has limitations, and there’s much to explore. But it represents a significant step forward in a fundamental problem