We Didn’t Invent Attention — We Just Rediscovered It

, someone claims they’ve invented a revolutionary AI architecture. But when you see the same mathematical pattern — selective amplification + normalization — emerge independently from gradient descent, evolution, and chemical reactions, you realize we didn’t invent the attention mechanism with the Transformers architecture. We rediscovered fundamental optimization principles that govern how any system processes information under energy constraints. Understanding attention as amplification rather than selection suggests specific architectural improvements and explains why current approaches work. Eight minutes here gives you a mental model that could guide better system design for the next decade.

When Vaswani and colleagues published “Attention Is All You Need” in 2017, they thoug…

, someone claims they’ve invented a revolutionary AI architecture. But when you see the same mathematical pattern — selective amplification + normalization — emerge independently from gradient descent, evolution, and chemical reactions, you realize we didn’t invent the attention mechanism with the Transformers architecture. We rediscovered fundamental optimization principles that govern how any system processes information under energy constraints. Understanding attention as amplification rather than selection suggests specific architectural improvements and explains why current approaches work. Eight minutes here gives you a mental model that could guide better system design for the next decade.

When Vaswani and colleagues published “Attention Is All You Need” in 2017, they thought they were proposing something revolutionary [1]. Their transformer architecture abandoned recurrent networks entirely, relying instead on attention mechanisms to process entire text sequences simultaneously. The mathematical core was simple: compute compatibility scores between positions, convert them to weights, and use these for selective combination of information.

But this pattern appears to emerge independently wherever information processing systems face resource constraints under complexity. Not because there’s some universal law of attention, but because certain mathematical structures seem to represent convergent solutions to fundamental optimization problems.

We may be looking at one of those rare cases where biology, chemistry, and AI have converged on similar computational strategies — not through shared mechanisms, but through shared mathematical constraints.

The 500-Million-Year Experiment

The biological evidence for attention-like mechanisms is remarkably deep. The optic tectum/superior colliculus system, which implements spatial attention through competitive inhibition, shows extraordinary evolutionary conservation across vertebrates [2]. From fish to humans, this neural architecture maintains structural and functional consistency across 500+ million years of evolution.

But perhaps more intriguing is the convergent evolution.

Independent lineages developed attention-like selective processing multiple times: compound eye systems in insects [3], camera eyes in cephalopods [4], hierarchical visual processing in birds [5], and cortical attention networks in mammals [2]. Despite vastly different neural architectures and evolutionary histories, these systems converged on similar solutions for selective information processing.

This raises a compelling question: Are we seeing evidence of fundamental computational constraints that govern how complex systems must process information under resource limitations?

Even simple organisms suggest this pattern scales remarkably. C. elegans, with only 302 neurons, demonstrates sophisticated attention-like behaviors in food seeking and predator avoidance [6]. Plants exhibit attention-like selective resource allocation, directing growth responses toward relevant environmental stimuli while ignoring others [7].

The evolutionary conservation is striking, but we should be cautious about direct equivalences. Biological attention involves specific neural circuits shaped by evolutionary pressures quite different from the optimization landscapes that produce AI architectures.

Attention as Amplification: Reframing the Mechanism

Recent theoretical work has fundamentally challenged how we understand attention mechanisms. Philosophers Peter Fazekas and Bence Nanay demonstrated that traditional “filter” and “spotlight” metaphors fundamentally mischaracterize what attention actually does [8].

They assert that attention doesn’t select inputs — it amplifies presynaptic signals in a non-stimulus-driven way, interacting with built-in normalization mechanisms that create the appearance of selection. The mathematical structure they identify is the following:

Amplification: Increase the strength of certain input signals
Normalization: Built-in mechanisms (like divisive normalization) process these amplified signals
Apparent Selection: The combination creates what appears to be selective filtering Figure 1: Attention doesn’t filter inputs — it amplifies certain signals, then normalization creates apparent selectivity. Like an audio mixer with automatic gain control, the result looks selective but the mechanism is amplification. Image by author.

This framework explains seemingly contradictory findings in neuroscience. Effects like increased firing rates, receptive field reduction, and surround suppression all emerge from the same underlying mechanism — amplification interacting with normalization computations that operate independently of attention.

Fazekas and Nanay focused specifically on biological neural systems. The question of whether this amplification framework extends to other domains remains open, but the mathematical parallels are suggestive.

Chemical Computers and Molecular Amplification

Perhaps the most surprising evidence comes from chemical systems. Baltussen and colleagues demonstrated that the formose reaction — a network of autocatalytic reactions involving formaldehyde, dihydroxyacetone, and metal catalysts — can perform sophisticated computation [9].

Figure 2. A Chemical Computer in Action: Mix five simple chemicals in a stirred reactor, and something remarkable happens — the chemical soup learns to recognize patterns, predict future changes, and sort information into categories. No programming, no training, no silicon chips. Just molecules doing math. This formose reaction network processes information using the same selective amplification principles that power ChatGPT’s attention mechanism, but it evolved naturally through chemistry alone. Image by author.

The system shows selective amplification across up to 10⁶ different molecular species, achieving > 95% accuracy on nonlinear classification tasks. Different molecular species respond differentially to input patterns, creating what appears to be chemical attention through selective amplification. Remarkably, the system operates on timescales (500 ms to 60 minutes) that overlap with biological and artificial attention mechanisms.

But the chemical system lacks the hierarchical control mechanisms and learning dynamics that characterize biological attention. Yet the mathematical structure — selective amplification creating apparent selectivity — appears strikingly similar. Programmable autocatalytic networks provide additional evidence. Metal ions like Nd³⁺ create biphasic control mechanisms, both accelerating and inhibiting reactions depending on concentration [10]. This generates controllable selective amplification that implements Boolean logic functions and polynomial mappings through purely chemical processes.

Information-Theoretic Constraints and Universal Optimization

The convergence across these different domains may reflect deeper mathematical necessities. Information bottleneck theory provides a formal framework: any system with limited processing capacity must solve the optimization problem of minimizing information retention while preserving task-relevant details [11].

Jan Karbowski’s work on information thermodynamics reveals universal energy constraints on information processing [12]. The fundamental thermodynamic bound on computation creates selection pressure for efficient selective processing mechanisms across all substrates capable of computation:

Information processing costs energy, so efficient attention mechanisms have a survival/performance advantage, where σ represents entropy (S) production rate, and ΔI represents information processing capacity.

Whenever any system — whether a brain, a computer, or even chemical reactions — processes information, it must dissipate energy as waste heat. The more information you process, the more energy you must waste. Since attention mechanisms process information (deciding what to focus on), they’re subject to this energy tax.

This creates universal pressure for efficient architectures — whether you’re evolution designing a brain, chemistry organizing reactions, or gradient descent training transformers.

Neural networks operating at criticality — the edge between order and chaos — maximize information processing capacity while maintaining stability [13]. Empirical measurements show that conscious attention in humans occurs precisely at these critical transitions [14]. Transformer networks during training exhibit similar phase transitions, organizing attention weights near critical points where information processing is optimized [15].

This suggests the possibility that attention-like mechanisms may emerge wherever systems face the fundamental trade-off between processing capacity and energy efficiency under resource constraints.

Convergent Mathematics, Not Universal Mechanisms

The evidence points toward a preliminary conclusion. Rather than discovering universal mechanisms, we may be witnessing convergent mathematical solutions to similar optimization problems:

The mathematical structure — selective amplification combined with normalization — appears across these domains, but the underlying mechanisms and constraints differ significantly.

For transformer architectures, this reframing suggests specific insights:

Q·K computation implements amplification.

The dot product Q·K^T computes semantic compatibility between query and key representations, acting as a learned amplification function where high compatibility scores amplify signal pathways.The scaling factor √d_k prevents saturation in high-dimensional spaces, maintaining gradient flow.

Softmax normalization creates winner-take-all dynamics

Softmax implements competitive normalization through divisive renormalization. The exponential term amplifies differences (winner-take-all dynamics) while sum normalization ensures Σw_ij = 1. Mathematically this function is equivalent to a divisive normalization.

Weighted V combination produces apparent selectivity

In this combination there is not explicit selection operator, it is basically a linear combination of value vectors. The apparent selectivity emerges from the sparsity pattern induced by softmax normalization. High attention weights create effective gating without explicit gating mechanisms.

The combination softmax(amplification) induce a winner-take-all dynamics on the value space.

Implications for AI Development

Understanding attention as amplification + normalization rather than selection offers several practical insights for AI architecture design:

Separating Amplification and Normalization: Current transformers conflate these mechanisms. We might explore architectures that decouple them, allowing for more flexible normalization strategies beyond softmax [16].
Non-Content-Based Amplification: Biological attention includes “not-stimulus-driven” amplification. Current transformer attention is purely content-based (Q·K compatibility). We could investigate learned positional biases, task-specific amplification patterns, or meta-learned amplification strategies.
Local Normalization Pools: Biology uses “pools of surrounding neurons” for normalization rather than global normalization. This suggests exploring local attention neighborhoods, hierarchical normalization across layers, or dynamic normalization pool selection.
Critical Dynamics: The evidence for attention operating near critical points suggests that effective attention mechanisms should exhibit specific statistical signatures — power-law distributions, avalanche dynamics, and critical fluctuations [17].

Open Questions and Future Directions

Several fundamental questions remain:

How deep do the mathematical parallels extend? Are we seeing true computational equivalence or superficial similarity?
What can chemical reservoir computing teach us about minimal attention architectures? If simple chemical networks can achieve attention-like computation, what does this suggest about the complexity requirements for AI attention?
Do information-theoretic constraints predict the evolution of attention in scaling AI systems? As models become larger and face more complex environments, will attention mechanisms naturally evolve toward these universal optimization principles?
How can we integrate biological insights about hierarchical control and adaptation into AI architectures? The gap between static transformer attention and dynamic biological attention remains substantial.

Conclusion

The story of attention appears to be less about invention and more about rediscovery. Whether in the formose reaction’s chemical networks, the superior colliculus’s neural circuits, or transformer architectures’ learned weights, we see variations on a mathematical theme: selective amplification combined with normalization to create apparent selectivity.

This doesn’t decrease the achievement of transformer architectures — if anything, it suggests they represent a fundamental computational insight that transcends their specific implementation. The mathematical constraints that govern efficient information processing under resource limitations appear to push different systems toward similar solutions.

As we continue scaling AI systems, understanding these deeper mathematical principles may prove more valuable than mimicking biological mechanisms directly. The convergent evolution of attention-like processing suggests we’re working with fundamental computational constraints, not engineering choices.

Nature spent 500 million years exploring these optimization landscapes through evolution. We rediscovered similar solutions through gradient descent in a few years. The question now is whether understanding these mathematical principles can guide us toward even better solutions that transcend both biological and current artificial approaches.

Final note

The real test: if someone reads this and designs a better attention mechanism as a result, we’ve created value.

Thank you for reading — and sharing!

Javier Marin Applied AI Consultant | Production AI Systems + Regulatory Compliance [email protected]

References

[1] Vaswani, A., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

[2] Knudsen, E. I. (2007). Fundamental components of attention. Annual Review of Neuroscience, 30, 57–78.

[3] Nityananda, V., et al. (2016). Attention-like processes in insects. Proceedings of the Royal Society B, 283(1842), 20161986.

[4] Cartron, L., et al. (2013). Visual object recognition in cuttlefish. Animal Cognition, 16(3), 391–401.

[5] Wylie, D. R., & Crowder, N. A. (2014). Avian models for 3D scene analysis. Proceedings of the IEEE, 102(5), 704–717.

[6] Jang, H., et al. (2012). Neuromodulatory state and sex specify alternative behaviors through antagonistic synaptic pathways in C. elegans. Neuron, 75(4), 585–592.

[7] Trewavas, A. (2009). Plant behaviour and intelligence. Plant, Cell & Environment, 32(6), 606–616.

[8] Fazekas, P., & Nanay, B. (2021). Attention is amplification, not selection. British Journal for the Philosophy of Science, 72(1), 299–324.

[9] Baltussen, M. G., et al. (2024). Chemical reservoir computation in a self-organizing reaction network. Nature, 631(8021), 549–555.

[10] Kriukov, D. V., et al. (2024). Exploring the programmability of autocatalytic chemical reaction networks. Nature Communications, 15(1), 8649.

[11] Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. arXiv preprint arXiv:1503.02406.

[12] Karbowski, J. (2024). Information thermodynamics: From physics to neuroscience. Entropy, 26(9), 779.

[13] Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. Journal of Neuroscience, 23(35), 11167–11177.

[14] Freeman, W. J. (2008). Neurodynamics: An exploration in mesoscopic brain dynamics. Springer-Verlag.

[15] Gao, J., et al. (2016). Universal resilience patterns in complex networks. Nature, 530(7590), 307–312.

[16] Reynolds, J. H., & Heeger, D. J. (2009). The normalization model of attention. Neuron, 61(2), 168–185.

[17] Shew, W. L., et al. (2009). Neuronal avalanches imply maximum dynamic range in cortical networks at criticality. Journal of Neuroscience, 29(49), 15595–15600.