Credit: Oryon.
In a dark MRI scanner outside Tokyo, a volunteer watches a video of someone hurling themselves off a waterfall. Nearby, a computer digests the brain activity pulsing across millions of neurons. A few moments later, the machine produces a sentence: “A person jumps over a deep water fall on a mountain ridge.”
No one typed those words. No one spoke them. They came directly from the volunteer’s brain activity.
That’s the startling premise of “mind captioning,” a new method developed by Tomoyasu Horikawa and colleagues at NTT Communication Science Laboratories in Japan. Published this week in Science Advances, the system uses a blend of brain imaging and artificial intelligence to generate textual descriptions o…
Credit: Oryon.
In a dark MRI scanner outside Tokyo, a volunteer watches a video of someone hurling themselves off a waterfall. Nearby, a computer digests the brain activity pulsing across millions of neurons. A few moments later, the machine produces a sentence: “A person jumps over a deep water fall on a mountain ridge.”
No one typed those words. No one spoke them. They came directly from the volunteer’s brain activity.
That’s the startling premise of “mind captioning,” a new method developed by Tomoyasu Horikawa and colleagues at NTT Communication Science Laboratories in Japan. Published this week in Science Advances, the system uses a blend of brain imaging and artificial intelligence to generate textual descriptions of what people are seeing — or even visualizing with their mind’s eye — based only on their neural patterns.
As Nature journalist Max Kozlov put it, the technique “generates descriptive sentences of what a person is seeing or picturing in their mind using a read-out of their brain activity, with impressive accuracy.”
This is not the stuff of science fiction anymore. It’s not mind-reading either, at least not yet. But it’s a vivid demonstration of how our brains and modern AI models might be speaking a surprisingly similar language.
Decoding Meaning from the Silent Mind
The researchers trained an AI to link brain scans with video captions, then used it to turn new brain activity — whether from watching or recalling scenes — into sentences through an iterative word-replacement process guided by language models. Credit: Nature, 2025, Horikawa.
To build the system, Horikawa had to bridge two universes: the intricate geometry of human thought and the sprawling semantic web that language models use to understand words. Six volunteers spent nearly seventeen hours each in an MRI scanner, watching 2,180 short, silent video clips. The scenes ranged from playful animals to emotional interactions, abstract animations, and everyday moments. Each clip lasted only a few seconds, but together they provided a massive dataset of how the brain reacts to visual experiences.
For every video, the researchers also gathered twenty captions written by online volunteers. The captions were complete sentences describing what was happening in each scene. The captions were cleaned up with the help of ChatGPT. Each sentence was then transformed into a complex numerical signature — a point in a vast multi-vector semantic space — using a language model called DeBERTa.
The team then mapped the brain activity recorded during each video to these semantic signatures. In other words, they trained an AI to recognize what kinds of neural patterns corresponded to particular kinds of meaning. Instead of using deep, opaque neural networks, the researchers relied on a more transparent linear model. This model could reveal which regions of the brain contributed to which kinds of semantic information.
From Abstract Meaning to Words
Once the system could predict the “meaning vector” of what someone was watching, it faced the next challenge: turning that abstract representation into an actual sentence. To do that, the Japanese scientist used another language model, RoBERTa, to generate words step by step. It began with a meaningless placeholder and, over a hundred iterations, filled in blanks, tested alternative sentences, and kept whichever version best matched the decoded meaning.
The process resembled an evolution of language inside the machine’s circuits. Early attempts sounded like nonsense but with each refinement, the sentences grew more accurate, finally converging on a full, coherent description of the scene.
When tested, the system could match the correct video to its generated description about half the time, even when presented with a hundred possibilities. “This is hard to do,” Alex Huth, a neuroscientist at the University of California, Berkeley, who has worked on similar brain-decoding projects, told Nature. “It’s surprising you can get that much detail.”
The researchers also made a surprising discovery when they scrambled the word order of the generated captions. The quality and accuracy dropped sharply, showing that the AI wasn’t just picking up on keywords but grasping something deeper — perhaps the structure of meaning itself, the relationships between objects, actions, and context.
Our new paper is on bioRxiv. We present a novel generative decoding method, called Mind Captioning, and demonstrate the generation of descriptive text of viewed and imagined content from human brain activity.
The video shows text generated for viewed content during optimization. https://t.co/e0cP6B3CDL pic.twitter.com/mB2CO959tT
— Tomoyasu Horikawa (@HKT52) April 27, 2024
The Language of Thought
One of the most striking experiments came later, when the volunteers were asked to recall the videos rather than watch them. They closed their eyes, imagined the scenes, and rated how vivid their mental replay felt. The same model, trained only on perception data, was used to decode these recollections. Astonishingly, it still worked.
Credit: Nature, 2025, Horikawa.
Even when subjects were only imagining the videos, the AI generated accurate sentences describing them, sometimes identifying the right clip out of a hundred. That result hinted at a powerful idea: the brain uses similar representations for seeing and visual recall, and those representations can be translated into language without ever engaging the traditional “language areas” of the brain.
In fact, when the researchers deliberately excluded regions typically associated with language processing, the system continued to generate coherent text. This suggests that structured meaning — what scientists call “semantic representation” — is distributed widely across the brain, not confined to speech-related zones.
That discovery carries enormous implications for people who can’t speak. Individuals with aphasia or neurodegenerative diseases that affect language could, in principle, use such systems to communicate through their nonverbal brain activity. The paper calls this an “interpretive interface” that could restore communication for those whose words are trapped inside their minds.
Promise and Concerns
Still, the researchers are careful not to overpromise. The technology is far from being a mind-reading device. It depends on hours of personalized data from each participant, massive MRI scanners, and a very narrow set of visual stimuli. The sentences it generates are filtered through the biases of the English-language captions and the models used to train them. Change the language model or the dataset, and the output could shift dramatically.
Horikawa himself insists that the system doesn’t reconstruct thoughts directly. It instead translates them through layers of AI interpretation. “To accurately characterize our primary contribution, it is essential to frame our method as an interpretive interface rather than a literal reconstruction of mental content,” the paper states.
The ethical implications of this technology are hard to ignore. If machines can turn brain activity into words, even imperfectly, who controls that information? Could it be misused in surveillance, law enforcement, or advertising? Both Horikawa and Huth have stressed the importance of consent and privacy. “Nobody has shown you can do that, yet,” Huth told Nature, when asked about reading private thoughts. But “yet” sounds concerning.
For now, mind captioning is confined to the lab: a handful of subjects, a room-sized scanner, and a process that takes hours to calibrate. But the direction is unmistakable and hard to ignore.