Measured brain activity was combined with LLMs like ChatGPT to connect the dots between neural patterns and language. (Credit: Teacher Photo on Shutterstock)
In A Nutshell
- The technology works like a translator, not a mind reader – It converts brain scan patterns into coherent sentences by learning which neural patterns correspond to different types of visual content, then building descriptions word by word through 100 rounds of AI-powered optimization.
- It identifies the correct video about 50% of the time from 100 options – Where random guessing would succeed only 1% of the time, the system generates descriptions accurate enough to match videos people watched or recalled from memory, capturing not just objects but relationships and actions.
- **The brain stores deta…
Measured brain activity was combined with LLMs like ChatGPT to connect the dots between neural patterns and language. (Credit: Teacher Photo on Shutterstock)
In A Nutshell
- The technology works like a translator, not a mind reader – It converts brain scan patterns into coherent sentences by learning which neural patterns correspond to different types of visual content, then building descriptions word by word through 100 rounds of AI-powered optimization.
- It identifies the correct video about 50% of the time from 100 options – Where random guessing would succeed only 1% of the time, the system generates descriptions accurate enough to match videos people watched or recalled from memory, capturing not just objects but relationships and actions.
- The brain stores detailed scene information outside language areas – Even when completely ignoring traditional language regions, the system maintained 50% accuracy, revealing that rich “who does what to whom” representations exist throughout visual and action-processing brain areas.
- The technology could help people who’ve lost the ability to speak – Because it bypasses language networks, it might provide alternative communication for individuals with aphasia, ALS, or other conditions affecting traditional language production while understanding remains intact.
Plenty of people turn the captions on while watching TV, but what about captions for your thoughts? Researchers have built an interpretive interface that turns patterns of brain activity into text. To be clear, the technology is described as translating, rather than directly “reading” mental content.
For example, let’s say a person were to imagine watching a sunset over the ocean, complete with thick clouds drifting across the sky. Scientists would read that brain activity and generate a description like “cloud tops are visible drifting over the sunset ocean.”
The method, called “mind captioning,” works for both viewed and imagined scenes. It unlocks new possibilities for understanding the mind and helping people who’ve lost the ability to communicate through traditional language.
Published in Science Advances, the research shows how the mind captioning system can generate accurate, structured sentences describing what someone is experiencing by first decoding semantic features from fMRI scans and then iteratively optimizing sentences with a language model. Unlike previous attempts that could only identify individual objects or pull from existing databases of descriptions, this approach creates original descriptions capturing not just what’s present in a scene, but how different elements interact and relate to each other.
Turning Brain Patterns Into ‘Mind Captions’
The mind captioning breakthrough relies on combining functional magnetic resonance imaging (fMRI) to measure brain activity with large language models (similar to ChatGPT) to bridge the gap between neural patterns and human language. The researchers trained their system on brain scans from six people watching short video clips, teaching it to recognize which patterns of brain activity corresponded to which types of visual content.
The system generates descriptions by constructing sentences word by word, optimizing each choice to align with what the brain is representing. Starting from scratch, it iteratively refines descriptions through 100 rounds of optimization, progressively building more accurate and detailed sentences.
When tested on new videos the subjects had never seen before, the system could identify the correct video from among 100 options with about 50% accuracy based solely on brain-decoded descriptions, where chance would be just 1%. Example outputs from the paper captured both individual elements and their relationships, with descriptions like people speaking while others hugged, or someone jumping over a waterfall on a mountain.
It Works for Imagination Too
The same system worked when subjects merely imagined videos they had previously watched, with their eyes closed. By measuring brain activity during these mental imagery sessions, the researchers could generate mind captions of what people were recalling from memory. During recall, performance was above chance, and the researchers could generate understandable descriptions even from single trials in some cases, though results varied between individuals.
You can never really know what’s going on in someone else’s head…or can you? (Credit: PeopleImages on Shutterstock)
This ability to decode both perception and imagination shows that the mind captioning system is tapping into fundamental ways the brain represents meaningful content, regardless of whether that content comes from the outside world or from memory.
The Language Network Isn’t Always Necessary
One of the study’s most surprising discoveries challenges conventional understanding of how the brain processes complex information. Even when excluding the brain’s language network entirely, the system still produced structured descriptions and maintained identification accuracy near 50% among 100 candidates.
Rich, structured information about visual scenes and their relationships exists across broad regions of the brain, particularly in areas involved invisual processing and understanding actions and interactions. The brain maintains detailed representations of “who does what to whom” without necessarily translating those representations into words.
This distinction matters both scientifically and for understanding conditions like aphasia, where language production is impaired but conceptual understanding may remain intact. It supports the idea that nonverbal thought can be remarkably detailed and structured, existing independently of language.
Mind Captioning Opens New Pathways for Communication
For individuals who have lost the ability to speak due to conditions like aphasia, amyotrophic lateral sclerosis, or severe motor impairments, this technology could provide an alternative communication pathway. Because the method doesn’t rely on the language network, it might work even when traditional language areas are damaged.
Training the system required extensive scanning sessions and thousands of videos, establishing the initial decoding models. However, once trained, the system could generate comprehensible descriptions of recalled content from single trials in some cases, showing promise for practical applications.
All six subjects in the study were native Japanese speakers with varying English proficiency. The system produced English descriptions because it decoded brain activity into an English semantic space—the captions and language models used were English-based. This shows the method translates nonverbal mental representations into language output regardless of the subject’s abilities in that particular language.
The Technical Process Behind the Discovery
The process involves two main stages. First, the system learns to predict “semantic features” from brain activity. These features, computed by a deep language model called DeBERTa-large, represent the meaning of text in a mathematical form capturing contextual relationships between words and concepts. These semantic representations preserve information about word order and grammatical structure, not just which objects are present.
During the second stage of description generation, another language model pretrained for “masked language modeling” guides the process. This model, RoBERTa-large, suggests candidate words to fill in gaps in developing sentences, using surrounding context to make intelligent predictions. The system progressively refines descriptions by comparing their semantic features to those decoded from the brain, selecting word sequences that best match the neural patterns.
This iterative optimization approach allows the system to search through countless possible descriptions to find ones that accurately reflect what’s represented in the brain.
Understanding How the Brain Organizes Information
The research builds on decades of work in brain decoding but represents a major leap in complexity. Previous studies successfully identified individual objects, faces, or places from brain activity, and some could classify images into broad categories. Recent work even decoded speech or linguistic information from brain activity during language tasks. This method goes further by capturing structured visual semantics and generating full descriptions of complex, dynamic scenes.
This technology may benefit patients with aphasia, ALS, or other conditions affecting speech. (Credit: Sidorov_Ruslan on Shutterstock)
The ability to decode structured information offers important insights into how the brain represents the world. The researchers found that simply shuffling the word order of generated descriptions substantially reduced their accuracy, even when all the same words were present. This shows the descriptions capture genuine relational information about who or what is doing something to whom or what, not just lists of objects that happen to be present.
Brain imaging revealed this structured information is distributed across higher visual areas, parietal cortex, and frontal regions. The fact that shuffling word order hurts performance signals that the brain maintains true relational structure, not merely collections of individual concepts.
What Comes Next
Important limitations remain. The descriptions, while accurate, still show room for improvement in capturing the full richness of subjective experience. The temporal resolution of fMRI also means the system captures what people experience over several seconds rather than moment-to-moment changes.
The researchers explicitly warn about mental-privacy risks and call for regulations to protect mental privacy and autonomy, particularly as advances in cross-individual alignment techniques may lower data requirements and make the technology more accessible. While current applications require willing participants and extensive data collection, future developments could reduce these barriers.
The study focused on visual content, but the same principles could potentially extend to other modalities and types of mental content. Future research might decode descriptions of sounds, abstract concepts, or even dreams. Using language models more closely aligned with human brain activity could further improve performance.
This research represents a major step toward understanding and accessing the contents of the human mind. By translating patterns of brain activity into readable text, scientists have created an interpretive interface between neural representations and language.
Disclaimer: This article describes laboratory research on brain-computer interfaces and should not be interpreted as currently available medical technology. The research involved willing participants in controlled settings with extensive informed consent. Brain decoding technology raises important privacy and ethical considerations that require ongoing discussion and regulation. Individuals with communication disorders should consult qualified healthcare providers about available and appropriate assistive technologies for their specific situations.
Paper Summary
Methodology
The study involved six Japanese subjects who participated in fMRI scanning sessions while viewing short video clips and later recalling them from memory. The researchers collected brain activity data during a video presentation experiment with 2,180 unique videos in training sessions and 72 videos repeated five times in test sessions. An imagery experiment had subjects recall 72 videos from memory after viewing verbal cues. Each video had 20 text captions annotated by independent workers describing the visual content. The researchers used linear regression models to decode brain activity into semantic features computed by the DeBERTa-large language model from video captions. These semantic features served as intermediate representations bridging brain activity and text. To generate descriptions from decoded features, they developed an iterative optimization method using the RoBERTa-large model pretrained for masked language modeling. This process involved repeatedly masking words, suggesting alternatives based on context, and selecting candidates whose semantic features best matched brain-decoded features through 100 iterations. The system started from a noninformative initial state and progressively evolved descriptions to align with target brain representations.
Results
The generated descriptions accurately captured viewed content, including dynamic changes and interactions between multiple elements, even when specific objects weren’t correctly identified. Throughout optimization, descriptions evolved from fragmented text to coherent structures, with semantic features showing increasingly stronger correlations with both target brain-decoded features and reference captions. Discriminability was substantially above chance across all evaluation metrics, with approximately 50% accuracy in identifying correct videos from 100 candidates. Word-order shuffling considerably reduced both identification accuracy and discriminability, demonstrating the descriptions captured structured relational information beyond simple word lists. This effect was more pronounced when using features from deeper language model layers, highlighting the importance of contextual semantic representations. The method outperformed approaches based on caption databases or nonlinear image captioning models. Voxelwise encoding analysis revealed semantic features effectively predicted brain activity in language networks and regions involved in recognizing objects, actions, and interactions. Notably, accurate descriptions capturing structured semantics could be generated without relying on the language network, achieving nearly 50% accuracy when excluding these regions. Perception-trained decoders successfully generalized to imagery-induced brain activity, generating descriptions of recalled content with above-chance accuracy. Semantic features demonstrated superior generalizability compared to visual or visuo-semantic features when decoding imagery using perception-trained decoders. The method produced comprehensible descriptions from single-trial fMRI activity during recall in some cases.
Limitations
The study used natural videos from the web, which enhanced ecological validity but constrained the ability to precisely identify which relational structures the method captures and assess generalizability to atypical scenes. Without experimental control through systematic manipulation of distinct relational structures, it remains unclear whether success reflects true generalization beyond common patterns or reliance on implicit biases toward typical scene structures potentially introduced through model priors, training data distribution, or stimulus selection. Reference captions came from independent annotators rather than the fMRI subjects themselves, so they may not fully align with each subject’s unique perceptions, though using 20 captions per video likely mitigated some variability. Annotators were instructed to focus on visual content rather than subjective aspects like emotional reactions, so generated descriptions were predominantly concrete and rarely reflected abstract dimensions. The verbal prompts used to cue recalled videos during the imagery experiment may have influenced brain activity during the imagery period due to slow hemodynamic response, making it difficult to fully differentiate activity associated with text reading from mental imagery. The method currently works best with repeated measurements and shows variable performance across individuals. One subject (S1) was exposed to the same stimuli multiple times during preliminary experiments, potentially influencing their brain responses.
Funding and Disclosures
This research was supported by grants from JST PRESTO (grant number JPMJPR185B) and JSPS KAKENHI (grant number JP21H03536). The author declares no competing interests.
Publication Information
Horikawa, T. (2025). “Mind captioning: Evolving descriptive text of mental content from human brain activity,” published in Science Advances, November 25, 2025. DOI:10.1126/sciadv.adw1464
About StudyFinds Analysis
Called “brilliant,” “fantastic,” and “spot on” by scientists and researchers, our acclaimed StudyFinds Analysis articles are created using an exclusive AI-based model with complete human oversight by the StudyFinds Editorial Team. For these articles, we use an unparalleled LLM process across multiple systems to analyze entire journal papers, extract data, and create accurate, accessible content. Our writing and editing team proofreads and polishes each and every article before publishing. With recent studies showing that artificial intelligence can interpret scientific research as well as (or even better) than field experts and specialists, StudyFinds was among the earliest to adopt and test this technology before approving its widespread use on our site. We stand by our practice and continuously update our processes to ensure the very highest level of accuracy. Read our AI Policy (link below) for more information.
Our Editorial Process
StudyFinds publishes digestible, agenda-free, transparent research summaries that are intended to inform the reader as well as stir civil, educated debate. We do not agree nor disagree with any of the studies we post, rather, we encourage our readers to debate the veracity of the findings themselves. All articles published on StudyFinds are vetted by our editors prior to publication and include links back to the source or corresponding journal article, if possible.
Our Editorial Team
Steve Fink
Editor-in-Chief
Sophia Naughton
Associate Editor