Abstract
Animals excel at seamlessly integrating information from different senses, a capability critical for navigating complex environments. Despite recent progress in multisensory research, the absence of stimulus-computable perceptual models fundamentally limits our understanding of how the brain extracts and combines task-relevant cues from the continuous flow of natural multisensory stimuli. Here, we introduce an image- and sound-computable population model for audiovisual perception, based on biologically plausible units that detect spatiotemporal correlations across auditory and visual streams. In a large-scale simulation spanning 69 psychophysical, eye-tracking, and pharmacological experiments, our model replicates human, monkey, and rat behaviour in response to diverse audâŠ
Abstract
Animals excel at seamlessly integrating information from different senses, a capability critical for navigating complex environments. Despite recent progress in multisensory research, the absence of stimulus-computable perceptual models fundamentally limits our understanding of how the brain extracts and combines task-relevant cues from the continuous flow of natural multisensory stimuli. Here, we introduce an image- and sound-computable population model for audiovisual perception, based on biologically plausible units that detect spatiotemporal correlations across auditory and visual streams. In a large-scale simulation spanning 69 psychophysical, eye-tracking, and pharmacological experiments, our model replicates human, monkey, and rat behaviour in response to diverse audiovisual stimuli with an average correlation exceeding 0.97. Despite relying on as few as 0â4 free parameters, our model provides an end-to-end account of audiovisual integration in mammalsâfrom individual pixels and audio samples to behavioural responses. Remarkably, the population response to natural audiovisual scenes generates saliency maps that predict spontaneous gaze direction, Bayesian causal inference, and a variety of previously reported multisensory illusions. This study demonstrates that the integration of audiovisual stimuli, regardless of their spatiotemporal complexity, can be accounted for in terms of elementary joint analyses of luminance and sound level. Beyond advancing our understanding of the computational principles underlying multisensory integration in mammals, this model provides a bio-inspired, general-purpose solution for multimodal machine perception.
Introduction
Perception in natural environments is inherently multisensory. For example, during speech perception, the human brain integrates audiovisual information to enhance speech intelligibility, often beyond awareness. A compelling demonstration of this is the McGurk illusion (McGurk and MacDonald, 1976), where the auditory perception of a syllable is altered by mismatched lip movements. Likewise, audiovisual integration plays a critical role in spatial localization, as illustrated by the ventriloquist illusion (Stratton, 1897), where perceived sound location shifts toward a synchronous visual stimulus.
Extensive behavioural and neurophysiological findings demonstrate that audiovisual integration occurs when visual and auditory stimuli are presented in close spatiotemporal proximity (i.e. the spatial and temporal determinants of multisensory integration; Stein, 2012, Stein and Stanford, 2008). When redundant multisensory information is integrated, the resulting percept is more reliable (Ernst and Banks, 2002) and salient (Talsma et al., 2010). Various models have successfully described how audiovisual integration unfolds across time and space (Alais and Burr, 2004, Körding et al., 2007, Magnotti et al., 2013, Yarrow et al., 2023)âoften within a Bayesian Causal Inference framework, where the system determines the probability that visual and auditory stimuli have a common cause and weighs the senses accordingly. This is the case for the detection of spatiotemporal discrepancies across the senses, or susceptibility to phenomena such as the McGurk or Ventriloquist illusions (Körding et al., 2007, Magnotti et al., 2013, Magnotti and Beauchamp, 2017).
Prevailing theoretical models of multisensory integration typically operate at what Marr, 1982 termed the computational level: they describe what the system is trying to achieve (e.g. obtain precise sensory estimates). However, these models are not stimulus-computable. That is, rather than analysing raw auditory and visual input directly, they rely on experimenter-defined, low-dimensional abstractions of the stimuli (Alais and Burr, 2004, Körding et al., 2007, Magnotti et al., 2013, Yarrow et al., 2023, Magnotti and Beauchamp, 2017)âsuch as the asynchrony between sound and image, expressed in seconds (Magnotti et al., 2013, Yarrow et al., 2023), or spatial location (Alais and Burr, 2004, Körding et al., 2007). As a result, they solve a fundamentally different task than real perceptual systems, which must infer such properties from the stimuli themselvesâfrom dynamic patterns of pixels and audio samplesâwithout access to ground-truth parameters. From Marrâs perspective, what is missing is an account at the algorithmic level: a concrete description of the stimulus-driven representations and operations that could give rise to the observed computations.
Despite their clear success in accounting for behaviour in simple, controlled conditions, current models remain silent on how perceptual systems extract, process, and combine task-relevant information from the continuous and structured stream of audiovisual signals that real-world perception entails. This omission is critical: audiovisual perception involves the continuous analysis of images and sounds; hence, models that do not operate on the stimuli cannot provide a complete account of perception. Only a few models can process elementary audiovisual stimuli (Parise and Ernst, 2016, Cuppini et al., 2017), none can tackle the complexity of natural audiovisual input. Currently, there are no stimulus-computable models (Burge, 2020) for multisensory perception that can take as input natural audiovisual data, like movies. This study explores how behaviour consistent with mammalian multisensory perception emerges from low-level analyses of natural auditory and visual signals.
In an image- and sound-computable model, visual and auditory stimuli can be represented as patterns in a three-dimensional space, where x and y are the two spatial dimensions, and t is the temporal dimension. An instance of such a three-dimensional diagram for the case of audiovisual speech is shown in Figure 1B (top): moving lips generate patterns of light that vary in sync with the sound. In such a representation, audiovisual correspondence can be detected by a local correlator (i.e. multiplier), that operates across space, time, and the senses (Parise et al., 2012). In previous studies, we proposed a biologically plausible solution to detect temporal correlation across the senses (Figure 1A; Parise and Ernst, 2016; Pesnot Lerousseau et al., 2022; Parise and Ernst, 2025; Horsfall et al., 2021). Here, we will illustrate how a population of multisensory correlation detectors can take real-life footage as input and provide a comprehensive bottom-up account for multisensory integration in mammals, encompassing its temporal, spatial, and attentional aspects.
The Multisensory Correlation Detector (MCD) population model.
(A) Schematic representation of a single MCD unit. The input visual signal represents the intensity of a pixel (mouth area) over time, while the audio input is the soundtrack (the syllable /ba/). The gray soundtrack represents the experimental manipulation of AV lag, obtained by delaying one sense with respect to the other. BPF and LPF indicate band-pass and low-pass temporal filters, respectively. (C) shows how single-unit responses vary as a function of cross-modal lag. (B) represents the architecture of the MCD population model. Each visual unit (in blue) receives input from a single pixel, while the auditory unit receives as input the intensity envelope of the soundtrack (mono audio; see Figure 4A for a version of the model capable of receiving spatialized auditory input). Sensory evidence is then integrated over time and space for perceptual decision-making, a process in which the two model responses are weighted, summed, corrupted with additive Gaussian noise, and compared to a criterion to generate a forced-choice response (D).
The present approach posits the existence of elementary processing units, the Multisensory Correlation Detectors (MCD; Parise and Ernst, 2025), each integrating time-varying input from unimodal transient channels through a set of temporal filters and elementary operations (Figure 1A, see Methods). Each unit returns two outputs, representing the temporal correlation and order of incoming visual and auditory signals (Figure 1C). When arranged in a two-dimensional lattice (Figure 1B), a population of MCD units is naturally suited to take movies (e.g. dynamic images and sounds) as input, hence capable to process any stimuli used in previous studies in audiovisual integration. Given that the aim of this study is to provide an account for multisensory integration in biological system, the benchmark of our model is to reproduce observersâ behaviour in carefully controlled psychophysical and eye-tracking experiments. Emphasis will be given to studies using natural stimuli, which despite their manifest ecological value, simply cannot be handled by alternative models. Among them, particular attention will be dedicated to experiments involving speech, perhaps the most representative instance of audiovisual perception, and sometimes claimed to be processed via dedicated mechanisms in the human brain.
Results
We tested the performance of our population model on three main aspects of audiovisual perception. The first concerns the temporal determinants of multisensory integration, primarily investigating how subjective audiovisual synchrony and integration depend on the physical lag across the senses. The second addresses the spatial determinants of audiovisual integration, focusing on the combination of visual and acoustic cues for spatial localization. The third one involves audiovisual attention and examines how gaze behaviour is spontaneously attracted to audiovisual stimuli even in the absence of explicit behavioural tasks. While most of the literature on audiovisual psychophysics involves human participants, in recent years monkeys and rats have also been trained to perform the same behavioural tasks. Therefore, to generalize our approach, whenever possible, we simulated experiments involving all available animal models.
Temporal determinants of audiovisual integration in humans and rats
Classic experiments on the temporal determinants of audiovisual integration usually manipulate the lag between the senses and assess the perception of synchrony, temporal order, and audiovisual speech integration (as measured in humans with the McGurk illusion, see Video 1) through psychophysical forced-choice tasks (Venezia et al., 2016, Vroomen and Keetels, 2010; Parise et al., 2025). Among them, we obtained both the audiovisual footage and the psychophysical data from 43 experiments in humans that used ecological audiovisual stimuli (real-life recordings of, e.g. speech and performing musicians, Figure 2A, Figure 2âfigure supplement 1 and Supplementary file 1, for the inclusion criteria, see Methods): 27 experiments were simultaneity judgments (van Wassenhove et al., 2007; Lee and Noppeney, 2011; Vroomen and Stekelenburg, 2011; Magnotti et al., 2013; Roseboom and Arnold, 2011; Yuan et al., 2014; Ikeda and Morishita, 2020; van Laarhoven et al., 2019; Lee and Noppeney, 2014) , 10 temporal order judgments (Vroomen and Stekelenburg, 2011, Freeman et al., 2013), six others assessed the McGurk effect (van Wassenhove et al., 2007, Yuan et al., 2014, Freeman et al., 2013).
Natural audiovisual stimuli and psychophysical responses.
(A) Stimuli (still frame and soundtrack) and psychometric functions for McGurk illusion (van Wassenhove et al., 2007), synchrony judgments (Lee and Noppeney, 2011), and temporal order judgments (Vroomen and Stekelenburg, 2011). In all panels, dots correspond to empirical data, lines to Multisensory Correlation Detectors (MCD) responses; negative lags represent vision first. (B) Stimuli and results of Alais and Carlile, 2005. The left panel displays the envelopes of auditory stimuli (clicks) recorded at different distances in a reverberant environment (the Sydney Opera House). While the reverberant portion of the sound is identical across distances, the intensity of the direct sound (the onset) decreases with depth. As a result, the centre of mass of the envelopes shifts rightward with increasing distance. The central panel shows empirical and predicted psychometric functions for the various distances. The four curves were fitted using the same decision-making parameters, so that the separation between the curves results purely from the operation of the MCD. The lag at which sound and light appear synchronous (point of subjective synchrony) scales with distance at a rate approximately matching the speed of sound (right panel). The dots in the right panel display the point of subjective synchrony (estimated separately for each curve), while the jagged line is the model prediction. (C) shows temporal order judgments for clicks and flashes from both rats and human observers (Mafi et al., 2022). Rats outperform humans at short lag, and vice-versa. (D) Ratsâ temporal order and synchrony judgments for flashes and clicks of varying intensity (Schormans and Allman, 2018). Note that in the synchrony judgment task only the left flank of the psychometric curve (video-lead lags) was sampled. Importantly the tree curves in each task were fitted using the same decision-making parameters, so that the MCD alone accounts for the separation between the curves. (E) Pharmacologically-induced changes in ratsâ audiovisual time perception. Left: Glutamatergic inhibition (MK-801 injection) leads to asymmetric broadening of the psychometric functions for simultaneity judgments. Right: GABA inhibition (Gabazine injection) abolishes rapid temporal adaptation, so that psychometric curves do not change based on the lag of the previous trials (as they do in controls) (Schormans and Allman, 2023). All pharmacologically-induced changes in audiovisual time perception can be accounted for by changes in the decision-making process, with no need to postulate changes in low-level temporal processing.
For each of the experiments, we can feed the stimuli to the model (Figure 1B and D), and compare the output to the empirical psychometric functions (Equation 10, for details see Methods) (Parise and Ernst, 2016, Pesnot Lerousseau et al., 2022; Parise and Ernst, 2025; Horsfall et al., 2021). Results demonstrate that a population of MCDs can broadly account for audiovisual temporal perception of ecological stimuli, and near-perfectly (rho = 0.97) reproduces the empirical psychometric functions for simultaneity judgments, temporal order judgments, and the McGurk effect (Figure 2A, Figure 2âfigure supplement 1). To quantify the impact of the low-level properties of the stimuli on the performance of the model, we ran a permutation test, where psychometric functions were predicted from mismatching stimuli (see Methods). The psychometric curves predicted from the matching stimuli provided a significantly better fit than mismatching stimuli (p<0.001, see Figure 2âfigure supplement 1K). This demonstrates that our model captures the subtle effects of how individual features affect observed responses, and it highlights the role of low-level stimulus properties on multisensory perception. All analyses performed so far relied on psychometric functions averaged across observers; individual observer analyses are included in the Figure 2âfigure supplements 3â5.
When estimating the perceived timing of audiovisual events, it is important to consider the different propagation speeds of light and sound, which introduce audio lags that are proportional to the observerâs distance from the source (Figure 2B, right). Psychophysical temporal order judgments demonstrate that, to compensate for these lags, humans scale subjective audiovisual synchrony with distance (Figure 2B; Alais and Carlile, 2005). This result has been interpreted as evidence that humans exploit auditory spatial cues, such as the direct-to-reverberant energy ratio (Figure 2B, left), to estimate the distance of the sound source and adjust subjective synchrony by scaling distance estimates by the speed of sound (Alais and Carlile, 2005). When presented with the same stimuli, our model also predicts the observed shifts in subjective simultaneity (Figure 2B, centre). However, rather than relying on explicit spatial representations and physics simulations, these shifts emerge from elementary analyses of natural audiovisual signals. Specifically, in reverberant environments, the intensity of the direct portion of a sound increases with source proximity, while the reverberant component remains constant. As a result, the envelopes of sounds originating close to the observers are more front-heavy than distant sounds (Figure 2B, left). These are low-level acoustic features that the lag detector of the MCD is especially sensitive to, thereby providing a computational shortcut to explicit physics simulations. A Matlab implementation of this simulation is included in Source code 1.
In recent years, audiovisual timing has been systematically studied also in rats (Mafi et al., 2022, Schormans and Allman, 2018, Al Youzbaki et al., 2023, Schormans and Allman, 2023, Schormans et al., 2016, Paulcan et al., 2023), generally using minimalistic stimuli (such as clicks and flashes), and under a variety of manipulations of the stimuli (e.g. loudness) and pharmacological interventions (e.g. GABA and glutamatergic inhibition). Therefore, to further generalize our model to other species, we assessed whether it can also account for ratsâ behaviour in synchrony and temporal order judgments. Overall, we could tightly replicate ratsâ behaviour (rho = 0.981; see Figure 2C-E, Figure 2âfigure supplement 2), including the effect of loudness on observed responses (Figure 2D). Interestingly, the unimodal temporal constants for rats were 4 times faster than for humans: such a different temporal tuning is reflected in higher sensitivity in rats for short lags (<0.1 s), and in humans for longer lags (Figure 2C). This fourfold difference in temporal tuning between rats and humans closely mirrors analogous interspecies differences in physiological rhythms, such as heart rate (~4.7 times faster in rats) and breathing rate (~6.3 times faster in rats) (Agoston, 2017).
While tuning the temporal constants of the model was necessary to account for the difference between humans and rats, this was not necessary to reproduce pharmacologically-induced changes in audiovisual time perception in rats (Figure 2E, Figure 2âfigure supplement 2F-G), which could be accounted for solely by changes in the decision-making process(Equation 10). This suggests that the observed effects can be explained without altering low-level temporal processing. However, this does not imply that such changes did not occurâonly that they were not required to reproduce the behavioural data in our simulations. Future studies using richer temporal stimuliâsuch as temporally modulated sequences that vary in frequency, rhythm, or phaseâwill be necessary to disentangle sensory and decisional contributions, as these stimuli can more selectively engage low-level temporal processing and better reveal whether perceptual changes arise from early encoding or later interpretive stages.
An asset of a low-level approach is that it allows one to inspect, at the level of individual pixels and frames, the features of the stimuli that determine the response of the model (i.e. the saliency maps). This is illustrated in Figure 3 and Videos 1 and 2 for the case of audiovisual speech, where model responses cluster mostly around the mouth area and (to a lesser extent) the eyes. These are the regions where pixelsâ luminance changes in synch with the audio track.
Ecological audiovisual stimuli and model responses.
(A) displays the frames and soundtrack of a dynamic audiovisual stimulus over time (in this example, video and audio tracks are synchronous, and the actress utters the syllable /ta/). (B) shows how the dynamic population responses MCDcorr and MCDlag vary across the frames of Panel A. Note how model responses highlight the pixels whose intensity changed with the soundtrack (i.e. the mouth area). The right side of Panel B represents the population read-out process, as implemented for the simulations in Figure 2: the population responses MCDcorr and MCDlag are integrated over space (i.e. pixels) and time (i.e. frames), scaled and weighted by the gain parameters ÎČcorr, and ÎČlag and summed to obtain a single decision variable that is fed to the decision-making stage (see Figure 1D). (C) represents the time-averaged population responses MCDcorrÂŻ and MCDlagÂŻ as a function of cross-modal lag (the central one corresponds to the time-averaged responses shown in B). Note how MCDcorrÂŻ peaks at around zero lag and decreases with increasing lag (following the same trend shown in Figure 1C, left), while polarity of MCDlagÂŻ changes with the sign of the delay. The psychophysical data corresponding to the stimulus in this figure is shown in Figure 2âfigure supplement 1B. See Video 2 for a dynamic representation of the content of this figure.
The McGurk Illusion â integration of mismatching audiovisual speech.
The soundtrack is from a recording where the actress utters the syllable /pa/, whereas in the video she utters /ka/. When the video and sound tracks are approximately synchronous, observers often experience the McGurk illusion, and perceive the syllable /ta/. To experience the illusion, try to recognize what the actress utters as we manipulate audiovisual lag. Note how the MCDcorr population response clusters around the mouth area, and how its magnitude scales with the probability of experiencing the illusion. See Video 2 for details.
Population response to audiovisual speech stimuli.
The top left panel displays the stimulus from van Wassenhove et al., 2007, where the actress utters the syllable /ta/. The central and right top panels represent the dynamic MCDcorr(x,y,t) and MCDlag(x,y,t) population responses, respectively (Equations 6 and 7). The lower part of the video displays the temporal profile of the stimuli and model responses (averaged over space). The top two lines represent the stimuli: for the visual stimuli, the line represents the root-mean-squared difference of the pixel value from one frame to the next; the line for audio represents the envelope of the stimulus. MCDVid and MCDAud represents the output of the unimodal transient channels (averaged over space) that feed to the MCD (Equation 2). The two lower lines represent the MCDcorr and MCDlag responses and correspond to the average of the responses displayed in the central and top-right panels This movie corresponds to the data displayed in Figure 3AâB. Note how the magnitude of MCDcorr increases as the absolute lag decreases, while the polarity of MCDlag changes depending on which modality came first.
Spatial determinants of audiovisual integration in humans and monkeys
Classic experiments on the spatial determinants of audiovisual integration usually require observers to localize the stimuli under systematic manipulations of the discrepancy and reliability (i.e. precision) of the spatial cues (Alais and Burr, 2004; Figure 4). This allows one to assess how unimodal cues are weighted and combined to give rise to phenomena such as the ventriloquist illusion (Stratton, 1897). When the spatial discrepancy across the senses is low, observersâ behaviour is well described by Maximum Likelihood Estimation (MLE; Alais and Burr, 2004), where unimodal information is combined in a statistically optimal fashion, so as to maximize the precision (reliability) of the multimodal percept (see Equations 11â14, Methods).
Audiovisual integration in space.
(A) Top represents the Multisensory Correlation Detector (MCD) population model for spatialized audio. Visual and auditory input units receive input from corresponding spatial locations, and feed into spatially-tuned MCD units. The output of each MCD unit is eventually normalized by the total population output, so as to represent the probability distribution of stimulus location over space. The bottom part of Panel A represents the dynamic unimodal and bimodal population response over time and space (azimuth) and their marginals. When time is marginalized out, a population of MCDs implements integration as predicted by the maximum likelihood estimation (MLE) model. When space is marginalized, the output show the temporal response function of the model. In this example, visual and auditory stimuli were asynchronously presented from discrepant spatial locations (note how the blue and orange distributions are spatiotemporally offset). (B) shows a schematic representation of the stimuli used to test the MLE model by Alais and Burr, 2004. Stimuli were presented from different spatial positions, with a parametric manipulation of audiovisual spatial disparity and blob size (i.e. the standard deviation, Ï of the blob). (C) shows how the bimodal psychometric functions predicted by the MCD (lines, see Equation 16) and the MLE (dots) models fully overlap. (D) shows how bimodal bias varies as a function of disparity and visual reliability (see legend on the left). The dots correspond to the empirical data from participant LM, while the lines are the predictions of the MCD model (compare to Figure 2A, of Alais and Burr, 2004). (E) shows how the just noticeable differences (just noticeable difference JND, the random localization error) vary as a function of blob size. The blue squares represent the visual JNDs, the purple dots the bimodal JNDs, while the dashed orange line represents the auditory JND. The continuous line shows the JNDs predicted by the MCD population model (compare to Figure 2B, of Alais and Burr, 2004). (F) represents the breakdown of integration with spatial disparity. The magnitude of the MCD population output (Equation 8, shown as the area under the curve of the bimodal response) decreases with increasing spatial disparity across the senses. This can be then transformed into a probability for a common cause (Equation 19). (G) represents the stimuli and results of the experiment used by Körding et al., 2007 to test the Bayesian Causal Inference (BCI) model. Auditory and visual stimuli originate from one of five spatial locations, spanning a range of 20° . The plots show the perceived locations of visual (blue) and auditory (orange) stimuli for each combination of audiovisual spatial locations. The dots represent human data, while the lines represent the responses of the MCD population model. (H) shows the stimuli and results of the experiment of Mohl et al., 2020. The plots on the right display the probability of a single (vs. double) fixation (top monkeys, bottom humans). The dots represent human data, while the lines represent the responses of the MCD population model. The remaining panels show the histogram of the fixated locations in bimodal trials: the jagged histograms are the empirical data, while the smooth ones are the model prediction (zero free parameters). The regions of overlap between empirical and predicted histograms are shown in black.
Given that both the MLE and the MCD operate by multiplying unimodal inputs (see Methods), the time-averaged MCD population response (Equation 16) is equivalent to MLE (Figure 4A). This can be illustrated by simulating the experiment of Alais and Burr, 2004 using both models. In this experiment, observers had to report whether a probe audiovisual stimulus appeared left or right of a standard. To assess the weighing behaviour resulting from multisensory integration, they manipulated the spatial reliability of the visual stimuli and the disparity between the senses (Figure 4B). Figure 4C shows that the integrated percept predicted by the two models is statistically indistinguishable. As such, a population of MCDs (Equation 16) can jointly account for the observed bias and precision of the bimodal percept (Figure 4DâE), with zero parameters. A MATLAB implementation of this simulation is included as Source code 1.
While fusing audiovisual cues is a sensible solution in the presence of minor spatial discrepancies across the senses, integration eventually breaks down with increasing disparity (Chen and Vroomen, 2013)âwhen the spatial (or temporal) conflict is too large, visual and auditory signals may well be unrelated. To account for the breakdown of multisensory integration in the presence of intersensory conflicts, Körding and colleagues proposed the influential Bayesian Causal Inference (BCI) model (Körding et al., 2007), where uni- and bimodal location estimates are weighted based on the probability that the two modalities share a common cause (Equation 17). The BCI model was originally tested in an experiment in which sound and light were simultaneously presented from one of five random locations, and observers had to report the position of both modalities (Körding et al., 2007; Figure 4G). Results demonstrate that visual and auditory stimuli preferentially bias each other when the discrepancy is low, with the bias progressively declining as the discrepancy increases.
Also a population of MCDs can compute the probability that auditory and visual stimuli share a common cause (Figure 1B and D; Figure 4F, Equation 19), therefore, we can test whether it can also implement BCI. For that, we simulated the experiment of Körding and colleagues, and fed the stimuli to a population of MCDs (Equations 18-20) which near-perfectly replicated the empirical data (rho = 0.99)âeven slightly outperforming the BCI model. A MATLAB implementation of this simulation is included as Source code 1.
To test the generalizability of these findings across species and behavioural paradigms, we simulated an experiment in which monkeys (Macaca mulatta) and humans directed their gaze toward audiovisual stimuli presented at varying spatial disparities (Figure 4H; Mohl et al., 2020). If observers infer a common cause, they tend to make a single fixation; otherwise, twoâone for each modality. As expected, the probability of a single fixation decreased with increasing disparity (Figure 4H, right). This pattern was captured by a population of MCDs: MCDcorr values were used to fit the probability of single vs. double saccades as a function of disparity (Equation 21, Figure 4H, right). Critically, using this fit, the model was then able to predict the full distribution of gaze locations (Equation 20, Figure 4H, left) in both species with zero additional free parameters. A Matlab implementation of this simulation is included as Source code 1.
Taken together, these simulations show that behaviour consistent with BCI and MLE naturally emerges from a population of MCDs. Unlike BCI and MLE, however, the MCD population model is both image- and sound-computable, and it explicitly represents the spatiotemporal dynamics of the process (Figure 4A, bottom; Figure 3B; Figure 5; Figure 6BâC). On one hand, this enables the model to be applied to complex, dynamic audiovisual stimuliâsuch as real-life videosâthat were previously off limits to traditional BCI and MLE frameworks, whose probabilistic, nonâstimulus-computable formulations prevent them from operating directly on such inputs. On the other, it permits direct, time-resolved comparisons between model responses and neurophysiological measures (Pesnot Lerousseau et al., 2022).
Multisensory Correlation Detector (MCD) and the Ventriloquist Illusion.
The upper panel represents still frame of a performing ventriloquist. The central panel represents the MCD population response. The lower plot represents the horizontal profile of the MCD response for the same frame. Note how the population response clusters on the location of the dummy, where more pixels are temporally correlated with the soundtrack.
Audiovisual saliency maps.
(A) represents a still frame of Coutrot and Guyader, 2015 stimuli. The white dots represent gaze direction of the various observers. (B) represents the Multisensory Correlation Detector (MCD) population response for the frame in Panel A. The dots represent observed gaze direction (and correspond to the white dots of Panel A). (C) represents how the MCD response varies over time and azimuth (with elevation marginalized-out). The black solid lines represent the active speaker, while the waveform on the right displays the soundtrack. Note how the MCD response was higher for the active speaker. (D) shows the distribution of model response at gaze direction (see Panel B) across all frames and observers in the database. Model response was normalized for each frame (Z-scores). The y axis represents the number of frames. The vertical gray line represents the mean. See Video 4 for a dynamic representation of the content of this figure.
As a practical demonstration, we applied the model (Equation 6) to a real-life video of a performing ventriloquist (Figure 5). The population response dynamically tracked the active talker, clustering around the dummyâs face whenever it produced speech Video 3.
The ventriloquist illusion.
The top panel represents a video of a performing ventriloquist. The voice of the dummy was edited (pitch-shifted) and added in post-production. The second panel represents the dynamic MCDcorrx,y,t population response to a blurred version of the video (Equation 6). The third panel shows the distribution of population responses along the horizontal axis (obtained by averaging the upper panel over the vertical dimension). This represents the dynamic, real-life version of the bimodal population response shown in Figure 4A for the case of minimalistic audiovisual stimuli. The lower panel represents the same information as the panel above displayed as a rolling timeline. For this video, the population response was temporally aligned to the stimuli to compensate for lags introduced by the temporal filters of the model. Note how the population response spatially follows the active speaker, hence capturing the sensed location of the audiovisual event towards correlated visuals.
Spatial orienting and audiovisual saliency maps
Multisensory stimuli are typically salient, and a vast body of literature demonstrates that spatial attention is commonly attracted to audiovisual stimuli (Talsma et al., 2010). This aspect of multisensory perception is naturally captured by a population of MCDs, whose dynamic response explicitly represents the regions in space with the highest audiovisual correspondence for each point in time. Therefore, for a population of MCDs to provide a plausible account for audiovisual integration, such dynamic saliency maps should be able to predict human audiovisual gaze behaviour, in a purely bottom-up fashion and with no free parameters. Figure 6A shows the stimuli and eye-tracking data from the experiment of Coutrot and Guyader, 2015, in which observers passively watched a video of four persons talking. Panel B shows the same eye-tracking data plotted over the corresponding MCD population response: across 20 observers, and 15 videos (for a total of over 16,000 frames), gaze was on average directed towards the locations (i.e. pixels) yielding the top 2% MCD response (Figure 6D, Equations 22; 23). The tight correspondence of predicted and empirical salience is illustrated in Figure 6C and Video 4: note how population responses peak based on the active speaker.
Audiovisual saliency maps.
The top panel represents Movie 1 from Coutrot and Guyader, 2015. The central panel represents MCDcorr in gray scales, while the colorful blobs represent observersâ gaze direction during passive viewing. The lower panel represents how the MCDcorr and gaze direction (co)vary over time and azimuth (with elevation marginalized-out). The black solid lines represent the active speaker. For the present simulations, movies were converted to grayscale and the upper and lower sections of the videos (which were mostly static) were cropped. Note how gaze is consistently directed towards the regions of the frames displaying the highest audiovisual correlation.
Discussion
This study demonstrates that elementary audiovisual analyses are sufficient to replicate behaviours consistent with multisensory perception in mammals. The proposed image- and sound-computable model, composed of a population of biologically plausible elementary processing units, provides a stimulus-driven framework for multisensory perception that transforms raw audiovisual input into behavioural predictions. Starting directly from pixels and audio samples, our model closely matched observed behaviour across a wide range of phenomenaâincluding multisensory illusions, spatial orienting, and causal inferenceâwith average correlations above 0.97. This was tested in a large-scale simulation spanning 69 audiovisual experiments, seven behavioural tasks, and data from 534 humans, 110 rats, and two monkeys.
We define a stimulus-computable model as one that receives input directly from the stimulusâsuch as raw images and sound waveformsârather than from abstracted descriptors like lag, disparity, or reliability. Framed in Marrâs terms, stimulus-computable models operate at the algorithmic level, specifying how sensory information is represented and processed. This contrasts with computational-level models, such as Bayesian ideal observers, which define the goals of perception (e.g. maximizing reliability Alais and Burr, 2004; Ernst and Banks, 2002) without specifying how those goals are achieved. Rather than competing with such normative accounts, the MCD provides a mechanistic substrate that could plausibly implement them. By operating directly on realistic audiovisual signals, our population model captures the richness of natural sensory input and directly addresses the problem of how biological systems represent and process multisensory information (Burge, 2020). This allows the MCD to generate precise, stimulus-specific predictions across tasks, including subtle differences in behavioural outcomes that arise from the structure of individual stimuli (see Figure 2âfigure supplement 1K).
The present approach naturally lends itself to be generalized and tested against a broad range of tasks, stimuli, and responsesâas reflected by the breadth of the experiments simulated here. Among the perceptual effects emerging from elementary signal processing, one notable example is the scaling of subjective audiovisual synchrony with sound source distance (Alais and Carlile, 2005). As sound travels slower than light, humans compensate for audio delays by adjusting subjective synchrony based on the sourceâs distance scaled by the speed of sound. Although this phenomenon appears to rely on explicit physics modelling, our simulations demonstrate that auditory cues embedded in the envelope (Figure 2B, left) are sufficient to scale subjective audiovisual synchrony. In a similar fashion, our simulations show that phenomena such as the McGurk illusion, the subjective timing of natural audiovisual stimuli, and saliency detection may emerge from elementary operations performed at pixel level, bypassing the need for more sophisticated analyses such as image segmentation, lip or face tracking, 3D reconstruction, etc (Chandrasekaran et al., 2009). Elementary, general-purpose operations on natural stimuli can drive complex behaviour, sometimes even in the absence of advanced perceptual and cognitive contributions. Indeed, it is intriguing that a population of MCDs, a computational architecture originally proposed for motion vision in insects, can predict speech illusions in humans.
The fact that identical low-level analyses can account for all of the 69 experiments simulated here directly addresses several open questions in multisensory research. For instance, psychometric functions for speech and non-speech stimuli often differ significantly (Vatakis et al., 2008). This has been interpreted as evidence that speech may be special and processed via dedicated mechanisms (Tuomainen et al., 2005). However, identical low-level analyses are sufficient to account for all observed responses, regardless of the stimulus type (Figure 2, Figure 2âfigure supplements 1 and 2). This suggests that most of the differences in psychometric curves across classes of stimuli (e.g. speech vs. non-speech vs. clicks-&-flashes) are due to the low-level features of the stimuli themselves, not how the brain processes them. Similarly, experience and expertise also modulate multisensory perception. For example, audiovisual simultaneity judgments differ significantly between musicians and non-musicians (Lee and Noppeney, 2011) (see Figure 2âfigure supplement 1C). Likewise, the McGurk illusion (Freeman et al., 2013) and subjective audiovisual timing (Petrini et al., 2009) vary over the lifespan in humans, and following pharmacological interventions in rats (Al Youzbaki et al., 2023; Schormans and Allman, 2023) (see Figure 2âfigure supplement 1E and J and Figure 2âfigure supplement 2F-G). Our simulations show that adjustments at the decision-making level are sufficient to account for these effects, without requiring structural or parametric changes to low-level perceptual processing across observers or conditions.
Although the same model explains responses to multisensory stimuli in humans, rats, and monkeys, the temporal constants vary across species. For example, the model for rats is tuned to temporal frequencies over four times higher than those for humans. This not only explains the differential sensitivity of humans and rats to long and short audiovisual lags, but it also mirrors analogous interspecies differences in physiological rhythms, such as heart and breathing rates (Agoston, 2017). Previous research has shown that physiological arousal modulates perceptual rhythms within individuals (Legrand et al., 2018). It is an open question whether the same association between multisensory temporal tuning and physiological rhythms persists in other mammalian systems. Conversely, no major differences in the modelâs spatial tuning were found between humans and macaques, possibly reflecting the close phylogenetic link between the two species.
How might these computations be implemented neurally? In a recent study (Pesnot Lerousseau et al., 2022), we identified neural responses in the posterior superior temporal sulcus, superior temporal gyrus, and left superior parietal gyrus that tracked the output of an MCD model during audiovisual temporal tasks. Participants were presented with random sequences of clicks and flashes while performing either a causality judgment or a temporal order judgment task. By applying a time-resolved encoding model to MEG data, we demonstrated that MCD dynamics aligned closely with stimulus-evoked cortical activity. The present study considerably extends the scope of the MCD framework, allowing it to process more naturalistic stimuli and to account for a broader range of behavioursâincluding cue combination, attentional orienting, and gaze-based decisions. This expansion opens the door to new neurophysiological investigations into the implementation of multisensory integration. For instance, the dynamic, spatially distributed population responses generated by the MCD (see videos) can be directly compared with neural population activity recorded using techniques such as ECoG, Neuropixels, or high-density fMRIâsimilar to previous efforts that linked the Bayesian Causal Inference model to neural responses during audiovisual spatial integration (Rohe et al., 2019; Aller and Noppeney, 2019; Rohe and Noppeney, 2015). Such comparisons may help bridge algorithmic and implementational levels of analysis, offering concrete hypotheses about how audiovisual correspondence detection and integration are instantiated in the brain.
An informative outcome of our simulations is the modelâs ability to predict spontaneous gaze direction in response to naturalistic audiovisual stimuli. Saliency, the property by which some elements in a display stand out and attract observerâs attention and gaze direction, is a popular concept in both cognitive and computer sciences (Itti et al., 1998). In computer vision, saliency models are usually complex and rely on advanced signal processing and semantic knowledgeâtypically with tens of millions of parameters (Chen et al., 2023; Coutrot, 2025). Despite successfully predicting gaze behaviour, current audiovisual saliency models are often computationally expensive, and the resulting maps are hard to interpret and inevitably affected by the datasets used for training (Adebayo et al., 2023). In contrast, our model detects saliency âout of the box,â without any free parameters, and operating purely at the individual pixel level. The elementary nature of the operations performed by a population of MCDs returns saliency maps that are easy to interpret: salient points are those with high audiovisual correlation. By grounding multisensory integration and saliency detection in biologically plausible computations, our study offers a new tool for machine perception and robotics to handle multimodal inputs in a more human-like way, while also improving system accountability.
This framework also provides a solution for self-supervised and unsupervised audiovisual learning in multimodal machine perception. A key challenge when handling raw audiovisual data is solving the causal inference problemâdetermining whether signals from different modalities are causally related or not (Körding et al., 2007). Models in machine perception often depend on large, labelled datasets for training. In this context, a biomimetic module that handles saliency maps, audiovisual correspondence detection, and multimodal fusion can drive self-supervised learning through simulated observers, thereby reducing the dependency on labelled data (Shahabaz and Sarkar, 2024; Arandjelovic and Zisserman, 2017; Ngiam et al., 2011). Furthermore, the simplicity of our population-based model provides a c