Introduction
The study of cerebral mechanisms underlying speech and voice processing has gained importance since the early 2000s with the advent of functional magnetic resonance imaging (fMRI) [1]. Voice-sensitive areas, commonly referred to as âtemporal voice areasâ (TVA) or simply âvoice areasâ, have been highlighted along the upper, superior portion of the temporal cortex [2]. Since then, great efforts have been made to better characterize these TVA, with particular attention to their spatial division into functional subregions [3â5]. A fairly large body of literature points to the critical role of the TVA in voice perception and processing in healthy participants [4, 6â8] as well as in lesioned patients [9]. âŚ
Introduction
The study of cerebral mechanisms underlying speech and voice processing has gained importance since the early 2000s with the advent of functional magnetic resonance imaging (fMRI) [1]. Voice-sensitive areas, commonly referred to as âtemporal voice areasâ (TVA) or simply âvoice areasâ, have been highlighted along the upper, superior portion of the temporal cortex [2]. Since then, great efforts have been made to better characterize these TVA, with particular attention to their spatial division into functional subregions [3â5]. A fairly large body of literature points to the critical role of the TVA in voice perception and processing in healthy participants [4, 6â8] as well as in lesioned patients [9]. Subregions of the TVA have also been directly linked to social perception [10], vocal emotion processing [11, 12], voice identity [13, 14], and gender perception [15]. The developmental axis of voice processing has also been studied in infants, demonstrating the existence of the TVA in the human brain as early as 7â but not 4âmonths of age [16], while the ability to respond specifically to the voice of their parents has been observed in fetuses in utero [17]. With the ongoing development of brain imaging and analysis techniques [18], it is realistic to expect successful, albeit noninvasive, fMRI results on task-related voice perception in utero in the near future. Along the evolutionary axis, evidence for TVA or, more generally, conspecific vocalization-sensitive brain areas has emerged primarily in dogs [19] and monkeys [20, 21] (Macaca mulatta), raising the question of whether such specialized brain areas are species-specific [22] and to what extent human and nonhuman primates share neural mechanisms that enable them to preferentially process conspecific vocalizations [23]. However, less attention has been paid to paradigms in which animal vocalizations are presented to humans, and to the best of our knowledge no study to date has reported selective human TVA activations for processing such auditory material, namely the vocalizations of other animals. Human processing of animal vocalizations has been studied with both monkey and cat material, but no specific cross-species activations have been observed within the TVA with respect to either species [24]. Other studies have focused more specifically on phylogenetic distance and have included nonhuman ape (chimpanzee, Pan troglodytes) and âOld Worldâ monkey (rhesus macaque, Macaca mulatta) vocalizations as stimuli. Such studies failed to identify species-specific brain activationsâdespite correctly discriminating chimpanzee affective vocalizations [25]âand observed ambivalent results for below [25] vs. above [26] chance discrimination of macaque affective vocalizations by human participants. A recent exception is a study in which functionally homologous anterior TVA activity was observed in both humans and macaques: this region was indeed specific to macaque calls in the macaqueâs anterior TVA, and specific to human voices in the anterior TVA of humans, but no macaque-specific activity was observed in the human TVA [27]. This sparse literature motivated the present study, which aims to investigate cross-species TVA activations in humans when asked to categorize vocalizations from phylogeneticallyâand acoustically-close and -distantâspecies while undergoing fMRI scanning. The importance of acoustic differences between species and more specifically acoustic distance, particularly through fundamental frequency variations [28, 29] was indeed of great interest. Acoustic distanceâcalculated using Mahalanobis distance with 16 acoustic parameters extracted from the stimuliâwas in fact a determining parameter in assessing affective cues recognition in nonhuman primate calls by human participants [30]. In this study, affiliative chimpanzeeâbut not bonoboâcalls were acoustically the closest to positive human voice stimuli, suggesting a distinct evolution of bonobo calls [30]. Bonobo vocalizations are of particular interest because this species is thought to have undergone evolutionary changes in their communication, in part due to a neoteny process involving acoustic modifications, and although they are as phylogenetically close to humans as chimpanzeesâwith an estimated separation with the Homo lineage only 6-8 million years ago [31]. Previous research has shown that bonobos have a shorter larynxâa valid predictor of a speciesâ mean fundamental frequency [32]âcompared to chimpanzees, resulting in a higher fundamental frequency in their calls [28]. Such a difference has been demonstrated in juvenile bonobo calls compared to chimpanzee and human baby calls [33], arguing for a greater acoustic distance between bonobo calls and human or chimpanzee vocalizations. For these reasons, we included vocalizations from both Pan species (chimpanzees, Pan troglodytes; bonobos, Pan paniscus), as well as a phylogenetically more distant species (Cercopithecidae: rhesus monkeys), with an estimated separation with the Homo lineage dating back to 25 million years ago. Indeed, any claim of human âuniquenessâ for TVA recruitment remains on hold and should be tested in light of these closely related species. Using the same stimuli, we previously investigated the specific frontal mechanisms involved in the categorization of nonhuman primate vocalizations independently of a selection of low-level acoustic parameters [34], but the possibility that acoustic differences would affect, at the auditory level, the ability of human participants to recognize nonhuman primate calls should be thoroughly examined, as we did in the present study. As suggested by research mentioned above, monkey vocalizations are overall less likely to be identified compared to ape vocalizations due to both phylogenetic and acoustic differences. Therefore, our mechanistic hypothesis of the difficulty for humans to recognize bonobo calls is that frequencies of the human tonotopic map in the auditory cortexâadapted and adjusted to the frequencies of the human voice during evolutionâwould not be tailored to process the frequencies generated by bonobo calls. It would also be the case for macaque calls, while frequencies of chimpanzee callsâbeing closer to the range of human voice fundamental frequency [28, 30]âwould be better represented in the human auditory cortex and therefore more easily processed and better identified by humans.
According to the literature mentioned so far and to the mechanistic hypothesis underlying the processing of chimpanzee as opposed to bonobo and/or macaque calls by human participants, we therefore predicted: (i) more acoustic proximity between human and chimpanzee vocalizations, whereas more distance would separate those of bonobos and macaques from the human voice; (ii) a recruitment of temporal brain areasâwithin the TVAâfor the processing of vocalizations from the Pan taxon (chimpanzee, bonobo) but not Cercopithecidae (rhesus monkey) vocalizations, taking into account acoustic features of interest through a discriminant analysis of the parameters that best underlie our stimuli.
Results
Our hypotheses involve a systematic and thorough control of phylogeny through the inclusion of specific primate species as well as the selection of specific acoustic features. We programmed a task in which the vocalizations of each species were presented randomly and for which the participants (N=23) had to specify to which species each stimulus corresponded. We therefore included equal numbers of trials (N=72) with human, chimpanzee, bonobo and macaque vocalizationsâN=18 eachâas well as trial-level acoustic features of the vocalizations, using three distinct statistical models with specific covariates. These models are sorted from the least to the most sophisticated modeling to uncover the role(s) of acoustic features on TVA activity potentially specific to each/some species (see the Methods; Model 1: mean of vocalization fundamental frequency and energy; Model 2: multi-dimensional Mahalanobis acoustic distance between the human voice and the calls of each nonhuman primate species [35]; Model 3: between-species most discriminant acoustic features of our stimuli, extracted using a general discriminant analysis [30]). Acoustical analyses involved in Model 2 allowed us to validate our first hypothesis according to which chimpanzees are acoustically the closest to humans, followed by the calls of bonobos and macaques (Fig.1B)â the main effect of Species on the acoustic distance was significant, F(3,88)=15.84, p<.001, as well as all comparisons (see Fig.1B and Table S2). In this study, we did not intend to focus on behavioral data since these have already been published with these stimuli in dedicated studies [30, 34]. Instead, we were interested in the neural processing associated with the exposition of human participants to primate vocalizations (Fig.1A).
Fig. 1:
Timecourse of the species categorization task with stimuli example and acoustic distance data.
(A) Detail of the timecourse of four trials of the species categorization task in non-representative order, including waveform and spectrogram graphs for one example stimulus of each species. (B) Scatter plot and histogram of the acoustic Mahalanobis distance data of each stimulus for each species including mean (numbers represent exact mean value) and violin plots of the standard error of the mean in addition to distribution fit. ITI: inter trial interval; Hum: human; Chimp: chimpanzee; Bon: bonobo; Mac: macaque.
Neuroimaging data within the sample-specific temporal voice areas
We aimed at uncovering functional changes relative to species categorization and processing within sample-specific (N=23) TVA, as delineated in our hypotheses. As described above, we used three distinct statistical models including trial-level parametric modulators (Model 1-3). We were particularly interested in human brain activity while processing vocalizations of our closest relativesâboth acoustically and phylogeneticallyânamely the chimpanzee but also the bonobo. The present study did not aim at uncovering wholebrain results underlying the processing of each speciesâ vocalizations but rather focused on human voice-sensitive areas, namely the TVA, although corrected statistics (voxelwise p<.05 False Discovery Rate) presented in this section were computed with a wholebrain voxelwise approach for higher data reproducibility and generalizability, and not using region-of-interest (ROI) analyses. ROI analyses would most probably have artificially amplified the number of voxels in the TVA in this study. Clusters outside the bounds of the sample-specific TVA are therefore visible but in a desaturated hue to better highlight TVA activations. These clusters are even more visible in supplementary figures with the same contrasts as those presented in this sectionâwith the addition of the [human,chimpanzee > bonobo,macaque] contrastâbut with an outline of the TVA from an independent, larger sample of participants excluding the 23 participants of this study (N=98; Fig.S2-4). No difference in terms of a potential attentional bias towards any species of the stimuli used in this study was found (Independent sample of N=28, see Methods and Fig.S1 for detailed information on this aspect).
Model 1: Effects of species processing with vocalization mean fundamental frequency and mean energy as covariates of no-interest at the trial level
In this first âsimpleâ model, we first wanted to remove from brain activations the part of variance correlating with basic low-level acousticsâas reported in the literature [28, 30, 31, 33], namely mean voice fundamental frequency and energy. A total of four contrasts were overlaid in the figure to test our second hypothesis, according to which phylogeneticâand especially acoustic, see Model 2âproximity would trigger enhanced activity in the TVA, just as the human voice does. Brain activity specific to chimpanzee vocalizations ([chimpanzee > human, bonobo, macaque]) led to enhanced activity in a cluster of the left anterior STG (aSTG1, k=91 voxels, Fig.2AD) located within the TVA (Fig.2AC). A homologous cluster of the right anterior STG was found as well in this contrast (aSTG2, Fig.2B). A similar result was observed when directly contrasting chimpanzee to human vocalizations ([chimpanzee > human]; Fig.2EFH) as well as chimpanzee to nonhuman primate calls ([chimpanzee > bonobo, macaque]) in three other clusters of the aSTG, located again within the TVA (Fig.2ABC, Table 1). Enhanced activity for human relative to chimpanzee vocalizations ([human > chimpanzee]) was observed in large parts of the anterior, mid and posterior superior and middle temporal cortex (Fig.2EFG, Table 1). No voxels reached significance at the wholebrain level for the [bonobo > human, chimpanzee, macaque], [bonobo > chimpanzee, macaque], [bonobo > human], [bonobo > chimpanzee], [bonobo > macaque], [macaque > human, chimpanzee, bonobo], [macaque > chimpanzee, bonobo], [macaque > human], [macaque > chimpanzee], [macaque > bonobo] contrasts. This analysis therefore revealed that the human anterior TVA are sensitive to cross-species primate vocalizationsâspecifically to chimpanzee but not bonobo or macaque calls using this regression model.
Fig. 2:
Wholebrain results when contrasting the processing of chimpanzee to other speciesâ vocalizations with mean fundamental frequency and energy as trial-level covariates of no-interest (model 1).
(ABC) Enhanced brain activity on a sagittal view with activity specific to chimpanzee vocalizations (dark blue to green) as well as between chimpanzee calls vs bonobo and macaque calls (chimpanzee > bonobo and macaque: brown to red with light yellow outline). (D) Percentage of signal change for each individual and relevant species according to the contrast in the left anterior superior temporal gyrus (aSTG1). Box plots represent mean value (black line) and the standard error of the mean with distribution fit. (EFG) Direct comparison between human and chimpanzee vocalizations (human > chimpanzee: dark red to yellow; chimpanzee > human: dark green to yellow) on a sagittal render. (H) Percentage of signal change in the anterior superior temporal gyrus (aSTG2) when contrasting chimpanzee to human vocalizations for each individual and relevant species according to the contrast with box plots representing mean value (black line) and the standard error of the mean with distribution fit. Brain activations are independent of low-level acoustic parameters for all species (mean fundamental frequency âF0â and mean energy of vocalizations). Data corrected for multiple comparisons using wholebrain voxelwise false discovery rate (FDR) at a threshold of p<.05. Percentage of signal change extracted at cluster peak including 9 surrounding voxels, selecting among these the ones explaining at least 85% of the variance using singular value decomposition. Circles represent individual values, boxplot represents the mean and its standard error, and half-violin plots show data distribution. Hum: human; Chimp: chimpanzee; Bon: bonobo; Mac: macaque. TVA: sample-specific (N=23) temporal voice areas. âaâ prefix: anterior; âmâ prefix: mid; âpâ prefix: posterior; STG: superior temporal gyrus; STS: superior temporal sulcus; L: left hemisphere; R: right hemisphere.
Table 1:
Activations, cluster size and coordinates for each contrast of interest of model 1 (mean of vocalization fundamental frequency and energy as trial-level covariates of no-interest) in the sample-specific temporal voice areas, wholebrain voxelwise p<.05 FDR corrected, k>10.
Model 2: Effects of species processing with vocalization acoustic distance from human voice, per species, as covariate of no-interest at the trial level
In this second model, we wanted to remove from brain activations the part of variance correlating with the acoustic distance between each species and the human voice (see Methods for the detailed index of acoustic distance calculated with the human voice as reference). TVA brain activity specific to primate calls was triggered again for chimpanzee vocalizations ([chimpanzee > human, bonobo, macaque]) in a cluster of the left anterior STG within the TVA (Fig.3ACD). A similar result was observed when directly contrasting chimpanzee to human vocalizations ([chimpanzee > human]; Fig.3EH, Table 2). Enhanced activity for human relative to chimpanzee vocalizations ([human > chimpanzee]) was again observed in large parts of the anterior, mid and posterior superior and middle temporal cortex (Fig.3EFG, Table 2). Chimpanzee compared to other nonhuman primate calls ([chimpanzee > bonobo, macaque]) led to enhanced activity in the bilateral aSTG (aSTG7 and aSTG9, Fig.3ABC). Using this second modelling of the MRI data, no voxels reached significance at the wholebrain level for the [bonobo > human, chimpanzee, macaque], [bonobo > chimpanzee, macaque], [bonobo > chimpanzee], [bonobo > macaque], [macaque > human, chimpanzee, bonobo], [macaque > chimpanzee, bonobo], [macaque > bonobo] contrasts. Contrasting [bonobo > human] and [macaque > human] yielded to enhanced activity outside the TVA while the [macaque > chimpanzee] comparison activated a very small cluster of within-TVA left planum temporale (see Fig.S5 for these contrasts). In this second model, again, only the calls of chimpanzees triggered specific activity in the anterior TVA.
Fig. 3:
Wholebrain results when contrasting the processing of chimpanzee to other speciesâ vocalizations with Mahalanobis acoustic distance as trial-level covariate of no-interest (model 2).
(ABC) Enhanced brain activity on a sagittal view with activity specific to chimpanzee vocalizations (chimp > hum,bon,mac; dark blue to green) as well as between chimpanzee calls vs bonobo and macaque calls (chimpanzee > bonobo and macaque: brown to red with light yellow outline). (D) Percentage of signal change for each individual and relevant species according to the contrast in the left anterior superior temporal gyrus (aSTG6). Box plots represent mean value (black line) and the standard error of the mean with distribution fit. (EFG) Direct comparison between human and chimpanzee vocalizations (human > chimpanzee: dark red to yellow; chimpanzee > human: dark green to yellow) on a sagittal render. (H) Percentage of signal change in the anterior superior temporal gyrus (aSTG8) when contrasting chimpanzee to human vocalizations for each individual and relevant species according to the contrast with box plots representing mean value (black line) and the standard error of the mean with distribution fit. Brain activations are independent from the acoustic distance of each stimulus for all species. Data corrected for multiple comparisons using wholebrain voxelwise false discovery rate (FDR) at a threshold of p<.05. Percentage of signal change extracted at cluster peak including 9 surrounding voxels, selecting among these the ones explaining at least 85% of the variance using singular value decomposition. Circles represent individual values, boxplot represents the mean and its standard error, and half-violin plots show data distribution. Hum: human; Chimp: chimpanzee; Bon: bonobo; Mac: macaque. TVA: sample-specific (N=23) temporal voice areas. âaâ prefix: anterior; âmâ prefix: mid; âpâ prefix: posterior; STG: superior temporal gyrus; STS: superior temporal sulcus; L: left hemisphere; R: right hemisphere.
Table 2:
Activations, cluster size and coordinates for each contrast of interest of model 2 (inter-species vocalization acoustic distance as trial-level covariate of no-interest) in the sample-specific temporal voice areas, wholebrain voxelwise p<.05 FDR corrected, k>10.
Model 3: Effects of species processing with vocalization most discriminant acoustic parameters (N=6) as covariates of no-interest at the trial level
In this last model, we wanted to elaborate more on the discriminant factors that characterize the low-level acoustic parameters of our set of stimuli. This approach is complementary to the inclusion of acoustic distance in model 2 and extends and refines these results. To do so, we used as trial-level covariates of no-interest the acoustic parameters explaining the most variance ([r > 0.7] and [r < -0.7]) in factors 1-3 of a discriminant analysis of these stimuli [30]â see the methods section for details on this analysis. These parameters therefore include, in this specific order: vocalization loudness, intensity, change in spectrum, bandwidth contour of the second formant (F2), power of the fundamental frequency (F0) and finally the difference in intensity contour. Having these acoustic features as covariates, we ran the same contrasts as in models 1 & 2. As in previous modeling of the imaging data, TVA activity was triggered by chimpanzee vocalizations ([chimpanzee > human, bonobo, macaque]) in yet other, larger bilateral clusters of the aSTG within the TVA (aSTG10 and aSTG11, Fig.4ABCD), closely resembling activations of model 2. A left-lateralized similar cluster was observed when directly contrasting chimpanzee to human vocalizations ([chimpanzee > human]; aSTG12, Fig.4EH, Table 3). Enhanced activity for human relative to chimpanzee vocalizations ([human > chimpanzee]) was similarly represented as in models 1 & 2 (anterior, mid and posterior superior and middle temporal cortex; Fig.4EFG, Table 3). Chimpanzee compared to other nonhuman primate calls in this model ([chimpanzee > bonobo, macaque]) led to the largest clusters observed in the aSTGâall models considered, still within the sample-specific TVA. Indeed, we observed a large left-lateralized cluster of the aSTG extending to the mid STG (aSTG13, Fig.4AC) as well as a right-lateralized cluster (aSTG14, Fig.4BC).
Fig. 4:
Wholebrain results when contrasting the processing of chimpanzee to other speciesâ vocalizations with vocalization loudness, intensity, change in spectrum, F2 bandwidth contour, F0 power and intensity contour difference as trial-level covariates of no-interest (model 3).
(ABC) Enhanced brain activity on a sagittal view with activity specific to chimpanzee vocalizations (dark blue to green) as well as between chimpanzee calls vs bonobo and macaque calls (chimpanzee > bonobo and macaque: brown to red with light yellow outline). (D) Percentage of signal change for each individual and relevant species according to the contrast in the left anterior superior temporal gyrus (aSTG10). Box plots represent mean value (black line) and the standard error of the mean with distribution fit. (EFG) Direct comparison between human and chimpanzee vocalizations (human > chimpanzee: dark red to yellow; chimpanzee > human: dark green to yellow) on a sagittal render. (H) Percentage of signal change in the anterior superior temporal gyrus (aSTG12) when contrasting chimpanzee to human vocalizations and when contrasting chimpanzee to bonobo and macaque calls (aSTG13) for each individual and relevant species according to the contrast with box plots representing mean value (black line) and the standard error of the mean with distribution fit. Brain activations are independent of the most discriminant low-level acoustic parameters of the stimuli set [30]. Data corrected for multiple comparisons using wholebrain voxelwise false discovery rate (FDR) at a threshold of p<.05. Percentage of signal change extracted at cluster peak including 9 surrounding voxels, selecting among these the ones explaining at least 85% of the variance using singular value decomposition. Circles represent individual values, boxplot represents the mean and its standard error, and half-violin plots show data distribution. Hum: human; Chimp: chimpanzee; Bon: bonobo; Mac: macaque. TVA: sample-specific (N=23) temporal voice areas. âaâ prefix: anterior; âmâ prefix: mid; âpâ prefix: posterior; STG: superior temporal gyrus; STS: superior temporal sulcus; L: left hemisphere; R: right hemisphere.
Table 3:
Activations, cluster size and coordinates for each contrast of interest of model 3 (vocalization loudness, intensity, change in spectrum, F2 bandwidth contour, F0 power and intensity contour difference as trial-level covariate of no-interest) in the sample-specific temporal voice areas, wholebrain voxelwise p<.05 FDR corrected, k>10.
In this last modelling of the fMRI data, no voxels reached significance either at the wholebrain level or within the TVA for the [bonobo > human, chimpanzee, macaque] or [bonobo > chimpanzee, macaque], [bonobo > chimpanzee], [bonobo > macaque] contrasts. We however found activity specific to the processing of macaque calls only in the left TVA, more specifically in a small cluster of the left mid STS ([macaque > human, chimpanzee, bonobo]) and in a small portion of the planum temporale adjacent to the primary auditory cortex for the [macaque > chimpanzee, bonobo] contrast (see Fig.S6). We also observed significant activations but outside the TVA for the [bonobo > human] and the [macaque > chimpanzee] contrast. Within-TVA activations were observed when contrasting macaque to human vocalizations in the left mid STS, as well as in the in the planum temporale, left mid STS and right mid STG when contrasting macaque to bonobo calls. See Fig.S7 for these results. Using this third model, we again observed chimpanzee-specific activity in the anterior TVA as well as mid STS activity specific to macaque calls, within the TVA.
A synthesis of the sensitivity of the human TVA to nonhuman primate calls
In the previous sections, we described three different models used to analyze our fMRI data. These modelsâfrom the simplest to the more sophisticated oneâhighlighted enhanced activity within sample-specific bilateral anterior TVA of our participants specifically when processing chimpanzee vocalizationsâbut also when processing macaque calls in model 3, in the bilateral mid STG, STS and planum temporale. When processing chimpanzee calls, TVA activity was especially enhanced in the aSTG but also in the anterior STS. We therefore regrouped these fourteen chimpanzee-specific aSTG clusters in Fig.5âmost of them overlap greatly but we still named them individually according to each contrast and analysis for exhaustivityâoverlaid with sample-specific TVA (Fig.5CD) and with the more general TVA from an independent sample of ninety-eight participants (Fig.5AB). Zooming closely, the area of maximal overlap between these regions (the orange surface) is located within the more general as well as within the sample-specific TVA. Interestingly, left-lateralized more medial clusters of aSTG were outside the outline of the sample-specific but not of the general TVA (Fig.5AC), while this was not the case for right-lateralized aSTG activations. Comparing the areas recruited when processing chimpanzee to bonobo and macaques calls, this contrastâ especially in model 3, yielded to distinct clusters of aSTG. This result is visible when looking at the three ârich blueâ outlines in every panel of Fig.5. The results synthesized here highlight the important role of acoustic parameters and emphasize the role of the most discriminant acoustic features on TVA activity relating to nonhuman primate vocalizations, especially those of chimpanzees and macaques.
Fig. 5:
Synthesis of mid and anterior TVA clusters of activity recruited specifically by the processing of chimpanzee and macaque vocalizations (Models 1,2,3).
aSTG and aSTS clusters recruited for the processing of chimpanzee calls as opposed to: human voices (green); bonobo, macaque calls (blue) and human voice; bonobo and macaque calls (turquoise) in the general TVA (AB, N=98) as well as in the sample-specific TVA (CD, N=23). Macaque results are only significant for Model 3 (purple: Macaque vs all other species; lilac: Macaque vs other nonhuman primates). Clusters are represented across all statistical models (Model 1: dotted line; Model 2: dashed line; Model 3: solid line). Model 1: mean of fundamental frequency and energy (covariates of no-interest, N=2); Model 2: acoustic distance (covariate of no-interest, N=1); Model 3: acoustic parameters that characterize low-level acoustics of our stimuli following a discriminant analysis (covariates of no-interest, N=6). Data are all corrected for multiple comparison using wholebrain voxelwise false discovery rate (FDR) at a threshold of p<.05. Hum: human; Chimp: chimpanzee; Bon: bonobo; Mac: macaque. TVA: temporal voice areas. âaâ prefix: anterior; STG: superior temporal
These results are given even more weight by more fined-tuned comparisons of voice versus non-voice material in the voice-localizer task, namely by splitting the non-vocal blocks as a function of the auditory sounds they contain. In this more specific outline of TVA subregions, we observed that most chimpanzee- and macaque-specific STG and STS regions were still within the bounds of TVA, especially in the most relevant case when the outline represented a comparison of human voice signals to animal or nature sounds (Fig.6A-D), while the outlines for human voice versus music or noise excluded most parts of the clusters of activation of these nonhuman primate speciesâ calls in the TVA (Fig.6E-H).
Fig. 6:
Clusters recruited specifically by the processing of chimpanzee and macaque vocalizations (Model 3) in subregions of the TVA, as a function of non-vocal material type.
Enhanced brain activity on a sagittal views with activity specific to macaque vocalizations (red to yellow), specific to chimpanzee vocalizations (dark blue to green) as well as between chimpanzee calls vs bonobo and macaque calls (chimpanzee > bonobo and macaque: brown to red with light yellow outline). Brain activations are independent of the most discriminant low-level acoustic parameters of the stimuli set [30]. Data corrected for multiple comparisons using wholebrain voxelwise false discovery rate (FDR) at a threshold of p<.05. Black outline represents: voice compared to non-vocal stimuli of animal sounds (A,B), nature sounds (C,D), music (E,F), artificial noise (G,H). Hum: human; Chimp: chimpanzee; Bon: bonobo; Mac: macaque. TVA: sample-specific (N=23; white outline) temporal voice areas. STG: superior temporal gyrus; STS: superior temporal sulcus; âaâ prefix: anterior; âmâ prefix: mid; L: left hemisphere; R: right hemisphere.
Discussion
The present study provides evidence of the sensitivity of the human TVA to cross-species vocalizations, especially to chimpanzee calls but also to macaque vocalizations, as illustrated by specific enhanced activity in the bilateral mid and anterior STG and STSâwithin sample-specific TVA. These results were obtained through statistical modeling of the MRI data that included either simple acoustics or the use of Mahalanobis acoustic distance between species and the most discriminant acoustic features specific to our stimuli as covariates. These two latter analyses converged and yielded to greatly overlapping results, especially in the anterior TVA. Therefore, our results suggest that vocalizations from another ape species recruits subregions of human temporal cortex that process species-specific voices in humansânamely the bilateral, sample-specific TVA. This evidence speaks in favor of cross-species primate vocalization processing in the anterior and mid TVA of humansâfor chimpanzee and macaque calls, respectively. While our acoustic data confirmed the hypothesized hierarchy of acoustic distance as a function of phylogenetic distance between our species, we still observed mid STG and STS activity for macaque versus bonobo calls and a small cluster in the left mid STS specific to macaque calls in model 3âan unexpected result since macaques are the most distant species from humans both phylogenetically and acoustically in our study. Therefore, while we initially hypothesized that primate calls would exclusively recruit human TVA as a function of a combination of phylogenetic and acoustic proximity, our data also point toward the greater importance of the most discriminant acoustic features rather than acoustic distance alone. We discuss these aspects below in more detail and interpret their general meaning and subsequent scientific implications, in addition to highlighting the limitations of our study.
Often specifically associated with the processing of conspecific vocalizations (e.g., in humans [2, 22, 27], macaques [21, 27, 36], and dogs [19]), the present study challenges the common view of the TVA as âspecies-specificâ and illustrates that human voices, chimpanzee and macaque calls can enhance activity in the TVA. We think that the distinct locations of these TVA subregions recruited for processing the vocalizations of these primate species matters. In fact, there might be a possible association between anterior TVAâspecific to processing chimpanzee callsâand the higher recognition performance of chimpanzee calls compared to those of bonobos or macaques in human participants [34]. Anterior TVA activity specific to the processing of chimpanzee calls occurred when these were compared to both human and nonhuman primate species, solely to other nonhuman primate vocalizations, or directly to the human voice. However, homologous specific results were not observed for bonobo and they were more scarceâespecially between modelsâfor macaque vocalizations: we found macaque-specific activity in a small area of the planum temporale and in a small cluster of the left mid STS, congruent for instance with locations observed in the general processing of animal sounds, especially in the planum temporale [37, 38]. On the other hand, within-TVA anterior STG activity was also observed when chimpanzee vocalizations were directly compared to the human voice. We think this result highlights the cross-species specificity of this anterior subregion of the TVA for processing species phylogenetically close to humans and especially with human-like acoustics, namely the calls of chimpanzees in the case of our study. Because of their vocal proximity, the perception of human voices and chimpanzee calls in socio-affective contexts could involve a common âsocialâ core of the brain, which increases activity in brain regions such as the anterior TVA, as reported previously in studies pertaining to social contextual information processing in the anterior STG [39, 40]. Differences at the level of processing complexity between the two types of vocalizations could also explain this observation, while we demonstrated that saliency or attention-related effects do not exist between our species stimuli. Indeed, previous studies have shown the role of the anterior STG and the anterior STS in the conceptual representation of social context by the human voice [39â42]. Therefore, our data might suggest that the anterior part of the superior temporal cortex could be recruited to process the social context of human and chimpanzee vocal stimuli. However, this processing would be more automated for the perception of the human voice than for chimpanzee calls because of our high exposure and expertise as humans to these vocal signals, but these hypotheses should be addressed scientifically in studies dedicated to this topic. Our results are also complementary to and coherent with a âvoice patchâ system in the brain of primates, as put forward by Belin and colleagues [43], and according to which distinct âpatchesâ or subregions of the temporal lobeâespecially its anterior portion, would be interconnected and would allow for the processing of voice information. Such system would be present in many primate species such as humans, macaques and marmosets, with most recent evidence suggesting a population of neurons in the anterior STG of the macaque brain selective to human voice [44], as also anticipated in another study on macaques and also in the anterior STG [45]. These fascinating and converging results mirror our present dataâwith âchimpanzee-selectiveâ responses in the anterior STG/TVA of our human participantsâand strongly emphasize the need for pursuing a comparative approach in order to clarify the cross-species neural bases underlying the processing of human and nonhuman primate vocal signals. As we mentioned previously, these interpretations are, considering our results, free of any potential attentional bias towards one species over the others, since no effect was observed on that matter in a control, behavioral study involving an independent sample of twenty-eight participants in a species-specific exogenous cueing attentional paradigmâ Methods and Fig.S1.
Importantly, our data also emphasize the influence of acoustic features and especially acoustic proximity between human and chimpanzee vocalizations: we show that activity in the anterior STG and more generally in the anterior TVA partly depends on phylogenetic and more importantly on key acoustic features and acoustic proximity. Consistent with previous studies [24, 25], we did not expect TVA activity for macaque calls processing because they are both phylogenetically and acoustically more distant from humans than the other species in this studyâalthough, as above mentioned, we found a very small cluster in the left primary auditory cortex and mid STS for model 3. It is interesting to note that in model 2, with acoustic distance as trial-level covariate, we observed TVA activity only for chimpanzee but not macaque calls, giving further weight to the importance of acoustic distance in this context. Also, if only phylogenetic proximity mattered, bonobo calls should also elicit activity in the TVA because they are as phylogenetically close to humans as chimpanzees. But this viewpoint is rather reductive and our results show that this is not entirely correct, and that activity in the TVA crucially depends on the acoustic properties of the perceived vocalizations since we cannot infer phylogeny from vocalizations. This interpretation is strongly supported by the inclusion of acoustic Mahalanobis distance for each species compared to the human voice as a trial-level covariate of no-interest. Using such modelling, differential neuroimaging results between chimpanzee and bonobo vocalizations were explained by both acoustic and phylogenetic proximity in the TVA. These results are consistent with the recent proposalâand recent findings [30]âthat there are substantial differences between chimpanzee and bonobo vocalizations. These encompass fundamental frequency range and mean due to larynx length [28, 32]âdespite the evolutionary relatedness to chimpanzees [28]. Therefore, the interaction between phylogeny and acoustic distance or proximity would explain the anterior TVA expansion for processing specifically chimpanzee but not bonobo vocalizations. This argument however falls short to explain the recruitment of the TVA by macaque calls in model 3.
Overall, it seems reasonable to hypothesize that TVA activity is not per se human-specific [2, 41] but that TVA are instead sensitive to vocalizations from other primate species, provided that these vocalizations have sufficient acoustic proximity to human vocal signalsâwhich would in itself be related to anatomical and/or behavioral changes throughout phylogenetic evolution. This integrative view is again consistent with the concept of a âvoice patchâ system in the primate brain [43]. We therefore propose that the mid and anterior TVA, unlike the rest of the TVA, would be heterospecificâsensitive to vocalization acoustics triggered by evolution. This proposition also implies a validation of our mechanistic hypothesis according to which the mean fundamental frequency of chimpanzee but not bonobo callsâthe former being much closer to the mean fundamental frequency of the human voice [28], would allow for a better identification and recognition of chimpanzee calls by humans. This advantage would rely on neurons of the human auditory cortexâboth the primary and more secondary regionsâbeing specialized in the processing of low to mid fundamental frequencies such as those of the human voice and chimpanzee calls. In our third analysis model, we looked further into this aspect and included several acoustic properties of our stimuli as a function of the four species in our stimuli. A discriminant analysis [30] allowed us to select specific acoustic features that best discriminate between our species stimuli. Namely, we took the six parameters explaining the most the differences between our stimuli, including vocalization loudness, intensityâ similar to our âenergyâ covariate of model 1, in addition to change in spectrum, F2 bandwidth contour, F0 power and intensity contour difference. Using these more sophisticated acoustic features as covariates of no-interest, we still obtained brain imaging results very similar to those of model 1 and even closer to model 2âwith acoustic distance as covariate, yet with some subtle differences in anterior STG cluster size and location. The peaks were indeed located more ventral and were largerâas compared to results of model 1 & 2, especially for the processing of chimpanzee-specific and macaque-specific vocalizations compared to all primate species and to nonhuman primates alone. These results suggest that the inclusion of spectrum change to intensity- and frequency-related acoustical parameters of the vocal signals slightly shifted and enlarged activation locations in the anterior STG. This result is again congruent with the proposed existence of âvoice patchesâ in the temporal lobe of primate species [43], with the interconnectivity of these patches highly depending on very fine-grained acoustic aspects of primate vocal signals. This step motivated the inclusion of these parameters as covariates of no-interest in neuroimaging model 3, to retain brain activations marginally independent of such acoustics. The congruence between these data should be explored in more detail in the future by the combination of computational bioacoustics and functional neuroimaging, due to the high relevance and sensitivity of combining these techniques to investigate primate social communication [46].
A final but maybe more secondary interpretation arising from our results regarding bonobo calls also supports the evolutionary divergence of this peculiar species. According to the self-domestication hypothesis, bonobos would have evolved differently than chimpanzees due to selection against aggression [47]. Interestingly, differentiation in the evolutionary path of bonobos has influenced both their behavior [31] and morphology, leading to differences at the level of call production [28, 33]. Considering these documented acoustic differences and putting them in perspective with our neuroimaging data, the calls of our last common ancestor with the other Pan species 8 million years ago [48], may have been closer to those uttered by modern chimpanzees than to those of bonobos. Our data indeed show that modern human brains remain more sensitive to the acoustic characteristics of the calls of the former compared to the latter, arguing for more conserved calls between modern chimpanzees and humans. This aspect is also in line with significant differences between, for instance, the fundamental frequency of human baby cries or babbling (âź250-600Hz) compared to that of bonobos (âź1000-3500Hz) [49, 50], while they correspond more closely to the fundamental frequency of chimpanzee calls (âź500-1000Hz) [28]. In our study, bonobo calls definitely are so much different than those uttered by the species of our other stimuli, that they presumably fall outside of the phylogeny and acoustic proximity factors that we outlined so far. This would also put into perspective the recruitment of mid TVAs for macaque calls.
In a sense, we therefore validate our first hypothesis regarding the existence of acoustic distance between each primate species used in our study. We also partially validate our second hypothesis, albeit not completely. In fact, macaque compared to bonobo or other primate calls in model 3 revealed mid TVA activations, and we think that these activations may depend specifically on the importance of the most discriminant acoustic features. Several TVA subregions or âpatchesâ underlying cross-species primate vocalization processing might therefore exist, and our data highlight at least one of them in the mid and anterior portion of the TVA. We will now discuss in further detail task-related limitations that might account for the partial divergence between our results and hypotheses.
Even though we tried to control for critical acoustic features, species categories and their related evolutionary distance, several limitations should be mentioned. These limitations are both theoretical and methodological. First, we cannot rule out the fact that including more primate species in our set of stimuli would not have influenced the results. In fact, even though our species categories were specifically chosen for this task, the inclusion of vocalizations from other great apesâsuch as gorillas or orangutansâwould have broadened the scope of our results. Related to this aspect, we can also mention that tackling primate phylogeny, which spans over millions of years, with only four species restricts the possible inference based on our results. Second, we observed improved sensitivity of our data by the use of more sophisticated acoustic modeling, namely the inclusion of both between-species acoustic distance and of the most discriminant acoustic features in the functional imaging data. However, we did not include as stimuliâor in a control taskâthe synthesized acoustic parameters of interest, for instance by using species-specific F0 contour or its spectral content in other neutral, comparable auditory stimuli. We cannot therefore completely rule out that such task would not trigger brain activations that overlap with our resultsâalthough such data would not be mutually exclusive with our data and interpretation. Future work should therefore address with the greatest level of detail the specific question of acoustics in primate vocalization processing, in addition to adding moreâas well as synthesizedâstimuli from other great ape species. The origin of these acoustic differences should also be investigated, since we can assume that these differences originated at least partially from evolutionary processes as well as survival and adaptation mechanisms. Finally, individual differences in the processing and preference of one species over another or over all the others cannot be ruled out, even though we provide evidence that attentional effects toward the vocalizations of a specific species did likely not exist in our data. Therefore, individual differences should be assessed in more detail in the future, with the inclusion of participant-level covariates such as questionnaire scores assessing the familiarity with primate vocalizations or the hedonic value of these vocalizations for each individual. Among the more general limitations of nonhuman primate neuroscience lies the fact that more inclusive and large-scale collaborations would be needed. Such collaborations and framework would lead to a better study and understanding of primate neuroscience, and previous initiatives have recently been put forward in this direction [51, 52].
Taken together, our data suggest that phylogeny-driven specific acoustic features appear to be necessary to trigger cross-species activity in the human temporal voice areasâespecially in subregions of the TVA in which increased activity underlies voice signals compared to animal and nature sounds. We provide evidence for specific an