Main
Infants must lay the foundations for cognition in the first year of life, with visual recognition being an important developmental challenge1. During this time, humans learn to recognize the things they see, grouping them into meaningful categories that later converge with language. Although understanding of visual processing in adults has made substantial progress2,[3](#ref-CR3 “Kanwisher, N. & Dilks, D. D. The functional organization of the ventral visual pathway in humans. in T…
Main
Infants must lay the foundations for cognition in the first year of life, with visual recognition being an important developmental challenge1. During this time, humans learn to recognize the things they see, grouping them into meaningful categories that later converge with language. Although understanding of visual processing in adults has made substantial progress2,3,4,5, its developmental trajectory remains unclear. Theories of visual development differ in emphasis on sensory-driven, bottom-up statistical learning6, the role of experience-dependent input7,8 or core knowledge systems9, hierarchical development from a primitive cortical organization10 and the degree of continuity between infants and adults11,12. However, brain development is likely a complex interplay between many of these factors13,14, and each element’s relative influence is difficult to distinguish without a characterization of visual function in early life.
The emergence of visual categories has been assessed in infants, with experimental paradigms measuring looking behavior. Young infants have been found to be sensitive to global structure and perceptual features15,16, and 10-month-old infants are sensitive to basic-level categories and conceptual features17. However, knowing when these functions appear in behavior does not reveal the developmental processes that led to their emergence. To distinguish these mechanisms, it would be valuable to characterize the precursor states of each element in the system. This would reveal how things change across time, through learning or maturation, and how brain function culminates in behavior. We aimed to do this by longitudinally measuring brain development, with a focus on the ventral visual stream (VVS). In adults, this region is key to visual recognition and comprises brain regions structured as a processing hierarchy from perceptual features to semantic categories18,19. In the visual system of infants, electroencephalography and magnetoencephalography (EEG and MEG) have found a staggered development of categories20,21 that mirrors the transition in feature complexity found in behavioral studies. However, the limited spatial resolution of these methods did not allow separation of the different parts of the ventral hierarchy. Methodological advances in awake infant fMRI have the potential to overcome this limitation22,23.
To characterize the developmental mechanisms, we related longitudinal fMRI measurements of regions along the VVS to computational models of vision. Deep neural networks (DNNs) are now cemented as effective models of the adult brain in several domains24,25, acting as complex feature detectors that learn the statistical regularities in visual input to encode an increasingly complex range of multidimensional features in their hierarchical layers—similar to adult visual cortex26,27,28. Particular progress has been made using representational similarity analysis (RSA)29, where the distributed codes for diverse stimuli are compared between the brain and the DNN to characterize the features encoded in each visual region. We presented a broader range of stimuli, avoiding those for which humans likely have specialized mechanisms, such as faces. This permits tests of alignment with computational models to determine if DNNs can be effective models in developmental neuroscience and to allow for discrimination of feature processing along the visual hierarchy.
We acquired a large-scale longitudinal MRI dataset of functional brain activity from awake infants, bridging the gap between large studies of sleeping or sedated infants30,31,32 and pioneering studies of awake infants with samples that collapse across different ages22,23,33. We found that a range of features are processed along the ventral hierarchy, including categorical representations across the ventral surface from 2 months of age. These representations corresponded to those in DNNs, revealing that the features encoded are like those that facilitate efficient categorization in machines. High-level vision continues to mature throughout the first year of life as the ventrotemporal cortex becomes more specialized in its stage-like function. We found later development of category representations in the object-selective lateral occipitotemporal cortex (LO), a region upstream of the anterior VVS in high-level visual processing34,35, revealing that human visual categories do not develop in a bottom-up manner along the cortical processing hierarchy.
Infant visual representations measured with awake fMRI at scale
fMRI was acquired longitudinally from 2-month-old (n = 130) and 9-month-old (n = 65) infants as they viewed 12 common visual categories (Fig. 1a–d). These were chosen across animate, small inanimate and large inanimate classes that would typically be seen in the first year of life (in person or through books), corresponding to words with a young age of acquisition36. For each category, we included three representative exemplars from diverse viewpoints to decorrelate perceptual and semantic, defined by category, responses. The success of awake fMRI depends upon infants staying content and relatively still in the scanner, so the pictures loomed to captivate attention. Most infants (63%) participated for at least four repetitions of each stimulus, totaling 144 pictures across 10 minutes of scanning. The distribution of head motion was acceptable (runs with a median framewise displacement (FWD) of <1.5 mm at 2 months, 85%; at 9 months, 97%) (Fig. 1d,f) and allowed for rigorous scrubbing during fMRI analysis (Methods). This resulted in a final sample of n = 101 2-month-old infants and n = 44 9-month-old infants for the pictures task. A cohort of adults (n = 17) was acquired for comparison. The blood oxygen level-dependent (BOLD) response to each object exemplar was estimated with a general linear model (GLM), and RSA was used to measure the representational geometry for each region in the ventral visual cortex (VVC). Individual regions of interest (ROIs) were defined using the cytoarchitectonic Julich atlas, which has been validated in its overlap with functionally defined regions34. To illustrate the broad picture, we first show early and late stages of processing the ventral stream (early visual cortex (EVC): comprising V1, V2dv and V3v; VVC: comprising FG1/VO1, FG3/PHC, FG2 and FG4). The similarities of voxelwise response patterns to each pair of conditions were calculated across subjects, giving a representational similarity matrix (RSM) for each ROI and age group (Fig. 2). The group mean RSMs were reliable, as estimated by a split-half noise ceiling, adjusting for test length using the Spearman−Brown correction (2 months of age: ρ(51) = 0.955 in EVC and 0.741 in VVC; 9 months of age: ρ(22) = 0.947 in EVC and 0.877 in VVC; adult: ρ(9) = 0.967 in EVC and 0.907 in VVC). Noise ceilings for all ROIs are reported in Extended Data Table 1, RSMs in Extended Data Fig. 1a–c and ROIs in Extended Data Fig. 1d. Correlations within each across-subject RSM ranged from (−1.0 to 1.0) for 2-month-old infants, from (−0.74 to 0.64) for 9-month-old infants and from (−0.32 to 0.42) for adults. To focus on signals that were highly consistent across participants, we report group mean RSMs across pairs of subjects.
Fig. 1: Characterizing the development of visual representations.
a, Awake longitudinal fMRI was collected in 2-month-old infants and again at 9 months of age. A cohort of adults was collected for comparison. After careful data curation and preprocessing with custom analysis pipelines, response patterns were estimated using a GLM. The final dataset for the pictures task presented here, after motion thresholding, included n = 101 2-month-old infants and n = 44 9-month-old infants. Created in BioRender: O’doherty, C. (2025): https://BioRender.com/fyk6ao9. b, Visual representations were characterized per ROI and age group using multivariate pattern analysis. The pairwise correlations, across subjects and runs, between all n conditions β estimates in an *m-*voxel dimensional space gave an n × n similarity matrix. c, The equivalent analysis was performed in a DNN. Activations were calculated from each layer, in response to the same images used during scanning. The resulting model similarity matrix could then be compared with the fMRI data using a Spearman correlation. d, Stimuli used in the pictures task. Three exemplars were chosen per 12 categories spanning animate, inanimate-small and inanimate-large. e, Frequency of time spent in the awake fMRI pictures task for 2-month-old infants and 9-month-old infants. f, Histogram of FWD across all data collected. Frames with FWD > 1.5 mm were removed when fitting the GLM, and runs with more than half the frames above this threshold were not included in analyses. g, Number of participants collected at each stage of the study. Infants also completed an awake videos task, which will be released with future work. mo, months old.
Fig. 2: Visual representations from infancy to adulthood.
The group average, across-subject pairwise correlation distance between voxelwise patterns from EVC and VVC in response to each object. Aggregated ROIs were defined using individual regions from the Julich atlas that have been validated to overlap with functionally defined regions. Conditions within the RSMs are nested by animacy class (animate, inanimate-small and inanimate-large), then category (four per animacy class) and, finally, exemplar (three per category). To focus on representational content rather than strength, we plotted the z-scored correlation distance. Raw group mean correlation ranges varied with age: 2-month EVC = (−0.027 to 0.030) and VVC = (−0.0082 to 0.0078); 9-month EVC = (−0.054 to 0.073) and VVC = (−0.030 to 0.045); adult EVC = (−0.112 to 0.142) and VVC = (−0.046 to 0.068). Violin plots show the bootstrap distribution of RSM correlation to a model that codes for within-object versus between-object similarity, implemented as an identity matrix. The violin is a symmetric kernel density estimate of the bootstrap distribution across all subject/run pair RSMs (2 months (n = 101): 14,108 unique subject/run pairs, 9 months (n = 44): 1,995 unique subject/run pairs; adults (n = 17): 136 unique subject/run pairs). The white dot is the median; the black bar is the IQR; and the whiskers denote 1.5 × IQR of the lower and upper quantiles. Correlations were normalized by the split-half noise ceiling for each ROI/age. ROIs are plotted from the atlas transformed into age-specific templates. mo, months old.
Basic and global categories are present in the ventral stream from 2 months of age
Representations in both EVC and VVC were surprisingly mature by 2 months of age (correlation to adult group EVC ρ = 0.788, 95% bootstrap confidence interval (CI): 0.782−0.793; VVC ρ = 0.577, 95% CI: 0.558−0.598) with continued maturation throughout the first year of life (9-month correlation to adult EVC ρ = 0.792, 95% CI: 0.783−0.802; VVC ρ = 0.654, 95% CI: 0.634−0.670). All statistics, including the across-group differences and their significance, were calculated using bootstrap resampling across subject pair RSMs to estimate the sampling distribution. Different visual stimuli evoked distinct BOLD patterns in both regions from 2 months of age, assessed by correlation to an RSM that contrasts within-exemplar versus between-exemplar similarities (Extended Data Fig. 2). In EVC, representations were similar in their correlation to this within versus between stimulus model at 2 months of age (Spearman correlation of brain RSM and model RSM ρ = 0.321, 95% bootstrap CI: 0.314−0.328), at 9 months of age (ρ = 0.333, 95% CI: 0.327−0.339) and in adulthood (ρ = 0.322, 95% CI: 0.315−0.330). In VVC, the within versus between image distinction was present at 2 months of age (ρ = 0.169, 95% CI: 0.128−0.206), becoming stronger in its representation by 9 months of age (ρ = 0.229, 95% CI: 0.201−0.255) and again into adulthood (ρ = 0.271, 95% CI: 0.251−0.288). This pattern was preserved when using importance reweighting during bootstrapping to match motion distributions (Extended Data Fig. 3a,b). Our dataset revealed a clear stimulus-specific representational geometry in the infant brain, which shows considerable continuity with adults, while continuing to develop throughout the first year of life. This maturity of individual object representations was evident in EVC and VVC at an age when overt visual categorization is restricted or undetectable37, visual acuity is still developing38 and experience with the world is limited39.
These results show that distinct representations are present for individual images, but it is unclear if this is due to their perceptual similarity or to high-level properties that facilitate category grouping. To investigate this, we quantified the features comprising this relational geometry through comparison with perceptual and categorical model RSMs (Extended Data Fig. 2). Informed by existing behavioral and EEG evidence21,37, we hypothesized that there would be a transition from low-level perceptual organization in the infants to high-level categorical organization in adults. According to the bottom-up theory of visual development, this perceptual to categorical transition should occur along the object processing hierarchy of the visual stream. The perceptual RSMs chosen were size, elongation, color and compactness—features that have previously been used to model the trajectory of infant looking time behavior from 4 months to 19 months of age37. Semantic models were defined by category membership. These were the within versus between image RSM described above, generalization across different exemplars within the same basic-level category (category RSM), similarity within the global categories of animate versus inanimate objects (animacy RSM) and small versus large inanimate objects. This tripartite animacy distinction is a known organizing principle of adult visual cortex to which infants are not behaviorally sensitive until 10 months of age37,40.
All features were present in each age group (Fig. 3c,d), including distinct categorical representations across visual cortex at 2 months of age, which strengthened throughout the first year of life. Correspondence to the semantic feature RSMs controlled for the four perceptual features through a partial correlation. Perceptual and categorical representations were present in the ventral stream for all the age groups; rather than a transition from the representation being driven by low-level visual features to high-level information, visual cortex begins with features spanning a range of complexities, which is fine-tuned with age. These findings persisted when replicating the tests in a smaller sample with restricted motion distributions (Extended Data Fig. 3a,b). However, we found no effect of longitudinal similarity when comparing the same infants at 2 months and 9 months of age (Extended Data Fig. 3c).
Fig. 3: The development of perceptual and semantic feature representations.
Perceptual (a) and categorical (b) features were calculated for each stimulus to construct model RSMs. Created in BioRender: O’doherty, C. (2025): https://BioRender.com/dka6man. Correspondence of visual representations from EVC and VVC to perceptual RSMs (c) and semantic feature RSMs (d), controlling for the four perceptual features with a partial correlation. Correlations were normalized by the noise ceiling. Boxes are quartiles of the bootstrapped 95% CI, across subject/run pairs (2 months (n = 101): 14,108 unique pairs, 9 months (n = 44): 1,995 unique pairs; adult (n = 17): 136 unique pairs). The lower and upper bounds are the 25th and 75th percentiles; the middle line is the median; and whiskers extend to 1.5 × IQR. Individual points are outliers. mo, months old.
Infants represented categories from 2 months of age, but were they encoding the organizing principles that are present in the adult VVS, such as animacy and real-world size?40 We found evidence for this much earlier in the brain than has been previously reported from looking time. The animacy distinction was present in the VVC from 2 months of age to a similar degree as category representations (Fig. 3d). However, this was refined with age and experience (Fig. 4) (partial Spearman correlation, controlling for the four perceptual features in 2-month-old infants, ρ = 0.198, 95% CI: 0.164−0.233; in 9-month-old infants, ρ = 0.552, 95% CI: 0.522−0.581; and in adults, ρ = 0.657, 95% CI: 0.613−0.696, significant change with each age). Representation of real-world size was weaker but followed a similar developmental trajectory (partial correlation to inanimate-large versus inaminate-small model in 2-month-old infants, ρ = 0.089, 95% CI: 0.057−0.118; in 9-month-old infants, ρ = 0.299, 95% CI: 0.278−0.322; and in adults, ρ = 0.263, 95% CI: 0.242−0.282). We found that basic-level categories are present in the ventral stream by 2 months of age, with an initial template of organization by animacy and real-world size. This global organization is further distinguished by the latter half of the first year, as it becomes evident in looking behavior34.
Fig. 4: Global organization by animacy emerges in the ventral stream throughout the first year of life.
MDS plots of the VVC representations in Fig. 2. The embedding space of the pairwise distances between all images is shown in each age group as well as their global category membership. The rubber duck category was grouped with animate objects in all age groups’ visual representations and has been defined as such here. To ensure similar projections in each plot, embedding models were fit across the ROIs VVC and V1, then separately refined for VVC collapsed across ages and, finally, for age.
LO regions are later to mature
The above analyses used an aggregated ROI approach to distinguish trends in two large divisions of the ventral stream, comprising many functionally distinct regions. We expanded this analysis to characterize the maturity of visual representations across finer parcellations of visual cortex. To assess the maturity of representations within each region, the group average infant RSMs were correlated to group average RSMs in the corresponding ROI, bootstrapping across infants to obtain CIs (Fig. 5a). Maturity was highest in early visual regions (correlation to adults in 2-month-old V1, ρ = 0.788, 95% CI: 0.782−0.794; 9-month-old V1, ρ = 0.809, 95% CI: 0.800−0.819) relative to anterior ventral visual regions such as the parahippocampal cortex (PHC) (correlation to adults in 2-month-old infants, ρ = 0.424, 95% CI: 0.390−0.457), with continued maturation throughout the first year of life (9-month-old infant PHC, ρ = 0.575, 95% CI: 0.546−0.601). However, maturity differed most noticeably along the medial to lateral axis, in volumetric space, with LO regions being particularly immature (correlation to adults in 2-month-old infants, ρ = 0.120, 95% CI: 0.076−0.160; in 9-month-old infants, ρ = 0.298, 95% CI: 0.250−0.344).
Fig. 5: Delayed maturation of lateral object-selective cortex.
a, Correlations in each visual region of infant and adult representations. To compare equivalent slices at 2 months and 9 months of age, the 2-month-old atlas was transformed into 9-month space. b, Group average visual representations for LO. z-scored responses are ordered as in Fig. 2. Raw correlation ranges in 2-month-old infants (−0.0053 to 0.0089), 9-month-old infants (−0.0130 to 0.0176) and adults (−0.0712 to 0.0925). c, The split-half reliability in LO was extremely low in the infant cohorts relative to a mature medial region, such as EVC. This was dissociated from the MRI tSNR, which increased with age at similar rates in EVC and LO, revealing that the lack of structure seen in b was due to inconsistent visual representations across infant participants and not poor signal in these regions. Solid lines in the top panel are the Fisher-transformed split-half reliability; solid lines in the bottom panel are the mean tSNR; and error bands are the 95% CI across runs (169 runs in 2-month-old infants, 64 runs in 9-month-old infants and 51 runs in adults).
In stark contrast to adults, RSMs from LO showed no evidence of category or animacy organization in the infant cohorts (Fig. 5b), despite this region being selective for intact versus scrambled objects in adults41,42. Visual representations were not reliable in infant LO, as indicated by the split-half noise ceiling (2-month-old infants: ρ(51) = 0.032; 9-month-old infants: ρ(22) = 0.301; adults: ρ(9) = 0.957). We hypothesized that this was due to some source of region-specific MRI measurement noise. Alternatively, if infant LO was not sensitive to the differences between our stimuli, this would also lead to unreliable signal in the RSM. To distinguish these potential causes, we assessed measurement noise using the temporal signal-to-noise ratio (tSNR; Extended Data Table 2) of the regional BOLD timecourses. These were similar between LO and EVC, suggesting similar data quality (Fig. 5c). Furthermore, a variance partitioning analysis showed similar developmental trajectories for EVC, VVC and LO (Extended Data Fig. 4). We, therefore, interpret the results in infant LO as a true lack of response to visual differences. The developmental transition of feature complexity in LO was not necessary for the category representations to be present in anterior VVC, contrary to a bottom-up model of visual development.
DNNs model infant visual cortex
We found the presence of high-level features in infant visual representations, as defined by within versus between category distinctions. However, category representations could be shaped by many visual features with potentially complex multidimensional tuning in feature space. DNNs have shown great success in capturing these complex feature manifolds from the statistics of visual input, learning representations of objects that are simultaneously successful for recognition and relevant for adult visual cortex27. To date, parallels to these models have only been drawn to the ‘fully trained’ adult brain. With our dataset characterizing infant visual representations, we can now compare the ‘learning’ human brain to the ‘learning’ model. Here, we tested if deep-learning-based models of visual recognition can capture infant neural responses (Fig. 6a). Using the 36 images from the fMRI experiment, we calculated activations from a set of models with an AlexNet architecture (Extended Data Figs. 5–7) by testing the extremes of the training process: untrained and fully trained. The architecture of untrained neural networks can provide sufficient inductive biases for capturing object-relevant information43, but training the network weights through visual input to the model confers an important advantage for representation learning. By testing correspondences to the models at these two stages, we can delineate the contribution of features that are learnable from the statistics of visual experience for explaining brain representations.
Fig. 6: DNNs model visual representations from infancy to adulthood.
a,b, Activations were calculated from DNNs trained with two different algorithms: supervised, trained on a fully labeled object recognition task, and self-supervised, trained using instance protocol contrastive learning. c, Layerwise Spearman correlations between the DNN AlexNet and visual cortex. Correlations were normalized by the noise ceiling within each age group and ROI. Solid lines are the mean, and error bands depict the 95% CI, calculated using bootstrap resampling across subject/run pairs (2 months (n = 101): 14,108 unique pairs; 9 months (n = 44): 1,995 unique pairs; adult (n = 17): 136 unique pairs). conv, convolutional; fc, fully connected.
Although infant representations in EVC correlated more with an untrained DNN than adult representations at both 2 months and 9 months of age (Fig. 6c), a fully trained network outperformed the untrained model across all age groups. This demonstrates that, from as early as 2 months of age, features learned from the statistics of visual input—used by machines for object classification—are important for explaining visual representations. Even with less visual experience44, infants’ representations are sophisticated enough to correspond well with those from a fully trained neural network. This DNN model underwent supervised training, where each image had a corresponding label, something pre-verbal infants would not have had access to. Instead, infants are sensitive to comparisons and patterns within the stream of sensory input45,46, aligning better with self-supervised training. To test if the observed effect generalized to DNNs trained without labels, we compared to a DNN trained with a self-supervised algorithm47. Infants showed similar patterns of correlation to the model regardless of learning algorithm, with the expected hierarchical correspondence between brain and model emerging along the VVS26 (Extended Data Fig. 8). Early visual cortex peaked in its correspondence to the DNN in earlier layers. In all age groups, VVC representations were most similar to deep layers that capture higher-level visual properties of the images, facilitating object classification. Notably, some layers of the model trained with self-supervised learning explained infants’ VVC significantly better than adults’ when compared to the supervised model. This demonstrates that different DNNs are better models for different developmental stages, opening the future possibility to identify the learning model that best explains infant brain data.
No evidence for bottom-up development of the VVS
Our results have shown that high-level visual features are present across the ventral stream at 2 months of age, with refinement of the functional distinctions between low-level and high-level regions throughout development. ln Fig. 7, we summarize our findings and further highlight this functional specialization along the visual hierarchy. Early visual regions in the 2-month-old brain represent features that span a range of complexities, captured by perceptual and categorical image features and many neural network layers. In the older age groups, EVC representations become specialized toward low-level visual features. In VVC, many features are represented at 2 months of age, with a bias toward categorical features. These high-level responses become functionally distinct in VVC with age and experience, especially for global organization by animacy. The influence of complex features captured by DNN layers on early visual representations becomes less pronounced with age. LO, a mid-level processing region, shows a protracted development relative to other object-selective regions. Thus, we found no evidence for a bottom-up transition from simple to complex features along the visual processing hierarchy. Instead, high-level feature representations are present in the ventral stream from 2 months of age and are fine-tuned with age and experience. When considering the maturation from medial toward lateral visual regions, we found a protracted development of object-selective regions in LO, which lies posterior to the VVC, indicating non-hierarchical development of visual regions involved in object perception.
Fig. 7: Developmental cascades of feature complexity along the ventral stream.
Scatter plots showing correlations to the features used in RSA and each layer of supervised AlexNet versus regions along the visual hierarchy. The size of each dot depicts the Spearman correlation between model and brain RSMs, standardized across all ages and test types. Correlations were not normalized by the noise ceiling due to invalidity in individual regions where there is lack of reliable signal, such as LO. The opacity of each dot ranges from the minimum to maximum value within a subplot. VVS regions become more functionally specialized through development, but high-level visual features captured by categorical features and deep layers of a neural network are present throughout the visual hierarchy from 2 months of age. conv, convolutional; fc, fully connected.
Discussion
Using a broad set of stimuli, we found that infant VVC contained category and animacy information at 2 months of age, which was not explained by low-level visual features. This category structure appears in ventral regions of the visual hierarchy before emerging in lateral regions, revealing that category representations in visual cortex do not develop in a bottom-up manner. In agreement with retinotopic studies48, early medial regions are the most mature at 2 months of age. The features encoded in visual representations are refined with age and experience rather than a developmental transition from simple to complex feature encoding.
Converging evidence now suggests that object processing in LO is later to mature. Previous infant fMRI studies using face and place stimuli found that focal regions of selectivity are present in ventral regions on the fusiform gyrus, such as fusiform face area (FFA) and parahippocampal place area (PPA), with no evidence for selectivity in lateral precursor regions occipital face area (OFA) and occipital place area (OPA)33,49. Shape processing, but not object selectivity, has been demonstrated in LO of 6-month-old infants using functional near-infrared spectroscopy (fNIRS)50, and steady-state visual evoked responses to scrambled versus intact objects are immature at this age51. Moreover, the improved spatial resolution of fMRI highlights that the developmental transition in feature complexity previously reported from behavioral and MEG/EEG studies20,21 may be driven by lateral visual regions rather than immature function of category processing along the ventral stream. LO is an important object-selective region, so why might it show a protracted development? Neural activity in this region is modulated by attentional shifts52, facilitates scene recognition from component objects53 and has closely overlapping representations for object classes and their intended motor function54. The contributions of these cross-modal and higher-order cognitive processes might require more experience; improved motor abilities may be necessary for an object’s intended action to be represented55; or sufficient activation of infants’ attention networks56 may constrain the complete emergence of object recognition. Indeed, the long-range connectivity of future category regions on the ventral surface has been found from early infancy57,58, whereas myelination of white matter tracts occurs later in lateral regions[59](https://www.nature.com/articles/s41593-025-02187-8#ref-CR59 “Deoni, S. C. L. et al. Mapping infant brain myeli