Introduction
Hierarchical spatial organization is ubiquitous in tissue and organ biology. Systematic, high-dimensional phenotypic measurements of this organization, generated through experimental tools such as spatial transcriptomics, multiplex immunofluorescence, and electron microscopy, are also becoming increasingly available as large, open datasets. However, transforming this abundance of data into a useful representation can be difficult, even for fields with a wealth of prior knowledge, such as neuroanatomy.
Datasets such as the Allen Brain Cell Whole Mouse Brain (ABC-WMB) Atlas1,[2](#ref-CR2 “The MICrONS Consortium. Functional …
Introduction
Hierarchical spatial organization is ubiquitous in tissue and organ biology. Systematic, high-dimensional phenotypic measurements of this organization, generated through experimental tools such as spatial transcriptomics, multiplex immunofluorescence, and electron microscopy, are also becoming increasingly available as large, open datasets. However, transforming this abundance of data into a useful representation can be difficult, even for fields with a wealth of prior knowledge, such as neuroanatomy.
Datasets such as the Allen Brain Cell Whole Mouse Brain (ABC-WMB) Atlas1,2,[3](https://www.nature.com/articles/s41467-025-64259-4#ref-CR3 “Chen, X. et al. Whole-cortex in situ sequencing reveals input-dependent area identity. Nature 1–10 https://doi.org/10.1038/s41586-024-07221-6
(2024).“), a multi-million cell single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (MERFISH) atlas, provide unprecedented opportunities to investigate whether computational tools can help biologists understand spatial cellular and molecular organization. However, the size of these datasets presents computational challenges for existing methods. Existing methods for spatial niche or spatial domain detection often operate on the entire dataset at once, for example, a tissue-section-wide cell by gene matrix. This precludes scale-up to large multi-section datasets as most systems do not have the GPU memory required to load multiple sections of data or store intermediary representations such as pairwise distance matrices[4](#ref-CR4 “Haviv, D. et al. The covariance environment defines cellular niches for spatial inference. Nat. Biotechnol. 1–12 https://doi.org/10.1038/s41587-024-02193-4
(2024).“),5,6, particularly as datasets scale into the millions or tens of millions of cells. Some methods rely on Gaussian processes, which feature a costly cubic computational scaling in the number of observations[7](https://www.nature.com/articles/s41467-025-64259-4#ref-CR7 “Townes, F. W. & Engelhardt, B. E. Nonnegative spatial factorization applied to spatial genomics. Nat. Methods 1–10 https://doi.org/10.1038/s41592-022-01687-w
(2022).“). Other more scalable methods are limited in capturing granular structure, integration across tissue sections, or require significant neuroanatomical prior knowledge to manually audit, cluster, and hyperparameter tune for domain discovery workflows[8](#ref-CR8 “Birk, S. et al. Large-scale characterization of cell niches in spatial atlases using bio-inspired graph learning. Preprint at https://doi.org/10.1101/2024.02.21.581428
(2024).“),9,[10](https://www.nature.com/articles/s41467-025-64259-4#ref-CR10 “He, Y. et al. Towards a universal spatial molecular atlas of the mouse brain. 2024.05.27.594872 Preprint at https://doi.org/10.1101/2024.05.27.594872
(2024).“).
Our method, CellTransformer, implements a robust representation learning and clustering workflow to discover spatial niches at scale by representing not tissue sections but subgraphs that represent individual cellular neighborhoods. We describe an innovative strategy to induce the encoder of an encoder-decoder transformer to aggregate useful information into a neighborhood representation token. This occurs by training the model to condition cell-type specific gene expression predictions using this neighborhood context token. The model thus learns to predict the expression of cell types in arbitrary cell neighborhoods. This representation allows for recovery of important anatomically plausible spatial domains while remaining computationally efficient.
We evaluate CellTransformer on using the ABC-WMB dataset (3.9 million cells collected with a 500 gene MERFISH panel)1 demonstrating its effectiveness in producing completely data-driven spatial domains of the mouse brain by comparing the results to the Allen Mouse Brain Common Coordinate Framework version 3 (CCFv3)11. CCF is a consensus hand-drawn 3D reference space compiled from a large multimodal data corpus. Annotations feature labels at three levels of coarseness (from 25 regions at coarse-grain to 670 at fine-grain), which we use to show that CellTransformer excels at identifying spatial domains that are spatially coherent and biologically relevant. CellTransformer domains reproduce known regional architecture observed in targeted studies of the subiculum and in the superior colliculus superficial layers. Beyond the 670 regions currently annotated in ABC-WMB, we show our workflow produces meaningful data-driven domains in regions that currently lack subregion annotation. As examples, we focus on data-driven subdomains we define in superior colliculus and midbrain reticular nucleus.
We also demonstrate CellTransformer’s strength in integrating domains across animals, leveraging a separate whole-brain dataset within ABC-WMB12 comprising 6.5 million cells distributed across four animals and 239 sections and with a separate gene panel with 1129 genes. We find that CellTransformer produces consistent subregions across all 5 animals (1 coronal and 4 sagittal), suggesting a successful integration across animals with heterogeneous measurements. Notably, we also find that identified domains are highly consistent across animals. This work demonstrates that large-scale data-driven discovery of domains at CCF-like resolution can be based on spatial transcriptomics data. Finally, we show that our framework can perform domain detection in a different spatial transcriptomics modality, Slide-seqV2, using a whole-brain dataset of cellularly deconvoluted results13.
Results
The CellTransformer architecture and domain detection workflow
CellTransformer is a graph transformer[14](https://www.nature.com/articles/s41467-025-64259-4#ref-CR14 “Veličković, P. et al. Graph Attention Networks. Preprint at https://doi.org/10.48550/arXiv.1710.10903
(2018).“) neural network that is trained to learn latent representations of cell neighborhoods by conditioning single-cell gene expression predictions on neighborhood spatial context. We define a cellular neighborhood as any cells within a user-specified distance cutoff in microns away from a reference or center cell. As input, our model requires the gene expression profiles and cell type classifications for cells in a neighborhood and outputs a latent variable representation for that neighborhood. One of the principal operations in a transformer is the self-attention operation, which computes a feature update based on pairwise interactions between elements in a sequence, which are referred to as tokens (here, cells). Accordingly, one interpretation of our model is of learning an arbitrary and dynamic pairwise interaction graph among cells.
Restricting this graph to a small neighborhood subgraph of the whole-tissue-section graph has benefits for both computational resource usage and biological interpretability. We interpret the size of the neighborhood as a constraint on the physical distance at which statistical correlations between the observed cells and their gene expression profiles can be directly captured. Truncating neighborhoods using a fixed spatial threshold instead of choosing a fixed number of neighbors also allows the network to account for the varying density of cells in space. Accordingly, our framework incorporates a notion of both cytoarchitecture (relative density and proximity) and molecular variation (cell type and RNA-level variation) in the data.
We designed a self-supervised training scheme requiring only cell-type labels, which many large-scale studies make available via scRNA-seq atlas reference mapping1,12. We model cellular neighborhoods as sets of cell tokens that are within a box of fixed size centered around a reference cell and use them to predict the observed gene expression of the cell at the center of the neighborhood. We refer to this cell as the reference cell (indicated by “cell R” in Fig. 1a). Cell tokens are generated by composing cell-type and gene expression information (Methods). After encoding with a series of transformer layers (where cells are only allowed to attend to each other if they are in the same neighborhood), these tokens are then aggregated using a learned pooling operation to produce a single token representation of the entire tissue context. The model receives a new mask token representing the reference cell’s type, which is used to predict its gene expression following the operation of several transformer decoder layers (Fig. 1b; note that the point cloud displayed is for illustration purposes only). Importantly, during this process, only the mask token and the neighborhood representation can attend to each other. This operation captures a hierarchical encoding and decoding process where low-level information (gene and cell type) is produced at the cell token level and aggregated into a high-level representation. This high-level representation is then used to conduct the reverse decoding process (prediction of gene expression from cell type and tissue context information). This construction resembles that of masked language models[14](https://www.nature.com/articles/s41467-025-64259-4#ref-CR14 “Veličković, P. et al. Graph Attention Networks. Preprint at https://doi.org/10.48550/arXiv.1710.10903
(2018).“) and masked autoencoders15,16, where masked predictions are generated based on a conditioning signal (position encodings). In our model, position-based conditioning is replaced by cell type conditioning, similarly to the NCEM17 model. However, unlike masked prediction models such as NCEM17, we aggregate information across tokens (cells) in a cellular neighborhood using a learned pooling, which strongly bottlenecks the information distributed across the tokens prior to masked cell prediction.
Fig. 1: Overall training and architectural scheme for CellTransformer. a During training, a single cell is drawn (we denote this the reference cell, highlighted in red). We extract the reference cell’s spatial neighbors and partition the group into a masked reference cell and its observed spatial neighbors. b Initially, the model encoder receives information about each cell and projects those features to *d-*dimensional latent variable space. Note that the latent variable point cloud shown is merely for illustration purposes and does not correspond to the actual CellTransformer latent space. Features interact across cells (tokens) through the self-attention mechanism. These per-cell representations and an extra token acting as a register token are then aggregated into a single vector representation, which we refer to as the neighborhood representation. This representation is concatenated to a mask token which is cell type-specific and chosen to represent the type of the reference cell. A shallow transformer decoder (dotted lines) further refines these representations and then a linear projection is used to output parameters of a negative binomial distribution modeling of the MERFISH probe counts for the reference cell. c Once the model is trained, we compute embeddings (one for each neighborhood/reference-cell pairing) and concatenate these embeddings within the tissue section datasets and across tissue sections. Concatenating embeddings across tissue sections produces region discovery at organ level. We then cluster these embeddings using *k-*means to discover tissue domains across sections.
At test time, we extract this neighborhood representation for each cell and use *k-*means clustering to identify discrete spatial domains (Fig. 1c). We will use the term spatial domain to refer to the output of clustering on embeddings and cluster to refer to single-cell clusters transferred from the ABC-WMB single-cell taxonomy. We emphasize that the input embedding matrix for *k-*means is constructed by concatenating all cells across the dataset across tissue sections. Since minibatching is used during training (unlike methods such as STAligner and GraphST), for generating embeddings, and during *k-*means (using cuml for GPU-acceleration), overall computational costs of our algorithm are limited in principle only by the memory required for storage of cellular neighborhoods rather than entire sections or datasets.
Data-driven discovery of fine-grained spatial domains in the mouse brain using ABC-WMB
The ABC-WMB spatial transcriptomics dataset contains data from five mouse brains1,12. One animal was processed by the Allen Institute for Brain Science with a 500 gene MERFISH panel and 53 coronal sections (Yao et al.)1 The remaining four other animals, generated in Zhang et al.11. were collected with a 1129 gene panel. Sections from two of these animals (“Zhuang 1”, 147 sections; and “Zhuang 2”, 66 sections) were sampled coronally. The other two animals in the dataset (“Zhuang 3”, 23 sections; and “Zhuang 4”, 3 sections) were sampled sagittally.
We first trained CellTransformer on the Allen 1 dataset. A visualization of the loss curve (negative log-likelihood) showed a clear plateau (Supplementary Fig. 1). We then clustered these embeddings using *k-*means. We emphasize that to generate spatial domains across the brain, all *k-*means clustering in this paper was performed by concatenating cells in the dataset across tissue sections. All further references to visualizations of domains, including those only visualized for a subset of domains, were fit at a given number of domains across the entire dataset. We observed an isolated case where a small number of spatial clusters only in the striatum displayed a non-homogenous spatial patterning that strongly resembled a predominantly astrocytic, Crym+ population described in Ollivier et al. (2024) and have similar gene expression patterns (particularly Crym expression) and spatial organization (See Supplementary Note 1 for a discussion on biological plausibility of these domains and the effect of smoothing, as well as an extensive series of visualizations of the domains at high resolution with and without smoothing). Despite this plausibility, we introduced an optional smoothing step (see Methods) to remove this population to concord with neuroanatomical convention, which typically analyzes spatially contiguous regions. We also visualized the first three principal components (~19.5% variance explained, Supplementary Fig. 2a–c), demonstrating the CellTransformer embeddings produce neuroanatomically reasonable spatial patterns. For example, PC1 and PC2 appear to have the highest magnitude in different cortical layers in the motor and visual areas, whereas PC3 has the highest magnitude in the lamina of the visceral and gustatory areas.
We generated domains at k = 25, 354, and 670, to match the division, structure, and substructure annotations’ resolution in CCFv3, displaying domains for four consecutive tissue sections (Fig. 2a). We also provide representative images of spatial clusters across the brain (28/53 sections) at different k in Supplementary Figs. 3–5. Low domain numbers such as k = 25 broadly divide the brain into neuroanatomically plausible patterns, with subregions of striatum (dorsal and ventral marked in Fig. 2a) and cortical layers clearly visible. A comparison of cortical layers across these sections shows that CellTransformer domains at k = 25 are well matched to CCF (Supplementary Fig. 6b) and correctly identify major classes of layers (1, 2/3 4, 5, and 6) across somatosensory and somatomotor cortex. In particular, we point out the excellent correspondence of domains across tissue sections at k = 25 across the entire dataset (Supplementary Fig. 3), with nearly perfect consistency across regions. This suggested that our neighborhood representation method was robust enough to enable integration without modeling of batch or tissue-level covariates.
Fig. 2: Representative images of spatial domains discovered using CellTransformer on the Allen 1 dataset (53 coronal sections and 500 gene MERFISH panel1) and comparison to CCF. a Four sequential tissue sections (the inter-section distance is 200 µm) from anterior (first row, corresponding to section 50) to posterior (bottom row, section 47). In the first three columns, each dot is a cell, colored by spatial domain identified by CellTransformer when clustering was conducted with k = 25, 354, and 670 domains (the CCF division, structure, and substructure domain resolutions). Spatial domain labels are depicted with the same colors across sections within the same column. Fourth column shows CCF region registration to the same tissue section. Select regions are annotated with CCF labels. MO motor cortex, SS somatosensory cortex, ACA anterior cingulate, CP caudoputamen, LSX lateral septum, MSX medial septum, VISC visceral cortex, GU gustatory cortex, PIR piriform cortex, OT olfactory tubule, ACB nucleus accumbens, HY hypothalamus. b Single hemisphere images of same tissue sections in (a) domains fit at k = 670, zoomed in on cortical layers of motor cortex (MO) and somatosensory cortex (SS). CCF boundaries are shown in semi-transparent lines, with the boundary between SS and MO outlined in larger black dotted lines. c Spatial homogeneity (see Methods) of domains from different methods including recently published methods CellCharter and SPIRAL. d Proportion of discrete domains by model as compared to CCF at the same resolution. Single-cell level spatial smoothness was averaged over each domain instead of averaged over the entire dataset as in (c). For each CCF annotation level we computed a threshold as the 20th percentile of per-domain spatial smoothness values. We applied this threshold to the distribution of data-driven domains at the same resolution. e Average Pearson correlation (averaging over number of domains and method) of the maximum Pearson correlation between the cell type composition (at subclass level, 338 types) vectors of data-driven regions with CCF ones. f Average Pearson correlation (averaging over number of domains and method) of optimal matched pairs between data-driven and CCF regions, where CCF regions are only allowed to pair with one data-driven region per comparison. Matches fit using linear programming. g Region-by-region Pearson correlation matrix comparing cell type composition vectors from 670 CCF regions (at substructure level) with 670 spatial domains from CellTransformer. The CCF regions are shown on the left with their structure annotations from CCF at division level on the side of the plot. Correlations above 0.9 are shown in bright green to assist in visualization. h Cell type (cluster level) by region matrix for 670 CCF regions at substructure level. i Cell type (cluster level) by region matrix for 670 CellTransformer regions. Rows are normalized to sum to 1 in both (h, i). Colors along x-axis in both (g, h) show cell class annotations from ABC-WMB cell type taxonomy at class level to allow for visualization of composition in terms of known types. Cell types in the “09 CNU-LGE GABA” class are boxed in purple in (h, i) matching their color in the legend. Rows of both (h, i) are grouped using clustering to produce approximately similar structure.
At k = 354, anterior-posterior subdivisions emerge such as the presence of layer 4 in the motor cortex (Fig. 2a, see Supplementary Fig. 6d, e). Historically, the mouse motor cortex was thought to lack a granular layer 418, however recently, MERFISH, transcriptomic and epigenomic studies have confirmed its existence1,18,19. At k = 100 and k = 354, we find a domain corresponding to Layer 4 in the somatosensory cortex which clearly extends to layer 4 in the motor cortex. We also provide a UMAP visualization of the CellTransformer embedding space with cluster labels from k = 354 clustering (Supplementary Fig. 7), demonstrating the observed domain spatial organization can also be recovered in a low-dimensional representation.
At k = 670, the cortical layers identified at lower resolution are further partitioned into superficial, intermediate, and deep strata within several layers. We visualize cortical layers across sections in depth (Fig. 2b), showing CellTransformer not only identifies fine superficial-deep structure within cortical layers but also preserves the boundary between somatosensory and motor cortex (marked in thick black dotted lines in Fig. 2b). Taken together these results showed that CellTransformer robustly describes previously known anatomical structures.
We also examined the caudoputamen at various choices of k. At k = 25, the caudoputamen is one domain, which separates into broad spatially contiguous domains at k = 100. Interestingly, at k = 354 and k = 670, we observe domains that intermingle in a grid-like pattern (Fig. 2a, Supplementary Fig. 8) that strongly resembles the Voronoi parcellation established in Hintiryan et al.20 through systematic projection mapping to caudoputamen. Notably, CellTransformer also captures the quadrant pattern in intermediate caudoputamen (sections 52, 50, and 49 in Supplementary Fig. 8), which Hintiryan et al. attributed to the differences in subnetwork reorganization. The correspondence of our transcriptomic domains to the Hintiryan et al. results, which are exclusively based on projection mapping (non-transcriptomic data), suggests the biological relevance of our representation learning workflow.
We compared CellTransformer to several other workflows to capture spatial coherency and multiresolution neuroanatomical annotations in CCF at the division, structure, and substructure levels. For comparison, we used two recent methods, CellCharter21 and SPIRAL22 that are scalable to millions of cells as benchmarks. CellCharter builds spatially informed embeddings for domain detection by concatenating the embeddings across scales, followed by dimensionality reduction and batch correction, while SPIRAL uses graph neural networks for batch effect correction and integration across scales. Additionally, we implemented two machine learning baselines. Gene-expression based domain detection is employed for spatial transcriptomics data analysis and has been shown to be highly successful for brain parecllation23,[24](#ref-CR24 “Maher, K. et al. Mitigating autocorrelation during spatially resolved transcriptomics data analysis. 2023.06.30.547258 Preprint at https://doi.org/10.1101/2023.06.30.547258
(2023).“),25. Therefore, we conducted *k-*means clustering on the single-cell MERFISH probe counts. We also employed *k-*means clustering on cellular neighborhoods (represented as cell type count vectors). Many other GPU-accelerated methods, such as scENVI[4](https://www.nature.com/articles/s41467-025-64259-4#ref-CR4 “Haviv, D. et al. The covariance environment defines cellular niches for spatial inference. Nat. Biotechnol. 1–12 https://doi.org/10.1038/s41587-024-02193-4
(2024).“), STACI26, spaGCN5, STAligner6, STAGATE27, or GraphST[28](https://www.nature.com/articles/s41467-025-64259-4#ref-CR28 “Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST | Nat. Commun. https://www.nature.com/articles/s41467-023-36796-3
.“), cannot be run on datasets that contain millions of cells due to computational constraints (see Methods). One reason is that several of these methods require the instantiation of a dataset-wide pairwise distance matrix between all cells either on GPU or in RAM, which is a prohibitively large matrix (~60TB for ~4 M cells).
To quantify the spatial coherence of domains, we developed a single-cell level spatial homogeneity metric. For each cell, we identified its nearest 100 spatial neighbors. We then quantified the proportion of neighbor cells within the same spatial domain label as the starting cell (Fig. 2c). Ideally, we would expect a high proportion of neighbor cells to be in the same spatial domain as the starting cell. In this comparison of neighborhood spatial smoothness, CellTransformer outperforms CellCharter (58.2% better spatial coherence at 670 domains) and SPIRAL (4091.2%). CellTransformer also outperforms the machine learning baseline based on *k-*means clustering on cellular neighborhoods (61.9% better spatial coherence at k = 670), as well as the gene expression baseline (419.2% better spatial coherence at k = 670). We also compared versions of CellTransformer domains with and without the optional smoothing step. Both versions perform very similarly, indicating the CellTransformer representation efficiently suppresses high-frequency spatial features and changes relatively little after smoothing with a small bandwidth. For reference, we include the CCF parcellation (dashed purple line) in this comparison to provide an upper bound.
CellTransformer utilizes cell type information in both the encoder and decoder portions of the network. To understand the impact of this choice, we trained two additional variants of CellTransformer, one without cell type information in the decoder and one without cell type information in the encoder and the decoder (only expression). We generated spatial domains from these models’ embeddings with smoothing, performed identically as in the base model. These variants perform competitively with the base CellTransformer version. The base model domains were 3.0% more similar to CCF than the model without cell type decoding, and 5.4% more similar than the model without cell type in either the encoder or decoder (Supplementary Fig. 9a). The smoothed and unsmoothed CellTransformer domains were also very similarly spatially smooth (Supplementary Fig. 9b), even when the number of domains was extended to 2000. CellTransformer variants also produced relatively little decrease in spatial smoothness at greater than 1000 domains, whereas CellCharter smoothness sharply declined. Similarly to the CCF comparison, the model without cell type decoding performed better than the model without cell type in both the encoder and decoder. Taken together, these results indicate that cell type conditioning is an important component of the model that increases similarity to CCF and spatial uniformity. This may be because the cell types, which were fit using whole-transcriptome scRNA-seq profiling, implicitly comprise a whole-transcriptome imputation step. However, we also note that the CellTransformer variant without any cell type information still performs better than CellCharter with respect to similarity to CCF and spatial coherence.
In addition to dataset-wide spatial smoothness, we also quantified the proportion of discrete domains (Fig. 2d). To classify a given domain as spatially discrete, we first averaged the single-cell level smoothness values over domains. We then investigated the distribution of per-domain average smoothness (Supplementary Fig. 10a, b). CellTransformer’s regional smoothness distribution is more similar to CCF than CellCharter, with relatively more highly smooth regions. Additionally, CellTransformer domains with smoothing are more similar to CCF than without. For a given CCF annotation resolution, we then computed a threshold based on the 20th percentile of per-domain averaged CCF smoothness values, which we used to classify a given data-driven domain as discrete or not. This adaptive metric allows us to compare fairly to human annotations at different resolutions. CellTransformer domains fit with and without smoothing both perform well, while for CellCharter, SPIRAL, and the gene expression baselines, discreteness declines significantly at 354 and 670 domains (Fig. 2d). We also computed the proportion of spatially discrete domains for resolutions from 700 to 2000 data-driven domains using the 20th percentile cutoff from the k = 670 CCF annotation level. For CellCharter, the proportion of discrete domains significantly diminishes (Supplementary Fig. 10c), unlike CellTransformer. The unsmoothed CellTransformer embedding workflow was most performant; we reasoned that the isotropic Gaussian smoothing employed may have eroded fine laminar boundaries, despite removing the isolated non-uniform domains discovered in striatum (Supplementary Note 1). We visualized a series of sections comparing domains from the smoothed and unsmoothed CellTransformer embeddings, finding a slight erosion of fine lamina in the cortex (Supplementary Fig. 11) compared with the unsmoothed, consistent with this observation. We also provide visualizations of a larger set of sections in Supplementary Fig. 12. To quantify the similarity of detected domains with CCF annotations, we compared the cell type composition of domains using cell type calls from the ABC-WMB taxonomy. We again chose the subclass cell type level, extracting for each domain and for each method a 338-long cell-type vector. We calculated the Pearson correlation of cell type composition vectors computed using the CCF regional annotations at division (25), structure (354) and substructure (670) levels against those of the various methods at the corresponding number of spatial domains. First, for each data-driven domain, we computed the maximum correlation to any CCF domain at the same CCF annotation resolution averaging these numbers across domains. CellTransformer outperforms other methods at mid-granularity and fine-granularity (Fig. 2e). In this comparison, several data-driven regions can match the same CCF region, which in the worst case could provide an overly optimistic picture of the correspondence between data-driven domains and CCF. To address this, we conducted a second analysis where only one CCF region could be matched to a given data-driven one. We used linear programming to compute an optimal 1:1 pairing of data-driven regions to CCF ones based on their Pearson correlation. We then averaged correlations across regions (Fig. 2f). CellTransformer is highly performant, showing that increase in correlation is not due to redundant matches to a single area in CCF. Visualization of spatial clusters from CellCharter (Supplementary Figs. 13–14) at k = 670 domains across the brain and in midbrain shows lack of spatial coherence in cortical layers and midbrain, with detected domains distributed in a what appear to be non-biological patterns. This is particularly evident in thalamus and across cortical layers. In contrast, CellTransformer identified spatially coherent domains and uncovered plausible neuroanatomical structures.
Next, we directly compared CCF annotation domains with data-driven domains using normalized mutual information (NMI) and the adjusted Rand index (ARI) (Supplementary Fig. 15). CellTransformer (with and without smoothing) perform significantly better than comparator methods (13.4% higher NMI than CellCharter and 46.7% higher NMI than SPIRAL with smoothed embeddings, 13.5% higher NMI than CellCharter and 46.9% higher NMI than SPIRAL at 670 domains). Registering spatial transcriptomics data to CCF’s dense MRIs is significantly challenging. We therefore attribute the overall low magnitude of the ARI and NMI, to potential errors in registration.
To further characterize the similarity of CellTransformer domains with CCF, we plotted the Pearson correlation matrix (Fig. 2g) between cell type composition vectors generated at 670 domains (substructure level in CCF). Block structures with very high correlations (>0.9, shown in bright green) in the matrix clearly show that CellTransformer is able to identify regions that are highly similar with known ones without any labels. We also investigated the correspondence of cell type composition with more granular single-cell annotations, employing the lowest-level single-cell annotations (the “cluster” level, with 5274 cell types, as opposed to the “subclass” level with 338) from ABC-WMB. We observed high similarity between the “substructure” CCF domain set (Fig. 2h) and 670 CellTransformer domains (Fig. 2i) with an average Pearson correlation of CellTransformer to CCF domains of 0.853. This shows that the high correspondence of CCF and CellTransformer is robust to cell type resolution at which comparison occurs. CellTransformer identified an increase in the number of domains containing the 09 CNU-LGE GABA class (striatal/pallidal GABAergic neurons from lateral ganglionic eminence compared with the 670 CCF substructures, shown in light purple box in Fig. 2h, i), potentially suggesting the presence of uncharacterized developmental populations.
The observation of hierarchical grouping of domains at different choices of k (for example, delineation of cortical layers and sublayers with increasing number of domains) prompted us to develop a strategy to evaluate an optimal number of spatial domains based on two metrics. We implemented a previously published strategy29 for Drosophila embryonic spatial gene expression and30 for 3D spatial gene expression in the adult mouse brain to determine the optimal number of domains using a stability criterion. We reasoned that the optimal choice of spatial domain number would feature minimal variability across clustering runs. In brief, we computed 20 clustering instances with different random initializations for a large range k values (100–2000) and quantified their variability over these initializations (see Methods). Interestingly, stability increased with increasing k (Supplementary Fig. 16a, b). To facilitate the choice of a particular resolution for analysis, we also computed the inertia (sum of squared errors in embedding space) for each clustering solution. Low stability at small numbers of domains may partially explain subpar results for CellTransformer in the k = 25 CCF evaluations. We averaged the inertia curve and instability and computed the point of second derivative crossing to identify k = 1300 as our resolution for analysis (crossing point shown with red dot in Supplementary Fig. 16c).
CellTransformer is the only method out of the three deep learning methods we implemented (including six other pipelines, which were unable to cope with the size of the ABC-WMB dataset) to allow discovery of spatially coherent divisions at greater than CCF resolution. This study shows that spatial transcriptomics data can be used to identify brain regions at resolutions finer than previously defined in the CCF. CellTransformer domains are more spatially smooth than comparator deep learning methods, as well as clustering on either gene expression or cell type composition, both at the single-cell level and at the domain level. Additionally, CellTransformer domains are more similar to CCF, both at the single-cell level, using NMI and ARI, and at the regional composition level. CellTransformer can also be used with and without cell type labels, although performance is unsurprisingly better when using cell type information. Crucially, CellTransformer is highly performant at discovering a high number of domains, whereas methods such as CellCharter experience a significant loss of spatial coherence (Supplementary Fig. 14). Encouraged by these findings, we next sought to establish correspondence of particular domains at k = 1300 to known neuroanatomy.
Mapping of spatial domains in the hippocampal formation
We characterized CellTransformer’s ability to capture known anatomical structure in the hippocampal formation, notably the subiculum (SUB) and prosubiculum (PS), in the Allen 1 dataset. We focused on this area because it is well characterized with respect to both connectivity31 and transcriptomic composition32,33. These structures were investigated in Ding et al.34, where the authors performed consensus clustering of glutamatergic neurons and subsequent ISH experiments were used to comprehensively map domains in dorsal subiculum (SUBd) and dorsal and ventral prosubiculum (PSd and PSv). Specifically, this and other recent works have noted the extensive laminar organization (superficial layers to deeper layers), and the dorsal-ventral organization of the subiculum35,36,37. This organization has been attributed to distinct and correlated patterns of gene expression and connectivity.
We qualitatively compared spatial domains discovered by CellTransformer with k = 1300 to the anatomical borders identified in Ding et al. (Fig. 3a). The subiculum features a three-layer organization referred to as the molecular (mo) layer, the pyramidal cell (py) layer, and the polymorphic (po) cell layer. Figure 3a shows a diagram of SUB and PS regions based on Ding et al. with the pyramidal and polymorphic layers of SUB and PS annotated in bold black text. Figure 3b shows discovered spatial domains at k = 1300 across four sequential sections corresponding to those in Ding et al.34. A subset of domains corresponding to SUB and PS are shown in Fig. 3c along with putative regional annotations. CellTransformer identifies a three-layer organization in the dorsal subiculum corresponding to that in Ding et al. labeled SUBd-py (light green), SUBd-po (gold), and SUBd-mo (gray-blue). CellTransformer also correctly splits the SUBd and PSd shown with black dotted lines on the image of section 32. Three-layer strata are also observed in PSd, although notably the pyramidal layer domain extends caudally, consistent with transcriptomic studies31,32,[33](https://www.nature.com/articles/s41467-025-64259-4#ref-CR33 “Bienkowski, M. S. et al. Homologous laminar organization of the mouse and human sub