Main
Accurate cell segmentation is crucial for quantitative analysis and interpretation of various cellular imaging experiments. Modern spatial genomics assays can produce data on the location and abundance of 102 protein species and 103 RNA species simultaneously in living and fixed tissues1,2,3,[4](#ref-CR4 “Hickey, J. W. et al. Spatial mapping of protein composition and tissue organization: a primer for mult…
Main
Accurate cell segmentation is crucial for quantitative analysis and interpretation of various cellular imaging experiments. Modern spatial genomics assays can produce data on the location and abundance of 102 protein species and 103 RNA species simultaneously in living and fixed tissues1,2,3,4,5. Accurate cell segmentation allows this type of data to be converted into interpretable tissue maps of protein localization and transcript abundances; these maps provide important insights into the biology of healthy and diseased tissues. Similarly, live-cell imaging provides insight into dynamic phenomena in bacterial and mammalian cell biology. Studying live-cell imaging data has provided mechanistic insights into critical phenomena such as the mechanical behavior of the bacterial cell wall6,7, information transmission in cell signaling pathways8,9,10,11,12,13, heterogeneity in immune cell behavior during immunotherapy14 and the morphodynamics of development15. Cell segmentation is also a key challenge for these experiments, as cells must be segmented and tracked to create temporally consistent records of cell behavior that can be queried at scale. These methods have seen use in several systems, including mammalian cells in cell culture13,16 and tissues5, bacterial cells17,18,19,20 and yeast21,22,23.
Considerable progress has been made in recent years on the problem of cell segmentation, driven primarily by advances in deep learning24. Progress in this space has occurred mainly in two distinct directions. The first direction seeks to find deep learning architectures that achieve state-of-the-art performance on cellular imaging tasks. These methods have historically focused on a particular imaging modality (for example, brightfield imaging) or target (for example, mammalian tissue) and have difficulty generalizing beyond their intended domain25,26,27,28,29,30,31. For example, Mesmer’s28 representation for a cell (cell centroid and boundary) enables good performance in tissue images but would be a poor choice for elongated bacterial cells. Similar tradeoffs in representations exist for the current collection of Cellpose models, necessitating the creation of a model zoo26,32. The second direction is to work on improving labeling methodology. Cell segmentation is an application of the instance segmentation problem, which requires pixel-level labels for every object in an image. Creating these labels can be expensive (US$0.01 per label, with hundreds to thousands of labels per image)28,33, which provides an incentive to reduce the marginal cost of labeling. A recent improvement to labeling methodology has been human-in-the-loop labeling, where labelers correct model errors rather than produce labels from scratch26,28,34. Further reductions in labeling costs can increase the amount of labeled imaging data by orders of magnitude.
Recent work in machine learning on foundation models holds promise for providing a complete solution. Foundation models are large deep neural network models (typically transformers35) trained on large amounts of data in a self-supervised fashion with supervised fine-tuning on one or several tasks36. Foundation models include the GPT37,38 family of models, which have proven transformative for natural language processing36. These types of attention-based models have recently been used for processing biological sequences39,40,41,42,43. These successes have inspired similar efforts in computer vision. The Vision Transformer (ViT)44 was introduced in 2020 and has since been used as the basis architecture for a collection of vision foundation models45,46,47,48,49. A key feature of foundation models is the scaling of model performance with model size, dataset size and compute50; these scaling laws have been observed for both language and vision models51. These scaling laws offer a path toward generalist models for cellular image analysis by increasing dataset and model size in exchange for dealing with the increased compute cost of training foundation models. This is in contrast to previous efforts that have focused on model architecture design and representation engineering.
One recent foundation model well suited to cellular image analysis is the Segment Anything Model (SAM)52. This model uses a ViT to extract information-rich features from raw images. These features are then directed to a module that generates instance masks based on user-provided prompts, which can be either spatial (for example, an object centroid or bounding box) or semantic (for example, an object’s visual description). Notably, the promptable nature of SAM enabled scalable dataset construction, as preliminary versions of SAM allowed labelers to generate accurate instance masks with 1–2 clicks. The final version of SAM was trained on a dataset of 11 million images containing over 1 billion masks and demonstrated strong performance on various zero-shot evaluation tasks. Recent work has attempted to apply SAM to problems in biological and medical imaging, including medical image segmentation53,54,55,56, lesion detection in dermatological images57,58, nuclear segmentation in hematoxylin and eosin (H&E) images59,60 and cellular image data for use in the Napari software package61.
Works such as MicroSAM61 or MedSAM56 use SAM’s original workflow to speed up annotation of cells and medical data, label a large dataset and then fine-tune the original SAM model. However, reliable automated segmentation is still missing in these works. Although promising, these studies reported challenges adapting SAM to these new use cases53,61. These challenges include reduced performance and uncertain boundaries when transitioning from natural to medical images. Cellular images contain additional complications: they can involve different imaging modalities (for example, phase microscopy versus fluorescence microscopy), thousands of objects in a field of view (FOV) (as opposed to dozens in a natural image) and uncertain and noisy boundaries (artifacts of projecting three-dimensional objects into a two-dimensional plane)61.
In addition to these challenges, SAM’s default strategy for automatic prompting does not allow for accurate inference on cellular images. SAM’s automated prompting uses a uniform grid of points to generate masks, an approach that is poorly suited to cellular images given the wide variation of cell densities. More precise prompting (for example, a bounding box or mask) requires prior knowledge of cell locations. Because cellular images often contain a large number of cells, it is impractical for users to provide prompts to SAM manually. This limitation makes it challenging for SAM to serve as a foundation model for cell segmentation because it still requires substantial human input for inference. A solution that enables the automatic generation of prompts would enable SAM-like models to serve as foundation models and knowledge engines, as they could accelerate the generation of labeled data, learn from them and make that knowledge accessible to life scientists via inference.
In this work, we developed CellSAM, a foundation model for cell segmentation (Fig. 1). CellSAM extends the SAM methodology to perform automated cellular instance segmentation. To achieve this, we first assembled a comprehensive dataset for cell segmentation spanning five broad data archetypes: tissue, cell culture, yeast, H&E and bacteria. Critically, we removed data leaks between training and testing data splits to ensure an accurate assessment of model performance. To automate inference with CellSAM, we developed CellFinder, a transformer-based object detector that uses the Anchor DETR framework62. CellSAM and CellFinder share SAM’s ViT backbone for feature extraction; these features are first used by CellFinder to generate bounding boxes around the cells to be used as prompts for SAM. The bounding boxes (prompts) and ViT features are fed into a decoder to generate instance segmentations of the cells in an image. We trained CellSAM on a large, diverse corpus of cellular imaging data, enabling it to achieve state-of-the-art performance across 10 datasets. We also evaluated CellSAM’s zero-shot performance using a held-out dataset, LIVECell63, demonstrating that it substantially outperforms existing methods for zero-shot segmentation. A deployed version of CellSAM is available at https://cellsam.deepcell.org.
Fig. 1: CellSAM: a foundational model for cell segmentation.
CellSAM combines SAM’s mask generation and labeling capabilities with an object detection model to achieve automated inference. Input images are divided into regularly sampled patches and passed through a transformer encoder (that is, a ViT) to generate information-rich image features. These image features are then sent to two downstream modules. The first module, CellFinder, decodes these features into bounding boxes using a transformer-based encoder–decoder pair. The second module combines these image features with prompts to generate masks using SAM’s mask decoder. CellSAM integrates these two modules using the bounding boxes generated by CellFinder as prompts for SAM. CellSAM is trained in two stages, using the pretrained SAM model weights as a starting point. In the first stage, we train the ViT and the CellFinder model together on the object detection task. This yields an accurate CellFinder but results in a distribution shift between the ViT and SAM’s mask decoder. The second stage closes this gap by fixing the ViT and SAM mask decoder weights and fine-tuning the remainder of the SAM model (that is, the model neck) using ground truth bounding boxes and segmentation labels.
Results
Construction of a dataset for a generalist cell segmentation model
A major challenge with existing cellular segmentation methods is their inability to generalize across cellular targets, imaging modalities and cell morphologies. To address this, we curated a dataset from the literature containing two-dimensional images from a diverse range of targets (mammalian cells in tissues and adherent cell culture, yeast cells, bacterial cells and mammalian cell nuclei) and imaging modalities (fluorescence, brightfield, phase contrast and mass cytometry imaging).
Our final dataset consisted of TissueNet28, DeepBacs64, BriFiSeg65, Cellpose25,26, Omnipose66,67, YeastNet68, YeaZ69, the 2018 Kaggle Data Science Bowl (DSB) dataset70, a collection of H&E datasets71,72,73,74,75,76,77 and an internally collected dataset of phase microscopy images across eight mammalian cell lines (Phase400). We group these datasets into six types for evaluation: Tissue, Cell Culture, H&E, Bacteria and Yeast. As the DSB70 comprises cell nuclei that span several of these types, we evaluate it separately and refer to it as Nuclear, making a total of six categories for evaluation. Although our method focuses on whole-cell segmentation, we included DSB70 because cell nuclei are often used as a surrogate when the information necessary for whole-cell segmentation (for example, cell membrane markers) is absent from an image. Figure 2a shows the number of annotations per evaluation type. Finally, we used a held-out dataset, LIVECell63, to evaluate CellSAM’s zero-shot performance. This dataset was curated to remove low-quality images and images that did not contain sufficient information about the boundaries of closely packed cells. A detailed description of data sources and preprocessing steps can be found in Appendix A. Our full, preprocessed dataset is publicly available at https://cellsam.deepcell.org.
Fig. 2: CellSAM is a strong generalist model for cell segmentation.
a, For training and evaluating CellSAM, we curated a diverse cell segmentation dataset from the literature. The number of annotated cells is given for each data type. Nuclear refers to a heterogeneous dataset (DSB)70 containing nuclear segmentation labels. b, Segmentation performance for CellSAM and Cellpose across different data types. We compared the segmentation error (1 − F1) for models that were trained as generalists (that is, the full dataset). Models were trained for a similar number of steps across all datasets. We observed that CellSAM-generalist had a lower error than Cellpose-generalist on all tested data categories. Furthermore, we validated this finding on a held-out competition dataset from the Weakly Supervised Cell Segmentation in Multi-modality High-Resolution Microscopy Images (that is, the NeurIPS challenge)91. Error bars were computed by computing the segmentation error per image and then calculating the mean and s.e. The categories contained the following number of samples: Tissue = 330, Cell Culture = 144, H&E = 51, Bacteria = 260, Yeast = 32 and Nuclear = 56. c, Human versus human and CellSAM-generalist versus human (CellSAM/human) inter-rater performance comparison. A two-sided t-test confirms that no statistical difference exists between CellSAM and human performance. d, Qualitative results of CellSAM segmentations for different data and imaging modalities. Predicted segmentations are outlined in red. NS, not significant.
CellSAM creates masks using box prompts generated from CellFinder
In early experiments, we found that providing ground truth bounding boxes as prompts to SAM (ground truth prompts represent an upper bound on performance) achieved substantially higher zero-shot performance than point prompting (Extended Data Fig. 1). This is in agreement with previous analyses of SAM applied to biological61 and medical53 images. Because the ground truth bounding box prompts yield accurate segmentation masks from SAM across various datasets, we sought to develop an object detector that could generate prompts for SAM in an automated fashion. Given that our zero-shot experiments demonstrated that ViT features can form robust internal representations of cellular images, we reasoned that we could build an object detector using the image features generated by SAM’s ViT. Previous work explored this space and demonstrated that ViT backbones can achieve state-of-the-art performance on natural images78,79. For our object detection module, we use the Anchor DETR62 framework with the same ViT backbone as the SAM module; we call this object detection module CellFinder. Anchor DETR is well suited for object detection in cellular images because it formulates object detection as a set prediction task. This allows it to perform cell segmentation in images with densely packed objects, a common occurrence in cellular imaging data. Alternative bounding box detection methods (for example, the R-CNN family) rely on non-maximum suppression (NMS)80,81, leading to poor performance in this regime. Methods that frame cell segmentation as a dense, pixel-wise prediction task (for example, Mesmer28, Cellpose25 and Hover-net30) assume that each pixel can be uniquely assigned to a single cell and cannot handle overlapping objects.
The ground truth prompting scheme by itself does not achieve real-world performance standards. Our analysis showed that SAM cannot accurately segment many cell types, likely due to the distribution of images seen during training. To adapt CellSAM from natural images to cellular images, we fine-tune the SAM model neck (the layers connecting SAM’s ViT to its decoder) while leaving other layers frozen to retain generalization ability. Training CellSAM in this manner achieved state-of-the-art accuracy when provided with ground truth bounding box prompts (Supplementary Fig. 1).
We train CellSAM in two stages; the full details can be found in the supplementary materials. In the first stage, we train CellFinder on the object detection task. We convert the ground truth cell masks into bounding boxes and train the ViT backbone and the CellFinder module. Once CellFinder is trained, we freeze the model weights of the ViT and fine-tune the SAM module as described above. This accounts for the distribution shifts in the ViT features that occur during the CellFinder training. Once training is complete, we use CellFinder to prompt SAM’s mask decoder. We refer to the collective method as CellSAM; Fig. 1 outlines an image’s full path through CellSAM during inference.
Benchmarking CellSAM’s performance on numerous biological datasets
We benchmarked CellSAM’s performance using F1 error (1 − F1) as metric (Fig. 2b) against Cellpose, a widely used cell segmentation algorithm. Because our work includes both dataset and model development, we chose benchmarks that allow us to measure the contributions of data and model architecture to overall performance. Our benchmarks include comparisons to a pretrained generalist Cellpose model (cyto3), an internally trained generalist Cellpose model and a suite of internally trained specialist (that is, trained on a single dataset) Cellpose models. Internally trained models were trained on the CellSAM dataset or a suitable subset using previously published training recipes, whereas evaluations were performed on a held-out split of the same dataset. We further evaluated CellSAM’s performance on the evaluation split of the NeurIPS Cell Segmentation Challenge82 (Fig. 2b). For this evaluation, we fine-tuned CellSAM with an additional hematology dataset, which was a substantial fraction of the NeurIPS challenge dataset. In almost every comparison, we found that CellSAM outperformed generalist Cellpose models (whether pretrained or internally trained) and was equivalent to specialist Cellpose models trained exclusively on individual datasets. We highlight features of our benchmarking analyses below.
CellSAM is a strong generalist model. Generalization across cell morphologies and imaging datasets has been a major challenge for deep-learning-based cell segmentation algorithms. To evaluate CellSAM’s generalization capabilities, we compared the performance of CellSAM and Cellpose models trained as specialists (that is, on a single dataset) to generalists (that is, on all datasets). Consistent with the literature, we observe that Cellpose’s performance degraded when trained as a generalist (Extended Data Fig. 3). By contrast, we found that the performance of CellSAM-generalist was equivalent to or better than CellSAM-specialist across all data categories and datasets (Extended Data Fig. 3). Moreover, CellSAM-generalist outperformed Cellpose-generalist in all data categories (Fig. 2b and Extended Data Figs. 2 and 3). This analysis highlights an essential feature of a foundational model: maintaining performance with increasing data diversity and scale.
CellSAM achieves human-level accuracy for generalized cell segmentation. We use the error (1 − F1) to assess the consistency of segmentation predictions and annotator masks across a series of images. We compared the annotations of three experts with each other (human versus human) and with CellSAM (human versus CellSAM). This comparison explores whether CellSAM’s performance is within the error margin created by annotator preferences (for example, the thickness of a cell boundary). We compared annotations across four data categories: mammalian cells in tissue, mammalian cells in cell culture, bacterial cells and yeast cells. A two-sided t-test revealed no significant differences between these two comparisons, indicating that CellSAM’s outputs are similar to expert human annotators (Fig. 2c). This is demonstrated by non-significant P values between CellSAM-annotator and inter-annotator agreements, specifically for Tissue: P = 0.18, Cell Culture: P = 0.49, Yeast: P = 0.11 and Bacteria: P = 0.90.
CellSAM enables fast and accurate labeling. When provided with ground truth bounding boxes, CellSAM achieves high-quality cell masks without any fine-tuning on unseen datasets (Extended Data Fig. 1). Because drawing bounding boxes consumes considerably less time than drawing individual masks, this means that CellSAM can be used to generate highly accurate labels quickly, even for out-of-distribution data.
CellSAM is a strong zero-shot and few-shot learner. We used the LIVECell dataset to explore CellSAM’s performance in zero-shot and few-shot settings. We stratified CellSAM’s zero-shot by cell lines present in LIVECell (Extended Data Figs. 4b and 5). We found that although performance varied by cell line, we could recover adequate performance in the few-shot regime for a number of the cell lines (for example, A172). Extended Data Fig. 4 shows that CellSAM improves its performance with only 10 additional FOVs (102−103 cells) for each cell line. We found that fine-tuning could not recover performance for cell lines with morphologies far from the training data distribution (for example, SH-SY5Y). This may reflect a limitation of bounding boxes as a prompting strategy for SAM models.
CellSAM enables diverse bioimage analysis workflows
Cell segmentation is a critical component of many spatial biology analysis pipelines; a single foundation model that generalizes across cell morphologies and imaging methods would fill a crucial gap in modern biological workflows by expanding the scope of the data that can be processed. In this section, we demonstrate how the same CellSAM-generalist model (not fine-tuned to any particular dataset) can be used across biological imaging pipelines by highlighting two use cases: spatial transcriptomics and live-cell imaging (Fig. 3).
Fig. 3: CellSAM enables diverse bioimage analysis workflows.
Because CellSAM-generalist functions across image modalities and cellular targets, it can be immediately applied across bioimaging analysis workflows without requiring task-specific adaptations. a, We schematically depict how CellSAM-generalist fits into the analysis pipeline for live-cell imaging and spatial transcriptomics, eliminating the need for different segmentation tools and expanding the scope of possible assays to which these tools can be applied. b, Segmentations from CellSAM are used to track cells87 and quantify fluorescent live-cell reporter activity in cell culture. c, CellSAM segments cells in multiple frames from a video of budding yeast cells. These cells are tracked across frames using a tracking algorithm87 that ensures consistent identities, enabling accurate lineage construction and cell division quantification. d, CellSAM is used to segment slices of a three-dimensional image, and these segmented slices are fed into u-Segment3D89 to create a three-dimensional segmentation. e, Segmentations generated using CellSAM are integrated with Polaris85, a spatial transcriptomics analysis pipeline. Because of CellSAM’s generalist nature, we can apply this workflow across sample types (for example, tissue and cell culture) and imaging modalities (for example, seqFISH and MERFISH). Datasets of cultured macrophage cells (seqFISH) and mouse ileum tissue (MERFISH)86 were used to generate the data in this example. MERFISH segmentations were generated with CellSAM with an image of a nuclear and membrane stain; seqFISH segmentations were generated with CellSAM with a maximum intensity projection image of all spots. 3D, three-dimensional.
Spatial transcriptomics methods measure single-cell gene expression while retaining the spatial organization of the sample. These experiments (for example, MERFISH83 and seqFISH84) fluorescently label individual mRNA transcripts; the number of