Main
The haematological system is among the most complex physiological systems and is uniquely interconnected with all others. Though often quantified by simple ‘blood counts’ of cell class frequencies, its characteristics are both supremely rich and highly variable within and across individuals1. Characterizing the morphological appearances of individual blood cells as seen on light microscopy is often critical to manage haematological disorders. Complex modulation of cell morphology by diverse biological, pathological and instrumental factors requires that this task necessarily be performed by trained experts. Moreover, labelling cells by thei…
Main
The haematological system is among the most complex physiological systems and is uniquely interconnected with all others. Though often quantified by simple ‘blood counts’ of cell class frequencies, its characteristics are both supremely rich and highly variable within and across individuals1. Characterizing the morphological appearances of individual blood cells as seen on light microscopy is often critical to manage haematological disorders. Complex modulation of cell morphology by diverse biological, pathological and instrumental factors requires that this task necessarily be performed by trained experts. Moreover, labelling cells by their major morphological type, for example, lymphocyte, is only the crudest form of description, on which finer fractionation into subtypes, across a wide spectrum of (ab)normality, is overlaid.
Indeed, the task of morphological characterization is both open ended and lacks a definitive ground truth: there may be morphological patterns whose subtlety has concealed great clinical importance, and some morphological classes are purely expert-determined visual phenotypes with no means of objective corroboration. Moreover, pathological appearances may be highly unusual or unique, precluding classification into any class, even at the simplest level of description, and such anomalies should be explicitly identified. The difficulty is commonly compounded by interactions with irrelevant biological features with variable representation across the population, and instrumental variations of technical origin2,3. The challenge, in short, is one human experts can only imperfectly meet, inevitably exhibiting marked variation with skill and experience4,5, and therefore, the training of machine learning (ML)-based models to automate morphological characterization is intrinsically difficult.
Recent work has applied discriminative models, particularly convolutional neural networks, to the morphological assessment of blood cells. These approaches have been used to classify leukaemia diagnosis6, lymphoblasts7, genetic acute myeloid leukaemia subtype identification8 and stored red blood cell morphologies9,10. Additionally, convolutional neural networks have been applied to the differentiation of bone marrow cell morphologies when trained on large image datasets11. These studies have demonstrated the potential of ML in morphological assessment, and some models have received FDA approval for clinical practice.
A desirable automated cell characterization model would have the following five key properties. First, it should be robust to domain shift and generalize to different biological, pathological and instrumental contexts and class distributions12,13. Second, the model should achieve high data efficiency, performing well despite sparse ground-truth labels and restricted access to comprehensive datasets as commonly found in clinical applications. Third, the decision of the model should aim to be interpretable where possible, as the reasoning behind a model’s decisions may be as important as the decisions themselves14,15. Fourth, the model should have the ability to identify rare or previously unseen patterns of features as such cases fall outside the model’s competence and must be so highlighted. This is particularly important in clinical applications yet often overlooked in model development and evaluation16. Finally, the model should be able to quantify the uncertainty attached to its decision, a feature often neglected in model assessments17.
Although optimally performing discriminative ML classification models can approximate human performance at classifying cells into predefined classes, they primarily learn a decision boundary based on expert labels. Consequently, they are not inherently designed to capture the full data distribution of cellular appearances. This limitation can make them less adept at handling some of the desirable properties outlined above, such as intrinsic robustness to domain shifts, natural anomaly detection for unseen cell types or high data efficiency, particularly when dealing with the complexities and variability inherent in clinical haematology data18,19. These open challenges limit the clinical applicability of purely discriminative approaches that only seek to replicate expert labelling.
Therefore, we introduce CytoDiffusion, a modelling approach centred around a diffusion-based generative model. Instead of merely learning a classification boundary, CytoDiffusion aims to model the full distribution of blood cell morphology. By capturing the underlying data distribution within a latent space, generative models offer several potential advantages for addressing the multifaceted challenges in clinical settings. These include facilitating greater robustness to distributional shifts, enabling inherent anomaly detection (as out-of-distribution samples are poorly represented), enhancing data efficiency, allowing interpretability through the generation of counterfactuals, and potentially streamlining the incorporation of new classes or finer subdivisions of existing ones. Classification is then performed based on this learned distributional representation, rather than being the sole objective of the model.
CytoDiffusion is developed and applied to real-world clinical challenges, building on recent work20,21,22,23 using generative models for classification challenges. We specifically chose diffusion models over alternative generative approaches based on their superior performance in modelling complex visual patterns24 and recent studies demonstrating their effectiveness as classifiers20,21,22,23. In the context of blood cell image classification, CytoDiffusion is compelled to learn the complete morphological characteristics of each cell type (by modelling the distribution) rather than focusing only on discriminative features near a decision boundary. Figure 1 illustrates the proposed modelling approach and the contributions of this work are (1) an application of latent diffusion models for blood cell image classification, (2) an evaluation framework that goes beyond accuracy and other standard metrics, incorporating domain shift robustness, anomaly detection capability and performance in low-data regimes, (3) a new dataset of blood cell images that includes artefacts and labeller confidence scores, addressing key limitations in existing datasets, (4) a principled framework for the evaluation of model and human confidence based on established psychometric modelling techniques and (5) a method for generating interpretable heat maps directly from the generative process to explain the model’s decisions.
Fig. 1: Overview of the diffusion-based classification model.
Representation of the diffusion-based classification process. An input image x0 is first encoded into a latent space using an encoder ({\mathcal{E}}). Gaussian noise ({\bf{\epsilon }} \sim {\mathcal{N}}(0,{I})) is then added to create a noisy latent representation z**t. This noisy representation is fed through a diffusion model for each possible class condition c. The model predicts the noise ϵ**θ for each condition. The classification decision is made by selecting the class that minimizes the error between the predicted noise ϵ**θ and true noise ϵ.
Through these contributions, we aim to establish a standard for the development and assessment of blood cell image classification models. Our work addresses several important aspects of clinical applicability, including robustness, interpretability and reliability. We propose that the research community adopt these evaluation tasks and metrics when assessing new models for blood cell image classification. By going beyond simple discriminative statistics and other conventional benchmarks, we can develop models that are not only high performing but also trustworthy and clinically relevant.
Results
We begin by validating the quality of images generated by CytoDiffusion through an authenticity test. Next, we assess the model’s in-domain performance on standard classification tasks across multiple datasets. We then examine CytoDiffusion’s ability to quantify uncertainty, comparing its metacognitive capabilities with those of human experts. Following this, we evaluate the model’s proficiency in anomaly detection, crucial for identifying rare or unseen cell types. We proceed to test the model’s robustness to domain shifts, simulating real-world variability in imaging conditions. Subsequently, we investigate CytoDiffusion’s efficiency in low-data regimes, a critical consideration for medical applications in which large, well-annotated datasets may be scarce. Finally, we demonstrate CytoDiffusion’s explainability through the generation of counterfactual heat maps, providing interpretable insights into its decision-making process.
CytoDiffusion generates images indistinguishable from real images
The clinical adoption of artificial intelligence (AI) systems requires not only high performance but also trustworthiness in the model’s learned representations. To demonstrate that CytoDiffusion learns the distribution of morphological features rather than artefactual shortcuts, we conducted an authenticity test. CytoDiffusion was trained on a dataset comprising 32,619 images. To evaluate its generative performance, we enlisted ten expert haematologists to assess a total of 2,880 images (with each expert evaluating 288 images). The experts achieved an overall accuracy of 0.523 (95% confidence interval: [0.505, 0.542]) in distinguishing between real and synthetic images, with a sensitivity of 0.558 and a specificity of 0.489. This performance is comparable to random guessing, indicating that the synthetic images produced by CytoDiffusion are almost indistinguishable from real blood cell images, even to experienced professionals.
In addition, the quality of conditional synthesis was evaluated by comparing the experts’ cell type classifications of the synthetic images with the conditioning labels used during generation. The high agreement rate of 0.986 not only validates the quality of generation but also confirms that CytoDiffusion preserves class-defining morphological features. The ability to generate synthetic images indistinguishable from real ones indicates that CytoDiffusion has learnt the morphological distribution of blood cell appearances. Examples of generated images are shown in Supplementary Fig. 2.
CytoDiffusion demonstrates competitive classification performance
Although the primary focus of our work lies in uncertainty quantification, anomaly detection, robustness to domain shifts, efficiency in low-data scenarios and explainability, we first establish CytoDiffusion’s baseline performance on standard classification tasks to ensure that its foundation is robust. We evaluated CytoDiffusion across four datasets: CytoData (our custom dataset), Raabin-WBC25, PBC26 and Bodzas27. As shown in Table 1, CytoDiffusion achieves state-of-the-art performance on CytoData, PBC and Bodzas, demonstrating that our diffusion-based approach can match or exceed the capabilities of traditional discriminative models. Extended Data Fig. 1 shows the classification confusion matrices for CytoDiffusion across all four datasets.
CytoDiffusion outperforms human experts in uncertainty quantification
The biological realm is characterized by constitutional, incompletely reducible uncertainty. In every task, it is valuable to quantify not only the fidelity but also the uncertainty of the agent: human or machine. Metacognitive measures of this kind enable the qualification of predictions, stratification of case difficulty and principled ensembling of agents28,29.
Our dataset uniquely incorporates human expert confidence for all images, providing a rare opportunity to compare model uncertainty with human expert uncertainty. Quantifying uncertainty is complicated by two cardinal aspects of the task. First, uncertainty has no reliable ground truth in real-world settings. Second, uncertainty in the present context contains an aleatoric component (the constitutional discriminability of the classes) and an epistemic component (the agent’s ability to discriminate between them). The former is determined by the domain and the latter, by the characteristics of the agent. In the ideal case, the epistemic uncertainty is zero, and the agent’s uncertainty is wholly aleatoric, determined by how much discriminant signal the data contains. In such a case, the relation between uncertainty and accuracy ought to approximate that of an ideal psychophysical observer detecting a noisy signal, that is, the uncertainty measure should resemble discriminability.
This insight allows us to deploy the mature conceptual apparatus of psychometric function estimation to the task of evaluating an agent’s uncertainty. We use well-established Bayesian psychometric modelling techniques to derive a psychometric function for CytoDiffusion’s performance (Fig. 2a), revealing an excellent fit, with tight posterior distributions on the key threshold and width parameters (Fig. 2a, axes in the inset). Although direct measurement is impossible, this suggests CytoDiffusion’s uncertainty is dominated by the aleatoric component and its behaviour is close to that of an ideal observer.
Fig. 2: Bayesian psychometric analysis of model and expert performance.
Performance was evaluated on our custom CytoData test set (n = 1,000 images). a–d, Psychometric functions showing accuracy as a function of a discriminability index. In these panels, data points (black circles) represent the mean accuracy for images binned by confidence, and their size is proportional to the number of trials in each bin. The solid black line is the maximum-likelihood psychometric function fit to the data. The horizontal black error bar on the curve indicates the 95% credibility interval for the function’s threshold, estimated at 80% accuracy (unscaled by lapse and guess rates). The plots in the inset show the joint posterior probability density for the psychometric function’s parameters, width and threshold. a, Psychometric function for CytoDiffusion, with its own confidence score as the discriminability index. b, Psychometric function for a representative human expert (Expert 5), using CytoDiffusion’s confidence score as the discriminability index. c, Psychometric function for the same expert (Expert 5), using expert confidence as the discriminability index. d, Psychometric function for the ViT-B/16 model, with its own confidence score as the discriminability index. e,f, Comparison of psychometric function parameters (width and threshold) across the six human experts. The coloured circles represent the posterior mean of the parameter estimates, and the error bars represent the 95% credibility intervals. Parameters were estimated using either CytoDiffusion confidence (e) or mean expert confidence (f) as the index of signal strength.
This conclusion is reinforced by evaluating individual human expert performance, judged against expert consensus, with CytoDiffusion’s confidence as the measure of discriminability. The resultant function, illustrated for Expert 5 (Fig. 2b), not only exhibits a good fit but also describes the relationship better than consensus human expert confidence (Fig. 2c), suggesting that CytoDiffusion’s metacognitive abilities are superior to human experts here. We also applied the same psychometric analysis to the vision transformer (ViT)-B/16 model (Fig. 2d). Although the psychometric function shows a good fit, the data points for ViT-B/16 exhibit non-monotonic behaviour at higher confidence values. This suggests that the discriminative model’s confidence estimates are less reliable precisely when high certainty would be most clinically valuable, unlike CytoDiffusion that maintains a consistent relationship between confidence and accuracy. Examination of the estimated threshold and width parameters for each human expert, with CytoDiffusion (Fig. 2e) or human expert (Fig. 2f) confidence, shows that CytoDiffusion’s measure can distinguish between the varying abilities of human experts better than they themselves can.
CytoDiffusion excels at detecting anomalous cell types
For each dataset, we evaluate our model’s performance in detecting clinically important anomalous cell types. The detection of blast cells is crucial in screening for various haematological malignancies, particularly leukaemia and myelodysplastic syndromes, where high sensitivity is essential to minimize false negatives that could lead to missed diagnoses. As shown in Fig. 3a, for the Bodzas dataset, with blasts as the abnormal class, CytoDiffusion achieved both high sensitivity (0.905) and specificity (0.962). However, the ViT suffered from extremely poor sensitivity (0.281), making it inadequate for clinical applications.
Fig. 3: Anomaly detection and low-data performance comparison.
a, Kernel density estimate figures comparing the anomaly detection performance of ViT-B/16 (top row) with CytoDiffusion (bottom row) for erythroblasts (left and right columns) and blasts (middle column). The horizontal axis represents the normality score, normalized to [0, 1]. The sensitivity (Sens) and specificity (Spec) values show each model’s performance in detecting anomalous cells and correctly classifying normal samples. b, Model performance comparison under low-data conditions across four cytology datasets. The data points represent the mean balanced accuracy, and the shaded areas represent the standard deviation. Statistics were calculated from five independent training sessions. AUC, area under the curve.
For PBC and CytoData, both with erythroblasts as abnormal, CytoDiffusion achieved a higher sensitivity compared with ViT and maintained a high specificity. These results demonstrate our model’s ability to distinguish between normal cells it was trained on and abnormal cell types not present in the training data, as well as maintaining the high sensitivity required for clinical applications.
CytoDiffusion shows robustness to domain shifts
To assess generalizability, we evaluated models across datasets with varying domain shifts. Models trained on Raabin-WBC were tested on Test-B (different microscopes and cameras) and LISC (different microscopes, cameras and staining). Models trained on CytoData were tested on PBC and Bodzas, where the PBC dataset was created using a different generation of CellaVision technology (DM9600 for CytoData and DM96 for PBC). The Bodzas dataset introduces another domain shift as it was manually stained, rather than using automated staining procedures. As shown in Extended Data Table 1, CytoDiffusion achieves state-of-the-art accuracy on all four datasets. These consistent performance advantages across varying degrees of domain shift demonstrate CytoDiffusion’s robustness to dataset variations, suggesting good generalization capabilities for real-world clinical applications.
CytoDiffusion outperforms discriminative models in low-data scenarios
To evaluate performance under limited-data conditions, we conducted experiments across the four previously described cytology datasets. For each dataset, we conducted training with limited subsets of 10, 20 and 50 images per class, simulating conditions of sparse data availability. Figure 3b demonstrates that CytoDiffusion consistently outperforms the discriminative models EfficientNetV2-M and ViT-B/16 across all four datasets. The advantage is particularly pronounced in the most data-scarce conditions, where traditional discriminative approaches struggle to generalize effectively.
CytoDiffusion provides visual explanations through counterfactual heat maps
Counterfactual heat maps highlight the regions of an image that would need to change for it to be classified as a different cell type. In Fig. 4a, we used an eosinophil as an example and prompted the model to consider what alterations would be necessary for this cell to be classified as a neutrophil, generating a heat map (Hneutrophil) that highlights regions in which there are large errors in the latent space between the two classes. The overlay of this heat map on the original image reveals that the model focuses primarily on distinguishing granularity between neutrophils and eosinophils, with areas of large colour deviation from the background indicating the most critical regions of difference.
Fig. 4: Counterfactual visualizations for model explainability.
a, An example of generating a counterfactual explanation. Left: original image of an eosinophil. Centre right: counterfactual heat map (Hneutrophil), which highlights areas that would need to change for the model to classify the image as a neutrophil. Far right: an overlay of the thresholded heat map on the original image, localizing the most critical features. b, Matrix of counterfactual heat maps for various cell-type transitions. The diagonal displays original images of each cell type, which serve as the source image for their respective columns. Each off-diagonal element in the same column represents a counterfactual heat map (Hc) showing the transition from the diagonal element (source) to the cell type of that row (target). Areas in the heat map with colours that deviate most from the background indicate regions in which there are large errors in the latent space between the two classes.
To provide a comprehensive view of CytoDiffusion’s abilities across all cell types, Fig. 4b shows the generated counterfactual heat maps for each possible class transition in the PBC dataset. This visualization offers insights into the model’s decision-making process for each cell type. For instance, when considering the transition from neutrophil to eosinophil (row 2, column 7), the model highlights regions in the cytoplasm (darker areas) where features should be added, largely maintaining the nuclear shape.
In particular, the heat maps also reveal the model’s understanding of subtle differences between similar cell types. In the transition from monocyte to immature granulocyte (Fig. 4b, row 4, column 6), the model indicates the difference in cytoplasm between the more acidophilic cytoplasm of the immature granulocytes compared with the greyish-blue monocytic cytoplasm. Intriguingly, the model also suggests the filling of the monocytic vacuoles (appearing as dark spots in the heat map). This captures one of the typical morphological findings in monocytes differentiating them from other normal blood cells and demonstrates the model’s ability to focus on nuanced information. These visualizations also serve as a validation tool, enabling the identification of potential model biases by revealing whether the model is focusing on clinically irrelevant areas during classification. This transparency in the decision-making process makes the model more trustworthy for clinical applications, as practitioners can verify that classifications are based on legitimate morphological features rather than artefacts or spurious correlations.
Discussion
This study introduces not only CytoDiffusion for haematological cell image classification but also a principled evaluative framework. Our approach is motivated by the desire to achieve a model with superhuman fidelity, flexibility and metacognitive awareness that can capture the distribution of all possible morphological appearances. These ambitions are attainable only by eliciting structure in the data beyond that which can be obtained simply from the expert labelling of images. Both our methodology and evaluative framework are grounded in the recognition of the inherent complexity of the target system along with the inherent constraints for AI modelling with medical imaging data, for example, relatively small dataset regimes and instrumental variation reflected in images. In addition, we propose that a comprehensive evaluation of medical imaging models should include multiple complementary assessment criteria. Although standard performance metrics are essential, we have deliberately evaluated our approach across several key dimensions: (1) robustness to domain shift, (2) ability to detect anomalies, (3) effectiveness with limited training data, (4) reliability in uncertainty quantification and (5) interpretability of outputs. By assessing all these aspects within a single study, we aim to provide a more complete picture of model capabilities and limitations relevant to clinical deployment.
Robustness of a classification model to domain shift, that is, its ability to generalize across different imaging conditions, is crucial for its practical application in clinical settings. This generalization capability is particularly important in haematology, where variations in microscope types, camera systems and staining techniques are common across different laboratories and hospitals. A model that performs well only under specific imaging conditions would have limited utility in real-world clinical practice, potentially leading to inconsistent or unreliable diagnoses when deployed in new environments. Additionally, the ability to identify rare or unexpected cell types is crucial in clinical scenarios, where the detection of abnormal cells can have important diagnostic implications. Furthermore, our assessment of model performance with reduced training examples examines the relationship between data availability and classification efficacy, illustrating the learning efficiency of the system. This feature is particularly beneficial for effectively handling rare or minority cell types, thereby influencing model selection and deployment decisions. Such data efficiency will be crucial when dividing classes into more granular subclasses encountered in clinical haematological assessments, many of which may only be sparsely represented.
Additionally, in any classification task, whether performed by a human or an ML algorithm, it is highly informative to understand the uncertainty in the final decision, a surrogate for how difficult the sample is to classify. This is particularly true for clinical data in which the classification result can inform an intervention or treatment decision. Currently, a method for the optimal evaluation of model uncertainty is not established in the ML literature. We introduce a framework for evaluating an agent’s uncertainty—human or machine—based on the expected structure of the purely aleatoric uncertainty of an ideal psychophysical observer. This approach allows us to quantify model uncertainty as departure from the ideal psychometric function relating purely aleatoric uncertainty to fidelity. We demonstrate that CytoDiffusion produces superior uncertainty estimates compared with human experts, with two key clinical implications. First, our approach enables efficient triage: cases with high certainty can be processed automatically, whereas uncertain cases can be flagged for human review. Second, the transparent quantification of model uncertainty may help build essential trust amongst clinical practitioners. Additionally, it provides a principled mechanism for weighting ensembles of models based on their confidence and for detecting domain shifts or equipment malfunctions through changes in the uncertainty distribution.
Moreover, interpretability remains essential for the clinical deployment of ML models. Although discriminative approaches can generate post hoc explanations through techniques like Grad-CAM or LIME30,31, CytoDiffusion generates counterfactual heat maps as direct outputs of the model’s generative process. These heat maps highlight regions that would need to change for an image to be classified differently, providing immediate insights into the morphological distinctions the model identifies between cell types. A key advantage of our approach is that these visualizations emerge naturally from the model’s operation without requiring additional interpretive layers, preserving the integrity of the decision-making rationale.
A key strength of our generative approach is its ability to learn a comprehensive representation of the data distribution, a capability that underpins its strong performance and holds potential for future discovery. This deep representational learning is the plausible mechanism for the model’s demonstrated success in tasks such as anomaly detection and robustness to domain shift. Effectively identifying an unseen cell type as anomalous relies on the model having learned a high-fidelity distribution of normal morphologies, going beyond the minimal features needed for simple classification. This is crucial because morphologically defined cell types are not natural but rather phenotypes defined by visual distinguishability combined with clinical utility. Crucially, some functionally distinct cell types may be visually indistinguishable, whereas others differ markedly in morphology despite functional relatedness. Thus, physiological or pathological relevance cannot be constrained to human intuition alone. Although the current study demonstrates the existence of this rich representational space, the explicit exploration of this space to identify novel, clinically important subclasses remains a promising direction for future work. For example, the learned representations could be used to characterize heterogeneities within existing classes, facilitating the identification of new morphological signals that can subsequently be evaluated for clinical relevance.
Our results suggest that CytoDiffusion can build on existing discriminative approaches by addressing key challenges in clinical deployment, from domain shift and uncertainty quantification to interpretable visual explanations and data efficiency, and maintaining competitive performance on standard metrics, without requiring extensive hyperparameter tuning or data-specific architectural modifications.
Our approach is not without limitations. The inference process of CytoDiffusion is computationally expensive, scaling with the number of classes. However, this is less problematic in the medical domain, where datasets typically have far fewer classes than general image classification tasks like ImageNet32. Moreover, CytoDiffusion’s adaptive allocation of computing resources, dedicating more effort to challenging images, makes it more effective. For future applications with more granular division of blood cell types (for example, subdividing blasts into myeloblasts, lymphoblasts and monoblasts), recent work33 on hierarchical diffusion classifiers suggests a promising direction for maintaining computational efficiency through progressive category pruning. CytoDiffusion required an average of 42 iterations per image (0.043 seconds per iteration), resulting in a mean classification time of 1.8 seconds per image. Several optimizations could further improve efficiency: code optimization and model distillation should reduce computational requirements, parallelization across multiple computing resources would enable the simultaneous processing of images and advancing hardware capabilities will naturally decrease relative computational costs over time.
There are many possible extensions to this study. First, we have not fully exploited the representation learning capabilities of generative models, where there is potential for further advances, for example, through the use of generative modelling architectures with compact latents. Moreover, generative models enable the detection and assurance of equity in medical diagnostics. Although not implemented in this study, conditioning on minority characteristics would allow for legibility of differences in the model’s perception across diverse demographic groups34, and potentially enable augmentation with counterfactually transformed images to mitigate the impact of class imbalance on equity. This capability is crucial for both detecting potential biases and enabling targeted remedial actions, ensuring fair and equitable application of AI in healthcare.
In conclusion, our generative approach with CytoDiffusion, combined with a comprehensive evaluation framework, offers a promising step towards more robust, interpretable and trustworthy AI systems in healthcare. Future work should not only refine these methods and assess their applicability to other medical imaging domains but also explicitly test their ability to promote fairness and mitigate bias.
Methods
Diffusion classifiers
Recent studies have shown that diffusion models can also be utilized as classifiers20,21,22,23. We use a latent diffusion model (Supplementary Section 1). Given an input image x, we want to predict the most probable class (\hat{c}). This can be formalized as finding the class (\hat{c}) that maximizes the posterior probability p(c = c**k∣x). Using Bayes’ theorem, this is equivalent to
$$\begin{array}{l}\hat{c}=\mathop{{\rm{argmax}}}\limits_{{c}_{k}},p(c={c}_{k}| {\bf{x}})=\mathop{{\rm{argmax}}}\limits_{{c}_{k}},p({\bf{x}}| c={c}_{k})\cdot p(c={c}_{k})\\quad=\mathop{{\rm{argmax}}}\limits_{{c}_{k}}\log p({\bf{x}}| c={c}_{k}),,\end{array}$$
(1)
assuming a uniform prior over the classes, (p(c={c}_{k})=\frac{1}{K}), where K is the number of classes.
As we do not have direct access to the (negative) log likelihood, we use our loss function to approximate it instead and, therefore, take
$$\hat{c}=\mathop{{\rm{argmin}}}\limits_{c}{{\mathbb{E}}}_{{\bf{\epsilon }},t}\left[{w}_{t}{\left\Vert {\bf{\epsilon }}-{{\bf{\epsilon }}}_{\theta }({{z}}_{t},t,c)\right\Vert }_{2}^{2}\right],$$
(2)
with the weight w**t discussed in Supplementary Section 1. To achieve this, we randomly pick a time step t and noise ϵ and calculate the error as
$${\left\Vert {\bf{\epsilon }}-{{\bf{\epsilon }}}_{\theta }({{z}}_{t},t,c)\right\Vert }_{2}^{2},,$$
(3)
for all class labels c. These error values are then normalized using w**t and the results are stored.
The process then repeats with newly sampled values of t and ϵ. Supplementary Fig. 4 provides an intuitive explanation of how our model makes its predictions.
Following the methodology in ref. 22, we repeatedly gather new sets of errors for each candidate class. Classes that are less likely to have the lowest error are progressively eliminated. This process can be interpreted as a successive elimination algorithm for best-arm identification in multi-armed bandit settings35,36. The elimination is achieved using a paired Student’s t-test.
Given that the errors do not perfectly follow the standard assumptions of a Student’s t-test, we use the same safeguards as those in ref. 22: a conservative P value of 2 × 10−3 and a requirement that each class must be scored a minimum of 20 times before elimination to minimize the chance of incorrectly pruning the correct class. This iterative procedure continues until there is only one class left or a maximum of 2,000 iterations is reached. Supplementary Fig. 1 provides an analysis of different P values, as well as various minimum and maximum numbers of iterations.
General training setup
Unless otherwise specified, we used the following training configuration for all experiments. We used Stable Diffusion 1.5 (ref. 24) as our base model. For class conditioning, we bypassed the tokenizer and text encoder, directly feeding the model with one-hot-encoded vectors for each class, replicated vertically and padded horizontally to match the expected matrix of 77 × 768 dimensions. We utilized a batch size of 10, a learning rate of 10−5 with linear warm-up over 1,000 steps and trained on an A100-80GB GPU. Details of the training and inference parameters are provided in Supplementary Section 2.
Datasets
We utilized multiple datasets (described in Extended Data Table 2 and Supplementary Table 2), including four that are publicly available and one custom dataset, CytoData, to develop and evaluate our diffusion classifier for haematological cell image classification. CytoData, available at https://www.ebi.ac.uk/biostudies/studies/S-BSST2156, is an anonymized dataset consisting of 559,808 single-cell image