Contents of this blog
Introduction
Working with Classifiers: The First Classifier
Understanding Classifiers: Unreliable Sources
Actual Research: The Study
Results
Introductory warning: If you’d rather begin directly with the study, start reading from Actual Research and then Results.
Key Terms
Binary Referral: A yes/no clinical decision about whether a patient should be referred to another service, specialist, or level of care.
Exotropia: A form of strabismus (eye misalignment) in which one or both eyes turn outward, away from the nose. It can be constant or intermittent, and may cause issues such as double vision, eye strain, or reduced depth perception.
Esotropia: A type of strabismus in which one or both eyes turn inward toward the nose...
Contents of this blog
Introduction
Working with Classifiers: The First Classifier
Understanding Classifiers: Unreliable Sources
Actual Research: The Study
Results
Introductory warning: If you’d rather begin directly with the study, start reading from Actual Research and then Results.
Key Terms
Binary Referral: A yes/no clinical decision about whether a patient should be referred to another service, specialist, or level of care.
Exotropia: A form of strabismus (eye misalignment) in which one or both eyes turn outward, away from the nose. It can be constant or intermittent, and may cause issues such as double vision, eye strain, or reduced depth perception.
Esotropia: A type of strabismus in which one or both eyes turn inward toward the nose. It can be constant or intermittent and is common in children but occurs at all ages.
Resolution: A measure of how well forecasts separate situations with different observed outcome frequencies. Higher resolution means the model assigns different probabilities to cases that genuinely differ in event likelihood.
Introduction:
The most common ways to improve classifier performance are: using more data, using pretrained architectures, or employing augmentations. Previously, I’ve written extensively on classifier training, from common pitfalls to augmentation techniques like AutoAugment, RandAugment, and TrivialAugment, with Cutout, Cutmix, and Mixup also in progress.
Across those posts, I often guided newcomers toward TrivialAugment or would suggest that they explore Generative Adversarial Networks (GANs). Within the medical domain, StyleGAN2-ADA stood out to me: it performs well with limited data, is relatively intuitive once you grasp GAN fundamentals, and holds up strongly against predecessors like StyleGAN and StyleGAN2.
However, my recent research made me rethink some of those assumptions.
June 2025: The First Classifier
In June, I had just started contributing to an open-source project, studying chatbots, and polishing a few independent research projects. Around that time, I built my first classifier, not for research, but for a small hackathon I decided to join. The classifier was central to the project, as it had to provide exercise recommendations based on predictions. Accuracy was crucial.
The project was completed and submitted successfully. Did I win? No, but not because of the classifier. The issues were instead due to dependency updates and package incompatibilities (I discuss them in detail here).
Still, the experience: its frustrations, limitations, and small victories, all sparked a six-month deep dive into classifiers.
July 2025 – August 2025: Unreliable Sources
A few weeks later, I revisited that original classifier and began experimenting with LLMs to refine it. My goal: learn the best strategy for building an effective classifier with just 500 images across 5 classes.
Initially, everything worked smoothly, the LLM suggestions improved the model. But then the infamous decline began: output quality dropped, changes became less meaningful, and eventually, the classifier’s performance worsened.
Despite studying “programs and algorithms”, I found myself repeatedly pressing Ctrl+C and Ctrl+V. At some point, fed up with the irony, I asked myself:
“How hard can studying classifiers actually be?”
TLDR: Extremely hard if you’re new.
I refreshed my understanding of CNNs (a topic I’d studied long ago and also blogged about in Juggling Multiple Interests). Then I moved on to augmentations.
With my trust in LLMs diminishing due to contradictions and backtracking, I still used them for basic definitions, but I could clearly tell when the information was unreliable. Eventually, I decided:
“What better way to learn something than from the source?”
That decision came with challenges: AutoAugment requires substantial foundational knowledge. But it was ultimately worth it.
During this period, I learned about:
How AutoAugment works
Computational demands and constraints
Performance across datasets like ImageNet, CIFAR-10, SVHN
Architectural, optimization, GPU, and CPU considerations
This naturally led me to RandAugment, AutoAugment’s computationally cheaper successor. Around this time, I also entered the medical/clinical perspective space and one particular question stuck with me:
“Which of these would be preferable in a clinical setting?”
This single question became the motivation behind the study I pursued for months.
June 2025 – November 2025: The Study
In late July, I began an independent research study to benchmark augmentation techniques for a specific task: Binary referral.
My goal was to determine whether augmentations truly help under accurate, but also suboptimal, clinical conditions.
At this point I was already deep into dataset-specific augmentations (AA, RA, TA). To compare them with more general, robustness-focused augmentations, I included the Mix family: Cutout, Cutmix, and Mixup.
Thus the final set was: AutoAugment, RandAugment, TrivialAugment, Cutout, Cutmix, Mixup, and a Baseline.
To mimic suboptimal conditions:
I used only one system (CPU only)
The dataset had ~100 images per class: essentially a stress test
Pretrained models were used to mitigate data scarcity: EfficientNet-B0, MobileNet-V2, and MobileNet-V3 (ImageNet-trained)
However, a practical issue emerged: these models require 224×224 inputs. Cropping was not viable since it removed spatially important medical features. I solved this by padding images into a square, producing a proper 224×224 input while preserving structure. Grad-CAM confirmed that models still localized the correct regions.
For evaluation, I used:
Statistical analysis
Brier score decomposition
Odds ratios
AUC/DeLong comparisons
This gave rise to several interesting findings.
Results
I tested using esotropia and exotropia datasets due to their distinct characteristics.
Esotropia
AutoAugment gave the most consistent results.
I haven’t yet done a full qualitative analysis, but AA likely learned policies that emphasized key esotropia features.
Exotropia
TrivialAugment performed most consistently.
This suggests that simple random transformations can help stabilize performance.
Underperformer: Cutmix
Cutmix consistently underperformed across nearly all seeds and models.
DeLong’s test (on AUCs) repeatedly indicated worse performance for Cutmix compared to others.
Brier Decomposition
Mixup had the most frequent issues with low reliability, followed closely by AA.
For resolution, AutoAugment was the most consistent, showing strong ability to differentiate cases.
The Biggest Takeaway
Across multiple seeds, pretrained models, and the dataset, the baseline performed similarly to augmented versions on nearly all metrics.
This leads me to the belief: You do not need augmentations to train your classifiers.
With a high-quality dataset, proper preprocessing, and the right pretrained model, even small datasets can reach strong baselines (e.g., ~0.93 across varied metrics).
If you do choose to use an augmentation, my recommendation is AutoAugment.
Read the complete study: Introducing UBAEF (A slight warning: the full paper with appendices is 131 pages long!) All detailed results: confidence intervals, training time, and more, can be found in the GitHub repository: GitHub Repo.
Until next time, with another project. And remember, sometimes, the baseline is already great.