Abstract:Lightweight vision classification models such as MobileNet, ShuffleNet, and EfficientNet are increasingly deployed in mobile and embedded systems, yet their performance has been predominantly benchmarked on ImageNet. This raises critical questions: Do models that excel on ImageNet also generalize across other domains? How can cross-dataset robustness be systematically quantified? And which architectural elements consistently drive generalization under tight resource constraints? Here, we present the first systematic evaluation of 11 lightweight vision models (2.5M parameters), trained under a fixed 100-epoch schedule across 7 diverse datasets. We introduce the Cross-Da…
Abstract:Lightweight vision classification models such as MobileNet, ShuffleNet, and EfficientNet are increasingly deployed in mobile and embedded systems, yet their performance has been predominantly benchmarked on ImageNet. This raises critical questions: Do models that excel on ImageNet also generalize across other domains? How can cross-dataset robustness be systematically quantified? And which architectural elements consistently drive generalization under tight resource constraints? Here, we present the first systematic evaluation of 11 lightweight vision models (2.5M parameters), trained under a fixed 100-epoch schedule across 7 diverse datasets. We introduce the Cross-Dataset Score (xScore), a unified metric that quantifies the consistency and robustness of model performance across diverse visual domains. Our results show that (1) ImageNet accuracy does not reliably predict performance on fine-grained or medical datasets, (2) xScore provides a scalable predictor of mobile model performance that can be estimated from just four datasets, and (3) certain architectural components–such as isotropic convolutions with higher spatial resolution and channel-wise attention–promote broader generalization, while Transformer-based blocks yield little additional benefit, despite incurring higher parameter overhead. This study provides a reproducible framework for evaluating lightweight vision models beyond ImageNet, highlights key design principles for mobile-friendly architectures, and guides the development of future models that generalize robustly across diverse application domains.
| Comments: | 10 pages, 5 tables, 1 figure, 3 equations, 11 mobile models, 7 datasets | 
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) | 
| Cite as: | arXiv:2511.00335 [cs.CV] | 
| (or arXiv:2511.00335v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2511.00335 arXiv-issued DOI via DataCite (pending registration) | 
Submission history
From: Weidong Zhang [view email] [v1] Sat, 1 Nov 2025 00:40:06 UTC (31 KB)