Evaluating Anomaly Detectors for Simulated Highly Imbalanced Industrial Classification Problems

View PDF HTML (experimental)

Abstract:Machine learning offers potential solutions to current issues in industrial systems in areas such as quality control and predictive maintenance, but also faces unique barriers in industrial applications. An ongoing challenge is extreme class imbalance, primarily due to the limited availability of faulty data during training. This paper presents a comprehensive evaluation of anomaly detection algorithms using a problem-agnostic simulated dataset that reflects real-world engineering constraints. Using a synthetic dataset with a hyper-spherical based anomaly distribution in 2D and 10D, we benchmark 14 detectors across training datasets with anomaly rates between 0.05% and …

View PDF HTML (experimental)

Abstract:Machine learning offers potential solutions to current issues in industrial systems in areas such as quality control and predictive maintenance, but also faces unique barriers in industrial applications. An ongoing challenge is extreme class imbalance, primarily due to the limited availability of faulty data during training. This paper presents a comprehensive evaluation of anomaly detection algorithms using a problem-agnostic simulated dataset that reflects real-world engineering constraints. Using a synthetic dataset with a hyper-spherical based anomaly distribution in 2D and 10D, we benchmark 14 detectors across training datasets with anomaly rates between 0.05% and 20% and training sizes between 1 000 and 10 000 (with a testing dataset size of 40 000) to assess performance and generalization error. Our findings reveal that the best detector is highly dependant on the total number of faulty examples in the training dataset, with additional healthy examples offering insignificant benefits in most cases. With less than 20 faulty examples, unsupervised methods (kNN/LOF) dominate; but around 30-50 faulty examples, semi-supervised (XGBOD) and supervised (SVM/CatBoost) detectors, we see large performance increases. While semi-supervised methods do not show significant benefits with only two features, the improvements are evident at ten features. The study highlights the performance drop on generalization of anomaly detection methods on smaller datasets, and provides practical insights for deploying anomaly detection in industrial environments.


Comments:	21 pages, 14 figures, 11 tables
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2601.00005 [cs.LG]
	(or arXiv:2601.00005v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2601.00005 arXiv-issued DOI via DataCite

Submission history

From: Lesley Wheat [view email] [v1] Sun, 7 Dec 2025 03:49:54 UTC (1,383 KB)

Submission history

Similar Posts