Abstract:This study addresses the problem of anomaly detection and root cause tracing in microservice architectures and proposes a unified framework that combines graph neural networks with temporal modeling. The microservice call chain is abstracted as a directed graph, where multidimensional features of nodes and edges are used to construct a service topology representation, and graph convolution is applied to aggregate features across nodes and model dependencies, capturing complex structural relationships among services. On this basis, gated recurrent units are introduced to model the temporal evolution of call chains, and multi-layer stacking and concatenation operations are used to jointly obtain structural and temporal representation…
Abstract:This study addresses the problem of anomaly detection and root cause tracing in microservice architectures and proposes a unified framework that combines graph neural networks with temporal modeling. The microservice call chain is abstracted as a directed graph, where multidimensional features of nodes and edges are used to construct a service topology representation, and graph convolution is applied to aggregate features across nodes and model dependencies, capturing complex structural relationships among services. On this basis, gated recurrent units are introduced to model the temporal evolution of call chains, and multi-layer stacking and concatenation operations are used to jointly obtain structural and temporal representations, improving the ability to identify anomaly patterns. Furthermore, anomaly scoring functions at both the node and path levels are defined to achieve unified modeling from local anomaly detection to global call chain tracing, which enables the identification of abnormal service nodes and the reconstruction of potential anomaly propagation paths. Sensitivity experiments are then designed from multiple dimensions, including hyperparameters, environmental disturbances, and data distribution, to evaluate the framework, and results show that it outperforms baseline methods in key metrics such as AUC, ACC, Recall, and F1-Score, maintaining high accuracy and stability under dynamic topologies and complex environments. This research not only provides a new technical path for anomaly detection in microservices but also lays a methodological foundation for intelligent operations in distributed systems.
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2511.03285 [cs.LG] |
| (or arXiv:2511.03285v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2511.03285 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Qingyuan Zhang [view email] [v1] Wed, 5 Nov 2025 08:28:41 UTC (822 KB)