Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences

Computer Science > Machine Learning

arXiv:2512.10147 (cs)

COVID-19 e-print

Important: e-prints posted on arXiv are not peer-reviewed by arXiv; they should not be relied upon without context to guide clinical practice or health-related behavior and should not be reported in news media as established information without consulting multiple experts in the field.

View PDF HTML (experimental)

Abstract:Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however…

Computer Science > Machine Learning

arXiv:2512.10147 (cs)

COVID-19 e-print

View PDF HTML (experimental)

Abstract:Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today’s multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4% classification accuracy while reducing embedding generation time by as much as 99.81%. This highlights the method’s potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.


Subjects:	Machine Learning (cs.LG); Genomics (q-bio.GN)
Cite as:	arXiv:2512.10147 [cs.LG]
	(or arXiv:2512.10147v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2512.10147 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sarwan Ali [view email] [v1] Wed, 10 Dec 2025 23:03:10 UTC (157 KB)

Computer Science > Machine Learning

Computer Science > Machine Learning

Submission history

Similar Posts