Optimizing Text Search: A Novel Pattern Matching Algorithm Based on Ukkonen's Approach

View PDF

Abstract:In the realm of computer science, the efficiency of text-search algorithms is crucial for processing vast amounts of data in areas such as natural language processing and bioinformatics. Traditional methods like Naive Search, KMP, and Boyer-Moore, while foundational, often fall short in handling the complexities and scale of modern datasets, such as the Reuters corpus and human genomic sequences. This study rigorously investigates text-search algorithms, focusing on optimizing Suffix Trees through methods like Splitting and Ukkonen’s Algorithm, analyzed on datasets including the Reuters corpus and human genomes. A novel optimization combining Ukkonen’s Algorithm with a new search technique is introduced, showing linear time and sp…

View PDF

Abstract:In the realm of computer science, the efficiency of text-search algorithms is crucial for processing vast amounts of data in areas such as natural language processing and bioinformatics. Traditional methods like Naive Search, KMP, and Boyer-Moore, while foundational, often fall short in handling the complexities and scale of modern datasets, such as the Reuters corpus and human genomic sequences. This study rigorously investigates text-search algorithms, focusing on optimizing Suffix Trees through methods like Splitting and Ukkonen’s Algorithm, analyzed on datasets including the Reuters corpus and human genomes. A novel optimization combining Ukkonen’s Algorithm with a new search technique is introduced, showing linear time and space efficiencies, outperforming traditional methods like Naive Search, KMP, and Boyer-Moore. Empirical tests confirm the theoretical advantages, highlighting the optimized Suffix Tree’s effectiveness in tasks like pattern recognition in genomic sequences, achieving 100% accuracy. This research not only advances academic knowledge in text-search algorithms but also demonstrates significant practical utility in fields like natural language processing and bioinformatics, due to its superior resource efficiency and reliability.


Comments:	5 pages, 13 figures
Subjects:	Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2512.16927 [cs.DS]
	(or arXiv:2512.16927v1 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2512.16927 arXiv-issued DOI via DataCite

Submission history

From: Xinyu Guan [view email] [v1] Sat, 29 Nov 2025 16:05:13 UTC (1,026 KB)

Submission history

Similar Posts