Abstract:In the realm of computer science, the efficiency of text-search algorithms is crucial for processing vast amounts of data in areas such as natural language processing and bioinformatics. Traditional methods like Naive Search, KMP, and Boyer-Moore, while foundational, often fall short in handling the complexities and scale of modern datasets, such as the Reuters corpus and human genomic sequences. This study rigorously investigates text-search algorithms, focusing on optimizing Suffix Trees through methods like Splitting and Ukkonen’s Algorithm, analyzed on datasets including the Reuters corpus and human genomes. A novel optimization combining Ukkonen’s Algorithm with a new search technique is introduced, showing linear time and sp…
Abstract:In the realm of computer science, the efficiency of text-search algorithms is crucial for processing vast amounts of data in areas such as natural language processing and bioinformatics. Traditional methods like Naive Search, KMP, and Boyer-Moore, while foundational, often fall short in handling the complexities and scale of modern datasets, such as the Reuters corpus and human genomic sequences. This study rigorously investigates text-search algorithms, focusing on optimizing Suffix Trees through methods like Splitting and Ukkonen’s Algorithm, analyzed on datasets including the Reuters corpus and human genomes. A novel optimization combining Ukkonen’s Algorithm with a new search technique is introduced, showing linear time and space efficiencies, outperforming traditional methods like Naive Search, KMP, and Boyer-Moore. Empirical tests confirm the theoretical advantages, highlighting the optimized Suffix Tree’s effectiveness in tasks like pattern recognition in genomic sequences, achieving 100% accuracy. This research not only advances academic knowledge in text-search algorithms but also demonstrates significant practical utility in fields like natural language processing and bioinformatics, due to its superior resource efficiency and reliability.
| Comments: | 5 pages, 13 figures |
| Subjects: | Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) |
| Cite as: | arXiv:2512.16927 [cs.DS] |
| (or arXiv:2512.16927v1 [cs.DS] for this version) | |
| https://doi.org/10.48550/arXiv.2512.16927 arXiv-issued DOI via DataCite |
Submission history
From: Xinyu Guan [view email] [v1] Sat, 29 Nov 2025 16:05:13 UTC (1,026 KB)