LOCALE: Local-Alignment Embeddings for Noise-Robust DNA Search at SRA Scale (opens in new tab)
Searching petabase-scale repositories of raw sequencing data such as the NIH Sequence Read Archive (SRA) could transform biological discovery, but existing methods either do not scale well or rely on exact k-mer matching that is brittle to sequencing errors and biological divergence. We recast sequence search as dense retrieval: we learn vector embeddings whose inner-product similarity ranks locally aligned sequences above unaligned ones. Our key observation is that effective retrieval does n...
Read the original article