December 09, 2025β’Adrien Grand (Engineer), Morgan Gallant (Engineer)
FTS v2, the newest version of turbopufferβs homegrown text search engine, is up to 20x faster than v1 thanks to (a) an improved storage layout and (b) a better search algorithm. This post is about the better search algorithm.
turbopuffer is often queried by agents, who tend to craft longer queries than humans. A key characteristic of the FTS v2 search algorithm is that it performs very well on such long queries (tens of terms). In particular, it is up to several times faster than block-max WAND, the most popular algorithm for lexical search.
Below are some representative benchmarks (full results) of FTS v1 versus Fβ¦
December 09, 2025β’Adrien Grand (Engineer), Morgan Gallant (Engineer)
FTS v2, the newest version of turbopufferβs homegrown text search engine, is up to 20x faster than v1 thanks to (a) an improved storage layout and (b) a better search algorithm. This post is about the better search algorithm.
turbopuffer is often queried by agents, who tend to craft longer queries than humans. A key characteristic of the FTS v2 search algorithm is that it performs very well on such long queries (tens of terms). In particular, it is up to several times faster than block-max WAND, the most popular algorithm for lexical search.
Below are some representative benchmarks (full results) of FTS v1 versus FTS v2 on a 5M-document Wikipedia export sample dataset borrowed from Quickwitβs Search Benchmark Game. The results support the claim that FTS v2 works extremely well not only on simple queries, but also on harder queries with many terms or those with only stopwords.
turbopuffer FTS latency (k=100, dataset=English Wikipedia export @ ~5M documents)
san francisco v1 ββββ 8ms
v2 ββ 3ms
β
β
the who v1 ββββββββββββββββββ 57ms
v2 βββ 7ms
β
β
united states v1 βββββββ 20ms
constitution v2 βββ 5ms
β
β
lord of the rings v1 ββββββββββββββββββββββββ 75ms
v2 βββ 6ms
β
β
pop singer songwriter v1 βββββββββββββββββββββββββββββββββββββββββββββββββββββ 174ms
born 1989 won best v2 βββββββ 20ms
country song time β
person of the year β
turbopuffer FTS latency (ms)
k=100,
English Wikipedia export
~5M documents
174ms
ββ v1 ββ
ββ
ββ v2 ββ
ββ
ββ
ββ
ββ
ββ
ββ
ββ
ββ
ββ
ββ
ββ
75ms ββ
ββ ββ
ββ ββ
57ms ββ ββ
ββ ββ ββ
ββ ββ ββ
ββ ββ ββ
ββ 20ms ββ ββ20ms
ββ ββ ββ ββββ
8ms ββ7ms ββ ββ6ms ββββ
ββ3ms ββββ ββ5ms ββββ ββββ
ββββ ββββ ββββ ββββ ββββ
ββββββββββββββββββββββββββββββ
q1 q2 q3 q4 q5
--------------------------------
q1 = san francisco
q2 = the who
q3 = united states
constitution
q4 = lord of the rings
q5 = pop singer songwriter
born 1989 won best
country song time
person of the year
Search algorithms: MAXSCORE vs WAND
Consider the query "new york" with a top-k of 10. It has two terms: "new" and "york", each contributing a max score of 2.0 and 5.0, respectively. "york" has a higher score because the word is less common than "new". The query score is the sum of the per-term scores, higher score is better.
We can deduce that any document may get scores up to the following values depending on what terms it contains:
| Document | Max possible score |
|---|---|
| only "new" | 2.0 |
| only "york" | 5.0 |
| "new" and "york" | 7.0 |
During evaluation, we may find ourselves in a scenario where the 10th best-scoring document so far has a score of 4.0. As k=10, any document containing only "new" can never enter the results! At this point, we can skip all documents that only contain "new". This simple optimization can lead to dramatic speedups.
There are two main algorithms that take advantage of this idea to speed up query evaluation without trading recall: MAXSCORE and WAND.
Letβs look at how they evaluate a query on terms "new", "york", and "population" which respectively contribute up to 2.0, 5.0, and 4.0 to the score (top_k=3 for brevity).
MAXSCORE: term-centric
The first family of search algorithms is MAXSCORE. It starts by sorting the query terms by their score contribution: [new(2.0), population(4.0), york(5.0)].
Then, it starts iterating through the documents that match each term, keeping track of the highest scoring documents in the top-k heap.
Letβs add a proverbial breakpoint a few iterations in:
----------------------------------------------------------
| Term | Max score | Postings (matching doc IDs) |
|--------------|-----------|-----------------------------|
| new | 2.0 | 0, 1, 3, 5, 6, 7, 9, 10 |
| | | ^ |
| population | 4.0 | 1, 2, 3, 8, 9 |
| | | ^ |
| york | 5.0 | 2, 3, 7, 10 |
| | | ^ |
----------------------------------------------------------
# (score, doc_id) -- bigger score is better
TOPK3_HEAP = [(6.0, 1), (9.0, 2), (11.0, 3)]
At this point, the minimum score in the heap is 6.0, so documents with βnewβ alone can no longer qualify, and βnewβ has become a non-essential term. We can simply ignore it when looking for the next document to evaluate, speeding up query performance.
In a few iterations, the minimum score will be 7.0, at which point βpopulationβ also becomes non-essential, speeding up the remainder of the query execution even further!
MAXSCORE: use essential terms to find candidates, use all terms to compute scores
WAND: document-centric
The second family of skipping algorithms is WAND ("weak and"). We consider it document-centric, because instead of continuously determining which terms no longer qualify, it attempts to find the next doc ID that could potentially qualify for the top-k heap.
Letβs breakpoint at the same moment as with MAXSCORE:
----------------------------------------------------------
| Term | Max score | Postings (matching doc IDs) |
|--------------|-----------|-----------------------------|
| new | 2.0 | 0, 1, 3, 5, 6, 7, 9, 10 |
| | | ^ |
| population | 4.0 | 1, 2, 3, 8, 9 |
| | | ^ |
| york | 5.0 | 2, 3, 7, 10 |
| | | ^ |
----------------------------------------------------------
# (score, doc_id) -- bigger score is better
TOPK3_HEAP = [(6.0, 1), (9.0, 2), (11.0, 3)]
WAND sorts doc IDs in ascending order and calculates the score upper bound for each ID. A term contributes to the document score if its iterator has passed and confirmed the doc ID is present, or hasnβt yet reached the doc ID (WAND must assume it is in the posting list).
For example:
# (doc ID, score upper bound)
[
(5, 2.0), # matches "new", will not match "york" or "population"
(7, 7.0), # matches "york", might match "new"
(8, 11.0) # matches "population", might match "new" and "york"
]
Then, WAND finds the first document whose best possible score exceeds the minimum heap score (6.0). Here, thatβs doc ID 7, so WAND evaluates it next. If the minimum heap score were greater than 7.0, WAND would skip directly to 8.
WAND: use all terms, continuously skipping doc IDs that cannot possibly qualify.
Block-max MAXSCORE/WAND
In their basic forms, these algorithms use global max scores: each term has a single max score (like 2.0 for "new") that applies to its entire posting list containing all doc IDs. With a global max score, we must evaluate each document individually to determine if it can qualify for the heap. But what if we could skip entire groups of documents at once?
This is what the block-max variants of MAXSCORE and WAND achieve. Block-max divides each posting list into fixed-size blocks containing a subset of doc IDs, storing a local max score for each block (the maximum score any document in that block can get from that term). Documents within a block can have different scores because term frequency (TF) varies per document, even though inverse document frequency (IDF) is constant for the term. If a blockβs local max score doesnβt exceed the minimum heap score, the algorithm can skip the entire block of documents. Block-max MAXSCORE and block-max WAND are currently considered the state of the art for lexical search.
There are really no disadvantages to block-max MAXSCORE/WAND versus MAXSCORE/WAND. For brevity, Iβll be using "MAXSCORE" and "WAND" to describe the algorithms for the remainder of this post, but know Iβm referring to their block-max counterparts.
Comparison
WAND is currently the dominant algorithm for query evaluation, as it takes more information into account (the current positions of the postings list iterators) and can thus skip more documents than MAXSCORE.
However, skipping more documents doesnβt necessarily mean faster. In order to skip more documents, WAND must spend more cycles computing the next doc ID that can qualify for the top-k heap. This introduces more work per evaluated document, giving WAND a lower throughput than MAXSCORE. When WANDβs skipping power is outweighed by low throughput, MAXSCORE becomes more performant.
The Apache Lucene project initially started with WAND. While it worked well for most queries, users reported certain query types where it was slower than even exhaustive evaluation due to poor skipping decisions compounded by poor throughput. This led Lucene to switch to MAXSCORE and optimize it for throughput by improving memory locality, reducing unpredictable branches, and taking advantage of SIMD instructions.
| Algorithm | Skipping | Evaluation throughput |
|---|---|---|
| Exhaustive | None | Very good |
| WAND | Very good | Average |
| MAXSCORE | Good | Good |
| Luceneβs vectorized MAXSCORE | Good | Very good |
Luceneβs vectorized MAXSCORE throughput is so good that itβs also extremely competitive with WAND on simple queries where WAND is expected to shine, as confirmed by the Tantivy vs. Lucene benchmarks (Tantivy implements WAND). And because throughput is very good, it doesnβt become slower than exhaustive search in cases when skipping is hard.
This good throughput is especially important on long queries (tens of terms) where skipping is harder, especially for WAND-based algorithms whose overhead scales with the number of terms. Our evaluations confirmed those made by Apache Lucene; this algorithm often performs several times faster than WAND on long queries.
Side note: the WAND vs. MAXSCORE trade-off has similarities with HNSW vs. SPFresh, the system turbopuffer uses for ANN search. WAND and HNSW optimize for skipping first, and throughput second, whereas Luceneβs MAXSCORE and SPFresh optimize for throughput first, and skipping second.
turbopufferβs MAXSCORE variant
While building on the same skipping logic as MAXSCORE, we borrow from Luceneβs vectorized implementation, differing significantly from the textbook description in order to improve throughput.
The main novelty is that postings iterators are no longer advanced in an alternating fashion. In traditional MAXSCORE implementations, iterators advance alternately: if iterator A is at doc ID 5 and iterator B is at doc ID 3, the algorithm advances B to catch up, perhaps advancing B beyond A, causing A to then catch up, and so on.
Our implementation works differently. After computing the next range of doc IDs to evaluate based on local max scores, it processes all matching doc IDs and scores needed from each postings iterator in one batch before moving to the next iterator. This means the same iterator advances many times in a row (often tens of times or more) before switching to another iterator.
This batching approach achieves specific CPU-level throughput optimizations:
- Better memory locality: Accessing consecutive memory locations from the same iterator enables the CPUβs cache prefetcher to predict and load upcoming data before itβs needed, achieving higher cache hit rates.
- Branch prediction: When the same iterator advances many times in a row, the CPUβs branch predictor achieves higher accuracy by learning a consistent pattern, keeping the instruction pipeline full more often.
- SIMD vectorization: Modern CPUs take advantage of SIMD instructions to process multiple data elements in parallel, but they require sequential data. Processing many consecutive elements from the same iterator enables SIMD optimizations that wouldnβt be possible with frequently alternating iterators.
Conclusion
Designing fast search engines is a subtle optimization problem. On the one hand, CPUs reach peak efficiency on serial workloads with predictable branches. On the other hand, the smartest algorithms often have a random access pattern and lots of conditionals. So it takes some effort to select the right algorithm and then tune it to get the best out of the CPU.
Plus, context keeps changing. AVX-512 has now been widely available on server-side CPUs for several years, further moving the cursor in the direction of serial, branchless workloads. "Serial and dumb" can often beat "smart and random" on modern CPUs. Furthermore, agents write longer queries than humans, so itβs becoming increasingly important for text search to scale well with the number of terms.
These factors force us to periodically revisit our choices of algorithms and their implementations. For text search, this means that the cursor has shifted more and more from WAND to MAXSCORE, which scales better with the number of terms and can be tuned to be more CPU-friendly.
Thanks for reading through this blog. I hope you enjoyed the deep dive, as well as the performance that youβre now getting on text search with turbopuffer thanks to FTS v2.
Appendix: turbopuffer FTS v2 benchmark results
Below youβll find results of turbopufferβs nightly benchmarks, comparing FTS v1 vs. FTS v2. It is reusing infrastructure from Search Benchmark, the Game: a 5M export of English Wikipedia, plus a mix of some AOL queries and handcrafted queries that exhibit similar characteristics to those we see in production.
| Query | top_k | FTS v1 latency (ms) | FTS v2 latency (ms) | Speedup (Multiplier) |
|---|---|---|---|---|
| the | 10 | 26.0 | 3.9 | 6.7x |
| the | 100 | 25.9 | 5.2 | 5.0x |
| the | 1000 | 28.8 | 12.8 | 2.2x |
| search | 10 | 3.1 | 1.4 | 2.2x |
| search | 100 | 3.5 | 2.0 | 1.8x |
| search | 1000 | 5.9 | 4.6 | 1.3x |
| new york | 10 | 17.5 | 4.0 | 4.4x |
| new york | 100 | 17.7 | 6.0 | 2.9x |
| new york | 1000 | 20.2 | 9.7 | 2.1x |
| san francisco | 10 | 7.1 | 2.2 | 3.2x |
| san francisco | 100 | 7.5 | 2.8 | 2.7x |
| san francisco | 1000 | 9.9 | 5.7 | 1.7x |
| the who | 10 | 57.4 | 4.8 | 12.0x |
| the who | 100 | 57.3 | 6.7 | 8.6x |
| the who | 1000 | 60.1 | 16.4 | 3.7x |
| the incredibles | 10 | 37.4 | 3.3 | 11.3x |
| the incredibles | 100 | 37.4 | 5.2 | 7.2x |
| the incredibles | 1000 | 40.9 | 13.7 | 3.0x |
| new york population | 10 | 31.0 | 6.5 | 4.8x |
| new york population | 100 | 31.3 | 8.3 | 3.8x |
| new york population | 1000 | 33.8 | 14.0 | 2.4x |
| world bank president | 10 | 20.9 | 4.6 | 4.5x |
| world bank president | 100 | 21.1 | 6.2 | 3.4x |
| world bank president | 1000 | 23.9 | 10.4 | 2.3x |
| united states constitution | 10 | 19.8 | 3.8 | 5.2x |
| united states constitution | 100 | 19.8 | 5.1 | 3.9x |
| united states constitution | 1000 | 23.0 | 9.4 | 2.4x |
| law school rankings | 10 | 14.9 | 2.7 | 5.5x |
| law school rankings | 100 | 15.3 | 4.3 | 3.6x |
| law school rankings | 1000 | 18.0 | 9.9 | 1.8x |
| lord of the rings | 10 | 74.5 | 4.2 | 17.7x |
| lord of the rings | 100 | 75.4 | 6.3 | 12.0x |
| lord of the rings | 1000 | 79.8 | 15.6 | 5.1x |
| story of a girl | 10 | 84.2 | 5.0 | 16.8x |
| story of a girl | 100 | 84.3 | 7.1 | 11.9x |
| story of a girl | 1000 | 86.5 | 16.4 | 5.3x |
| to be or not to be | 10 | 99.7 | 18.4 | 5.4x |
| to be or not to be | 100 | 100.6 | 22.3 | 4.5x |
| to be or not to be | 1000 | 103.8 | 31.1 | 3.3x |
| arsenal player 1998 forward goal non flying dutch | 10 | 32.7 | 7.2 | 4.5x |
| arsenal player 1998 forward goal non flying dutch | 100 | 33.1 | 8.6 | 3.8x |
| arsenal player 1998 forward goal non flying dutch | 1000 | 36.2 | 13.6 | 2.7x |
| france president world war resistance london appeal | 10 | 53.7 | 10.6 | 5.1x |
| france president world war resistance london appeal | 100 | 54.4 | 12.0 | 4.5x |
| france president world war resistance london appeal | 1000 | 57.6 | 18.9 | 3.0x |
| pop singer songwriter born 1989 won best country song time person of year | 10 | 172.9 | 15.9 | 10.9x |
| pop singer songwriter born 1989 won best country song time person of year | 100 | 174.4 | 20.1 | 8.7x |
| pop singer songwriter born 1989 won best country song time person of year | 1000 | 178.0 | 32.4 | 5.5x |
| kenyan world marathon record olympic champion bbc world sport star of the year | 10 | 143.5 | 12.0 | 12.0x |
| kenyan world marathon record olympic champion bbc world sport star of the year | 100 | 143.2 | 17.2 | 8.3x |
| kenyan world marathon record olympic champion bbc world sport star of the year | 1000 | 147.3 | 30.4 | 4.8x |
| a search engine is an information retrieval software system designed to help find information stored on one or more computer systems | 10 | 350.2 | 18.3 | 19.1x |
| a search engine is an information retrieval software system designed to help find information stored on one or more computer systems | 100 | 350.0 | 26.7 | 13.1x |
| a search engine is an information retrieval software system designed to help find information stored on one or more computer systems | 1000 | 357.7 | 47.7 | 7.5x |
| a database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed | 10 | 570.9 | 30.5 | 18.7x |
| a database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed | 100 | 575.1 | 43.3 | 13.3x |
| a database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed | 1000 | 577.7 | 80.7 | 7.2x |
turbopuffer
We host 1T+ documents, handle writes at 10M+ writes/s, and serve 10k+ queries/s. We are ready for far more. We hope youβll trust us with your queries.