pg_textsearch
Modern ranked text search for Postgres.
- Simple syntax:
ORDER BY content <@> 'search terms' - BM25 ranking with configurable parameters (k1, b)
- Works with Postgres text search configurations (english, french, german, etc.)
- Supports partitioned tables
- Goal: state-of-the-art performance and scalability
🎉 Now Open Source! We’re excited to share pg_textsearch with the community.
🚀 Status: v0.1.1-dev (prerelease) - Feature-complete but not yet optimized. Not recommended for production use. See ROADMAP.md for what’s next.
Historical note
The original name of the project was Tapir - Textual Analysis for Postgres Information Retrieval. We still use th…
pg_textsearch
Modern ranked text search for Postgres.
- Simple syntax:
ORDER BY content <@> 'search terms' - BM25 ranking with configurable parameters (k1, b)
- Works with Postgres text search configurations (english, french, german, etc.)
- Supports partitioned tables
- Goal: state-of-the-art performance and scalability
🎉 Now Open Source! We’re excited to share pg_textsearch with the community.
🚀 Status: v0.1.1-dev (prerelease) - Feature-complete but not yet optimized. Not recommended for production use. See ROADMAP.md for what’s next.
Historical note
The original name of the project was Tapir - Textual Analysis for Postgres Information Retrieval. We still use the tapir as our mascot and the name occurs in various places in the source code.
PostgreSQL Version Compatibility
pg_textsearch supports:
- PostgreSQL 17
- PostgreSQL 18
New in PostgreSQL 18 Support
- Embedded index name syntax: Use
index_name:queryformat in cast expressions for better compatibility with PG18’s query planner - Improved ORDER BY optimization: Full support for PG18’s consistent ordering semantics
- Query planner compatibility: Works correctly with PG18’s more eager expression evaluation
Installation
Pre-built Binaries
Download pre-built binaries from the Releases page. Available for Linux and macOS (amd64 and arm64), PostgreSQL 17 and 18.
Build from Source
cd /tmp
git clone https://github.com/timescale/pg_textsearch
cd pg_textsearch
make
make install # may need sudo
Getting Started
Enable the extension (do this once in each database where you want to use it)
CREATE EXTENSION pg_textsearch;
Create a table with text content
CREATE TABLE documents (id bigserial PRIMARY KEY, content text);
INSERT INTO documents (content) VALUES
('PostgreSQL is a powerful database system'),
('BM25 is an effective ranking function'),
('Full text search with custom scoring');
Create a pg_textsearch index on the text column
CREATE INDEX docs_idx ON documents USING bm25(content) WITH (text_config='english');
Querying
Get the most relevant documents using the <@> operator
SELECT * FROM documents
ORDER BY content <@> 'database system'
LIMIT 5;
Note: <@> returns the negative BM25 score since Postgres only supports ASC order index scans on operators. Lower scores indicate better matches.
The index is automatically detected from the column. For explicit index specification:
SELECT * FROM documents
WHERE content <@> to_bm25query('database system', 'docs_idx') < -1.0;
Supported operations:
text <@> 'query'- Score text against a query (index auto-detected)text <@> bm25query- Score text with explicit index specification
Verifying Index Usage
Check query plan with EXPLAIN:
EXPLAIN SELECT * FROM documents
ORDER BY content <@> 'database system'
LIMIT 5;
For small datasets, PostgreSQL may prefer sequential scans. Force index usage:
SET enable_seqscan = off;
Note: Even if EXPLAIN shows a sequential scan, <@> and to_bm25query always use the index for corpus statistics (document counts, average length) required for BM25 scoring.
Indexing
Create a BM25 index on your text columns:
CREATE INDEX ON documents USING bm25(content) WITH (text_config='english');
Index Options
text_config- PostgreSQL text search configuration to use (required)k1- term frequency saturation parameter (1.2 by default)b- length normalization parameter (0.75 by default)
CREATE INDEX ON documents USING bm25(content) WITH (text_config='english', k1=1.5, b=0.8);
Also supports different text search configurations:
-- English documents with stemming
CREATE INDEX docs_en_idx ON documents USING bm25(content) WITH (text_config='english');
-- Simple text processing without stemming
CREATE INDEX docs_simple_idx ON documents USING bm25(content) WITH (text_config='simple');
-- Language-specific configurations
CREATE INDEX docs_fr_idx ON french_docs USING bm25(content) WITH (text_config='french');
CREATE INDEX docs_de_idx ON german_docs USING bm25(content) WITH (text_config='german');
Data Types
bm25query
The bm25query type represents queries for BM25 scoring with optional index context:
-- Create a bm25query with index name (required for WHERE clause and standalone scoring)
SELECT to_bm25query('search query text', 'docs_idx');
-- Returns: docs_idx:search query text
-- Embedded index name syntax (alternative form using cast)
SELECT 'docs_idx:search query text'::bm25query;
-- Returns: docs_idx:search query text
-- Create a bm25query without index name (only works in ORDER BY with index scan)
SELECT to_bm25query('search query text');
-- Returns: search query text
Note: In PostgreSQL 18, the embedded index name syntax using single colon (:) allows the query planner to determine the index name even when evaluating SELECT clause expressions early. This ensures compatibility across different query evaluation strategies.
bm25query Functions
| Function | Description |
|---|---|
| to_bm25query(text) → bm25query | Create bm25query without index name (for ORDER BY only) |
| to_bm25query(text, text) → bm25query | Create bm25query with query text and index name |
| text <@> bm25query → double precision | BM25 scoring operator (returns negative scores) |
| bm25query = bm25query → boolean | Equality comparison |
Performance
pg_textsearch indexes use a memtable architecture for efficient writes. Like other index types, it’s faster to create an index after loading your data.
-- Load data first
INSERT INTO documents (content) VALUES (...);
-- Then create index
CREATE INDEX docs_idx ON documents USING bm25(content) WITH (text_config='english');
Monitoring
-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexrelid::regclass::text ~ 'pg_textsearch';
Configuration
Optional settings in postgresql.conf:
# Query limit when no LIMIT clause detected
pg_textsearch.default_limit = 1000 # default 1000
# Auto-spill thresholds (set to 0 to disable)
pg_textsearch.bulk_load_threshold = 100000 # terms per transaction
pg_textsearch.memtable_spill_threshold = 800000 # posting entries (~8MB segments)
The memtable_spill_threshold controls when the in-memory index spills to disk segments. When the memtable reaches this many posting entries, it automatically flushes to a segment at transaction commit. This keeps memory usage bounded while maintaining good query performance.
Crash recovery: The memtable is rebuilt from the heap on startup, so no data is lost if Postgres crashes before spilling to disk.
Examples
Basic Search
CREATE TABLE articles (id serial PRIMARY KEY, title text, content text);
CREATE INDEX articles_idx ON articles USING bm25(content) WITH (text_config='english');
INSERT INTO articles (title, content) VALUES
('Database Systems', 'PostgreSQL is a powerful relational database system'),
('Search Technology', 'Full text search enables finding relevant documents quickly'),
('Information Retrieval', 'BM25 is a ranking function used in search engines');
-- Find relevant documents
SELECT title, content <@> 'database search' as score
FROM articles
ORDER BY score;
Also supports different languages and custom parameters:
-- Different languages
CREATE INDEX fr_idx ON french_articles USING bm25(content) WITH (text_config='french');
CREATE INDEX de_idx ON german_articles USING bm25(content) WITH (text_config='german');
-- Custom parameters
CREATE INDEX custom_idx ON documents USING bm25(content)
WITH (text_config='english', k1=2.0, b=0.9);
Limitations
Partitioned Tables
BM25 indexes on partitioned tables use partition-local statistics. Each partition maintains its own:
- Document count (
total_docs) - Average document length (
avg_doc_len) - Per-term document frequencies for IDF calculation
This means:
- Queries targeting a single partition compute accurate BM25 scores using that partition’s statistics
- Queries spanning multiple partitions return scores computed independently per partition, which may not be directly comparable across partitions
Example: If partition A has 1000 documents and partition B has 10 documents, the term "database" would have different IDF values in each partition. Results from both partitions would have scores on different scales.
Recommendations:
- For time-partitioned data, query individual partitions when score comparability matters
- Use partitioning schemes where queries naturally target single partitions
- Consider this behavior when designing partition strategies for search workloads
-- Query single partition (scores are accurate within partition)
SELECT * FROM docs
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'
ORDER BY content <@> 'search terms'
LIMIT 10;
-- Cross-partition query (scores computed per-partition)
SELECT * FROM docs
ORDER BY content <@> 'search terms'
LIMIT 10;
Word Length Limit
pg_textsearch inherits PostgreSQL’s tsvector word length limit of 2047 characters. Words exceeding this limit are ignored during tokenization (with an INFO message). This is defined by MAXSTRLEN in PostgreSQL’s text search implementation.
For typical natural language text, this limit is never encountered. It may affect documents containing very long tokens such as base64-encoded data, long URLs, or concatenated identifiers.
This behavior is similar to other search engines:
- Elasticsearch: Truncates tokens (configurable via
truncatefilter, default 10 chars) - Tantivy: Truncates to 255 bytes by default
Troubleshooting
-- List available text search configurations
SELECT cfgname FROM pg_ts_config;
-- List BM25 indexes
SELECT indexname FROM pg_indexes WHERE indexdef LIKE '%USING bm25%';
Installation Notes
If your machine has multiple Postgres installations, specify the path to pg_config:
export PG_CONFIG=/Library/PostgreSQL/18/bin/pg_config # or 17
make clean && make && make install
If you get compilation errors, install Postgres development files:
# Ubuntu/Debian
sudo apt install postgresql-server-dev-17 # for PostgreSQL 17
sudo apt install postgresql-server-dev-18 # for PostgreSQL 18
Reference
Index Options
| Option | Type | Default | Description |
|---|---|---|---|
| text_config | string | required | PostgreSQL text search configuration to use |
| k1 | real | 1.2 | Term frequency saturation parameter (0.1 to 10.0) |
| b | real | 0.75 | Length normalization parameter (0.0 to 1.0) |
Text Search Configurations
Available configurations depend on your Postgres installation:
# SELECT cfgname FROM pg_ts_config;
cfgname
------------
simple
arabic
armenian
basque
catalan
danish
dutch
english
finnish
french
german
greek
hindi
hungarian
indonesian
irish
italian
lithuanian
nepali
norwegian
portuguese
romanian
russian
serbian
spanish
swedish
tamil
turkish
yiddish
(29 rows)
Further language support is available via extensions such as zhparser.
Development Functions
These functions are for debugging and development use only. Their interface may change in future releases without notice.
| Function | Description |
|---|---|
| bm25_dump_index(index_name) → text | Dump internal index structure (truncated) |
| bm25_dump_index(index_name, file_path) → text | Dump full index structure to file |
| bm25_summarize_index(index_name) → text | Show index statistics without content |
| bm25_spill_index(index_name) → int4 | Force memtable spill to disk segment |
-- Quick overview of index statistics
SELECT bm25_summarize_index('docs_idx');
-- Detailed dump for debugging (truncated output)
SELECT bm25_dump_index('docs_idx');
-- Full dump to file (includes hex data)
SELECT bm25_dump_index('docs_idx', '/tmp/docs_idx_dump.txt');
-- Force spill to disk (returns number of entries spilled)
SELECT bm25_spill_index('docs_idx');
Contributing
See CONTRIBUTING.md for development setup, code style, and how to submit pull requests.
- Bug Reports: Create an issue
- Feature Requests: Request a feature
- General Discussion: Start a discussion