- 16 Dec, 2025 *
I recently added domain exclusion lists and paywalled content filtering to Scour. This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don’t know what that means, I’ll explain it below).
Scour searches noisy sources for content related to users’ interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time.
The Ranking SQL Query
The main SQL query Scour uses for ranking app…
- 16 Dec, 2025 *
I recently added domain exclusion lists and paywalled content filtering to Scour. This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don’t know what that means, I’ll explain it below).
Scour searches noisy sources for content related to users’ interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time.
The Ranking SQL Query
The main SQL query Scour uses for ranking applies a number of filters and streams the item embeddings through the application code for scoring.
Scour uses brute force search rather than a vector database, which works well enough for now because of three factors:
- Scour uses SQLite, so the data is colocated with the application code.
- It uses binary-quantized vector embeddings with Hamming Distance comparisons, which only take ~5 nanoseconds each.
- We care most about recent posts so we can significantly narrow the search set by publish date.
A simplified version of the query looks something like:
SELECT *
FROM items i
WHERE i.lang IN (SELECT lang FROM user_languages WHERE user_id = ?1)
AND i.published BETWEEN ?2 AND ?3
AND ...(more filters)...
The query plan shows that this makes good use of indexes:
QUERY PLAN
|--SEARCH i USING INDEX idx_items_lang_published (lang=? AND published>? AND published<?)
`--LIST SUBQUERY 1
`--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)
Domain Filters Using Correlated Subqueries
To add user-specified domain blocklists, I created the user_excluded_domains table and added this filter clause to the main ranking query:
AND NOT EXISTS (
SELECT 1
FROM user_excluded_domains ued
WHERE user_id = ?1
AND ued.domain = i.domain
)
The domain exclusion table uses (user_id, domain) as a primary key, so the lookup is efficient. However, this lookup is done for every row returned from the first part of the query. This is a correlated subquery:
QUERY PLAN
|--SEARCH i USING INDEX idx_items_lang_published (lang=? AND published>? AND published<?)
|--LIST SUBQUERY 1
| `--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)
`--LIST SUBQUERY 2
`--SEARCH user_excluded_domains USING COVERING INDEX sqlite_autoindex_user_excluded_domains_1 (user_id=?)
A problem with the way we just added this feature is that most users don’t exclude any domains, but we’ve added a check that is run for every row anyway.
To speed up the queries for users who aren’t using the feature, we could first check the user’s settings and then dynamically build the query. But we don’t have to, because we can accomplish the same effect within one static query.
We can change our domain exclusion filter to first check whether the user has any excluded domains:
AND (
NOT EXISTS (
SELECT 1
FROM user_excluded_domains
WHERE user_id = ?1
)
OR NOT EXISTS (
SELECT 1
FROM user_excluded_domains ued
WHERE user_id = ?1
AND ued.domain = i.domain
)
)
Since the OR short-circuits, if the first NOT EXISTS returns true (when the user has no excluded domains), SQLite never evaluates the correlated subquery at all.
The first NOT EXISTS clause does not reference any column in items, so SQLite can evaluate it once and reuse the boolean result for all of the rows. This "uncorrelated scalar subquery" is extremely cheap to evaluate and, when it returns true, lets us short-circuit and skip the more expensive correlated subquery that checks each item’s domain against the exclusion list.
Here is the query plan for this updated query. Note how the second subquery says SCALAR SUBQUERY, whereas the third one is a CORRELATED SCALAR SUBQUERY. The latter is the per-row check, but it can be skipped by the second subquery.
QUERY PLAN
|--SEARCH i USING INDEX idx_items_lang_published (lang=? AND published>? AND published<?)
|--LIST SUBQUERY 1
| `--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)
|--SCALAR SUBQUERY 2
| `--SEARCH user_excluded_domains USING COVERING INDEX sqlite_autoindex_user_excluded_domains_1 (user_id=?)
`--CORRELATED SCALAR SUBQUERY 3
`--SEARCH ued USING COVERING INDEX sqlite_autoindex_user_excluded_domains_1 (user_id=? AND domain=?)
Benchmarking
To test the performance of each of these queries, I replaced the SELECT * with SELECT COUNT(*) and used a simple bash script to invoke the sqlite3 binary 100 times for each query on my laptop. Starting up the sqlite3 process each time adds overhead, but we’re comparing relative differences.
At the time of this benchmark, the last week had 235,975 items, 144,229 of which were in English. The two example users I tested this for below only look for English content.
User Without Excluded Domains
This test represents most users, who have not configured any excluded domains:
| Approach | Min (ms) | Max (ms) | Avg (ms) | Stddev (ms) | Diff (ms) | Diff (%) |
|---|---|---|---|---|---|---|
| Baseline (no filter) | 67 | 91 | 72.7 | 4.7 | — | — |
| Correlated Subquery | 80 | 108 | 85.2 | 5.5 | +12.5 | +17.1% |
| With Short-Circuit | 69 | 91 | 72.7 | 3.8 | +0 | +0% |
This shows that the short-circuit query adds practically no overhead for users without excluded domains, whereas the correlated subquery alone makes queries 17% slower for these users.
User with Excluded Domains
This test uses an example user that has excluded content from 2 domains:
| Approach | Min (ms) | Max (ms) | Avg (ms) | Stddev (ms) | Diff (ms) | Diff (%) |
|---|---|---|---|---|---|---|
| Baseline (no filter) | 68 | 99 | 76.2 | 7.6 | — | — |
| Correlated Subquery | 84 | 112 | 90.5 | 6.8 | +14.3 | +18.7% |
| With Short-Circuit | 82 | 109 | 88.5 | 8.1 | +12.3 | +16.1% |
In this case, we do need to check each row against the domain filter. But this shows that the short-circuit still adds no overhead on top of the query.
Conclusion
When using SQL subqueries to filter down result sets, it’s worth thinking about whether each subquery is really needed for most users or most queries. If the check is needed most of the time, this approach won’t help. However if the per-row check isn’t always needed, using an uncorrelated scalar subquery to short-circuit a condition can dramatically speed up the average case with practically zero overhead.
This is extra important because the slow-down from each additional subquery compounds. In this blog post, I described and benchmarked a single additional filter. But this is only one of multiple subquery filters.
Earlier, I also mentioned that users had asked for a way to filter out paywalled content. This works similarly to filtering out content from excluded domains. Some users opt-in to hiding paywalled content. For those users, we check if each item is paywalled. If so, we check if it comes from a site the user has specifically allowed paywalled content from (because they have a subscription). I used the same uncorrelated subquery approach to first check if the feature is enabled for the user and, only then, does SQLite need to check each row.
Concretely, the paywalled content filter subquery looks like:
AND (
(
SELECT COALESCE(hide_paywalled_content, 0) = 0
FROM users
WHERE user_id = ?1
) -- note these parentheses are needed so SQLite doesn't mistakenly think this query is correlated with `items`
OR COALESCE(i.is_paywalled, 0) = 0
OR i.domain IN (
SELECT domain
FROM user_paywall_allowed_domains
WHERE user_id = ?1
)
)
In short, a trivial uncorrelated scalar subquery can help us short-circuit and avoid a more expensive per-row check when we don’t need it.
Appendix: NOT EXISTS vs NOT IN vs LEFT JOIN
There are multiple ways to exclude rows from an SQL query.
Here are the results from the same benchmark I ran above, but with two other ways of checking for whether an item comes from an excluded domain.
The NULL-safe NOT IN version of the query uses the subquery:
...
AND (
i.domain IS NULL
OR i.domain NOT IN (
SELECT domain
FROM user_excluded_domains
WHERE user_id = ?1
)
)
The LEFT JOIN variation joins items with user_excluded_domains and then checks for NULL:
SELECT *
FROM items i
LEFT JOIN user_excluded_domains ued on ued.user_id = ?1 AND ued.domain = i.domain
WHERE i.lang IN (SELECT lang FROM user_languages WHERE user_id = ?1)
AND i.published BETWEEN ?2 AND ?3
AND ued.domain IS NULL
And here are the full benchmarks:
User Without Excluded Domains
| Approach | Min (ms) | Max (ms) | Avg (ms) | Stddev (ms) | Diff (ms) | Diff (%) |
|---|---|---|---|---|---|---|
| Baseline (no filter) | 67 | 91 | 72.7 | 4.7 | — | — |
| NOT EXISTS (no short-circuit) | 80 | 108 | 85.2 | 5.5 | +12.5 | +17.1% |
| NOT EXISTS + short-circuit | 69 | 91 | 72.7 | 3.8 | +0 | +0% |
| NULL-safe NOT IN (no short-circuit) | 75 | 111 | 79.5 | 7.1 | +6.8 | +9.3% |
| NULL-safe NOT IN + short-circuit | 69 | 103 | 74.8 | 6.6 | +2.1 | +2.8% |
| LEFT JOIN (no short-circuit) | 74 | 100 | 79.1 | 5.1 | +6.4 | +8.8% |
| LEFT JOIN + short-circuit | 76 | 103 | 84.4 | 7.4 | +11.7 | +16.0% |
For users without excluded domains, we can see that the NOT EXISTS query using the short-circuit wins and adds no overhead.
User With Excluded Domains
| Approach | Min (ms) | Max (ms) | Avg (ms) | Stddev (ms) | Diff (ms) | Diff (%) |
|---|---|---|---|---|---|---|
| Baseline (no filter) | 68 | 99 | 76.2 | 7.6 | — | — |
| NOT EXISTS (no short-circuit) | 84 | 112 | 90.5 | 6.8 | +14.3 | +18.7% |
| NOT EXISTS + short-circuit | 82 | 109 | 88.5 | 8.1 | +12.3 | +16.1% |
| NULL-safe NOT IN (no short-circuit) | 83 | 112 | 89.7 | 8.4 | +13.5 | +17.7% |
| NULL-safe NOT IN + short-circuit | 84 | 112 | 91.3 | 8.2 | +15.1 | +19.8% |
| LEFT JOIN (no short-circuit) | 81 | 107 | 86.3 | 6.7 | +10.1 | +13.2% |
| LEFT JOIN + short-circuit | 82 | 126 | 89.8 | 7.7 | +13.6 | +17.8% |
For users who do have excluded domains, the LEFT JOIN is faster than the NOT EXISTS version. However, this version raises the exact problem this whole blog post is designed to address. Since joins happen no matter what, we cannot use the short-circuit to avoid the overhead for users without excluded domains. At least for now, this is why I’ve gone with the NOT EXISTS subquery using the short-circuit.