Building a content scanner for keywords
I needed to monitor keywords across multiple social platforms and notify users within 15 minutes of new mentions. Each platform has different APIs, rate limits, and reliability characteristics. This post covers the architecture decisions I made and what I learned.
The Problem
Monitoring content across platforms means dealing with:
- Different authentication schemes (OAuth, API keys, public endpoints)
- Rate limits from 300/hour to unlimited
- Varying data freshness (some APIs lag by minutes)
- Inconsistent reliability (some platforms have 99.9% uptime, others significantly less)
I scan Hacker News, Lobsters, Bluesky, Mastodon, and GitHub Discussions. Each scan runs every 10 minutes via a cron-triggered edge function on Supabase...
Building a content scanner for keywords
I needed to monitor keywords across multiple social platforms and notify users within 15 minutes of new mentions. Each platform has different APIs, rate limits, and reliability characteristics. This post covers the architecture decisions I made and what I learned.
The Problem
Monitoring content across platforms means dealing with:
- Different authentication schemes (OAuth, API keys, public endpoints)
- Rate limits from 300/hour to unlimited
- Varying data freshness (some APIs lag by minutes)
- Inconsistent reliability (some platforms have 99.9% uptime, others significantly less)
I scan Hacker News, Lobsters, Bluesky, Mastodon, and GitHub Discussions. Each scan runs every 10 minutes via a cron-triggered edge function on Supabase.
Scanner Architecture
Each platform gets its own scanner class implementing a simple interface:
abstract class BaseScanner {
protected abstract platformName: string;
abstract scan(lastScanTime: Date): Promise<RawContentItem[]>;
protected async fetchWithTimeout(
url: string,
options?: RequestInit,
timeoutMs = 30000,
): Promise<Response> {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await fetch(url, {
...options,
signal: controller.signal,
});
clearTimeout(timeout);
return response;
} catch (error) {
clearTimeout(timeout);
throw error;
}
}
}
The orchestrator runs scanners in parallel using Promise.allSettled(). When one platform’s API is down, others continue unaffected. I track last_polled_at per platform in PostgreSQL, allowing each scanner to request only new content since the last successful scan.
async runAllScans(): Promise<ScanResult[]> {
const platforms = ['hackernews', 'lobsters', 'bluesky', 'mastodon', 'github_discussions'];
const results = await Promise.allSettled(
platforms.map(async (platform) => {
const scanner = this.scanners.get(platform);
const lastScanTime = await this.getLastScanTime(platform);
const items = await scanner.scan(lastScanTime);
await this.processContent(items, platform);
return { platform, itemCount: items.length };
})
);
return results.map((r, i) => ({
platform: platforms[i],
success: r.status === 'fulfilled',
itemCount: r.status === 'fulfilled' ? r.value.itemCount : 0,
}));
}
This design made adding new platforms straightforward. Adding GitHub Discussions took 2 hours including testing. The tradeoff is more code—five scanners instead of one generic implementation—but the isolation is worth it.
Rate Limiting Reality
I hit rate limits during testing. Hard.
Bluesky’s public API allows 3,000 requests per 5 minutes. Sounds generous until you’re scanning 200 keywords and growing at 20% per day. I implemented request counting per scanner with exponential backoff:
private requestCount = 0;
private windowStart = Date.now();
private readonly maxRequests = 2400; // 80% of limit
private readonly windowMs = 5 * 60 * 1000;
async rateLimit(): Promise<void> {
if (Date.now() - this.windowStart > this.windowMs) {
this.requestCount = 0;
this.windowStart = Date.now();
}
if (this.requestCount >= this.maxRequests) {
const waitTime = this.windowMs - (Date.now() - this.windowStart);
console.log(`[${this.platformName}] Rate limit approached, waiting ${waitTime}ms`);
await new Promise(resolve => setTimeout(resolve, waitTime));
this.requestCount = 0;
this.windowStart = Date.now();
}
this.requestCount++;
}
When I detect I’m at 80% of the limit, I slow down. At 90%, I stop and resume in the next cycle. Here’s an interesting site that does it better: https://firesky.tv/
Hacker News via Algolia has no documented rate limit, but I saw 429s after ~100 requests in quick succession. Now I space requests by 100ms minimum using a simple queue.
GitHub’s GraphQL API uses a point system (5,000 points/hour). Each query costs roughly 50 points. I batch repo queries and cache DID lookups for 24 hours in a Map<string, string> in memory.
The lesson: read the docs, then test anyway. Documented limits and real limits differ.
Keyword Matching Performance
Initial implementation checked each keyword against each content item: O(N×M). With 500 keywords and 1,000 posts per scan cycle, that’s 500,000 comparisons.
I moved matching into PostgreSQL using a single function call:
CREATE FUNCTION match_keywords_for_content(
p_content_id INT,
p_content_text TEXT,
p_platform TEXT
) RETURNS TABLE(keyword_id INT, user_ids INT[]) AS $$
BEGIN
RETURN QUERY
WITH matched_keywords AS (
SELECT
gk.id as keyword_id,
array_agg(DISTINCT uks.user_id) as user_ids
FROM global_keywords gk
JOIN user_keyword_subscriptions uks ON gk.id = uks.keyword_id
WHERE
p_content_text ILIKE '%' || gk.keyword || '%'
AND (uks.platforms ? p_platform OR uks.platforms IS NULL)
AND uks.notify = true
GROUP BY gk.id
)
INSERT INTO keyword_matches (keyword_id, content_id, detected_at)
SELECT mk.keyword_id, p_content_id, NOW()
FROM matched_keywords mk
ON CONFLICT (keyword_id, content_id) DO NOTHING
RETURNING keyword_id, (
SELECT user_ids FROM matched_keywords
WHERE keyword_id = keyword_matches.keyword_id
) as user_ids;
END;
$$ LANGUAGE plpgsql;
The function does ILIKE matching, filters by platform preference using JSONB’s ? operator, and inserts matches in one transaction. Query time dropped from ~2,000ms to 15ms for 100 keywords. Composite indexes on (keyword_id, detected_at) and (content_id, keyword_id) made timeline queries fast.
I considered PostgreSQL’s full-text search but stuck with ILIKE for simplicity. Users expect "nextjs" to match "Next.js"—full-text search requires more configuration for case insensitivity and punctuation handling. The performance is acceptable for now.
Platform-Specific Challenges
Bluesky (AT Protocol): No timestamp filtering on the search API. I fetch 100 posts per keyword and check timestamps client-side. Wasteful but unavoidable. Cursor-based pagination helped reduce duplicate fetches. The API returns posts in a nested structure:
interface BskySearchResponse {
posts: Array<{
uri: string;
cid: string;
author: { handle: string; displayName: string };
record: {
text: string;
createdAt: string;
};
}>;
cursor?: string;
}
Mastodon: I only scan mastodon.social, Fosstodon, and Techhub.social. Federation means I miss content from other instances, but monitoring all instances isn’t feasible. The public timeline API returns HTML content—I strip tags with a regex:
function stripHtml(html: string): string {
return html
.replace(/<br\s*\/?>/gi, "\n")
.replace(/<[^>]*>/g, "")
.replace(/ /g, " ")
.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.trim();
}
Not perfect but works for keyword matching.
GitHub Discussions: GraphQL query costs vary by complexity. Initially my queries cost 200 points each (repos × discussions × comments). I simplified to fetch only discussion bodies, reducing cost to 50 points:
query ($owner: String!, $name: String!, $cursor: String) {
repository(owner: $owner, name: $name) {
discussions(
first: 50
after: $cursor
orderBy: { field: CREATED_AT, direction: DESC }
) {
nodes {
title
body
url
createdAt
author {
login
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
}
I monitor 15 trending repos (Next.js, React, VS Code, Supabase, Node.js), based on what GitHub has in their trending section. Users can’t customize this yet.
Hacker News: Algolia’s API is the most reliable. Consistent latency (50-100ms), no auth required, and clean JSON responses. The 1,000 result limit per query hasn’t been an issue. I use their search_by_date endpoint sorted by timestamp descending.
Lobsters: Small community means fewer posts but higher signal. API is simple and fast. No rate limiting observed even with aggressive polling. Returns clean JSON with created_at timestamps for efficient filtering.
RSS: Highly targeted at tech blogs and news sites. Straight-forward and well trodden path. I use a simple RSS parser and store feed URLs in the database. I recently added the ability to add your own RSS feed—I use my Reddit subscriptions on mine to get access to Reddit posts.
Database Schema Decisions
I deduplicate keywords globally. When two users track "typescript", I store one row in global_keywords and link both users via user_keyword_subscriptions:
CREATE TABLE global_keywords (
id SERIAL PRIMARY KEY,
keyword TEXT UNIQUE NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE user_keyword_subscriptions (
id SERIAL PRIMARY KEY,
user_id TEXT NOT NULL,
keyword_id INT REFERENCES global_keywords(id) ON DELETE CASCADE,
platforms JSONB DEFAULT '[]',
notify BOOLEAN DEFAULT true,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(user_id, keyword_id)
);
CREATE INDEX idx_uks_user_notify ON user_keyword_subscriptions(user_id, notify);
CREATE INDEX idx_uks_keyword ON user_keyword_subscriptions(keyword_id) WHERE notify = true;
This reduced the keywords table from 5,000 rows to 800 rows for current users.
The join adds complexity but the reduction in keyword matching work is significant. I run one ILIKE comparison per unique keyword, not per user subscription.
Content deduplication uses a UNIQUE constraint on (url, platform). The database handles this during INSERT, returning conflicts:
const { data, error } = await supabase
.from("content_items")
.insert({ url, title, content, platform, published_at })
.select("id")
.single();
if (error?.code === "23505") {
// Duplicate, skip silently
return null;
}
I log skipped items but don’t treat them as errors.
What I Got Wrong
Initial keyword matching in application code: Moving to SQL cut latency by 99% and simplified the code. I should have started with database-side matching.
Scanning everything every time: Before tracking last_polled_at, I scanned all historical content. Database grew to 2GB in a week of testing. Now I fetch only new content with a 24hr backfill for reliability.
Individual email notifications: I used to send an email per keyword match. Users with popular keywords got dozens of emails per hour. I implemented 15-minute batching, grouping all matches per user into one email. Email costs dropped 85%, from ~$150/month projected to ~$22/month actual.
Current Performance
- Scan cycle: 10 minutes (pg_cron trigger)
- Processing time per cycle: 30-45 seconds for ~200 keywords across 5 platforms
- Keyword matching: 5-15ms per content item (100 keywords)
- Database queries: Analytics dashboard <200ms, timeline charts <100ms
- Email delivery: 15-minute batches, ~10 emails per cycle
Limitations
I don’t do real-time monitoring but it’s close. The 10-minute polling interval means users see mentions with a delay. WebSocket connections would improve latency but increase cost significantly (estimated 3x). For tracking technical keywords—discussions move slowly—10 minutes is acceptable.
ILIKE matching produces false positives. "react" matches "reaction". I’m actively working on this and plan to add negative keywords to help. Full-text search would help but adds operational complexity (managing tsvector columns, reindexing, etc.).
I don’t handle platform API changes proactively. I get notified to fix it if it ever does.
What’s Next
Thinking about adding sentiment analysis to filter low-quality matches. But running inference on every post is expensive—I’m testing sampling strategies (analyze 10% of posts, flag keywords that consistently have low-quality matches).
Webhook support would let users push matches to Slack, Discord, or custom endpoints. The challenge is retry logic and handling slow subscriber endpoints without blocking the pipeline. Considering using a separate queue worker for this.
The system processes ~50,000 posts on average per day across 5 platforms, and on very rare instances almost ~10,000 per scan. I optimized for developer time over theoretical performance, so the architecture and tech stack is focused on doing things the dumb/simple way until something breaks.
My tech stack is just Supabase for both DB and edge functions + Fly.io for hosting the Next.js app + Resend for emails. Total infrastructure cost: ~$50ish/month (Supabase Pro $25, Fly.io ~$10ish, Resend $20). The entire codebase is ~5,000 lines of TypeScript and is maintained by just me.