Scour Bot

Scour is a personalized content feed service. It is effectively an RSS reader where users specify topics they are interested in and feeds to follow. Scour uses semantic search to rank items based on user interests.

For website owners, Scour can help direct traffic to your site by surfacing your content to users who are interested in related topics.

Scour does not use website content to train AI models.

Opt-Out

To opt out of having your content featured on Scour, email opt-out@scour.ing or block the bot's IP Address or User Agent.

Identifying the Scour Bot

User Agent

The bot sends the User-Agent header: ScourRSSBot/1.0 (+https://scour.ing/bot).

IP Addresses

The bot's outgoing IP addresses can be found in the IP list. We do not intentionally change IP addresses, but the address may change when the hosting provider moves Scour to a new machine.

Feed Polling Behavior

Users Submit Feed URLs to Poll

Scour does not crawl or scrape every URL on a website.

Users submit feed URLs to check for updates. On submission, Scour checks whether the URL is a supported feed type (see below). If the URL is not a supported feed type, Scour will automatically check common feed paths (for example, /rss.xml, /atom.xml, /feed.json, /feed, etc.).

Supported Feed Types

Scour supports the following feed types:

RSS
Atom
JSON Feed

Scour additionally supports some blogs that do not have RSS feeds but have a page that lists posts. For example, Mixedbread's blog does not have an RSS feed but the blog page lists posts by title and date. Scour checks pages such as these for new posts as if they were RSS feeds.

Checking for Feed Updates

Scour checks feeds for updates every 900 seconds.
The bot sends 1 request per feed, independent of how many users are subscribed to the feed through Scour.
The bot sends If-Modified-Since and If-None-Match headers for conditional requests. If the feed returns a 304 Not Modified response, Scour assumes the feed has not been updated and skips the rest of the process.
If the feed contains new content, Scour will parse the feed content. It will also make a GET request to the post URL to get the full content of the post.

Rate Limiting

Scour rate limits requests to a maximum of 2 requests per second to any single domain. Subdomains are grouped under the same registrable domain (for example, requests to alice.substack.com and bob.substack.com count toward the same substack.com rate limit). This applies to both feed polling and full-text content requests.

Error Backoff

When a server returns an error, Scour backs off and will not retry the feed until the backoff period has elapsed:

429 Too Many Requests — waits 4 hours
403 Forbidden — waits 1 day
404 Not Found — waits 1 week
410 Gone — waits 30 days
DNS or connection errors — waits 1 day to 1 week

Robots.txt

Scour does not follow the robots.txt exclusion list.

My understanding of the Robot Exclusion Protocol is that it is not intended to apply to software that acts as an agent on behalf of a human user (like a web browser). All feeds that Scour polls were manually subscribed to by a human user and Scour merely checks them for updates.

Sadly, I also found that many websites with RSS feeds have their robots.txt file unintentionally or intentionally set to block access to the feed URLs. If RSS readers like Scour would honor these exclusions, it would defeat the purpose of having an RSS feed.

How Scour Uses Website Content

Scour does not use website content to train AI models.

Scour generates an embedding for each post and user interest. It uses these embeddings to determine which posts match a user's interests.

Scour primarily surfaces links for users to click on, directing traffic to the original source.

Complaints, Questions, and Feedback

Scour is run by a single developer, Evan Schwartz.

If you have any complaints, questions, or feedback, please reach out to me at bot@scour.ing.