Stop writing for loops to crawl websites. Seriously. It’s 2026, and we just got the upgrade XPath was missing for 20 years.
If you have ever built a web scraper, you know the Drill of Pain™:
- Write a Python script.
- Import
requestsandBeautifulSoup. - Fetch a URL.
- Parse the HTML.
- Write a loop to find links.
- Recursively call the function (and accidentally DDoS the site because you forgot a
time.sleep).
It’s imperative, it’s messy, and it’s hard to read 6 months later.
But a new open-source tool just dropped on GitHub, and it is effectively "SQL for the Web."
Meet wxpath.
🤯 The Paradigm Shift: Crawling Inside the Query
Usually, XPath is just a selector language. You download the page first,…
Stop writing for loops to crawl websites. Seriously. It’s 2026, and we just got the upgrade XPath was missing for 20 years.
If you have ever built a web scraper, you know the Drill of Pain™:
- Write a Python script.
- Import
requestsandBeautifulSoup. - Fetch a URL.
- Parse the HTML.
- Write a loop to find links.
- Recursively call the function (and accidentally DDoS the site because you forgot a
time.sleep).
It’s imperative, it’s messy, and it’s hard to read 6 months later.
But a new open-source tool just dropped on GitHub, and it is effectively "SQL for the Web."
Meet wxpath.
🤯 The Paradigm Shift: Crawling Inside the Query
Usually, XPath is just a selector language. You download the page first, then you use XPath to pick the title.
wxpath changes the rules. It adds a custom url(...) operator directly into the XPath language. This means your selector doesn’t just find data—it travels to new pages to get it.
The Old Way (Python + Requests + Parsel)
# 🤮 The Boilerplate Nightmare
import requests
from parsel import Selector
def crawl(url):
response = requests.get(url)
sel = Selector(text=response.text)
# Extract data
title = sel.xpath("//h1/text()").get()
# Find next link and repeat...
next_page = sel.xpath("//a[@class='next']/@href").get()
if next_page:
crawl(next_page) # Hope you handle recursion depth!
The wxpath Way
# 😍 The "One-Liner" (Declarative)
import wxpath
# Fetch the URL, find the link, FETCH THAT LINK, and return the title
query = "url('https://site.com')/url(//a[@class='next']/@href)//h1/text()"
results = wxpath.query(query)
Do you see that? The logic of navigation is embedded in the query.
⚡ Feature Spotlight: The "Deep Crawl" Operator (///)
The killer feature isn’t just fetching one page. It’s the /// operator.
In standard XPath, // means "search anywhere in this document."
In wxpath, /// means "crawl anywhere in this network."
If you want to crawl all links on a Wikipedia page, and then for each of those pages, extract the title and the first paragraph, you don’t need a loop. You need a map.
Scraping a Knowledge Graph in one expression:
url('https://en.wikipedia.org/wiki/Web_scraping')
///url(//div[@id='bodyContent']//a/@href) /map { 'title': //h1/text(),
'summary': (//p)[1]/text(),
'source_url': base-uri(.)
}
This single expression:
- Fetches the seed URL.
- Finds all links in the body.
- Concurrently crawls those links.
- Returns a clean JSON-like object (Map) for each result.
🛠️ Why This is "Viral" Tech
We are moving away from Imperative Coding (telling the computer how to do it) to Declarative Coding (telling the computer what you want).
We saw this with React (UI). We saw this with GraphQL (APIs). Now we are seeing it with wxpath (Scraping).
Key Features that make it production-ready:
- XPath 3.1 Support: It supports modern Maps and Arrays (
/map{...}), so your output is ready for JSON serialization immediately. - Async Under the Hood: It uses
aiohttpto fetch pages concurrently, even though the query looks linear. - Politeness Built-in: It respects
robots.txtand handles rate limiting automatically, so you don’t become "that guy" who crashes a server.
🚀 How to Try It
It’s a Python library, so installation is standard:
pip install wxpath
Then run the CLI to test a query instantly:
wxpath "url('https://news.ycombinator.com')//a[@class='storylink']/text()"
🔮 The Verdict
Is this the end of Scrapy? Probably not for massive, enterprise-scale crawls. But for the 90% of use cases where you just need to "grab data from these linked pages," wxpath is a cheat code.
It turns a 50-line script into a 3-line query. That is the kind of efficiency we live for.
Star the repo while it’s hot: 👉 github.com/rodricios/wxpath