duck_tails; crawler; url_pattern
I was going to make some space between DuckDB-centric Drops, but I literally had to use one of the new Community extensions I discovered yesterday to solve a self-inflicted problem (not my doing, but I am in charge, so it’s technically my problem, now). And, if I’m going to show off one extension, I might as well pick two more!
TL;DR
(This is an LLM/GPT-generated summary of today’s Drop. Ollama and MiniMax M2.1.)
- The
duck_tailsextension adds git-aware capabilities to DuckDB, enabling version-controlled data workflows by allowing SQL queries against files in git repositories at any commit, branch, or tag (https://duckdb.org/community_extensions/extensions/duck_tails) - The DuckDB
crawlerextension provides web scraping functionality with rate limiting, robots.txt compliance, HTML extraction via CSS selectors, article extraction, and MERGE syntax for upsert operations (https://duckdb.org/community_extensions/extensions/crawler) - The
urlpatternextension implements the WHATWG URL Pattern Standard for parsing, matching, and extracting components from URLs using pattern-based matching (https://duckdb.org/community_extensions/extensions/urlpattern)
duck_tails
Photo by mauro carlo stella on Pexels.com
In the early days of of $WORK (years before I joined), the folks doing systems engineering, network engineering, and software engineering/app dev were not “data” people. For the most part, that’s still true, and — as such — sometimes causes a bit of pain for we data folk who need to perform research tasks.
One of those is somewhat more of an annoyance than others, and that would be how, and where (and, when) the API and app use tag name, id, and slug for input or output to the caller.