🐥 Extracting clean article introductions from HTML using Elixir, Phoenix, and LLMs

Extracting a clean introduction from an article page sounds simple, but in practice it is messy. Every site has a different DOM structure, and most pages are filled with navigation, ads, cookie banners, and unrelated UI. This post describes a pragmatic, production-ready approach to solving this using Elixir, conventional HTML parsing, and an LLM that is tightly constrained to extract rather than generate.

The guiding principle is simple: let deterministic code do as much work as possible, and only use an LLM where structural understanding is genuinely needed.

The overall approach

At a high level, the pipeline looks like this:

Fetch the raw HTML of an article
Strip out obvious non-content elements
Send the cleaned HTML to an LLM with very strict instructions
Receive c…

The guiding principle is simple: let deterministic code do as much work as possible, and only use an LLM where structural understanding is genuinely needed.

The overall approach

At a high level, the pipeline looks like this:

Fetch the raw HTML of an article
Strip out obvious non-content elements
Send the cleaned HTML to an LLM with very strict instructions
Receive clean Markdown containing only the title and introduction

The output is not a summary and not a rewrite. It is a faithful extraction of existing content.

Step 1: Fetching the HTML

Start by fetching the article HTML over HTTP. In Elixir, Req provides a small, composable API that fits well in extraction pipelines.

def fetch_html(url) do
case Req.get(url) do
{:ok, %{status: 200, body: html}} ->
{:ok, html}
{:ok, resp} ->
{:error, {:unexpected_status, resp.status}}
{:error, reason} ->
{:error, reason}
end
end

Being strict here matters. Only a successful 200 OK response should proceed. Error pages and redirects introduce subtle failures later on.

Step 2: Pre-cleaning the HTML

Raw HTML is far too noisy to send directly to an LLM. Before that, remove entire classes of irrelevant elements using a DOM parser like Floki.

def preclean_html(html) do
html
|> Floki.parse_document!()
|> Floki.filter_out("script, style, nav, footer, aside, img, svg")
|> Floki.raw_html()
|> String.replace(~r/\s+/, " ")
|> String.trim()
|> String.slice(0, 6000)
end

The important calls here are:

Floki.parse_document!/1 to parse the DOM
Floki.filter_out/2 to remove known noise
Floki.raw_html/1 to serialize the cleaned DOM

Collapsing whitespace and enforcing a hard size limit are pragmatic safeguards that improve consistency and cost control.

Step 3: Using an LLM as an extractor

Instead of asking the LLM to summarize, treat it as a structural extraction engine. A library like ReqLLM makes this easy while still enforcing discipline.

First, define a highly constrained system prompt describing exactly what the model is allowed to do:

system_prompt = """
You will receive raw HTML from an article.
- Extract only the title and introduction
- If the article is short, return the full article
- Stop at the first major section boundary (h2, section, hr)
- Convert the result to clean Markdown
- Preserve fenced code blocks exactly
- Do not summarize, rewrite, or add commentary
- Do not invent content
"""

Next, send the prompt and HTML as structured messages using ReqLLM.Context.new/1:

context =
ReqLLM.Context.new([
ReqLLM.Context.system(system_prompt),
ReqLLM.Context.user(clean_html)
])

This framing strongly nudges the model toward extraction rather than generation.

Step 4: Enforcing structured output

To make this safe for production, require the LLM to return a well-defined object. With ReqLLM.generate_object!/4, you can enforce a schema:

schema = [
title: [type: :string, required: true],
markdown: [type: :string, required: true]
]
result =
ReqLLM.generate_object!(
"openai:gpt-4o-mini",
context.messages,
schema,
temperature: 1.0
)

If the model deviates from this schema, the call fails immediately. This keeps downstream systems from consuming malformed output.

Why this works in practice

This approach works because each layer has a clear responsibility:

Elixir handles orchestration and failure modes
Req handles HTTP predictably
Floki performs fast, deterministic HTML cleanup
The LLM handles semantic structure across inconsistent layouts

Because code blocks are preserved verbatim and the model is never asked to be creative, the output is suitable for technical articles and documentation.

Key Takeaway

The key takeaway is that LLMs are most reliable when they are heavily constrained. Used this way, they behave like precise extraction tools rather than unpredictable writers.

If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.

The overall approach

The overall approach

Step 1: Fetching the HTML

Step 2: Pre-cleaning the HTML

Step 3: Using an LLM as an extractor

Step 4: Enforcing structured output

Why this works in practice

Key Takeaway

Similar Posts