Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. 📄 But let’s be honest: users really want to chat with live URLs—documentation, wikis, and blogs. 🌐
I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. 👇
🛑 The Problem with Scraping for LLMs
You can’t just fetch(url) and pass the HTML to GPT-4.
- Too much noise: Navbars, footers, and ads waste tokens. 💸
- Context Window: Raw HTML is huge and confuses the model.
- Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). ⏳
🛠 The Stack
- Framework: Next.js 14
- Scraper:
Cheerio(via LangChain). It parses HTML like jQuery…
Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. 📄 But let’s be honest: users really want to chat with live URLs—documentation, wikis, and blogs. 🌐
I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. 👇
🛑 The Problem with Scraping for LLMs
You can’t just fetch(url) and pass the HTML to GPT-4.
- Too much noise: Navbars, footers, and ads waste tokens. 💸
- Context Window: Raw HTML is huge and confuses the model.
- Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). ⏳
🛠 The Stack
- Framework: Next.js 14
- Scraper:
Cheerio(via LangChain). It parses HTML like jQuery, making it lightweight and fast. ⚡️ - Vector DB: Pinecone (Serverless).
Step 1: The Scraper Logic 🕷️
We use CheerioWebBaseLoader from LangChain. It grabs the raw HTML and lets us select only the body or specific content tags (like <article>), ignoring the junk.
import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";
export async function scrapeUrl(url) {
// 1. Load the URL
const loader = new CheerioWebBaseLoader(url, {
selector: "p, h1, h2, h3, article", // 🎯 Only grab text content
});
const docs = await loader.load();
return docs;
}
Step 2: The Cleaning (Smart Chunking) 🧹
LLMs need manageable chunks of text. If you cut a sentence in half, you lose context. We use RecursiveCharacterTextSplitter.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // 📏 Tokens per chunk
chunkOverlap: 200, // 🔗 Overlap to preserve context across chunks
});
const splitDocs = await splitter.splitDocuments(docs);
Step 3: The Cost Hack (1024 Dimensions) 💡
This is the most important part! 💰
By default, OpenAI’s embedding models output 1536 dimensions. But Pinecone charges based on storage size.
OpenAI’s new text-embedding-3-small allows you to "shorten" the dimensions with minimal accuracy loss.
I configured my implementation to force 1024 dimensions:
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
dimensions: 1024, // 📉 Saves ~33% on storage costs
});
✅ The Result
We now have a clean pipeline: URL ➡️ Clean Text ➡️ Chunks ➡️ Vectors ➡️ Chat.
This allows users to point the app at their documentation and ask questions immediately.
🎁 Want the Full Source Code?
I cleaned up this entire logic (plus Multi-File PDF support, Mobile UI, and Streaming response) and packaged it into a production-ready Starter Kit called FastRAG.
It saves you the ~40 hours of setting up the boilerplate so you can focus on building your AI SaaS over the holiday break. 🎅
🏁 I’m running a "Holiday Build" race:
🥇 First 69 devs: Get 69% OFF (~$9). Code: FAST69
🥈 Everyone else: Get 40% OFF. Code: HOLIDAY40
Check out the Live Demo & Repo here: 👉 FastRAG
Happy coding and happy holidays! 🎄🚀