Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. ๐ But letโs be honest: users really want to chat with live URLsโdocumentation, wikis, and blogs. ๐
I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. ๐
๐ The Problem with Scraping for LLMs
You canโt just fetch(url) and pass the HTML to GPT-4.
- Too much noise: Navbars, footers, and ads waste tokens. ๐ธ
- Context Window: Raw HTML is huge and confuses the model.
- Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). โณ
๐ The Stack
- Framework: Next.js 14
- Scraper:
Cheerio(via LangChain). It parses HTML like jQueryโฆ
Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. ๐ But letโs be honest: users really want to chat with live URLsโdocumentation, wikis, and blogs. ๐
I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. ๐
๐ The Problem with Scraping for LLMs
You canโt just fetch(url) and pass the HTML to GPT-4.
- Too much noise: Navbars, footers, and ads waste tokens. ๐ธ
- Context Window: Raw HTML is huge and confuses the model.
- Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). โณ
๐ The Stack
- Framework: Next.js 14
- Scraper:
Cheerio(via LangChain). It parses HTML like jQuery, making it lightweight and fast. โก๏ธ - Vector DB: Pinecone (Serverless).
Step 1: The Scraper Logic ๐ท๏ธ
We use CheerioWebBaseLoader from LangChain. It grabs the raw HTML and lets us select only the body or specific content tags (like <article>), ignoring the junk.
import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";
export async function scrapeUrl(url) {
// 1. Load the URL
const loader = new CheerioWebBaseLoader(url, {
selector: "p, h1, h2, h3, article", // ๐ฏ Only grab text content
});
const docs = await loader.load();
return docs;
}
Step 2: The Cleaning (Smart Chunking) ๐งน
LLMs need manageable chunks of text. If you cut a sentence in half, you lose context. We use RecursiveCharacterTextSplitter.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // ๐ Tokens per chunk
chunkOverlap: 200, // ๐ Overlap to preserve context across chunks
});
const splitDocs = await splitter.splitDocuments(docs);
Step 3: The Cost Hack (1024 Dimensions) ๐ก
This is the most important part! ๐ฐ
By default, OpenAIโs embedding models output 1536 dimensions. But Pinecone charges based on storage size.
OpenAIโs new text-embedding-3-small allows you to "shorten" the dimensions with minimal accuracy loss.
I configured my implementation to force 1024 dimensions:
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
dimensions: 1024, // ๐ Saves ~33% on storage costs
});
โ The Result
We now have a clean pipeline: URL โก๏ธ Clean Text โก๏ธ Chunks โก๏ธ Vectors โก๏ธ Chat.
This allows users to point the app at their documentation and ask questions immediately.
๐ Want the Full Source Code?
I cleaned up this entire logic (plus Multi-File PDF support, Mobile UI, and Streaming response) and packaged it into a production-ready Starter Kit called FastRAG.
It saves you the ~40 hours of setting up the boilerplate so you can focus on building your AI SaaS over the holiday break. ๐
๐ Iโm running a "Holiday Build" race:
๐ฅ First 69 devs: Get 69% OFF (~$9). Code: FAST69
๐ฅ Everyone else: Get 40% OFF. Code: HOLIDAY40
Check out the Live Demo & Repo here: ๐ FastRAG
Happy coding and happy holidays! ๐๐