How to Build a "Chat with Website" App using Next.js, LangChain, and Cheerio 🦜🔗

Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. 📄 But let’s be honest: users really want to chat with live URLs—documentation, wikis, and blogs. 🌐

I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. 👇

🛑 The Problem with Scraping for LLMs

You can’t just fetch(url) and pass the HTML to GPT-4.

Too much noise: Navbars, footers, and ads waste tokens. 💸
Context Window: Raw HTML is huge and confuses the model.
Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). ⏳

🛠 The Stack

Framework: Next.js 14
Scraper: Cheerio (via LangChain). It parses HTML like jQuery…

Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. 📄 But let’s be honest: users really want to chat with live URLs—documentation, wikis, and blogs. 🌐

I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. 👇

🛑 The Problem with Scraping for LLMs

You can’t just fetch(url) and pass the HTML to GPT-4.

Too much noise: Navbars, footers, and ads waste tokens. 💸
Context Window: Raw HTML is huge and confuses the model.
Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). ⏳

🛠 The Stack

Framework: Next.js 14
Scraper: Cheerio (via LangChain). It parses HTML like jQuery, making it lightweight and fast. ⚡️
Vector DB: Pinecone (Serverless).

Step 1: The Scraper Logic 🕷️

We use CheerioWebBaseLoader from LangChain. It grabs the raw HTML and lets us select only the body or specific content tags (like <article>), ignoring the junk.

import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";

export async function scrapeUrl(url) {
// 1. Load the URL
const loader = new CheerioWebBaseLoader(url, {
selector: "p, h1, h2, h3, article", // 🎯 Only grab text content
});

const docs = await loader.load();
return docs;
}

Step 2: The Cleaning (Smart Chunking) 🧹

LLMs need manageable chunks of text. If you cut a sentence in half, you lose context. We use RecursiveCharacterTextSplitter.


import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // 📏 Tokens per chunk
chunkOverlap: 200, // 🔗 Overlap to preserve context across chunks
});

const splitDocs = await splitter.splitDocuments(docs);

Step 3: The Cost Hack (1024 Dimensions) 💡

This is the most important part! 💰

By default, OpenAI’s embedding models output 1536 dimensions. But Pinecone charges based on storage size.

OpenAI’s new text-embedding-3-small allows you to "shorten" the dimensions with minimal accuracy loss.

I configured my implementation to force 1024 dimensions:


const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
dimensions: 1024, // 📉 Saves ~33% on storage costs
});

✅ The Result

We now have a clean pipeline: URL ➡️ Clean Text ➡️ Chunks ➡️ Vectors ➡️ Chat.

This allows users to point the app at their documentation and ask questions immediately.

🎁 Want the Full Source Code?

I cleaned up this entire logic (plus Multi-File PDF support, Mobile UI, and Streaming response) and packaged it into a production-ready Starter Kit called FastRAG.

It saves you the ~40 hours of setting up the boilerplate so you can focus on building your AI SaaS over the holiday break. 🎅

🏁 I’m running a "Holiday Build" race:

🥇 First 69 devs: Get 69% OFF (~$9). Code: FAST69

🥈 Everyone else: Get 40% OFF. Code: HOLIDAY40

Check out the Live Demo & Repo here: 👉 FastRAG

Happy coding and happy holidays! 🎄🚀

🛑 The Problem with Scraping for LLMs

🛠 The Stack

🛑 The Problem with Scraping for LLMs

🛠 The Stack

Step 1: The Scraper Logic 🕷️

Step 2: The Cleaning (Smart Chunking) 🧹

Step 3: The Cost Hack (1024 Dimensions) 💡

✅ The Result

🎁 Want the Full Source Code?

Similar Posts