5 min readDec 21, 2025
–
Press enter or click to view image in full size
Hi! Welcome to next part of series related to LLM-based applications developments dedicated to Retrieval-Augmented Generation, or simply RAG.
RAG is a pattern that very quickly became the foundation of many LLM-based applications. Why? Because it solves one of the biggest weaknesses of language models: limited knowledge.
Imagine a model that can write code, translate text, and answer questions beautifully — but it has no access to your internal company documentation and doesn’t know the latest information from the internet. The model’s knowledge ends at the moment it was trained.
With RAG, we don’t need to train a new model from scratch to “teach it” new things. We simply connect it to an ext…
5 min readDec 21, 2025
–
Press enter or click to view image in full size
Hi! Welcome to next part of series related to LLM-based applications developments dedicated to Retrieval-Augmented Generation, or simply RAG.
RAG is a pattern that very quickly became the foundation of many LLM-based applications. Why? Because it solves one of the biggest weaknesses of language models: limited knowledge.
Imagine a model that can write code, translate text, and answer questions beautifully — but it has no access to your internal company documentation and doesn’t know the latest information from the internet. The model’s knowledge ends at the moment it was trained.
With RAG, we don’t need to train a new model from scratch to “teach it” new things. We simply connect it to an external knowledge source.
How does RAG work?
Very simply — the pipeline consists of two main steps:
- Retrieval: we search for fragments of knowledge that match the user’s question.
- Generation: the LLM receives the question together with the retrieved context and generates an answer.
This mirrors how humans work: if I don’t know something, I don’t invent it — I look it up in sources, and then write a sensible answer based on what I found.
Where RAG works well
RAG is useful in many scenarios, for example:
- a chatbot answering employee questions based on internal documents,
- a legal assistant responding to questions about regulations,
- a recommendation system for e-commerce,
- analysis of financial reports or scientific data.
The key building blocks of a RAG pipeline
A critical aspect of RAG is the data source. We can load PDF documents, text files, web pages, or data from a SQL database. LangChain provides ready-made loaders such as PyPDFLoader, TextLoader, or WebBaseLoader.
1) Load
First, we load the documents.
2) Split
Large documents are too long for the model to handle effectively. That’s why we split them into smaller chunks — for example 500–1000 characters — with a small overlap between chunks. The overlap helps keep chunks coherent and improves semantic matching.
3) Embed
Next come embeddings. Each chunk is converted into a vector of numbers that represents its meaning.
You can use OpenAIEmbeddings, CohereEmbeddings, or free Hugging Face models.
4) Store
Then we store all embeddings in a vector store — a vector database. This can be:
- a local FAISS index,
- Qdrant running in Docker,
- or managed cloud services like Pinecone or Weaviate.
The vector store lets us find the most similar chunks for a given user question.
5) Retrieve
Next is the retriever. This module takes the user question, turns it into an embedding, and searches the vector store for the nearest chunks. Those retrieved chunks become context for the model.
6) Generate
Finally we use the LLM. Here we build the prompt: the user’s question, the context returned by the retriever, and additional instructions like:
Get Michalzarnecki’s stories in your inbox
Join Medium for free to get updates from this writer.
“Answer only based on the provided context. If the answer isn’t there — say you don’t know.”
Only now does the model generate the final response.
The whole pipeline in six words
You can describe the full RAG pipeline in six words:
load → split → embed → store → retrieve → generate
Press enter or click to view image in full size
Alright — let’s move to the notebook.
Install libraries
!pip install langchain_text_splittersPy
Building vector store (FAISS) and retriever
from langchain_community.vectorstores import FAISSfrom langchain_openai import OpenAIEmbeddingsfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom dotenv import load_dotenvload_dotenv()docs = [ "LangChain is a framework for working with LLM.", "RAG combines context retrieval with answer generation.", "FAISS is a library for storing and searching embeddings.", "Retriever is used to find the most similar documents to the user's queries. The retriever can return a variable number of matching documents, specified in the k parameter. The retriever uses various text similarity algorithms, e.g., cosine matching, Euclidean distance, MMR."]splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)split = splitter.create_documents(docs)print(f"Number of chunks: {len(split)}")embeddings = OpenAIEmbeddings()vectorstore = FAISS.from_documents(split, embedding=embeddings)retriever = vectorstore.as_retriever(search_kwargs={"k": 2})query = "Why use a retriever?"context = retriever.invoke(query)print("Retrieved chunk:")for i, c in enumerate(context, 1): print(f"{i}.", c.page_content)
output:
Number of chunks: 7Retrieved chunk:1. Retriever is used to find the most similar documents to the user's queries. The retriever can return2. The retriever uses various text similarity algorithms, e.g., cosine matching, Euclidean distance,
Simple chain RAG (prompt + context + LLM)
from langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.runnables import RunnablePassthroughllm = ChatOpenAI(model="gpt-4o-mini", temperature=0)rag_prompt = ChatPromptTemplate.from_messages([ ("system", "Give precise answers based solely on CONTEXT. If there is no data, say you don't know."), ("system", "CONTEXT:\\n{context}"), ("user", "{question}")])rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | rag_prompt | llm | StrOutputParser())print(rag_chain.invoke("What is FAISS and what is it for?"))
output:
FAISS is a library for storing and searching embeddings.
Example RAG — full program
from langchain_openai import ChatOpenAI, OpenAIEmbeddingsfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain_community.vectorstores import FAISSfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_core.output_parsers import StrOutputParser# Modelllm = ChatOpenAI(model="gpt-4o-mini", temperature=0)# Source documentsdocs = ["LangChain is a framework for working with LLM.","RAG combines context matching with answer generation.","FAISS is a library for storing and retrieving embeddings."]# Splitsplitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)splits = splitter.create_documents(docs)# Embeddings + vector storeembeddings = OpenAIEmbeddings()vectorstore = FAISS.from_documents(splits, embedding=embeddings)retriever = vectorstore.as_retriever(search_kwargs={"k": 2})# Prompt RAGprompt = ChatPromptTemplate.from_messages([ ("system", "Respond only to context:\n{context}"), ("user", "{question}")])# Pipelinerag_chain = ( { "context": lambda x: retriever.invoke(x["question"]), "question": lambda x: x["question"] } | prompt | llm | StrOutputParser())print(rag_chain.invoke({"question": "What is FAISS?"}))
output:
FAISS is a library for storing and retrieving embeddings, which are numerical representations of data, often used in machine learning and information retrieval tasks.
RAG with loop and evaluation
from langchain_core.prompts import ChatPromptTemplateeval_prompt = ChatPromptTemplate.from_messages([ ("system", "Evaluate answer."), ("user", "Question: {question}\\nAnswer: {answer}\\nIs the answer correct? Respond with only 'yes' or 'no'.")])def rag_with_eval(question, max_retries): for attempt in range(max_retries): context = retriever.invoke(question) answer = (prompt | llm | StrOutputParser()).invoke({"context": context, "question": question}) eval_result = (eval_prompt | llm | StrOutputParser()).invoke({"question": question, "answer": answer}) print(f"Evaluation result {eval_result}") if "yes" in eval_result.lower(): return f"✅ Answer approved:\\n{answer}" print(f"❌ Answer: {answer}\\n rejected, retrying...") return "Could not get the correct answer."print(rag_with_eval("What is RAG?", max_retries=3))
Evaluation result Yes.✅ Answer approved:\nRAG stands for Retrieval-Augmented Generation. It combines context matching with answer generation, allowing for more accurate and contextually relevant responses by retrieving information from a knowledge base before generating an answer.
That’s all in this part dedicated to Retrieval Augmented Generation RAG. In the next article we will gain intuition on how vector database, embeddings and semantic search works.
**see **next chapter
**see **previous chapter
**see the full code from this article in the GitHub **repository