LLM & AI Agent Applications with LangChain and LangGraph — Part 3: Model capacity, context windows, and what actually makes an LLM “large”
10 min readDec 7, 2025
–
Press enter or click to view image in full size
Welcome in next chapter in the series about LLMs-based application development.
To this point we already have some basic intuition about how large language models work. Now I want to go one level deeper and talk about the parameters that make LLMs different from smaller text models, and about the components that appear in architectures such as GPT, the Generative Pretrained Transformer.
The goal of this article is simple: when you see a model description like “X billion parameters, Y tokens of context”, I want you to immediately feel what this means in practice for you…
LLM & AI Agent Applications with LangChain and LangGraph — Part 3: Model capacity, context windows, and what actually makes an LLM “large”
10 min readDec 7, 2025
–
Press enter or click to view image in full size
Welcome in next chapter in the series about LLMs-based application development.
To this point we already have some basic intuition about how large language models work. Now I want to go one level deeper and talk about the parameters that make LLMs different from smaller text models, and about the components that appear in architectures such as GPT, the Generative Pretrained Transformer.
The goal of this article is simple: when you see a model description like “X billion parameters, Y tokens of context”, I want you to immediately feel what this means in practice for your application.
Press enter or click to view image in full size
Model capacity
The first parameter worth understanding is capacity. You can think of model capacity as a very rough proxy for its “intelligence”, or more precisely, its ability to learn and represent complex patterns.
A shallow neural network, with just a single hidden layer, behaves like a first year student. It can learn the basics and solve simple problems, but it will hit limitations fairly quickly. Our small example shallow network might have only fourteen parameters, representing the weights of connections and biases. There is simply not much room in this structure to store sophisticated relationships.
A deep neural network, with many hidden layers and far more connections, is closer to a seasoned researcher with years of experience. It can represent far more complex mappings between inputs and outputs. In the toy example from the video course, the deep version of the network has around 229 parameters. That is already an order of magnitude more capacity than the shallow one, even though it is still tiny compared with real models.
Press enter or click to view image in full size
Large model parameters
When we move to modern language models such as GPT, those numbers jump dramatically. GPT 3, with 175 billion parameters, is closer to an entire university library combined with a team of experts than to a single person. It has space to encode an enormous variety of linguistic and factual patterns.
Newer generations, like GPT 5, stretch this idea even further. Some of the cutting edge models are reported or estimated to have on the order of trillions of parameters. That scale translates directly into capacity. It does not automatically guarantee quality, because training data and procedure also matter, but it sets an upper bound on how much knowledge and structure the model can potentially hold.
Press enter or click to view image in full size
Specialized parts of neural network
There is another key idea that I want you to remember, because it has strong consequences for application design: an LLM behaves like a multitool.
A large language model is not a single narrow specialist. It is closer to a Swiss army knife filled with mental tools. Inside this single model you can think of many “micro specialists”: one that is decent at English grammar, another that handles code, another that remembers medical patterns, another that has seen a lot of legal text, and so on. Of course they are not literally separate modules, but the model has internal regions and representations that behave a bit like that.
That is why the same model can answer questions about history, analyse a piece of Python code, draft an email and summarise a research paper. It is not a world class expert in every narrow niche, but it usually has enough general knowledge to be genuinely useful across many domains. You will still hit limits with very specialised or highly technical questions, especially in fast moving fields, but for a broad range of tasks the multitool nature is a huge advantage.
After capacity, another crucial concept appears: context.
Context is the working memory of the model, everything it “sees” and “remembers” during a single interaction. For LLMs this breaks down into three related ideas: input context, context window and semantic context.
Press enter or click to view image in full size
Model context
Input context is simply the information you pass to the model in a request. This can include your current question, earlier messages in the conversation, system instructions, and sometimes external data attached through tools or RAG. You can think of it as a briefing before a meeting. The better and more relevant the briefing, the better the answer you can expect.
The context window is the technical limit of that memory. A model does not see your entire life story at once. It has a fixed size “desk” in front of it, and you can place only a certain number of tokens on that desk at the same time. If you try to place more, some of the earlier material will fall off the edge and the model will simply not have access to it during that call.
On top of that sits semantic context. This is the model’s ability to understand relationships between words and concepts. It knows that “king” and “queen” are related, it understands that “doctor” is likely connected with “hospital”, and it can follow a theme across a paragraph. This semantic structure is what lets the model stay on topic and reference earlier parts of the input in a meaningful way.
From a practical point of view, context size often reduces to a very concrete question: how much text can I send to the model in one go, and how much will it cost. To answer that, we have to understand the difference between words and tokens, because all context limits and pricing are expressed in tokens, not words.
Words are what we see in natural language: “house”, “is”, “beautiful”. Tokens are technical units created by a tokenizer. A token can be a whole word, part of a word, a punctuation mark, or even a single character. For example, in the sentence:
“Ada has a cat, but John has a dragon.”
we have nine words, but eleven tokens, because the comma and the period are tokens of their own and some words might be split in a specific tokenizer. This matters because when you call an API like OpenAI’s, you are billed for tokens sent in and tokens generated, not for words.
For English text a common rough rule is that the number of words is about three quarters of the number of tokens for the same passage. For languages like Polish, which use diacritics and have more inflection, you often get more tokens for the same number of words. This means that a Polish prompt of a given length will eat a bit more of your context window and budget than an English one.
Press enter or click to view image in full size
Word count vs token count
Once you are comfortable thinking in tokens and context windows, comparing models becomes more concrete.
Press enter or click to view image in full size
LLMs comparison
Consider a few example LLMs and their public parameters. One of the early open source stars was Mixtral from Mistral AI. It has around 46.7 billion parameters and a context window of 32 thousand tokens. In practice that is somewhere around one hundred pages of A4 text in English. For Polish you would fit slightly fewer pages because of the token effect mentioned earlier. Mixtral is exposed through an API, but because it is open source, you can also download it and run it locally using tools such as ollama.
GPT 5 from OpenAI plays in a different league. It is the flagship model in that ecosystem and offers a context window of about 400 thousand tokens. That is enough to store an entire novel, plus still have room to generate a new, substantial chapter in a single call. The exact number of parameters is not published, but estimates put it somewhere around the tens of trillions. Access is provided exclusively through the OpenAI API.
LLaMA 4 Maverick is another example that pushes context size even further. It has roughly 400 billion parameters and supports context windows up to a million tokens. You can imagine this as a gigantic notebook that the model can keep open at once. Architecturally it is available both as a hosted API and as a model that you can run on your own infrastructure if you have the hardware.
All of this power comes with a price on the hardware side.
Get Michalzarnecki’s stories in your inbox
Join Medium for free to get updates from this writer.
Running a full scale LLM locally is not something you do casually on an old laptop. Large models can require many tens of gigabytes of RAM just to load the weights, before you even send a single token. To process them efficiently you usually need powerful GPUs, Graphics Processing Units, that are optimised for parallel numerical operations, or TPUs, specialised tensor units originally designed by Google. A server capable of training or serving the very largest models can cost hundreds of thousands of dollars when you include the whole setup.
That is why, in practice, most developers use hosted APIs. Instead of building your own concert hall, you rent one for the evenings when you need it. You pay per use, in tokens, and you let the provider deal with hardware, scaling, redundancy and maintenance.
Press enter or click to view image in full size
Model specification
In some cases it still makes sense to run smaller or quantized models locally, especially for privacy sensitive applications or when you want a model embedded directly in an offline tool. But for this course and for many real world projects, we will treat LLMs as remote components accessible through an API and focus our energy on building the surrounding application logic.
To make these trade offs even more tangible, in the next paragraphs I will use a small code example to show how different models behave when we send them the same prompt, how context limits are reached and how pricing is affected.
For this experiment i created PDF document which is inspired by real document that I encountered in my work related to building LLM-based document analyzer. Here we have list of shareholders were are 5 rows but 3 of them represent old data - crossed out. So
Press enter or click to view image in full size
After analyzing this document, we can extract 2 shareholders: 1. Jan Kowalski, 20K EUR, 80% of shares 2. Zdzislaw Malinowski 5K EUR, 20% of shares
Now lets create Python code + OpenAI API example comparing analyzing this document with gpt-4o-mini and gpt-5-mini models. First let’s import libraries and prepare the prompt with instruction how to parse attached shareholder list. PDF document is transformed into base64 format and attached to the text.
import base64from openai import OpenAIfrom dotenv import load_dotenvimport pandas as pdimport jsonimport timeload_dotenv()client = OpenAI()with open("../../data/document.pdf", "rb") as f: data = f.read()base64_string = base64.b64encode(data).decode("utf-8")messages = [ { "role": "system", "content": """You are an intelligent assistant analyzing company shareholder information. You will be provided with a PDF containing shareholder data for the company. Respond with only JSON code without any additional text or formatting. Avoid also adding markdown format.Example output:shareholders": [ { "shareholder_name": "Example company", "trade_register_info": "No 12345 Metropolis", "address": "Some street 10", "birthdate": "null", "share_amount": 11250, "share_percentage": 45.0 }, { "shareholder_name": "John Doe", "trade_register_info": null, "address": "Other street 11", "birthdate": "1965-04-12", "share_amount": 11250, "share_percentage": 45.0 }]""" }, { "role": "user", "content": [ { "type": "file", "file": { "filename": "document.pdf", "file_data": f"data:application/pdf;base64,{base64_string}", } }, { "type": "text", "text": "What are shareholders of this company?", } ], },]
Next request to gpt-4o-mini is made.
start = time.time()completion4o = client.chat.completions.create( model="gpt-4o-mini", messages=messages)time4o = time.time() - startprint(completion4o.usage)
Afterwards similar request is send to newer reasoning model — gpt-5-mini.
start = time.time()completion5mini = client.chat.completions.create( model="gpt-5-mini", messages=messages)time5mini = time.time() - startprint(completion5mini.usage)
In the received responses for both models we have not only generated text but also information about consumed tokens that we can compare for both models.
usage4omini = completion4o.usageusage5mini = completion5mini.usagedf_compare_models = pd.DataFrame([ {'time': time4o, 'completion_tokens': usage4omini.completion_tokens, 'prompt_tokens': usage4omini.prompt_tokens, 'total_tokens': usage4omini.total_tokens, 'reasoning_tokens': usage4omini.completion_tokens_details.reasoning_tokens}, {'time': time5mini, 'completion_tokens': usage5mini.completion_tokens, 'prompt_tokens': usage5mini.prompt_tokens, 'total_tokens': usage5mini.total_tokens, 'reasoning_tokens': usage5mini.completion_tokens_details.reasoning_tokens}],index=['gp-4o-mini', 'gpt-5-mini'])df_compare_models
Press enter or click to view image in full size
token count comparison
df_compare_models[['completion_tokens', 'prompt_tokens', 'total_tokens', 'reasoning_tokens']].plot.bar(rot=0)
token count comparison
Here we can see that gpt-5-mini used much more prompt_tokens, completion tokens and even used additional category of “internal processing tokens” — reasoning tokens.
Processing time was comparable in both API calls.
df_compare_models[['time']].plot.bar(rot=0)
response time comparison
Lest check the quality of response and how models deal with analyzing shareholder list. First gpt-4o-mini:
data = json.loads(completion4o.choices[0].message.content)['shareholders']df4omini = pd.DataFrame(data)df4omini
Press enter or click to view image in full size
gpt-4o-mini response
and gpt-5-mini:
data = json.loads(completion5mini.choices[0].message.content)['shareholders']df5mini = pd.DataFrame(data)df5mini
Press enter or click to view image in full size
gpt-5-mini response
We can see here that gpt-5-mini was 100% correct in recognizing shareholders, while gpt-4o-mini responded with wrong values and even hallucinated with additional shareholder that doesn’t appear in the document “Anna Kowalska”. This shows that newer generation of reasoning models can indeed give more precise answers and less hallucinations.
That is all for this episode. In the next one we will start bringing these concepts into code and you will see how model choice, context and tokens influence actual LangChain and LangGraph pipelines.
**see previous **chapter
**see next **chapter
see the GitHub repository with code examples: https://github.com/mzarnecki/course_llm_agent_apps_with_langchain_and_langgraph