applications of LLMs are the ones that I like to call the “wow effect LLMs.” There are plenty of viral LinkedIn posts about them, and they all sound like this:
“I built [x] that does [y] in [z] minutes using AI.”
Where:
- [x] is usually something like a web app/platform
- [y] is a somewhat impressive main feature of [x]
- [z] is usually an integer number between 5 and 10.
- “AI” is really, most of the time, a LLM wrapper (Cursor, Codex, or similar)
If you notice carefully, the focus of the sentence is not really the quality of the analysis but the amount of time you save. This is to say that, when dealing with a task, people are not excited about the LLM* output quality* in tackling the problem, but they are thrilled that the LLM is spitting out …
applications of LLMs are the ones that I like to call the “wow effect LLMs.” There are plenty of viral LinkedIn posts about them, and they all sound like this:
“I built [x] that does [y] in [z] minutes using AI.”
Where:
- [x] is usually something like a web app/platform
- [y] is a somewhat impressive main feature of [x]
- [z] is usually an integer number between 5 and 10.
- “AI” is really, most of the time, a LLM wrapper (Cursor, Codex, or similar)
If you notice carefully, the focus of the sentence is not really the quality of the analysis but the amount of time you save. This is to say that, when dealing with a task, people are not excited about the LLM* output quality* in tackling the problem, but they are thrilled that the LLM is spitting out something quick that might sound like a solution to their problem.
This is why I refer to them as wow-effect LLMs. As impressive as they sound and look, these wow-effect LLMs display multiple issues that prevent them from being actually implemented in a production environment. Some of them:
- The prompt is usually not optimized: you don’t have time to test all the different versions of the prompts, evaluate them, and provide examples in 5-10 minutes.
- They are not meant to be sustainable: in that short of time, you can develop a nice-looking plug-and-play wrapper. By default, you are throwing all the costs, latency, maintainability, and privacy considerations out of the window.
- They usually lack context: LLMs are powerful when they are plugged into a big infrastructure, they have decisional power over the tools that they use, and they have contextual data to augment their answers. No chance of implementing that in 10 minutes.
Now, don’t get me wrong: LLMs are designed to be intuitive and easy to use. This means that evolving LLMs from the wow effect to production-level is not rocket science. However, it requires a specific methodology that needs to be implemented.
The goal of this blog post is to provide this methodology. The points we will cover to move from wow-effect LLMs to production-level LLMs are the following:
- LLM System Requirements. When this beast goes into production, we need to know how to maintain it. This is done in stage zero, through adequate system requirements analysis.
- Prompt Engineering. We are going to optimize the prompt structure and provide some best-practice prompt strategies.
- Force structure with schemas and structured output. We are going to move from free text to structured objects, so the format of your response is fixed and reliable.
- Use tools so the LLM does not work in isolation. We are going to let the model connect to data and call functions. This provides richer answers and reduces hallucinations.
- Add guardrails and validation around the model. Check inputs and outputs, enforce business rules, and define what happens when the model fails or goes out of bounds.
- Combine everything into a simple, testable pipeline. Orchestrate prompts, tools, structured outputs, and guardrails into a single flow that you can log, monitor, and improve over time.
We are going to use a very simple case:** we are going to make LLM judge data scientists’ tests. **This is just a concrete case to avoid a totally abstract and confusing article. The procedure is general enough to be adapted to other LLM applications, typically with very minor adjustments.
Looks like we’ve got a lot of ground to cover. Let’s get started!
Image generated by author using Excalidraw Whiteboard
The whole code and data can be found here.
Tough choices: cost, latency, privacy
Before writing any code, there are a few important questions to ask:
- How complex is your task? Do you really need the latest and most expensive model, or can you use a smaller one or an older family?
- How often do you run this, and at what latency? Is this a web app that must respond on demand, or a batch job that runs once and stores results? Do users expect an immediate answer, or is “we’ll email you later” acceptable?
- What is your budget? You should have a rough idea of what is “ok to spend”. Is it 1k, 10k, 100k? And compared to that, would it make sense to train and host your own model, or is that clearly overkill?
- What are your privacy constraints? Is it ok to send this data through an external API? Is the LLM seeing sensitive data? Has this been approved by whoever owns legal and compliance?
Let me throw some examples at you. If we consider OpenAI, this is the table to look at for prices:
Image from https://platform.openai.com/docs/pricing
For simple tasks, where you have a low budget and need low latency, the smaller models (for example the 4.x mini family or 5 nano) are usually your best bet. They are optimized for speed and price, and for many basic use cases like classification, tagging, light transformations, or simple assistants, you will barely notice the quality difference while paying a fraction of the cost.
For more complex tasks, such as complex code generation, long-context analysis, or high-stakes evaluations, it can be worth using a stronger model in the 5.x family, even at a higher per-token cost. In those cases, you are explicitly trading money and latency for better decision quality.
If you are running large offline workloads, for example re-scoring or re-evaluating thousands of items overnight, batch endpoints can significantly reduce costs compared to real-time calls. This often changes which model fits your budget, because you can afford a “bigger” model when latency is not a constraint.
From a privacy standpoint, it is also good practice to only send non-sensitive or “sensitive-cleared” data to your provider, meaning data that has been cleaned to remove anything confidential or personal. If you need even more control, you can consider running local LLMs.
Image made by author using Excalidraw Whiteboard
The specific use case
For this article, we’re building an** automated grading system for Data Science exams**. Students take a test that requires them to analyze actual datasets and answer questions based on their findings. The LLM’s job is to grade these submissions by:
- Understanding what each question asks
- Accessing the correct answers and grading criteria
- Verifying student calculations against the actual data
- Providing detailed feedback on what went wrong
This is a perfect example of why LLMs need tools and context. You see, you could indeed do a plug-and-play approach. If we were to do a simple DS through a single prompt and API call, it would have the wow-effect, but it would not work well in production. Without access to the datasets and grading rubrics, the LLM cannot grade accurately. It needs to retrieve the actual data to verify whether a student’s answer is correct.
Our exam is stored in **test.json **and contains 10 questions across three sections. Students must analyze three different datasets: e-commerce sales, customer demographics, and A/B test results. Let’s look at a few example questions:
As you can see, the questions are data-related, so the LLM will need a tool to analyze these questions. We will go back to this.
Image made by author using Excalidraw Whiteboard
Building the prompt
When I use ChatGPT for small daily questions, I am terribly lazy, and I don’t pay attention to the prompt quality at all, and that is ok. Imagine that you need to know the current situation of the housing market in your city, and you have to sit down at your laptop and write thousands of lines of Python code. Not very appealing, right?
However, to truly get the best prompt for your production-level LLM application, there are some key components to follow:
- Clear Role Definition. WHO the LLM is and WHAT expertise it has.
- System vs User Messages. The system is the LLM-specific instructions. The “user” represents the specific prompt to run, with the current request from the user.
- Explicit Rules with Chain-of-Thought. This is the list of steps that the LLM has to follow to perform the task. This step-by-step reasoning triggers the Chain-of-Thought, which improves performance and reduces hallucinations.
- Few-Shot Examples. This is a list of examples, so that we show explicitly how the LLM should perform the task. Show the LLM correct grading examples.
It is usually a good idea to have a prompt.py file, with SYSTEM_PROMPT, USER_PROMPT_TEMPLATE, and FEW_SHOT_EXAMPLES. This is the example for our use-case:
So the prompts that we will reuse are stored as constants, while the ones that change based on the student answer are obtained from get_grading_prompt.
Image made by author using Excalidraw Whiteboard
Output Formatting
If you notice, the output in the few-shot example already has a sort of “structure”. At the end of the day, the score should be “packaged” in a production-adequate format. It is not acceptable to have the output as a free-text/string.
In order to do that, we are going to use the magic Pydantic. Pydantic allows us to easily create a schema that can then be passed to the LLM, which will build the output based on the schema.
This is our schemas.py file:
If you focus on GradingResult, you can see that you have these kinds of features:
question_number: int = Field(..., ge=1, le=10, description="Question number (1-10)")
points_earned: float = Field(..., ge=0, le=10, description="Points earned out of 10")
points_possible: int = Field(default=10, description="Maximum points for this question")
Now, imagine that we want to add a new feature (e.g. completeness_of_the_answer), this would be very easy to do: you just add it to the schema. However, keep in mind that the prompt should reflect the way your output will look.
Image made by author using Excalidraw Whiteboard
Tools Description
The /data folder has:
- A list of datasets, which will be the topic of our questions (e.g.*** ***Calculate the average order value (AOV) for customers who used the discount code \”SAVE20\”. What percentage of total orders used this discount code). This folder has a set of tables, which represent the data that should be analyzed by the students when taking the tests.
- The grading rubric dataset, which will describe how we are going to evaluate each question.
- The ground truth dataset, which will describe the ground truth answer for every question
We are going to give the LLM free roam on these datasets; we are letting it explore each file based on the specific question.
For example, get_ground_truth_answer() allows the LLM to pull the ground truth for a given question. query_dataset() allows you to do some operations on the LLM, like computing the mean, max, and count.
Even in this case, it is worth noticing that tools, schema, and prompt are completely customizable. If your LLM has access to 10 tools, and you need to add one more functionality, there is no need to do any structural change to the code: the only thing to do is to add the functionality in terms of prompt, schema, and tool.
Image made by author using Excalidraw Whiteboard
Guardrails Description
In Software Engineering, you recognize a good system from how gracefully it fails. This shows the amount of work that has been put into the task.
In this case, some “graceful falls” are the following:
- The input should be sanitized: the question ID should exist, the student’s answer text should exist, and not be too long
- The output should be sanitized: the question ID should exist, the score should be between 1 to 10, and the output should be in the correct format identified by Pydantic.
- The output should “make sense”: you can not give the best score if there are errors, or give 0 if there are no errors.
- A rate limit should be implemented: in production, you don’t want to accidentally run thousands of threads at once for no reason. It is best to implement a RateLimit check.
This part is slightly boring, but very necessary. As it is necessary, it is included in my Github Folder, as it is boring, I won’t copy-paste it here. You’re welcome! 🙂
Image made by author using Excalidraw Whiteboard
Whole pipeline
The whole pipeline is implemented through CrewAI, which is built on top of LangChain. The logic is simple:
- The crew is the main object that is used to generate the output for a given input with a single command (crew.kickoff()).
- An agent is defined: this wraps the tools, the prompts, and the specific LLM (e.g, GPT 4 with a given temperature). This is connected to the crew.
- The task is defined: this is the specific task that we want the LLM to perform. This is also connected to the crew.
Now, the magic is that the task is connected to the tools, the prompts, and the Pydantic schema. This means that all the dirty work is done in the backend. The pseudo-code looks like this:
agent = Agent(
role="Expert Data Science Grader",
goal="Grade student data science exam submissions accurately and fairly by verifying answers against actual datasets",
backstory=SYSTEM_PROMPT,
tools=tools_list,
llm=llm,
verbose=True,
allow_delegation=False,
max_iter=15
)
task = Task(
description=description,
expected_output=expected_output,
agent=agent,
output_json=GradingResult # Enforce structured output
)
crew = Crew(
agents=[self.grader_agent],
tasks=[task],
process=Process.sequential,
verbose=self.verbose
)
result = crew.kickoff()
Now, let’s say we have the following JSON output, with the student work:
We can use the following main.py file to process this:
And run it through:
python main.py --submission ../data/test.json \
--limit 1 \
--output ../results/test_llm_output.json
This kind of setup is exactly how production-level code works: the output is passed through an API as a structured piece of information, and the code needs to run on that piece of data.
This is how the terminal will display to you:
Image made by author
As you can see from the screenshot above, the input is processed through the LLM, but before the output is produced, the CoT is triggered, the tools are called, and the tables are read.
And this is what the output looks like (test_llm_output.json):
This is a good example of how LLMs can be exploited in their full power. At the end of the day, the main advantage of LLMs is their ability to read the context efficiently. The more context we provide (tools, rule-based prompting, few-shot prompting, output formatting), the less the LLM will have to “fill the gaps” (usually hallucinating) and the better job it will eventually do.
Image generated by author using Excalidraw Whiteboard
Conclusions
Thank you for sticking with me throughout this long, but hopefully not too painful, blog post. 🙂
We cover a lot of fun stuff. More specifically, we started from the wow-effect LLMs, the ones that look great in a LinkedIn post but fall apart as soon as you ask them to run every day, within a budget, and under real constraints.
Instead of stopping at the demo, we walked through what it actually takes to turn an LLM into a system:
- We defined the system requirements first, thinking in terms of cost, latency, and privacy, instead of just picking the biggest model available.
- We framed a concrete use case: an automated grader for Data Science exams that has to read questions, look at real datasets, and give structured feedback to students.
- We designed the prompt as a specification, with a clear role, explicit rules, and few-shot examples to guide the model toward consistent behavior.
- We enforced structured output using Pydantic, so the LLM returns typed objects instead of free text that needs to be parsed and fixed every time.
- We plugged in tools to give the model access to the datasets, grading rubrics, and ground truth answers, so it can check the student work instead of hallucinating results.
- We added guardrails and validation around the model, making sure inputs and outputs are sane, scores make sense, and the system fails gracefully when something goes wrong.
- Finally, we put everything together into a simple pipeline, where prompts, tools, schemas, and guardrails work as one unit that you can reuse, test, and monitor.
The main idea is simple. LLMs are not magical oracles. They are powerful components that need context, structure, and constraints. The more you control the prompt, the output format, the tools, and the failure modes, the less the model has to fill the gaps on its own, and the fewer hallucinations you get.
Before you head out
Thank you again for your time. It means a lot ❤️
My name is Piero Paialunga, and I’m this guy here:
Image made by author
I’m originally from Italy, hold a Ph.D. from the University of Cincinnati, and work as a Data Scientist at The Trade Desk in New York City. I write about AI, Machine Learning, and the evolving role of data scientists both here on TDS and on LinkedIn. If you liked the article and want to know more about machine learning and follow my studies, you can:
A. Follow me on Linkedin, where I publish all my stories B. Follow me on GitHub, where you can see all my code C. For questions, you can send me an email