Lessons learned building GenAI apps

Introduction

We’ve pivoted from an era of “train your own model” to one of “call an API.” On the surface, this shift is incredibly empowering. The barrier to entry has never been lower. However while it is easier than ever to build a prototype, the complexity of shipping a reliable, production-ready system is higher than it’s ever been.

After spending the last two years building GenAI applications, I’ve seen the good, the bad, and the buggy. I’m sharing these practical lessons not as an expert with all the answers, but as an engineer passing along my notes. Let’s get into the lessons I’ve learned so far.

This is by no means a complete receipe. You might feel that you do not agree with some of the points and that is totally fine. These are the lessons that I’ve learne…

Introduction

This is by no means a complete receipe. You might feel that you do not agree with some of the points and that is totally fine. These are the lessons that I’ve learned and hope they’ll be helpful to you in one way or the other.

1. Understand the Task

Before writing a single line of code, you must deeply understand what you are actually trying to solve. This might sound like “Software Engineering 101,” but in the world of LLM-driven apps, it is where most projects fail.

Misunderstanding the task is like having a heavy-duty foundation for a commercial warehouse when the client actually needed a residential cottage. You might complete the build on time and under budget, but the end result is useless to the person living there.

Defining the task isn’t just about transcribing a client’s wish list. It’s about user empathy - You have to put yourself in the end-user’s shoes: - What do they actually expect when they click “Submit”? - If they ask for a “summarizer,” are they looking for a three-sentence executive blurb or a technical list of action items? - What information is “signal” to them, and what is just “noise” they want excluded?

You likely won’t get every detail right on the first try, and that’s okay. However, taking the time to truly deconstruct the task helps you plan your architecture effectively and, more importantly, manage the client’s expectations from day one.

2. Know the Data/Domain

Once the task is defined, your next challenge is mastering the data and the domain. In my experience, this is often where the most significant project friction occurs.

There is a natural “communication gap” here: clients are usually domain experts who are non-technical, while engineers are technical experts who are rarely familiar with the nuances of healthcare, finance, or law. If you don’t bridge this gap early, you risk heading down a “one-way road” that is incredibly expensive to return from.

In a healthcare project I worked on, the medical data was so specialized that we initially struggled to make progress. Our solution? We scheduled daily syncs with the client just to decode their terminology and workflows. While this felt like a poor use of time at first, it ended up being our smartest investment. It ensured we built on a foundation of facts rather than assumptions, saving us months of potential rework.

Strategies for mastering the domain:

Build a Glossary: Learn industry-specific terms and exactly how the client defines them.

Audit Data Quality Early: The “Garbage In, Garbage Out” rule is true for LLMs as well. If your data is messy, your model’s reasoning will be too.

Identify Sensitive Content: Flag PII (Personally Identifiable Information) and regulated data (like HIPAA or GDPR) before a single token is sent to an API.

Identify Edge Cases: Ask the client: _“What is a scenario that happens only 1% of the time, but would be a disaster if the AI got it wrong?”

Reverse-Engineer the Human Workflow: How do humans currently solve this problem? If you can’t map the human process, you’ll struggle to automate it reliably.

3. Define Evals Early

“You can’t improve what you don’t measure.”

This is arguably the most critical pillar of GenAI development. Without a robust evaluation framework, you are essentially operating without a clear way to system improvement. You might write a prompt, test it against five cases, and conclude that it “works.” But in production, LLMs fail in silent, creative ways that a handful of manual checks will never catch.

Evaluation is the foundation of the GenAI lifecycle:

Build “Eval Sets” before Production Code: Treat your evaluation dataset as your “Ground Truth.” It should contain diverse examples, from the standard “happy path” to the weirdest edge cases you identified in Section 2.

The 2025 Metric Mix: Don’t rely solely on old-school metrics like BLEU or ROUGE. They cannot capture the meaning. Use a hybrid approach:

Deterministic Checks: For formatting (e.g., “Is the output valid JSON?”).

Semantic Similarity: Using embedding distance to check if the meaning matches your target, even if the words differ.

LLM-as-a-Judge: Use a “frontier” model (like GPT-5 or Claude 4.5 Sonnet) to grade your system’s performance based on specific rubrics like faithfulness or helpfulness.

Task specific metrics: Never forget to use task-specific metrics. Even though you use LLM for classification, using F1 score is almost always a good idea than using LLM as a judge. The only downside is you need to prepare ground truth.

Human-in-the-Loop (HITL): This remains the gold standard. Use human experts to periodically audit the automated scores. If the automated eval says “Pass” but a human says “Fail,” your eval criteria need tuning.

Make Evals Fast and Versioned: You will run these hundreds of times. If your eval takes an hour, you won’t run it. Aim for a “smoke test” suite that runs in minutes and track how your scores change as you update prompts.

Relentless Error Analysis: Don’t just settle for a score (e.g., “80% accuracy”). Dive into the failures. Is the model hallucinating? Is it a formatting issue? Is the retrieved data wrong? It’s where you discover exactly what to optimize next.

4. Embrace Observability

Once you move from a local notebook to a production environment, monitoring is no longer a “nice-to-have”—it is your lifeline. Traditional software monitoring focuses on “uptime” (is the server alive?), but GenAI monitoring must track intelligence (is the answer correct, fast, and affordable?).

Without proper observability, LLMs are a black box. You won’t know if a sudden spike in user complaints is due to a “silent” model update from OpenAI, a “data drift” in your vector database, or an inefficient prompt chain. LLM Observability tools like Langfuse (great for open-source flexibility) and LangSmith (if you are already in the LangChain ecosystem) are essential for peering inside this box.

Your observability strategy should focus on these three pillars:

1. Deep Request Tracing: In a complex app (like a RAG pipeline), a single user question triggers multiple events: a search, a context retrieval, and finally an LLM call. Tracing allows you to map this entire lifecycle. If the final answer is wrong, you can pinpoint exactly where the “logic chain” broke—did the retriever find the wrong document, or did the LLM just ignore the right one?

2. Performance & Unit Economics: Beyond total latency, you need to track TTFT (Time to First Token). In 2025, users care more about how quickly the AI starts responding than when it finishes. Simultaneously, you must monitor per-user token consumption; a small bug in a loop can “burn” thousands of dollars in hours if you aren’t accounting for your token spend in real-time.

3. Quality & Drift Detection: Models aren’t static. Providers update them “under the hood,” which can lead to silent failures—where the API returns a 200 OK but the response is gibberish or non-compliant. Continuous oversight helps you catch these “personality shifts” and hallucinations before they erode user trust.

5. Take Clients Onboard

While you are busy understanding the task and auditing the data, you must ensure you are communicating clearly with your stakeholders. It is your job to manage the “AI hype.” You need to be transparent about what the system can do, what its limitations are, and how you plan to improve it over time—especially if the task is complex and cannot be fully solved in the initial phases.

Key strategies for client alignment:

Be explicit about capabilities: Clearly define what the AI can and cannot do. Don’t let them assume it has “human-level” reasoning if it’s just a classification tool.

Show failure cases upfront: Don’t just demo the “happy path.” Demonstrate limitations and edge cases early so there are no surprises during production.

Define responsibility boundaries: Establish where the AI’s job ends and where human oversight must begin. This is crucial for maintaining quality and trust.

Educate on the probabilistic nature: Explain that LLMs are not deterministic. Unlike traditional software, the outputs won’t always be perfect or 100% consistent, and the client needs to be prepared for that variability.

Agree on iteration cycles: Make it clear that GenAI development is experimental by nature. Success requires a cycle of testing, feedback, and refinement rather than a single “final” release.

6. Cost Considerations from day 1

There is a high probability that costs will catch you by surprise if they aren’t factored into your architecture from day one. It’s easy to build a high-performing PoC (Proof of Concept) using a five-page document, but production reality is different. When that same system suddenly has to process hundreds of 50-page documents, your servers won’t crash—but your LLM bill certainly might.

As an engineer, it is your responsibility to set clear boundaries regarding the nature of the data and the expected costs. If a client is fascinated by a complex multi-chain agent that you built for a demo, you must help them understand the “unit economics” of that solution.

Tips for cost-effective scaling:

Estimate Early and Often: Don’t wait for the first invoice. Calculate the expected cost per user and cost per request during the design phase.

The “Smaller Model” First Rule: If a smaller, faster model (like GPT-4o-mini or Claude Haiku) can achieve 90% of the result, use it. It optimizes both latency and cost simultaneously.

Manage Multi-Chain Workflows: Every step in an autonomous chain adds to the bill. Be transparent with stakeholders about the price of complexity.

Define Data Boundaries: Set limits on document sizes or the number of documents a user can upload to prevent costs.

7. Prioritize the Feedback Loop

Once you release the system to UAT (User Acceptance Testing) or production and real users start interacting with it, your most valuable asset is their feedback. Many teams treat feedback as an afterthought, but implementing a mechanism to capture user sentiment in the early stages is one of the smartest moves you can make.

This feedback doesn’t just tell you if the system “works”, it builds a unique, high-value dataset that reveals exactly where the model’s responses fall short of user expectations. It is the ultimate tool for pressure-testing your initial assumptions.

How to build an effective feedback loop?

Make it Frictionless: Use a simple one-click “Thumbs Up/Down” interface. If the barrier to giving feedback is too high, users simply won’t do it.

Capture Structured Data: When a user gives negative feedback, ask a quick follow-up question: “Was the answer inaccurate, too long, or poorly formatted?” This helps you categorize issues without manual review.

Analyze the Patterns: Look for clusters. Are users consistently complaining about the same feature? This is your signal to revisit your implementation.

Turn Feedback into Evals: Take the cases where users were dissatisfied and turn them into permanent test cases in your evaluation suite. This ensures you never regress on the same issue twice.

Close the Loop: Show your users that their input matters. Improving the system based on their direct feedback is the fastest way to build trust in an AI product.

8. Embrace the Experimental Nature

Stakeholders are often “wowed” by the initial PoC, only to become frustrated when they realize how many ways the system can fail in the wild. As an engineer, your job is to set the expectation that GenAI development is not a “one-and-done” task. It is a continuous, iterative process.

A production system is a moving target: prompt performance shifts, model providers push updates, the nature of incoming data evolves, and user feedback will constantly pull the project in new directions. All of these factors force you into a state of perpetual experimentation. To make your case to leadership, don’t just ask for more time; present a structured plan for iteration. One of the most effective habits is to align your experimentation with your feedback loops.

Example: The “Prompt Release” Strategy

In one project, we realized we needed to update our prompts constantly based on evolving client needs. To manage this without causing chaos, we treated prompt updates with the same discipline as code deployments. We established a dedicated “Prompt Release Day.”

Our weekly cycle looked like this:

Friday: Deadline for receiving the “First Draft” of requirements or feedback from the client.

Monday – Wednesday: Dedicated time for experimentation, testing new prompt versions, and running them through our Eval sets.

Thursday: Release Day. The new, tested prompts are pushed to production.

This process did two things: it gave the engineers a predictable window to experiment, and it gave the clients a reliable release cycle for when they would see improvements. It turned “unpredictable AI magic” into a standard, manageable engineering workflow.

9. Treat Prompts as Code

Building on the idea of a release cycle, the biggest mindset shift you need is this: Treat your prompts with the same rigor as your application code. If your prompts are scattered across random Jupyter notebooks or buried as hard-coded strings deep inside your functions, you are setting yourself up for a maintenance nightmare. Treating prompts as code means applying the same engineering disciplines you use for any other part of your technical stack.

What “Prompts as Code” looks like in practice:

Version Control: Store your prompts in your repository, or even better, use a dedicated prompt management tool like Langfuse. This allows you to track exactly who changed what and, more importantly, enables you to roll back to a “known-good” version immediately if a new tweak causes the model’s performance to degrade.

Prompt Reviews: Never push a major prompt change without a peer review. In GenAI, a single word change like swapping “must” for “should” can drastically alter the model’s output logic. A second pair of eyes is essential to catch these subtle shifts in behavior.

Templating: Don’t mix logic with data. Use structured formats to keep your system instructions separate from user inputs. While the “how-to” of prompt engineering is outside the scope of this article, keeping your prompts clean and modular is a fundamental part of building a manageable system.

Automated Testing: Just as you wouldn’t ship a function without a unit test, don’t ship a prompt without running it against your Eval set. This testing should be multi-dimensional, covering everything from expected output quality to security checks (like prompt injection).

10. Frameworks Might Be Overkill

I have seen many teams rush to adopt frameworks like LangChain or LlamaIndex the moment they start a project. In the early prototyping stages, these tools feel like magic—they make complex tasks look easy with just a few lines of code and standard implementations.

However, as your system grows and your LLM chains become more complex, you may realize that these frameworks add unwanted complexities to the project.

By design, they abstract away much of the underlying logic. While this saves time initially, it eventually becomes a bottleneck. You often end up “blind” to the internal prompts being used, and simple customizations can become incredibly tricky. I’ve seen engineers spend more time fighting a framework’s codebase or trying to override its defaults than they would have spent writing the actual business logic from scratch.

The “Simple-First” Approach: It is often better to start with simple, direct API calls and focus on structuring your own codebase properly from day one. This is where your traditional software engineering skills really come to the rescue. Building your own thin wrappers around LLM calls gives you:

Total Transparency: You know exactly what prompt is being sent and why.

Easier Debugging: There are no hidden “black-box” layers between your code and the model.

Long-term Maintainability: You aren’t at the mercy of a framework’s breaking changes or opinionated updates.

Don’t let the hype around “agentic frameworks” distract you. Sometimes, a well-organized Python script beats a complex library.

11. Error Analysis is Precious

Once your evals are running, they will provide a performance score but a score alone doesn’t tell you how to improve. You need to look under the hood to understand exactly what kind of errors are occurring, where they are happening, and how to fix them. This is where Error Analysis becomes your most valuable tool.

GenAI applications demand the same rigorous error analysis used in traditional Machine Learning. Instead of just seeing that the system failed, you need to categorize why it failed.

To make error analysis effective, categorize your failures into buckets. This sounds simple, but in practice you’ll discover new failure modes you never imagined. Here are some that I could recall.

Retrieval Failures: Did the system fail because the RAG pipeline didn’t find the right information? (Fix: Optimize your embeddings or chunking).

Reasoning Failures: Did the AI have the right information but failed to connect the dots? (Fix: Improve the prompt or move to a more capable model).

Formatting/Constraint Failures: Did the AI follow the instructions but fail to output the correct JSON or XML format? (Fix: Use few-shot examples or constrained decoding).

Hallucinations: Did the AI invent facts that weren’t in the source? (Fix: Tighten your system prompt or add “grounding” checks).

12. Expectation vs Reality Trap

In the beginning, everyone tries their best to build a “perfect” system. But as the project nears the finish line, both you and your stakeholders will likely face a harsh Expectation vs. Reality moment.

A demo is a controlled environment. It works because you are using curated data, “happy path” queries, and perhaps a model that hasn’t been hit with rate limits or latency spikes yet. Production, however, is messy and unpredictable.

Why the gap exists:

The “Golden Example” Fallacy: Demos usually showcase the 5% of cases where the AI is brilliant. Production forces the AI to handle the other 95%—the ambiguous, the messy, and the nonsensical inputs.

The Creativity of Users: You cannot predict how a real user will interact with your system. They will ask questions you never tested and use the tool in ways you never intended.

Hidden Complexity: As you add guardrails, PII filters, and cost-management logic, the “clean” performance of the initial PoC can start to degrade.

Don’t aim for a “perfect” launch. Aim for a resilient one. Prepare your stakeholders for the fact that the first version in the wild will reveal gaps you couldn’t see in the lab. This is why the feedback loops and experimentation cycles we discussed earlier are so critical.

Conclusion

Building GenAI applications is part science and part art. It is a unique discipline that encompasses everything from the nuances of your data and the capabilities of the model to the way you orchestrate chains and the rigor you apply to your prompts. Most importantly, it is defined by how you plan for continuous improvement and iteration.

Don’t view these lessons as a set of strict, unbreakable rules. This field is evolving so rapidly that what holds true today might shift by tomorrow. You will undoubtedly encounter your own unique challenges and learn your own lessons along the journey.

I will try to continuously update this post. In the meantime please leave comments for any suggestion, modifications you want to make or simply to put in your opinion. BTW feel free to open a PR as well. I’ll see you in the next one. Bye!

Introduction

Introduction

1. Understand the Task

2. Know the Data/Domain

Build a Glossary: Learn industry-specific terms and exactly how the client defines them.

Audit Data Quality Early: The “Garbage In, Garbage Out” rule is true for LLMs as well. If your data is messy, your model’s reasoning will be too.

Identify Sensitive Content: Flag PII (Personally Identifiable Information) and regulated data (like HIPAA or GDPR) before a single token is sent to an API.

Identify Edge Cases: Ask the client: _“What is a scenario that happens only 1% of the time, but would be a disaster if the AI got it wrong?”

3. Define Evals Early

Build “Eval Sets” before Production Code: Treat your evaluation dataset as your “Ground Truth.” It should contain diverse examples, from the standard “happy path” to the weirdest edge cases you identified in Section 2.

Deterministic Checks: For formatting (e.g., “Is the output valid JSON?”).

Semantic Similarity: Using embedding distance to check if the meaning matches your target, even if the words differ.

LLM-as-a-Judge: Use a “frontier” model (like GPT-5 or Claude 4.5 Sonnet) to grade your system’s performance based on specific rubrics like faithfulness or helpfulness.

Task specific metrics: Never forget to use task-specific metrics. Even though you use LLM for classification, using F1 score is almost always a good idea than using LLM as a judge. The only downside is you need to prepare ground truth.

Human-in-the-Loop (HITL): This remains the gold standard. Use human experts to periodically audit the automated scores. If the automated eval says “Pass” but a human says “Fail,” your eval criteria need tuning.

Make Evals Fast and Versioned: You will run these hundreds of times. If your eval takes an hour, you won’t run it. Aim for a “smoke test” suite that runs in minutes and track how your scores change as you update prompts.

4. Embrace Observability

5. Take Clients Onboard

Be explicit about capabilities: Clearly define what the AI can and cannot do. Don’t let them assume it has “human-level” reasoning if it’s just a classification tool.

Show failure cases upfront: Don’t just demo the “happy path.” Demonstrate limitations and edge cases early so there are no surprises during production.

Define responsibility boundaries: Establish where the AI’s job ends and where human oversight must begin. This is crucial for maintaining quality and trust.

Educate on the probabilistic nature: Explain that LLMs are not deterministic. Unlike traditional software, the outputs won’t always be perfect or 100% consistent, and the client needs to be prepared for that variability.

6. Cost Considerations from day 1

Estimate Early and Often: Don’t wait for the first invoice. Calculate the expected cost per user and cost per request during the design phase.

The “Smaller Model” First Rule: If a smaller, faster model (like GPT-4o-mini or Claude Haiku) can achieve 90% of the result, use it. It optimizes both latency and cost simultaneously.

Manage Multi-Chain Workflows: Every step in an autonomous chain adds to the bill. Be transparent with stakeholders about the price of complexity.

7. Prioritize the Feedback Loop

Make it Frictionless: Use a simple one-click “Thumbs Up/Down” interface. If the barrier to giving feedback is too high, users simply won’t do it.

Capture Structured Data: When a user gives negative feedback, ask a quick follow-up question: “Was the answer inaccurate, too long, or poorly formatted?” This helps you categorize issues without manual review.

Analyze the Patterns: Look for clusters. Are users consistently complaining about the same feature? This is your signal to revisit your implementation.

Turn Feedback into Evals: Take the cases where users were dissatisfied and turn them into permanent test cases in your evaluation suite. This ensures you never regress on the same issue twice.

8. Embrace the Experimental Nature

Example: The “Prompt Release” Strategy

Friday: Deadline for receiving the “First Draft” of requirements or feedback from the client.

Monday – Wednesday: Dedicated time for experimentation, testing new prompt versions, and running them through our Eval sets.

9. Treat Prompts as Code

Prompt Reviews: Never push a major prompt change without a peer review. In GenAI, a single word change like swapping “must” for “should” can drastically alter the model’s output logic. A second pair of eyes is essential to catch these subtle shifts in behavior.

10. Frameworks Might Be Overkill

Total Transparency: You know exactly what prompt is being sent and why.

Easier Debugging: There are no hidden “black-box” layers between your code and the model.

11. Error Analysis is Precious

Retrieval Failures: Did the system fail because the RAG pipeline didn’t find the right information? (Fix: Optimize your embeddings or chunking).

Reasoning Failures: Did the AI have the right information but failed to connect the dots? (Fix: Improve the prompt or move to a more capable model).

Formatting/Constraint Failures: Did the AI follow the instructions but fail to output the correct JSON or XML format? (Fix: Use few-shot examples or constrained decoding).

12. Expectation vs Reality Trap

The “Golden Example” Fallacy: Demos usually showcase the 5% of cases where the AI is brilliant. Production forces the AI to handle the other 95%—the ambiguous, the messy, and the nonsensical inputs.

The Creativity of Users: You cannot predict how a real user will interact with your system. They will ask questions you never tested and use the tool in ways you never intended.

Conclusion

Similar Posts