Why Agentic AI Struggles in the Real World — and How to Fix It

In less than a year, MCP has become the go-to standard for exposing external functions and data to LLMs. It looked like AI systems were about to change our daily lives, but many still struggle with real-world tasks beyond simple demos (even when equipped with multiple agents, MCP servers, and tools).

In this post, we’ll take a closer look at what kinds of problems “traditional AI” systems handle well, why they often fail with more complex tasks, and what we can do to solve them.

It’s a longer read, but if you’re building anything beyond proof-of-concept AI, it’s worth your time.

Table of Contents

Where Agentic Systems Shine
Issue #1: Working with Real-World Data
[Issue #2: Back-and-Forth Tool Call…

In this post, we’ll take a closer look at what kinds of problems “traditional AI” systems handle well, why they often fail with more complex tasks, and what we can do to solve them.

It’s a longer read, but if you’re building anything beyond proof-of-concept AI, it’s worth your time.

Table of Contents

Where Agentic Systems Shine
Issue #1: Working with Real-World Data
Issue #2: Back-and-Forth Tool Call Loops
Issue #3: Tool Quantity and Quality
Issue #4: Autonomous Work is Risky
Solution
Ready-to-Use Library
Summary

Where Agentic Systems Shine

With tool calling, agents learned how to communicate with the outside world. That was a huge step forward — LLMs could now go beyond generating text in isolation and actually take action. They could read and modify documents, fetch information, and even understand what date it is today.

MCP took this even further — today, we have thousands of published servers that AI models can connect to directly.

Here are a few examples where AI agents are already being used successfully:

Event monitoring and alerts. Example: detecting an angry email from a customer and notifying management.
Data extraction. Example: extracting order details from an email and automatically creating a draft order. Or identifying key issues, actions taken, and whether those actions were successful.
Unstructured text analysis. Example: analyzing legal contracts to detect risks or weak clauses.
Data-driven insights and recommendations. Example: collecting supplier data (pricing, delivery terms, etc.) and selecting the most optimal one.
In-app assistance for complex language-based features. Example: helping users with formulas or formatting in Excel.
Coding assistance. Example: GitHub Copilot, Cursor, Windsurf, and other AI coding tools.

Across nearly all successful agent implementations, a few common traits can be observed:

Task specialization and structured workflows. Each agent is fine-tuned for a specific purpose with clear logic (e.g., if you receive data → perform a specific action). So-called vertical integration.
Relatively small working context per iteration. The LLM doesn’t process all available data at once — it selectively extracts and uses only the relevant parts.
Short, independent work cycles. While research into fully autonomous agents continues, in practice, most use cases still require human interaction at key points.

When developers try to expand an agent’s capabilities beyond these boundaries, they quickly encounter the challenges described in the next four sections.

Issue #1: Working with Real-World Data

GPT-1 could handle just 512 tokens, whereas GPT-5 can now work with up to 400k tokens. Context windows are growing rapidly in new LLMs, however even the largest models will never be able to effectively process an entire database — or even a single real-world table.

This limitation is crucial because even simple tasks require finding relevant entries in a database. Consider these example commands:

Find the top 3 customers by total spending in Q3 2024
Show items that were added to cart but not purchased in the last 7 days
Build a chart with revenue by products in PC category whose rating is below 3 stars

All of these tasks require an agent to extract the relevant data from a database. But doing so efficiently is harder than it seems. Let’s look at a few approaches:

Feed the entire table to the LLM. Technically, it could work, but it’s impractical for large real-world databases. Even if the LLM’s context window could handle it, this approach is expensive, inefficient, and increases the likelihood of errors (see How Language Models Use Long Contexts).

Create task-specific tools (methods). For example:

GetTopNCustomersByTotalSpending(DateTime start, DateTime end)
GetCurrentCartItemsAddedBeforeDate(DateTime addingDate)
GetProductsInCategoryWithRating(string category, short rating)

This works for predefined queries, but if the user slightly changes their request, these methods may no longer fit. Creating a method for every possible request quickly leads to hundreds or thousands of methods — and LLMs struggle with even a hundred. 1.

Use a specialized MCP (e.g., MSSQL MCP Server). It dynamically builds SQL queries to fetch only the relevant data. However, two major issues remain:

Security. Allowing AI-generated SQL queries is risky. Even in read-only mode, queries might access information users shouldn’t see. For example, if a table contains user data, an SQL query could unintentionally return other Data volume.
Data amount. A query can still return thousands of rows, which the LLM must process. This is especially problematic for analytical tasks. As noted in the “Tool-space interference in the MCP era” research:

the top tool returned an average of 557,766 tokens, which is enough to swamp the context windows of many popular models like GPT-5. Further down the list, we find that 16 tools produce more than 128,000 tokens, swamping GPT-4o and other popular models.

None of these options scale well for real-world applications. And this is just the first challenge — we’ll explore more issues in the next sections.

Issue #2: Back-and-Forth Tool Call Loops

Many people assume that LLMs can just magically call functions on their own. In reality, every tool/function call goes through a very specific workflow:

The client sends a request to the LLM along with a JSON schema that describes all available tools (function names, parameters, and text descriptions).
The LLM decides whether a tool call is needed. If yes, it responds with a JSON payload containing the function name and arguments.
The client then executes the call. It parses the JSON, matches the function name to a real method in code, runs it, and collects the result.
The result is sent back to the LLM, along with the original user request and the full list of tools again. The model may then choose to call another function — and the whole loop repeats.

If your task requires multiple back-and-forth calls, this process quickly becomes expensive:

It adds latency (because each step is sequential).
It burns through tokens fast (because every loop contains the full tool list + user request).

Modern LLMs support parallel tool calls, but in many real-world cases you can’t use them, because the next function call depends on the previous result, so you’re stuck with an iterative loop.

Issue #3: Tool Quantity and Quality

We might naively assume that we can just pass all our available tools to the LLM, and it will easily figure out which ones to use and when.

However, there are multiple factors that make tool calling harder for AI models:

Number of tools. Each time the model is prompted, it receives the full list of available tools (their names, parameters, and descriptions). The more tools you add, the heavier and more confusing the prompt becomes. Even though newer models are getting better at handling large toolsets, OpenAI still officially recommends keeping the list under 20 tools:

Aim for fewer than 20 functions at any one time, though this is just a soft suggestion.

Meanwhile, GitHub’s MCP already ships with around 40 tools. There are emerging techniques like dynamic tool discovery, but they come with a trade-off: the model can’t see all tools upfront, which makes long-term planning harder for agents.

Naming collisions. Things get even messier when you connect multiple MCP servers. MCP doesn’t enforce namespaces, so different servers often ship tools with the same name. Microsoft researchers analyzed 7,000+ MCP servers and found 775 tools with name collisions — the most common one being simply: search.

Parameter depth. In normal APIs, we’re used to passing complex types (e.g., calling `EditCustomer(Customer customer)`). MCP technically supports this, but research from composio showed that flattening parameters (turning nested objects into simple fields) improved tool-calling performance by 47%. That brings inconvenience and maintenance cost: add one new property to `Customer` and you now need to update the method signature in MCP too. This makes large real-world schemas harder to evolve.

Call order / state dependency. Some tools should only be callable in specific application states. For example the FilterCustomers tool (which filters the Customers list) should only run after OpenCustomersPage (which opens a page with the list). You can encode rules like this in the tool descriptions, but every extra condition increases complexity (for an LLM) and failure rate.

Issue #4. Autonomous Work is Risky

AI vendors keep announcing higher and higher success rates for autonomous agents — and the numbers do look impressive. But before we assume AI can fully replace humans, there are a few realities worth keeping in mind:

Benchmarks are not the real world. Most benchmark datasets don’t come close to real-world conditions. They simplify context, remove ambiguity, and avoid messy edge cases — exactly the things that make real software and workflows hard.
Even a 1% failure rate can eliminate automation benefits. A model that is “99% accurate” sounds great, but in production, that remaining 1% can be enough to break data integrity, trigger financial loss, or damage user trust — wiping out all the efficiency gains.
Communication between agents is fragile. Just like humans, AI agents can misunderstand each other. One agent may create a plan that another interprets differently, leading to unexpected or completely wrong actions.
Small errors compound in multi-step systems. In complex agent pipelines, tiny mistakes don’t stay tiny — they stack, amplify, and eventually cause failures that are hard to debug. And because AI systems are non-deterministic, reproducing the same bug is difficult.

Solution

MCP/Tool calling passes method descriptions to an LLM which replies with a JSON telling the client what function to call next. This forces the model to act as a data-processing and decision engine: it has to analyze business data, pick tools, participate in back-and-forth loops, and resolve ambiguities.

But what we actually need from the model is not the execution, but the algorithm: which tools should run, in what order, and how their inputs/outputs connect. But the question is: what’s the most precise way to express that algorithm? Text? JSON? It’s code! Code is explicit, deterministic, and executable — and modern LLMs are already very good at generating it.

The solution is to change the tool calling paradigm: Instead of asking the LLM “Which function do I call now?”, ask it to generate code that solves the task using the available tools, then run that code client-side.

With this approach, the AI doesn’t have to process the actual data — which immediately solves two major problems: Issue #1: Working with Real-World Data and Issue #2: Back-and-Forth Tool Call Loops.

But how will the LLM know which methods it’s allowed to call in its generated script? We need to expose a clear API to the model — a list of functions it can safely use when writing code. The API should expose two main things:

Functions the AI is allowed to call to accomplish the task.
Domain model classes (e.g., Customer, Order) so the script can work with real data types.

For example:

public class Employee {
public String FirstName;
public String LastName;
public String Email;
}
public class EmployeesViewOperator {
private EmployeesViewOperator();
public List<Employee> GetEmployees();
public void UpdateEmployee(Employee updatedEmployee);
}
public class MainAppOperator {
private MainAppOperator();
// Navigates to the Employees view
public EmployeesViewOperator GetEmployeesViewOperator();
}

The API we expose to the LLM doesn’t need to contain real implementation — it’s just a declaration layer the model can reference when generating its script. When we later execute that script, our code runs the real methods behind the scenes.

This approach also solves Issue #3: Tool Quantity & Quality because:

Compact – The API takes fewer tokens than a JSON tool description.
Single source of truth – Models are defined once as classes, then reused as parameter/return types.
Clear structure – Each group of actions lives in its own class, reducing ambiguity.
Built-in call ordering – To access a method, the LLM must get an instance of its parent class. Example: to call UpdateEmployee, the model must first get an EmployeesViewOperator instance — which implicitly enforces “open Employees page first”.

So now our flow looks as follows: Send the API + the user task -> LLM generates code -> client executes code. No more tool-call loops.

Sounds great — but what about security? What if the model generates something harmful and we just run it?

Here are two ways to stay safe:

Sandbox execution – Run scripts in an isolated process or inside a Docker container.
Restrict what code can access – e.g., in C# Roslyn scripts, you can whitelist allowed assemblies and namespaces so the script can only call methods you’ve exposed.

Ready-To-Use Library

The approach above isn’t just theory — I’ve implemented it in a free open-source library called ASON (Agent Script Operation). You can try the online demo to see how it handles dynamic user requests in real time:

ASON Online Demo (runs on the free Azure tier, so you might hit rate limits)

The demo also addresses Issue #4: The Risks of Autonomous Operation. For example, run this command in the demo: “Set the position of all employees hired in 2025 to Intern”. You’ll notice something important: the AI agent doesn’t directly modify the data. Instead, it opens an edit form with the updated values but leaves saving up to the user (keeping the human in the loop).

Agents can handle safe operations automatically, but when it comes to editing data, they pause and let the user take control. This is the same UX pattern used by GitHub Copilot and other AI coding tools.

ASON is built for C# developers, but its approach can be applied to any development stack. To try ASON, you can use CLI project templates:

dotnet new install Ason.ProjectTemplates
dotnet new ason.blaz.srv -n MyAsonProject

Templates are available for Blazor, MAUI, WPF, WinForms, and **Console **apps. Full docs and source code are available on GitHub.

Summary

Agentic systems using MCP and tool calling are designed to handle real tasks — but in practice, they often struggle outside demo scenarios. One of the main reasons is that current AI systems treat the LLM as a universal data processing engine, when what we really need is a flexible plan: a clear algorithm for which tools to use and how to connect them.

Code is the most precise and natural way for an AI model to express that plan. We just need provide the LLM with what we want to achieve (user task) and what we can do (API). The LLM then generates code, and after that, AI is no longer needed.

This approach lets us:

Build flexible logic with client-side data processing..
Avoid back-and-forth loops to save tokens and improve performance.
Work with more tools and define toll calling order.

This isn’t just theory — it’s implemented in the ASON library.

P.S. As the library author, I’m happy to answer any questions.