Agents Work. Sort Of

07 Nov, 2025 *

We have simple agents everywhere—and reliable ones nowhere.

You can build a workflow agent in an afternoon with Zapier. It triggers on email, writes to a spreadsheet, posts to Slack. Deterministic flows are commoditized. Every startup ships one. Every team runs dozens.

The promise, though, was different. We wanted agents that think. That adapt. That autonomously research your market, debug your codebase, manage your CRM without predetermined paths.

We’re starting to get there. But only in narrow corridors where the rails are thick enough.

image (7)

The Reliability Gradient

Workflow agents—the ones with fixed, branching logic—work because they don’t think. They fol…

07 Nov, 2025 *

We have simple agents everywhere—and reliable ones nowhere.

The promise, though, was different. We wanted agents that think. That adapt. That autonomously research your market, debug your codebase, manage your CRM without predetermined paths.

We’re starting to get there. But only in narrow corridors where the rails are thick enough.

image (7)

The Reliability Gradient

Workflow agents—the ones with fixed, branching logic—work because they don’t think. They follow scripts. Simple ones launch with no-code tools. Complex ones require architecture, but once built, they’re reliable. When the use case is common and valuable enough, the workflow itself becomes a product.

Then there’s the next tier: dynamic agents that plan and adapt. Research agents work well because the task loop is clean—search, summarize, search again. Coding agents work because the environment is deterministic: write code, run tests, see failures, iterate. The feedback is binary.

But ask an autonomous agent to touch multiple APIs with custom schemas, or to execute tasks across a CRM with unique fields and undocumented business logic, and reliability crumbles.

The difference is simple. Research and coding agents use universal tools everyone understands. Your CRM, your database, your company’s tools? Those require learning.

image (8)

The Four Pillars Problem

Every reliable agent needs four things to work:

1. Task Planning

LLMs can create simple task lists for web research—a series of searches and summaries. That’s tractable. But once tasks multiply or require private context, the wheels fall off.

Exhibit A: Research agents handle “find 10 competitors and their pricing” because it’s a flat list. Ask them to research 100 entities with nested attributes, and task management degrades. You see this solve itself in spreadsheet-based AI tools—they offload task tracking to rows and columns because passing long task lists between prompts doesn’t scale.

Exhibit B: Coding agents work for small problems or greenfield projects. Ask them to refactor a complex, undocumented codebase and they thrash. Developers have learned to compensate by writing .md files that explain how their code is organized. The better the documentation, the better the agent plans. Complex code requires retrieving only relevant context dynamically from those documents.

What’s missing: Most businesses have strong, undocumented opinions on the correct approach to any project. We need lightweight ways to capture this upfront—and continuously.

image (9)

2. Task Execution

Tasks are usually API calls, LLM reasoning, or both. Sometimes they’re even workflow agents chained together.

Authentication is solved. MCPs and similar tools handle basic API access. But tool-specific context—how your CRM is actually used, how your database is actually structured, which scenarios matter—that’s where agents fail.

Your tools aren’t generic. Salesforce at your company looks different than at mine. The same API serves different purposes across organizations. Agents need to understand:

How you’ve structured your data
What projects use which tools
Why certain fields exist
How errors should be interpreted

Right now, when an agent hits an error, it guesses. Or asks you. It should learn—capturing feedback as organized context that surfaces when relevant.

3. Context Management

Imagine being a new hire. You get onboarding docs. Then you learn on the job—both from the org’s collective knowledge (“this is how we do things here”) and from your own mistakes.

Agents need the same layers. Meta context about you and your company. Project-specific context. Task-specific context. Tool-specific context. We’ve evolved from simple system prompts to hybrid RAG (vector, keyword, graph), but we’re still figuring out when to retrieve what.

As projects get complex, the challenge isn’t storing context—it’s pruning it. You want only relevant information in each prompt, nothing extraneous.

The unlock: Not just having context, but knowing when and how to retrieve it. We’re building the storage layer. The retrieval strategy layer is next.

4. Reflection

We have monitoring tools for costs and logs. We’re terrible at measuring success.

Coding agents have an advantage: tests fail or pass. That’s deterministic feedback. For most agentic tasks, success is ambiguous. You need human judgment.

Right now, reflection is human-in-the-loop. You give feedback. A developer reads it and tweaks prompts or code. The agent doesn’t learn directly.

The unlock comes when reflection becomes self-improvement. When failed tasks generate learnings that get stored as context and surface automatically next time—only when relevant. This leads to fine-tuning. Then to agentic RL environments. We’re early here.

Why Orchestration Exists

Here’s a thought experiment: If we had infinite context windows (no degradation), infinite compute, infinite storage, browser access, and a universal payment method, a single LLM loop could probably handle most work.

Nothing is infinite.

Agent orchestration exists to manage limitations. You offload work from the LLM through structure and deterministic code. You split complex agents into specialized sub-agents with focused tools and context. You hand subtasks to workflow agents when the path is known.

Some people are building natural language API endpoints to make agent-to-agent collaboration easier. Multi-agent systems work when the coordination cost stays below the specialization benefit.

Patterns in Production

Agents ship in three flavors:

Internal tools – Your team uses them to automate workflows or research tasks. Usually chat or voice triggered.

Standalone products – An agent that combines multiple tools to solve one job, like a research assistant or coding copilot. The product is the orchestration.

Embedded features – Agents baked into existing tools, like “draft this email” in your CRM or “summarize this thread” in Slack.

Background agents are emerging—ones that run without prompting, monitoring for triggers or opportunities. Chat and voice interfaces dominate for explicit triggers.

What’s Next

Workflow agents work. Dynamic agents work in constrained environments with deterministic feedback. Broader autonomy requires better:

Task planning that learns from company-specific approaches
Execution that understands tool-specific context
Context management that retrieves strategically, not exhaustively
Reflection that generates learnings automatically

We’re close on read-only agents. Write permissions are scary because errors compound. One interesting idea: let agents experiment in forked environments—like a CRM mirror—where you can review changes before merging them to production.

Agents work. They’re just not yet agents.

They’re becoming them.

—– end —–

Ashish

Co-Founder

Boring Workflows image (10)

The Four Pillars Problem

Similar Posts