Two months ago, our internal knowledge base chatbot confidently told a support rep that our refund policy was “14 days, no questions asked.” Our real policy is 30 days with approval for larger amounts.
A $2,000 refund was processed based on that hallucination.
That was the moment we stopped treating LLM features like “smart text boxes” and started treating them like unreliable distributed systems that require real engineering.
This article is not about demos. It’s about what you have to build after the demo works.
The Reality of AI Backends
Traditional backends are deterministic.
Same input → same output.
AI backends are probabilistic.
Same input → slightly different output depending on context, model variance, and prompt structure.
This means:
- You can…
Two months ago, our internal knowledge base chatbot confidently told a support rep that our refund policy was “14 days, no questions asked.” Our real policy is 30 days with approval for larger amounts.
A $2,000 refund was processed based on that hallucination.
That was the moment we stopped treating LLM features like “smart text boxes” and started treating them like unreliable distributed systems that require real engineering.
This article is not about demos. It’s about what you have to build after the demo works.
The Reality of AI Backends
Traditional backends are deterministic.
Same input → same output.
AI backends are probabilistic.
Same input → slightly different output depending on context, model variance, and prompt structure.
This means:
- You cannot trust outputs
- You cannot trust retrieval
- You cannot trust prompts
- You cannot trust tool calls
- You must observe everything
A production AI backend ends up looking like this:
API │ AI Orchestrator ├─ Guardrails ├─ Router ├─ Rate limits │ ├─ RAG pipeline ├─ Function execution └─ Direct generation │ Observability + Evals
If you skip any of these layers, you will eventually ship a hallucination that costs money.
RAG Is Not “Chunk, Embed, Query”
The tutorial version of RAG is:
split text → embed → vector search → pass to LLM
That works in a notebook. It fails in production.
Real RAG needs:
- Proper ingestion pipeline
- Semantic chunking
- Change detection
- Hybrid retrieval (vector + keyword)
- Reranking
- Ongoing evaluation of retrieval quality
Ingestion in Laravel
Ingestion is a queued job, not a script.
You re see documents constantly. You re-embed only what changed.
class IngestDocuments
{
public function handle(SourceInterface $source)
{
$documents = $source->fetch();
foreach ($documents as $doc) {
$hash = sha1($doc->content);
if (Cache::get("doc_hash_{$doc->id}") === $hash) {
continue;
}
$chunks = (new SemanticChunker())->chunk($doc->content);
$embeddings = app(EmbeddingService::class)->embed($chunks);
app(VectorStore::class)->upsert($doc->id, $chunks, $embeddings);
Cache::put("doc_hash_{$doc->id}", $hash, now()->addDay());
}
}
}
The biggest quality improvement you will see is semantic chunking instead of fixed token splits.
Hybrid Retrieval Is Mandatory
Vector search misses exact matches like order IDs, SKUs, emails.
Keyword search misses meaning.
You need both.
class HybridRetriever
{
public function search(string $query, int $limit = 8)
{
$vector = app(VectorStore::class)->search($query, $limit * 2);
$keyword = app(KeywordSearch::class)->search($query, $limit * 2);
return $this->mergeAndRank($vector, $keyword, $limit);
}
}
Most hallucinations in RAG systems are actually retrieval failures, not model failures.
Generation With Grounded Context
What you pass to the model matters more than the model.
class RagResponder
{
public function answer(string $question, array $chunks)
{
$context = collect($chunks)
->pluck('content')
->join("\n\n");
$prompt = Prompt::load('rag-answer', 'v3');
$response = app(LLM::class)->chat([
['role' => 'system', 'content' => $prompt->system],
['role' => 'user', 'content' => $prompt->fill([
'context' => $context,
'question' => $question,
])],
], temperature: 0.2, json: true);
return $response;
}
}
Low temperature. Structured output. Explicit context.
You are trying to reduce creativity, not increase it.
Function Calling Without Guardrails Will Burn You
Letting an LLM trigger backend actions without controls is equivalent to letting users hit internal APIs directly.
Every tool call must go through:
- Authorization
- Rate limiting
- Audit logging
- Optional approval
class ToolExecutor
{
public function execute(string $tool, array $args, User $user)
{
$definition = ToolRegistry::get($tool);
Gate::authorize($definition->ability, $user);
if ($definition->needsApproval && !$user->isAdmin()) {
throw new AuthorizationException();
}
RateLimiter::hit("tool:{$tool}", 60);
$result = call_user_func($definition->handler, $args);
AuditLog::create([
'user_id' => $user->id,
'tool' => $tool,
'args' => $args,
'result' => $result,
]);
return $result;
}
}
Refunds, account changes, billing operations — these must never be “just a function call.”
Prompts Are Code
Prompts change behavior more than code does.
They must be:
- Versioned
- Stored
- Reviewed
- Rolled out gradually
class Prompt extends Model
{
protected $casts = ['variables' => 'array'];
}
class PromptManager
{
public static function load(string $name, string $version): Prompt
{
return Prompt::where(compact('name', 'version'))->firstOrFail();
}
}
Never hardcode prompts in PHP files.
You will want to change them without redeploying.
Observability Is Not Optional
You need to log, trace, and evaluate:
- The user query
- Retrieved chunks
- Final prompt sent
- Model output
- Tokens and latency
Without this, you cannot debug hallucinations.
You also need automated evaluations that periodically ask:
“Is this answer actually grounded in the provided context?”
That’s how you catch issues before users do.
Caching and Cost Control
LLM calls are expensive and slow.
Cache deterministic calls by hashing inputs.
class CachedLLM
{
public function chat(array $payload)
{
$key = hash('sha256', json_encode($payload));
return Cache::remember($key, 3600, fn () =>
app(LLM::class)->chat($payload)
);
}
}
Track cost daily and hard-stop if you exceed budget.
What Actually Prevents Incidents
After enough production incidents, you realize the real safeguards are:
- Hybrid retrieval
- Strict prompts
- Guarded tool execution
- Full tracing
- Automated evals
- Aggressive caching
Not model choice. Not fancy agents. Not frameworks.
Just engineering discipline applied to a probabilistic system.
Final Takeaway
A demo AI feature looks like magic.
A production AI system looks like a paranoid, over-engineered backend.
And that’s exactly what it needs to be.