When Five Dumb AIs Beat One Smart AI: The Case for Multi-Agent Systems

6 min readJust now

–

Press enter or click to view image in full size

Last week, I spent three days trying to solve what seemed like a simple problem. The solution that finally worked? Using a “dumber” model for most of the work and saving the smart one for the hard part.

Let me tell you what happened.

The Problem: Are These the Same Product?

I was building a system to organize retail product information. Given UPC codes, I needed to figure out: are these different sizes of the same product, or completely different products?

Coca-Cola 250ml and Coca-Cola 1L → Same product, different sizes
Coca-Cola and Sprit…

6 min readJust now

–

Press enter or click to view image in full size

Photo by Aerps.com on Unsplash

Let me tell you what happened.

The Problem: Are These the Same Product?

I was building a system to organize retail product information. Given UPC codes, I needed to figure out: are these different sizes of the same product, or completely different products?

Coca-Cola 250ml and Coca-Cola 1L → Same product, different sizes
Coca-Cola and Sprite → Different products entirely

Sounds simple, right? Just throw it at Claude and let the smart model figure it out.

1st Attempt: The “Smart Model for Everything” Disaster

My first approach was straightforward:

Press enter or click to view image in full size

But here’s what happened:

Claude would hallucinate product names when UPC data was unclear
Sometimes it would get confused and say Pepsi 500ml and Pepsi 1L were different products
Processing 500 UPCs took forever
Accuracy was around 75%

I was frustrated. Claude is supposed to be smart! Why was it making such basic mistakes?

Then I realized: I was asking Claude to do too much. It had to:

That’s not one task — that’s seven different tasks crammed into one prompt.

The Experiment: Decomposition

I stepped back and asked: “What does Claude actually need to be good at here?”

Answer: Comparing products to decide if they’re the same. That’s it.

Everything else — getting product names, extracting brands, parsing sizes — that’s just data preparation. I don’t need genius-level intelligence for data preparation.

So I redesigned the pipeline:

Step 1: UPCItemDB API → Get actual product labels
Step 2: Ollama (Llama 3.2) → Extract brand and size from labels
Step 3: Python → Normalize and group by brand
Step 4: Claude → For each brand group, determine variants

The Architecture That Actually Worked

Let me show you the code:

class ProductAgent(dspy.Module):    def __init__(self):        self.lookup_tool = lookup_upc_tool  # External API        self.extractor = dspy.ChainOfThought(ExtractAttributes)            def forward(self, upc):        # Step 1: Get real product data        upc_data = self.lookup_tool(upc=upc)        title = upc_data.get("title")                # Step 2: Extract with Ollama (local, free, fast)        preds = self.extractor(            title=title,            description=upc_data.get("description")        )                # Step 3: Normalize with Python (zero cost, zero errors)        brand = normalize_brand(clean_text(preds.brand))        size = normalize_size(clean_text(preds.size))                return {"brand": brand, "size": size, "label": title}

Then for the actual intelligence part:

# Group all products by brandbrand_groups = df.groupby('brand')# For each brand, ask Claude the hard questionfor brand, products in brand_groups:    prompt = f"""    These products are all from {brand}:    {products[['label', 'size']].to_string()}        Which ones are size variants of the same product type/flavor?    Group them accordingly.    """        variants = claude_sonnet(prompt)

Why This Combo Was a Killer

1. UPCItemDB Did the Heavy Lifting

Instead of having Claude imagine what “858514006595” might be, I got the actual product name: “Humble Brands Natural Deodorant Moroccan Rose 2.5oz”

No hallucination. No guessing. Just facts.

2. Ollama Was Perfect for the Boring Stuff

Running locally on my laptop, Llama 3.2 extracted brands and sizes:

“Humble Brands Natural Deodorant Moroccan Rose 2.5oz” → Brand: “Humble Brands”, Size: “2.5oz”
Cost: ₹0 (running locally)
Speed: 2 seconds per UPC
Accuracy: 95% (good enough for this step)

When it made mistakes, they were small — like “Humble Brands” vs “Humble-Brands”. Easy to fix with simple normalization functions.

3. Python Handled the Obvious

Normalizing sizes? That’s just string manipulation:

def normalize_size(size_str):    # "12 Fl Oz" → "12oz"    # "1 L" → "1L"    # "500ml" → "500ml"    match = re.match(r"([\d.,]+)\s*([mM][lL]|[lL]|oz)", size_str)    if match:        number, unit = match.groups()        return f"{number}{unit.lower()}"

No AI needed. No cost. No possibility of hallucination.

4. Claude Only Did What It’s Actually Good At

By the time Claude saw the data, it looked like this:

Brand: Humble BrandsProducts:1. Humble Brands Natural Deodorant Moroccan Rose 2.5oz2. Humble Brands Natural Deodorant Moroccan Rose 3.75oz3. Humble Brands Natural Deodorant Moroccan Rose 7.1oz4. Humble Brands Natural Deodorant Lavender 2.5oz# Note: Products mentioned here do to represent the # actual sized variants of the brand, for understanding purposes only

Question: Which are variants of the same product?

This is where Claude shines. It needs to understand:

Moroccon Rose vs Lavender = different variants
Different sizes of Moroccon Rose = same product
Lavender = different product entirely

This requires real reasoning. And Claude nailed it nearly every time.

The Results

Press enter or click to view image in full size

The multi-agent approach wasn’t just cheaper — it was better.

Why This Pattern Works

1. Use the Right Tool for Each Job

API calls for data retrieval (that’s what APIs are for)
Local models for simple extraction (it’s just pattern matching)
Python for deterministic logic (why use AI for string manipulation?)
Smart models for actual reasoning (the hard stuff)

2. Errors Are Isolated

When something went wrong, I knew exactly where:

Wrong product name? → API issue
Wrong brand extraction? → Ollama prompt needs tweaking
Wrong grouping? → Claude prompt needs improvement

With the monolithic approach, everything was tangled. One error, one inscrutable output.

3. Cost Follows Complexity

I only pay for Claude when the task actually needed intelligence.

Processing 500 UPCs meant:

500 free API calls (just HTTP requests)
500 free Ollama extractions (running locally)
50 Claude calls (only one per brand group, not per product)

That’s why costs dropped 92%. This taught me a general pattern that works everywhere:

Press enter or click to view image in full size

Examples I’ve seen with other projects since:

Customer support tickets:

GPT-4o-mini extracts category and urgency → Claude Sonnet drafts response

Legal document review:

Llama finds relevant clauses → GPT-4 analyzes implications

Code review:

Local model identifies changed functions → Claude Opus reviews logic

Research synthesis:

Fast model gathers sources → Smart model synthesizes insights

The pattern is universal: use cheap intelligence to set up the problem, then use expensive intelligence where it actually matters.

The Mental Shift

The hardest part wasn’t the code — it was changing how I thought about the problem.

Old thinking: “I need a model smart enough to solve this entire problem.”

New thinking: “What’s the minimum intelligence needed for each step?”

It’s like cooking. You don’t need a Michelin-star chef to chop vegetables. You need them for the sauce. Use prep cooks for prep work.

What I’d Tell Someone Starting Today

If you’re building anything that processes data at scale:

Break down your task. What are the actual steps?
Ask: Which steps need real intelligence vs. simple pattern matching?
Use the cheapest tool that works for each step
Save your smart model for the part that actually requires reasoning
Test each step independently. Makes debugging 10x easier.

I tried to be smart by using the smartest model. Don’t fall into the trap I did — throwing your most expensive tool at every problem because it’s “the best.” I succeeded by being strategic about when to use what.

Sometimes the best solution is knowing when not to use the best model.

The Problem: Are These the Same Product?

The Problem: Are These the Same Product?

1st Attempt: The “Smart Model for Everything” Disaster

The Experiment: Decomposition

The Architecture That Actually Worked

Why This Combo Was a Killer

1. UPCItemDB Did the Heavy Lifting

2. Ollama Was Perfect for the Boring Stuff

3. Python Handled the Obvious

4. Claude Only Did What It’s Actually Good At

The Results

Why This Pattern Works

1. Use the Right Tool for Each Job

2. Errors Are Isolated

3. Cost Follows Complexity

The Mental Shift

What I’d Tell Someone Starting Today

Similar Posts