Microsoft: Don't let AI agents near your credit card yet

Ready to have your agent talk to my agent and arrange a sale? Microsoft has published a simulated marketplace to put AI agents through their paces and answer a question for the new age: Would you trust AI with your credit card?

Customer-facing assistants are all the rage these days. OpenAI and Anthropic, for example, have helpers that will navigate websites and complete purchases. Then there are assistants that will aid sellers with customer engagement and operations.

It all points to a future where, like rich people with personal shoppers, the average user will have “people” to do all the work for them.

To simulate what might happen, Microsoft’s researchers built the [Magentic Marketplace](https://www.microsoft.com/en-us/research/blog/magentic-marketplace-an-open-source-simulation-…

It all points to a future where, like rich people with personal shoppers, the average user will have “people” to do all the work for them.

To simulate what might happen, Microsoft’s researchers built the Magentic Marketplace, an open-source simulation upon which agents can be unleashed and the results studied.

And the conclusion? “Agents should assist, not replace, human decision-making.”

The marketplace simulation manages catalogs of goods and services, and facilitates agent-to-agent communication. It also handles simulated payments. The researchers simulated transactions such as ordering food or engaging with home improvement services. Agents represented customers and businesses at each end of the transactions.

Each experiment was run using 100 virtual customers and 300 virtual businesses, and included both proprietary models (such as GPT-4o and Gemini-2.5-Flash) and open source models. The team had agents building queries, navigating results, and negotiating transactions.

The results were interesting. Although agents can help (the thinking is that an AI agent should be able to consider far more possibilities than a human could), loading them with more options and search results led to a decline in the number of comparisons. With some exceptions (notably Gemini-2.5-Flash and GPT-5), researchers found the models tended to accept the initial “good enough” options rather than dig deeper.

Researchers also tried manipulation strategies, which ranged from fake award credentials and fake reviews, to prompt injections. Again, the models varied. Gemini-2.5-Flash proved generally resistant, while others could be tricked. Prompt injection techniques proved useful in directing payments to manipulative agents, while more basic persuasion techniques were also effective.

The researchers noted: “These findings highlight a critical security concern for agentic marketplaces.”

It all suggests that the current state of the art in terms of AI models still has some ways to go. The agents were shown to struggle when presented with too many options and were vulnerable to manipulation. Researchers also found some models showed biases, including selecting a business based on its position in the results rather than on merit.

And then there is the design and implementation of the marketplace. The researchers said: “Our current study focused on static markets, but real-world environments are dynamic, with agents and users learning over time.

“Oversight is critical for high-stakes transactions.”

“A simulation environment like Magentic Marketplace is crucial for understanding the interplay between market components and agents before deploying them at scale.”

So, perhaps reconsider handing over authority to an agent at this point. The results might not be quite what you were expecting. ®

Similar Posts