One year ago, Sam Altman, the C.E.O. of OpenAI, made a bold prediction: “We believe that, in 2025, we may see the first AI agents ‘join the workforce’ and materially change the output of companies.” A couple of weeks later, the company’s chief product officer, Kevin Weil, said at the World Economic Forum conference at Davos in January, “I think 2025 is the year that we go from ChatGPT being this super smart thing . . . to ChatGPT doing things in the real world for you.” He gave examples of artificial intelligence filling out online forms and booking restaurant reservations. He later promised, “We’re going to be able to do that, no question.” (OpenAI has a corporate partnership with Condé Nast, the owne…
One year ago, Sam Altman, the C.E.O. of OpenAI, made a bold prediction: “We believe that, in 2025, we may see the first AI agents ‘join the workforce’ and materially change the output of companies.” A couple of weeks later, the company’s chief product officer, Kevin Weil, said at the World Economic Forum conference at Davos in January, “I think 2025 is the year that we go from ChatGPT being this super smart thing . . . to ChatGPT doing things in the real world for you.” He gave examples of artificial intelligence filling out online forms and booking restaurant reservations. He later promised, “We’re going to be able to do that, no question.” (OpenAI has a corporate partnership with Condé Nast, the owner of The New Yorker.)
This was no small boast. Chatbots can respond directly to a text-based prompt—by answering a question, say, or writing a rough draft of an e-mail. But an agent, in theory, would be able to navigate the digital world on its own, and complete tasks that require multiple steps and the use of other software, such as web browsers. Consider everything that goes into making a hotel reservation: deciding on the right nights, filtering based on one’s preferences, reading reviews, searching various websites to compare rates and amenities. An agent could conceivably automate all of these activities. The implications of such a technology would be immense. Chatbots are convenient for human employees to use; effective A.I. agents might replace the employees altogether. The C.E.O. of Salesforce, Marc Benioff, who has claimed that half the work at his company is done by A.I., predicted that agents will help unleash a “digital labor revolution,” worth trillions of dollars.
New Yorker writers reflect on the year’s highs and lows.
2025 was heralded as the Year of the A.I. Agent in part because, by the end of 2024, these tools had become undeniably adept at computer programming. A demo of OpenAI’s Codex agent, from May, showed a user asking the tool to modify his personal website. “Add another tab next to investment/tools that is called ‘food I like.’ In the doc put—tacos,” the user wrote. The chatbot quickly carried out a sequence of interconnected actions: it reviewed the files in the website’s directory, examined the contents of a promising file, then used a search command to find the right location to insert a new line of code. After the agent learned how the site was structured, it used this information to successfully add a new page that featured tacos. As a computer scientist myself, I had to admit that Codex was tackling the task more or less as I would. Silicon Valley grew convinced that other difficult tasks would soon be conquered.
As 2025 winds down, however, the era of general-purpose A.I. agents has failed to emerge. This fall, Andrej Karpathy, a co-founder of OpenAI, who left the company and started an A.I.-education project, described agents as “cognitively lacking” and said, “It’s just not working.” Gary Marcus, a longtime critic of tech-industry hype, recently wrote on his Substack that “AI Agents have, so far, mostly been a dud.” This gap between prediction and reality matters. Fluent chatbots and reality-bending video generators are impressive, but they cannot, on their own, usher in a world in which machines take over many of our activities. If the major A.I. companies cannot deliver broadly useful agents, then they may be unable to deliver on their promises of an A.I.-powered future.
The term “A.I. agents” evokes ideas of supercharged new technology reminiscent of “The Matrix” or “Mission: Impossible—The Final Reckoning.” In truth, agents are not some kind of customized digital brain; instead, they are powered by the same type of large language model that chatbots use. When you ask an agent to tackle a chore, a control program—a straightforward application that coördinates the agent’s actions—turns your request into a prompt for an L.L.M. Here’s what I want to accomplish, here are the tools available, what should I do first? The control program then attempts any actions that the language model suggests, tells it about the outcome, and asks, Now what should I do? This loop continues until the L.L.M. deems the task complete.
This setup turns out to excel at automating software development. Most of the actions required to create or modify a computer program can be implemented by entering a limited set of commands into a text-based terminal. These commands tell a computer to navigate a file system, add or update text in source files, and, if needed, compile human-readable code into machine-readable bits. This is an ideal setting for L.L.M.s. “The terminal interface is text-based, and that is the domain that language models are based on,” Alex Shaw, the co-creator of Terminal-Bench, a popular tool used to evaluate coding agents, told me.
More generalized assistants, of the sort envisioned by Altman, would require agents to leave the comfortable constraints of the terminal. Since most of us complete computer tasks by pointing and clicking, an A.I. that can “join the workforce” probably needs to know how to use a mouse—a surprisingly difficult goal. The Times recently reported on a string of new startups that have been building “shadow sites”—replicas of popular webpages, like those of United Airlines and Gmail, on which A.I. can analyze how humans use a cursor. In July, OpenAI released ChatGPT Agent, an early version of a bot that can use a web browser to complete tasks, but one review noted that “even simple actions like clicking, selecting elements, and searching can take the agent several seconds—or even minutes.” At one point, the tool got stuck for nearly a quarter of an hour trying to select a price from a real-estate site’s drop-down menu.
There’s another option to improve the capability of agents: make existing tools easier for the A.I. to master. One open-source effort aims to develop what’s known as Model Context Protocol, a standardized interface that allows agents to access software using text-based requests. Another is the Agent2Agent protocol, launched by Google last spring, which proposes a world in which agents interact directly with each other. My personal A.I. doesn’t have to use a hotel-reservation site if it can instead ask a dedicated A.I.—perhaps trained by the hotel company itself—to navigate the site on its behalf. Of course, it will take time to rebuild the infrastructure of the internet with bots in mind. (For years, developers have actively tried to prevent bots from messing around with websites.) And even if technologists can complete this project, or successfully master the mouse, they will face another challenge: the weaknesses of the L.L.M.s that underlie their agents’ decisions.
In a video that announced the début of ChatGPT Agent, Altman and a group of OpenAI engineers demoed several of its features. At one point, it generated a map, supposedly displaying an itinerary for visiting all thirty Major League Baseball stadiums in North America. Curiously, it included a stop in the middle of the Gulf of Mexico. One could dismiss this flub as an outlier, but for Marcus, the Silicon Valley critic, this type of mistake underscores a more fundamental issue. He told me that L.L.M.s lack sufficient understanding of “how things work in the world” to reliably tackle open-ended tasks. Even in straightforward scenarios, such as planning a trip, he said, “you still have to reason about time, and you still have to reason about location”—basic human abilities that language models struggle with. “They’re building clumsy tools on top of clumsy tools,” he said.
Other commentators warn that agents will amplify errors. As chatbot users quickly learn, L.L.M.s have a tendency to make things up; one popular benchmark reveals that various versions of GPT-5, OpenAI’s cutting-edge model, have a hallucination rate of around ten per cent. For an agent tackling a multi-step task, these semi-regular lapses might prove catastrophic: it only takes one misstep for the entire effort to veer off track. “Don’t get too excited about AI agents yet,” a Business Insider headline warned in the spring. “They make a lot of mistakes.”
To better understand how an L.L.M. brain could go astray, I asked ChatGPT to walk through the plan it would follow if it were powering a hotel-booking agent. It described a sequence of eighteen steps and sub-steps: selecting the booking website, applying filters to the search results, entering credit-card information, sending me a summary of the reservation, and so on. I was impressed by how thoroughly the model could break down the activity. (Until you see them listed out, it’s easy to underestimate just how many small actions go into such a common task.) But I could also see places where our hypothetical agent might fall off track.
Sub-step 4.4, for example, has the agent rank rooms using a formula: α*(location score) + β*(rating score) − γ*(price penalty) + δ*(loyalty bonus). This is the right type of thing to do in this situation, but the L.L.M. left the details worrisomely underspecified. How would it calculate these penalty and bonus values, and how would it select the weights (represented by Greek symbols) to balance them? Humans would presumably hand-tune such details using trial-and-error and common sense, but who knows what an L.L.M. might do on its own. And little mistakes will matter: overemphasize something like the price penalty and you might end up in one of the seediest hotels in the city.
A few weeks ago, Altman announced in an internal memo that the development of A.I. agents was one project, among others, that OpenAI would deëemphasize, because it wanted to focus on improving its core chatbot product. This time last year, leaders like Altman were making it sound like we’d raced over a technological cliff, and that we were tumbling chaotically toward an automated workforce. Such breathlessness now seems rash. Lately, in an effort to calibrate my expectations about artificial intelligence, I’ve been thinking about a podcast interview with Karpathy, the OpenAI co-founder, from October. Dwarkesh Patel, the interviewer, asked him why the Year of the Agent had failed to materialize. “I feel like there’s some overpredictions going on in the industry,” Karpathy replied. “In my mind, this is really a lot more accurately described as the Decade of the Agent.” ♦