Bas Franken
← All posts
AI DevOps Opinion

Why Most Agentic AI Still Fails in Production

I have built dozens of AI agent projects over the past year. Most of them went nowhere. Not because the technology failed. The agents ran fine, the LLM calls returned reasonable responses, the tool integrations worked. I was building solutions to problems that nobody actually had, and I would get excited about a concept, build the agent around it, get it running, and then realize that the underlying problem either did not exist at the scale I imagined or could be solved with a shell script and a cron job. The AI was a hammer and I kept finding nails that did not justify using it.

That is the trap that I think a lot of the 88% of failed AI pilots fall into. It is not a technical failure. It is not bad engineering. It is building something impressive that nobody needs, because the technology is seductive and it feels like it should be able to do everything. “Can do” and “should do” are different questions, and the second one is harder to answer than the first. The few projects of mine that actually work share exactly one thing: they started from a real problem I personally had, not from a technology I wanted to use. My Azure cost agent exists because I was tired of manually checking for idle resources. The agent does something I was already doing by hand, just faster and more consistently. That is the whole reason it is still running.

I want to walk through why so many agentic AI projects die between pilot and production, because the data is out there and it tells a consistent story. Your CEO just came back from a keynote asking why you do not have an agent yet. Here is what the keynote did not say.

The Numbers

Gartner predicts that over 40% of agentic AI projects will be cancelled by the end of 2027. Their exact words: “due to escalating costs, unclear business value or inadequate risk controls.”

Research from IDC in partnership with Lenovo found that 88% of AI proofs of concept never make it to production. For every 33 AI POCs a company launched, only four graduated to production. IDC’s own conclusion: “the high number of AI POCs but low conversion to production indicates the low level of organizational readiness in terms of data, processes and IT infrastructure.”

These are not blog estimates. These are IDC and Gartner.

The Math That Kills Agents

This is the part that should be on every slide deck but never is.

When an AI agent works through a multi-step task, each step’s probability of success multiplies with every prior step. This principle is called Lusser’s Law, named after German engineer Robert Lusser who calculated it in the 1950s while studying rocket failures.

The math is simple:

  • 85% accuracy per step, 10 steps: 0.85^10 = 20% overall success rate
  • 90% accuracy per step, 10 steps: 0.90^10 = 35% overall success rate
  • 95% accuracy per step, 20 steps: 0.95^20 = 36% overall success rate

Four out of five runs of your 10-step agent will contain at least one error somewhere in the chain. That’s not a bug. That’s arithmetic.

A Towards Data Science analysis from March 2026 documented two real incidents that illustrate this. Jason Lemkin spent nine days building a contact database with an AI coding agent. The agent interpreted “freeze the code” as an invitation to delete the production database, then generated 4,000 fake records to fill the gap. Separately, OpenAI’s Operator agent completed a $31.43 grocery purchase without user confirmation, bypassing its own safety guardrails.

Neither was a rare bug. Both were the predictable outcome of compound error probability in multi-step workflows.

Agent Washing

Of the thousands of vendors selling “AI agents” in 2026, Gartner estimates only about 130 are real. The rest are what Gartner calls “agent washing”: rebranding existing products like AI assistants, RPA scripts, and chatbots with “AI Agent” on the marketing page, without adding actual agentic capabilities.

Companies buy solutions expecting autonomous execution and get glorified chatbots that still require constant human intervention. Then the project gets cancelled for “unclear business value” and ends up in Gartner’s 40% statistic.

How to spot it: ask the vendor what happens when the agent encounters something outside its training data. A real agent adapts its approach. A rebranded chatbot returns a fallback message.

Why Pilots Work But Production Doesn’t

The demo always works. The pilot usually works. Production breaks.

IDC’s research identified three consistent reasons:

Unclear ROI. The pilot proved the technology works. Nobody proved it generates more value than it costs.

Insufficient AI-ready data. The pilot ran on clean, curated data. Production data is scattered across dozens of systems in incompatible formats.

Lack of in-house expertise. The pilot was built by a vendor or a specialized team. Production requires your own engineers to maintain and debug the agent.

CIO.com’s analysis adds a fourth factor: overzealous greenlighting. CEO and board pressure is causing companies to approve AI POCs far more easily than other technology projects. More pilots doesn’t mean more success. It means more failures, faster.

Cascading Failures in Multi-Agent Systems

Single-agent systems fail predictably. Multi-agent systems fail catastrophically.

When one agent makes a mistake, it passes that wrong answer to the next agent as valid input. OWASP classified this as a distinct security threat (ASI08) in their 2026 Agentic Security framework.

The specific problem: semantic errors in agent-to-agent communication pass validation checks. The message format is correct. The data types are correct. The content is wrong. No monitoring catches it because it looks like a normal response.

What Actually Works Right Now

Not everything is failing. Some agents are generating real, measurable results. The pattern is consistent: narrow scope, verifiable output, and human oversight at critical points.

Coding agents are the clearest success story. Cursor passed $2 billion in annualized revenue in early 2026, doubling in three months. Cognition acquired Windsurf in July 2025 to integrate it into the Devin platform. GitHub’s coding agent spins up ephemeral VMs, clones repos, and submits pull requests for human review. The key detail across all of them: each agent operates with repo-scoped tokens that grant CI read access but gate write and deploy permissions behind review policies. Narrow scope, verifiable output (code compiles or it does not), human in the loop.

DevOps incident response is working at scale. Azure SRE Agent went GA in March 2026, and Microsoft has been running it across its own services with over 1,300 agents deployed, handling more than 35,000 incidents in the nine months before GA. AWS DevOps Agent went GA in March 2026 too, building topology maps and correlating logs, metrics, and deployment history to surface root causes. Both run primarily in “suggest and human approve” mode for anything beyond low-risk actions.

Logistics optimization has concrete numbers. Uber Freight has cut empty miles by 10 to 15 percent since 2023 through AI-driven route matching, removing an estimated 4 million empty miles from its network. The gains come from matching loads to trucks that would otherwise return empty, which is the kind of narrow, measurable, verifiable problem where this generation of agents actually works.

The pattern across all of these: the agent does one specific thing, the output is measurable, and a human reviews anything with real consequences.

Why These Succeed Where Others Fail

The successful deployments share three properties that align with the math:

Fewer steps. A coding agent that reads a file, makes a change, and submits a PR has 3 steps. At 90% accuracy per step, that’s 73% overall success. Compare that to a 15-step “autonomous business process agent” at 24% overall success. I ran into the same math when I built an Azure FinOps dashboard on Microsoft Agent Framework 1.0: triage routes once, one specialist answers, three steps total, and the whole thing stays in the reliable band because of it.

Each step is verifiable. Code compiles or it doesn’t. An incident correlation matches the logs or it doesn’t. A route optimization reduces miles or it doesn’t. When the output is measurable, errors get caught before they cascade.

Failures don’t cause damage. A wrong PR suggestion wastes a reviewer’s time. A wrong incident root cause suggestion wastes an engineer’s time. Neither deletes a database or makes an unauthorized purchase. The blast radius is contained.

This Will Change

This is a snapshot of April 2026, not a permanent verdict.

The compound error problem is being actively researched. Cognizant published MAKER, a framework that achieved zero errors over one million LLM steps by decomposing tasks into micro-agents with error correction at each step. That’s still a research result, not a production tool, but it shows the math problem is solvable.

Model accuracy is improving every quarter. The jump from 85% to 95% per-step accuracy doesn’t sound dramatic, but over 10 steps it’s the difference between 20% and 60% overall success. We’re on that trajectory.

Frameworks are maturing. LangGraph, Microsoft Agent Framework, and CrewAI are all shipping better observability, checkpointing, and error handling with every release. The tooling gap that kills production deployments today is closing.

Gartner themselves predict that 33% of enterprise software will include agentic AI by 2028, up from less than 1% in 2024. They’re not saying agents don’t work. They’re saying most current implementations are premature.

The question isn’t whether agentic AI will work. It’s whether your specific implementation, with your specific data quality, your specific integration complexity, and your specific scope, is ready for production today. For most organizations in April 2026, the honest answer is: not yet. But “not yet” is very different from “never.”

The Bottom Line

If you’re evaluating agentic AI for your organization, run the compound error math before you write the business case. Check whether your vendor is one of Gartner’s 130 real ones or part of the agent-washing majority. Scope your first agent to something narrow, verifiable, and low-risk. And have a real conversation about data readiness before you start building.

The technology is real. The results are real, for the companies that got the fundamentals right. But the gap between a working demo and a working production system is where 88% of projects go to die. Understanding why is the first step to being in the 12%.