AI capability is no longer the constraint. The bottleneck is organizational: knowing what to ask AI to do, what to safely delegate, and having the systems required to verify, and operationalize what AI produces.
That scarcity explains the paradox many organizations face. AI pilots "work", and yet there is no clear path to production or to measurable ROI.
This pattern mirrors Futurum Research's 2026 findings. Enterprises are past AI experimentation. What now limits scale is not model quality or ambition, but gaps in governance, data hygiene, and operational readiness.
What Determines AI Value
AI deployment value is a function of three variables:
| Variable | Definition |
|---|---|
| Human Baseline | How long a task takes a human to complete, at what success rate, without AI |
| Probability of Success | How reliably the AI completes the task to the required standard, including the human review time needed to verify it |
| AI Process Cost | The full overhead of running the AI: prompting, waiting, evaluating output, correcting errors, handing off |
The math only favors AI when all three move in the right direction simultaneously. A fast AI that produces unreliable output doesn't win. A reliable AI that takes longer end-to-end than the human baseline doesn't win. An AI that reduces human task time but creates a verification burden that eats the savings doesn't win either.
As AI handles longer and more complex tasks, the human verification burden can often grow. A one-paragraph AI summary is easy to check. A forty-page AI-generated contract analysis is not. METR's research on AI task horizons shows autonomous capability doubling every four to seven months, but the organizational capacity to verify longer-horizon outputs isn't keeping pace.
What Pilots Can and Can't Tell You
A pilot is the right tool for a specific job: validating that an AI system can perform a task, under reasonably favorable conditions, with a motivated group of users.
What a pilot can't tell you is whether the AI value holds at production scale. And that's the question that determines whether AI delivers returns.
The Feasibility-to-Value Gap
An MIT NANDA report found that only 5% of custom enterprise AI tools reached production.
"Despite $30-40 billion in enterprise investment into GenAI, the report uncovers a surprising result in that 95% of organizations are getting zero return. Just 5% of integrated AI pilots are extracting millions in value, while the vast majority remain stuck with no measurable P&L impact. This divide does not seem to be driven by model quality or regulation, but seems to be determined by approach."
MIT NANDA Report
Most pilots share some common factors that could be contributing to these failures:
- Inputs are curated around cases the team expects the system to handle well
- Users are early adopters, so more motivated and technically capable than the average production user
- Data is cleaner than the real environment
- Governance and access controls are minimal
- Success is measured in simple terms and under controlled conditions
None of these are wrong choices. They make pilots fast and manageable. But that means the pilot is operating in conditions that won't exist in production. Which means the results don't directly answer the value question.
The Evaluation Problem
The metrics used to declare a pilot successful are poorly suited to predicting production performance. Three gaps explain most of the failure.
Benchmark Bias
This is a classic case of Goodhart's Law. When a measure becomes a target, it stops being a good measure. When you evaluate an AI system in a pilot, you create the test inputs based on what you expect the system to handle, drawing on cases you've already seen it manage well.
So the pilot performance is measured under conditions that won't exist in production. The benchmark is optimized by the act of creating it.
Production will always surface messier cases: edge cases, unusual formats, ambiguous requests, and users who interact with the system in ways the pilot team didn't anticipate.
Accuracy Without Error Cost
90% accuracy sounds like a strong result. But whether it's acceptable depends entirely on what the 10% errors cost at production volume.
An example: if you're processing 10,000 invoices against contracts monthly, 90% accuracy means 1,000 errors per month. What does each error cost in staff correction time, vendor disputes, audit exposure, and delayed resolution?
Both are needed before a production decision can be made responsibly.
Output Quality vs. Action Quality
This is a gap that becomes important as AI moves from generating outputs to executing workflows. Evaluating output quality and evaluating action quality are different problems.
A chatbot produces a recommendation or a summary, a human reads it and decides what to do. An agent takes actions: filing, triggering downstream processes, handing off to the next step in a chain.
A system that produces good outputs might still take the wrong action if it has low context. Evaluating these two systems requires fundamentally different frameworks, and most teams only have one.
What Production Demands
Moving to production reveals obstacles pilots don't encounter.
In a pilot, the people running the system are also the people evaluating it. They know the system's quirks, its failures, and they're invested in its success.
In production, the AI is acting on behalf of people who didn't design it, in contexts the pilot team didn't fully anticipate, with accountability structures the pilot never had to address. That's a different trust environment.
| Obstacle | What it means in production |
|---|---|
| Role-based access control | Different users have different permissions. The AI needs to know what it can access and act on behalf of whom |
| Data constraints | Certain datasets available in the pilot environment can have restrictions in production. The data environment is different from what the pilot assumed |
| Legacy system integration | Pilots often use clean APIs or simplified data pipelines that don't reflect the actual production architecture |
| Volume and latency | The system architecture may need to change to handle scale |
| Brand and compliance guardrails | The organization has standards for what the AI can say and do on its behalf. In a pilot these are often informal or unenforced. In production they need to be systematically built into the system |
| Organizational ownership | Pilots are usually owned by IT teams. Moving to production means a business unit taking accountability for the AI's outputs as part of their core workflow |
Each of these obstacles adds to the AI Process Cost or reduces the Probability of Success in the production environment. The gap between pilot performance and production performance on the value formula is largely determined by how well these constraints were anticipated during the pilot stage.
The Governance Gap
Good governance is planned for in the pilot stage rather than added on later.
| Governance Pillar | The question it answers |
|---|---|
| Output validation | Does the AI's response meet defined quality standards before it's acted on or passed to the next workflow step? |
| Action constraints | What can the AI do autonomously vs. what requires human decisions? |
| Role-based permissions | What can the AI access and act on behalf of which users? |
| Escalation logic | When does the system route to a human rather than proceed forward? |
Designing Pilots for Production
The standard question at the end of a pilot is: does this work?
A better question is: have we answered everything we need to know to operate this reliably and accountably in production? If the answer is no, the pilot should continue, with the unanswered questions as the explicit success criteria for the next phase.
| Design Principle | What it requires |
|---|---|
| Test against production conditions | What will the real data environment look like? What is the full range of user types? What are the volume and latency requirements? The pilot should test against these, or explicitly scope what it isn't testing and plan to address it before production |
| Run the error cost calculation upfront | Before the pilot begins, define what accuracy threshold is acceptable and run the volume math. If 90% is the threshold, calculate what the error rate costs at production volume. If that cost is acceptable, the threshold is right. If it isn't, the threshold needs to change |
| Define what the AI is not permitted to do | Action constraints should be defined and tested during the pilot |
| Establish business unit ownership before the pilot ends | If there is no business unit owner prepared to take accountability for the workflow in production, there is no production path |
| Scope governance as a pilot deliverable | Access controls, output validation, escalation logic. These should be designed during the pilot, even if fully built afterward |
| Define the rollback plan before going live | If production performance degrades, what is the process for reverting or pausing the system? This is much easier to decide before any incidents than during one |
What and How to Measure
As AI moves through three broad stages of organizational use, what you measure has to change with it.
| Phase | What it is | Goal | What to Measure | Gate to Next Stage |
|---|---|---|---|---|
| Pilot | Experimentation stage | Validate technical feasibility and user value | Task success rate | Clear, repeatable criteria for what is "production-ready" (accuracy, cost, risk factors) |
| Production | AI tools operating in everyday workflows | Deliver consistent value in real work | Adoption, reliability, latency, cost per task | Stable performance + clear ownership and monitoring |
| Scale | Workflows redesigned with AI-first thinking | Multiply impact across teams and functions | P&L impact | AI becomes a default design assumption |
For agentic systems specifically, value is created across four stages of a loop:
| Stage | What happens | What leaders must define |
|---|---|---|
| READ | Ingest unstructured information | What are we trying to do and why? What qualifies input to enter the loop? |
| THINK | Apply domain rules and judgment | What does success look like? Where does AI judgment end and human judgment begin? |
| WRITE | Produce structured output | What qualifies output to move to the next stage? |
| VERIFY | Check against standards and constraints | What are the limits of AI for this specific workflow? |
How We Think About Metrics at Tatras Data
There are no universal metrics for agentic AI. Here's how we approach it.
| Use Case | Metrics | Why they were chosen |
|---|---|---|
| Invoice-to-contract reconciliation | Match rate, error cost per document, processing time vs. manual baseline, human review rate | These directly map to the three value variables: Human Baseline, Probability of Success, and AI Process Cost. Match rate alone is insufficient without error cost and volume context |
Closing Thoughts
Successful pilots fail because value wasn't designed into the system early enough.
The organizations that treat pilots as rehearsal for production and use it to understand what the real environment will demand are the ones that will see strong ROI.