Enmesh AI is a decision architecture platform that makes institutional knowledge usable by AI agents. It extracts the rules, constraints, exceptions, and precedents that define how an organization actually operates, then encodes them into a structured layer that agents can reference at decision time.

How is Enmesh AI different from workflow automation tools?

Workflow automation tools (like RPA) follow predefined steps. Enmesh AI captures the decision logic behind those steps, including the exceptions, judgment calls, and context-dependent rules that automation tools cannot handle. It gives AI agents the reasoning layer they need to handle edge cases correctly.

What is decision architecture?

Decision architecture is a structured representation of how an organization makes decisions. It includes the explicit rules, implicit constraints, historical precedents, and exception-handling patterns that define correct behavior. Enmesh AI extracts this from existing documents and workflows and encodes it into a format AI agents can interpret.

How do AI agents use Enmesh?

AI agents query the Enmesh semantic control plane when they encounter a business decision. Instead of relying on prompt instructions or training data, the agent receives the specific rules, constraints, and precedents that apply to that decision in context. This lets agents follow organizational logic rather than hallucinating answers.

Why 95% of AI Pilots Fail to Reach Production

The real gap isn't models or data, but the operational knowledge nobody writes down.

March 5, 2026

Emily Lu Founder & CEO

The MIT study that found 95% of enterprise AI pilots fail to reach production got a lot of attention when it came out. Most of the commentary focused on the usual explanations: bad data, insufficient talent, lack of executive support.

But when you look at the projects that actually made it to production, a different pattern emerges. The 5% that succeeded didn't have better models or bigger budgets. They were methodical about something most teams rush past: defining what the AI actually needs to know.

What Defines AI Pilot Failures and Successes

The AI failures that make headlines tend to be obvious in hindsight: Klarna replaced 700 customer service agents with AI, watched quality deteriorate, and had to hire humans back. Hertz deployed an AI damage scanner that billed customers for damage and offered no way for customers to dispute it. These are cautionary tales, but they're also edge cases: companies that moved too fast on tasks AI wasn't suited for, or removed human oversight entirely. They're not representative of how most AI projects actually fail.

The more common failure is quieter and feels productive: the AI pilot that performs well enough to keep funded but never well enough to go live. It produces impressive demos, gets enthusiastic executive buy-in, stretches over six months of engineering, and never quite makes it to production. It consumes time, budget, and credibility without producing a clear result in either direction.

The default explanations for why these pilots stall tend to fall into two camps:

The AI isn't smart enough yet, wait for the next model.
We just need to give it more data / better data.

The first answer buys time: there's always a new model release around the corner. The second launches a data cleanup initiative that takes months and may not change the outcome.

Both seem plausible, but neither addresses the fundamental issue: the agent has the data but doesn't know what to do with it. It has access to the documentation, the systems, the records, but no understanding of which information matters for which task, under what conditions, and what "correct" looks like.

The Fundamental Issue: AI Agents Are Like a Brilliant New Hire on Their First Day, Every Single Time

Most people think of AI as something that gets better over time, like an employee who learns the role. In reality, an agent is more like a contractor who shows up each morning with amnesia: extraordinarily talented, but with no institutional memory, no sense of what matters most, and no ability to tell the difference between a routine case and one that's about to go wrong.

Every time it runs, it starts from zero and has to reconstruct its understanding from whatever you hand it. In other words, it's entirely dependent on the briefing you give it, and most organizations aren't giving a very good briefing.

Think about what it would look like to hand a complex operational workflow to someone brilliant but completely new, with no context about your clients, your systems, or the unofficial rules your team has built up over years. That's roughly what an AI agent is working with, and it creates a set of predictable problems:

What connects all four patterns is that they're not failures of intelligence, but of preparation. The agent can reason, synthesize, and produce output, but what it can't do is reconstruct the operational knowledge that your team has built up over years: the priorities, the exceptions, the judgment calls that live in people's heads and nowhere else. That knowledge has to be extracted and structured before the agent ever touches the workflow.

Building the Agent's Briefing

So your team isn't wrong that data matters. But access and usability aren't the same thing. The agent can reach your data, but that doesn't mean the data is organized, filtered, or structured in a way that makes it actionable for a specific task. This difference shows up when we look at what agents actually need, layer by layer. For an AI agent to reliably perform a workflow, three layers of knowledge need to be in place:

Layer 1Access

Can the agent reach the data and systems it needs? This is where most teams start, and most teams solve it. Connect Salesforce, Asana, Google Drive, grant permissions, index the documents. It's necessary but nowhere near sufficient. (This is what companies like Glean, Cohere, and Microsoft Copilot do.)

Layer 2Context

For a given task, does the agent know what to do with what it has access to? This is the layer that makes or breaks the pilot, and it's the one that almost always gets skipped. And context isn't just relevant data, but also the implicit operational knowledge and decision logic that your team carries around:

Which sources to trust & when

Not all data is equally reliable, and your team knows that for certain types of work, the numbers in one system are the source of truth, but for a different type, you need to pull from somewhere else.

What the exceptions are

Every workflow has a standard path and a dozen variations. Your team knows that when a certain condition shows up, the normal rules don't apply. Those exceptions are rarely documented.

What matters most

Not every element of the output carries equal weight. Your team knows which fields will cause real problems if they're wrong, which numbers will trigger follow-up questions, and which ones nobody looks at closely.

What 'right' looks like

Your team can look at a finished output and know whether it's correct: not just structurally complete, but substantively accurate. They're checking it against a mental model built from experience.

All of this has to be extracted from the people who do the work and structured so the agent receives the right knowledge for the right task at the right time. That extraction work is unglamorous and time-consuming, which is exactly why it gets skipped.

Layer 3Evaluation

Can you verify that the agent's output is correct, automatically, at scale? Not "does the output look reasonable" but "is this actually right, checked against what your best person would have produced?" Without this layer, you have no way to know whether the agent is ready for production. You're relying on gut feel, or worse, you're finding out from clients.

How to Assess a Workflow's AI-Readiness

The workflows most likely to succeed with AI automation are the ones where these three layers are closest to being in place. Here are a few questions that can help you diagnose that quickly:

Question 1

Context Explicitness

If your most knowledgeable person quit tomorrow, how much of what they know could be reconstructed from existing documentation? Is it consolidated in one place, or scattered across process docs, email threads, Slack messages, and the memories of whoever's been around longest?

Mostly documented

The rules, decision logic, and exceptions are already captured in process guides or system configurations. A new hire could get up to speed from the documentation alone without much shadowing. This workflow is close to automation-ready, and the preparation work is mostly done.

Partially documented

A documented process exists, but only covers the standard path and mostly relies on people's institutional memory and judgment. It doesn't include which clients need special handling, which data sources to trust in which situations, or what to do when two systems disagree. This is the most common profile we see, and it's exactly where automation projects stall.

Mostly tribal knowledge

The workflow runs on institutional memory, with different team members handling the same situation differently and no single "source of truth." When someone leaves, critical knowledge goes with them. An agent dropped into this environment will produce output that looks plausible but reflects none of the judgment that actually makes the workflow function. There's significant documentation work to do before starting to build out the AI automation.

Question 2

Output Verifiability

Is there a correct answer for this workflow: something you could check the agent's output against objectively? Or does evaluating the output require someone senior to look at it and make a judgment call?

Verifiable against clear criteria

A junior person with a simple rubric or checklist could flag most errors. The final output has fields, values, or decisions that can be checked against a defined source of truth: number matching, category-specific rules, formatting checks, etc. This means the agent's accuracy can be measured automatically.

Verifiable by comparison

There's no simple checklist, but correct outputs produced by experienced people exist as reference points. An agent's output can be placed side by side with what the team would have produced and the divergences identified. This is enough to build evaluations, but it requires upfront investment: assembling a strong "ground-truth" dataset from real historical work.

Only verifiable by expert judgment

The only way to know if the output is right is to have a senior person review it and make a call. No checklist, no reference output, no clear criteria, just experienced pattern recognition. Without a way to evaluate the agent's output at scale, there's no path from pilot to production. This is how projects end up in an indefinite loop of "it's getting better, we just need more time."

Question 3

Task Decomposability

Does this workflow have natural breakpoints, places where one task ends and another begins with a clear handoff? Or does it flow as one continuous process from start to finish?

Fully decomposable

The workflow is a sequence of discrete steps with clear handoffs between them. Each step has a defined input and a defined output, and getting step one right doesn't require knowing what happens in step four. Individual steps can be automated, tested, and improved independently. This is the easiest profile to automate incrementally: start with the most straightforward step, validate it, then expand.

Partially decomposable

Some steps are self-contained, but others depend on context from earlier in the workflow or require judgment about how to proceed based on the full picture. Certain steps can be isolated and automated, but the workflow as a whole still needs a person guiding it through the more complex handoffs. The strategy here is to identify which steps are separable and automate those first, while keeping a person on the steps that require cross-step judgment.

Monolithic

The workflow is one continuous process where every decision depends on everything that came before it. There's no natural breakpoint where an intermediate output can be checked or handed off. Automating pieces in isolation doesn't work because the pieces don't exist independently. This is the hardest profile to automate, and where the compounding error problem is most severe, because there's no place to catch and correct mistakes mid-workflow.

Question 4

Error Consequence

Does the error get caught internally before anyone outside the team sees it? Does it reach a client or partner? Does it trigger a financial, legal, or regulatory consequence? The answer determines how much room there is to iterate, and how much verification infrastructure needs to be in place before the agent goes live.

Contained internally

The output goes through an internal review step before reaching anyone external (a draft report, a pre-processed dataset, a recommendation that a person approves before acting on). Errors are cheap to fix and low-risk to the business. This is the best environment for early automation: there's room to experiment, learn from failures, and improve the agent's performance without real consequences.

Externally visible but correctable

The output reaches a client, partner, or another team, and errors cause real friction: a wrong number in a report, a misclassified request, a delayed deliverable. Recoverable, but damaging to credibility over time. A human review gate needs to be designed into the workflow from the start, not added after the first complaint.

High consequence

Errors trigger financial exposure, regulatory risk, or legal liability. A wrong trade, an incorrect compliance filing, a miscalculated bill sent to a client. There's very little tolerance for iteration in production. The evaluation layer needs to be airtight before the agent touches anything live, and even then, human oversight should remain a permanent part of the workflow, not a temporary safety net.

What This Looks Like in Practice

To make this concrete, here's how three real workflows score against these questions, and what that means for whether they're ready for AI.

Invoice Data Entry

A finance team processes hundreds of invoices monthly, pulling the same fields into the same system. The rules are clear, the output is either right or wrong, and errors get caught in reconciliation. This doesn't need AI: standard automation tools can handle it.

Ready for automation

Client Reporting in Financial Services

A wealth management team produces quarterly performance reports, pulling from multiple systems. Which system to trust depends on the account type. Certain clients have custom formats. The team knows which numbers to double-check and which discrepancies to ignore. Most of this lives in people's heads.

Needs preparation

Complex Underwriting Decisions

A senior underwriter evaluates commercial insurance applications by weighing financial statements, loss history, market conditions, and broker relationships, then makes a pricing call. Almost entirely judgment. Only another experienced underwriter can evaluate whether the decision was right.

Not ready

Conclusion: The Real Work of AI Automation

What makes AI automation hard isn't the AI, but the knowledge. Every operational workflow runs on a layer of understanding that's so embedded in how people work that they barely notice it: which sources to trust, which details matter, when the standard process doesn't apply. People absorb this over months and years, but an agent starts cold every time.

For automation to work, that tacit knowledge has to be surfaced, documented, and structured in a way the agent can actually use. Not dumped into a database, but organized precisely for each task and each sub-task: the right context, in the right form, at the right moment. That's the gap most AI projects never close, and it's what the 5% that succeed invest in before they build anything.

This is part of Enmesh's ongoing writing on enterprise AI infrastructure. Read about our approach or explore decision architecture.