AI agents 2026: three autonomy levels and why most SMEs stop too early at level one
Anyone running AI agents in production quickly runs into the autonomy question. Three levels help allocate risk and benefit cleanly. Few SMEs scale past level one, even though that is where the largest lever sits.
Why the autonomy question is the most important in the AI stack in 2026
Anyone running an AI agent in production eventually faces a decision that does not show up in any architecture diagram: how much is the agent allowed to decide on its own, without a human looking? The answer to that question separates pilot projects that survive scaling from those that are quietly shelved after the first misstep.
In practice, a three-level logic has taken hold that works regardless of the specific framework choice. It comes from the enterprise world, but applies equally to small and mid-sized businesses, because it addresses the actual point of friction: the transfer of responsibility between human and machine.
Level one: suggest, human decides
The agent prepares decisions. A concrete example: a support agent drafts a response to an incoming ticket, the case worker reviews it and sends it out. The agent can also propose several variants, sorted by confidence score, with the reasoning for why it prefers one over another.
At this level, the human remains in the decision loop. That is operationally slower than full automation, but significantly more robust from a regulatory and operational perspective. Teams in industries with personal data, financial obligations or reputation-critical communication should set level one as the default.
In our project experience, this is exactly where the largest lever sits, and it is one most SMEs miss. Level one is often treated as a stopgap, a compromise between ambition and reality. In truth, level one is the point at which the entire operational excellence of an AI workflow is decided. Teams that set this up well create the foundation for level two. Teams that cut corners here will fail at level three.
The reason: every operational discipline that works at level one is also the foundation for the next level. Confidence scoring, prompt versioning, audit logs, evaluation sets, approval workflows. All of these investments pay off twice, once the agent is actually allowed to act more autonomously. Teams that skip these at level one, because "a human is looking after all", simply push the technical debt into the heads of the case workers and do not collect the data they would need for the next level.
Level two: decide within defined boundaries
The agent decides on its own, but only within clearly drawn boundaries. A classic example: an agent categorises incoming invoices, assigns them to the correct GL account and proposes the posting. Within defined chart-of-accounts rules and within defined amount thresholds, the agent acts autonomously. As soon as an invoice crosses a threshold, the agent automatically escalates to a human.
Level two only works when the boundaries are genuinely clear. "Invoices under 5,000 Euro are posted automatically, above that they go to approval" is a clear rule. "Invoices that look uncritical" is not. Defining the boundaries is the central design task, not implementing the agent itself.
In our work we see three typical mistakes during the transition from level one to level two.
Mistake one: boundaries are drawn too tightly. Teams that route every decision through approval have effectively kept level one and just added a second approval tool. That slows processing without creating autonomy. The consequence: the team does not accept the new complexity and falls back to level zero, that is, manual processing without an agent.
Mistake two: boundaries are drawn too widely. Teams that let the agent operate within "logical" amounts, without knowing the distribution of real cases, miss the long tail. In practice, 95 percent of all invoices are under 2,000 Euro and 0.5 percent are above 50,000 Euro. Optimising the agent for "the majority" creates a high-frequency vulnerability precisely where no human is looking any more.
Mistake three: escalation is not specified. What happens when a boundary is crossed? Who is notified? Within what deadline? What happens if no one reacts? Without defined escalation paths, level two effectively becomes level three, with the risk that no one notices. centerbit consistently uses Placet as the escalation inbox: every level-two boundary crossing produces an approval request with a deadline. Teams that do not react within the deadline get a second wave of reminders. If the response still does not come, the action is written into a secure audit log and the agent pauses.
Level three: fully autonomous, with governance
The agent runs an end-to-end task without human intervention, within a defined strategy and with clear fallback scenarios. An example: a server-infrastructure monitoring agent detects anomalies, prioritises them, opens tickets, notifies the on-call rotation and escalates on SLA violation. As long as the agent acts within its strategy, the process runs without human involvement.
Level three is what the marketing materials of the big platform vendors mean when they talk about "autonomous agents". In practice, level three only makes sense when two conditions are met at the same time.
Condition one: the domain is high-frequency and low-exception. An agent that processes 10,000 invoices per month and decides correctly in 99.5 percent of cases has a clear lever. An agent that prepares three strategic decisions per quarter does not belong at level three, because the case volume does not justify the governance investment.
Condition two: the consequences of an error are reversible or limited. An agent that sends a wrong notification is annoying. An agent that triggers a wrong bank transfer is an incident. Level three requires either that the maximum damage per action is clearly bounded, or that every action is written into an audit log that allows a quick reversal.
In practice, most productive level-three setups combine both: the actions are small and reversible, and the audit log allows immediate reversal in case of doubt. centerbit customers running level three in production typically have a "big red button" that immediately pauses the agent and puts all open actions into a hold status.
Why most SMEs should stay at level one for longer (and still give up too early)
The temptation is to jump straight to level two or three, because the ROI looks especially tempting in marketing slides. In practice it is consistently the case that the ROI at level one is the most reliable, because that is where the operational excellence is rehearsed that is needed for level two and three anyway.
One question that comes up regularly in on-site workshops: what percentage of tasks can we automate? The honest answer varies, but the pattern is consistent. Teams that start cleanly at level one typically reach, within six to twelve weeks, a rate of 50 to 70 percent of tasks completed entirely by the agent. The remaining 30 to 50 percent require human intervention, but with substantially reduced effort, because the agent delivers the preparatory work in a structured way.
Teams that aim directly at level three fail more often. The complexity of boundary definition, the requirements for reversibility and the effort for audit logs are already demanding at level two. At level three, continuous performance monitoring, drift detection and incident response are added. Teams that have not learned these disciplines at level one are fighting fires at level three.
Practical recommendation: maintain a maturity report per workflow
We recommend that SMEs maintain a maturity score per workflow that makes the current level and the prerequisites for the next level transparent. Four dimensions are typically sufficient.
Operational discipline. Are audit logs, evaluation sets and approval workflows established? Teams that show red lights here should not move to the next level.
Domain clarity. Are the business rules documented well enough that a human can explain them to a colleague in 30 minutes? If not, an agent cannot execute them autonomously.
Volume and frequency. How many cases per month? How many exceptions? Teams that handle fewer than 100 cases per month should not aim at level three, because the effort for governance is out of proportion to the lever.
Reversibility and damage limitation. What happens if the agent decides wrongly? Is the action reversible? Is the maximum damage per action clearly bounded? Teams that cannot answer this question belong operationally at level one.
What centerbit brings to the autonomy levels
Our architecture is built so that the choice of level is not a one-way street. Facio agents can be operated with the same setup at level one, two or three, depending on the maturity of the respective workflow. Placet handles approval logic and audit trails, so each level has its own compliance trail. Runtime data flows are fully traceable through the audit logs, which is the regulatory prerequisite for level three.
In practice, the teams that start at level one and scale disciplined into level two realise the largest lever in the first twelve months. The temptation to jump ahead to level three is humanly understandable, but operationally risky. The architecture allows the jump. The operational excellence has to be earned.
centerbit
Book a consultation now
If you see similar manual work in your team, we can review the process together in a free initial consultation.