When to Trust an Agent and When to Step In

December 22, 2025 8 min read AI agents productivity

The hardest part of agentic AI in 2026 isn't getting the agent to do the work. It's knowing when to override it. The four-level autonomy ladder, the five signals an agent is going off the rails, and a real example of catching one before it shipped a quietly broken auth flow.

The hardest part of working with agentic AI in 2026 isn't getting the agent to do the work. It's knowing when to override it.

An agent that produces useful code 90% of the time and confidently broken code 10% of the time is not a 90% solution. It's a system you have to be paying attention to all of the time, because the 10% of broken output looks indistinguishable from the 90% of correct output if you skim. Trust calibration is the actual engineering skill of working with agentic AI, and most teams don't yet have a framework for it.

What follows is the framework I use, in two parts. First: a four-level autonomy ladder for deciding how much trust to extend to an agent on any given task. Second: the five signals that an agent is currently going off the rails, even when its output looks fine.

The four-level autonomy ladder

Not all tasks are equal. The same agent can be entirely trustworthy on one task and dangerous on another. The level of supervision should be set by the consequences of the agent being wrong.

Level 1 — Read-only, always trust

Tasks where the agent observes but doesn't change anything. Code analysis, documentation generation from existing code, summarization, search. The worst-case outcome of the agent being wrong is that I get bad information that I have to discard.

I let agents work autonomously at this level all day. The downside is bounded.

Level 2 — Bounded write, mostly trust, verify

Tasks where the agent writes code in a clearly-scoped area. Adding tests for an existing function. Implementing a small utility from a clear spec. Refactoring a single file. The blast radius is small, the work is reviewable, and the agent has a high probability of getting it right.

I review the diff before merging, but I don't read every line carefully. I'm looking for obvious smell — duplicate logic, weird naming, missed edge cases. If the diff looks clean and the tests pass, I merge. The downside if I miss something is one bad commit that's easy to revert.

Level 3 — Real-money, auth, or state-changing, verify line by line

Tasks that touch payments, authorization, user data, or system state in ways that matter. The agent can draft these, but every line has to be reviewed by a human before it lands.

This is the level where most teams lose discipline first. The agent produces a plausible-looking auth migration, the diff isn't huge, the tests pass — but the migration is silently introducing a privilege escalation. I've seen this happen. I've nearly let it happen, which I'll talk about in a moment.

The discipline at this level: read every line, run the change against your own threat model, ask the agent why it made each non-obvious choice. Treat the agent's output as a junior engineer's work that needs senior review before merge.

Level 4 — Public-facing or irreversible, do not delegate

Tasks where the cost of getting it wrong is unrecoverable. Schema migrations on production data without rollback. Sending email to your customer base. Posting to a social media account. Public-facing legal text. Press statements.

Agents do not produce these autonomously. They can draft them — which can be useful — but the work of actually committing to the output is human-only. The asymmetry is too sharp; even a 99% reliability rate produces an unacceptable error rate over many actions.

Five signals an agent is going off the rails

Even within the right autonomy level, individual sessions can drift. The signals that something is off, in approximate order of how often I see them:

1. Confident answers without specifics

"This will work because of how the framework handles state." Without naming the function, the file, or the documented behavior. The agent is filling in plausible reasoning rather than checking. Push back: "Show me the line in the codebase that demonstrates this."

If the agent can't, the answer is suspect. About half the time the agent then says "actually, on closer inspection..." and revises. The other half it doubles down on a wrong claim, which tells you the entire reasoning chain is hallucinated.

2. Multiple files changed for "one fix"

You ask for a fix to a single bug. The diff comes back touching seven files. Sometimes this is correct — the bug genuinely was scattered. More often, the agent has decided the codebase needs "consistency" or "improvement" while it was in there.

The discipline: ask why each file was changed. If the answer for any file is anything other than "this was necessary for the fix," revert that file's changes. Scope drift in agent diffs accumulates fast.

3. "Cleaning up" code unrelated to the task

A subtype of the above. The agent removes a comment it didn't understand. Renames a variable. Reformats a function it thought was ugly. None of these are explicit instructions. All of them produce noise in the diff.

This is a hard category to police because the changes look harmless one at a time. The cumulative effect over a quarter is a codebase whose history is incomprehensible because every fix touches twenty unrelated lines.

4. Confidence that contradicts evidence

Tests are failing. The agent says "the implementation is correct, the tests must be wrong." This is occasionally true. It is usually wrong. The signal is the agent privileging its own reasoning over the failing test.

The fix: never let the agent dismiss a failing test without proof. "Show me which assertion in the test is incorrect and why" is the right pushback. Most of the time the agent then realizes the implementation is wrong.

5. Speed too high for the complexity

This one is the hardest to articulate but the most reliable in retrospect. A complex problem is solved in twelve seconds with one paragraph of explanation. Be suspicious. Real engineering problems usually have layers; an instant answer often skips them.

The discipline: when the answer comes back faster than seems reasonable, ask the agent to explicitly enumerate three alternatives and argue for the chosen one. The "argue" step surfaces whether the agent has actually thought about the problem or pattern-matched to a familiar shape.

A concrete example: the auth refactor

Earlier this year I asked an agent to refactor a small piece of auth code. The original function had grown to 200 lines and could be reasonably split into four. Routine work, level-3 task by my own framework — auth is in the "verify line by line" category.

The agent produced a clean four-function refactor in about 90 seconds. The tests passed. The diff was the right size. On a quick read it looked correct. I almost merged it.

Two of the five signals fired before I clicked merge. Speed too high (signal 5): the refactor was elegantly factored, which is suspicious for a 200-line function with the gnarly history this one had. And confidence without specifics (signal 1): when I asked the agent why it had moved a particular permission check from one branch to another, the answer was "it's cleaner this way" rather than "the original ordering was incorrect because X."

I dug in. The "cleaner" reordering had introduced a small but real privilege escalation: in one specific code path, an authorization check that previously ran before a sensitive operation now ran after. In the test suite, no test exercised that exact path, so the tests passed. In production, the bug would have allowed certain users to perform an action they shouldn't have been able to.

The discipline that caught it was the framework above. Without it, I'd have merged a clean-looking refactor and shipped a real security bug. Multiply by every team using AI for code generation, and you can see why the slop problem is real and the trust calibration is the actual engineering work of 2026.

The meta-discipline

The framing that pulls all of this together: agents are tools, not teammates. They don't have stakes. They don't get yelled at when production breaks at 3am. They have no embodied sense of what's risky. They will confidently produce code that's 99% right, and the 1% wrong will sometimes be catastrophic, and they will not know.

The job of the engineer working with agents is to supply that missing sense of stakes. The autonomy ladder is how you decide when to engage. The five signals are how you stay engaged once you're in the work.

None of this is hard. It's just disciplined. The teams that get this right are using agents at the limit of what's possible without producing slop. The teams that don't are accumulating a debt that becomes obvious only after a major incident.

Calibrate trust. Verify the consequential changes. Override when the signals fire. The framework is the discipline.

Team-level adoption of these patterns

Individual discipline isn't enough at team scale. If only one engineer on a four-person team is calibrating agent trust carefully, the other three's slop ends up in the codebase anyway. The patterns above need to be team-level practice.

The lightest-weight version of this that's worked for me: a single shared document — call it AGENTS.md — that lists the team's autonomy levels, the signals to watch for, and the kinds of work that always require human review. The doc is short, written by the team together, and reviewed quarterly.

Pair it with one tactical practice: every PR description includes a single line at the top stating which autonomy level applied to the work. "Level 2 — bounded write, agent-assisted." "Level 3 — auth code, full human review." This makes the trust calibration legible during code review and surfaces drift before it ships.

The combination of the doc and the PR-line discipline takes about a week to introduce and a quarter to internalize. The teams that do this end up with a meaningfully better signal-to-noise ratio in their AI-assisted work than the teams that don't.

All writing