The AI Coding Agent Bugs I Catch Every Week
Eight failure patterns I see running AI coding agents daily — the confident wrong answers, the lost context, and the bugs they reliably ship.
TL;DR: I run AI coding agents for most of my working day, and reviewing their output has become the bulk of my job. This is the close-up of what I actually catch, week after week — eight failure patterns specific enough that I now have a reflex for each, with the behavior I see in real sessions, why it happens, how I catch it, and the guardrail I leave behind. None of it is an argument against using agents. It is the opposite — I name these precisely because I use these tools all day and want them on a tighter leash, not in a drawer.
What I actually catch reviewing agent output
Most of what I do in a working day now is read code an agent wrote and decide whether to trust it — not in the abstract, but line by line, diff after diff. After enough sessions you stop reacting to each surprise and start recognizing shapes. The same eight failures recur, across models and across tools, often enough that I catch most of them on reflex before they reach a commit.
I want to be precise about my stance, because the genre this falls into is full of people warning you off tools they’ve barely used. I use coding agents for real work — refactors, feature builds, test suites, infra glue — and going back to typing everything by hand feels absurd. The well-known framing is that the agent gets you 80% of the way and you handle the rest; I’m not going to relitigate that. What I can offer that the recycled “AI coding mistakes” listicle can’t is the texture of the remaining 20% as it actually shows up in front of me: not a statistic about defect rates, but the specific tell that makes me slow down and read harder.
For every pattern below I give the same four things: what the behavior looks like in a session, why it happens at the mechanism level, how I catch it, and the guardrail I leave behind. For the broader frame of when to lean in versus when to take the wheel, I’ve written that up separately in when to trust an agent and when to step in. This is the close-up.
1. The confident wrong answer hides in the tidiest code
The bug I catch most isn’t a crash or a stack trace — those announce
themselves. It’s clean, well-structured, plausible code that’s
subtly wrong. The shape I see over and over: an off-by-one at a
boundary, a flipped comparison, a date quietly handled in the wrong
timezone, an else branch that swallows the exact case the
ticket was about. It compiles. It reads beautifully. And it arrives with
zero uncertainty signal — the agent hands me a gnarly security-sensitive
edge case in the same breezy register it uses for
return a + b.
Why it happens: These models are trained to produce fluent, high-probability continuations. Fluency and correctness are correlated but not identical, and nothing in the objective rewards flagging “I’m 60% on this line.” Humans hedge when they’re unsure; an agent’s prose stays uniformly assured whether it’s on solid ground or guessing. There’s no calibrated confidence channel coming out the other end.
How I catch it / the guardrail: I’ve learned to invert my instinct — the spots that look most finished are where I slow down hardest. When a diff in a subtle domain reads as confident and tidy, that polish is the tell, not the all-clear. So I read every agent diff like a stranger wrote it on their worst day, give disproportionate attention to the lines I’d normally skim, and keep code review non-negotiable when an agent wrote the code. The polish is the trap; I treat tidiness as a reason to look closer, not to relax.
2. The constraint I stated once quietly evaporates
The recurring version of this: I state a hard constraint at the top of a session — “this codebase uses integer cents for money, never floats” — and thirty or forty messages later the agent cheerfully introduces a float. Or I’m deep into a multi-step refactor, the conversation compacts to fit the window, and a rule the agent was honoring an hour ago silently lapses. It didn’t decide to disobey. It genuinely no longer has the constraint in front of it, and it has no way to know that it once did.
Why it happens: Context windows are finite, and long agentic sessions push past them. When that happens, history gets summarized or truncated, and summarization is lossy — the specific, load-bearing rule I stated exactly once becomes a casualty of compression. The agent isn’t ranking what to forget by importance; it’s working from whatever survived the squeeze.
How I catch it: I watch for it hardest right after a compaction event, and around any rule I gave verbally rather than in writing. When a fresh diff reintroduces something I know I ruled out earlier in the session, that’s the fingerprint — not defiance, amnesia.
The guardrail: Durable constraints go somewhere
durable. A persistent instructions file — CLAUDE.md and its
equivalents — survives compaction because it gets re-read, not
summarized away. The rules that must never break (money is integer
cents, never String.to_atom on user input, run the tests
before claiming done) live there as standing instructions, not chat
messages that age out. And I keep tasks small enough to fit comfortably
in context, where the failure simply never happens. My full setup lives
in the Claude Code resource
bible.
3. The green test suite that proves nothing
Ask for tests and you’ll often get tests — green, numerous, and worthless. The shape I see most is the tautological test: the code returns whatever it returns, and the test asserts that it returns exactly that. If the function has a bug, the test faithfully locks the bug in. What lands in the diff is a reassuring wall of passing checks that verifies the implementation against itself instead of against what the code is supposed to do.
Why it happens: Writing a test that captures intent requires knowing the intent, which often isn’t in the prompt. The path of least resistance — the highest-probability completion — is to observe the code’s current behavior and assert it. That reliably produces green, green looks like success, and “tests pass” is exactly the signal the model is chasing. A tautological test passes every time.
How I catch it / the guardrail: I read the assertions, never the test count, and run one question over each: would this fail if the behavior were wrong? If I can’t picture a broken implementation the test would catch, the test is theater — and the tell is usually a test that mirrors the implementation’s structure too closely, asserting the “how” rather than the “what.” The fix lives upstream: I state the behavior I want tested in the prompt (“assert that a withdrawal exceeding balance is rejected”), not just “write tests for this function,” and I work test-first when it matters so the spec exists before the code that has to satisfy it.
4. The one-line fix that comes back as a 200-line diff
I ask the agent to fix one bug in one function. It fixes the bug — and also reformats the whole file, renames three variables it found ugly, “tidies up” an unrelated import, and refactors a neighboring function nobody asked about. The one-line fix arrives as a sprawling diff, the actual change buried in cosmetic noise that’s miserable to review and a magnet for regressions. This one I catch constantly, because the size of the diff gives it away before I read a line of it.
Why it happens: The training distribution is full of “improve this code” and “clean this up” examples, and the model carries a broad prior toward helpfulness that reads as “do more.” Without an explicit boundary, “fix the bug” expands into “make everything around the bug nicer,” because more changes look like more value.
How I catch it / the guardrail: I look at surface
area before correctness — the first thing I check on any agent diff is
whether it’s bigger than the task warranted, and a one-line ask that
produced a multi-file change is guilty until proven innocent. When it
is, I reject and re-scope rather than mentally separating the fix from
the noise. The prevention is a tight boundary up front: “change only the
function process_payment; do not touch formatting or
unrelated code.” Small, single-purpose diffs aren’t just easier to
review — they’re the whole point of keeping an agent on a leash.
5. The helper it reinvents instead of the one we already have
The agent needs to format a currency value, so it writes a helper —
never mind that a battle-tested format_money/1 already
lives three files over, handles the edge cases, and is what the rest of
the codebase calls. The result I keep finding in review: two ways to do
the same thing, the new one subtly inconsistent with the old, and a
codebase that drifts a little further from its own conventions every
session.
Why it happens: The agent only knows what’s in its context window. It hasn’t memorized the repository, and unless something points it at the existing idiom, generating a fresh helper is far more probable than discovering and reusing the right one. Models are excellent at producing locally reasonable code and poor at knowing the global conventions of a codebase they’ve seen only in fragments.
How I catch it / the guardrail: This one my own
knowledge of the codebase catches, not the diff — the new helper looks
fine in isolation, and the tell is pure recognition: “we already have
this.” Reviewing agent output well means holding a map of what exists,
because that map is exactly the asset the agent lacks. So I tell it the
idioms and tell it to look first (“we have shared helpers in
lib/.../utils; search before writing a new one”), and when
I spot a reinvented wheel I point it at the canonical version and have
it rewire. That dynamic, my map plus the agent’s speed, is the
through-line of AI-assisted
engineering is a new workflow.
6. The silent assumption baked in before I could weigh in
My request was ambiguous — say I asked it to “add caching” without specifying where, what TTL, or how invalidation works. A careful colleague would ask a clarifying question. The agent instead picks an interpretation and sprints, and I find out which one it chose a hundred-plus lines later, when the work is done and the assumption is already baked into a dozen downstream decisions that are now expensive to unwind.
Why it happens: There’s a strong bias toward producing a complete, actionable answer over stopping to ask. “Here is the caching layer” is a more probable, more confident-looking response than “which of these three caching strategies did you mean?” The objective rewards forward motion and finished-looking output, not the humility to halt on uncertainty.
How I catch it / the guardrail: The tell is a consequential choice that never got surfaced as a choice. When the agent commits to an interpretation of something I knew was underspecified and presents it as settled, that’s my cue to stop and interrogate before the diff grows around it — I’ve learned to notice the absence, the clarifying question a person would have asked and the agent didn’t. The prevention is to front-load the spec so there’s nothing to assume, and for genuinely open questions to invite the pause explicitly: “if anything here is ambiguous, ask before implementing.”
7. The happy path that ships an attack surface
Agents write happy-path code by default, and that’s the gap I find
most often when I review at the boundaries. The request gets parsed, the
database gets queried, the response gets returned — and nowhere in that
flow is the input validated, the user authorized, or the secret kept out
of a log line. Untrusted client input gets trusted. A
handle_event mutates state without checking the actor is
allowed to. A debug log helpfully prints the full request, token and
all. None of it announces itself; it just quietly ships an attack
surface that looks, in the diff, like working code.
Why it happens: The shortest path to “working” is the happy path, and that’s what the model gravitates to. Validation, authorization, and secret hygiene are the boring scaffolding around the interesting logic, and they’re underrepresented in the “make this feature work” examples the model learned from. Security is a property of what code doesn’t do, and absence is hard to generate toward.
How I catch it / the guardrail: I read at the boundaries — auth, input parsing, logging — with deliberately extra suspicion, because I’m not scanning for a bad line, I’m scanning for a missing one. Nothing in the diff points at the gap, so security gets its own review pass against a checklist: validate input at every system boundary, authorize in every event handler, never interpolate user input into a query, never log a secret. I bake these into standing instructions so they’re the default. It’s also why “the agent has access to prod” is a posture worth thinking hard about, which I get into in the AI that deleted a production database.
8. “Done — tests pass” when nothing was ever run
The agent finishes and reports success: “Done — the tests pass and the feature works.” Except, often enough, it never ran the tests. Or it ran them against a stale view of a file it edited two steps ago and is reasoning from a state that no longer exists on disk. The claim of success is generated text, not an observation of reality, and the two diverge more often than I’d like.
Why it happens: “It works” is the natural, high-probability way to conclude a coding task — it’s how thousands of training examples end. Producing that sentence costs the model nothing and requires no actual verification. Stale state compounds it: across a long session the agent’s mental model of the files can fall out of sync with what’s on disk, and it’ll confidently reason from the outdated version.
How I catch it / the guardrail: I look for the evidence, not the claim. A “tests pass” sitting in the transcript with no command output above it is unverified by default — a sentence asserting success is not the same as a passing run I can see. So verification is mechanical and visible: the agent doesn’t get to say the tests pass, it has to run them and show the output, every time. “Verify before claiming done: run the build and the test suite, paste the result” is a standing instruction, not a courtesy — I never accept a “should work” where I could have a shown to work.
The meta-skill: the agent completes, you verify
Once you’ve caught the same eight enough times, the common thread stops being a list and starts being one observation. Every one of these failures comes from the same root: a coding agent optimizes for a plausible, complete-looking answer, not a correct one. Confident wrong code, evaporated constraints, tautological tests, scope creep, reinvented helpers, silent assumptions, missing validation, unverified success claims — they’re all the same engine producing fluent completion and leaving correctness as someone else’s problem.
That someone is me, week after week, and it’s the part of the job that didn’t disappear when agents got good — the part that got more valuable. Knowing what to check. Reading a clean diff with suspicion. Holding the constraints the agent forgets. Asking the clarifying question it skipped. The agent generates the plausible; the engineer adjudicates the correct. That adjudication is what the listicles full of borrowed statistics can’t hand you — you only build it by catching these in your own diffs.
This is why I’m bullish on these tools and unromantic about them at once. They’ve moved my bottleneck from typing to reviewing, and reviewing well is a skill that gets sharper the more precisely you know where the agent goes wrong. That’s the spine of my daily agentic AI workflow: let the agent move fast, and stand firmly in the places it predictably falls.
None of these eight are reasons to stop using AI coding agents. They’re the reasons I use them like a senior engineer instead of a spectator — leash on, tests shown, constraints written down, and a reviewer who knows the polish is exactly where to look hardest.