Amazon Let the AI Drive. It Hit a Tree.

Jared Smith June 30, 2026 9 min read AI agents engineering leadership

Amazon mandated AI coding, let it touch infrastructure unwatched, and lost millions of orders. The fix wasn't less AI — it was more humans per deploy.

TL;DR: Amazon told its engineers to use AI, set a quota for it, and in at least one case let an AI tool change production infrastructure without a human watching. The tool decided the fix was to delete and recreate the environment. The bill, across a string of incidents, runs into millions of lost orders and a 13-hour outage. Amazon’s remediation is the part worth reading twice: a 90-day reset that puts two people on every deploy to its most critical systems. The company that pushed hardest to take humans out of the loop responded to disaster by putting more of them back in. That’s not an indictment of AI. It’s the whole thesis — AI is a faster driver, and a faster driver with no one in the passenger seat doesn’t get you there sooner. It hits the tree sooner.

A faster driver still needs a navigator

There’s a seductive version of the AI-coding pitch where the headcount line on the spreadsheet only goes down. The agent writes the code, the agent reviews the code, the agent ships the code, and you — the expensive human — get to go do something else. Fewer people, more output. The car drives itself.

The problem with a self-driving car isn’t that it’s slow. It’s that when it’s wrong, it’s wrong at speed. A junior engineer who doesn’t understand the blast radius of a change types slowly enough that someone notices. An agent that doesn’t understand the blast radius executes in nine seconds. I’ve watched an agent delete a production database and then explain, fluently, why it shouldn’t have — the articulateness is the trap, because it reads like judgment right up until the moment it isn’t. Speed without a navigator isn’t progress. It’s just a higher-velocity way to arrive at the wrong place.

Amazon just gave us the cleanest case study yet.

The receipts

According to reporting from the Financial Times (summarized by Digital Trends), Amazon’s e-commerce business hit “a trend of incidents” starting in the third quarter of 2025 — serious enough to trigger a company-wide meeting led by SVP Dave Treadwell. The specifics are bracing:

A 13-hour outage in December 2025, after Amazon’s Kiro AI coding tool was allowed to update infrastructure without human oversight. Kiro’s chosen solution: delete and recreate the environment.
March 2, 2026 — AI coding tools contributed to an incident that cost roughly 120,000 lost orders and produced 1.6 million website errors.
Three days later — a separate outage caused a 99% drop in orders across North American marketplaces, totaling about 6.3 million lost orders.

Amazon’s official line is that these were user errors, not AI failures — but the company concedes the scale of AI-generated code amplified the damage. Read that sentence again, because it’s the entire point. “Not the AI’s fault, but the AI made it enormous” is a confession that the tool removed a brake, not that the tool was blameless. A mistake a human would have made on one server, the system made across the fleet, instantly.

And here’s the context that turns this from an anecdote into a pattern: Amazon had been pushing hard for adoption, reportedly requiring at least 80% of developers to use AI for coding tasks at least once a week. A quota. You can feel the org chart logic in that number — we bought the tool, now use the tool — and you can feel exactly how it produces a culture where letting the agent touch prod unsupervised reads as compliance rather than recklessness.

The AI didn’t fail. It did its job, at the wrong altitude.

It’s worth being precise about what went wrong, because “AI bad” is the lazy reading and it’s also wrong. Kiro did something a competent-but-junior operator might do: faced with a broken environment, it reached for the biggest hammer — tear it down, build it fresh. In a dev sandbox that’s a reasonable instinct. In production it’s a catastrophe. The model didn’t lack capability. It lacked the one thing the old apprenticeship beats into you over a decade: a felt sense of what this particular mistake costs here.

That’s not a coding skill. It’s a judgment skill, and judgment is sediment — it settles out of work, slowly, from a thousand small encounters with how systems actually behave when you connect them under load. An agent has read about blast radius. It has never been paged at 3am because it got the blast radius wrong. The gap between knowing and having-earned-it is exactly the gap that bit Amazon, and it’s the gap that doesn’t close by buying more inference.

This is why the confident-wrong failure mode is the dangerous one. A tool that’s hesitantly wrong gets caught. A tool that deletes your environment with the same calm fluency it uses to format a CSV sails right past anyone who isn’t equipped to overrule it. Knowing when to trust the agent and when to step in is the load-bearing skill of this whole era — and you cannot staff that skill with the same headcount cut you justified by buying the agent.

“Do more with AI” quietly meant “fewer eyes on the road”

The 80% quota and the unsupervised infra change aren’t two stories. They’re the same story. When you frame AI as a way to do the same work with fewer people, the natural next move is to thin out the review, the approvals, the second pair of eyes — those feel like the human overhead the AI was supposed to eliminate. The brakes look like the cost you’re cutting.

But that’s the inversion at the heart of all of this. AI doesn’t shrink the work — it exposes how much work you were leaving on the table, and it raises the stakes on every action because each one now executes at machine scale and machine speed. More leverage means each decision matters more, not less. Satya Nadella’s framing is that AI is “token capital” that amplifies human judgment — and amplification cuts both ways. Multiply good judgment and you get more good outcomes, faster. Multiply absent judgment and you get 6.3 million lost orders in an afternoon. The amplifier doesn’t supply the signal. You still have to.

The org that internalized “amplify” as “automate, then reduce headcount” learned the difference in production.

The fix is the thesis, in Amazon’s own handwriting

Here’s the part I’d tattoo on the inside of every “AI replaces engineers” deck. Amazon’s remediation — its 90-day safety reset across roughly 335 critical systems — is not “better AI.” It’s:

Two-person code review before deployment. Humans. Plural.
Formal documentation and approval processes.
Stricter automated checks.

A company at the absolute frontier of AI adoption, staring at the wreckage of letting the agent drive solo, did not conclude we need a smarter agent. It concluded we need more humans in the loop, with more structure around them. The remediation for too-little human oversight was, precisely, more human oversight. The fix for “the AI drove into a tree” was to put a navigator back in the passenger seat — two of them, actually, with a checklist.

That is the case for more people, not fewer, written by the company that most wanted the opposite to be true. You don’t get to wave it away as old-economy caution. This is Amazon. If anyone had the AI sophistication to safely remove the humans, it was them, and they looked at the data and added humans back.

What to actually do with this

You don’t need a 13-hour outage to learn the lesson on someone else’s invoice. Four things that follow directly:

Never let the agent be the only thing between a change and production. The agent can write it, draft it, even propose the deploy. A human approves the deploy. This isn’t distrust of AI; it’s the same reason you don’t let one engineer push to prod unreviewed, scaled to a contributor that works a thousand times faster and has zero scar tissue. A professional owns the whole outcome — the cost, the failure, the 3am page — and ownership can’t be delegated to something that can’t be paged.

Staff the review, don’t cut it. If your AI rollout plan has headcount going down and deploy velocity going up with no one added to the review side, you’ve built Amazon’s December. The leverage AI gives you is real — spend some of it on more skilled eyes, not fewer. The reviewers are the navigators, and they’re cheaper than the outage.

Make “should we” a required step, not an emergent one. Kiro’s failure wasn’t can we delete and recreate — it could. It was should we, here, now, at this blast radius. That question has to live in the process, as a gate a human passes, because the model will answer “can we” with cheerful competence every single time.

Treat the AI quota as a smell. “80% of devs must use AI weekly” optimizes for adoption metrics, not outcomes. Prove the return or don’t spend the time: measure whether the work got better and safer, not whether the tool got touched. A quota tells your engineers that using the agent is the goal. Shipping correct, survivable systems is the goal. Those are not the same KPI, and Amazon just paid millions of orders to learn which one matters.

The car is fast. Hire the navigator.

The mistake isn’t using AI. Amazon should use AI; so should you; I run agents every day and they make me genuinely faster. The mistake is reading “the AI can drive” as “I can take my hands off the wheel and reduce the crew.” The AI can drive — and it will drive into a tree faster than you ever could, with more confidence, across more of your fleet at once, narrating its reasoning the whole way down.

The faster the car, the more the navigator matters. That’s not nostalgia for human labor. It’s the operating manual, and Amazon just published the field-tested edition: two pairs of eyes per deploy, structure around every change, humans owning the outcome. More people to use AI well, not fewer. The company that bet the other way wrote you the receipt.

All writing