Prompt Injection Defense for a 5-Person AI Startup

Prompt injection defense a five-person team can ship in a week — trust boundaries, least-privilege tools, approval gates, and what not to build yet.

TL;DR: Almost everything written about prompt injection is either an academic paper or an enterprise vendor pitch, and neither is addressed to you: a five-person team shipping an LLM product with no security hire. Here’s the operator version. You cannot filter your way out of prompt injection — there is no patch, because the attack input and the legitimate input arrive in the same channel. What you can do is make injection unprofitable: keep Simon Willison’s “lethal trifecta” (private data + untrusted content + external communication) from ever coexisting in one agent context, shrink each tool to the least privilege it needs, validate what comes out of the model instead of trusting what goes in, and put a human gate in front of anything state-changing. That’s roughly a week of work, most of it deletion. And skip the guardrail-model arms race, the red-team retainer, and the WAF-for-prompts SaaS — at your stage they’re theater.

The advice is written for teams you don’t have

Search for prompt injection defense and page one splits into two camps. Camp one is research — taxonomy papers, benchmark suites, formal definitions of indirect injection. Genuinely useful if you’re building a guardrail model; useless if you’re trying to decide what to ship Thursday. Camp two is enterprise vendors selling AI firewalls, red-team platforms, and posture dashboards to companies with a CISO, a security engineering team, and a procurement process.

Nobody is writing for the team that actually ships most new AI products: five people, one of whom is “the security person” the way someone is “the DevOps person” — by default, on top of a full product job. That’s the team I work with as a fractional security lead, so that’s who this post is for.

The good news: at five people you have an advantage the enterprise doesn’t. Your attack surface is small enough to actually reason about, and nobody has to file a ticket to delete a tool from the agent. Most of what follows is subtraction.

What you’re actually defending against

One paragraph of theory, because the mental model matters more than the vocabulary. Prompt injection is not SQL injection with different syntax. SQL injection had a real fix — parameterized queries separate code from data, done. LLMs have no such separation: the system prompt, the user’s request, and the contents of whatever document or webpage or email the model just read all arrive as the same kind of thing — tokens in a context window. Any text the model reads is text that can instruct it. That’s why OWASP’s Top 10 for LLM applications has ranked prompt injection as LLM01 — the number one risk — since the list existed, and why it hasn’t moved: it isn’t a bug that gets patched, it’s a property of the architecture.

The cleanest way to think about when this property becomes an incident is Simon Willison’s lethal trifecta: an agent with access to private data, exposure to untrusted content, and the ability to communicate externally. Any two of those is survivable. All three in one context means a single poisoned input can read your secrets and mail them out — no exploit code, no CVE in your stack, nothing for a scanner to find.

This is not hypothetical. EchoLeak (CVE-2025-32711) was a zero-click exfiltration chain in Microsoft 365 Copilot: a crafted email, invisible to the human reading it, instructed Copilot to pull sensitive data from the victim’s context and leak it through an image URL. Microsoft rated it CVSS 9.3. That happened to a company with one of the largest security organizations on the planet, because their agent held all three legs of the trifecta at once. Your copy of the same architecture doesn’t get a pass because you’re small — it gets less attention, which is not the same thing.

Day one: map your trifecta exposure

Before you build anything, spend an afternoon on inventory. For every place your product calls a model, write down three columns: what private data can reach the context (docs, tickets, emails, DB rows, other users’ content), what untrusted content can reach the context (anything a user or the open internet can influence — including your own RAG corpus if users can write to it), and what the model’s output can cause (tool calls, links rendered to users, emails sent, rows written).

Most five-person teams find the same two surprises. First, the RAG pipeline is an injection channel — if users can upload documents that other users’ sessions later retrieve, user A is writing into user B’s context, and your “input sanitization” on the chat box never sees it. Second, rendered output is an exfiltration channel — if the model can emit markdown images or links pointing at attacker-controlled domains, it can smuggle data out one URL parameter at a time. That’s the EchoLeak pattern, and it applies to any chat UI that renders model output as rich content.

The deliverable from day one is not a document. It’s a list of places where all three trifecta legs coexist. Each one is a design bug, and the fix is architectural: split the agent, drop a capability, or gate the action. Which brings us to the actual work.

Cut the tool surface before you filter the prompts

The instinct is to add a filter in front of the model. Resist it — the highest-leverage work is on the other side: making the model incapable of doing damage even when it’s fully compromised. Assume the model is an enthusiastic intern who believes everything they read. You don’t fix that by screening the intern’s mail. You fix it by not giving the intern prod credentials.

Concretely, for a week-sized effort:

Scope every tool to the session’s user. The retrieval tool takes the authenticated user’s ID from your code, not from a model-supplied argument. If the model can pass user_id as a parameter, an injected prompt can pass someone else’s. The single most common injection-adjacent bug I see in small-team codebases isn’t exotic — it’s a tool signature that trusts the model to say whose data to fetch.

Give tools the narrowest verb that does the job. A query_orders(user_id) tool that your backend curries down to the current user beats a run_sql(query) tool by such a margin that it’s barely the same category of software. If any tool in your agent accepts raw SQL, shell strings, or arbitrary URLs, that tool is the incident report.

Separate read contexts from write contexts. If the assistant that summarizes untrusted documents is the same agent instance that can send emails or hit webhooks, you’ve built the trifecta on purpose. Run untrusted-content processing in a context whose only output is text returned to your code — no tools, or read-only tools — and let a separate, clean context own anything that acts. The orchestration cost is an extra model call. The security property it buys is the whole ballgame.

Validate outputs, not just inputs

Input filtering is where everyone starts and where attackers are already waiting — obfuscation, encoding tricks, translation, and payloads split across documents all sail past regexes and phrase blocklists. Output validation is less fashionable and much more durable, because it checks the thing you actually care about: what’s about to happen.

Three checks, each about a day:

Validate tool calls structurally. Every tool argument gets schema-checked and policy-checked in your code before execution — is this argument type-valid, is it within this user’s scope, is this action allowed in this state? Boring, deterministic, and it works exactly as well on day 1,000 as day one.

Sanitize rendered output. Strip or proxy images in model output; allowlist link domains; never render raw HTML from a model. This single control kills the most practical exfiltration path a chat product has.

Detect data that shouldn’t leave. A dumb, deterministic scan of outbound model responses for things shaped like secrets — API keys, connection strings, your own internal hostnames — catches the embarrassing failures cheaply. It’s not sophisticated. Neither are most incidents.

Put a human gate on anything state-changing

I’ve written before about the autonomy ladder — when to trust an agent and when to step in, and prompt injection is the sharpest version of the argument. An agent that can be talked into things by its own inputs does not get to take irreversible actions unsupervised. Sending money, deleting records, emailing third parties, changing permissions: those actions get a confirmation step where a human sees what is about to happen in plain language — not “the agent wants to proceed,” but “this will email invoice.pdf to billing@vendor-you-dont-recognize.example.”

Founders push back that this breaks the magic-agent demo. It does, slightly. But the demo where your agent wires data to an attacker because a PDF told it to is worse, and at five people you do not have the incident-response capacity to treat that as a learning experience. Approval gates are also the control that lets you ship more agent capability later: every gate you can eventually remove is a rung of autonomy you earned with logs instead of hope.

Why the LLM-judge guardrail can’t be your primary defense

The tempting product on the shelf is the guardrail model — a second LLM that inspects inputs or outputs for injection. Use one if you like, but understand what you’re buying: a probabilistic filter in front of a probabilistic system. It fails a few percent of the time on a good day, novel attack phrasings roll off faster than vendors retrain, and — the part that should bother you — the judge model reads the same untrusted text and is itself injectable. EchoLeak walked straight past Microsoft’s own cross-prompt injection classifier on the way to the data.

A detector that’s 97% effective sounds great until you notice attackers get unlimited retries at near-zero cost. Against a motivated adversary, “usually catches it” is a speed bump. The deterministic controls above — scoped tools, output validation, approval gates — don’t have a bypass rate. They’re either enforced or they aren’t, and that’s a property a five-person team can actually maintain. Layer a detector on top after the deterministic floor exists, as signal and rate-limiting, not as the wall.

What to deliberately not build

Half of security at your stage is declining work confidently. Skip, for now: the prompt-WAF subscription (it’s a detector with a dashboard — see above); a quarterly red-team retainer (a competent one costs more than your entire security budget and will tell you to do the things in this post); fine-tuning for adversarial robustness (real research area, not a startup Tuesday); and building your own injection benchmark suite (run one of the open ones twice a year and move on).

None of these are bad ideas at 50 people with a security hire and something to lose that’s worth the spend. At five, every one of them displaces the week of unglamorous work that actually changes your exposure.

Where this fits in the bigger picture

Prompt injection defense is one layer of the four-layer stack I laid out in how I’d run security at an AI-native company in 2026 — alongside agent credentials, secrets handling, and audit logging — and it’s the layer enterprise security reviews now probe first, because buyers have read the same incident reports you just did. Get the week of work done and you’re not just safer; you have a concrete, honest answer for the security questionnaire that would otherwise stall your first enterprise deal.

If you want a second pair of eyes on where your product sits on the trifecta map — or you’re staring down a security review and need the whole stack, not just this layer — the vCISO math post covers what fractional security leadership costs at your stage, and this is the engagement I run.