Prompt Injection Defense for a 5-Person AI Startup
Prompt injection defense a five-person team can ship in a week — trust boundaries, least-privilege tools, approval gates, and what not to build yet.
TL;DR: Almost everything written about prompt injection is either an academic paper or an enterprise vendor pitch, and neither is addressed to you: a five-person team shipping an LLM product with no security hire. Here’s the operator version. You cannot filter your way out of prompt injection — there is no patch, because the attack input and the legitimate input arrive in the same channel. What you can do is make injection unprofitable: keep Simon Willison’s “lethal trifecta” (private data + untrusted content + external communication) from ever coexisting in one agent context, shrink each tool to the least privilege it needs, validate what comes out of the model instead of trusting what goes in, and put a human gate in front of anything state-changing. That’s roughly a week of work, most of it deletion. And skip the guardrail-model arms race, the red-team retainer, and the WAF-for-prompts SaaS — at your stage they’re theater.
The advice is written for teams you don’t have
Search for prompt injection defense and page one splits into two camps. Camp one is research — taxonomy papers, benchmark suites, formal definitions of indirect injection. Genuinely useful if you’re building a guardrail model; useless if you’re trying to decide what to ship Thursday. Camp two is enterprise vendors selling AI firewalls, red-team platforms, and posture dashboards to companies with a CISO, a security engineering team, and a procurement process.
Nobody is writing for the team that actually ships most new AI products: five people, one of whom is “the security person” the way someone is “the DevOps person” — by default, on top of a full product job. That’s the team I work with as a fractional security lead, so that’s who this post is for.
The good news: at five people you have an advantage the enterprise doesn’t. Your attack surface is small enough to actually reason about, and nobody has to file a ticket to delete a tool from the agent. Most of what follows is subtraction.
What you’re actually defending against
One paragraph of theory, because the mental model matters more than the vocabulary. Prompt injection is not SQL injection with different syntax. SQL injection had a real fix — parameterized queries separate code from data, done. LLMs have no such separation: the system prompt, the user’s request, and the contents of whatever document or webpage or email the model just read all arrive as the same kind of thing — tokens in a context window. Any text the model reads is text that can instruct it. That’s why OWASP’s Top 10 for LLM applications has ranked prompt injection as LLM01 — the number one risk — since the list existed, and why it hasn’t moved: it isn’t a bug that gets patched, it’s a property of the architecture.
The cleanest way to think about when this property becomes an incident is Simon Willison’s lethal trifecta: an agent with access to private data, exposure to untrusted content, and the ability to communicate externally. Any two of those is survivable. All three in one context means a single poisoned input can read your secrets and mail them out — no exploit code, no CVE in your stack, nothing for a scanner to find.
This is not hypothetical. EchoLeak (CVE-2025-32711) was a zero-click exfiltration chain in Microsoft 365 Copilot: a crafted email, invisible to the human reading it, instructed Copilot to pull sensitive data from the victim’s context and leak it through an image URL. Microsoft rated it CVSS 9.3. That happened to a company with one of the largest security organizations on the planet, because their agent held all three legs of the trifecta at once. Your copy of the same architecture doesn’t get a pass because you’re small — it gets less attention, which is not the same thing.
Day one: map your trifecta exposure
Before you build anything, spend an afternoon on inventory. For every place your product calls a model, write down three columns: what private data can reach the context (docs, tickets, emails, DB rows, other users’ content), what untrusted content can reach the context (anything a user or the open internet can influence — including your own RAG corpus if users can write to it), and what the model’s output can cause (tool calls, links rendered to users, emails sent, rows written).
Most five-person teams find the same two surprises. First, the RAG pipeline is an injection channel — if users can upload documents that other users’ sessions later retrieve, user A is writing into user B’s context, and your “input sanitization” on the chat box never sees it. Second, rendered output is an exfiltration channel — if the model can emit markdown images or links pointing at attacker-controlled domains, it can smuggle data out one URL parameter at a time. That’s the EchoLeak pattern, and it applies to any chat UI that renders model output as rich content.
The deliverable from day one is not a document. It’s a list of places where all three trifecta legs coexist. Each one is a design bug, and the fix is architectural: split the agent, drop a capability, or gate the action. Which brings us to the actual work.
Cut the tool surface before you filter the prompts
The instinct is to add a filter in front of the model. Resist it — the highest-leverage work is on the other side: making the model incapable of doing damage even when it’s fully compromised. Assume the model is an enthusiastic intern who believes everything they read. You don’t fix that by screening the intern’s mail. You fix it by not giving the intern prod credentials.
Concretely, for a week-sized effort:
Scope every tool to the session’s user. The
retrieval tool takes the authenticated user’s ID from your
code, not from a model-supplied argument. If the model can pass
user_id as a parameter, an injected prompt can pass someone
else’s. The single most common injection-adjacent bug I see in
small-team codebases isn’t exotic — it’s a tool signature that trusts
the model to say whose data to fetch.
Give tools the narrowest verb that does the job. A
query_orders(user_id) tool that your backend curries down
to the current user beats a run_sql(query) tool by such a
margin that it’s barely the same category of software. If any tool in
your agent accepts raw SQL, shell strings, or arbitrary URLs, that tool
is the incident report.
Separate read contexts from write contexts. If the assistant that summarizes untrusted documents is the same agent instance that can send emails or hit webhooks, you’ve built the trifecta on purpose. Run untrusted-content processing in a context whose only output is text returned to your code — no tools, or read-only tools — and let a separate, clean context own anything that acts. The orchestration cost is an extra model call. The security property it buys is the whole ballgame.
Validate outputs, not just inputs
Input filtering is where everyone starts and where attackers are already waiting — obfuscation, encoding tricks, translation, and payloads split across documents all sail past regexes and phrase blocklists. Output validation is less fashionable and much more durable, because it checks the thing you actually care about: what’s about to happen.
Three checks, each about a day:
Validate tool calls structurally. Every tool argument gets schema-checked and policy-checked in your code before execution — is this argument type-valid, is it within this user’s scope, is this action allowed in this state? Boring, deterministic, and it works exactly as well on day 1,000 as day one.
Sanitize rendered output. Strip or proxy images in model output; allowlist link domains; never render raw HTML from a model. This single control kills the most practical exfiltration path a chat product has.
Detect data that shouldn’t leave. A dumb, deterministic scan of outbound model responses for things shaped like secrets — API keys, connection strings, your own internal hostnames — catches the embarrassing failures cheaply. It’s not sophisticated. Neither are most incidents.
Put a human gate on anything state-changing
I’ve written before about the autonomy ladder — when to trust an agent and when to step in, and prompt injection is the sharpest version of the argument. An agent that can be talked into things by its own inputs does not get to take irreversible actions unsupervised. Sending money, deleting records, emailing third parties, changing permissions: those actions get a confirmation step where a human sees what is about to happen in plain language — not “the agent wants to proceed,” but “this will email invoice.pdf to billing@vendor-you-dont-recognize.example.”
Founders push back that this breaks the magic-agent demo. It does, slightly. But the demo where your agent wires data to an attacker because a PDF told it to is worse, and at five people you do not have the incident-response capacity to treat that as a learning experience. Approval gates are also the control that lets you ship more agent capability later: every gate you can eventually remove is a rung of autonomy you earned with logs instead of hope.
Why the LLM-judge guardrail can’t be your primary defense
The tempting product on the shelf is the guardrail model — a second LLM that inspects inputs or outputs for injection. Use one if you like, but understand what you’re buying: a probabilistic filter in front of a probabilistic system. It fails a few percent of the time on a good day, novel attack phrasings roll off faster than vendors retrain, and — the part that should bother you — the judge model reads the same untrusted text and is itself injectable. EchoLeak walked straight past Microsoft’s own cross-prompt injection classifier on the way to the data.
A detector that’s 97% effective sounds great until you notice attackers get unlimited retries at near-zero cost. Against a motivated adversary, “usually catches it” is a speed bump. The deterministic controls above — scoped tools, output validation, approval gates — don’t have a bypass rate. They’re either enforced or they aren’t, and that’s a property a five-person team can actually maintain. Layer a detector on top after the deterministic floor exists, as signal and rate-limiting, not as the wall.
What to deliberately not build
Half of security at your stage is declining work confidently. Skip, for now: the prompt-WAF subscription (it’s a detector with a dashboard — see above); a quarterly red-team retainer (a competent one costs more than your entire security budget and will tell you to do the things in this post); fine-tuning for adversarial robustness (real research area, not a startup Tuesday); and building your own injection benchmark suite (run one of the open ones twice a year and move on).
None of these are bad ideas at 50 people with a security hire and something to lose that’s worth the spend. At five, every one of them displaces the week of unglamorous work that actually changes your exposure.
Where this fits in the bigger picture
Prompt injection defense is one layer of the four-layer stack I laid out in how I’d run security at an AI-native company in 2026 — alongside agent credentials, secrets handling, and audit logging — and it’s the layer enterprise security reviews now probe first, because buyers have read the same incident reports you just did. Get the week of work done and you’re not just safer; you have a concrete, honest answer for the security questionnaire that would otherwise stall your first enterprise deal.
If you want a second pair of eyes on where your product sits on the trifecta map — or you’re staring down a security review and need the whole stack, not just this layer — the vCISO math post covers what fractional security leadership costs at your stage, and this is the engagement I run.