Your AI Product Needs a Telemetry Layer Before It Needs a Better Model

January 26, 2026 9 min read AI engineering

Most AI startups try to fix the model when the real problem is they can't see what the model is doing. The four-layer AI telemetry stack, the tooling to use, and how proper instrumentation cut a Lavender hallucination rate by 40% without touching the model itself.

I've watched three AI startups burn months trying to "improve the model" when they couldn't even tell which prompts produced which outputs at scale.

Every team had the same instinct: hallucination rate too high, response quality inconsistent, costs creeping — must be a model problem, let's tune the prompts, let's swap to GPT-5, let's fine-tune. None of them stopped to ask the more useful question first: what's actually happening inside the model calls we're already making?

The answer, almost always: nobody really knew. There was no production logging of prompts. No structured capture of model outputs. No correlation between which user did what and what the model returned. The team was making decisions about model improvement based on cherry-picked screenshots and vibes.

That's not a model problem. That's an instrumentation problem. And it's solvable in two weeks of disciplined engineering, which buys you the visibility to know whether the model problem is even real.

What AI telemetry actually means

Classic application telemetry — request rate, latency, error rate — does not tell you anything useful about an AI feature. A successful 200 response from your LLM endpoint tells you nothing about whether the response was correct, helpful, or hallucinated. You need a different layer of observability that's specific to how AI features fail.

The four things you must capture for every model call:

The full prompt. Every variable interpolation. Every system prompt. Every retrieval-augmented context. Stored as structured data, not a stringified blob.
The full response. Including any tool calls, function calls, or structured outputs. Stored verbatim.
The cost and latency. Tokens in, tokens out, dollar cost, wall-clock time. These compose into your unit economics.
The user context. Who triggered this call, in what feature, against what state. Anonymized if you must, but linkable to the user session.

Without those four, you cannot reason about model performance at any scale beyond "let me copy this prompt into the playground and see what happens." That's not engineering, it's gambling.

The four-layer telemetry stack

Once the basics are captured, the actual decisions you make benefit from layered aggregation.

Layer 1: Request-level telemetry

Every model call gets logged with the four-tuple above plus a request ID. This is the source of truth. Every other layer aggregates from this layer.

Storage decisions matter here. The volume can be large — for a product making 100k model calls a day, this is 100k structured rows daily. We chose Postgres with JSONB columns at Lavender, with a 90-day retention policy. Worked fine for our scale; would not scale to 10M calls/day. Use what fits.

Layer 2: Feature-level aggregation

Each model call belongs to a feature: "summarize," "draft email," "suggest reply," etc. Aggregate the request-level data by feature to answer questions like:

What's the median response time of the "draft email" feature this week?
What's the daily cost of "summarize" over the past 30 days?
Which feature has seen the biggest cost spike since the last release?

This is the layer where you start making product decisions: "the suggest-reply feature costs 4x what summarize does and gets used 1/10 as much — we should kill it or rebuild it."

Layer 3: User-level signal

Each user has interactions across multiple features. Aggregate at the user level to answer:

Are heavy users seeing more or fewer hallucinations than light users?
Is there a cohort of users for whom the feature consistently fails?
What's our cost per active user per week?

The user-level layer is where you discover that your model is fine for 90% of users but catastrophically bad for the specific use case 10% of users have. Without this layer, that 10% is invisible.

Layer 4: Aggregate trends and regression detection

Daily / weekly rollups across the whole product. The metrics that go on a dashboard the founder reads every Monday morning:

Total cost trend
Cost per active user trend
P95 latency trend
Hallucination signal trend (more on this below)
Feature-level usage distribution

The point of layer 4 is regression detection. When something breaks, you want to know within 24 hours, not 21 days into the quarter when finance asks why the OpenAI bill tripled.

The hallucination signal

Hallucination is the hardest thing to measure because there's no ground truth label at runtime. Real-world signals that approximate it:

User regenerates the response. One of the strongest negative signals. If a user immediately clicks "regenerate," they didn't like what they got.
User edits the response heavily before using it. If you have a copy-and-edit flow, measure the edit distance.
User abandons the feature mid-flow. Strong signal something went wrong.
Explicit thumbs-up / thumbs-down. Lowest-volume signal but the cleanest. Add it everywhere it's not annoying.
Response contains markers of uncertainty. "I don't have information about" or "I cannot determine" — sometimes useful, sometimes a euphemism for hallucination.

None of these is a clean ground-truth label. Combined, they give you a directional indicator that's good enough for relative comparisons over time. The goal isn't "what's our true hallucination rate" — that's unanswerable. The goal is "is hallucination getting better or worse this week, and which features are driving the change."

Tooling I'd reach for

The build-vs-buy decision for AI telemetry has shifted in the last 18 months. There are now real options.

LangSmith — strong if you're already using LangChain. Decent if you're not. Captures request/response/cost out of the box.
Helicone — proxy-based capture. Lowest integration cost — point your LLM SDK at Helicone's URL, get telemetry for free. Best for early-stage teams that want zero-config.
Langfuse — open source, self-hostable. Good for teams with security/data residency concerns.
Custom OpenTelemetry instrumentation. If you already have a strong observability stack (Datadog, Honeycomb, etc.), wrapping your model calls in OTel spans is sometimes the right answer because it integrates with existing dashboards.

For pre-Series-A AI startups I'd start with Helicone and graduate later. The integration cost is one afternoon. The telemetry you get back is enough to make the next dozen product decisions correctly.

Model problem or instrumentation problem?

The most useful framing I've found, when an AI feature is underperforming:

Can you, right now, answer these five questions in under five minutes?

What was the prompt and response of the last 10 calls to this feature?
What's the median latency for this feature over the past 7 days?
What's the daily cost for this feature, broken out by model?
Which users had the worst experiences this week, by hallucination signal?
How does any of this compare to two weeks ago?

If the answer to any of these is "I don't know" or "let me write a query," you have an instrumentation problem, not a model problem. Fix instrumentation first. Then look at the data, and the model problem either becomes obvious — or evaporates because what looked like a model problem was actually a prompt regression in last week's deploy.

A concrete example

At Lavender, we shipped a new prompt template for one of our AI features early in 2025. The hallucination signal — measured via the regenerate-rate — climbed about 60% over the next two weeks. The instinct was "the new prompt is worse, let's rewrite it."

Telemetry told a different story. The regenerate-rate climbed for users on a specific email template that one of our customer-success team had recommended internally. The new prompt was fine. The customer template was triggering an edge case we hadn't anticipated, and the regenerate-rate spike was an artifact of that template being used 4x more than usual.

The fix was a 20-line guardrail in the prompt that handled the edge case. Hallucination signal dropped by 40% within 72 hours. We didn't tune the model. We didn't change LLMs. We did not run a single eval. We instrumented, looked at the data, found the actual cause, fixed it.

That story is impossible to tell without telemetry. Without it, the team would have spent two weeks rewriting the prompt, regressing on something else, and ending up worse than where they started. With it, the cause was obvious within 90 minutes of looking at the data.

The takeaway

Most AI startups will eventually need to think hard about the model. None of them should think about the model first. The order is:

Instrument. Capture every model call, structured and queryable.
Aggregate. Build feature-, user-, and trend-level views.
Look. Stare at the data for a week. Most "model problems" reveal themselves as something else.
Then, if needed, tune the model. But you'll be tuning against actual data, not vibes.

The two weeks of disciplined engineering this requires is the highest-leverage AI work most startups aren't doing. It's also boring. Which is exactly why doing it is an edge over teams that go straight to fine-tuning.

The team discipline this requires

Telemetry is a code problem for half a sprint and an organizational problem forever after. The engineering team has to keep instrumentation current as new features ship, or the system rots within a quarter.

The disciplines that worked at Lavender:

No model call ships without telemetry. Code review checklist item, enforced. New AI feature PRs get rejected if they don't wire up the four-tuple capture.
One engineer owns the telemetry layer. Not full-time, but they're the named point of contact. Schema evolution, dashboard updates, retention policies — they own it. Without an owner, the layer drifts.
Weekly review of the dashboards. 15 minutes at the top of an engineering meeting. Just looking at the trends. Catches regressions while they're small and trains the team to think in terms of these metrics.
Cost alerts before user complaints. If the daily AI spend deviates from the rolling 7-day median by more than 30%, it pages the on-call. Most product issues show up here before they show up in support tickets.

The instrumentation work is one or two weeks. The discipline of keeping it useful is forever. Build the muscle early — adding it later, against an existing AI product with no telemetry, is meaningfully harder.

All writing