Monitoring LLMs and Agents in Production - Online Eval, Drift, Cost-per-Correct
Offline evaluation (before you ship) is the gating step. Online monitoring (after you ship) is the ongoing practice that catches regressions, drift, and unexpected failures that offline eval missed. Most teams have some offline eval. Very few have systematic online monitoring. This guide covers both and explains why you need both.
Offline Eval vs Online Monitoring
Offline Evaluation
Run against a static golden dataset before shipping. Tells you: is this version better than the previous version on the tasks I care about? Gates releases.
- - Fixed, versioned dataset
- - Deterministic (run twice, same result)
- - Cheap to run (controlled environment)
- - Does not capture production distribution shift
Online Monitoring
Run continuously against a sample of live production traffic. Tells you: is quality holding in production, with real user queries, after deployment?
- - Real user queries (unpredictable)
- - Sampled (1-5% is typical)
- - Catches distribution drift
- - Catches model provider updates
What to Monitor
Quality (LLM-as-judge on sampled traces)
Tool: Arize, Langfuse, LangSmith, HoneyHiveSample 1-5% of production calls. Run your faithfulness + relevancy judge on the sampled responses. Track the distribution over time. A drop in mean faithfulness of 5%+ relative to the previous 7 days is a regression signal worth investigating.
Latency (p50, p95, p99)
Tool: Any APM tool + model provider metricsp50 latency matters for user experience. p95 and p99 matter for reliability - tail latency is where models fail users. Track separately for TTFT (time to first token, for streaming) and total response time. An LLM that averages 800ms but hits 10s at p99 is worse than it appears on median.
Cost (per request, per user, per feature)
Tool: Model provider dashboards + FinOps toolToken usage translates directly to cost. Track cost per call by feature, user segment, and model. Unexpected cost spikes often indicate prompt injection, infinite loops in agent workflows, or a change in user query distribution (users asking longer questions).
Drift (input distribution + output distribution)
Tool: Arize, Langfuse (embedding features)Input drift: are users asking different types of questions than before? Track query embedding centroids over time. A shift in mean embedding indicates the user population or task distribution is changing. Output drift: are response lengths, tones, or refusal rates changing? Can indicate model provider side updates.
User signals (thumbs-down, retry, abandon)
Tool: Your product analytics toolThe ground truth of user satisfaction. A 5% thumbs-down rate on a feature that had 2% thumbs-down last week is a strong signal. Track these as leading indicators - they move before quality metrics because users notice problems before your automated evaluation does.
Sampling Strategy
You cannot LLM-as-judge every production request. At 100,000 daily calls, judging every call at $0.002 per judge call costs $200/day - $73,000/year - to monitor a single feature. Sample instead. Standard starting point: 1-5% random sample.
Random sampling is a starting point, not an optimum. Weight sampling toward high-signal events:
- Premium users or high-value accounts: failures here have higher business impact. Sample at 10-20% for this segment.
- Error signals: calls that triggered a retry, an error code, or a response length below a threshold. Sample at 100% for these.
- High-cost calls: calls with token counts in the top 10%. These are expensive and often involve complex prompts where quality is harder to maintain.
- New features or recently changed prompts: sample at higher rate (5-10%) for the first 7 days after a change, then drop to steady-state.
A well-weighted 2% sample often provides better signal quality than a random 5% sample because it concentrates evaluation budget on high-risk, high-value events.
Cost-per-Correct-Answer
This is the most underused metric in LLM evaluation. A model that is 10% more accurate but costs 3x more per call is usually not worth it. The formula: (accuracy) / (cost per call). Higher is better. Compare across model options for your specific task.
| Model | Task accuracy | Cost / 1K calls | Cost/correct |
|---|---|---|---|
| GPT-4o mini | 74% | $0.30 | $0.41 |
| GPT-4o | 84% | $2.50 | $2.98 |
| Claude 4 Haiku | 78% | $0.25 | $0.32 |
| Claude 4 Sonnet | 88% | $3.00 | $3.41 |
| Illustrative example. Accuracy is task-specific. Cost per 1K calls assumes average 1K input + 500 output tokens per call, April 2026 pricing. Phosphor = best cost/correct ratio. Always measure on your specific task. | |||
In this example, Claude 4 Haiku provides the best cost-per-correct-answer despite having lower accuracy than Claude 4 Sonnet. The 10-point accuracy gap does not justify an 11x cost difference for most use cases. For high-stakes tasks (medical, legal), accuracy may justify the cost; for high-volume routine tasks, cost-per-correct should drive the choice.
For current API pricing: claudeapipricing.com
Drift Detection
Input Drift
Are users asking different types of questions? Track query embedding centroids over time. A shift in embedding space indicates the task distribution is changing. Your offline eval golden dataset may no longer represent real traffic.
Detection: compute the centroid of query embeddings for each rolling 7-day window. Alert when cosine distance between consecutive centroids exceeds a threshold (0.05-0.10 for most applications).
Output Drift
Is the model producing different outputs for the same or similar inputs? This can indicate model provider side updates (even without a version bump, model behavior can change with infrastructure updates).
Detection: maintain a set of fixed “canary” prompts (20-50 examples from your golden dataset) and run them weekly. Track response length distribution, refusal rate, and LLM-as-judge score. Alert on statistically significant changes.