Last verified April 2026

Monitoring LLMs and Agents in Production - Online Eval, Drift, Cost-per-Correct

Q: What is the difference between observability and evaluation?

Observability covers the structural aspects of your LLM calls: latency, token count, cost, error rate, retry rate, and trace structure. These are computed from raw metadata without any quality assessment. Evaluation covers whether the outputs are good: faithfulness, relevancy, task success, user satisfaction. Both are necessary. Observability catches infrastructure problems; evaluation catches quality problems. Many tools (Arize, Langfuse, LangSmith) provide both in one platform.

Offline evaluation (before you ship) is the gating step. Online monitoring (after you ship) is the ongoing practice that catches regressions, drift, and unexpected failures that offline eval missed. Most teams have some offline eval. Very few have systematic online monitoring. This guide covers both and explains why you need both.

Offline Eval vs Online Monitoring

Offline Evaluation

Run against a static golden dataset before shipping. Tells you: is this version better than the previous version on the tasks I care about? Gates releases.

- Fixed, versioned dataset
- Deterministic (run twice, same result)
- Cheap to run (controlled environment)
- Does not capture production distribution shift

Online Monitoring

Run continuously against a sample of live production traffic. Tells you: is quality holding in production, with real user queries, after deployment?

- Real user queries (unpredictable)
- Sampled (1-5% is typical)
- Catches distribution drift
- Catches model provider updates

What to Monitor

Quality (LLM-as-judge on sampled traces)

Tool: Arize, Langfuse, LangSmith, HoneyHive

Sample 1-5% of production calls. Run your faithfulness + relevancy judge on the sampled responses. Track the distribution over time. A drop in mean faithfulness of 5%+ relative to the previous 7 days is a regression signal worth investigating.

Latency (p50, p95, p99)

Tool: Any APM tool + model provider metrics

p50 latency matters for user experience. p95 and p99 matter for reliability - tail latency is where models fail users. Track separately for TTFT (time to first token, for streaming) and total response time. An LLM that averages 800ms but hits 10s at p99 is worse than it appears on median.

Cost (per request, per user, per feature)

Tool: Model provider dashboards + FinOps tool

Token usage translates directly to cost. Track cost per call by feature, user segment, and model. Unexpected cost spikes often indicate prompt injection, infinite loops in agent workflows, or a change in user query distribution (users asking longer questions).

Drift (input distribution + output distribution)

Tool: Arize, Langfuse (embedding features)

Input drift: are users asking different types of questions than before? Track query embedding centroids over time. A shift in mean embedding indicates the user population or task distribution is changing. Output drift: are response lengths, tones, or refusal rates changing? Can indicate model provider side updates.

User signals (thumbs-down, retry, abandon)

Tool: Your product analytics tool

The ground truth of user satisfaction. A 5% thumbs-down rate on a feature that had 2% thumbs-down last week is a strong signal. Track these as leading indicators - they move before quality metrics because users notice problems before your automated evaluation does.

Sampling Strategy

You cannot LLM-as-judge every production request. At 100,000 daily calls, judging every call at $0.002 per judge call costs $200/day - $73,000/year - to monitor a single feature. Sample instead. Standard starting point: 1-5% random sample.

Random sampling is a starting point, not an optimum. Weight sampling toward high-signal events:

Premium users or high-value accounts: failures here have higher business impact. Sample at 10-20% for this segment.
Error signals: calls that triggered a retry, an error code, or a response length below a threshold. Sample at 100% for these.
High-cost calls: calls with token counts in the top 10%. These are expensive and often involve complex prompts where quality is harder to maintain.
New features or recently changed prompts: sample at higher rate (5-10%) for the first 7 days after a change, then drop to steady-state.

A well-weighted 2% sample often provides better signal quality than a random 5% sample because it concentrates evaluation budget on high-risk, high-value events.

Cost-per-Correct-Answer

This is the most underused metric in LLM evaluation. A model that is 10% more accurate but costs 3x more per call is usually not worth it. The formula: (accuracy) / (cost per call). Higher is better. Compare across model options for your specific task.

Model	Task accuracy	Cost / 1K calls	Cost/correct
GPT-4o mini	74%	$0.30	$0.41
GPT-4o	84%	$2.50	$2.98
Claude 4 Haiku	78%	$0.25	$0.32
Claude 4 Sonnet	88%	$3.00	$3.41
Illustrative example. Accuracy is task-specific. Cost per 1K calls assumes average 1K input + 500 output tokens per call, April 2026 pricing. Phosphor = best cost/correct ratio. Always measure on your specific task.

In this example, Claude 4 Haiku provides the best cost-per-correct-answer despite having lower accuracy than Claude 4 Sonnet. The 10-point accuracy gap does not justify an 11x cost difference for most use cases. For high-stakes tasks (medical, legal), accuracy may justify the cost; for high-volume routine tasks, cost-per-correct should drive the choice.

For current API pricing: claudeapipricing.com

Drift Detection

Input Drift

Are users asking different types of questions? Track query embedding centroids over time. A shift in embedding space indicates the task distribution is changing. Your offline eval golden dataset may no longer represent real traffic.

Detection: compute the centroid of query embeddings for each rolling 7-day window. Alert when cosine distance between consecutive centroids exceeds a threshold (0.05-0.10 for most applications).

Output Drift

Is the model producing different outputs for the same or similar inputs? This can indicate model provider side updates (even without a version bump, model behavior can change with infrastructure updates).

Detection: maintain a set of fixed “canary” prompts (20-50 examples from your golden dataset) and run them weekly. Track response length distribution, refusal rate, and LLM-as-judge score. Alert on statistically significant changes.

Frequently Asked Questions

Do I need a dedicated tool for production monitoring?+

Not at first. If you have fewer than 10,000 LLM calls per day, a logging solution plus weekly manual review of 50-100 sampled responses is often enough. A dedicated LLM monitoring tool earns its cost when you have high call volume, a team that needs shared visibility, or automated alerting requirements.

What sample rate should I use for LLM-as-judge on production traffic?+

1-5% of production traffic is the standard starting point. At 1% of 100,000 daily calls, you are judging 1,000 responses per day - statistically robust signal. Weight sampling toward high-value events: premium users, high-cost calls, responses that triggered a user signal (thumbs down, retry, abandon).

How do I detect model regression after a version bump?+

Run your offline eval dataset before and after the version bump and compare metric distributions. For production monitoring, use a control period approach: sample 1-5% of traffic for 7 days before the bump, run LLM-as-judge on that sample, and store the distribution. After the bump, alert on a 5% relative drop in headline metric.

What is the difference between observability and evaluation?+

Observability covers structural aspects of your LLM calls: latency, token count, cost, error rate. Evaluation covers whether the outputs are good: faithfulness, relevancy, task success. Both are necessary. Observability catches infrastructure problems; evaluation catches quality problems.

Should I log every prompt and response in production?+

Log every call's metadata (latency, token count, model version, error code). Log the full prompt and response for a sample (1-10% depending on volume and compliance requirements). Full logging has constraints: cost, privacy (prompts often contain PII), and security (prompts may contain confidential data). Review your data retention policies before enabling full logging.

Custom eval guide ->Eval tools compared ->LLM-as-Judge methodology ->