LLM Observability in production: monitoring quality, cost and model behavior in 2026
Language models require specific metrics: inference latency, cost per token, response quality and behavioral drift.
Executive summary
Language models require specific metrics: inference latency, cost per token, response quality and behavioral drift.
Last updated: 3/12/2026
Introduction: The AI black box
Deploying a language model in production is just the beginning. Monitoring its behavior, ensuring consistent quality, controlling costs, and detecting behavioral drift are far more complex challenges. Unlike traditional APIs, LLMs are inherently non-deterministic, and this requires a completely new approach to observability.
In 2026, mature companies don't just "trust" their models — they measure, iterate, and adjust continuously. LLM observability has become its own discipline, with specific metrics, specialized tools, and distinct operational practices.
What to monitor in an LLM system
Three pillars of observability
┌─────────────────────────────────────────────────────────────────────┐
│ LLM OBSERVABILITY │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ QUALITY │ │ COST │ │ BEHAVIOR │ │
│ │ │ │ │ │ │ │
│ │ - Accuracy │ │ - Tokens/s │ │ - Latency │ │
│ │ - Relevance │ │ - Cost/$ │ │ - Tokens │ │
│ │ - Satisfaction│ │ - Throughput │ │ - Errors │ │
│ │ - Utility │ │ - Cache hit │ │ - Failures │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘Quality metrics
1. Relevance and accuracy
typescript// Relevance assessment with LLM-as-a-judge
interface QualityAssessment {
query: string;
response: string;
relevanceScore: number; // 0-1
precisionScore: number; // 0-1
hallucinationScore: number; // 0-1 (lower is better)
}
async function assessQuality(
query: string,
response: string,
context?: string
): Promise<QualityAssessment> {
const assessmentPrompt = `
You are a quality evaluator for LLM responses.
Query: ${query}
Response: ${response}
${context ? `Context: ${context}` : ''}
Evaluate the response on three dimensions:
1. Relevance (0-1): Does the response directly address the query?
2. Accuracy (0-1): Is the information factually correct?
3. Hallucination (0-1): Does the response assert unsupported information?
Return only JSON with format:
{
"relevanceScore": 0.9,
"precisionScore": 0.8,
"hallucinationScore": 0.1
}
`;
const response = await llmClient.complete({
messages: [{ role: 'user', content: assessmentPrompt }],
model: 'claude-sonnet-4-6-20250214',
temperature: 0.1 // Low temperature for consistency
});
return JSON.parse(response.content);
}2. User satisfaction
typescript// Collecting explicit and implicit feedback
interface UserFeedback {
requestId: string;
userId: string;
explicitRating?: number; // 1-5 stars
implicitMetrics: {
copiedToClipboard?: boolean;
acceptedSuggestion?: boolean;
timeToAccept?: number; // ms
followUpQuery?: boolean;
rephrasedQuery?: boolean;
};
}
async function logFeedback(feedback: UserFeedback) {
await analytics.track('llm_feedback', {
...feedback,
timestamp: Date.now()
});
// Calculate combined satisfaction score
const satisfactionScore = calculateSatisfactionScore(feedback);
await metrics.gauge('llm.satisfaction.score', satisfactionScore, {
tags: {
userId: feedback.userId,
requestId: feedback.requestId
}
});
}
function calculateSatisfactionScore(feedback: UserFeedback): number {
let score = 0;
let factors = 0;
if (feedback.explicitRating !== undefined) {
score += feedback.explicitRating / 5; // Normalize to 0-1
factors += 1;
}
if (feedback.implicitMetrics.copiedToClipboard) {
score += 0.8; // Copying indicates high satisfaction
factors += 1;
}
if (feedback.implicitMetrics.acceptedSuggestion) {
score += 0.7;
factors += 1;
}
return factors > 0 ? score / factors : 0;
}3. Automatic evaluation with RAGAS
typescript// Using RAGAS for automatic evaluation
import { RagasEvaluator } from 'ragas';
const evaluator = new RagasEvaluator({
model: 'claude-sonnet-4-6-20250214',
apiKey: process.env.ANTHROPIC_API_KEY
});
async function evaluateRAGPipeline(
query: string,
retrievedDocs: string[],
generatedResponse: string,
groundTruth?: string
) {
const metrics = await evaluator.evaluate({
question: query,
context: retrievedDocs,
answer: generatedResponse,
ground_truth: groundTruth
});
// Automatically calculated metrics
return {
faithfulness: metrics.faithfulness, // Is response faithful to context?
answerRelevancy: metrics.answer_relevancy, // Is response relevant?
contextPrecision: metrics.context_precision, // Is context precise?
contextRecall: metrics.context_recall, // Is context complete?
contextEntityRecall: metrics.context_entity_recall, // Retrieved entities?
answerSimilarity: groundTruth ? metrics.answer_similarity : null // Similarity to correct answer
};
}Cost metrics
1. Cost per token
typescript// Detailed cost tracking
interface TokenUsage {
promptTokens: number;
completionTokens: number;
totalTokens: number;
cost: number;
}
interface ModelPricing {
model: string;
promptCostPer1K: number;
completionCostPer1K: number;
}
const MODEL_PRICING: Record<string, ModelPricing> = {
'claude-sonnet-4-6-20250214': {
model: 'claude-sonnet-4-6-20250214',
promptCostPer1K: 0.003, // $0.003 per 1K prompt tokens
completionCostPer1K: 0.015 // $0.015 per 1K completion tokens
},
'claude-opus-4-6-20250214': {
model: 'claude-opus-4-6-20250214',
promptCostPer1K: 0.015,
completionCostPer1K: 0.075
}
};
function calculateTokenCost(
model: string,
promptTokens: number,
completionTokens: number
): TokenUsage {
const pricing = MODEL_PRICING[model];
if (!pricing) {
throw new Error(`Unknown model: ${model}`);
}
const promptCost = (promptTokens / 1000) * pricing.promptCostPer1K;
const completionCost = (completionTokens / 1000) * pricing.completionCostPer1K;
const totalCost = promptCost + completionCost;
return {
promptTokens,
completionTokens,
totalTokens: promptTokens + completionTokens,
cost: totalCost
};
}
// Middleware for automatic tracking
export async function trackLLMRequest<T>(
model: string,
request: () => Promise<{ promptTokens: number; completionTokens: number; result: T }>
): Promise<{ result: T; usage: TokenUsage }> {
const startTime = Date.now();
const { promptTokens, completionTokens, result } = await request();
const latency = Date.now() - startTime;
const usage = calculateTokenCost(model, promptTokens, completionTokens);
// Register metrics
await metrics.histogram('llm.request.duration', latency, {
tags: {
model,
operation: 'inference'
}
});
await metrics.gauge('llm.request.tokens.prompt', promptTokens, { tags: { model } });
await metrics.gauge('llm.request.tokens.completion', completionTokens, { tags: { model } });
await metrics.gauge('llm.request.cost', usage.cost, { tags: { model } });
return { result, usage };
}
// Usage
const { result, usage } = await trackLLMRequest(
'claude-sonnet-4-6-20250214',
async () => {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6-20250214',
max_tokens: 1024,
messages: [{ role: 'user', content: '...' }]
});
return {
promptTokens: response.usage.input_tokens,
completionTokens: response.usage.output_tokens,
result: response.content
};
}
);
console.log(`Cost: $${usage.cost.toFixed(4)}`);2. Cost optimization with semantic cache
typescript// Semantic cache for cost reduction
import { embeddingModel } from './embeddings';
import { vectorStore } from './vector-store';
interface SemanticCacheEntry {
query: string;
queryEmbedding: number[];
response: string;
cachedAt: number;
hits: number;
}
const semanticCache = new Map<string, SemanticCacheEntry>();
const SIMILARITY_THRESHOLD = 0.95;
async function getCachedResponse(query: string): Promise<string | null> {
const queryEmbedding = await embeddingModel.embed(query);
// Search for similar entries in cache
for (const [key, entry] of semanticCache) {
const similarity = cosineSimilarity(queryEmbedding, entry.queryEmbedding);
if (similarity > SIMILARITY_THRESHOLD) {
// Cache hit
entry.hits++;
await metrics.increment('llm.cache.hit', {
tags: { type: 'semantic' }
});
// Log cost savings
const estimatedSavings = estimateCostSavings(entry);
await metrics.gauge('llm.cache.savings', estimatedSavings);
return entry.response;
}
}
await metrics.increment('llm.cache.miss', {
tags: { type: 'semantic' }
});
return null;
}
async function cacheResponse(query: string, response: string): Promise<void> {
const queryEmbedding = await embeddingModel.embed(query);
const entry: SemanticCacheEntry = {
query,
queryEmbedding,
response,
cachedAt: Date.now(),
hits: 0
};
semanticCache.set(query, entry);
}
function cosineSimilarity(a: number[], b: number[]): number {
const dotProduct = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
// Integration with LLM request
async function getLLMResponseWithCache(query: string): Promise<string> {
// Try cache first
const cached = await getCachedResponse(query);
if (cached) {
return cached;
}
// Cache miss: make request
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-6-20250214',
max_tokens: 1024,
messages: [{ role: 'user', content: query }]
});
const result = response.content[0].text;
// Cache the response
await cacheResponse(query, result);
return result;
}Behavior metrics
1. Behavioral drift detection
typescript// Drift detection using embedding distributions
import { EmbeddingModel } from './embeddings';
interface ResponseDistribution {
model: string;
date: string;
embeddings: number[][];
mean: number[];
covariance: number[][];
}
async function detectBehavioralDrift(
currentResponses: string[],
historicalDistributions: ResponseDistribution[]
): Promise<{ drifted: boolean; score: number }> {
const currentEmbeddings = await Promise.all(
currentResponses.map(r => embeddingModel.embed(r))
);
const currentDistribution = calculateDistribution(currentEmbeddings);
let maxDriftScore = 0;
for (const historical of historicalDistributions) {
const driftScore = calculateDrift(currentDistribution, historical);
maxDriftScore = Math.max(maxDriftScore, driftScore);
}
return {
drifted: maxDriftScore > 0.7, // Adjustable threshold
score: maxDriftScore
};
}
function calculateDistribution(embeddings: number[]): {
mean: number[];
covariance: number[][];
} {
const dimensions = embeddings[0].length;
const mean = new Array(dimensions).fill(0);
// Calculate mean
for (const emb of embeddings) {
for (let i = 0; i < dimensions; i++) {
mean[i] += emb[i] / embeddings.length;
}
}
// Calculate covariance
const covariance: number[][] = [];
for (let i = 0; i < dimensions; i++) {
covariance[i] = [];
for (let j = 0; j < dimensions; j++) {
let cov = 0;
for (const emb of embeddings) {
cov += (emb[i] - mean[i]) * (emb[j] - mean[j]);
}
covariance[i][j] = cov / embeddings.length;
}
}
return { mean, covariance };
}
function calculateDrift(
current: { mean: number[]; covariance: number[][] },
historical: { mean: number[]; covariance: number[][] }
): number {
// Simplified Wasserstein distance
let distance = 0;
for (let i = 0; i < current.mean.length; i++) {
distance += Math.pow(current.mean[i] - historical.mean[i], 2);
}
return Math.sqrt(distance);
}2. Hallucination detection
typescript// Hallucination detection using citations
interface CitationCheck {
response: string;
citedSources: string[];
uncitedClaims: string[];
hallucinationProbability: number;
}
async function detectHallucinations(
response: string,
retrievedContext: string[]
): Promise<CitationCheck> {
// Extract claims from response
const claims = await extractClaims(response);
const citedSources: string[] = [];
const uncitedClaims: string[] = [];
for (const claim of claims) {
const isSupported = await checkClaimSupport(claim, retrievedContext);
if (isSupported.supported) {
citedSources.push(...isSupported.sources);
} else {
uncitedClaims.push(claim);
}
}
const hallucinationProbability = uncitedClaims.length / claims.length;
return {
response,
citedSources,
uncitedClaims,
hallucinationProbability
};
}
async function checkClaimSupport(
claim: string,
context: string[]
): Promise<{ supported: boolean; sources: string[] }> {
const prompt = `
Determine if the following claim is supported by the provided context.
Claim: ${claim}
Context:
${context.map((c, i) => `[${i + 1}] ${c}`).join('\n')}
Respond with JSON:
{
"supported": true/false,
"sources": [1, 2, ...], // indices of supporting sources
"confidence": 0-1
}
`;
const response = await llmClient.complete({
messages: [{ role: 'user', content: prompt }],
model: 'claude-sonnet-4-6-20250214',
temperature: 0
});
return JSON.parse(response.content);
}Observability architecture
Complete monitoring pipeline
typescript// Centralized LLM observability system
class LLMObservabilitySystem {
private qualityMetrics: Map<string, QualityAssessment> = new Map();
private costMetrics: Map<string, TokenUsage> = new Map();
private behaviorMetrics: Map<string, ResponseDistribution> = new Map();
async trackRequest(requestId: string, config: {
query: string;
response: string;
model: string;
promptTokens: number;
completionTokens: number;
latency: number;
context?: string[];
userId?: string;
}) {
// Track cost
const costUsage = calculateTokenCost(
config.model,
config.promptTokens,
config.completionTokens
);
this.costMetrics.set(requestId, costUsage);
// Assess quality
const quality = await assessQuality(
config.query,
config.response,
config.context?.join('\n')
);
this.qualityMetrics.set(requestId, quality);
// Track behavior metrics
await this.trackBehavior(requestId, config);
// Calculate composite metrics
const overallScore = this.calculateOverallScore(
quality,
costUsage,
config.latency
);
// Alert if score falls below threshold
if (overallScore < 0.7) {
await this.alertPoorPerformance(requestId, overallScore);
}
return { quality, costUsage, overallScore };
}
private async trackBehavior(
requestId: string,
config: any
) {
const embedding = await embeddingModel.embed(config.response);
// Check for drift
const historical = Array.from(this.behaviorMetrics.values());
const { drifted, score } = await detectBehavioralDrift(
[config.response],
historical
);
if (drifted) {
await this.alertDrift(requestId, score);
}
// Store current distribution
this.behaviorMetrics.set(requestId, {
model: config.model,
date: new Date().toISOString(),
embeddings: [embedding],
mean: embedding,
covariance: []
});
}
private calculateOverallScore(
quality: QualityAssessment,
costUsage: TokenUsage,
latency: number
): number {
// Metric weighting
const qualityWeight = 0.5;
const costWeight = 0.3;
const latencyWeight = 0.2;
// Normalize latency (assuming 500ms as ideal)
const normalizedLatency = Math.max(0, 1 - latency / 10000);
// Normalize cost (assuming $0.01 as ideal)
const normalizedCost = Math.max(0, 1 - costUsage.cost / 0.1);
// Weighted average
const qualityScore = (
quality.relevanceScore +
quality.precisionScore +
(1 - quality.hallucinationScore)
) / 3;
return (
qualityScore * qualityWeight +
normalizedCost * costWeight +
normalizedLatency * latencyWeight
);
}
private async alertPoorPerformance(requestId: string, score: number) {
await alerts.send({
severity: 'warning',
title: 'LLM Performance Alert',
message: `Request ${requestId} has poor overall score: ${score.toFixed(2)}`,
metadata: {
requestId,
score
}
});
}
private async alertDrift(requestId: string, score: number) {
await alerts.send({
severity: 'critical',
title: 'LLM Behavioral Drift Detected',
message: `Request ${requestId} shows behavioral drift: ${score.toFixed(2)}`,
metadata: {
requestId,
driftScore: score
}
});
}
}Dashboards and visualization
Real-time metrics dashboard
typescript// Integration with Grafana/Prometheus
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const registry = new Registry();
// Metrics
const llmRequestDuration = new Histogram({
name: 'llm_request_duration_seconds',
help: 'Duration of LLM requests',
labelNames: ['model', 'operation'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
});
const llmRequestCost = new Gauge({
name: 'llm_request_cost_dollars',
help: 'Cost of LLM requests in dollars',
labelNames: ['model']
});
const llmCacheHitRate = new Gauge({
name: 'llm_cache_hit_rate',
help: 'Cache hit rate for LLM requests',
labelNames: ['cache_type']
});
const llmQualityScore = new Gauge({
name: 'llm_quality_score',
help: 'Quality score of LLM responses',
labelNames: ['metric_type']
});
registry.registerMetric(llmRequestDuration);
registry.registerMetric(llmRequestCost);
registry.registerMetric(llmCacheHitRate);
registry.registerMetric(llmQualityScore);
// Metrics endpoint for Prometheus
export async function metricsHandler(req: Request): Promise<Response> {
return new Response(await registry.metrics(), {
headers: { 'Content-Type': 'text/plain' }
});
}Success metrics
To validate that your observability system is working:
- Time to detect drift: Target <24 hours after behavioral change
- Relevant alert rate: >80% of alerts should trigger corrective action
- Observability cost: <10% of total LLM inference cost
- Monitored request coverage: >95% of production requests
Your LLM system in production suffers from unpredictable costs, inconsistent quality, or undetected behavioral drift? Talk to Imperialis specialists about LLM observability, from quality metrics to drift detection, to operate models in production with confidence.
Sources
- Anthropic documentation — Official Anthropic documentation
- RAGAS library — Automatic RAG evaluation
- LangSmith — LLM observability platform
- OpenTelemetry for AI — AI telemetry standards