Applied AI

LLM Observability in production: monitoring quality, cost and model behavior in 2026

Language models require specific metrics: inference latency, cost per token, response quality and behavioral drift.

3/12/2026•10 min read•AI

LLM Observability in production: monitoring quality, cost and model behavior in 2026

Executive summary

Language models require specific metrics: inference latency, cost per token, response quality and behavioral drift.

Last updated: 3/12/2026

Sources

Introduction: The AI black box

Deploying a language model in production is just the beginning. Monitoring its behavior, ensuring consistent quality, controlling costs, and detecting behavioral drift are far more complex challenges. Unlike traditional APIs, LLMs are inherently non-deterministic, and this requires a completely new approach to observability.

In 2026, mature companies don't just "trust" their models — they measure, iterate, and adjust continuously. LLM observability has become its own discipline, with specific metrics, specialized tools, and distinct operational practices.

What to monitor in an LLM system

Three pillars of observability

┌─────────────────────────────────────────────────────────────────────┐
│                       LLM OBSERVABILITY                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   QUALITY     │  │     COST     │  │  BEHAVIOR    │       │
│  │              │  │              │  │              │       │
│  │ - Accuracy   │  │ - Tokens/s   │  │ - Latency    │       │
│  │ - Relevance  │  │ - Cost/$     │  │ - Tokens     │       │
│  │ - Satisfaction│  │ - Throughput │  │ - Errors     │       │
│  │ - Utility    │  │ - Cache hit  │  │ - Failures   │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
│                                                                   │
└─────────────────────────────────────────────────────────────────────┘

Quality metrics

1. Relevance and accuracy

typescript// Relevance assessment with LLM-as-a-judge
interface QualityAssessment {
  query: string;
  response: string;
  relevanceScore: number; // 0-1
  precisionScore: number; // 0-1
  hallucinationScore: number; // 0-1 (lower is better)
}

async function assessQuality(
  query: string,
  response: string,
  context?: string
): Promise<QualityAssessment> {
  const assessmentPrompt = `
You are a quality evaluator for LLM responses.

Query: ${query}

Response: ${response}

${context ? `Context: ${context}` : ''}

Evaluate the response on three dimensions:
1. Relevance (0-1): Does the response directly address the query?
2. Accuracy (0-1): Is the information factually correct?
3. Hallucination (0-1): Does the response assert unsupported information?

Return only JSON with format:
{
  "relevanceScore": 0.9,
  "precisionScore": 0.8,
  "hallucinationScore": 0.1
}
  `;

  const response = await llmClient.complete({
    messages: [{ role: 'user', content: assessmentPrompt }],
    model: 'claude-sonnet-4-6-20250214',
    temperature: 0.1 // Low temperature for consistency
  });

  return JSON.parse(response.content);
}

2. User satisfaction

typescript// Collecting explicit and implicit feedback
interface UserFeedback {
  requestId: string;
  userId: string;
  explicitRating?: number; // 1-5 stars
  implicitMetrics: {
    copiedToClipboard?: boolean;
    acceptedSuggestion?: boolean;
    timeToAccept?: number; // ms
    followUpQuery?: boolean;
    rephrasedQuery?: boolean;
  };
}

async function logFeedback(feedback: UserFeedback) {
  await analytics.track('llm_feedback', {
    ...feedback,
    timestamp: Date.now()
  });

  // Calculate combined satisfaction score
  const satisfactionScore = calculateSatisfactionScore(feedback);

  await metrics.gauge('llm.satisfaction.score', satisfactionScore, {
    tags: {
      userId: feedback.userId,
      requestId: feedback.requestId
    }
  });
}

function calculateSatisfactionScore(feedback: UserFeedback): number {
  let score = 0;
  let factors = 0;

  if (feedback.explicitRating !== undefined) {
    score += feedback.explicitRating / 5; // Normalize to 0-1
    factors += 1;
  }

  if (feedback.implicitMetrics.copiedToClipboard) {
    score += 0.8; // Copying indicates high satisfaction
    factors += 1;
  }

  if (feedback.implicitMetrics.acceptedSuggestion) {
    score += 0.7;
    factors += 1;
  }

  return factors > 0 ? score / factors : 0;
}

3. Automatic evaluation with RAGAS

typescript// Using RAGAS for automatic evaluation
import { RagasEvaluator } from 'ragas';

const evaluator = new RagasEvaluator({
  model: 'claude-sonnet-4-6-20250214',
  apiKey: process.env.ANTHROPIC_API_KEY
});

async function evaluateRAGPipeline(
  query: string,
  retrievedDocs: string[],
  generatedResponse: string,
  groundTruth?: string
) {
  const metrics = await evaluator.evaluate({
    question: query,
    context: retrievedDocs,
    answer: generatedResponse,
    ground_truth: groundTruth
  });

  // Automatically calculated metrics
  return {
    faithfulness: metrics.faithfulness, // Is response faithful to context?
    answerRelevancy: metrics.answer_relevancy, // Is response relevant?
    contextPrecision: metrics.context_precision, // Is context precise?
    contextRecall: metrics.context_recall, // Is context complete?
    contextEntityRecall: metrics.context_entity_recall, // Retrieved entities?
    answerSimilarity: groundTruth ? metrics.answer_similarity : null // Similarity to correct answer
  };
}

Cost metrics

1. Cost per token

typescript// Detailed cost tracking
interface TokenUsage {
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  cost: number;
}

interface ModelPricing {
  model: string;
  promptCostPer1K: number;
  completionCostPer1K: number;
}

const MODEL_PRICING: Record<string, ModelPricing> = {
  'claude-sonnet-4-6-20250214': {
    model: 'claude-sonnet-4-6-20250214',
    promptCostPer1K: 0.003, // $0.003 per 1K prompt tokens
    completionCostPer1K: 0.015 // $0.015 per 1K completion tokens
  },
  'claude-opus-4-6-20250214': {
    model: 'claude-opus-4-6-20250214',
    promptCostPer1K: 0.015,
    completionCostPer1K: 0.075
  }
};

function calculateTokenCost(
  model: string,
  promptTokens: number,
  completionTokens: number
): TokenUsage {
  const pricing = MODEL_PRICING[model];

  if (!pricing) {
    throw new Error(`Unknown model: ${model}`);
  }

  const promptCost = (promptTokens / 1000) * pricing.promptCostPer1K;
  const completionCost = (completionTokens / 1000) * pricing.completionCostPer1K;
  const totalCost = promptCost + completionCost;

  return {
    promptTokens,
    completionTokens,
    totalTokens: promptTokens + completionTokens,
    cost: totalCost
  };
}

// Middleware for automatic tracking
export async function trackLLMRequest<T>(
  model: string,
  request: () => Promise<{ promptTokens: number; completionTokens: number; result: T }>
): Promise<{ result: T; usage: TokenUsage }> {
  const startTime = Date.now();

  const { promptTokens, completionTokens, result } = await request();

  const latency = Date.now() - startTime;
  const usage = calculateTokenCost(model, promptTokens, completionTokens);

  // Register metrics
  await metrics.histogram('llm.request.duration', latency, {
    tags: {
      model,
      operation: 'inference'
    }
  });

  await metrics.gauge('llm.request.tokens.prompt', promptTokens, { tags: { model } });
  await metrics.gauge('llm.request.tokens.completion', completionTokens, { tags: { model } });
  await metrics.gauge('llm.request.cost', usage.cost, { tags: { model } });

  return { result, usage };
}

// Usage
const { result, usage } = await trackLLMRequest(
  'claude-sonnet-4-6-20250214',
  async () => {
    const response = await anthropic.messages.create({
      model: 'claude-sonnet-4-6-20250214',
      max_tokens: 1024,
      messages: [{ role: 'user', content: '...' }]
    });

    return {
      promptTokens: response.usage.input_tokens,
      completionTokens: response.usage.output_tokens,
      result: response.content
    };
  }
);

console.log(`Cost: $${usage.cost.toFixed(4)}`);

2. Cost optimization with semantic cache

typescript// Semantic cache for cost reduction
import { embeddingModel } from './embeddings';
import { vectorStore } from './vector-store';

interface SemanticCacheEntry {
  query: string;
  queryEmbedding: number[];
  response: string;
  cachedAt: number;
  hits: number;
}

const semanticCache = new Map<string, SemanticCacheEntry>();
const SIMILARITY_THRESHOLD = 0.95;

async function getCachedResponse(query: string): Promise<string | null> {
  const queryEmbedding = await embeddingModel.embed(query);

  // Search for similar entries in cache
  for (const [key, entry] of semanticCache) {
    const similarity = cosineSimilarity(queryEmbedding, entry.queryEmbedding);

    if (similarity > SIMILARITY_THRESHOLD) {
      // Cache hit
      entry.hits++;

      await metrics.increment('llm.cache.hit', {
        tags: { type: 'semantic' }
      });

      // Log cost savings
      const estimatedSavings = estimateCostSavings(entry);
      await metrics.gauge('llm.cache.savings', estimatedSavings);

      return entry.response;
    }
  }

  await metrics.increment('llm.cache.miss', {
    tags: { type: 'semantic' }
  });

  return null;
}

async function cacheResponse(query: string, response: string): Promise<void> {
  const queryEmbedding = await embeddingModel.embed(query);

  const entry: SemanticCacheEntry = {
    query,
    queryEmbedding,
    response,
    cachedAt: Date.now(),
    hits: 0
  };

  semanticCache.set(query, entry);
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));

  return dotProduct / (magnitudeA * magnitudeB);
}

// Integration with LLM request
async function getLLMResponseWithCache(query: string): Promise<string> {
  // Try cache first
  const cached = await getCachedResponse(query);
  if (cached) {
    return cached;
  }

  // Cache miss: make request
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6-20250214',
    max_tokens: 1024,
    messages: [{ role: 'user', content: query }]
  });

  const result = response.content[0].text;

  // Cache the response
  await cacheResponse(query, result);

  return result;
}

Behavior metrics

1. Behavioral drift detection

typescript// Drift detection using embedding distributions
import { EmbeddingModel } from './embeddings';

interface ResponseDistribution {
  model: string;
  date: string;
  embeddings: number[][];
  mean: number[];
  covariance: number[][];
}

async function detectBehavioralDrift(
  currentResponses: string[],
  historicalDistributions: ResponseDistribution[]
): Promise<{ drifted: boolean; score: number }> {
  const currentEmbeddings = await Promise.all(
    currentResponses.map(r => embeddingModel.embed(r))
  );

  const currentDistribution = calculateDistribution(currentEmbeddings);

  let maxDriftScore = 0;

  for (const historical of historicalDistributions) {
    const driftScore = calculateDrift(currentDistribution, historical);
    maxDriftScore = Math.max(maxDriftScore, driftScore);
  }

  return {
    drifted: maxDriftScore > 0.7, // Adjustable threshold
    score: maxDriftScore
  };
}

function calculateDistribution(embeddings: number[]): {
  mean: number[];
  covariance: number[][];
} {
  const dimensions = embeddings[0].length;
  const mean = new Array(dimensions).fill(0);

  // Calculate mean
  for (const emb of embeddings) {
    for (let i = 0; i < dimensions; i++) {
      mean[i] += emb[i] / embeddings.length;
    }
  }

  // Calculate covariance
  const covariance: number[][] = [];
  for (let i = 0; i < dimensions; i++) {
    covariance[i] = [];
    for (let j = 0; j < dimensions; j++) {
      let cov = 0;
      for (const emb of embeddings) {
        cov += (emb[i] - mean[i]) * (emb[j] - mean[j]);
      }
      covariance[i][j] = cov / embeddings.length;
    }
  }

  return { mean, covariance };
}

function calculateDrift(
  current: { mean: number[]; covariance: number[][] },
  historical: { mean: number[]; covariance: number[][] }
): number {
  // Simplified Wasserstein distance
  let distance = 0;

  for (let i = 0; i < current.mean.length; i++) {
    distance += Math.pow(current.mean[i] - historical.mean[i], 2);
  }

  return Math.sqrt(distance);
}

2. Hallucination detection

typescript// Hallucination detection using citations
interface CitationCheck {
  response: string;
  citedSources: string[];
  uncitedClaims: string[];
  hallucinationProbability: number;
}

async function detectHallucinations(
  response: string,
  retrievedContext: string[]
): Promise<CitationCheck> {
  // Extract claims from response
  const claims = await extractClaims(response);

  const citedSources: string[] = [];
  const uncitedClaims: string[] = [];

  for (const claim of claims) {
    const isSupported = await checkClaimSupport(claim, retrievedContext);

    if (isSupported.supported) {
      citedSources.push(...isSupported.sources);
    } else {
      uncitedClaims.push(claim);
    }
  }

  const hallucinationProbability = uncitedClaims.length / claims.length;

  return {
    response,
    citedSources,
    uncitedClaims,
    hallucinationProbability
  };
}

async function checkClaimSupport(
  claim: string,
  context: string[]
): Promise<{ supported: boolean; sources: string[] }> {
  const prompt = `
Determine if the following claim is supported by the provided context.

Claim: ${claim}

Context:
${context.map((c, i) => `[${i + 1}] ${c}`).join('\n')}

Respond with JSON:
{
  "supported": true/false,
  "sources": [1, 2, ...], // indices of supporting sources
  "confidence": 0-1
}
  `;

  const response = await llmClient.complete({
    messages: [{ role: 'user', content: prompt }],
    model: 'claude-sonnet-4-6-20250214',
    temperature: 0
  });

  return JSON.parse(response.content);
}

Observability architecture

Complete monitoring pipeline

typescript// Centralized LLM observability system
class LLMObservabilitySystem {
  private qualityMetrics: Map<string, QualityAssessment> = new Map();
  private costMetrics: Map<string, TokenUsage> = new Map();
  private behaviorMetrics: Map<string, ResponseDistribution> = new Map();

  async trackRequest(requestId: string, config: {
    query: string;
    response: string;
    model: string;
    promptTokens: number;
    completionTokens: number;
    latency: number;
    context?: string[];
    userId?: string;
  }) {
    // Track cost
    const costUsage = calculateTokenCost(
      config.model,
      config.promptTokens,
      config.completionTokens
    );
    this.costMetrics.set(requestId, costUsage);

    // Assess quality
    const quality = await assessQuality(
      config.query,
      config.response,
      config.context?.join('\n')
    );
    this.qualityMetrics.set(requestId, quality);

    // Track behavior metrics
    await this.trackBehavior(requestId, config);

    // Calculate composite metrics
    const overallScore = this.calculateOverallScore(
      quality,
      costUsage,
      config.latency
    );

    // Alert if score falls below threshold
    if (overallScore < 0.7) {
      await this.alertPoorPerformance(requestId, overallScore);
    }

    return { quality, costUsage, overallScore };
  }

  private async trackBehavior(
    requestId: string,
    config: any
  ) {
    const embedding = await embeddingModel.embed(config.response);

    // Check for drift
    const historical = Array.from(this.behaviorMetrics.values());
    const { drifted, score } = await detectBehavioralDrift(
      [config.response],
      historical
    );

    if (drifted) {
      await this.alertDrift(requestId, score);
    }

    // Store current distribution
    this.behaviorMetrics.set(requestId, {
      model: config.model,
      date: new Date().toISOString(),
      embeddings: [embedding],
      mean: embedding,
      covariance: []
    });
  }

  private calculateOverallScore(
    quality: QualityAssessment,
    costUsage: TokenUsage,
    latency: number
  ): number {
    // Metric weighting
    const qualityWeight = 0.5;
    const costWeight = 0.3;
    const latencyWeight = 0.2;

    // Normalize latency (assuming 500ms as ideal)
    const normalizedLatency = Math.max(0, 1 - latency / 10000);

    // Normalize cost (assuming $0.01 as ideal)
    const normalizedCost = Math.max(0, 1 - costUsage.cost / 0.1);

    // Weighted average
    const qualityScore = (
      quality.relevanceScore +
      quality.precisionScore +
      (1 - quality.hallucinationScore)
    ) / 3;

    return (
      qualityScore * qualityWeight +
      normalizedCost * costWeight +
      normalizedLatency * latencyWeight
    );
  }

  private async alertPoorPerformance(requestId: string, score: number) {
    await alerts.send({
      severity: 'warning',
      title: 'LLM Performance Alert',
      message: `Request ${requestId} has poor overall score: ${score.toFixed(2)}`,
      metadata: {
        requestId,
        score
      }
    });
  }

  private async alertDrift(requestId: string, score: number) {
    await alerts.send({
      severity: 'critical',
      title: 'LLM Behavioral Drift Detected',
      message: `Request ${requestId} shows behavioral drift: ${score.toFixed(2)}`,
      metadata: {
        requestId,
        driftScore: score
      }
    });
  }
}

Dashboards and visualization

Real-time metrics dashboard

typescript// Integration with Grafana/Prometheus
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

// Metrics
const llmRequestDuration = new Histogram({
  name: 'llm_request_duration_seconds',
  help: 'Duration of LLM requests',
  labelNames: ['model', 'operation'],
  buckets: [0.1, 0.5, 1, 2, 5, 10]
});

const llmRequestCost = new Gauge({
  name: 'llm_request_cost_dollars',
  help: 'Cost of LLM requests in dollars',
  labelNames: ['model']
});

const llmCacheHitRate = new Gauge({
  name: 'llm_cache_hit_rate',
  help: 'Cache hit rate for LLM requests',
  labelNames: ['cache_type']
});

const llmQualityScore = new Gauge({
  name: 'llm_quality_score',
  help: 'Quality score of LLM responses',
  labelNames: ['metric_type']
});

registry.registerMetric(llmRequestDuration);
registry.registerMetric(llmRequestCost);
registry.registerMetric(llmCacheHitRate);
registry.registerMetric(llmQualityScore);

// Metrics endpoint for Prometheus
export async function metricsHandler(req: Request): Promise<Response> {
  return new Response(await registry.metrics(), {
    headers: { 'Content-Type': 'text/plain' }
  });
}

Success metrics

To validate that your observability system is working:

Time to detect drift: Target <24 hours after behavioral change
Relevant alert rate: >80% of alerts should trigger corrective action
Observability cost: <10% of total LLM inference cost
Monitored request coverage: >95% of production requests

Your LLM system in production suffers from unpredictable costs, inconsistent quality, or undetected behavioral drift? Talk to Imperialis specialists about LLM observability, from quality metrics to drift detection, to operate models in production with confidence.

Sources

Anthropic documentation — Official Anthropic documentation
RAGAS library — Automatic RAG evaluation
LangSmith — LLM observability platform
OpenTelemetry for AI — AI telemetry standards

Talk about LLM monitoring Explore more articles