Security and resilience

Error Handling Patterns for Distributed Systems in Production

How to structure error handling in microservices architectures to transform inevitable failures into operational resilience.

3/17/20269 min readSecurity
Error Handling Patterns for Distributed Systems in Production

Executive summary

How to structure error handling in microservices architectures to transform inevitable failures into operational resilience.

Last updated: 3/17/2026

The inevitability of failure in distributed systems

In monolithic architectures, when something breaks, it's usually one specific thing and you can debug locally. In distributed systems with dozens of microservices, databases, queues, and caches, things break all the time.

Network failures. Servers stopping. Databases getting overloaded. Third-party APIs returning 500. Simultaneous deploys across multiple services. DNS issues. Expired certificates.

The question isn't whether your system will fail, but how it fails. A system that fails well — predictably, observably, and recoverably — is much more valuable than a system that "never fails" until it fails catastrophically.

Structured error handling is what separates fragile systems from resilient ones.

The spectrum of failures you need to handle

Transient failures

  • Intermittent network timeout
  • Momentarily exhausted database connection pool
  • API gateway briefly returning 503

Pattern: Retry with exponential backoff.

Permanent failures

  • Destroyed service
  • Removed API endpoint
  • Incompatible database schema

Pattern: Immediate fallback + alert.

Partial failures

  • One of three database replicas goes down
  • One availability zone (AZ) degrades
  • Cache partially available

Pattern: Circuit breaker + graceful degradation.

Cascading failures

  • One service overloads downstream
  • Retry storm takes down database
  • Circuit breaker broken across multiple dependencies

Pattern: Bulkhead + hierarchical timeout.

Pattern 1: Retry with intelligent backoff

Naive retries transform small problems into disasters. Retry storms — when all clients retry simultaneously — can take down systems that normally work well.

Anti-pattern: Aggressive linear retry

typescript// BAD: Fixed retry without jitter
async function fetchWithRetry(url: string, maxRetries: number = 5) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetch(url);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await delay(100); // 100ms fixed
    }
  }
}

Problem: 5 clients × 5 retries = 25 simultaneous requests for each original failure. This crushes downstream resources.

Pattern: Exponential backoff with jitter

typescriptinterface RetryConfig {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  retryableErrors: (error: Error) => boolean;
}

async function fetchWithExponentialBackoff<T>(
  fn: () => Promise<T>,
  config: RetryConfig
): Promise<T> {
  let attempt = 0;

  while (attempt < config.maxAttempts) {
    try {
      return await fn();
    } catch (error) {
      attempt++;

      if (attempt >= config.maxAttempts || !config.retryableErrors(error as Error)) {
        throw error;
      }

      // Exponential backoff with jitter
      const baseDelay = Math.min(
        config.baseDelayMs * Math.pow(2, attempt),
        config.maxDelayMs
      );
      const jitter = baseDelay * (0.5 + Math.random() * 0.5);
      const delayMs = Math.floor(baseDelay + jitter);

      metrics.increment('retry.attempt', {
        error_type: error.constructor.name,
        attempt: attempt.toString()
      });

      await delay(delayMs);
    }
  }

  throw new Error('Max retries exceeded');
}

// Usage
const result = await fetchWithExponentialBackoff(
  () => fetch('https://api.example.com/data').then(r => r.json()),
  {
    maxAttempts: 4,
    baseDelayMs: 100,    // 100ms, 200ms, 400ms, 800ms
    maxDelayMs: 5000,
    retryableErrors: (error) =>
      error instanceof TypeError || // Network error
      (error as any)?.status >= 500 // Server error
  }
);

Benefits:

  • Jitter prevents thundering herd
  • Exponential backoff gives downstream recovery space
  • Conditional retry avoids retrying errors that won't go away

Pattern 2: Circuit breaker

Circuit breaker prevents operations that are failing from being called repeatedly, allowing the system to recover and saving resources.

Circuit breaker implementation

typescriptenum CircuitState {
  CLOSED = 'CLOSED',    // Normal operation
  OPEN = 'OPEN',         // Blocks calls
  HALF_OPEN = 'HALF_OPEN' // Tests recovery
}

interface CircuitBreakerConfig {
  failureThreshold: number;      // Failures before opening
  successThreshold: number;      // Successes to close (half-open)
  timeoutMs: number;            // Time before trying half-open
  windowMs: number;            // Failure count window
}

class CircuitBreaker<T> {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime = 0;
  private failures: number[] = [];

  constructor(
    private fn: () => Promise<T>,
    private config: CircuitBreakerConfig
  ) {
    // Clear old failure counts
    setInterval(() => {
      const now = Date.now();
      this.failures = this.failures.filter(f => now - f < this.config.windowMs);
      this.failureCount = this.failures.length;
    }, this.config.windowMs / 2);
  }

  async execute(): Promise<T> {
    // If circuit is open and timeout passed, try half-open
    if (this.state === CircuitState.OPEN) {
      if (Date.now() - this.lastFailureTime > this.config.timeoutMs) {
        this.state = CircuitState.HALF_OPEN;
        metrics.increment('circuit.half_open');
      } else {
        throw new CircuitBreakerOpenError('Circuit is OPEN');
      }
    }

    try {
      const result = await this.fn();

      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;

      if (this.successCount >= this.config.successThreshold) {
        this.state = CircuitState.CLOSED;
        this.successCount = 0;
        metrics.increment('circuit.closed');
      }
    } else {
      this.failures = this.failures.filter(f => Date.now() - f < this.config.windowMs);
    }
  }

  private onFailure() {
    this.failureCount++;
    this.failures.push(Date.now());
    this.lastFailureTime = Date.now();

    if (this.failureCount >= this.config.failureThreshold) {
      this.state = CircuitState.OPEN;
      this.successCount = 0;
      metrics.increment('circuit.open');
    }
  }

  getState(): CircuitState {
    return this.state;
  }
}

class CircuitBreakerOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'CircuitBreakerOpenError';
  }
}

// Usage
const paymentCircuitBreaker = new CircuitBreaker(
  () => fetch('https://payments.api/charge').then(r => r.json()),
  {
    failureThreshold: 5,    // Opens after 5 failures
    successThreshold: 2,    // Closes after 2 successes (half-open)
    timeoutMs: 60000,       // 1 minute before trying recovery
    windowMs: 30000         // 30s window for failure count
  }
);

try {
  const result = await paymentCircuitBreaker.execute();
} catch (error) {
  if (error instanceof CircuitBreakerOpenError) {
    // Use fallback
    return await fallbackPaymentFlow();
  }
  throw error;
}

Pattern 3: Fallback and graceful degradation

Not everything needs to work perfectly. Resilient systems degrade gracefully — non-critical features fail silently while core functionality remains available.

Degradation by criticality

typescriptinterface ServiceConfig {
  name: string;
  critical: boolean;           // If core or nice-to-have
  fallback?: () => Promise<any>;
  timeoutMs: number;
}

class GracefulDegradation {
  private services: Map<string, ServiceConfig> = new Map();

  register(config: ServiceConfig) {
    this.services.set(config.name, config);
  }

  async execute<T>(serviceName: string, fn: () => Promise<T>): Promise<T | null> {
    const config = this.services.get(serviceName);
    if (!config) throw new Error(`Unknown service: ${serviceName}`);

    try {
      return await Promise.race([
        fn(),
        timeout(config.timeoutMs)
      ]);
    } catch (error) {
      metrics.increment('service.error', {
        service: serviceName,
        critical: config.critical.toString()
      });

      if (config.critical) {
        // Critical services: immediate alert
        alerting.sendCritical(
          `Critical service ${serviceName} failed`,
          { error }
        );
        throw error;
      }

      // Non-critical services: silent fallback
      if (config.fallback) {
        try {
          const fallbackResult = await config.fallback();
          metrics.increment('service.fallback_success', { service: serviceName });
          return fallbackResult;
        } catch (fallbackError) {
          metrics.increment('service.fallback_failed', { service: serviceName });
          return null; // Complete graceful failure
        }
      }

      return null;
    }
  }
}

// Configuration
const degradation = new GracefulDegradation();

degradation.register({
  name: 'payments',
  critical: true,
  timeoutMs: 5000
});

degradation.register({
  name: 'recommendations',
  critical: false,
  timeoutMs: 1000,
  fallback: () => Promise.resolve([]) // Returns empty list
});

degradation.register({
  name: 'analytics',
  critical: false,
  timeoutMs: 2000
  // No fallback = complete silent failure
});

// Usage
async function handleUserRequest(userId: string) {
  try {
    const payment = await degradation.execute('payments', () => createPayment(userId));
    // Payment is critical: error here triggers alert
  } catch (error) {
    // Handle payment error
  }

  const recommendations = await degradation.execute(
    'recommendations',
    () => fetchRecommendations(userId)
  );
  // If fails, recommendations = null, but request doesn't fail

  // Analytics: doesn't even try if failed before, returns null
  degradation.execute('analytics', () => trackEvent('user_view'));
}

Pattern 4: Hierarchical timeout

Timeouts are your last line of defense. Without timeouts, a stuck request can hang threads indefinitely, exhausting resources.

Timeout hierarchy

Client timeout (3s) < Gateway timeout (5s) < Service timeout (10s)

Each layer needs a timeout shorter than the previous. If client times out before service, service can still respond and process the operation, but client already disconnected — wasted resources.

typescriptinterface TimeoutConfig {
  client: number;     // Timeout at client (minimum)
  gateway: number;    // Timeout at gateway
  service: number;    // Timeout at service (maximum)
}

function validateTimeouts(config: TimeoutConfig) {
  if (config.client >= config.gateway) {
    throw new Error('Client timeout must be less than gateway timeout');
  }
  if (config.gateway >= config.service) {
    throw new Error('Gateway timeout must be less than service timeout');
  }
}

// Consistent configuration
const TIMEOUTS: Record<string, TimeoutConfig> = {
  payments: { client: 3000, gateway: 5000, service: 10000 },
  analytics: { client: 500, gateway: 1000, service: 2000 },
  recommendations: { client: 1000, gateway: 2000, service: 5000 }
};

async function fetchWithTimeout<T>(
  fn: () => Promise<T>,
  timeoutMs: number
): Promise<T> {
  return Promise.race([
    fn(),
    new Promise<never>((_, reject) =>
      setTimeout(() => reject(new TimeoutError()), timeoutMs)
    )
  ]);
}

class TimeoutError extends Error {
  constructor() {
    super('Operation timed out');
    this.name = 'TimeoutError';
  }
}

// Consistent usage
async function callPaymentsAPI() {
  const config = TIMEOUTS.payments;

  return await fetchWithTimeout(
    () => fetch('https://payments.api/charge').then(r => r.json()),
    config.client // Use client timeout
  );
}

Pattern 5: Bulkhead for failure isolation

Bulkhead separates resources by operation type, preventing failure in one type from affecting others.

typescriptclass Bulkhead<T> {
  private queue: Array<{ fn: () => Promise<T>; resolve: (value: T) => void; reject: (error: Error) => void }> = [];
  private running = 0;

  constructor(
    private maxConcurrent: number,
    private queueLimit: number = 100
  ) {}

  async execute(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      if (this.queue.length >= this.queueLimit) {
        reject(new Error('Bulkhead queue full'));
        metrics.increment('bulkhead.rejected');
        return;
      }

      this.queue.push({ fn, resolve, reject });
      this.processQueue();
    });
  }

  private async processQueue() {
    while (this.running < this.maxConcurrent && this.queue.length > 0) {
      const task = this.queue.shift();
      if (!task) break;

      this.running++;
      metrics.gauge('bulkhead.running', this.running);

      try {
        const result = await task.fn();
        task.resolve(result);
      } catch (error) {
        task.reject(error as Error);
      } finally {
        this.running--;
        metrics.gauge('bulkhead.running', this.running);
        this.processQueue(); // Next task
      }
    }
  }
}

// Separate bulkheads by operation type
const bulkheads = {
  read: new Bulkhead(50, 100),    // Up to 50 simultaneous read operations
  write: new Bulkhead(10, 50),    // Up to 10 simultaneous write operations
  analytics: new Bulkhead(5, 20)   // Up to 5 simultaneous analytics requests
};

// Usage
async function readUserData(userId: string) {
  return await bulkheads.read.execute(() =>
    db.users.findById(userId)
  );
}

async function writeUserData(userId: string, data: any) {
  return await bulkheads.write.execute(() =>
    db.users.update(userId, data)
  );
}

// Analytics doesn't take down write if it fails
async function trackAnalytics() {
  try {
    return await bulkheads.analytics.execute(() =>
      analyticsApi.track('event')
    );
  } catch (error) {
    // Analytics failed, but main operation continues
    metrics.increment('analytics.dropped');
  }
}

Error observability

Without observability, you're blind. Error handling without logging and metrics is just hiding problems.

Structured error logging

typescriptinterface ErrorContext {
  service: string;
  operation: string;
  userId?: string;
  requestId?: string;
  upstream?: string;
  metadata?: Record<string, any>;
}

function logError(error: Error, context: ErrorContext) {
  logger.error({
    event: 'error_occurred',
    error_type: error.constructor.name,
    error_message: error.message,
    error_stack: process.env.NODE_ENV === 'development' ? error.stack : undefined,
    ...context,
    timestamp: new Date().toISOString()
  });
}

// Usage
try {
  await someOperation();
} catch (error) {
  logError(error as Error, {
    service: 'payment-service',
    operation: 'process_payment',
    userId: request.userId,
    requestId: request.id,
    upstream: 'https://payments.api/charge',
    metadata: { amount: 10000, currency: 'BRL' }
  });
  throw error;
}

Error metrics

typescriptinterface ErrorMetrics {
  total: number;
  byType: Map<string, number>;
  byOperation: Map<string, number>;
  byUpstream: Map<string, number>;
}

class ErrorTracker {
  private metrics: ErrorMetrics = {
    total: 0,
    byType: new Map(),
    byOperation: new Map(),
    byUpstream: new Map()
  };

  track(error: Error, context: ErrorContext) {
    this.metrics.total++;

    this.incrementMap(this.metrics.byType, error.constructor.name);
    this.incrementMap(this.metrics.byOperation, context.operation);
    if (context.upstream) {
      this.incrementMap(this.metrics.byUpstream, context.upstream);
    }

    // Export to metrics system
    this.exportToMetricsSystem();
  }

  private incrementMap(map: Map<string, number>, key: string) {
    map.set(key, (map.get(key) || 0) + 1);
  }

  private exportToMetricsSystem() {
    // Export to Prometheus/DataDog/etc
    for (const [type, count] of this.metrics.byType) {
      metrics.gauge('errors.total', count, { error_type: type });
    }
  }

  getErrorRate(): number {
    const totalRequests = metrics.get('requests.total');
    return this.metrics.total / totalRequests;
  }
}

Conclusion

Error handling in distributed systems isn't a list of patterns to implement once and forget. It's a continuous discipline that evolves with your system.

Start with what has the most impact: retry with backoff, timeouts, and circuit breaker. Add graceful degradation when you understand which services are critical and which are nice-to-have. Implement bulkheads when you have failure isolation problems. Always accompany with observability — structured logs and error metrics are your eyes and ears.

The goal isn't to eliminate errors — that's impossible in distributed systems. The goal is to make errors predictable, observable, and recoverable. When an error happens, you want to know what happened, why it happened, and what the impact was. And you want the system to keep operating, even if in degraded mode.

Systems that handle errors well can fail all the time and still appear reliable to users. Systems that handle errors poorly can work 99.9% of the time and appear broken when one thing fails catastrophically.


Does your microservices architecture need structured error handling? Talk to Imperialis resilient systems experts to design error handling patterns that transform failures into operational resilience.

Sources

Related reading