Security and resilience

Circuit Breakers and Resilience Patterns: Designing Distributed Systems That Survive Failure

How circuit breakers, retries with exponential backoff, and other resilience patterns prevent cascading failures and ensure system reliability in distributed architectures.

3/10/20266 min readSecurity
Circuit Breakers and Resilience Patterns: Designing Distributed Systems That Survive Failure

Executive summary

How circuit breakers, retries with exponential backoff, and other resilience patterns prevent cascading failures and ensure system reliability in distributed architectures.

Last updated: 3/10/2026

The reality of distributed failure

In monolithic applications, failure modes are relatively predictable: the server either responds or it doesn't. In distributed architectures built on microservices, APIs, third-party services, and cloud infrastructure, failure becomes the default state. Network partitions occur, dependencies become slow, databases timeout, and services experience unexpected load spikes.

The fundamental challenge: when one component degrades, how do you prevent that degradation from cascading through your entire system and causing a complete outage?

Resilience patterns address this by designing systems that gracefully handle, isolate, and recover from failures. Circuit breakers act as the foundational pattern, but they're most effective when combined with retries with exponential backoff, timeouts, bulkheads, and fallbacks. For engineering teams operating distributed systems at scale, implementing these patterns isn't optional—it's essential for survival.

Circuit Breakers: Preventing cascading failures

A circuit breaker monitors calls to external services and opens the circuit when failure rates exceed a threshold. Once open, subsequent calls fail immediately without attempting the remote service, allowing it time to recover while preventing your system from wasting resources on doomed requests.

The three states of a circuit breaker

Closed State (Normal Operation)

  • All requests pass through to the service
  • Failures are tracked against thresholds
  • When failure threshold is exceeded, circuit transitions to open

Open State (Fail-Fast Mode)

  • All requests fail immediately without reaching the service
  • A timeout elapses before attempting to close the circuit
  • Prevents overwhelming a struggling service

Half-Open State (Testing)

  • A single request is allowed through to test if service recovered
  • If successful, circuit closes; if it fails, it reopens
  • Provides controlled recovery mechanism

Implementation patterns

typescriptclass CircuitBreaker {
  constructor(private options: CircuitBreakerOptions) {
    this.state = 'closed';
    this.failures = 0;
    this.lastFailureTime = 0;
    this.successCount = 0;
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (this.shouldAttemptReset()) {
        this.state = 'half-open';
      } else {
        throw new CircuitBreakerOpenError('Circuit is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    if (this.state === 'half-open') {
      this.state = 'closed';
      this.successCount = 0;
    } else {
      this.failures = 0;
    }
  }

  private onFailure() {
    this.failures++;
    this.lastFailureTime = Date.now();

    if (this.failures >= this.options.failureThreshold) {
      this.state = 'open';
      this.successCount = 0;
    }
  }

  private shouldAttemptReset(): boolean {
    const cooldownPeriod = this.options.openTimeoutMs;
    return Date.now() - this.lastFailureTime > cooldownPeriod;
  }
}

Critical configuration parameters

ParameterPurposeRisk of Misconfiguration
Failure ThresholdNumber of failures before opening circuitToo low: opens on transient failures. Too high: allows cascading failures.
Timeout WindowTime window for counting failuresToo short: sensitive to normal variance. Too long: slow to detect degradation.
Open TimeoutHow long circuit stays open before testingToo short: floods recovering service. Too long: unnecessary prolonged outages.
Half-Open Max RequestsRequests allowed during half-open stateToo many: floods service during recovery. Too few: might miss successful recovery.

Retries with Exponential Backoff: Handling transient failures

Not all failures warrant opening a circuit. Transient failures—network hiccups, brief database timeouts, temporary service unavailability—often resolve on their own. The retry pattern attempts failed operations with carefully calculated delays.

Why naive retries are dangerous

A naive retry strategy that immediately retries failed requests can exacerbate the problem:

typescript// DANGEROUS: Immediate retries create thundering herd
async function naiveRetry(fn: () => Promise<any>, maxRetries: number) {
  let attempts = 0;
  while (attempts < maxRetries) {
    try {
      return await fn();
    } catch (error) {
      attempts++;
      if (attempts >= maxRetries) throw error;
      // No delay: immediate retry floods service
    }
  }
}

When multiple clients retry simultaneously, they create a thundering herd that overwhelms the already-struggling service, turning a transient issue into a sustained outage.

Exponential backoff with jitter

typescriptasync function retryWithBackoff<T>(
  fn: () => Promise<T>,
  options: RetryOptions
): Promise<T> {
  let attempt = 0;
  let lastError: Error;

  while (attempt < options.maxRetries) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      attempt++;

      if (attempt >= options.maxRetries) {
        throw lastError;
      }

      // Exponential backoff: delay grows exponentially
      const baseDelay = options.initialDelayMs;
      const exponentialDelay = baseDelay * Math.pow(2, attempt - 1);

      // Add jitter to prevent synchronized retries
      const jitter = exponentialDelay * options.jitterFactor;
      const randomizedDelay = exponentialDelay + (Math.random() * jitter);

      await sleep(randomizedDelay);
    }
  }

  throw lastError!;
}

interface RetryOptions {
  maxRetries: number;
  initialDelayMs: number;
  jitterFactor: number; // Typically 0.1 to 0.5
}

Why jitter matters: Without jitter, all clients experiencing the same failure will retry at approximately the same intervals after exponential backoff, creating synchronized waves of load. Jitter randomizes these delays slightly, spreading retries more evenly.

When to retry vs. when to fail fast

Failure TypeRetry StrategyRationale
Network timeoutRetry with backoffLikely transient, network conditions fluctuate
5xx server errorsRetry with backoffService might be temporarily overloaded
429 rate limitedRetry with exponential backoffWait for rate limit window to reset
404 not foundDo not retryResource doesn't exist, retrying won't help
4xx client errors (400, 401, 403)Do not retryClient-side error, retrying won't fix
500 server errors (logic errors)Limited retriesMight be bug, retrying might succeed with different input

Bulkheads: Isolating failure impact

The bulkhead pattern partitions resources so that failures in one domain don't consume all available resources. Named after the watertight compartments in ships, bulkheads prevent a single point of failure from sinking the entire system.

Thread pool isolation

typescriptclass BulkheadExecutor {
  constructor(private threadPoolSize: number) {
    this.threadPool = new WorkerPool(threadPoolSize);
  }

  async execute<T>(task: () => Promise<T>): Promise<T> {
    if (this.threadPool.availableWorkers === 0) {
      throw new BulkheadExhaustedError('Bulkhead capacity exhausted');
    }

    const worker = this.threadPool.acquire();
    try {
      return await task();
    } finally {
      this.threadPool.release(worker);
    }
  }
}

// Separate bulkheads for different domains
const paymentBulkhead = new BulkheadExecutor(10);
const inventoryBulkhead = new BulkheadExecutor(20);
const notificationBulkhead = new BulkheadExecutor(5);

Operational benefit: If the payment service becomes slow and exhausts its thread pool, inventory and notification services continue operating because they have isolated pools.

Semaphore-based rate limiting

typescriptclass SemaphoreBulkhead {
  constructor(private maxConcurrent: number) {
    this.semaphore = new Semaphore(maxConcurrent);
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    const permit = await this.semaphore.acquire();
    try {
      return await fn();
    } finally {
      this.semaphore.release(permit);
    }
  }
}

Timeout Strategies: Preventing resource exhaustion

Every external call must have a timeout. Without timeouts, slow services cause your system to accumulate connections, exhaust thread pools, and eventually fail completely.

Per-operation timeouts

typescriptasync function callWithTimeout<T>(
  fn: () => Promise<T>,
  timeoutMs: number,
  operationName: string
): Promise<T> {
  const timeoutPromise = new Promise<never>((_, reject) => {
    setTimeout(() => {
      reject(new TimeoutError(`Operation ${operationName} timed out after ${timeoutMs}ms`));
    }, timeoutMs);
  });

  try {
    return await Promise.race([fn(), timeoutPromise]);
  } catch (error) {
    if (error instanceof TimeoutError) {
      // Log timeout with context
      metrics.recordTimeout(operationName, timeoutMs);
    }
    throw error;
  }
}

Tiered timeout strategies

Different operations warrant different timeout durations based on their complexity and importance:

typescriptconst timeoutConfig = {
  // Fast operations: strict timeouts
  healthCheck: 500,
  cacheLookup: 100,
  apiValidation: 200,

  // Medium operations: balanced timeouts
  apiCall: 3000,
  databaseQuery: 2000,
  messageQueuePublish: 1000,

  // Heavy operations: generous timeouts with circuit breaker
  dataProcessingJob: 30000,
  reportGeneration: 60000,
  bulkImport: 120000,
};

Fallbacks: Graceful degradation when services fail

Circuit breakers prevent failures from cascading, but fallbacks provide alternative behavior when services are unavailable. Fallbacks maintain functionality even if features are degraded.

Fallback strategies by service criticality

typescriptclass FallbackHandler {
  async executeWithFallback<T>(
    primaryFn: () => Promise<T>,
    fallbackFn: () => Promise<T>,
    service: string
  ): Promise<T> {
    try {
      return await primaryFn();
    } catch (error) {
      metrics.recordFallback(service, error);
      return await fallbackFn();
    }
  }
}

// Fallback examples
const fallbacks = {
  // Cache fallback: serve stale data
  productCatalog: async (productId: string) => {
    return await cache.get(`product:${productId}`) ||
           await database.getProduct(productId);
  },

  // Feature toggle: disable non-critical features
  recommendations: async (userId: string) => {
    return await recommendationsService.getForUser(userId) ||
           { items: [], source: 'fallback-disabled' };
  },

  // Alternative service: switch to backup provider
  emailDelivery: async (email: Email) => {
    try {
      return await primaryEmailProvider.send(email);
    } catch (error) {
      return await backupEmailProvider.send(email);
    }
  },

  // Graceful error message: inform user of temporary issue
  paymentProcessing: async (payment: Payment) => {
    try {
      return await paymentProcessor.charge(payment);
    } catch (error) {
      return {
        status: 'temporarily_unavailable',
        message: 'Payment processing is temporarily unavailable. Please try again in a few minutes.',
        retryAfter: 300 // 5 minutes
      };
    }
  }
};

Operational monitoring: Observing resilience in action

Resilience patterns are only effective if you can observe when they trigger. Without monitoring, you won't know if your circuit breakers are opening too frequently or if retries are masking deeper issues.

Key metrics to track

typescriptinterface ResilienceMetrics {
  // Circuit breaker metrics
  circuitBreakerStateChanges: {
    service: string;
    fromState: 'closed' | 'open' | 'half-open';
    toState: 'closed' | 'open' | 'half-open';
    timestamp: Date;
  }[];

  // Retry metrics
  retryAttempts: {
    service: string;
    attempt: number;
    totalAttempts: number;
    success: boolean;
    delayMs: number;
  }[];

  // Timeout metrics
  timeoutOccurrences: {
    operation: string;
    timeoutMs: number;
    actualDurationMs: number;
  }[];

  // Fallback metrics
  fallbackInvocations: {
    service: string;
    fallbackType: 'cache' | 'alternative' | 'disabled';
    latencyMs: number;
  }[];

  // Bulkhead metrics
  bulkheadRejections: {
    bulkhead: string;
    rejectedRequests: number;
    availableCapacity: number;
  }[];
}

Alerting strategy

yamlresilience_alerts:
  circuit_breaker_open:
    condition: "CircuitBreakerOpen > 3 within 5m for same service"
    severity: critical
    action: "Immediate investigation: service might be down or degraded"

  high_retry_rate:
    condition: "RetryRate > 50% for operation"
    severity: warning
    action: "Review service performance: might indicate instability"

  timeout_spike:
    condition: "TimeoutRate > 10% increase from baseline"
    severity: warning
    action: "Check for performance regression or network issues"

  fallback_activation:
    condition: "FallbackRate > 20% for critical service"
    severity: warning
    action: "Service degraded: primary dependency unavailable"

Implementation anti-patterns

Anti-pattern 1: Circuit breakers without monitoring

Setting up circuit breakers but not tracking when they open means you won't detect degrading services until users report issues.

Solution: Implement comprehensive metrics and alerting for all resilience patterns.

Anti-pattern 2: Excessive retries

Retrying non-idempotent operations (like charging a credit card) can cause duplicate transactions and data corruption.

Solution: Classify operations as idempotent or non-idempotent. Only retry idempotent operations automatically.

Anti-pattern 3: One-size-fits-all configuration

Using the same retry and timeout settings for all services ignores their distinct performance characteristics.

Solution: Configure resilience parameters per service based on observed SLAs and failure modes.

Anti-pattern 4: Fallbacks that hide problems

Fallbacks that always succeed mask underlying issues, preventing root cause analysis.

Solution: Design fallbacks to degrade functionality noticeably, and alert when they activate.

Chaos Engineering: Proactively testing resilience

Building resilience patterns is the first step. Validating they work requires proactively inducing failures in production-like environments.

Testing circuit breakers

typescriptasync function testCircuitBreaker() {
  const circuitBreaker = new CircuitBreaker({
    failureThreshold: 3,
    timeoutMs: 10000
  });

  // Simulate failures to trigger circuit breaker
  const failingService = async () => {
    throw new Error('Service unavailable');
  };

  for (let i = 0; i < 5; i++) {
    try {
      await circuitBreaker.execute(failingService);
    } catch (error) {
      console.log(`Attempt ${i + 1}: ${error.message}`);
    }
  }

  // Circuit should be open after 3 failures
  assert(circuitBreaker.state === 'open', 'Circuit should be open');
}

Testing retry strategies

typescriptasync function testExponentialBackoff() {
  const attempts: number[] = [];

  const flakyService = async () => {
    attempts.push(attempts.length + 1);
    if (attempts.length < 3) {
      throw new Error('Temporary failure');
    }
    return 'success';
  };

  const result = await retryWithBackoff(flakyService, {
    maxRetries: 5,
    initialDelayMs: 100,
    jitterFactor: 0.1
  });

  assert(attempts.length === 3, 'Should have retried twice');
  assert(result === 'success', 'Should eventually succeed');
}

Conclusion

Resilience patterns transform distributed systems from fragile networks of dependencies into robust architectures that survive failures gracefully. Circuit breakers prevent cascading failures, retries with exponential backoff handle transient issues, bulkheads isolate failures, timeouts prevent resource exhaustion, and fallbacks maintain degraded functionality.

The key insight: failure isn't a question of if, but when. Designing for failure isn't pessimism—it's pragmatism. By implementing these patterns and observing them through comprehensive monitoring, engineering teams can build systems that not only withstand failures but recover from them automatically.

The next step isn't implementing every resilience pattern simultaneously. Start with the highest-impact patterns for your architecture: circuit breakers for critical dependencies, sensible timeouts for all external calls, and exponential backoff for idempotent operations. Monitor these implementations, validate they work through chaos engineering, and expand your resilience strategy iteratively.


Your distributed architecture is experiencing cascading failures and unexpected outages? Talk to Imperialis engineering specialists to design and implement a comprehensive resilience strategy that prevents failures from impacting your customers.

Sources

Related reading