Error Handling Patterns for Distributed Systems in Production
How to structure error handling in microservices architectures to transform inevitable failures into operational resilience.
Executive summary
How to structure error handling in microservices architectures to transform inevitable failures into operational resilience.
Last updated: 3/17/2026
The inevitability of failure in distributed systems
In monolithic architectures, when something breaks, it's usually one specific thing and you can debug locally. In distributed systems with dozens of microservices, databases, queues, and caches, things break all the time.
Network failures. Servers stopping. Databases getting overloaded. Third-party APIs returning 500. Simultaneous deploys across multiple services. DNS issues. Expired certificates.
The question isn't whether your system will fail, but how it fails. A system that fails well — predictably, observably, and recoverably — is much more valuable than a system that "never fails" until it fails catastrophically.
Structured error handling is what separates fragile systems from resilient ones.
The spectrum of failures you need to handle
Transient failures
- Intermittent network timeout
- Momentarily exhausted database connection pool
- API gateway briefly returning 503
Pattern: Retry with exponential backoff.
Permanent failures
- Destroyed service
- Removed API endpoint
- Incompatible database schema
Pattern: Immediate fallback + alert.
Partial failures
- One of three database replicas goes down
- One availability zone (AZ) degrades
- Cache partially available
Pattern: Circuit breaker + graceful degradation.
Cascading failures
- One service overloads downstream
- Retry storm takes down database
- Circuit breaker broken across multiple dependencies
Pattern: Bulkhead + hierarchical timeout.
Pattern 1: Retry with intelligent backoff
Naive retries transform small problems into disasters. Retry storms — when all clients retry simultaneously — can take down systems that normally work well.
Anti-pattern: Aggressive linear retry
typescript// BAD: Fixed retry without jitter
async function fetchWithRetry(url: string, maxRetries: number = 5) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fetch(url);
} catch (error) {
if (i === maxRetries - 1) throw error;
await delay(100); // 100ms fixed
}
}
}Problem: 5 clients × 5 retries = 25 simultaneous requests for each original failure. This crushes downstream resources.
Pattern: Exponential backoff with jitter
typescriptinterface RetryConfig {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
retryableErrors: (error: Error) => boolean;
}
async function fetchWithExponentialBackoff<T>(
fn: () => Promise<T>,
config: RetryConfig
): Promise<T> {
let attempt = 0;
while (attempt < config.maxAttempts) {
try {
return await fn();
} catch (error) {
attempt++;
if (attempt >= config.maxAttempts || !config.retryableErrors(error as Error)) {
throw error;
}
// Exponential backoff with jitter
const baseDelay = Math.min(
config.baseDelayMs * Math.pow(2, attempt),
config.maxDelayMs
);
const jitter = baseDelay * (0.5 + Math.random() * 0.5);
const delayMs = Math.floor(baseDelay + jitter);
metrics.increment('retry.attempt', {
error_type: error.constructor.name,
attempt: attempt.toString()
});
await delay(delayMs);
}
}
throw new Error('Max retries exceeded');
}
// Usage
const result = await fetchWithExponentialBackoff(
() => fetch('https://api.example.com/data').then(r => r.json()),
{
maxAttempts: 4,
baseDelayMs: 100, // 100ms, 200ms, 400ms, 800ms
maxDelayMs: 5000,
retryableErrors: (error) =>
error instanceof TypeError || // Network error
(error as any)?.status >= 500 // Server error
}
);Benefits:
- Jitter prevents thundering herd
- Exponential backoff gives downstream recovery space
- Conditional retry avoids retrying errors that won't go away
Pattern 2: Circuit breaker
Circuit breaker prevents operations that are failing from being called repeatedly, allowing the system to recover and saving resources.
Circuit breaker implementation
typescriptenum CircuitState {
CLOSED = 'CLOSED', // Normal operation
OPEN = 'OPEN', // Blocks calls
HALF_OPEN = 'HALF_OPEN' // Tests recovery
}
interface CircuitBreakerConfig {
failureThreshold: number; // Failures before opening
successThreshold: number; // Successes to close (half-open)
timeoutMs: number; // Time before trying half-open
windowMs: number; // Failure count window
}
class CircuitBreaker<T> {
private state: CircuitState = CircuitState.CLOSED;
private failureCount = 0;
private successCount = 0;
private lastFailureTime = 0;
private failures: number[] = [];
constructor(
private fn: () => Promise<T>,
private config: CircuitBreakerConfig
) {
// Clear old failure counts
setInterval(() => {
const now = Date.now();
this.failures = this.failures.filter(f => now - f < this.config.windowMs);
this.failureCount = this.failures.length;
}, this.config.windowMs / 2);
}
async execute(): Promise<T> {
// If circuit is open and timeout passed, try half-open
if (this.state === CircuitState.OPEN) {
if (Date.now() - this.lastFailureTime > this.config.timeoutMs) {
this.state = CircuitState.HALF_OPEN;
metrics.increment('circuit.half_open');
} else {
throw new CircuitBreakerOpenError('Circuit is OPEN');
}
}
try {
const result = await this.fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
if (this.state === CircuitState.HALF_OPEN) {
this.successCount++;
if (this.successCount >= this.config.successThreshold) {
this.state = CircuitState.CLOSED;
this.successCount = 0;
metrics.increment('circuit.closed');
}
} else {
this.failures = this.failures.filter(f => Date.now() - f < this.config.windowMs);
}
}
private onFailure() {
this.failureCount++;
this.failures.push(Date.now());
this.lastFailureTime = Date.now();
if (this.failureCount >= this.config.failureThreshold) {
this.state = CircuitState.OPEN;
this.successCount = 0;
metrics.increment('circuit.open');
}
}
getState(): CircuitState {
return this.state;
}
}
class CircuitBreakerOpenError extends Error {
constructor(message: string) {
super(message);
this.name = 'CircuitBreakerOpenError';
}
}
// Usage
const paymentCircuitBreaker = new CircuitBreaker(
() => fetch('https://payments.api/charge').then(r => r.json()),
{
failureThreshold: 5, // Opens after 5 failures
successThreshold: 2, // Closes after 2 successes (half-open)
timeoutMs: 60000, // 1 minute before trying recovery
windowMs: 30000 // 30s window for failure count
}
);
try {
const result = await paymentCircuitBreaker.execute();
} catch (error) {
if (error instanceof CircuitBreakerOpenError) {
// Use fallback
return await fallbackPaymentFlow();
}
throw error;
}Pattern 3: Fallback and graceful degradation
Not everything needs to work perfectly. Resilient systems degrade gracefully — non-critical features fail silently while core functionality remains available.
Degradation by criticality
typescriptinterface ServiceConfig {
name: string;
critical: boolean; // If core or nice-to-have
fallback?: () => Promise<any>;
timeoutMs: number;
}
class GracefulDegradation {
private services: Map<string, ServiceConfig> = new Map();
register(config: ServiceConfig) {
this.services.set(config.name, config);
}
async execute<T>(serviceName: string, fn: () => Promise<T>): Promise<T | null> {
const config = this.services.get(serviceName);
if (!config) throw new Error(`Unknown service: ${serviceName}`);
try {
return await Promise.race([
fn(),
timeout(config.timeoutMs)
]);
} catch (error) {
metrics.increment('service.error', {
service: serviceName,
critical: config.critical.toString()
});
if (config.critical) {
// Critical services: immediate alert
alerting.sendCritical(
`Critical service ${serviceName} failed`,
{ error }
);
throw error;
}
// Non-critical services: silent fallback
if (config.fallback) {
try {
const fallbackResult = await config.fallback();
metrics.increment('service.fallback_success', { service: serviceName });
return fallbackResult;
} catch (fallbackError) {
metrics.increment('service.fallback_failed', { service: serviceName });
return null; // Complete graceful failure
}
}
return null;
}
}
}
// Configuration
const degradation = new GracefulDegradation();
degradation.register({
name: 'payments',
critical: true,
timeoutMs: 5000
});
degradation.register({
name: 'recommendations',
critical: false,
timeoutMs: 1000,
fallback: () => Promise.resolve([]) // Returns empty list
});
degradation.register({
name: 'analytics',
critical: false,
timeoutMs: 2000
// No fallback = complete silent failure
});
// Usage
async function handleUserRequest(userId: string) {
try {
const payment = await degradation.execute('payments', () => createPayment(userId));
// Payment is critical: error here triggers alert
} catch (error) {
// Handle payment error
}
const recommendations = await degradation.execute(
'recommendations',
() => fetchRecommendations(userId)
);
// If fails, recommendations = null, but request doesn't fail
// Analytics: doesn't even try if failed before, returns null
degradation.execute('analytics', () => trackEvent('user_view'));
}Pattern 4: Hierarchical timeout
Timeouts are your last line of defense. Without timeouts, a stuck request can hang threads indefinitely, exhausting resources.
Timeout hierarchy
Client timeout (3s) < Gateway timeout (5s) < Service timeout (10s)Each layer needs a timeout shorter than the previous. If client times out before service, service can still respond and process the operation, but client already disconnected — wasted resources.
typescriptinterface TimeoutConfig {
client: number; // Timeout at client (minimum)
gateway: number; // Timeout at gateway
service: number; // Timeout at service (maximum)
}
function validateTimeouts(config: TimeoutConfig) {
if (config.client >= config.gateway) {
throw new Error('Client timeout must be less than gateway timeout');
}
if (config.gateway >= config.service) {
throw new Error('Gateway timeout must be less than service timeout');
}
}
// Consistent configuration
const TIMEOUTS: Record<string, TimeoutConfig> = {
payments: { client: 3000, gateway: 5000, service: 10000 },
analytics: { client: 500, gateway: 1000, service: 2000 },
recommendations: { client: 1000, gateway: 2000, service: 5000 }
};
async function fetchWithTimeout<T>(
fn: () => Promise<T>,
timeoutMs: number
): Promise<T> {
return Promise.race([
fn(),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new TimeoutError()), timeoutMs)
)
]);
}
class TimeoutError extends Error {
constructor() {
super('Operation timed out');
this.name = 'TimeoutError';
}
}
// Consistent usage
async function callPaymentsAPI() {
const config = TIMEOUTS.payments;
return await fetchWithTimeout(
() => fetch('https://payments.api/charge').then(r => r.json()),
config.client // Use client timeout
);
}Pattern 5: Bulkhead for failure isolation
Bulkhead separates resources by operation type, preventing failure in one type from affecting others.
typescriptclass Bulkhead<T> {
private queue: Array<{ fn: () => Promise<T>; resolve: (value: T) => void; reject: (error: Error) => void }> = [];
private running = 0;
constructor(
private maxConcurrent: number,
private queueLimit: number = 100
) {}
async execute(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
if (this.queue.length >= this.queueLimit) {
reject(new Error('Bulkhead queue full'));
metrics.increment('bulkhead.rejected');
return;
}
this.queue.push({ fn, resolve, reject });
this.processQueue();
});
}
private async processQueue() {
while (this.running < this.maxConcurrent && this.queue.length > 0) {
const task = this.queue.shift();
if (!task) break;
this.running++;
metrics.gauge('bulkhead.running', this.running);
try {
const result = await task.fn();
task.resolve(result);
} catch (error) {
task.reject(error as Error);
} finally {
this.running--;
metrics.gauge('bulkhead.running', this.running);
this.processQueue(); // Next task
}
}
}
}
// Separate bulkheads by operation type
const bulkheads = {
read: new Bulkhead(50, 100), // Up to 50 simultaneous read operations
write: new Bulkhead(10, 50), // Up to 10 simultaneous write operations
analytics: new Bulkhead(5, 20) // Up to 5 simultaneous analytics requests
};
// Usage
async function readUserData(userId: string) {
return await bulkheads.read.execute(() =>
db.users.findById(userId)
);
}
async function writeUserData(userId: string, data: any) {
return await bulkheads.write.execute(() =>
db.users.update(userId, data)
);
}
// Analytics doesn't take down write if it fails
async function trackAnalytics() {
try {
return await bulkheads.analytics.execute(() =>
analyticsApi.track('event')
);
} catch (error) {
// Analytics failed, but main operation continues
metrics.increment('analytics.dropped');
}
}Error observability
Without observability, you're blind. Error handling without logging and metrics is just hiding problems.
Structured error logging
typescriptinterface ErrorContext {
service: string;
operation: string;
userId?: string;
requestId?: string;
upstream?: string;
metadata?: Record<string, any>;
}
function logError(error: Error, context: ErrorContext) {
logger.error({
event: 'error_occurred',
error_type: error.constructor.name,
error_message: error.message,
error_stack: process.env.NODE_ENV === 'development' ? error.stack : undefined,
...context,
timestamp: new Date().toISOString()
});
}
// Usage
try {
await someOperation();
} catch (error) {
logError(error as Error, {
service: 'payment-service',
operation: 'process_payment',
userId: request.userId,
requestId: request.id,
upstream: 'https://payments.api/charge',
metadata: { amount: 10000, currency: 'BRL' }
});
throw error;
}Error metrics
typescriptinterface ErrorMetrics {
total: number;
byType: Map<string, number>;
byOperation: Map<string, number>;
byUpstream: Map<string, number>;
}
class ErrorTracker {
private metrics: ErrorMetrics = {
total: 0,
byType: new Map(),
byOperation: new Map(),
byUpstream: new Map()
};
track(error: Error, context: ErrorContext) {
this.metrics.total++;
this.incrementMap(this.metrics.byType, error.constructor.name);
this.incrementMap(this.metrics.byOperation, context.operation);
if (context.upstream) {
this.incrementMap(this.metrics.byUpstream, context.upstream);
}
// Export to metrics system
this.exportToMetricsSystem();
}
private incrementMap(map: Map<string, number>, key: string) {
map.set(key, (map.get(key) || 0) + 1);
}
private exportToMetricsSystem() {
// Export to Prometheus/DataDog/etc
for (const [type, count] of this.metrics.byType) {
metrics.gauge('errors.total', count, { error_type: type });
}
}
getErrorRate(): number {
const totalRequests = metrics.get('requests.total');
return this.metrics.total / totalRequests;
}
}Conclusion
Error handling in distributed systems isn't a list of patterns to implement once and forget. It's a continuous discipline that evolves with your system.
Start with what has the most impact: retry with backoff, timeouts, and circuit breaker. Add graceful degradation when you understand which services are critical and which are nice-to-have. Implement bulkheads when you have failure isolation problems. Always accompany with observability — structured logs and error metrics are your eyes and ears.
The goal isn't to eliminate errors — that's impossible in distributed systems. The goal is to make errors predictable, observable, and recoverable. When an error happens, you want to know what happened, why it happened, and what the impact was. And you want the system to keep operating, even if in degraded mode.
Systems that handle errors well can fail all the time and still appear reliable to users. Systems that handle errors poorly can work 99.9% of the time and appear broken when one thing fails catastrophically.
Does your microservices architecture need structured error handling? Talk to Imperialis resilient systems experts to design error handling patterns that transform failures into operational resilience.