Cloud and platform

Event-driven architecture patterns in production: dead letters, sagas, and when to event source

Event-driven systems promise loose coupling and scalability. In production, they deliver dead letter queues, saga orchestration, and the difficult decision of when event sourcing is worth the complexity.

3/19/20268 min readCloud
Event-driven architecture patterns in production: dead letters, sagas, and when to event source

Executive summary

Event-driven systems promise loose coupling and scalability. In production, they deliver dead letter queues, saga orchestration, and the difficult decision of when event sourcing is worth the complexity.

Last updated: 3/19/2026

Executive summary

The transition from synchronous request/response to asynchronous event-driven architecture solves real problems: services are loosely coupled, systems can scale independently, and business processes can execute over hours or days rather than within a single HTTP request timeout.

But production event-driven systems introduce new failure modes that don't exist in synchronous systems. What happens when a message consumer crashes mid-processing? How do you compensate for a payment that succeeded when the subsequent inventory reservation failed? When is event sourcing's promise of perfect audit trails worth its implementation complexity?

This post covers the patterns that distinguish proof-of-concept event systems from production-grade event-driven architectures.

Dead letter queues: the pattern you need before you go live

A dead letter queue (DLQ) is a destination for messages that cannot be processed successfully after a configured number of attempts. It is the most under-appreciated pattern in event-driven systems, and the first one you'll wish you had when you hit production.

DLQ design patterns

Per-consumer DLQ vs shared DLQ:

For most systems, create a separate DLQ per consumer (or per consumer group). This allows targeted investigation and reprocessing without unrelated messages creating noise. A shared DLQ is acceptable only for very small teams where operational simplicity outweighs isolation benefits.

Message enrichment before DLQ:

When moving a message to the DLQ, enrich it with operational metadata before the original payload is lost:

json{
  "originalMessage": { /* original payload */ },
  "failureMetadata": {
    "consumer": "order-fulfillment-service",
    "failureReason": "INVENTORY_SYSTEM_TIMEOUT",
    "retryCount": 3,
    "firstFailureAt": "2026-03-19T10:23:45Z",
    "lastFailureAt": "2026-03-19T10:26:12Z",
    "processingAttemptDurationMs": [1250, 1800, 2100]
  }
}

This metadata is invaluable when investigating production issues and deciding whether to reprocess, modify, or permanently discard messages.

Retry strategies: exponential backoff with jitter

When a message processing fails, immediate retry often creates a thundering herd problem that worsens the underlying issue. Exponential backoff with jitter is the standard approach:

  • Initial delay: 1 second
  • Exponential base: 2 (delay doubles each retry)
  • Maximum delay: 5 minutes
  • Jitter: ±25% random variation to prevent synchronization

For transient failures (network timeouts, temporary service unavailability), this approach allows the system to recover without overwhelming dependencies. For permanent failures (invalid message schema, missing data), the message quickly reaches the DLQ rather than consuming retry capacity indefinitely.

DLQ operations: visibility and reprocessing

A DLQ is operationally useful only when you have tooling to inspect and reprocess messages. Minimum requirements:

  1. Search and filter: Find messages by error type, original topic, or time range
  2. Message inspection: View the original payload and failure metadata
  3. Replay mechanism: Reprocess messages either individually or in bulk
  4. Edit before replay: Fix malformed messages without redeploying code

Tooling options:

  • Kafka: UI tools like Kafka Explorer, or custom consumer with admin APIs
  • RabbitMQ: Dead letter exchange with shovel plugins for message movement
  • AWS SQS: Built-in DLQ support with redrive capability
  • Cloud Pub/Sub: Built-in dead lettering with acknowledgment deadline extension

Saga pattern: managing distributed transactions without two-phase commit

In a synchronous system, you might wrap multi-service operations in a database transaction using two-phase commit. In event-driven systems, this is neither practical nor desirable. The saga pattern coordinates a business process across multiple services through a sequence of local transactions, each with a compensating transaction for rollback.

Choreography vs orchestration

Choreography: Each service emits events and reacts to events from other services. There is no central coordinator; the saga emerges from the interaction of services.

Best for: Simple workflows with 2-3 participants, stable team boundaries where each service is owned by a different team.

Orchestration: A central saga orchestrator service maintains the state of the saga and sends commands to participant services. Participants respond with events that the orchestrator consumes.

Best for: Complex workflows with 4+ participants, workflows that change frequently, workflows requiring visibility into current state.

Implementing orchestrator sagas

Orchestrator sagas require persistent state management. Each saga instance should be stored in a database with:

  • Saga ID
  • Current state (which step has completed)
  • Payload data (business identifiers, intermediate results)
  • History of completed steps (for audit and compensation)

State machine example: Order fulfillment

[PENDING] → [PAYMENT_INITIATED] → [PAYMENT_CONFIRMED] → [INVENTORY_RESERVED] → [SHIPPING_SCHEDULED] → [COMPLETED]
                                          ↓                    ↓
                                    [PAYMENT_FAILED]    [INVENTORY_FAILED]
                                          ↓                    ↓
                                    [COMPENSATED] ← [COMPENSATED]
                                          ↓
                                    [FAILED]

Each state transition is triggered by an event from a participant service. The orchestrator initiates compensation by sending compensating commands to all completed participants in reverse order.

Compensation design: not all operations can be undone

Compensating transactions are not always simple inverses of the original transaction. Consider:

  • Payment: Capture → Refund (straightforward)
  • Inventory reservation: Reserve → Release (straightforward)
  • Shipping: Schedule shipment → Cannot "unschedule" a shipment once dispatched (compensation: process return, issue refund)
  • Email notification: Send welcome email → Cannot "unsend" an email (compensation: send correction email, log the issue)

Design your saga with the understanding that some operations are compensable, some are not, and plan business processes accordingly.

Event sourcing: when the complexity is worth it

Event sourcing stores all changes to application state as a sequence of events rather than just the current state. To reconstruct the current state, you replay all events from the beginning of time.

When event sourcing makes sense

Use event sourcing when:

  1. Audit is a first-class requirement: Financial systems, healthcare applications, and regulated industries where the complete history of state changes is a regulatory requirement
  2. Complex business logic: Systems where the current state is insufficient to understand why the system is in its current state (insurance underwriting, credit decisioning)
  3. Temporal queries: Systems that need to answer "what was the state at time T?" queries (subscription systems, billing)
  4. Event replay is valuable: Systems where recomputing state from events is operationally useful (fixing bugs in business logic, backfilling derived data)

Avoid event sourcing when:

  1. Your domain is simple CRUD: Most traditional web applications don't benefit from event sourcing
  2. Your team lacks event sourcing experience: The learning curve is steep; the first event-sourced system will take longer than expected
  3. You don't have resources for tooling: Event sourcing requires tooling for snapshot management, event store queries, and replay infrastructure

Event sourcing implementation patterns

Event store schema:

Each event requires:

  • Event ID (UUID)
  • Aggregate ID (the entity the event belongs to)
  • Event type (e.g., OrderPlaced, PaymentCaptured)
  • Event data (business-specific payload)
  • Metadata (timestamp, causation ID, correlation ID, user ID)
  • Version number (for optimistic concurrency)

Snapshot strategy:

Replaying millions of events to compute current state is prohibitively expensive. Implement snapshots:

  • Frequency: Every N events (e.g., every 100 events) or time-based (every 24 hours)
  • Storage: Same event store or separate snapshot store
  • Snapshot format: Complete aggregate state at the point of snapshot
  • Query pattern: Load latest snapshot, then replay events since snapshot

Event versioning:

Event schemas evolve. When an event type changes, you need to handle both old and new versions:

  • Upcast: Read old events and transform to new schema on replay
  • Versioned event types: OrderPlacedV1, OrderPlacedV2
  • Separate read models: Project events into multiple read models optimized for queries

Event sourcing operational concerns

Event sourcing introduces operational complexity beyond traditional state persistence:

  1. Event store performance: All reads require event replay; optimize with snapshots and read model projections
  2. Schema migration: Changing event schemas requires handling historical events with old schemas
  3. Event deletions: Regulatory "right to be forgotten" requires special handling in event logs
  4. Debugging complexity: Understanding current state requires understanding event history, not just database rows

Message broker selection: Kafka vs RabbitMQ vs cloud-native

The choice of message broker shapes your event-driven architecture patterns.

Kafka: streams, not queues

Kafka is a log-based messaging system optimized for streaming high-throughput, durable event streams. It is not a traditional message queue.

When to choose Kafka:

  • High-throughput event streaming (millions of events per second)
  • Multiple independent consumers reading the same events (stream processing, analytics)
  • Event sourcing or event replay requirements
  • Long event retention (days to weeks)

Kafka operational considerations:

  • Consumer group management for horizontal scaling
  • Partition sizing strategy (too few partitions limit parallelism; too many increase overhead)
  • Broker and ZooKeeper/KRaft management
  • Topic compaction for latest-state semantics

RabbitMQ: flexible routing, lower complexity

RabbitMQ is a traditional message broker with sophisticated routing capabilities through exchanges and bindings.

When to choose RabbitMQ:

  • Workflows requiring complex routing (content-based routing, multicast, request-reply)
  • Lower operational complexity than Kafka
  • Workloads that don't require Kafka's scale or durability guarantees
  • Teams with existing RabbitMQ expertise

RabbitMQ operational considerations:

  • Queue durability and mirroring for high availability
  • Connection and channel management
  • Memory and disk alarms
  • Plugin ecosystem (Shovel for message movement, Federation for multi-region)

Cloud-native options: managed infrastructure

AWS SQS/SNS/Kinesis:

  • SQS: Simple queue service, per-message pricing, no infrastructure management
  • SNS: Pub/sub messaging, fanout to multiple subscribers
  • Kinesis: Real-time streaming, integrates with Lambda and analytics services

Google Cloud Pub/Sub:

  • At-least-once delivery, integrates with GCP ecosystem
  • Built-in dead lettering and acknowledgment deadline extension

Azure Service Bus:

  • Enterprise messaging features (sessions, scheduled delivery, message deferral)
  • Protocol diversity (AMQP, MQTT, HTTP)

Cloud-native options reduce operational overhead but introduce vendor lock-in and potential cost at scale.

Testing event-driven systems: beyond unit tests

Event-driven systems require testing beyond traditional unit tests:

  1. Consumer contract tests: Verify that consumers can handle all message variants they subscribe to
  2. Producer contract tests: Verify that producers emit messages that match expected schemas
  3. Integration tests with embedded broker: Test producer and consumer together with a real message broker (Testcontainers for Kafka/RabbitMQ)
  4. Chaos tests: Simulate broker unavailability, consumer crashes, and network partitions to verify resilience
  5. Saga tests: Verify that saga orchestrators correctly compensate failures at each step

Decision prompts for architecture teams

  • When a message fails processing, can your operations team inspect the failure, understand the root cause, and replay the message without deploying new code?
  • For your most critical multi-service workflows, is the coordination logic (saga) visible and debuggable, or is it hidden in implicit choreography?
  • Have you explicitly designed compensating transactions for each step in your distributed transactions, or are you assuming success?
  • If you're event sourcing, do you have tooling for snapshot management and event replay, or will your first event replay be a manual production emergency?

Designing a production event-driven architecture with proper failure handling and resilience? Talk to Imperialis about saga orchestration, event sourcing evaluation, and message broker selection.

Sources

Related reading