Cloud and platform

Event-driven architecture patterns in production: dead letters, sagas, and when to event source

Event-driven systems promise loose coupling and scalability. In production, they deliver dead letter queues, saga orchestration, and the difficult decision of when event sourcing is worth the complexity.

3/19/2026•8 min read•Cloud

Event-driven architecture patterns in production: dead letters, sagas, and when to event source

Executive summary

Last updated: 3/19/2026

Sources

Executive summary

The transition from synchronous request/response to asynchronous event-driven architecture solves real problems: services are loosely coupled, systems can scale independently, and business processes can execute over hours or days rather than within a single HTTP request timeout.

But production event-driven systems introduce new failure modes that don't exist in synchronous systems. What happens when a message consumer crashes mid-processing? How do you compensate for a payment that succeeded when the subsequent inventory reservation failed? When is event sourcing's promise of perfect audit trails worth its implementation complexity?

This post covers the patterns that distinguish proof-of-concept event systems from production-grade event-driven architectures.

Dead letter queues: the pattern you need before you go live

A dead letter queue (DLQ) is a destination for messages that cannot be processed successfully after a configured number of attempts. It is the most under-appreciated pattern in event-driven systems, and the first one you'll wish you had when you hit production.

DLQ design patterns

Per-consumer DLQ vs shared DLQ:

For most systems, create a separate DLQ per consumer (or per consumer group). This allows targeted investigation and reprocessing without unrelated messages creating noise. A shared DLQ is acceptable only for very small teams where operational simplicity outweighs isolation benefits.

Message enrichment before DLQ:

When moving a message to the DLQ, enrich it with operational metadata before the original payload is lost:

json{
  "originalMessage": { /* original payload */ },
  "failureMetadata": {
    "consumer": "order-fulfillment-service",
    "failureReason": "INVENTORY_SYSTEM_TIMEOUT",
    "retryCount": 3,
    "firstFailureAt": "2026-03-19T10:23:45Z",
    "lastFailureAt": "2026-03-19T10:26:12Z",
    "processingAttemptDurationMs": [1250, 1800, 2100]
  }
}

This metadata is invaluable when investigating production issues and deciding whether to reprocess, modify, or permanently discard messages.

Retry strategies: exponential backoff with jitter

When a message processing fails, immediate retry often creates a thundering herd problem that worsens the underlying issue. Exponential backoff with jitter is the standard approach:

Initial delay: 1 second
Exponential base: 2 (delay doubles each retry)
Maximum delay: 5 minutes
Jitter: ±25% random variation to prevent synchronization

For transient failures (network timeouts, temporary service unavailability), this approach allows the system to recover without overwhelming dependencies. For permanent failures (invalid message schema, missing data), the message quickly reaches the DLQ rather than consuming retry capacity indefinitely.

DLQ operations: visibility and reprocessing

A DLQ is operationally useful only when you have tooling to inspect and reprocess messages. Minimum requirements:

Search and filter: Find messages by error type, original topic, or time range
Message inspection: View the original payload and failure metadata
Replay mechanism: Reprocess messages either individually or in bulk
Edit before replay: Fix malformed messages without redeploying code

Tooling options:

Kafka: UI tools like Kafka Explorer, or custom consumer with admin APIs
RabbitMQ: Dead letter exchange with shovel plugins for message movement
AWS SQS: Built-in DLQ support with redrive capability
Cloud Pub/Sub: Built-in dead lettering with acknowledgment deadline extension

Saga pattern: managing distributed transactions without two-phase commit

In a synchronous system, you might wrap multi-service operations in a database transaction using two-phase commit. In event-driven systems, this is neither practical nor desirable. The saga pattern coordinates a business process across multiple services through a sequence of local transactions, each with a compensating transaction for rollback.

Choreography vs orchestration

Choreography: Each service emits events and reacts to events from other services. There is no central coordinator; the saga emerges from the interaction of services.

Best for: Simple workflows with 2-3 participants, stable team boundaries where each service is owned by a different team.

Orchestration: A central saga orchestrator service maintains the state of the saga and sends commands to participant services. Participants respond with events that the orchestrator consumes.

Best for: Complex workflows with 4+ participants, workflows that change frequently, workflows requiring visibility into current state.

Implementing orchestrator sagas

Orchestrator sagas require persistent state management. Each saga instance should be stored in a database with:

Saga ID
Current state (which step has completed)
Payload data (business identifiers, intermediate results)
History of completed steps (for audit and compensation)

State machine example: Order fulfillment

[PENDING] → [PAYMENT_INITIATED] → [PAYMENT_CONFIRMED] → [INVENTORY_RESERVED] → [SHIPPING_SCHEDULED] → [COMPLETED]
                                          ↓                    ↓
                                    [PAYMENT_FAILED]    [INVENTORY_FAILED]
                                          ↓                    ↓
                                    [COMPENSATED] ← [COMPENSATED]
                                          ↓
                                    [FAILED]

Each state transition is triggered by an event from a participant service. The orchestrator initiates compensation by sending compensating commands to all completed participants in reverse order.

Compensation design: not all operations can be undone

Compensating transactions are not always simple inverses of the original transaction. Consider:

Payment: Capture → Refund (straightforward)
Inventory reservation: Reserve → Release (straightforward)
Shipping: Schedule shipment → Cannot "unschedule" a shipment once dispatched (compensation: process return, issue refund)
Email notification: Send welcome email → Cannot "unsend" an email (compensation: send correction email, log the issue)

Design your saga with the understanding that some operations are compensable, some are not, and plan business processes accordingly.

Event sourcing: when the complexity is worth it

Event sourcing stores all changes to application state as a sequence of events rather than just the current state. To reconstruct the current state, you replay all events from the beginning of time.

When event sourcing makes sense

Use event sourcing when:

Audit is a first-class requirement: Financial systems, healthcare applications, and regulated industries where the complete history of state changes is a regulatory requirement
Complex business logic: Systems where the current state is insufficient to understand why the system is in its current state (insurance underwriting, credit decisioning)
Temporal queries: Systems that need to answer "what was the state at time T?" queries (subscription systems, billing)
Event replay is valuable: Systems where recomputing state from events is operationally useful (fixing bugs in business logic, backfilling derived data)

Avoid event sourcing when:

Your domain is simple CRUD: Most traditional web applications don't benefit from event sourcing
Your team lacks event sourcing experience: The learning curve is steep; the first event-sourced system will take longer than expected
You don't have resources for tooling: Event sourcing requires tooling for snapshot management, event store queries, and replay infrastructure

Event sourcing implementation patterns

Event store schema:

Each event requires:

Event ID (UUID)
Aggregate ID (the entity the event belongs to)
Event type (e.g., OrderPlaced, PaymentCaptured)
Event data (business-specific payload)
Metadata (timestamp, causation ID, correlation ID, user ID)
Version number (for optimistic concurrency)

Snapshot strategy:

Replaying millions of events to compute current state is prohibitively expensive. Implement snapshots:

Frequency: Every N events (e.g., every 100 events) or time-based (every 24 hours)
Storage: Same event store or separate snapshot store
Snapshot format: Complete aggregate state at the point of snapshot
Query pattern: Load latest snapshot, then replay events since snapshot

Event versioning:

Event schemas evolve. When an event type changes, you need to handle both old and new versions:

Upcast: Read old events and transform to new schema on replay
Versioned event types: OrderPlacedV1, OrderPlacedV2
Separate read models: Project events into multiple read models optimized for queries

Event sourcing operational concerns

Event sourcing introduces operational complexity beyond traditional state persistence:

Event store performance: All reads require event replay; optimize with snapshots and read model projections
Schema migration: Changing event schemas requires handling historical events with old schemas
Event deletions: Regulatory "right to be forgotten" requires special handling in event logs
Debugging complexity: Understanding current state requires understanding event history, not just database rows

Message broker selection: Kafka vs RabbitMQ vs cloud-native

The choice of message broker shapes your event-driven architecture patterns.

Kafka: streams, not queues

Kafka is a log-based messaging system optimized for streaming high-throughput, durable event streams. It is not a traditional message queue.

When to choose Kafka:

High-throughput event streaming (millions of events per second)
Multiple independent consumers reading the same events (stream processing, analytics)
Event sourcing or event replay requirements
Long event retention (days to weeks)

Kafka operational considerations:

Consumer group management for horizontal scaling
Partition sizing strategy (too few partitions limit parallelism; too many increase overhead)
Broker and ZooKeeper/KRaft management
Topic compaction for latest-state semantics

RabbitMQ: flexible routing, lower complexity

RabbitMQ is a traditional message broker with sophisticated routing capabilities through exchanges and bindings.

When to choose RabbitMQ:

Workflows requiring complex routing (content-based routing, multicast, request-reply)
Lower operational complexity than Kafka
Workloads that don't require Kafka's scale or durability guarantees
Teams with existing RabbitMQ expertise

RabbitMQ operational considerations:

Queue durability and mirroring for high availability
Connection and channel management
Memory and disk alarms
Plugin ecosystem (Shovel for message movement, Federation for multi-region)

Cloud-native options: managed infrastructure

AWS SQS/SNS/Kinesis:

SQS: Simple queue service, per-message pricing, no infrastructure management
SNS: Pub/sub messaging, fanout to multiple subscribers
Kinesis: Real-time streaming, integrates with Lambda and analytics services

Google Cloud Pub/Sub:

At-least-once delivery, integrates with GCP ecosystem
Built-in dead lettering and acknowledgment deadline extension

Azure Service Bus:

Enterprise messaging features (sessions, scheduled delivery, message deferral)
Protocol diversity (AMQP, MQTT, HTTP)

Cloud-native options reduce operational overhead but introduce vendor lock-in and potential cost at scale.

Testing event-driven systems: beyond unit tests

Event-driven systems require testing beyond traditional unit tests:

Consumer contract tests: Verify that consumers can handle all message variants they subscribe to
Producer contract tests: Verify that producers emit messages that match expected schemas
Integration tests with embedded broker: Test producer and consumer together with a real message broker (Testcontainers for Kafka/RabbitMQ)
Chaos tests: Simulate broker unavailability, consumer crashes, and network partitions to verify resilience
Saga tests: Verify that saga orchestrators correctly compensate failures at each step

Decision prompts for architecture teams

When a message fails processing, can your operations team inspect the failure, understand the root cause, and replay the message without deploying new code?
For your most critical multi-service workflows, is the coordination logic (saga) visible and debuggable, or is it hidden in implicit choreography?
Have you explicitly designed compensating transactions for each step in your distributed transactions, or are you assuming success?
If you're event sourcing, do you have tooling for snapshot management and event replay, or will your first event replay be a manual production emergency?

Designing a production event-driven architecture with proper failure handling and resilience? Talk to Imperialis about saga orchestration, event sourcing evaluation, and message broker selection.

Sources

Distributed sagas — Microservices.io patterns, 2026 — accessed March 2026
Event sourcing explained — Martin Fowler, 2026 — accessed March 2026
Dead letter queues — AWS Architecture Center, 2026 — accessed March 2026
Kafka vs RabbitMQ comparison — Confluent, 2026 — accessed March 2026

Talk about architecture design Explore more articles