Knowledge

Event-Driven Architecture with Kafka: Beyond Message Queues

How Apache Kafka enables true event-driven architectures with persistent event streams, consumer groups, and exactly-once semantics.

3/12/20267 min readKnowledge
Event-Driven Architecture with Kafka: Beyond Message Queues

Executive summary

How Apache Kafka enables true event-driven architectures with persistent event streams, consumer groups, and exactly-once semantics.

Last updated: 3/12/2026

Executive summary

Traditional message queues (RabbitMQ, SQS) are designed for task distribution: here's a job, please process it. Apache Kafka is designed for something fundamentally different: event streaming.

The distinction is profound. A queue holds messages until they're consumed, then they're gone. Kafka retains events for a configurable retention period, allowing multiple consumers to replay the event stream at their own pace. This decoupling enables new use cases that are impossible with traditional queues:

  • Late-joining consumers: New services can consume historical events to rebuild state.
  • Event replay: Bugs can be fixed and consumers can "rewind" to reprocess events.
  • Multiple downstream systems: Analytics, search indexing, and business logic can all consume the same event stream independently.

In 2026, Kafka has become the default choice for event-driven architectures that require durable, replayable event streams.

Core concepts: Why Kafka is different

Topics, partitions, and consumer groups

Kafka organizes events into topics, which are split into partitions. Each partition is an ordered, immutable log of events:

Topic: orders
Partition 0: [event1] → [event2] → [event3] → [event4]
Partition 1: [event5] → [event6] → [event7] → [event8]

Events within a partition maintain strict ordering. Across partitions, ordering is not guaranteed.

Consumer groups provide the scaling mechanism. Each partition is consumed by exactly one consumer within a group. If you have 4 partitions and 4 consumers in a group, each consumer handles one partition. If you add a 5th consumer, it sits idle until partitions are rebalanced.

This design enables both parallel processing (multiple partitions) and stateful consumers (each consumer knows exactly which events it has processed).

Retention vs TTL

Message queues typically delete messages after consumption. Kafka retains events based on time or size:

bash# Retain events for 7 days
log.retention.hours=168

# Or retain events up to 10GB
log.retention.bytes=10737418240

This retention window is what enables replayability. If a consumer crashes and needs to reprocess the last hour of events, Kafka can provide them.

When Kafka accelerates delivery

Kafka provides measurable velocity gains in specific scenarios:

  • State propagation: When services need to stay synchronized through events rather than polling databases.
  • Multi-consumer patterns: When multiple downstream systems need the same events independently.
  • Event sourcing: When the event log itself is the source of truth for application state.
  • Analytics pipelines: When business intelligence and real-time monitoring need access to the same event stream.

Decision prompts for your context:

  • Do you need to replay events for new consumers or recovery scenarios?
  • Do you have multiple independent systems that need the same events?
  • Is event ordering within a business entity important (e.g., all events for a single user)?

Consumer patterns and anti-patterns

Pattern 1: At-least-once delivery

Kafka guarantees at-least-once delivery. If a consumer crashes after processing an event but before committing its offset, it will reprocess that event on restart.

go// Consumer with at-least-once semantics
for {
    records := consumer.Poll(100 * time.Millisecond)
    
    for _, record := range records {
        // Process event
        processEvent(record)
        
        // Commit offset AFTER successful processing
        consumer.Commit(record)
    }
}

Operational implication: Consumers must be idempotent. Reprocessing the same event twice must have the same result as processing it once.

Pattern 2: Consumer groups for horizontal scaling

Consumer groups enable horizontal scaling without coordination:

bash# Three instances of the same service
orders-consumer-1: partitions [0, 1]
orders-consumer-2: partitions [2, 3]
orders-consumer-3: partitions [4, 5]

When a new instance joins or leaves, Kafka automatically rebalances partitions across the group.

Anti-pattern: Synchronous consumers in event pipelines

A common mistake is making consumers synchronous, chaining them together:

Producer → [Kafka] → Consumer A → Consumer B → Consumer C

This defeats the purpose of Kafka. Each consumer should process independently:

Producer → [Kafka] → Consumer A
                → Consumer B
                → Consumer C

Schema evolution strategies

As your event schemas evolve, you need strategies to maintain compatibility:

Backward-compatible changes

  • Adding optional fields
  • Renaming fields with aliases
  • Adding default values

Forward-compatible changes

  • Removing optional fields
  • Changing field types with converters

Schema registry integration

Confluent Schema Registry provides schema management and compatibility enforcement:

java// Producer with schema validation
Producer<String, OrderEvent> producer = new KafkaProducer<>(
    props,
    new StringSerializer(),
    new KafkaAvroSerializer(schemaRegistry)
);

The registry ensures that producers only send valid schemas and consumers can deserialize them correctly.

Operational considerations

Kafka introduces operational complexity that must be managed:

Broker management

  • Broker failures: Kafka is designed to tolerate broker failures with replication. Configure replication factors appropriate for your durability requirements (typically 3 for production).
  • Rebalancing: When consumer groups change, partitions are rebalanced. Configure appropriate rebalancing timeouts to prevent stuck consumers.
  • Resource allocation: Kafka is memory and I/O intensive. Allocate dedicated resources to avoid contention with other services.

Consumer lag monitoring

Consumer lag—the difference between the latest event in Kafka and the last event processed by a consumer—is the critical health metric:

bash# Monitor lag using Kafka CLI
kafka-consumer-groups --bootstrap-server localhost:9092 \
    --group orders-consumer --describe

Alerting on increasing lag enables proactive intervention before consumers fall permanently behind.

Dead letter queues

When a consumer repeatedly fails to process an event, it should send the event to a dead letter queue (DLQ) for manual inspection rather than blocking the entire pipeline:

goif err := processEvent(record); err != nil {
    // Send to DLQ for manual investigation
    dlqProducer.Send(record)
    // Commit offset to continue processing other events
    consumer.Commit(record)
}

Performance optimization strategies

Partition sizing

More partitions enable more parallelism but increase overhead:

  • Too few partitions: Limits consumer parallelism and creates hotspots.
  • Too many partitions: Increases broker metadata overhead and can cause rebalancing storms.

A practical starting point: Number of partitions = Number of consumers × Target parallelism factor (typically 2-3x).

Batching and compression

Kafka supports message batching and compression to reduce network overhead:

properties# Batch messages before sending
batch.size=16384
linger.ms=5

# Use compression (gzip, snappy, lz4, zstd)
compression.type=lz4

Compression typically provides 3-5x size reduction with minimal CPU overhead.

Exactly-once semantics

Kafka supports exactly-once processing using transactions and idempotent producers:

properties# Enable idempotent producer
enable.idempotence=true

# Use transactions for exactly-once across topics
isolation.level=read_committed

This eliminates duplicate processing but requires careful schema design and transaction boundary management.

30-day implementation plan

  1. Identify event sources: Map which business events are natural candidates for streaming.
  2. Design event schemas: Define initial schemas with backward compatibility in mind.
  3. Select deployment model: Choose between self-hosted Kafka, Confluent Cloud, or managed services.
  4. Implement first consumers: Start with low-risk consumers (analytics, logging) to build operational maturity.
  5. Add monitoring: Deploy consumer lag monitoring and alerting before production rollout.
  6. Document failure modes: Create runbooks for common failure scenarios (consumer crashes, broker failures).

Production validation checklist

Indicators to track:

  • Consumer lag across all consumer groups (should remain stable).
  • End-to-end event latency (time from event creation to processing).
  • Rebalance frequency and duration (frequent rebalances indicate instability).
  • Dead letter queue rate (high rates indicate systemic issues).

Platform decisions for the next cycle

  • Define event schema governance: who approves schema changes and how are they communicated?
  • Establish retention policies: how long should events be retained for replay scenarios?
  • Configure replication factors: balance between durability and resource cost.

Need help designing an event-driven architecture that scales without operational nightmares? Talk about custom software with Imperialis to design and implement this evolution.

Sources

Related reading