Event-Driven Architecture Patterns

This document outlines the standards, patterns, and best practices for implementing event-driven architectures at Bayat.

Introduction

Event-driven architecture (EDA) is an architectural pattern that promotes the production, detection, consumption of, and reaction to events. An event can be defined as a significant change in state or an occurrence that is of interest to the system or its users.

Benefits of Event-Driven Architecture

Loose Coupling: Services communicate without direct dependencies
Scalability: Components can scale independently
Resilience: Failure in one component doesn't affect others directly
Flexibility: Easier to add new components that react to existing events
Real-time Processing: Enables immediate reaction to events as they occur

When to Use Event-Driven Architecture

Consider EDA when:

Building systems with asynchronous workflows
Implementing real-time features
Designing microservices that need to coordinate without tight coupling
Managing complex business processes across multiple services
Creating systems that need audit trails of all state changes

Core Concepts

Events

An event represents a fact that occurred at a specific point in time. Events are immutable and should be expressed in the past tense.

Event Types

Domain Events: Represent business-significant occurrences within a domain
Integration Events: Used for communication between different bounded contexts
Command Events: Represent instructions to perform actions
Query Events: Request information without changing state
Notification Events: Inform subscribed parties about system occurrences

Event Producers

Components that detect, capture, and publish events to an event broker.

Event Consumers

Components that subscribe to events and react to them.

Event Broker

Middleware that routes events from producers to consumers.

Event Store

Persistent storage of events, serving as the source of truth.

Event Patterns

Event Notification

The simplest pattern where a service publishes an event to notify other services about a change.

┌──────────┐    ┌──────────┐    ┌──────────┐
│ Producer │───>│  Broker  │───>│ Consumer │
└──────────┘    └──────────┘    └──────────┘

When to use: For simple coordination between services where consumers don't need complete information.

Event-Carried State Transfer

Events contain the complete relevant state, reducing the need for consumers to query back for additional data.

┌──────────┐    ┌──────────┐    ┌──────────┐
│ Producer │───>│  Broker  │───>│ Consumer │
└──────────┘    └──────────┘    └──────────┘
      │                               │
      │                               │
      v                               v
┌──────────┐                    ┌──────────┐
│ Producer │                    │ Consumer │
│ Database │                    │ Database │
└──────────┘                    └──────────┘

When to use: When consumers need comprehensive information and reducing network calls is important.

Event Sourcing

All changes to application state are stored as a sequence of events, which can be replayed to reconstruct current state.

┌──────────┐    ┌──────────┐    ┌──────────┐
│   User   │───>│ Commands │───>│  Domain  │
│   Input  │    │          │    │  Logic   │
└──────────┘    └──────────┘    └──────────┘
                                      │
                                      v
                                ┌──────────┐
                                │  Events  │
                                │          │
                                └──────────┘
                                      │
                                      v
                                ┌──────────┐
                                │  Event   │
                                │  Store   │
                                └──────────┘
                                      │
                                      v
                                ┌──────────┐
                                │  Read    │
                                │  Model   │
                                └──────────┘

When to use: When you need a complete audit trail, time-travel debugging, or complex event processing.

Command Query Responsibility Segregation (CQRS)

Separates read and write operations into different models, often with different data stores.

┌──────────┐                    ┌──────────┐
│  Write   │                    │   Read   │
│ Requests │                    │ Requests │
└──────────┘                    └──────────┘
      │                               │
      v                               v
┌──────────┐                    ┌──────────┐
│  Write   │                    │   Read   │
│  Model   │                    │  Model   │
└──────────┘                    └──────────┘
      │                               ▲
      v                               │
┌──────────┐    ┌──────────┐    ┌──────────┐
│  Events  │───>│  Event   │───>│ Projector│
│          │    │  Store   │    │          │
└──────────┘    └──────────┘    └──────────┘

When to use: When read and write workloads have significantly different requirements or scaling needs.

Saga Pattern

Coordinates a sequence of local transactions where each transaction updates data within a single service.

┌──────────┐    ┌──────────┐    ┌──────────┐
│  Step 1  │───>│  Step 2  │───>│  Step 3  │
└──────────┘    └──────────┘    └──────────┘
      │               │               │
      v               v               v
┌──────────┐    ┌──────────┐    ┌──────────┐
│Compensate│    │Compensate│    │Compensate│
│  Step 1  │<───│  Step 2  │<───│  Step 3  │
└──────────┘    └──────────┘    └──────────┘

When to use: For managing distributed transactions across multiple services.

Communication Patterns

Publish-Subscribe

Multiple consumers subscribe to event types and receive all events of those types.

Best practices:

Use topic-based routing for logical organization
Implement message filtering at the broker when possible
Design with multiple consumers in mind

Point-to-Point

Events are delivered to exactly one consumer from a pool of potential consumers.

Best practices:

Use for work distribution and load balancing
Ensure idempotent consumers for reliability
Consider message expiration policies

Request-Reply over Events

A synchronous interaction pattern implemented with asynchronous events.

Best practices:

Include correlation IDs to match replies with requests
Set appropriate timeouts
Have fallback mechanisms for missed replies

Event Schema Design

Event Structure

All events should follow a consistent structure:

{
  "eventId": "uuid-string",
  "eventType": "domain.entity.action.occurred",
  "eventTime": "ISO-8601 timestamp",
  "version": "schema-version",
  "producer": "service-name",
  "data": {
    // Event-specific payload
  },
  "metadata": {
    // Additional contextual information
    "correlationId": "uuid-string",
    "causationId": "uuid-string",
    "traceId": "uuid-string"
  }
}

Schema Evolution

Follow these principles for evolving event schemas:

Backward Compatibility: New schema versions must accept data produced with older schemas
Forward Compatibility: Old consumers should be able to process events from newer producers
Additive Changes: Only add fields, don't remove or change existing ones
Default Values: Provide sensible defaults for new fields
Versioning: Include schema version in the event metadata

Schema Registry

Use a schema registry to:

Centrally manage event schemas
Validate event conformance
Enable schema evolution governance
Generate client code for various languages

Event Storage and Routing

Message Brokers

Choose the appropriate broker based on requirements:

Apache Kafka: For high-throughput, persistent event streaming
RabbitMQ: For complex routing requirements and traditional message patterns
AWS SNS/SQS: For AWS-native applications with moderate throughput
Google Pub/Sub: For GCP-native applications
Azure Event Hubs/Service Bus: For Azure-native applications

Event Store Considerations

When implementing an event store:

Append-Only: The event store should be an immutable, append-only log
Optimistic Concurrency: Use optimistic concurrency control for event streams
Snapshots: Implement snapshotting for performance with long event streams
Partitioning: Partition events by aggregate ID or other logical boundaries
Compression: Consider compression for storage efficiency

Error Handling

Message Processing Failures

Dead Letter Queues (DLQ): Route unprocessable messages to a DLQ for later analysis
Retry Policies: Implement exponential backoff for transient failures
Circuit Breakers: Prevent cascading failures when downstream services fail
Poison Message Handling: Identify and isolate messages that consistently cause failures

Event Processing Guarantees

Choose the appropriate guarantee level:

At-most-once: Events may be lost but never processed twice (lowest overhead)
At-least-once: Events are never lost but may be processed multiple times (requires idempotency)
Exactly-once: Events are processed exactly once (highest overhead, not always possible)

Idempotent Consumers

Design consumers to be idempotent by:

Using natural idempotency when possible (e.g., setting vs. incrementing)
Tracking processed event IDs to detect duplicates
Using database transactions to make updates atomic
Designing compensating actions for non-idempotent operations

Monitoring and Observability

Key Metrics to Monitor

Latency: Time from event production to consumption
Throughput: Events processed per time unit
Error Rates: Failed events vs. successful events
Queue Depths: Number of unprocessed events
Processing Time: Time taken to process each event
Dead Letter Counts: Number of events in DLQs

Event Tracing

Implement distributed tracing by:

Propagating trace IDs through events
Correlating events in the same business transaction
Visualizing event flows across services
Measuring time spent in each processing stage

Event Logging

Log the following information:

Event metadata (ID, type, timestamp)
Processing milestones (received, processed, forwarded)
Errors with context
Business-significant event details (sanitized of sensitive data)

Testing Strategies

Unit Testing Event Handlers

Test event handlers in isolation by:

Mocking event sources
Verifying correct state changes
Ensuring idempotency with duplicate events
Testing error handling paths

Integration Testing

Test the interaction between components by:

Using embedded event brokers for testing
Verifying event publication
Confirming subscription and processing
Testing schema compatibility

End-to-End Testing

Validate complete event flows by:

Tracing events through the entire system
Verifying eventual consistency
Testing failure recovery
Measuring performance characteristics

Implementation Guidelines

Event Storming

Use event storming workshops to:

Identify domain events collaboratively
Understand event flows and dependencies
Define service boundaries around events
Discover missing events or processes

Domain-Driven Design Integration

Align event-driven architecture with DDD by:

Modeling events as domain concepts
Using bounded contexts to define event ownership
Designing aggregates that emit domain events
Implementing context maps to show relationships between bounded contexts

Performance Considerations

Optimize for:

Batching: Group events for efficient processing
Partitioning: Distribute events for parallel processing
Backpressure: Implement mechanisms to handle overload
Caching: Cache reference data needed for event processing

Tooling and Infrastructure

Recommended Technologies

Based on project context, consider:

Event Brokers

Apache Kafka: For high-throughput event streaming
RabbitMQ: For traditional messaging patterns
NATS: For lightweight, high-performance messaging
Cloud Provider Services:
- AWS: EventBridge, SNS, SQS, Kinesis
- Azure: Event Hubs, Service Bus, Event Grid
- GCP: Pub/Sub, Dataflow

Event Stores

EventStoreDB: Purpose-built event sourcing database
Apache Kafka: For event sourcing and stream processing
PostgreSQL/MySQL: With append-only tables for simpler implementations
MongoDB/DynamoDB: For document-based event storage

Stream Processing

Apache Kafka Streams: For JVM-based stream processing
Apache Flink: For complex event processing with exactly-once semantics
Apache Spark Streaming: For batch and stream processing
AWS Lambda / Azure Functions / GCP Cloud Functions: For serverless event processing

Monitoring

Prometheus & Grafana: For metrics collection and visualization
ELK Stack: For log aggregation and analysis
Jaeger/Zipkin: For distributed tracing
Datadog/New Relic: For integrated observability platforms

Governance

Event Ownership

Define clear ownership by:

Assigning each event type to a specific team/service
Documenting ownership in a central catalog
Establishing change management processes for events
Creating feedback channels for event consumers

Event Discovery

Implement event discovery mechanisms:

Event Catalog: Document all events, their schemas, producers, and consumers
Self-Registration: Enable services to register their events automatically
Runtime Discovery: Allow services to discover available events at runtime
Documentation: Maintain comprehensive, up-to-date event documentation

Change Management

Versioning Strategy: Follow semantic versioning for event schemas
Deprecation Policy: Define timeline and process for deprecating events
Consumer Impact Analysis: Assess impact of changes on existing consumers
Backwards Compatibility Period: Maintain compatibility for an appropriate period

Migration Strategies

Transitioning from Synchronous Architecture

Strangler Fig Pattern: Gradually replace synchronous calls with events
Dual-Writing: Write to both old and new systems during transition
Event Backbone: Introduce an event backbone alongside existing communication
Decomposition: Break monoliths into services connected by events

Implementation Roadmap

Start Small: Begin with non-critical, bounded areas
Prove Value: Demonstrate benefits with metrics
Expand Gradually: Incrementally apply to more domains
Refine Practices: Continuously improve based on experiences

Common Pitfalls to Avoid

Event Overload: Too many fine-grained events creating noise
Missing Events: Forgetting to model important domain events
Tight Coupling: Using events in ways that create hidden dependencies
Inconsistent Schemas: Lacking standardization across events
Insufficient Monitoring: Not tracking event flows end-to-end

Framework-Specific Implementation

Spring Framework (Java)

// Event definition
@Value
public class OrderCreatedEvent {
    private final String orderId;
    private final String customerId;
    private final BigDecimal amount;
    private final LocalDateTime createdAt;
}

// Publishing events
@Service
public class OrderService {
    private final ApplicationEventPublisher eventPublisher;
    
    @Autowired
    public OrderService(ApplicationEventPublisher eventPublisher) {
        this.eventPublisher = eventPublisher;
    }
    
    @Transactional
    public Order createOrder(OrderRequest request) {
        // Business logic to create order
        Order order = // ...
        
        // Publish domain event
        eventPublisher.publishEvent(new OrderCreatedEvent(
            order.getId(),
            order.getCustomerId(),
            order.getTotalAmount(),
            order.getCreatedAt()
        ));
        
        return order;
    }
}

// Consuming events
@Service
public class NotificationService {
    @EventListener
    public void handleOrderCreatedEvent(OrderCreatedEvent event) {
        // Handle the event
        sendOrderConfirmation(event.getOrderId(), event.getCustomerId());
    }
}

Node.js (with EventEmitter)

// Event-driven service
class OrderService extends EventEmitter {
  constructor(orderRepository) {
    super();
    this.orderRepository = orderRepository;
  }
  
  async createOrder(orderData) {
    // Business logic
    const order = await this.orderRepository.save(orderData);
    
    // Emit domain event
    this.emit('orderCreated', {
      orderId: order.id,
      customerId: order.customerId,
      amount: order.totalAmount,
      createdAt: new Date().toISOString()
    });
    
    return order;
  }
}

// Event consumer
const orderService = new OrderService(orderRepository);

// Register event handler
orderService.on('orderCreated', async (event) => {
  await notificationService.sendOrderConfirmation(
    event.orderId, 
    event.customerId
  );
});

Kafka Consumer (Java)

@Service
public class OrderEventConsumer {
    private final NotificationService notificationService;
    
    @Autowired
    public OrderEventConsumer(NotificationService notificationService) {
        this.notificationService = notificationService;
    }
    
    @KafkaListener(topics = "order-events", groupId = "notification-service")
    public void consume(ConsumerRecord<String, String> record) {
        try {
            OrderEvent event = parseEvent(record.value());
            
            if ("ORDER_CREATED".equals(event.getType())) {
                notificationService.sendOrderConfirmation(
                    event.getData().get("orderId"),
                    event.getData().get("customerId")
                );
            }
        } catch (Exception e) {
            // Error handling
            log.error("Error processing order event", e);
        }
    }
    
    private OrderEvent parseEvent(String json) {
        // Parse JSON to OrderEvent object
        // ...
    }
}

Case Studies

Case Study 1: E-commerce Order Processing

Challenge: Building a scalable order processing system that handles variable load.

Solution:

Implemented event-driven architecture with the following events:
- OrderCreated
- PaymentProcessed
- InventoryReserved
- OrderFulfilled
- ShipmentCreated
- OrderDelivered
Used Apache Kafka as the event backbone with:
- Topic-per-event-type structure
- Consistent event schema with versioning
- Dead letter queues for failed events
Implemented the Saga pattern for transaction management:
- Coordinated steps across multiple services
- Implemented compensating transactions
- Handled timeout and failure scenarios

Results:

Scaled to handle 10x order volume during peak periods
Reduced order processing time by 40%
Achieved 99.99% reliability in order processing
Enabled new services to be added without modifying existing ones

Case Study 2: Real-time Analytics Pipeline

Challenge: Building a real-time analytics system processing millions of user interactions.

Solution:

Implemented event sourcing for user interaction data:
- Captured all user events in an immutable log
- Used event-carried state transfer pattern
- Applied CQRS for specialized read models
Used stream processing to:
- Calculate real-time metrics
- Detect anomalies and patterns
- Update materialized views
Implemented multiple projections for different analytical needs:
- Time-series analysis
- User behavior aggregations
- Funnel analysis

Results:

Achieved sub-second latency for analytics updates
Reduced infrastructure costs by 35%
Enabled complex analytics without impacting user-facing services
Simplified addition of new analytics dimensions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!