This document outlines the standards, patterns, and best practices for implementing event-driven architectures at Bayat.
- Introduction
- Core Concepts
- Event Patterns
- Communication Patterns
- Event Schema Design
- Event Storage and Routing
- Error Handling
- Monitoring and Observability
- Testing Strategies
- Implementation Guidelines
- Tooling and Infrastructure
- Governance
- Migration Strategies
- Framework-Specific Implementation
- Case Studies
Event-driven architecture (EDA) is an architectural pattern that promotes the production, detection, consumption of, and reaction to events. An event can be defined as a significant change in state or an occurrence that is of interest to the system or its users.
- Loose Coupling: Services communicate without direct dependencies
- Scalability: Components can scale independently
- Resilience: Failure in one component doesn't affect others directly
- Flexibility: Easier to add new components that react to existing events
- Real-time Processing: Enables immediate reaction to events as they occur
Consider EDA when:
- Building systems with asynchronous workflows
- Implementing real-time features
- Designing microservices that need to coordinate without tight coupling
- Managing complex business processes across multiple services
- Creating systems that need audit trails of all state changes
An event represents a fact that occurred at a specific point in time. Events are immutable and should be expressed in the past tense.
- Domain Events: Represent business-significant occurrences within a domain
- Integration Events: Used for communication between different bounded contexts
- Command Events: Represent instructions to perform actions
- Query Events: Request information without changing state
- Notification Events: Inform subscribed parties about system occurrences
Components that detect, capture, and publish events to an event broker.
Components that subscribe to events and react to them.
Middleware that routes events from producers to consumers.
Persistent storage of events, serving as the source of truth.
The simplest pattern where a service publishes an event to notify other services about a change.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Producer │───>│ Broker │───>│ Consumer │
└──────────┘ └──────────┘ └──────────┘
When to use: For simple coordination between services where consumers don't need complete information.
Events contain the complete relevant state, reducing the need for consumers to query back for additional data.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Producer │───>│ Broker │───>│ Consumer │
└──────────┘ └──────────┘ └──────────┘
│ │
│ │
v v
┌──────────┐ ┌──────────┐
│ Producer │ │ Consumer │
│ Database │ │ Database │
└──────────┘ └──────────┘
When to use: When consumers need comprehensive information and reducing network calls is important.
All changes to application state are stored as a sequence of events, which can be replayed to reconstruct current state.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ User │───>│ Commands │───>│ Domain │
│ Input │ │ │ │ Logic │
└──────────┘ └──────────┘ └──────────┘
│
v
┌──────────┐
│ Events │
│ │
└──────────┘
│
v
┌──────────┐
│ Event │
│ Store │
└──────────┘
│
v
┌──────────┐
│ Read │
│ Model │
└──────────┘
When to use: When you need a complete audit trail, time-travel debugging, or complex event processing.
Separates read and write operations into different models, often with different data stores.
┌──────────┐ ┌──────────┐
│ Write │ │ Read │
│ Requests │ │ Requests │
└──────────┘ └──────────┘
│ │
v v
┌──────────┐ ┌──────────┐
│ Write │ │ Read │
│ Model │ │ Model │
└──────────┘ └──────────┘
│ ▲
v │
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Events │───>│ Event │───>│ Projector│
│ │ │ Store │ │ │
└──────────┘ └──────────┘ └──────────┘
When to use: When read and write workloads have significantly different requirements or scaling needs.
Coordinates a sequence of local transactions where each transaction updates data within a single service.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Step 1 │───>│ Step 2 │───>│ Step 3 │
└──────────┘ └──────────┘ └──────────┘
│ │ │
v v v
┌──────────┐ ┌──────────┐ ┌──────────┐
│Compensate│ │Compensate│ │Compensate│
│ Step 1 │<───│ Step 2 │<───│ Step 3 │
└──────────┘ └──────────┘ └──────────┘
When to use: For managing distributed transactions across multiple services.
Multiple consumers subscribe to event types and receive all events of those types.
Best practices:
- Use topic-based routing for logical organization
- Implement message filtering at the broker when possible
- Design with multiple consumers in mind
Events are delivered to exactly one consumer from a pool of potential consumers.
Best practices:
- Use for work distribution and load balancing
- Ensure idempotent consumers for reliability
- Consider message expiration policies
A synchronous interaction pattern implemented with asynchronous events.
Best practices:
- Include correlation IDs to match replies with requests
- Set appropriate timeouts
- Have fallback mechanisms for missed replies
All events should follow a consistent structure:
{
"eventId": "uuid-string",
"eventType": "domain.entity.action.occurred",
"eventTime": "ISO-8601 timestamp",
"version": "schema-version",
"producer": "service-name",
"data": {
// Event-specific payload
},
"metadata": {
// Additional contextual information
"correlationId": "uuid-string",
"causationId": "uuid-string",
"traceId": "uuid-string"
}
}
Follow these principles for evolving event schemas:
- Backward Compatibility: New schema versions must accept data produced with older schemas
- Forward Compatibility: Old consumers should be able to process events from newer producers
- Additive Changes: Only add fields, don't remove or change existing ones
- Default Values: Provide sensible defaults for new fields
- Versioning: Include schema version in the event metadata
Use a schema registry to:
- Centrally manage event schemas
- Validate event conformance
- Enable schema evolution governance
- Generate client code for various languages
Choose the appropriate broker based on requirements:
- Apache Kafka: For high-throughput, persistent event streaming
- RabbitMQ: For complex routing requirements and traditional message patterns
- AWS SNS/SQS: For AWS-native applications with moderate throughput
- Google Pub/Sub: For GCP-native applications
- Azure Event Hubs/Service Bus: For Azure-native applications
When implementing an event store:
- Append-Only: The event store should be an immutable, append-only log
- Optimistic Concurrency: Use optimistic concurrency control for event streams
- Snapshots: Implement snapshotting for performance with long event streams
- Partitioning: Partition events by aggregate ID or other logical boundaries
- Compression: Consider compression for storage efficiency
- Dead Letter Queues (DLQ): Route unprocessable messages to a DLQ for later analysis
- Retry Policies: Implement exponential backoff for transient failures
- Circuit Breakers: Prevent cascading failures when downstream services fail
- Poison Message Handling: Identify and isolate messages that consistently cause failures
Choose the appropriate guarantee level:
- At-most-once: Events may be lost but never processed twice (lowest overhead)
- At-least-once: Events are never lost but may be processed multiple times (requires idempotency)
- Exactly-once: Events are processed exactly once (highest overhead, not always possible)
Design consumers to be idempotent by:
- Using natural idempotency when possible (e.g., setting vs. incrementing)
- Tracking processed event IDs to detect duplicates
- Using database transactions to make updates atomic
- Designing compensating actions for non-idempotent operations
- Latency: Time from event production to consumption
- Throughput: Events processed per time unit
- Error Rates: Failed events vs. successful events
- Queue Depths: Number of unprocessed events
- Processing Time: Time taken to process each event
- Dead Letter Counts: Number of events in DLQs
Implement distributed tracing by:
- Propagating trace IDs through events
- Correlating events in the same business transaction
- Visualizing event flows across services
- Measuring time spent in each processing stage
Log the following information:
- Event metadata (ID, type, timestamp)
- Processing milestones (received, processed, forwarded)
- Errors with context
- Business-significant event details (sanitized of sensitive data)
Test event handlers in isolation by:
- Mocking event sources
- Verifying correct state changes
- Ensuring idempotency with duplicate events
- Testing error handling paths
Test the interaction between components by:
- Using embedded event brokers for testing
- Verifying event publication
- Confirming subscription and processing
- Testing schema compatibility
Validate complete event flows by:
- Tracing events through the entire system
- Verifying eventual consistency
- Testing failure recovery
- Measuring performance characteristics
Use event storming workshops to:
- Identify domain events collaboratively
- Understand event flows and dependencies
- Define service boundaries around events
- Discover missing events or processes
Align event-driven architecture with DDD by:
- Modeling events as domain concepts
- Using bounded contexts to define event ownership
- Designing aggregates that emit domain events
- Implementing context maps to show relationships between bounded contexts
Optimize for:
- Batching: Group events for efficient processing
- Partitioning: Distribute events for parallel processing
- Backpressure: Implement mechanisms to handle overload
- Caching: Cache reference data needed for event processing
Based on project context, consider:
- Apache Kafka: For high-throughput event streaming
- RabbitMQ: For traditional messaging patterns
- NATS: For lightweight, high-performance messaging
- Cloud Provider Services:
- AWS: EventBridge, SNS, SQS, Kinesis
- Azure: Event Hubs, Service Bus, Event Grid
- GCP: Pub/Sub, Dataflow
- EventStoreDB: Purpose-built event sourcing database
- Apache Kafka: For event sourcing and stream processing
- PostgreSQL/MySQL: With append-only tables for simpler implementations
- MongoDB/DynamoDB: For document-based event storage
- Apache Kafka Streams: For JVM-based stream processing
- Apache Flink: For complex event processing with exactly-once semantics
- Apache Spark Streaming: For batch and stream processing
- AWS Lambda / Azure Functions / GCP Cloud Functions: For serverless event processing
- Prometheus & Grafana: For metrics collection and visualization
- ELK Stack: For log aggregation and analysis
- Jaeger/Zipkin: For distributed tracing
- Datadog/New Relic: For integrated observability platforms
Define clear ownership by:
- Assigning each event type to a specific team/service
- Documenting ownership in a central catalog
- Establishing change management processes for events
- Creating feedback channels for event consumers
Implement event discovery mechanisms:
- Event Catalog: Document all events, their schemas, producers, and consumers
- Self-Registration: Enable services to register their events automatically
- Runtime Discovery: Allow services to discover available events at runtime
- Documentation: Maintain comprehensive, up-to-date event documentation
- Versioning Strategy: Follow semantic versioning for event schemas
- Deprecation Policy: Define timeline and process for deprecating events
- Consumer Impact Analysis: Assess impact of changes on existing consumers
- Backwards Compatibility Period: Maintain compatibility for an appropriate period
- Strangler Fig Pattern: Gradually replace synchronous calls with events
- Dual-Writing: Write to both old and new systems during transition
- Event Backbone: Introduce an event backbone alongside existing communication
- Decomposition: Break monoliths into services connected by events
- Start Small: Begin with non-critical, bounded areas
- Prove Value: Demonstrate benefits with metrics
- Expand Gradually: Incrementally apply to more domains
- Refine Practices: Continuously improve based on experiences
- Event Overload: Too many fine-grained events creating noise
- Missing Events: Forgetting to model important domain events
- Tight Coupling: Using events in ways that create hidden dependencies
- Inconsistent Schemas: Lacking standardization across events
- Insufficient Monitoring: Not tracking event flows end-to-end
// Event definition
@Value
public class OrderCreatedEvent {
private final String orderId;
private final String customerId;
private final BigDecimal amount;
private final LocalDateTime createdAt;
}
// Publishing events
@Service
public class OrderService {
private final ApplicationEventPublisher eventPublisher;
@Autowired
public OrderService(ApplicationEventPublisher eventPublisher) {
this.eventPublisher = eventPublisher;
}
@Transactional
public Order createOrder(OrderRequest request) {
// Business logic to create order
Order order = // ...
// Publish domain event
eventPublisher.publishEvent(new OrderCreatedEvent(
order.getId(),
order.getCustomerId(),
order.getTotalAmount(),
order.getCreatedAt()
));
return order;
}
}
// Consuming events
@Service
public class NotificationService {
@EventListener
public void handleOrderCreatedEvent(OrderCreatedEvent event) {
// Handle the event
sendOrderConfirmation(event.getOrderId(), event.getCustomerId());
}
}
// Event-driven service
class OrderService extends EventEmitter {
constructor(orderRepository) {
super();
this.orderRepository = orderRepository;
}
async createOrder(orderData) {
// Business logic
const order = await this.orderRepository.save(orderData);
// Emit domain event
this.emit('orderCreated', {
orderId: order.id,
customerId: order.customerId,
amount: order.totalAmount,
createdAt: new Date().toISOString()
});
return order;
}
}
// Event consumer
const orderService = new OrderService(orderRepository);
// Register event handler
orderService.on('orderCreated', async (event) => {
await notificationService.sendOrderConfirmation(
event.orderId,
event.customerId
);
});
@Service
public class OrderEventConsumer {
private final NotificationService notificationService;
@Autowired
public OrderEventConsumer(NotificationService notificationService) {
this.notificationService = notificationService;
}
@KafkaListener(topics = "order-events", groupId = "notification-service")
public void consume(ConsumerRecord<String, String> record) {
try {
OrderEvent event = parseEvent(record.value());
if ("ORDER_CREATED".equals(event.getType())) {
notificationService.sendOrderConfirmation(
event.getData().get("orderId"),
event.getData().get("customerId")
);
}
} catch (Exception e) {
// Error handling
log.error("Error processing order event", e);
}
}
private OrderEvent parseEvent(String json) {
// Parse JSON to OrderEvent object
// ...
}
}
Challenge: Building a scalable order processing system that handles variable load.
Solution:
-
Implemented event-driven architecture with the following events:
OrderCreated
PaymentProcessed
InventoryReserved
OrderFulfilled
ShipmentCreated
OrderDelivered
-
Used Apache Kafka as the event backbone with:
- Topic-per-event-type structure
- Consistent event schema with versioning
- Dead letter queues for failed events
-
Implemented the Saga pattern for transaction management:
- Coordinated steps across multiple services
- Implemented compensating transactions
- Handled timeout and failure scenarios
Results:
- Scaled to handle 10x order volume during peak periods
- Reduced order processing time by 40%
- Achieved 99.99% reliability in order processing
- Enabled new services to be added without modifying existing ones
Challenge: Building a real-time analytics system processing millions of user interactions.
Solution:
-
Implemented event sourcing for user interaction data:
- Captured all user events in an immutable log
- Used event-carried state transfer pattern
- Applied CQRS for specialized read models
-
Used stream processing to:
- Calculate real-time metrics
- Detect anomalies and patterns
- Update materialized views
-
Implemented multiple projections for different analytical needs:
- Time-series analysis
- User behavior aggregations
- Funnel analysis
Results:
- Achieved sub-second latency for analytics updates
- Reduced infrastructure costs by 35%
- Enabled complex analytics without impacting user-facing services
- Simplified addition of new analytics dimensions