Skip to content

Commit 3be1194

Browse files
new task added for "BusinessIntelligence"
1 parent 63223f4 commit 3be1194

File tree

12 files changed

+7524
-0
lines changed

12 files changed

+7524
-0
lines changed
Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
# Senior BI Engineer Take-Home Exercise - Hospitality Domain
2+
3+
## 🎯 Overview
4+
5+
Welcome to the Mews Senior Data Engineer take-home exercise! This exercise evaluates your **real-time data engineering skills** through a realistic hospitality data scenario focusing on **streaming ingestion** and **incremental processing strategies**.
6+
7+
**Time Estimate**: 4-6 hours
8+
**Domain**: Hospitality (Hotel Property Management)
9+
**Focus Areas**: Real-Time Data Ingestion, Incremental Processing, Stream Processing, Change Data Capture (CDC)
10+
11+
---
12+
13+
## 📚 Documentation Pages
14+
15+
This exercise consists of several pages:
16+
17+
1. [**Dataset Documentation**](./Task/Dataset/Readme.md) - Historical baseline data
18+
2. [**Streaming Events Guide**](./Task/Streaming/Readme.md) - Event structures and simulation
19+
3. [**Hospitality Metrics Reference**](./Task/Hospitality%20Metrics%20Quick%20Reference.md) - SQL examples and KPI formulas
20+
21+
---
22+
23+
## 🚀 What You'll Build
24+
25+
Design and implement a **real-time data pipeline** that handles streaming hotel booking data with incremental processing strategies to support live dashboards and operational analytics.
26+
27+
### Core Components
28+
29+
1. **Real-Time Data Ingestion** - Stream processing from multiple sources
30+
2. **Incremental Loading Strategy** - Efficient CDC and delta processing
31+
3. **State Management** - Handle late-arriving data and out-of-order events
32+
4. **Near Real-Time Analytics** - Sub-minute latency for operational metrics
33+
5. **Scalable Architecture** - Design for high-throughput streaming scenarios
34+
35+
---
36+
37+
## 📊 Scenario Context
38+
39+
You're building a data platform for Mews that processes **live booking events** from hotels worldwide. The system needs to:
40+
41+
* Process bookings as they happen (real-time)
42+
* Handle updates to existing bookings (modifications, cancellations)
43+
* Support incremental batch processing for historical data
44+
* Maintain accurate metrics with minimal latency
45+
* Scale from 1 hotel processing 10 bookings/day to 1,000+ hotels processing 10,000+ bookings/hour
46+
47+
### Data Sources
48+
49+
**Source 1: Booking Events Stream (Real-Time)**
50+
51+
* New bookings created
52+
* Booking modifications (room changes, date changes)
53+
* Cancellations and no-shows
54+
* Check-ins and check-outs
55+
* Format: JSON events with timestamps
56+
57+
**Source 2: Guest Updates Stream (Real-Time)**
58+
59+
* Guest profile creations
60+
* Loyalty tier updates
61+
* Contact information changes
62+
* Format: CDC events (INSERT, UPDATE, DELETE)
63+
64+
**Source 3: Room Inventory (Batch + Incremental)**
65+
66+
* Static configuration (updates infrequent)
67+
* Incremental updates when rooms added/removed
68+
* Price changes (dynamic pricing updates)
69+
70+
**Initial Dataset**: Use the provided [historical data](./Task/Dataset/Readme.md) as the baseline, then simulate streaming events using the [Streaming Events Guide](./Task/Streaming/Readme.md).
71+
72+
---
73+
74+
## 🎯 Requirements
75+
76+
### Part 1: Real-Time Ingestion Architecture (40%)
77+
78+
**1.1 Streaming Data Ingestion**
79+
80+
Design and implement a streaming ingestion pipeline that handles:
81+
82+
* **Event-driven ingestion** from multiple sources
83+
* **Schema evolution** (new fields added over time)
84+
* **Event ordering** and late-arriving data
85+
* **Duplicate detection** and idempotency
86+
* **Backpressure handling** for high-volume scenarios
87+
88+
**Technical Requirements:**
89+
90+
* Process events with <1 minute end-to-end latency
91+
* Handle out-of-order events (up to 1 hour late)
92+
* Support at least 100 events/second throughput
93+
* Implement watermarking for late data
94+
* Ensure exactly-once semantics for critical metrics
95+
96+
**1.2 Incremental Loading Strategy**
97+
98+
Implement an incremental processing approach that includes:
99+
100+
* **Change Data Capture (CDC)** pattern for guest updates
101+
* **Merge/Upsert logic** for handling updates to existing bookings
102+
* **Soft deletes** for cancelled bookings (maintain history)
103+
* **Partitioning strategy** by date and property for efficiency
104+
* **Checkpoint management** for resumability
105+
106+
**Key Scenarios to Handle:**
107+
108+
* Guest books a room → Creates new booking
109+
* Guest modifies dates → Updates existing booking (preserve history)
110+
* Guest cancels → Soft delete (booking_status = 'Cancelled')
111+
* Guest checks in → Status update (booking_status = 'Checked-In')
112+
* Late-arriving event → Process correctly despite delay
113+
114+
**1.3 State Management**
115+
116+
Handle stateful operations:
117+
118+
* Maintain running aggregates (daily revenue, occupancy)
119+
* Track booking lifecycle (created → modified → confirmed → checked-in → checked-out)
120+
* Manage session windows for multi-event bookings
121+
* Implement exactly-once processing guarantees
122+
123+
### Part 2: Near Real-Time Analytics (20%)
124+
125+
Build analytics that update in near real-time:
126+
127+
**Streaming Metrics** (Update every 1-5 minutes):
128+
129+
* Current occupancy rate (who's checked in now)
130+
* Today's revenue (running total)
131+
* Live booking funnel (new bookings last hour/day)
132+
* Cancellation alerts (immediate notifications)
133+
134+
**Micro-Batch Metrics** (Update every 15-30 minutes):
135+
136+
* Rolling 7-day ADR and RevPAR
137+
* Guest segmentation updates
138+
* Trending room types
139+
* Lead time distribution
140+
141+
**Dashboard Requirements:**
142+
143+
* Show "as of [timestamp]" freshness indicator
144+
* Display both real-time and batch-computed metrics
145+
* Handle eventual consistency gracefully
146+
147+
### Part 3: Architecture & Scalability (30%)
148+
149+
Design for production-scale streaming:
150+
151+
**Architecture Diagram Must Show:**
152+
153+
1. **Ingestion Layer**
154+
155+
* Event sources (Kafka, Kinesis, Event Hub, or similar)
156+
* Stream processing engine (Spark Streaming, Flink, Kafka Streams)
157+
* Dead letter queues for failed events
158+
159+
2. **Processing Layer**
160+
161+
* Stream processing jobs
162+
* Batch jobs for historical data
163+
* State stores and checkpointing
164+
* Data quality validation
165+
166+
3. **Storage Layer**
167+
168+
* Hot storage (recent data, fast queries)
169+
* Warm storage (last 90 days)
170+
* Cold storage (historical archive)
171+
* Partitioning and indexing strategy
172+
173+
4. **Serving Layer**
174+
175+
* Query patterns for real-time dashboards
176+
* Caching strategy
177+
* API endpoints for live data
178+
179+
180+
**Scalability Considerations:**
181+
182+
* How to handle 10x traffic increase
183+
* Multi-region deployment strategy
184+
* Failure recovery and replay mechanisms
185+
* Cost optimization (compute vs storage)
186+
187+
**Document:**
188+
189+
* Technology choices and trade-offs
190+
* Latency vs throughput trade-offs
191+
* Consistency guarantees
192+
* Monitoring and alerting strategy
193+
* Incremental vs full refresh decisions
194+
195+
### Part 4: Implementation & Communication (10%)
196+
197+
**Code/Pseudocode:**
198+
199+
* Streaming job implementation (working or detailed pseudocode)
200+
* Incremental merge logic
201+
* State management code
202+
* Data quality checks
203+
204+
**Documentation:**
205+
206+
* Clear explanation of ingestion strategy
207+
* Incremental loading algorithm
208+
* How you handle edge cases
209+
* Testing approach for streaming systems
210+
211+
---
212+
213+
## 🌊 Streaming Scenarios to Address
214+
215+
Your solution should handle these real-world scenarios:
216+
217+
### Scenario 1: High-Volume Booking Window
218+
219+
* 1,000 bookings arrive in 10 minutes (Black Friday sale)
220+
* System must process without dropping events
221+
* Metrics must stay accurate under load
222+
223+
### Scenario 2: Late-Arriving Cancellation
224+
225+
* Booking created at 10:00 AM
226+
* Cancellation event arrives at 2:00 PM (for 10:05 AM cancellation)
227+
* System must retroactively correct metrics
228+
229+
### Scenario 3: Duplicate Events
230+
231+
* Network retry causes duplicate booking event
232+
* System must deduplicate correctly
233+
* Idempotency key: booking_id + event_timestamp
234+
235+
### Scenario 4: Schema Evolution
236+
237+
* New field added: `booking_source` (web, mobile, phone)
238+
* Existing pipelines must continue working
239+
* Historical data lacks this field
240+
241+
### Scenario 5: Multi-Property Update
242+
243+
* 100 hotels update their room rates simultaneously
244+
* Incremental processing must be efficient
245+
* Avoid full table scans
246+
247+
---
248+
249+
## 📋 Deliverables
250+
251+
1. **Streaming Architecture Design**
252+
253+
* End-to-end architecture diagram
254+
* Data flow from ingestion to serving
255+
* Technology stack justification
256+
257+
2. **Incremental Processing Implementation**
258+
259+
* CDC/Upsert logic (code or detailed pseudocode)
260+
* Partitioning strategy
261+
* Checkpoint/state management approach
262+
263+
3. **Code/Pseudocode**
264+
265+
* Streaming job implementation
266+
* Incremental merge logic
267+
* Late data handling
268+
269+
4. **Scaling Strategy Document**
270+
271+
* How to scale from 10 to 10,000 events/sec
272+
* Failure recovery approach
273+
* Monitoring and observability
274+
275+
5. **Presentation (5-10 slides)**
276+
277+
* Architectural approach
278+
* Key design decisions
279+
* Trade-offs made
280+
* Performance characteristics
281+
282+
---
283+
284+
## 💡 Key Questions to Answer
285+
286+
Your solution should address:
287+
288+
1. **Latency vs Completeness**: How do you balance fast processing with waiting for late data?
289+
2. **Exactly-Once Processing**: How do you guarantee no double-counting in metrics?
290+
3. **Failure Recovery**: What happens if your stream processor crashes?
291+
4. **Backfill Strategy**: How do you reprocess historical data without disrupting live streams?
292+
5. **Schema Evolution**: How do you handle new fields without breaking pipelines?
293+
6. **Cost Optimization**: When to use micro-batching vs true streaming?
294+
295+
---
296+
297+
## 🎁 Bonus Challenges (Optional)
298+
299+
* Implement a **lambda architecture** (streaming + batch) with consistency guarantees
300+
* Design a **Kappa architecture** (streaming-only) alternative
301+
* Add **multi-region active-active** replication
302+
* Implement **complex event processing** (CEP) for fraud detection
303+
* Build a **backpressure handling** mechanism
304+
* Design **blue-green deployment** strategy for zero-downtime updates
305+
306+
---
307+
308+
## 📥 Download Datasets
309+
310+
**Baseline historical data in [Dataset Documentation](./Dataset%20Documentation.md):**
311+
* [bookings.csv](./Task/Dataset/bookings.csv) - Use as initial state
312+
* [guests.json](./Task/Dataset/guests.json) - Use as initial state
313+
* [room_inventory.sql](./Task/Dataset/room_inventory.sql) - Static reference data
314+
315+
**Streaming event examples in [Streaming Events Guide](./Submission%20Guidelines%20&%20Tips.md):**
316+
* Sample event structures (JSON)
317+
* Event simulation scripts (Python)
318+
* CDC patterns and examples
319+
320+
---
321+
322+
**Good luck! We're excited to see how you approach real-time data engineering! 🚀**
323+
324+
_Focus on demonstrating your understanding of streaming concepts and incremental processing strategies. Working code is great, but clear architectural thinking is more important._

0 commit comments

Comments
 (0)