|
| 1 | +# Senior BI Engineer Take-Home Exercise - Hospitality Domain |
| 2 | + |
| 3 | +## 🎯 Overview |
| 4 | + |
| 5 | +Welcome to the Mews Senior Data Engineer take-home exercise! This exercise evaluates your **real-time data engineering skills** through a realistic hospitality data scenario focusing on **streaming ingestion** and **incremental processing strategies**. |
| 6 | + |
| 7 | +**Time Estimate**: 4-6 hours |
| 8 | +**Domain**: Hospitality (Hotel Property Management) |
| 9 | +**Focus Areas**: Real-Time Data Ingestion, Incremental Processing, Stream Processing, Change Data Capture (CDC) |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## 📚 Documentation Pages |
| 14 | + |
| 15 | +This exercise consists of several pages: |
| 16 | + |
| 17 | +1. [**Dataset Documentation**](./Task/Dataset/Readme.md) - Historical baseline data |
| 18 | +2. [**Streaming Events Guide**](./Task/Streaming/Readme.md) - Event structures and simulation |
| 19 | +3. [**Hospitality Metrics Reference**](./Task/Hospitality%20Metrics%20Quick%20Reference.md) - SQL examples and KPI formulas |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## 🚀 What You'll Build |
| 24 | + |
| 25 | +Design and implement a **real-time data pipeline** that handles streaming hotel booking data with incremental processing strategies to support live dashboards and operational analytics. |
| 26 | + |
| 27 | +### Core Components |
| 28 | + |
| 29 | +1. **Real-Time Data Ingestion** - Stream processing from multiple sources |
| 30 | +2. **Incremental Loading Strategy** - Efficient CDC and delta processing |
| 31 | +3. **State Management** - Handle late-arriving data and out-of-order events |
| 32 | +4. **Near Real-Time Analytics** - Sub-minute latency for operational metrics |
| 33 | +5. **Scalable Architecture** - Design for high-throughput streaming scenarios |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## 📊 Scenario Context |
| 38 | + |
| 39 | +You're building a data platform for Mews that processes **live booking events** from hotels worldwide. The system needs to: |
| 40 | + |
| 41 | +* Process bookings as they happen (real-time) |
| 42 | +* Handle updates to existing bookings (modifications, cancellations) |
| 43 | +* Support incremental batch processing for historical data |
| 44 | +* Maintain accurate metrics with minimal latency |
| 45 | +* Scale from 1 hotel processing 10 bookings/day to 1,000+ hotels processing 10,000+ bookings/hour |
| 46 | + |
| 47 | +### Data Sources |
| 48 | + |
| 49 | +**Source 1: Booking Events Stream (Real-Time)** |
| 50 | + |
| 51 | +* New bookings created |
| 52 | +* Booking modifications (room changes, date changes) |
| 53 | +* Cancellations and no-shows |
| 54 | +* Check-ins and check-outs |
| 55 | +* Format: JSON events with timestamps |
| 56 | + |
| 57 | +**Source 2: Guest Updates Stream (Real-Time)** |
| 58 | + |
| 59 | +* Guest profile creations |
| 60 | +* Loyalty tier updates |
| 61 | +* Contact information changes |
| 62 | +* Format: CDC events (INSERT, UPDATE, DELETE) |
| 63 | + |
| 64 | +**Source 3: Room Inventory (Batch + Incremental)** |
| 65 | + |
| 66 | +* Static configuration (updates infrequent) |
| 67 | +* Incremental updates when rooms added/removed |
| 68 | +* Price changes (dynamic pricing updates) |
| 69 | + |
| 70 | +**Initial Dataset**: Use the provided [historical data](./Task/Dataset/Readme.md) as the baseline, then simulate streaming events using the [Streaming Events Guide](./Task/Streaming/Readme.md). |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## 🎯 Requirements |
| 75 | + |
| 76 | +### Part 1: Real-Time Ingestion Architecture (40%) |
| 77 | + |
| 78 | +**1.1 Streaming Data Ingestion** |
| 79 | + |
| 80 | +Design and implement a streaming ingestion pipeline that handles: |
| 81 | + |
| 82 | +* **Event-driven ingestion** from multiple sources |
| 83 | +* **Schema evolution** (new fields added over time) |
| 84 | +* **Event ordering** and late-arriving data |
| 85 | +* **Duplicate detection** and idempotency |
| 86 | +* **Backpressure handling** for high-volume scenarios |
| 87 | + |
| 88 | +**Technical Requirements:** |
| 89 | + |
| 90 | +* Process events with <1 minute end-to-end latency |
| 91 | +* Handle out-of-order events (up to 1 hour late) |
| 92 | +* Support at least 100 events/second throughput |
| 93 | +* Implement watermarking for late data |
| 94 | +* Ensure exactly-once semantics for critical metrics |
| 95 | + |
| 96 | +**1.2 Incremental Loading Strategy** |
| 97 | + |
| 98 | +Implement an incremental processing approach that includes: |
| 99 | + |
| 100 | +* **Change Data Capture (CDC)** pattern for guest updates |
| 101 | +* **Merge/Upsert logic** for handling updates to existing bookings |
| 102 | +* **Soft deletes** for cancelled bookings (maintain history) |
| 103 | +* **Partitioning strategy** by date and property for efficiency |
| 104 | +* **Checkpoint management** for resumability |
| 105 | + |
| 106 | +**Key Scenarios to Handle:** |
| 107 | + |
| 108 | +* Guest books a room → Creates new booking |
| 109 | +* Guest modifies dates → Updates existing booking (preserve history) |
| 110 | +* Guest cancels → Soft delete (booking_status = 'Cancelled') |
| 111 | +* Guest checks in → Status update (booking_status = 'Checked-In') |
| 112 | +* Late-arriving event → Process correctly despite delay |
| 113 | + |
| 114 | +**1.3 State Management** |
| 115 | + |
| 116 | +Handle stateful operations: |
| 117 | + |
| 118 | +* Maintain running aggregates (daily revenue, occupancy) |
| 119 | +* Track booking lifecycle (created → modified → confirmed → checked-in → checked-out) |
| 120 | +* Manage session windows for multi-event bookings |
| 121 | +* Implement exactly-once processing guarantees |
| 122 | + |
| 123 | +### Part 2: Near Real-Time Analytics (20%) |
| 124 | + |
| 125 | +Build analytics that update in near real-time: |
| 126 | + |
| 127 | +**Streaming Metrics** (Update every 1-5 minutes): |
| 128 | + |
| 129 | +* Current occupancy rate (who's checked in now) |
| 130 | +* Today's revenue (running total) |
| 131 | +* Live booking funnel (new bookings last hour/day) |
| 132 | +* Cancellation alerts (immediate notifications) |
| 133 | + |
| 134 | +**Micro-Batch Metrics** (Update every 15-30 minutes): |
| 135 | + |
| 136 | +* Rolling 7-day ADR and RevPAR |
| 137 | +* Guest segmentation updates |
| 138 | +* Trending room types |
| 139 | +* Lead time distribution |
| 140 | + |
| 141 | +**Dashboard Requirements:** |
| 142 | + |
| 143 | +* Show "as of [timestamp]" freshness indicator |
| 144 | +* Display both real-time and batch-computed metrics |
| 145 | +* Handle eventual consistency gracefully |
| 146 | + |
| 147 | +### Part 3: Architecture & Scalability (30%) |
| 148 | + |
| 149 | +Design for production-scale streaming: |
| 150 | + |
| 151 | +**Architecture Diagram Must Show:** |
| 152 | + |
| 153 | +1. **Ingestion Layer** |
| 154 | + |
| 155 | + * Event sources (Kafka, Kinesis, Event Hub, or similar) |
| 156 | + * Stream processing engine (Spark Streaming, Flink, Kafka Streams) |
| 157 | + * Dead letter queues for failed events |
| 158 | + |
| 159 | +2. **Processing Layer** |
| 160 | + |
| 161 | + * Stream processing jobs |
| 162 | + * Batch jobs for historical data |
| 163 | + * State stores and checkpointing |
| 164 | + * Data quality validation |
| 165 | + |
| 166 | +3. **Storage Layer** |
| 167 | + |
| 168 | + * Hot storage (recent data, fast queries) |
| 169 | + * Warm storage (last 90 days) |
| 170 | + * Cold storage (historical archive) |
| 171 | + * Partitioning and indexing strategy |
| 172 | + |
| 173 | +4. **Serving Layer** |
| 174 | + |
| 175 | + * Query patterns for real-time dashboards |
| 176 | + * Caching strategy |
| 177 | + * API endpoints for live data |
| 178 | + |
| 179 | + |
| 180 | +**Scalability Considerations:** |
| 181 | + |
| 182 | +* How to handle 10x traffic increase |
| 183 | +* Multi-region deployment strategy |
| 184 | +* Failure recovery and replay mechanisms |
| 185 | +* Cost optimization (compute vs storage) |
| 186 | + |
| 187 | +**Document:** |
| 188 | + |
| 189 | +* Technology choices and trade-offs |
| 190 | +* Latency vs throughput trade-offs |
| 191 | +* Consistency guarantees |
| 192 | +* Monitoring and alerting strategy |
| 193 | +* Incremental vs full refresh decisions |
| 194 | + |
| 195 | +### Part 4: Implementation & Communication (10%) |
| 196 | + |
| 197 | +**Code/Pseudocode:** |
| 198 | + |
| 199 | +* Streaming job implementation (working or detailed pseudocode) |
| 200 | +* Incremental merge logic |
| 201 | +* State management code |
| 202 | +* Data quality checks |
| 203 | + |
| 204 | +**Documentation:** |
| 205 | + |
| 206 | +* Clear explanation of ingestion strategy |
| 207 | +* Incremental loading algorithm |
| 208 | +* How you handle edge cases |
| 209 | +* Testing approach for streaming systems |
| 210 | + |
| 211 | +--- |
| 212 | + |
| 213 | +## 🌊 Streaming Scenarios to Address |
| 214 | + |
| 215 | +Your solution should handle these real-world scenarios: |
| 216 | + |
| 217 | +### Scenario 1: High-Volume Booking Window |
| 218 | + |
| 219 | +* 1,000 bookings arrive in 10 minutes (Black Friday sale) |
| 220 | +* System must process without dropping events |
| 221 | +* Metrics must stay accurate under load |
| 222 | + |
| 223 | +### Scenario 2: Late-Arriving Cancellation |
| 224 | + |
| 225 | +* Booking created at 10:00 AM |
| 226 | +* Cancellation event arrives at 2:00 PM (for 10:05 AM cancellation) |
| 227 | +* System must retroactively correct metrics |
| 228 | + |
| 229 | +### Scenario 3: Duplicate Events |
| 230 | + |
| 231 | +* Network retry causes duplicate booking event |
| 232 | +* System must deduplicate correctly |
| 233 | +* Idempotency key: booking_id + event_timestamp |
| 234 | + |
| 235 | +### Scenario 4: Schema Evolution |
| 236 | + |
| 237 | +* New field added: `booking_source` (web, mobile, phone) |
| 238 | +* Existing pipelines must continue working |
| 239 | +* Historical data lacks this field |
| 240 | + |
| 241 | +### Scenario 5: Multi-Property Update |
| 242 | + |
| 243 | +* 100 hotels update their room rates simultaneously |
| 244 | +* Incremental processing must be efficient |
| 245 | +* Avoid full table scans |
| 246 | + |
| 247 | +--- |
| 248 | + |
| 249 | +## 📋 Deliverables |
| 250 | + |
| 251 | +1. **Streaming Architecture Design** |
| 252 | + |
| 253 | + * End-to-end architecture diagram |
| 254 | + * Data flow from ingestion to serving |
| 255 | + * Technology stack justification |
| 256 | + |
| 257 | +2. **Incremental Processing Implementation** |
| 258 | + |
| 259 | + * CDC/Upsert logic (code or detailed pseudocode) |
| 260 | + * Partitioning strategy |
| 261 | + * Checkpoint/state management approach |
| 262 | + |
| 263 | +3. **Code/Pseudocode** |
| 264 | + |
| 265 | + * Streaming job implementation |
| 266 | + * Incremental merge logic |
| 267 | + * Late data handling |
| 268 | + |
| 269 | +4. **Scaling Strategy Document** |
| 270 | + |
| 271 | + * How to scale from 10 to 10,000 events/sec |
| 272 | + * Failure recovery approach |
| 273 | + * Monitoring and observability |
| 274 | + |
| 275 | +5. **Presentation (5-10 slides)** |
| 276 | + |
| 277 | + * Architectural approach |
| 278 | + * Key design decisions |
| 279 | + * Trade-offs made |
| 280 | + * Performance characteristics |
| 281 | + |
| 282 | +--- |
| 283 | + |
| 284 | +## 💡 Key Questions to Answer |
| 285 | + |
| 286 | +Your solution should address: |
| 287 | + |
| 288 | +1. **Latency vs Completeness**: How do you balance fast processing with waiting for late data? |
| 289 | +2. **Exactly-Once Processing**: How do you guarantee no double-counting in metrics? |
| 290 | +3. **Failure Recovery**: What happens if your stream processor crashes? |
| 291 | +4. **Backfill Strategy**: How do you reprocess historical data without disrupting live streams? |
| 292 | +5. **Schema Evolution**: How do you handle new fields without breaking pipelines? |
| 293 | +6. **Cost Optimization**: When to use micro-batching vs true streaming? |
| 294 | + |
| 295 | +--- |
| 296 | + |
| 297 | +## 🎁 Bonus Challenges (Optional) |
| 298 | + |
| 299 | +* Implement a **lambda architecture** (streaming + batch) with consistency guarantees |
| 300 | +* Design a **Kappa architecture** (streaming-only) alternative |
| 301 | +* Add **multi-region active-active** replication |
| 302 | +* Implement **complex event processing** (CEP) for fraud detection |
| 303 | +* Build a **backpressure handling** mechanism |
| 304 | +* Design **blue-green deployment** strategy for zero-downtime updates |
| 305 | + |
| 306 | +--- |
| 307 | + |
| 308 | +## 📥 Download Datasets |
| 309 | + |
| 310 | +**Baseline historical data in [Dataset Documentation](./Dataset%20Documentation.md):** |
| 311 | +* [bookings.csv](./Task/Dataset/bookings.csv) - Use as initial state |
| 312 | +* [guests.json](./Task/Dataset/guests.json) - Use as initial state |
| 313 | +* [room_inventory.sql](./Task/Dataset/room_inventory.sql) - Static reference data |
| 314 | + |
| 315 | +**Streaming event examples in [Streaming Events Guide](./Submission%20Guidelines%20&%20Tips.md):** |
| 316 | +* Sample event structures (JSON) |
| 317 | +* Event simulation scripts (Python) |
| 318 | +* CDC patterns and examples |
| 319 | + |
| 320 | +--- |
| 321 | + |
| 322 | +**Good luck! We're excited to see how you approach real-time data engineering! 🚀** |
| 323 | + |
| 324 | +_Focus on demonstrating your understanding of streaming concepts and incremental processing strategies. Working code is great, but clear architectural thinking is more important._ |
0 commit comments