Generate realistic food delivery order data for testing batch and streaming workflows.
The test data generator creates order lifecycle events inspired by Casper's Kitchens. It simulates a ghost kitchen food delivery platform with realistic timing, chaos injection, and configurable scale.
# Generate 7 days of data
./lakehouse testdata generate --days 7
# Load into Iceberg tables
./lakehouse testdata load
# Stream to Kafka at 60x speed
./lakehouse testdata stream --speed 60Create order events as Parquet files.
./lakehouse testdata generate # 90 days (~7GB)
./lakehouse testdata generate --days 7 # 7 days (~500MB)
./lakehouse testdata generate --days 1 # 1 day (~70MB)
./lakehouse testdata generate --seed 42 # ReproducibleOutput:
data/dimensions/- Dimension tables (brands, items, etc.)data/events/orders_Nd.parquet- Event datadata/load_to_iceberg.py- Generated load script
Load generated data into Iceberg tables.
./lakehouse testdata loadCreates tables in iceberg.bronze.* namespace.
Stream events to Kafka in simulated real-time.
./lakehouse testdata stream # Real-time
./lakehouse testdata stream --speed 60 # 60x (1 min = 1 hour)
./lakehouse testdata stream --speed 3600 # 3600x (1 sec = 1 hour)Show dataset statistics.
./lakehouse testdata statsRemove all generated data.
./lakehouse testdata clean| Table | Records | Description |
|---|---|---|
dim_brands |
20 | Ghost kitchen brands |
dim_items |
160 | Menu items (8 per brand) |
dim_categories |
10 | Food categories |
dim_locations |
4 | Delivery cities |
| Column | Type | Description |
|---|---|---|
event_id |
string | UUID |
event_type |
string | Lifecycle event type |
ts |
string | ISO 8601 timestamp |
order_id |
string | UUID |
location_id |
int | FK to dim_locations |
sequence |
int | Event order in lifecycle |
body |
string | JSON payload |
Order lifecycle progression:
order_created # Customer places order
↓
kitchen_started # Kitchen begins preparation
↓
kitchen_finished # Food ready for pickup
↓
order_ready # Packaged for driver
↓
driver_arrived # Driver at restaurant
↓
driver_picked_up # Driver has order
↓
driver_ping (1-5x) # Driver location updates
↓
delivered # Order complete
The generator includes configurable data quality issues for testing resilience:
| Issue | Description |
|---|---|
| Null values | Random nulls in non-key fields |
| Late events | Events arriving out of order |
| Duplicates | Duplicate event IDs |
| Malformed JSON | Invalid JSON in body field |
Approximate sizes by days:
| Days | Orders | Events | File Size |
|---|---|---|---|
| 1 | ~38K | ~1M | ~70MB |
| 7 | ~266K | ~7M | ~500MB |
| 30 | ~1.1M | ~30M | ~2GB |
| 90 | ~3.4M | ~92M | ~7GB |
# Generate data
./lakehouse testdata generate --days 30
# Load to Iceberg
./lakehouse testdata load
# Query with Spark
docker exec -it spark-master-41 /opt/spark/bin/spark-sql \
-e "SELECT event_type, COUNT(*) FROM iceberg.bronze.orders GROUP BY 1"Terminal 1:
./lakehouse testdata stream --speed 60Terminal 2:
./lakehouse consumer# Same seed = same data
./lakehouse testdata generate --days 7 --seed 12345
./lakehouse testdata clean
./lakehouse testdata generate --days 7 --seed 12345 # Identical outputscripts/testdata/
├── __init__.py # Main entry point
├── __main__.py # CLI handler
├── config.py # Configuration dataclasses
├── dimensions.py # Dimension table generators
├── events.py # Order lifecycle event generator
├── chaos.py # Data quality issue injection
├── exporter.py # Parquet batch export
└── producer.py # Kafka streaming producer
When streaming, events are published to:
- Topic:
orders(configurable) - Key:
order_id - Value: JSON event