A fully open-source, self-hostable data lakehouse for local development and testing of modern data workflows. Run production-grade infrastructure on your laptop with Apache Spark, Iceberg, and Kafka - no cloud account required. Includes a realistic data generation framework to test batch and streaming pipelines.
Why Lakehouse at Home?
- Learn data engineering with real tools, not toy examples
- Develop and test Spark jobs locally before deploying to production
- Experiment with Iceberg table formats, streaming pipelines, and medallion architecture
- Deploy (optional) to your cloud provider when ready using included Terraform templates
| Component | Version | Purpose |
|---|---|---|
| Apache Spark | 4.0 / 4.1 | Distributed compute |
| Apache Iceberg | 1.10 | ACID table format |
| Apache Kafka | 3.6 | Event streaming |
| Apache Airflow | 3.1 | Workflow orchestration |
| PostgreSQL | 16 | Catalog metadata |
| SeaweedFS | - | S3-compatible storage |
| Unity Catalog | 0.3.1 | REST catalog (optional) |
| Resource | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16 GB |
| Disk | 20 GB | 50 GB |
| CPU | 4 cores | 8 cores |
Software: Docker, Java 17+ (21 for Spark 4.1), Python 3.10+, Poetry
Using Claude Code, Cursor, or another AI coding assistant? Point it at this repo:
Clone https://github.com/lisancao/lakehouse-at-home and follow CLAUDE.md to set up locally.
The agent will use CLAUDE.md for context and ./lakehouse setup to validate prerequisites.
# Clone
git clone https://github.com/lisancao/lakehouse-at-home.git
cd lakehouse-at-home
# Setup (validates prereqs, downloads JARs, installs deps)
./lakehouse setup
# Configure credentials
nano .env
nano config/spark/spark-defaults.conf
# Start
./lakehouse start all
# Verify
./lakehouse testSee Installation Guide for detailed OS-specific setup.
# Setup & validation
./lakehouse setup # Validate prereqs, download JARs, install deps
./lakehouse check-config # Validate credential consistency
./lakehouse preflight # Test service connectivity
# Service management
./lakehouse start all # Start Spark + Kafka
./lakehouse stop all # Stop all services
./lakehouse status # Check service health
./lakehouse status --json # Machine-readable status
./lakehouse test # Run connectivity tests
./lakehouse logs spark-master # View logs
# Unity Catalog (optional)
./lakehouse start unity-catalog # Start Unity Catalog REST server
./lakehouse stop unity-catalog # Stop Unity Catalog
# Airflow (optional)
./lakehouse start airflow # Start Airflow scheduler + webserver
./lakehouse stop airflow # Stop Airflow
./lakehouse logs airflow-webserver # View Airflow logsSee CLI Reference for all commands.
Generate realistic order data for testing:
./lakehouse testdata generate --days 7 # Generate 7 days
./lakehouse testdata load # Load to Iceberg
./lakehouse testdata stream --speed 60 # Stream to KafkaSee Test Data Guide for details.
| Guide | Description |
|---|---|
| Quickstart | 5-minute setup |
| Installation | macOS, Ubuntu, Windows guides |
| Configuration | Environment and Spark config |
| CLI Reference | All commands |
| Streaming | Kafka + Spark streaming |
| Airflow | Workflow orchestration |
| Multi-Version Spark | Run 4.0 and 4.1 together |
| Unity Catalog | REST catalog setup & migration |
| Architecture | System design |
| AWS Deployment | Cloud production setup |
| Databricks Deployment | Managed Spark platform |
| Troubleshooting | Common issues |
| Security | Security guidelines for contributors |
| AI Skills | AI assistant references (SDP, etc.) |
┌───────────────────────────────────────────────────────────────────────────────────┐
│ QUERIES │
│ Spark SQL • Time Travel • Dashboards • Reports │
└───────────────────────────────────────────────────────────────────────────────────┘
▲
│
┌──────────────────────┐ ┌───────────────────────────────────────────────────────┐
│ │ │ COMPUTE: Spark 4.x │
│ STREAMING │ │ │
│ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ Kafka (:9092) │───▶│ │ BRONZE │─▶│ SILVER │─▶│ GOLD │ │
│ └─ events topic │ │ │ (raw ingest)│ │ (cleaned) │ │ (aggregated)│ │
│ └─ orders topic │ │ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │
│ Zookeeper (:2181) │ │ Spark 4.0 (:7077, UI :8080) │
│ │ │ Spark 4.1 (:7078, UI :8082) │
│ (direct to Spark, │ └───────────────────────────────────────────┬──────────┘
│ not via catalog) │ ▲ │
└──────────────────────┘ │ │ Iceberg API
│ docker exec │
┌──────────────────────┐ │ spark-submit │
│ ORCHESTRATION │────────────────────────┘ │
│ │ │
│ Airflow (:8085) │ (Airflow schedules Spark jobs; │
│ └─ DAGs │ Spark talks to Iceberg) │
│ └─ Sensors │ │
└──────────────────────┘ │
▼
┌───────────────────────────────────────────────────────┐
│ CATALOG: Iceberg Metadata │
│ │
│ PostgreSQL (:5432) Unity Catalog (:8080) │
│ └─ JDBC catalog └─ REST catalog │
│ └─ table schemas └─ multi-engine access │
│ └─ snapshots, partitions │
└──────────────────────────┬────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ STORAGE: SeaweedFS (S3 API) │
│ │
│ s3://lakehouse/warehouse/ │
│ ├── bronze/*.parquet │
│ ├── silver/*.parquet │
│ └── gold/*.parquet │
│ │
│ :8333 (S3-compatible object storage) │
└───────────────────────────────────────────────────────┘
How It Works:
- Streaming → Kafka feeds events directly to Spark (not via catalog)
- Compute → Spark transforms data through Bronze → Silver → Gold layers
- Orchestration → Airflow schedules Spark jobs via
docker exec spark-submit - Catalog → PostgreSQL or Unity Catalog manages Iceberg table metadata
- Storage → SeaweedFS stores Parquet files accessed via Iceberg
Catalog Options:
| Feature | PostgreSQL JDBC | Unity Catalog |
|---|---|---|
| Protocol | Direct SQL | REST API |
| Clients | Spark only | Spark, DuckDB, Trino, Dremio |
| Auth | Database credentials | OAuth / Token |
| Setup | Simpler | More flexible |
| Service | Port | UI |
|---|---|---|
| PostgreSQL | 5432 | - |
| SeaweedFS | 8333 | - |
| Spark 4.0 | 7077 | http://localhost:8080 |
| Spark 4.1 | 7078 | http://localhost:8082 |
| Kafka | 9092 | - |
| Airflow | 8085 | http://localhost:8085 |
| Unity Catalog | 8080 | REST API |
cd terraform
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform applySee AWS Deployment Guide | Estimated: $50-500/month
cd terraform-databricks
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform applySee Databricks Deployment Guide | Estimated: $100-800/month
# Install test dependencies
poetry install --with dev,test
# Run all tests
poetry run pytest tests/ -v
# Run specific test categories
poetry run pytest tests/ --ignore=tests/integration/ # Unit tests
poetry run pytest tests/integration/ -v # Integration tests
poetry run pytest -m security -v # Security tests
# Multi-version Spark testing
./scripts/connectivity/test-spark-versions.sh -v 4.0 -v 4.1 -t allSee SECURITY.md for security guidelines.
# Install pre-commit hooks
pre-commit install
# Run security checks
pre-commit run --all-files
poetry run pytest -m security -vContributions welcome! Please:
- Open an issue first to discuss changes
- Install pre-commit hooks:
pre-commit install - Run tests before submitting:
poetry run pytest tests/ - See SECURITY.md for security requirements
MIT