Lakehouse at Home

A fully open-source, self-hostable data lakehouse for local development and testing of modern data workflows. Run production-grade infrastructure on your laptop with Apache Spark, Iceberg, and Kafka - no cloud account required. Includes a realistic data generation framework to test batch and streaming pipelines.

Why Lakehouse at Home?

Learn data engineering with real tools, not toy examples
Develop and test Spark jobs locally before deploying to production
Experiment with Iceberg table formats, streaming pipelines, and medallion architecture
Deploy (optional) to your cloud provider when ready using included Terraform templates

Stack

Component	Version	Purpose
Apache Spark	4.0 / 4.1	Distributed compute
Apache Iceberg	1.10	ACID table format
Apache Kafka	3.6	Event streaming
Apache Airflow	3.1	Workflow orchestration
PostgreSQL	16	Catalog metadata
SeaweedFS	-	S3-compatible storage
Unity Catalog	0.3.1	REST catalog (optional)

Requirements

Resource	Minimum	Recommended
RAM	8 GB	16 GB
Disk	20 GB	50 GB
CPU	4 cores	8 cores

Software: Docker, Java 17+ (21 for Spark 4.1), Python 3.10+, Poetry

Getting Started

AI-Assisted Setup

Using Claude Code, Cursor, or another AI coding assistant? Point it at this repo:

Clone https://github.com/lisancao/lakehouse-at-home and follow CLAUDE.md to set up locally.

The agent will use CLAUDE.md for context and ./lakehouse setup to validate prerequisites.

Manual Setup

# Clone
git clone https://github.com/lisancao/lakehouse-at-home.git
cd lakehouse-at-home

# Setup (validates prereqs, downloads JARs, installs deps)
./lakehouse setup

# Configure credentials
nano .env
nano config/spark/spark-defaults.conf

# Start
./lakehouse start all

# Verify
./lakehouse test

See Installation Guide for detailed OS-specific setup.

CLI

# Setup & validation
./lakehouse setup                # Validate prereqs, download JARs, install deps
./lakehouse check-config         # Validate credential consistency
./lakehouse preflight            # Test service connectivity

# Service management
./lakehouse start all            # Start Spark + Kafka
./lakehouse stop all             # Stop all services
./lakehouse status               # Check service health
./lakehouse status --json        # Machine-readable status
./lakehouse test                 # Run connectivity tests
./lakehouse logs spark-master    # View logs

# Unity Catalog (optional)
./lakehouse start unity-catalog  # Start Unity Catalog REST server
./lakehouse stop unity-catalog   # Stop Unity Catalog

# Airflow (optional)
./lakehouse start airflow        # Start Airflow scheduler + webserver
./lakehouse stop airflow         # Stop Airflow
./lakehouse logs airflow-webserver  # View Airflow logs

See CLI Reference for all commands.

Test Data

Generate realistic order data for testing:

./lakehouse testdata generate --days 7    # Generate 7 days
./lakehouse testdata load                 # Load to Iceberg
./lakehouse testdata stream --speed 60    # Stream to Kafka

See Test Data Guide for details.

Documentation

Guide	Description
Quickstart	5-minute setup
Installation	macOS, Ubuntu, Windows guides
Configuration	Environment and Spark config
CLI Reference	All commands
Streaming	Kafka + Spark streaming
Airflow	Workflow orchestration
Multi-Version Spark	Run 4.0 and 4.1 together
Unity Catalog	REST catalog setup & migration
Architecture	System design
AWS Deployment	Cloud production setup
Databricks Deployment	Managed Spark platform
Troubleshooting	Common issues
Security	Security guidelines for contributors
AI Skills	AI assistant references (SDP, etc.)

Architecture

┌───────────────────────────────────────────────────────────────────────────────────┐
│                                    QUERIES                                        │
│            Spark SQL  •  Time Travel  •  Dashboards  •  Reports                   │
└───────────────────────────────────────────────────────────────────────────────────┘
                                        ▲
                                        │
┌──────────────────────┐    ┌───────────────────────────────────────────────────────┐
│                      │    │                    COMPUTE: Spark 4.x                 │
│  STREAMING           │    │                                                       │
│                      │    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐   │
│  Kafka (:9092)       │───▶│  │   BRONZE    │─▶│   SILVER    │─▶│    GOLD     │   │
│  └─ events topic     │    │  │ (raw ingest)│  │  (cleaned)  │  │ (aggregated)│   │
│  └─ orders topic     │    │  └─────────────┘  └─────────────┘  └─────────────┘   │
│                      │    │                                                       │
│  Zookeeper (:2181)   │    │  Spark 4.0 (:7077, UI :8080)                         │
│                      │    │  Spark 4.1 (:7078, UI :8082)                         │
│  (direct to Spark,   │    └───────────────────────────────────────────┬──────────┘
│   not via catalog)   │                        ▲                       │
└──────────────────────┘                        │                       │ Iceberg API
                                                │ docker exec           │
┌──────────────────────┐                        │ spark-submit          │
│  ORCHESTRATION       │────────────────────────┘                       │
│                      │                                                │
│  Airflow (:8085)     │  (Airflow schedules Spark jobs;                │
│  └─ DAGs             │   Spark talks to Iceberg)                      │
│  └─ Sensors          │                                                │
└──────────────────────┘                                                │
                                                                        ▼
                            ┌───────────────────────────────────────────────────────┐
                            │                 CATALOG: Iceberg Metadata             │
                            │                                                       │
                            │  PostgreSQL (:5432)          Unity Catalog (:8080)    │
                            │  └─ JDBC catalog             └─ REST catalog          │
                            │  └─ table schemas            └─ multi-engine access   │
                            │  └─ snapshots, partitions                             │
                            └──────────────────────────┬────────────────────────────┘
                                                       │
                                                       ▼
                            ┌───────────────────────────────────────────────────────┐
                            │               STORAGE: SeaweedFS (S3 API)             │
                            │                                                       │
                            │  s3://lakehouse/warehouse/                            │
                            │  ├── bronze/*.parquet                                 │
                            │  ├── silver/*.parquet                                 │
                            │  └── gold/*.parquet                                   │
                            │                                                       │
                            │  :8333 (S3-compatible object storage)                 │
                            └───────────────────────────────────────────────────────┘

How It Works:

Streaming → Kafka feeds events directly to Spark (not via catalog)
Compute → Spark transforms data through Bronze → Silver → Gold layers
Orchestration → Airflow schedules Spark jobs via docker exec spark-submit
Catalog → PostgreSQL or Unity Catalog manages Iceberg table metadata
Storage → SeaweedFS stores Parquet files accessed via Iceberg

Catalog Options:

Feature	PostgreSQL JDBC	Unity Catalog
Protocol	Direct SQL	REST API
Clients	Spark only	Spark, DuckDB, Trino, Dremio
Auth	Database credentials	OAuth / Token
Setup	Simpler	More flexible

Ports

Service	Port	UI
PostgreSQL	5432	-
SeaweedFS	8333	-
Spark 4.0	7077	http://localhost:8080
Spark 4.1	7078	http://localhost:8082
Kafka	9092	-
Airflow	8085	http://localhost:8085
Unity Catalog	8080	REST API

Cloud Deployment

AWS (EMR + S3 + RDS)

cd terraform
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply

See AWS Deployment Guide | Estimated: $50-500/month

Databricks

cd terraform-databricks
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply

See Databricks Deployment Guide | Estimated: $100-800/month

Testing

# Install test dependencies
poetry install --with dev,test

# Run all tests
poetry run pytest tests/ -v

# Run specific test categories
poetry run pytest tests/ --ignore=tests/integration/  # Unit tests
poetry run pytest tests/integration/ -v               # Integration tests
poetry run pytest -m security -v                      # Security tests

# Multi-version Spark testing
./scripts/connectivity/test-spark-versions.sh -v 4.0 -v 4.1 -t all

Security

See SECURITY.md for security guidelines.

# Install pre-commit hooks
pre-commit install

# Run security checks
pre-commit run --all-files
poetry run pytest -m security -v

Contributing

Contributions welcome! Please:

Open an issue first to discuss changes
Install pre-commit hooks: pre-commit install
Run tests before submitting: poetry run pytest tests/
See SECURITY.md for security requirements

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.claude/skills		.claude/skills
.gitea/workflows		.gitea/workflows
.github/workflows		.github/workflows
config		config
dags		dags
docker/airflow		docker/airflow
docs		docs
jars		jars
notebooks		notebooks
schemas		schemas
scripts		scripts
terraform-databricks		terraform-databricks
terraform		terraform
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose-airflow.yml		docker-compose-airflow.yml
docker-compose-kafka.yml		docker-compose-kafka.yml
docker-compose-notebooks.yml		docker-compose-notebooks.yml
docker-compose-spark41.yml		docker-compose-spark41.yml
docker-compose-unity-catalog.yml		docker-compose-unity-catalog.yml
docker-compose.yml		docker-compose.yml
lakehouse		lakehouse
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lakehouse at Home

Stack

Requirements

Getting Started

AI-Assisted Setup

Manual Setup

CLI

Test Data

Documentation

Architecture

Ports

Cloud Deployment

AWS (EMR + S3 + RDS)

Databricks

Testing

Security

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lakehouse at Home

Stack

Requirements

Getting Started

AI-Assisted Setup

Manual Setup

CLI

Test Data

Documentation

Architecture

Ports

Cloud Deployment

AWS (EMR + S3 + RDS)

Databricks

Testing

Security

Contributing

License

About

Topics

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages