CLAUDE.md - Agent Guide for lakehouse-stack

Project Overview

Self-hostable data lakehouse: Spark 4.x + Iceberg 1.10 + Kafka 3.6 + PostgreSQL + SeaweedFS.

Documentation

Document	Purpose
`docs/getting-started/`	Installation, quickstart, configuration
`docs/guides/`	CLI reference, streaming, test data, multi-version Spark, Airflow orchestration
`docs/guides/unity-catalog.md`	Unity Catalog OSS setup and migration
`docs/guides/pipelines.md`	Data pipelines (imperative vs declarative)
`docs/guides/airflow.md`	Apache Airflow orchestration
`docs/deployment/`	Local and AWS deployment
`docs/architecture.md`	System design
`docs/troubleshooting.md`	Common issues
`docs/DEV_WORKFLOW.md`	Development workflow: always work from develop, test locally
`SECURITY.md`	Security guidelines for contributors
`.claude/skills/`	AI assistant skill files (see below)

Quick Commands

# Setup and validation
./lakehouse setup          # Validate prereqs, download JARs, create DB
./lakehouse check-config   # Validate credential consistency
./lakehouse preflight      # Test service connectivity

# Service management
./lakehouse start all      # Start Spark + Kafka
./lakehouse stop all       # Stop all services
./lakehouse status         # Human-readable status
./lakehouse status --json  # Machine-readable status
./lakehouse test           # Connectivity tests (returns exit code)
./lakehouse logs <service> # View logs (spark-master, kafka, etc.)

# Unity Catalog (optional)
./lakehouse start unity-catalog  # Start Unity Catalog REST server
./lakehouse stop unity-catalog   # Stop Unity Catalog
./lakehouse logs unity-catalog   # View Unity Catalog logs

# Airflow (optional - requires Airflow 3.x)
./lakehouse start airflow   # Start Airflow scheduler and API server
./lakehouse stop airflow    # Stop Airflow
./lakehouse logs airflow    # View Airflow logs

# Database migrations
./lakehouse migrate        # Apply schema migrations
./lakehouse migrate --dry-run  # Preview migrations

Testing

# Install test dependencies
poetry install --with dev,test

# Run all tests
poetry run pytest tests/ -v

# Run by category
poetry run pytest tests/ --ignore=tests/integration/     # Unit only
poetry run pytest tests/integration/ -v                   # Integration only
poetry run pytest -m security -v                          # Security only
poetry run pytest -m spark41 -v                           # Spark 4.1 only

# Multi-version Spark testing
./scripts/connectivity/test-spark-versions.sh                    # Default (Spark 4.1)
./scripts/connectivity/test-spark-versions.sh -v 4.0 -v 4.1     # Both versions
./scripts/connectivity/test-spark-versions.sh -t integration    # Integration tests

Test Data

./lakehouse testdata generate --days 7   # Generate 7 days of order data
./lakehouse testdata load                # Load to Iceberg tables
./lakehouse testdata stream --speed 60   # Stream to Kafka at 60x speed

Key Files

Path	Purpose
`lakehouse`	CLI script (bash)
`.env`	Credentials (from .env.example) - NOT in git
`config/spark/spark-defaults.conf`	Spark config - NOT in git
`config/spark/spark-defaults-uc.conf`	Spark config for Unity Catalog - NOT in git
`docker-compose-spark41.yml`	Spark 4.1 cluster (default)
`docker-compose.yml`	Spark 4.0 cluster
`docker-compose-kafka.yml`	Kafka + Zookeeper
`docker-compose-unity-catalog.yml`	Unity Catalog OSS server
`docker-compose-airflow.yml`	Apache Airflow orchestration
`dags/`	Airflow DAG definitions
`jars/`	Required JARs (~860MB)
`scripts/quickstarts/`	Tutorials (01-04) and demos
`scripts/connectivity/`	Integration test scripts (run via CLI)
`scripts/pipelines/`	Spark pipeline scripts (SDP, Spark 4.0/4.1)
`scripts/demos/`	Interactive demo scripts
`scripts/tools/`	Utility scripts (download-jars, etc)
`scripts/testdata/`	Test data generator
`tests/`	Test suite
`schemas/`	Database migrations
`terraform/`	AWS infrastructure
`.pre-commit-config.yaml`	Security hooks

Architecture

Spark 4.x → Iceberg 1.10 → PostgreSQL (metadata) + SeaweedFS (data)
                ↑               ↑
            Kafka 3.6      Unity Catalog (optional REST catalog)
                ↑
            Airflow (optional orchestration)

Catalog Options:

PostgreSQL JDBC (default) - Direct SQL, Spark-only
Unity Catalog OSS (optional) - REST API, multi-client (DuckDB, Trino, etc.)

Namespaces (Medallion):

iceberg.bronze.* - Raw data
iceberg.silver.* - Cleaned
iceberg.gold.* - Aggregated

Ports

Service	Port
PostgreSQL	5432
SeaweedFS	8333
Spark 4.0	7077 (UI: 8080)
Spark 4.1	7078 (UI: 8082)
Kafka	9092
Zookeeper	2181
Unity Catalog	8081 (when running with Spark)
Airflow	8085

Code Style

Python: 3.10+, Black (88 chars), Ruff
PySpark: from pyspark.sql import functions as f
Shell: ShellCheck compliant

Critical Versions

Do not change without testing:

AWS SDK v2: 2.24.6 (exact for Hadoop 3.4.1)
Iceberg: 1.10.0
Spark: 4.0.1 or 4.1.0 (Scala 2.13)
Airflow: 3.1.6 (breaking changes from 2.x - see docs/guides/airflow.md)
Poetry: 2.1.0

Java versions (don't change - these are set by official images):

Spark 4.0 container: Java 17 (from apache/spark:4.0.1-scala2.13-java17-*)
Spark 4.1 container: Java 21 (from apache/spark:4.1.0-scala2.13-java21-*)
Airflow container: Java 17 (sufficient for scheduling; Spark jobs run in Spark containers)

Security

NEVER commit credentials. See SECURITY.md for full guidelines.

# Install pre-commit hooks
pre-commit install

# Run security checks manually
pre-commit run --all-files

# Run security tests
poetry run pytest -m security -v

Pre-commit hooks enforce:

No hardcoded secrets (detect-secrets)
No private keys
Python security (bandit)
Shell security (shellcheck)

CI/CD

GitHub Actions workflow (.github/workflows/ci.yml):

Stage	Description
Lint & Validate	Ruff, Black, shell syntax, compose validation
Unit Tests	pytest (non-integration)
Container Startup	Verify Spark 4.0/4.1 images
Integration Tests	PostgreSQL, Kafka connectivity
Spark Matrix	Parallel Spark 4.0 + 4.1 tests
Airflow Validation	DAG validation and container tests
E2E Pipeline	Full stack test (master only)

Common Tasks

# Submit Spark job
docker exec spark-master-41 /opt/spark/bin/spark-submit /scripts/quickstarts/01-basics.py

# Download JARs (with retry)
./scripts/tools/download-jars.sh
./scripts/tools/download-jars.sh --verify-only  # Check existing JARs

# Format and lint
poetry run black scripts/ tests/
poetry run ruff check scripts/ tests/ --fix

# Create PR
gh pr create --base develop --title "feat: description"

Troubleshooting

See docs/troubleshooting.md for full guide.

./lakehouse test              # Test all services
./lakehouse status --json     # Check health
./lakehouse check-config      # Validate credentials
docker logs spark-master-41   # Spark logs
docker logs kafka             # Kafka logs
docker logs airflow-webserver # Airflow logs

For AI Agents

Golden rule: Always work from develop. Always test locally before pushing.

git pull origin develop              # Start here
# ... make changes ...
poetry run pytest tests/ -v          # Test locally
git push origin develop              # Only after tests pass

See docs/DEV_WORKFLOW.md for full workflow details.

AI Skills Reference

The .claude/skills/ directory contains detailed reference guides for AI assistants:

Skill File	Topic
`SDP.md`	Spark Declarative Pipelines - complete reference from basics to production

When to use skills files:

When writing Spark pipelines, read .claude/skills/SDP.md first
Skills files contain patterns, common errors, and lakehouse-specific examples
They are tested with AI assistants to ensure clarity and completeness

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md - Agent Guide for lakehouse-stack

Project Overview

Documentation

Quick Commands

Testing

Test Data

Key Files

Architecture

Ports

Code Style

Critical Versions

Security

CI/CD

Common Tasks

Troubleshooting

For AI Agents

AI Skills Reference

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md - Agent Guide for lakehouse-stack

Project Overview

Documentation

Quick Commands

Testing

Test Data

Key Files

Architecture

Ports

Code Style

Critical Versions

Security

CI/CD

Common Tasks

Troubleshooting

For AI Agents

AI Skills Reference