|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +SQLMesh is a next-generation data transformation framework that enables: |
| 8 | +- Virtual data environments for isolated development without warehouse costs |
| 9 | +- Plan/apply workflow (like Terraform) for safe deployments |
| 10 | +- Multi-dialect SQL support with automatic transpilation |
| 11 | +- Incremental processing to run only necessary transformations |
| 12 | +- Built-in testing and CI/CD integration |
| 13 | + |
| 14 | +**Requirements**: Python >= 3.9 (Note: Python 3.13+ is not yet supported) |
| 15 | + |
| 16 | +## Essential Commands |
| 17 | + |
| 18 | +### Environment setup |
| 19 | +```bash |
| 20 | +# Create and activate a Python virtual environment (Python >= 3.9, < 3.13) |
| 21 | +python -m venv venv |
| 22 | +source venv/bin/activate # On Windows: venv\Scripts\activate |
| 23 | + |
| 24 | +# Install development dependencies |
| 25 | +make install-dev |
| 26 | + |
| 27 | +# Setup pre-commit hooks (important for code quality) |
| 28 | +make install-pre-commit |
| 29 | +``` |
| 30 | + |
| 31 | +### Common Development Tasks |
| 32 | +```bash |
| 33 | +# Run linters and formatters (ALWAYS run before committing) |
| 34 | +make style |
| 35 | + |
| 36 | +# Fast tests for quick feedback during development |
| 37 | +make fast-test |
| 38 | + |
| 39 | +# Slow tests for comprehensive coverage |
| 40 | +make slow-test |
| 41 | + |
| 42 | +# Run specific test file |
| 43 | +pytest tests/core/test_context.py -v |
| 44 | + |
| 45 | +# Run tests with specific marker |
| 46 | +pytest -m "not slow and not docker" -v |
| 47 | + |
| 48 | +# Build package |
| 49 | +make package |
| 50 | + |
| 51 | +# Serve documentation locally |
| 52 | +make docs-serve |
| 53 | +``` |
| 54 | + |
| 55 | +### Engine-Specific Testing |
| 56 | +```bash |
| 57 | +# DuckDB (default, no setup required) |
| 58 | +make duckdb-test |
| 59 | + |
| 60 | +# Other engines require credentials/Docker |
| 61 | +make snowflake-test # Needs SNOWFLAKE_* env vars |
| 62 | +make bigquery-test # Needs GOOGLE_APPLICATION_CREDENTIALS |
| 63 | +make databricks-test # Needs DATABRICKS_* env vars |
| 64 | +``` |
| 65 | + |
| 66 | +### UI Development |
| 67 | +```bash |
| 68 | +# In web/client directory |
| 69 | +pnpm run dev # Start development server |
| 70 | +pnpm run build # Production build |
| 71 | +pnpm run test # Run tests |
| 72 | + |
| 73 | +# Docker-based UI |
| 74 | +make ui-up # Start UI in Docker |
| 75 | +make ui-down # Stop UI |
| 76 | +``` |
| 77 | + |
| 78 | +## Architecture Overview |
| 79 | + |
| 80 | +### Core Components |
| 81 | + |
| 82 | +**sqlmesh/core/context.py**: The main Context class orchestrates all SQLMesh operations. This is the entry point for understanding how models are loaded, plans are created, and executions happen. |
| 83 | + |
| 84 | +**sqlmesh/core/model/**: Model definitions and kinds (FULL, INCREMENTAL_BY_TIME_RANGE, SCD_TYPE_2, etc.). Each model kind has specific behaviors for how data is processed. |
| 85 | + |
| 86 | +**sqlmesh/core/snapshot/**: The versioning system. Snapshots are immutable versions of models identified by fingerprints. Understanding snapshots is crucial for how SQLMesh tracks changes. |
| 87 | + |
| 88 | +**sqlmesh/core/plan/**: Plan building and evaluation logic. Plans determine what changes need to be applied and in what order. |
| 89 | + |
| 90 | +**sqlmesh/core/engine_adapter/**: Database engine adapters provide a unified interface across 16+ SQL engines. Each adapter handles engine-specific SQL generation and execution. |
| 91 | + |
| 92 | +### Key Concepts |
| 93 | + |
| 94 | +1. **Virtual Environments**: Lightweight branches that share unchanged data between environments, reducing storage costs and deployment time. |
| 95 | + |
| 96 | +2. **Fingerprinting**: Models are versioned using content-based fingerprints. Any change to a model's logic creates a new version. |
| 97 | + |
| 98 | +3. **State Sync**: Manages metadata across different backends (can be stored in the data warehouse or external databases). |
| 99 | + |
| 100 | +4. **Intervals**: Time-based partitioning system for incremental models, tracking what data has been processed. |
| 101 | + |
| 102 | +## Testing Philosophy |
| 103 | + |
| 104 | +- Tests are marked with pytest markers: |
| 105 | + - **Type markers**: `fast`, `slow`, `docker`, `remote`, `cicdonly`, `isolated`, `registry_isolation` |
| 106 | + - **Domain markers**: `cli`, `dbt`, `github`, `jupyter`, `web` |
| 107 | + - **Engine markers**: `engine`, `athena`, `bigquery`, `clickhouse`, `databricks`, `duckdb`, `motherduck`, `mssql`, `mysql`, `postgres`, `redshift`, `snowflake`, `spark`, `trino`, `risingwave` |
| 108 | +- Default to `fast` tests during development |
| 109 | +- Engine tests use real connections when available, mocks otherwise |
| 110 | +- The `sushi` example project is used extensively in tests |
| 111 | +- Use `DuckDBMetadata` helper for validating table metadata in tests |
| 112 | +- Tests run in parallel by default (`pytest -n auto`) |
| 113 | + |
| 114 | +## Code Style Guidelines |
| 115 | + |
| 116 | +- Python: Black formatting, isort for imports, mypy for type checking, Ruff for linting |
| 117 | +- TypeScript/React: ESLint + Prettier configuration |
| 118 | +- SQL: SQLGlot handles parsing/formatting |
| 119 | +- All style checks run via `make style` |
| 120 | +- Pre-commit hooks enforce all style rules automatically |
| 121 | +- Important: Some modules (duckdb, numpy, pandas) are banned at module level to prevent import-time side effects |
| 122 | + |
| 123 | +## Important Files |
| 124 | + |
| 125 | +- `sqlmesh/core/context.py`: Main orchestration class |
| 126 | +- `examples/sushi/`: Reference implementation used in tests |
| 127 | +- `web/server/main.py`: Web UI backend entry point |
| 128 | +- `web/client/src/App.tsx`: Web UI frontend entry point |
| 129 | +- `vscode/extension/src/extension.ts`: VSCode extension entry point |
| 130 | + |
| 131 | +## Common Pitfalls |
| 132 | + |
| 133 | +1. **Engine Tests**: Many tests require specific database credentials or Docker. Check test markers before running. |
| 134 | + |
| 135 | +2. **Path Handling**: Be careful with Windows paths - use `pathlib.Path` for cross-platform compatibility. |
| 136 | + |
| 137 | +3. **State Management**: Understanding the state sync mechanism is crucial for debugging environment issues. |
| 138 | + |
| 139 | +4. **Snapshot Versioning**: Changes to model logic create new versions - this is by design for safe deployments. |
| 140 | + |
| 141 | +5. **Module Imports**: Avoid importing duckdb, numpy, or pandas at module level - these are banned by Ruff to prevent long load times in cases where the libraries aren't used. |
| 142 | + |
| 143 | +## GitHub CI/CD Bot Architecture |
| 144 | + |
| 145 | +SQLMesh includes a GitHub CI/CD bot integration that automates data transformation workflows. The implementation is located in `sqlmesh/integrations/github/` and follows a clean architectural pattern. |
| 146 | + |
| 147 | +### Code Organization |
| 148 | + |
| 149 | +**Core Integration Files:** |
| 150 | +- `sqlmesh/cicd/bot.py`: Main CLI entry point (`sqlmesh_cicd` command) |
| 151 | +- `sqlmesh/integrations/github/cicd/controller.py`: Core bot orchestration logic |
| 152 | +- `sqlmesh/integrations/github/cicd/command.py`: Individual command implementations |
| 153 | +- `sqlmesh/integrations/github/cicd/config.py`: Configuration classes and validation |
| 154 | + |
| 155 | +### Architecture Pattern |
| 156 | + |
| 157 | +The bot follows a **Command Pattern** architecture: |
| 158 | + |
| 159 | +1. **CLI Layer** (`bot.py`): Handles argument parsing and delegates to controllers |
| 160 | +2. **Controller Layer** (`controller.py`): Orchestrates workflow execution and manages state |
| 161 | +3. **Command Layer** (`command.py`): Implements individual operations (test, deploy, plan, etc.) |
| 162 | +4. **Configuration Layer** (`config.py`): Manages bot configuration and validation |
| 163 | + |
| 164 | +### Key Components |
| 165 | + |
| 166 | +**GitHubCICDController**: Main orchestrator that: |
| 167 | +- Manages GitHub API interactions via PyGithub |
| 168 | +- Coordinates workflow execution across different commands |
| 169 | +- Handles error reporting through GitHub Check Runs |
| 170 | +- Manages PR comment interactions and status updates |
| 171 | + |
| 172 | +**Command Implementations**: |
| 173 | +- `run_tests()`: Executes unit tests with detailed reporting |
| 174 | +- `update_pr_environment()`: Creates/updates virtual PR environments |
| 175 | +- `gen_prod_plan()`: Generates production deployment plans |
| 176 | +- `deploy_production()`: Handles production deployments |
| 177 | +- `check_required_approvers()`: Validates approval requirements |
| 178 | + |
| 179 | +**Configuration Management**: |
| 180 | +- Uses Pydantic models for type-safe configuration |
| 181 | +- Supports both YAML config files and environment variables |
| 182 | +- Validates bot settings and user permissions |
| 183 | +- Handles approval workflows and deployment triggers |
| 184 | + |
| 185 | +### Integration with Core SQLMesh |
| 186 | + |
| 187 | +The bot leverages core SQLMesh components: |
| 188 | +- **Context**: Uses SQLMesh Context for project operations |
| 189 | +- **Plan/Apply**: Integrates with SQLMesh's plan generation and application |
| 190 | +- **Virtual Environments**: Creates isolated PR environments using SQLMesh's virtual data environments |
| 191 | +- **State Sync**: Manages metadata synchronization across environments |
| 192 | +- **Testing Framework**: Executes SQLMesh unit tests and reports results |
| 193 | + |
| 194 | +### Error Handling and Reporting |
| 195 | + |
| 196 | +- **GitHub Check Runs**: Creates detailed status reports for each workflow step |
| 197 | +- **PR Comments**: Provides user-friendly feedback on failures and successes |
| 198 | +- **Structured Logging**: Uses SQLMesh's logging framework for debugging |
| 199 | +- **Exception Handling**: Graceful handling of GitHub API failures and SQLMesh errors |
| 200 | + |
| 201 | +## Environment Variables for Engine Testing |
| 202 | + |
| 203 | +When running engine-specific tests, these environment variables are required: |
| 204 | + |
| 205 | +- **Snowflake**: `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_WAREHOUSE`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USER`, `SNOWFLAKE_PASSWORD` |
| 206 | +- **BigQuery**: `BIGQUERY_KEYFILE` or `GOOGLE_APPLICATION_CREDENTIALS` |
| 207 | +- **Databricks**: `DATABRICKS_CATALOG`, `DATABRICKS_SERVER_HOSTNAME`, `DATABRICKS_HTTP_PATH`, `DATABRICKS_ACCESS_TOKEN`, `DATABRICKS_CONNECT_VERSION` |
| 208 | +- **Redshift**: `REDSHIFT_HOST`, `REDSHIFT_USER`, `REDSHIFT_PASSWORD`, `REDSHIFT_DATABASE` |
| 209 | +- **Athena**: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `ATHENA_S3_WAREHOUSE_LOCATION` |
| 210 | +- **ClickHouse Cloud**: `CLICKHOUSE_CLOUD_HOST`, `CLICKHOUSE_CLOUD_USERNAME`, `CLICKHOUSE_CLOUD_PASSWORD` |
| 211 | + |
| 212 | +## Migrations System |
| 213 | + |
| 214 | +SQLMesh uses a migration system to evolve its internal state database schema and metadata format. The migrations handle changes to SQLMesh's internal structure, not user data transformations. |
| 215 | + |
| 216 | +### Migration Structure |
| 217 | + |
| 218 | +**Location**: `sqlmesh/migrations/` - Contains 80+ migration files from v0001 to v0083+ |
| 219 | + |
| 220 | +**Naming Convention**: `v{XXXX}_{descriptive_name}.py` (e.g., `v0001_init.py`, `v0083_use_sql_for_scd_time_data_type_data_hash.py`) |
| 221 | + |
| 222 | +**Core Infrastructure**: |
| 223 | +- `sqlmesh/core/state_sync/db/migrator.py`: Main migration orchestrator |
| 224 | +- `sqlmesh/utils/migration.py`: Cross-database compatibility utilities |
| 225 | +- `sqlmesh/core/state_sync/base.py`: Auto-discovery and loading logic |
| 226 | + |
| 227 | +### Migration Categories |
| 228 | + |
| 229 | +**Schema Evolution**: |
| 230 | +- State table creation/modification (snapshots, environments, intervals) |
| 231 | +- Column additions/removals and index management |
| 232 | +- Database engine compatibility fixes (MySQL/MSSQL field size limits) |
| 233 | + |
| 234 | +**Data Format Migrations**: |
| 235 | +- JSON metadata structure updates (snapshot serialization changes) |
| 236 | +- Path normalization (Windows compatibility) |
| 237 | +- Fingerprint recalculation when SQLGlot parsing changes |
| 238 | + |
| 239 | +**Cleanup Operations**: |
| 240 | +- Removing obsolete tables and unused data |
| 241 | +- Metadata optimization and attribute cleanup |
| 242 | + |
| 243 | +### Key Migration Patterns |
| 244 | + |
| 245 | +```python |
| 246 | +# Standard migration function signature |
| 247 | +def migrate(state_sync, **kwargs): # type: ignore |
| 248 | + engine_adapter = state_sync.engine_adapter |
| 249 | + schema = state_sync.schema |
| 250 | + # Migration logic here |
| 251 | + |
| 252 | +# Common operations |
| 253 | +engine_adapter.create_state_table(table_name, columns_dict) |
| 254 | +engine_adapter.alter_table(alter_expression) |
| 255 | +engine_adapter.drop_table(table_name) |
| 256 | +``` |
| 257 | + |
| 258 | +### State Management Integration |
| 259 | + |
| 260 | +**Core State Tables**: |
| 261 | +- `_snapshots`: Model version metadata (most frequently migrated) |
| 262 | +- `_environments`: Environment definitions |
| 263 | +- `_versions`: Schema/SQLGlot/SQLMesh version tracking |
| 264 | +- `_intervals`: Incremental processing metadata |
| 265 | + |
| 266 | +**Migration Safety**: |
| 267 | +- Automatic backups before migration (unless `skip_backup=True`) |
| 268 | +- Atomic database transactions for consistency |
| 269 | +- Snapshot count validation before/after migrations |
| 270 | +- Automatic rollback on failures |
| 271 | + |
| 272 | +### Migration Execution |
| 273 | + |
| 274 | +**Auto-Discovery**: Migrations are automatically loaded using `pkgutil.iter_modules()` |
| 275 | + |
| 276 | +**Triggers**: Migrations run automatically when: |
| 277 | +- Schema version mismatch detected |
| 278 | +- SQLGlot version changes require fingerprint recalculation |
| 279 | +- Manual `sqlmesh migrate` command execution |
| 280 | + |
| 281 | +**Execution Flow**: |
| 282 | +1. Version comparison (local vs remote schema) |
| 283 | +2. Backup creation of state tables |
| 284 | +3. Sequential migration execution (numerical order) |
| 285 | +4. Snapshot fingerprint recalculation if needed |
| 286 | +5. Environment updates with new snapshot references |
| 287 | + |
| 288 | +## dbt Integration |
| 289 | + |
| 290 | +SQLMesh provides native support for dbt projects, allowing users to run existing dbt projects while gaining access to SQLMesh's advanced features like virtual environments and plan/apply workflows. |
| 291 | + |
| 292 | +### Core dbt Integration |
| 293 | + |
| 294 | +**Location**: `sqlmesh/dbt/` - Complete dbt integration architecture |
| 295 | + |
| 296 | +**Key Components**: |
| 297 | +- `sqlmesh/dbt/loader.py`: Main dbt project loader extending SQLMesh's base loader |
| 298 | +- `sqlmesh/dbt/manifest.py`: dbt manifest parsing and project discovery |
| 299 | +- `sqlmesh/dbt/adapter.py`: dbt adapter system for SQL execution and schema operations |
| 300 | +- `sqlmesh/dbt/model.py`: dbt model configurations and materialization mapping |
| 301 | +- `sqlmesh/dbt/context.py`: dbt project context and environment management |
| 302 | + |
| 303 | +### Project Conversion |
| 304 | + |
| 305 | +**dbt Converter**: `sqlmesh/dbt/converter/` - Tools for migrating dbt projects to SQLMesh |
| 306 | + |
| 307 | +**Key Features**: |
| 308 | +- `convert.py`: Main conversion orchestration |
| 309 | +- `jinja.py` & `jinja_transforms.py`: Jinja template and macro conversion |
| 310 | +- Full support for dbt assets (models, seeds, sources, tests, snapshots, macros) |
| 311 | + |
| 312 | +**CLI Commands**: |
| 313 | +```bash |
| 314 | +# Initialize SQLMesh in existing dbt project |
| 315 | +sqlmesh init -t dbt |
| 316 | + |
| 317 | +# Convert dbt project to SQLMesh format |
| 318 | +sqlmesh dbt convert |
| 319 | +``` |
| 320 | + |
| 321 | +### Supported dbt Features |
| 322 | + |
| 323 | +**Project Structure**: |
| 324 | +- Full dbt project support (models, seeds, sources, tests, snapshots, macros) |
| 325 | +- dbt package dependencies and version management |
| 326 | +- Profile integration using existing `profiles.yml` for connections |
| 327 | + |
| 328 | +**Materializations**: |
| 329 | +- All standard dbt materializations (table, view, incremental, ephemeral) |
| 330 | +- Incremental model strategies (delete+insert, merge, insert_overwrite) |
| 331 | +- SCD Type 2 support and snapshot strategies |
| 332 | + |
| 333 | +**Advanced Features**: |
| 334 | +- Jinja templating with full macro support |
| 335 | +- Runtime variable passing and configuration |
| 336 | +- dbt test integration and execution |
| 337 | +- Cross-database compatibility with SQLMesh's multi-dialect support |
| 338 | + |
| 339 | +### Example Projects |
| 340 | + |
| 341 | +**sushi_dbt**: `examples/sushi_dbt/` - Complete dbt project running with SQLMesh |
| 342 | +**Test Fixtures**: `tests/fixtures/dbt/sushi_test/` - Comprehensive test dbt project with all asset types |
| 343 | + |
| 344 | +### Integration Benefits |
| 345 | + |
| 346 | +When using dbt with SQLMesh, you gain: |
| 347 | +- **Virtual Environments**: Isolated development without warehouse costs |
| 348 | +- **Plan/Apply Workflow**: Safe deployments with change previews |
| 349 | +- **Multi-Dialect Support**: Run the same dbt project across different SQL engines |
| 350 | +- **Advanced Testing**: Enhanced testing capabilities beyond standard dbt tests |
| 351 | +- **State Management**: Sophisticated metadata and versioning system |
0 commit comments