Skip to content

Commit af2fb95

Browse files
authored
chore: add CLAUDE.md (#4817)
1 parent 3b4128a commit af2fb95

File tree

1 file changed

+351
-0
lines changed

1 file changed

+351
-0
lines changed

CLAUDE.md

Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,351 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
SQLMesh is a next-generation data transformation framework that enables:
8+
- Virtual data environments for isolated development without warehouse costs
9+
- Plan/apply workflow (like Terraform) for safe deployments
10+
- Multi-dialect SQL support with automatic transpilation
11+
- Incremental processing to run only necessary transformations
12+
- Built-in testing and CI/CD integration
13+
14+
**Requirements**: Python >= 3.9 (Note: Python 3.13+ is not yet supported)
15+
16+
## Essential Commands
17+
18+
### Environment setup
19+
```bash
20+
# Create and activate a Python virtual environment (Python >= 3.9, < 3.13)
21+
python -m venv venv
22+
source venv/bin/activate # On Windows: venv\Scripts\activate
23+
24+
# Install development dependencies
25+
make install-dev
26+
27+
# Setup pre-commit hooks (important for code quality)
28+
make install-pre-commit
29+
```
30+
31+
### Common Development Tasks
32+
```bash
33+
# Run linters and formatters (ALWAYS run before committing)
34+
make style
35+
36+
# Fast tests for quick feedback during development
37+
make fast-test
38+
39+
# Slow tests for comprehensive coverage
40+
make slow-test
41+
42+
# Run specific test file
43+
pytest tests/core/test_context.py -v
44+
45+
# Run tests with specific marker
46+
pytest -m "not slow and not docker" -v
47+
48+
# Build package
49+
make package
50+
51+
# Serve documentation locally
52+
make docs-serve
53+
```
54+
55+
### Engine-Specific Testing
56+
```bash
57+
# DuckDB (default, no setup required)
58+
make duckdb-test
59+
60+
# Other engines require credentials/Docker
61+
make snowflake-test # Needs SNOWFLAKE_* env vars
62+
make bigquery-test # Needs GOOGLE_APPLICATION_CREDENTIALS
63+
make databricks-test # Needs DATABRICKS_* env vars
64+
```
65+
66+
### UI Development
67+
```bash
68+
# In web/client directory
69+
pnpm run dev # Start development server
70+
pnpm run build # Production build
71+
pnpm run test # Run tests
72+
73+
# Docker-based UI
74+
make ui-up # Start UI in Docker
75+
make ui-down # Stop UI
76+
```
77+
78+
## Architecture Overview
79+
80+
### Core Components
81+
82+
**sqlmesh/core/context.py**: The main Context class orchestrates all SQLMesh operations. This is the entry point for understanding how models are loaded, plans are created, and executions happen.
83+
84+
**sqlmesh/core/model/**: Model definitions and kinds (FULL, INCREMENTAL_BY_TIME_RANGE, SCD_TYPE_2, etc.). Each model kind has specific behaviors for how data is processed.
85+
86+
**sqlmesh/core/snapshot/**: The versioning system. Snapshots are immutable versions of models identified by fingerprints. Understanding snapshots is crucial for how SQLMesh tracks changes.
87+
88+
**sqlmesh/core/plan/**: Plan building and evaluation logic. Plans determine what changes need to be applied and in what order.
89+
90+
**sqlmesh/core/engine_adapter/**: Database engine adapters provide a unified interface across 16+ SQL engines. Each adapter handles engine-specific SQL generation and execution.
91+
92+
### Key Concepts
93+
94+
1. **Virtual Environments**: Lightweight branches that share unchanged data between environments, reducing storage costs and deployment time.
95+
96+
2. **Fingerprinting**: Models are versioned using content-based fingerprints. Any change to a model's logic creates a new version.
97+
98+
3. **State Sync**: Manages metadata across different backends (can be stored in the data warehouse or external databases).
99+
100+
4. **Intervals**: Time-based partitioning system for incremental models, tracking what data has been processed.
101+
102+
## Testing Philosophy
103+
104+
- Tests are marked with pytest markers:
105+
- **Type markers**: `fast`, `slow`, `docker`, `remote`, `cicdonly`, `isolated`, `registry_isolation`
106+
- **Domain markers**: `cli`, `dbt`, `github`, `jupyter`, `web`
107+
- **Engine markers**: `engine`, `athena`, `bigquery`, `clickhouse`, `databricks`, `duckdb`, `motherduck`, `mssql`, `mysql`, `postgres`, `redshift`, `snowflake`, `spark`, `trino`, `risingwave`
108+
- Default to `fast` tests during development
109+
- Engine tests use real connections when available, mocks otherwise
110+
- The `sushi` example project is used extensively in tests
111+
- Use `DuckDBMetadata` helper for validating table metadata in tests
112+
- Tests run in parallel by default (`pytest -n auto`)
113+
114+
## Code Style Guidelines
115+
116+
- Python: Black formatting, isort for imports, mypy for type checking, Ruff for linting
117+
- TypeScript/React: ESLint + Prettier configuration
118+
- SQL: SQLGlot handles parsing/formatting
119+
- All style checks run via `make style`
120+
- Pre-commit hooks enforce all style rules automatically
121+
- Important: Some modules (duckdb, numpy, pandas) are banned at module level to prevent import-time side effects
122+
123+
## Important Files
124+
125+
- `sqlmesh/core/context.py`: Main orchestration class
126+
- `examples/sushi/`: Reference implementation used in tests
127+
- `web/server/main.py`: Web UI backend entry point
128+
- `web/client/src/App.tsx`: Web UI frontend entry point
129+
- `vscode/extension/src/extension.ts`: VSCode extension entry point
130+
131+
## Common Pitfalls
132+
133+
1. **Engine Tests**: Many tests require specific database credentials or Docker. Check test markers before running.
134+
135+
2. **Path Handling**: Be careful with Windows paths - use `pathlib.Path` for cross-platform compatibility.
136+
137+
3. **State Management**: Understanding the state sync mechanism is crucial for debugging environment issues.
138+
139+
4. **Snapshot Versioning**: Changes to model logic create new versions - this is by design for safe deployments.
140+
141+
5. **Module Imports**: Avoid importing duckdb, numpy, or pandas at module level - these are banned by Ruff to prevent long load times in cases where the libraries aren't used.
142+
143+
## GitHub CI/CD Bot Architecture
144+
145+
SQLMesh includes a GitHub CI/CD bot integration that automates data transformation workflows. The implementation is located in `sqlmesh/integrations/github/` and follows a clean architectural pattern.
146+
147+
### Code Organization
148+
149+
**Core Integration Files:**
150+
- `sqlmesh/cicd/bot.py`: Main CLI entry point (`sqlmesh_cicd` command)
151+
- `sqlmesh/integrations/github/cicd/controller.py`: Core bot orchestration logic
152+
- `sqlmesh/integrations/github/cicd/command.py`: Individual command implementations
153+
- `sqlmesh/integrations/github/cicd/config.py`: Configuration classes and validation
154+
155+
### Architecture Pattern
156+
157+
The bot follows a **Command Pattern** architecture:
158+
159+
1. **CLI Layer** (`bot.py`): Handles argument parsing and delegates to controllers
160+
2. **Controller Layer** (`controller.py`): Orchestrates workflow execution and manages state
161+
3. **Command Layer** (`command.py`): Implements individual operations (test, deploy, plan, etc.)
162+
4. **Configuration Layer** (`config.py`): Manages bot configuration and validation
163+
164+
### Key Components
165+
166+
**GitHubCICDController**: Main orchestrator that:
167+
- Manages GitHub API interactions via PyGithub
168+
- Coordinates workflow execution across different commands
169+
- Handles error reporting through GitHub Check Runs
170+
- Manages PR comment interactions and status updates
171+
172+
**Command Implementations**:
173+
- `run_tests()`: Executes unit tests with detailed reporting
174+
- `update_pr_environment()`: Creates/updates virtual PR environments
175+
- `gen_prod_plan()`: Generates production deployment plans
176+
- `deploy_production()`: Handles production deployments
177+
- `check_required_approvers()`: Validates approval requirements
178+
179+
**Configuration Management**:
180+
- Uses Pydantic models for type-safe configuration
181+
- Supports both YAML config files and environment variables
182+
- Validates bot settings and user permissions
183+
- Handles approval workflows and deployment triggers
184+
185+
### Integration with Core SQLMesh
186+
187+
The bot leverages core SQLMesh components:
188+
- **Context**: Uses SQLMesh Context for project operations
189+
- **Plan/Apply**: Integrates with SQLMesh's plan generation and application
190+
- **Virtual Environments**: Creates isolated PR environments using SQLMesh's virtual data environments
191+
- **State Sync**: Manages metadata synchronization across environments
192+
- **Testing Framework**: Executes SQLMesh unit tests and reports results
193+
194+
### Error Handling and Reporting
195+
196+
- **GitHub Check Runs**: Creates detailed status reports for each workflow step
197+
- **PR Comments**: Provides user-friendly feedback on failures and successes
198+
- **Structured Logging**: Uses SQLMesh's logging framework for debugging
199+
- **Exception Handling**: Graceful handling of GitHub API failures and SQLMesh errors
200+
201+
## Environment Variables for Engine Testing
202+
203+
When running engine-specific tests, these environment variables are required:
204+
205+
- **Snowflake**: `SNOWFLAKE_ACCOUNT`, `SNOWFLAKE_WAREHOUSE`, `SNOWFLAKE_DATABASE`, `SNOWFLAKE_USER`, `SNOWFLAKE_PASSWORD`
206+
- **BigQuery**: `BIGQUERY_KEYFILE` or `GOOGLE_APPLICATION_CREDENTIALS`
207+
- **Databricks**: `DATABRICKS_CATALOG`, `DATABRICKS_SERVER_HOSTNAME`, `DATABRICKS_HTTP_PATH`, `DATABRICKS_ACCESS_TOKEN`, `DATABRICKS_CONNECT_VERSION`
208+
- **Redshift**: `REDSHIFT_HOST`, `REDSHIFT_USER`, `REDSHIFT_PASSWORD`, `REDSHIFT_DATABASE`
209+
- **Athena**: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `ATHENA_S3_WAREHOUSE_LOCATION`
210+
- **ClickHouse Cloud**: `CLICKHOUSE_CLOUD_HOST`, `CLICKHOUSE_CLOUD_USERNAME`, `CLICKHOUSE_CLOUD_PASSWORD`
211+
212+
## Migrations System
213+
214+
SQLMesh uses a migration system to evolve its internal state database schema and metadata format. The migrations handle changes to SQLMesh's internal structure, not user data transformations.
215+
216+
### Migration Structure
217+
218+
**Location**: `sqlmesh/migrations/` - Contains 80+ migration files from v0001 to v0083+
219+
220+
**Naming Convention**: `v{XXXX}_{descriptive_name}.py` (e.g., `v0001_init.py`, `v0083_use_sql_for_scd_time_data_type_data_hash.py`)
221+
222+
**Core Infrastructure**:
223+
- `sqlmesh/core/state_sync/db/migrator.py`: Main migration orchestrator
224+
- `sqlmesh/utils/migration.py`: Cross-database compatibility utilities
225+
- `sqlmesh/core/state_sync/base.py`: Auto-discovery and loading logic
226+
227+
### Migration Categories
228+
229+
**Schema Evolution**:
230+
- State table creation/modification (snapshots, environments, intervals)
231+
- Column additions/removals and index management
232+
- Database engine compatibility fixes (MySQL/MSSQL field size limits)
233+
234+
**Data Format Migrations**:
235+
- JSON metadata structure updates (snapshot serialization changes)
236+
- Path normalization (Windows compatibility)
237+
- Fingerprint recalculation when SQLGlot parsing changes
238+
239+
**Cleanup Operations**:
240+
- Removing obsolete tables and unused data
241+
- Metadata optimization and attribute cleanup
242+
243+
### Key Migration Patterns
244+
245+
```python
246+
# Standard migration function signature
247+
def migrate(state_sync, **kwargs): # type: ignore
248+
engine_adapter = state_sync.engine_adapter
249+
schema = state_sync.schema
250+
# Migration logic here
251+
252+
# Common operations
253+
engine_adapter.create_state_table(table_name, columns_dict)
254+
engine_adapter.alter_table(alter_expression)
255+
engine_adapter.drop_table(table_name)
256+
```
257+
258+
### State Management Integration
259+
260+
**Core State Tables**:
261+
- `_snapshots`: Model version metadata (most frequently migrated)
262+
- `_environments`: Environment definitions
263+
- `_versions`: Schema/SQLGlot/SQLMesh version tracking
264+
- `_intervals`: Incremental processing metadata
265+
266+
**Migration Safety**:
267+
- Automatic backups before migration (unless `skip_backup=True`)
268+
- Atomic database transactions for consistency
269+
- Snapshot count validation before/after migrations
270+
- Automatic rollback on failures
271+
272+
### Migration Execution
273+
274+
**Auto-Discovery**: Migrations are automatically loaded using `pkgutil.iter_modules()`
275+
276+
**Triggers**: Migrations run automatically when:
277+
- Schema version mismatch detected
278+
- SQLGlot version changes require fingerprint recalculation
279+
- Manual `sqlmesh migrate` command execution
280+
281+
**Execution Flow**:
282+
1. Version comparison (local vs remote schema)
283+
2. Backup creation of state tables
284+
3. Sequential migration execution (numerical order)
285+
4. Snapshot fingerprint recalculation if needed
286+
5. Environment updates with new snapshot references
287+
288+
## dbt Integration
289+
290+
SQLMesh provides native support for dbt projects, allowing users to run existing dbt projects while gaining access to SQLMesh's advanced features like virtual environments and plan/apply workflows.
291+
292+
### Core dbt Integration
293+
294+
**Location**: `sqlmesh/dbt/` - Complete dbt integration architecture
295+
296+
**Key Components**:
297+
- `sqlmesh/dbt/loader.py`: Main dbt project loader extending SQLMesh's base loader
298+
- `sqlmesh/dbt/manifest.py`: dbt manifest parsing and project discovery
299+
- `sqlmesh/dbt/adapter.py`: dbt adapter system for SQL execution and schema operations
300+
- `sqlmesh/dbt/model.py`: dbt model configurations and materialization mapping
301+
- `sqlmesh/dbt/context.py`: dbt project context and environment management
302+
303+
### Project Conversion
304+
305+
**dbt Converter**: `sqlmesh/dbt/converter/` - Tools for migrating dbt projects to SQLMesh
306+
307+
**Key Features**:
308+
- `convert.py`: Main conversion orchestration
309+
- `jinja.py` & `jinja_transforms.py`: Jinja template and macro conversion
310+
- Full support for dbt assets (models, seeds, sources, tests, snapshots, macros)
311+
312+
**CLI Commands**:
313+
```bash
314+
# Initialize SQLMesh in existing dbt project
315+
sqlmesh init -t dbt
316+
317+
# Convert dbt project to SQLMesh format
318+
sqlmesh dbt convert
319+
```
320+
321+
### Supported dbt Features
322+
323+
**Project Structure**:
324+
- Full dbt project support (models, seeds, sources, tests, snapshots, macros)
325+
- dbt package dependencies and version management
326+
- Profile integration using existing `profiles.yml` for connections
327+
328+
**Materializations**:
329+
- All standard dbt materializations (table, view, incremental, ephemeral)
330+
- Incremental model strategies (delete+insert, merge, insert_overwrite)
331+
- SCD Type 2 support and snapshot strategies
332+
333+
**Advanced Features**:
334+
- Jinja templating with full macro support
335+
- Runtime variable passing and configuration
336+
- dbt test integration and execution
337+
- Cross-database compatibility with SQLMesh's multi-dialect support
338+
339+
### Example Projects
340+
341+
**sushi_dbt**: `examples/sushi_dbt/` - Complete dbt project running with SQLMesh
342+
**Test Fixtures**: `tests/fixtures/dbt/sushi_test/` - Comprehensive test dbt project with all asset types
343+
344+
### Integration Benefits
345+
346+
When using dbt with SQLMesh, you gain:
347+
- **Virtual Environments**: Isolated development without warehouse costs
348+
- **Plan/Apply Workflow**: Safe deployments with change previews
349+
- **Multi-Dialect Support**: Run the same dbt project across different SQL engines
350+
- **Advanced Testing**: Enhanced testing capabilities beyond standard dbt tests
351+
- **State Management**: Sophisticated metadata and versioning system

0 commit comments

Comments
 (0)