Modern AI-consumable Context-optimized ELT orchestration
DLO is a declarative data orchestration platform that compiles SQL-based transformations into executable pipelines with automatic dependency resolution, scheduling, and comprehensive lineage tracking. Designed for data engineers who want the simplicity of dbt-style workflows with the power of programmatic orchestration and AI integration.
- Declarative Configuration - Define data models, sources, and transformations in YAML + SQL
- Automatic Dependency Resolution - SQL parsing extracts table references to build execution DAGs
- Multiple Model Types - Materialized tables, views, and ephemeral CTEs
- Smart Scheduling - Cron-based scheduling with dependency-aware execution
- Column-Level Metadata - Rich metadata including descriptions, tags, profiling metrics, and sample data
- Semantic Layer - Define reusable metrics and relationships
- Interactive Web UI - Browse models, explore lineage graphs, and search resources
- AI Integration - Model Context Protocol (MCP) server for AI assistant integration
- Platform Adapters - Extensible architecture with Databricks support (more coming soon)
- Graph Visualization - Automatic DAG generation with matplotlib
DLO automatically tracks and visualizes data lineage across your entire pipeline:
Example: Healthcare data pipeline showing source tables (encounters, allergies, supplies, patients) flowing through views into materialized tables, with automatic dependency tracking and relationship visualization.
- Python 3.10 or higher
- A supported data platform (currently Databricks)
# Clone the repository
git clone https://github.com/yourusername/dlo.git
cd dlo
# Install using uv (recommended)
pip install uv
uv sync
# Or install with pip in development mode
pip install -e .Install additional features based on your needs:
# Databricks adapter
uv sync --group databricksCreate a new directory for your DLO project:
mkdir my_data_project
cd my_data_projectconfig.yaml - Project configuration:
name: healthcare_analytics
version: 1.0.0
profile: databricks_prodprofile.yaml - Connection profile (or place in ~/.config/dlo/profile.yaml):
databricks_prod:
type: databricks
host: https://your-workspace.cloud.databricks.com
http_path: /sql/1.0/warehouses/your-warehouse-id
catalog: main
schema: healthcare
# Authentication (choose one)
token: your-personal-access-token
# OR
client_id: your-client-id
client_secret: your-client-secretsources/encounters.yaml:
name: encounters
description: Patient encounters from EHR system
materialization: source
target:
catalog: raw
schema: healthcare
table: encounters
columns:
- name: encounter_id
type: string
description: Unique encounter identifier
primary_key: true
- name: patient_id
type: string
description: Patient identifier
- name: encounter_date
type: timestamp
description: Date and time of encounter
- name: encounter_type
type: string
description: Type of encounter (inpatient, outpatient, etc.)
- name: provider_id
type: string
description: Attending provider identifiersources/patients.yaml:
name: patients
description: Patient demographic information
materialization: source
target:
catalog: raw
schema: healthcare
table: patients
columns:
- name: patient_id
type: string
description: Unique patient identifier
primary_key: true
- name: first_name
type: string
description: Patient first name
- name: last_name
type: string
description: Patient last name
- name: date_of_birth
type: date
description: Patient date of birth
- name: status
type: string
description: Patient status (active, inactive, deceased)sql/patient_encounters_view.sql:
SELECT
e.encounter_id,
e.patient_id,
e.encounter_date,
e.encounter_type,
e.provider_id,
p.first_name,
p.last_name,
p.date_of_birth
FROM encounters e
JOIN patients p ON e.patient_id = p.patient_id
WHERE e.encounter_date >= CURRENT_DATE - INTERVAL 2 YEARsql/patient_360.sql:
SELECT
p.patient_id,
p.first_name,
p.last_name,
p.date_of_birth,
COUNT(DISTINCT e.encounter_id) as total_encounters,
MAX(e.encounter_date) as last_encounter_date,
COUNT(DISTINCT a.allergy_id) as total_allergies
FROM patients p
LEFT JOIN patient_encounters_view e ON p.patient_id = e.patient_id
LEFT JOIN allergies a ON p.patient_id = a.patient_id
GROUP BY p.patient_id, p.first_name, p.last_name, p.date_of_birthmodels/patient_encounters_view.yaml:
name: patient_encounters_view
description: Denormalized view of patient encounters with demographics
materialization: view
target:
catalog: analytics
schema: healthcare
table: patient_encounters_view
sql_path: sql/patient_encounters_view.sql
columns:
- name: encounter_id
type: string
description: Unique encounter identifier
- name: patient_id
type: string
description: Patient identifier
- name: encounter_date
type: timestamp
description: Encounter date and timemodels/patient_360.yaml:
name: patient_360
description: Comprehensive patient profile with aggregated metrics
materialization: materialized
target:
catalog: analytics
schema: healthcare
table: patient_360
sql_path: sql/patient_360.sql
schedule: "0 2 * * *" # Daily at 2 AM
columns:
- name: patient_id
type: string
description: Unique patient identifier
primary_key: true
- name: total_encounters
type: integer
description: Total number of patient encounters
profiling:
min: 0
max: 150
mean: 8.5# Compile the project - generates manifest and dependency graph
dlo compile
# Run all models in dependency order
dlo run
# Run specific models
dlo run --select patient_360
# Schedule models as jobs on your data platform
dlo schedule# Start the web server
dlo serve
# Or start in development mode with hot reload
dlo serve --dev
# Access at http://localhost:6364DLO consists of several integrated components:
- CLI - Command-line interface for all operations (
dlo compile,dlo run, etc.) - Compiler - Parses YAML/SQL files, builds dependency graphs, generates compiled SQL
- Parser - Validates configurations and extracts SQL table dependencies using sqlglot
- Runtime - Orchestrates compilation, execution, and scheduling
- Graph Builder - Creates NetworkX DAGs and generates visualizations
- Adapters - Platform-specific implementations (Databricks, with more coming)
- API Server - FastAPI backend serving manifest data
- Web UI - React + TypeScript frontend for exploration and visualization
- MCP Server - Model Context Protocol integration for AI assistants
- Vector Store - Optional embeddings for semantic search
- Manifest - Central registry of all project resources
- Models - Data transformations (materialized, view, ephemeral)
- Sources - External tables and datasets
- Metrics - Reusable business calculations
- Relationships - Documented connections between models
# Compile and generate manifest
dlo compile
# Compile with specific project/profile
dlo compile --project-dir ./my_project --profile prod# Run all models
dlo run
# Run specific models
dlo run --select model_name
# Run with specific adapter profile
dlo run --profile databricks_prod# Create scheduled jobs for all models with schedules
dlo schedule
# Schedule specific models
dlo schedule --select patient_360# Production mode (serves built React app)
dlo serve
# Development mode (Vite dev server with HMR)
dlo serve --dev
# Custom host and port
dlo serve --host 0.0.0.0 --port 8080
# With multiple workers
dlo serve --workers 4# Start MCP server for AI integration
dlo mcp# Run ad-hoc SQL query
dlo query "SELECT * FROM patients LIMIT 10"# Show DLO version
dlo versionDLO supports three model materialization types:
Physical tables created in your data warehouse. Best for frequently accessed datasets.
name: patient_360
materialization: materialized
target:
catalog: analytics
schema: healthcare
table: patient_360Virtual tables that execute the query on each access. Best for simple transformations.
name: patient_encounters_view
materialization: view
target:
catalog: analytics
schema: healthcare
table: patient_encounters_viewNever materialized - compiled as CTEs in downstream queries. Best for intermediate transformations.
name: active_patients
materialization: ephemeralSchedule models using cron expressions:
schedule: "0 2 * * *" # Daily at 2 AMCommon patterns:
0 * * * *- Every hour0 0 * * *- Daily at midnight0 0 * * 0- Weekly on Sunday0 0 1 * *- Monthly on the 1st
Document connections between models:
relationships/patient_encounters.yaml:
name: patient_to_encounters
description: One patient can have many encounters
from_model: patients
to_model: patient_encounters_view
relationship_type: one_to_many
join_columns:
- from: patient_id
to: patient_idDefine reusable business metrics:
metrics/patient_metrics.yaml:
name: active_patient_count
description: Count of active patients
model: patients
type: count
filters:
- status = 'active'
dimensions:
- status
- date_of_birthA typical DLO project follows this structure:
my_project/
├── config.yaml # Project configuration
├── profile.yaml # Connection profiles (optional)
│
├── sources/ # Source definitions
│ ├── encounters.yaml
│ ├── allergies.yaml
│ ├── patients.yaml
│ └── supplies.yaml
│
├── models/ # Model definitions
│ ├── patient_encounters_view.yaml
│ ├── patient_allergies_view.yaml
│ ├── patient_360.yaml
│ └── patient_encounter_cleaned.yaml
│
├── sql/ # SQL transformation queries
│ ├── patient_encounters_view.sql
│ ├── patient_allergies_view.sql
│ ├── patient_360.sql
│ └── patient_encounter_cleaned.sql
│
├── relationships/ # Relationship definitions
│ └── patient_relationships.yaml
│
├── metrics/ # Metric definitions
│ └── patient_metrics.yaml
│
└── target/ # Generated files (gitignored)
├── manifest.json # Compiled manifest
├── compiled/ # Compiled SQL with CTEs
└── graphs/ # DAG visualizations
The DLO web interface provides an intuitive way to explore your data pipeline:
- Dashboard - Overview of models, sources, metrics, and relationships
- Catalog Browser - Navigate all resources with search and filtering
- Lineage Graph - Interactive visualization of data flow and dependencies
- Detail Views - Comprehensive information for each resource
- Column Metadata - View column descriptions, types, profiling stats, and sample data
- Search - Full-text search across all resources
- Dark/Light Mode - Comfortable viewing in any environment
Start the server and navigate to http://localhost:6364:
dlo serveFor development with hot module replacement:
dlo serve --devFull support for Databricks with Unity Catalog:
Features:
- SQL Warehouse execution
- Unity Catalog (catalog.schema.table)
- Jobs API for scheduling
- OAuth2 and PAT authentication
- Task dependency management
- Workspace file operations
Configuration:
databricks_prod:
type: databricks
host: https://your-workspace.cloud.databricks.com
http_path: /sql/1.0/warehouses/warehouse-id
catalog: main
schema: analytics
# Authentication
token: your-pat-token
# OR
client_id: your-client-id
client_secret: your-client-secret
# Optional: Runtime configuration
runtime:
workspace_path: /Workspace/dlo
job_cluster_key: dlo-cluster# Clone and install
git clone https://github.com/yourusername/dlo.git
cd dlo
# Install with all development dependencies
uv sync --all-groups
# Install git hooks
lefthook installsrc/dlo/- Main source codecore/- CLI, compiler, parser, modelsadapters/- Platform-specific implementationsapi/- FastAPI serverui/- React frontendmcp/- MCP servervector_store/- Embeddings supportcommon/- Shared utilities
tests/- Test suiteunit/- Unit testsintegration/- Integration tests (if any)
docs/- Documentation
DLO is licensed under the Apache License 2.0.

