Skip to content

Commit da9941f

Browse files
authored
Merge pull request #34 from mingjerli/feat/readme-and-subpipeline
feat: README revamp and add build_subpipeline() method
2 parents 84a01ee + e189bc3 commit da9941f

3 files changed

Lines changed: 70 additions & 48 deletions

File tree

README.md

Lines changed: 43 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# clgraph
22

3-
A Python library for SQL column lineage analysis. No database required. No infrastructure to maintain. Just Python.
3+
A Python library that turns SQL queries into lineage graphs. No database required. No infrastructure to maintain. Just your queries and Python.
44

55
![clgraph illustration](./clgraph-illustration.svg)
66

@@ -11,46 +11,42 @@ A Python library for SQL column lineage analysis. No database required. No infra
1111
- **Auto-propagate Metadata** — PII flags, ownership, and descriptions flow automatically through lineage
1212
- **Context for AI Agents** — Provide LLMs with structured lineage data for smarter data assistance
1313
- **CI/CD Change Detection** — Detect lineage changes between pipeline versions for automated testing
14+
- **Automatic DAG Construction** — Execute pipelines in Python (async or sequential) with topological ordering, or generate Airflow DAGs
1415

1516
## Why We Built This
1617

17-
Column lineage is notoriously difficult. Traditional tools reverse-engineer lineage from query logs and execution metadata, requiring expensive platform integration and complex infrastructure. Most open-source alternatives focus only on table-level lineage or single-query column analysis.
18+
**Your SQL already contains everything.** Tables, columns, transformations, joins—it's all there in your code.
1819

19-
**Our insight**: When SQL is written with explicit column names and clear transformations (what we call "[lineage-friendly SQL](https://clgraph.dev/blog/writing-lineage-friendly-sql/)"), static analysis can provide *perfect* column lineage—without database access, without runtime integration, and without query logs.
20+
Traditional tools reverse-engineer lineage from query logs and database metadata, requiring expensive infrastructure. But when SQL is written with explicit column names and clear transformations (what we call "[lineage-friendly SQL](https://clgraph.dev/blog/writing-lineage-friendly-sql/)"), static analysis can build a *complete* lineage graph—without database access, without runtime integration, and without query logs.
2021

21-
We built clgraph to prove this approach works. By combining lineage-friendly SQL with perfect static analysis, we solve 90% of column lineage needs with 10% of the complexity of enterprise tools. No database required. No infrastructure to maintain. Just pure Python analyzing your SQL files.
22+
**We parse it once. You get the complete graph.** It's a Python object you can traverse, query, and integrate however you want — powering tracing, impact analysis, metadata propagation, DAG construction, and more.
2223

2324
**Read more**:
2425
- [Why We Built This (Full Story)](https://clgraph.dev/concepts/why-we-built-this/)
2526
- [How to Write Lineage-Friendly SQL](https://clgraph.dev/blog/writing-lineage-friendly-sql/)
2627

2728
## Features
2829

29-
### Column Lineage Analysis
30-
- **Perfect column lineage** for any single SQL query, no matter how complex
31-
- **Recursive query parsing** - handles arbitrary nesting of CTEs and subqueries
32-
- **Bottom-up lineage building** - dependency-ordered processing
33-
- **Star notation preservation** - no forced expansion, with EXCEPT/REPLACE support
34-
- **Forward and backward lineage tracing** - impact analysis and source tracing
35-
36-
### Multi-Query Pipeline Analysis
37-
- **Cross-query lineage** - trace columns through multiple dependent queries
38-
- **Table dependency graphs** - understand pipeline structure
39-
- **Template variable support** - handle parameterized SQL with {{variable}} syntax
40-
- **Pipeline-level impact analysis** - see how changes propagate through your data pipeline
41-
42-
### Metadata Management
43-
- **Column metadata** - track descriptions, ownership, PII flags, and custom tags
44-
- **Metadata propagation** - automatically inherit metadata through lineage
45-
- **Inline comment parsing** - extract metadata from SQL comments (`-- description [pii: true]`)
46-
- **LLM integration** - generate natural language descriptions using Ollama, OpenAI, etc.
47-
- **Diff tracking** - detect changes between pipeline versions
48-
49-
### Export Functionality
50-
- **JSON export** - machine-readable format for system integration
51-
- **JSON round-trip** - save pipelines to JSON and reload them with `Pipeline.from_json()`
52-
- **CSV export** - column and table metadata for spreadsheets
53-
- **GraphViz export** - DOT format for visualization tools
30+
### Lineage Tracing
31+
- **Trace column origins** — Find where any column comes from, through complex CTEs and subqueries
32+
- **Impact analysis** — See what downstream columns are affected by changes
33+
- **Cross-query lineage** — Track columns through entire pipelines, not just single queries
34+
35+
### Metadata & Governance
36+
- **Auto-propagate metadata** — PII flags, ownership, and descriptions flow through lineage
37+
- **Inline comment parsing** — Extract metadata from SQL comments (`-- description [pii: true]`)
38+
- **LLM descriptions** — Generate natural language column descriptions with OpenAI, Ollama, etc.
39+
- **Diff tracking** — Detect lineage changes between pipeline versions
40+
41+
### Pipeline Execution
42+
- **Run pipelines** — Execute queries in dependency order (async or sequential)
43+
- **Airflow integration** — Generate Airflow DAGs from your pipeline
44+
- **Template variables** — Handle parameterized SQL with `{{variable}}` syntax
45+
46+
### Export
47+
- **JSON** — Machine-readable format with round-trip support
48+
- **CSV** — Column and table metadata for spreadsheets
49+
- **GraphViz** — DOT format for visualization
5450

5551
## Installation
5652

@@ -192,11 +188,11 @@ for impact in impacts:
192188
**Output:**
193189
```
194190
Pipeline(
195-
raw_events: CREATE TABLE raw_events AS SELECT user_id, eve...
191+
raw_events: CREATE TABLE raw_events AS SELECT user_id, event_...
196192
main
197-
daily_active_users: CREATE TABLE daily_active_users AS SELECT use...
193+
daily_active_users: CREATE TABLE daily_active_users AS SELECT user_id...
198194
main
199-
user_summary: CREATE TABLE user_summary AS SELECT u.name, u.em...
195+
user_summary: CREATE TABLE user_summary AS SELECT u.name, u.ema...
200196
main
201197
)
202198
------------------------------------------------------------
@@ -207,11 +203,15 @@ Execution order (5 tables):
207203
4. daily_active_users
208204
5. user_summary
209205
------------------------------------------------------------
210-
Backward lineage for user_summary.event_count (1 sources):
211-
ColumnNode('daily_active_users:raw_events.*')
206+
Backward lineage for user_summary.event_count (4 sources):
207+
ColumnNode('source_events.user_id')
208+
ColumnNode('source_events.event_type')
209+
ColumnNode('source_events.event_timestamp')
210+
ColumnNode('source_events.session_id')
212211
------------------------------------------------------------
213-
Forward lineage for source_events.event_timestamp (1 impacts):
214-
ColumnNode('user_summary:user_summary.activity_date')
212+
Forward lineage for source_events.event_timestamp (2 impacts):
213+
ColumnNode('user_summary.activity_date')
214+
ColumnNode('user_summary.event_count')
215215
```
216216

217217
### Metadata from SQL Comments
@@ -257,10 +257,10 @@ for col in pipeline.columns.values():
257257

258258
**Output:**
259259
```
260-
Total columns: 6
260+
Total columns: 5
261261
------------------------------------------------------------
262262
PII columns (1):
263-
select:select.email
263+
select_result.email
264264
Owner: data-team
265265
------------------------------------------------------------
266266
```
@@ -320,16 +320,13 @@ print("✓ Exported to lineage.json, columns.csv, lineage.dot")
320320

321321
**Output:**
322322
```
323-
📊 Propagating metadata for 8 columns...
324-
✅ Done! Propagated metadata for 8 columns
325-
Found 3 PII columns:
326-
ColumnNode('raw.orders:raw.orders.user_email')
327-
Owner: data-team
328-
Tags: contact, sensitive
329-
ColumnNode('analytics.revenue:analytics.revenue.user_email')
323+
📊 Propagating metadata for 6 columns...
324+
✅ Done! Propagated metadata for 6 columns
325+
Found 2 PII columns:
326+
ColumnNode('raw.orders.user_email')
330327
Owner: data-team
331328
Tags: contact, sensitive
332-
ColumnNode('analytics.revenue:raw.orders.user_email')
329+
ColumnNode('analytics.revenue.user_email')
333330
Owner: data-team
334331
Tags: contact, sensitive
335332
------------------------------------------------------------

clgraph-illustration.svg

Lines changed: 2 additions & 2 deletions
Loading

src/clgraph/pipeline.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1954,6 +1954,31 @@ def __repr__(self):
19541954
queries_display = "\n ".join(query_strs)
19551955
return f"Pipeline(\n {queries_display}\n)"
19561956

1957+
def build_subpipeline(self, target_table: str) -> "Pipeline":
1958+
"""
1959+
Build a subpipeline containing only queries needed to build a specific table.
1960+
1961+
This is a convenience wrapper around split() for building a single target.
1962+
1963+
Args:
1964+
target_table: The table to build (e.g., "analytics.revenue")
1965+
1966+
Returns:
1967+
A new Pipeline containing only the queries needed to build target_table
1968+
1969+
Example:
1970+
# Build only what's needed for analytics.revenue
1971+
subpipeline = pipeline.build_subpipeline("analytics.revenue")
1972+
1973+
print(f"Full pipeline: {len(pipeline.table_graph.queries)} queries")
1974+
print(f"Subpipeline: {len(subpipeline.table_graph.queries)} queries")
1975+
1976+
# Run just the subpipeline
1977+
result = subpipeline.run(executor=execute_sql)
1978+
"""
1979+
subpipelines = self.split([target_table])
1980+
return subpipelines[0]
1981+
19571982
def split(self, sinks: List) -> List["Pipeline"]:
19581983
"""
19591984
Split pipeline into non-overlapping subpipelines based on target tables.

0 commit comments

Comments
 (0)