Skip to content

fix: Handle sqlglot 28.x breaking change in EXCEPT/REPLACE key names#32

Merged
mingjerli merged 9 commits into
mainfrom
fix/sqlglot-28-except-key-compatibility
Dec 21, 2025
Merged

fix: Handle sqlglot 28.x breaking change in EXCEPT/REPLACE key names#32
mingjerli merged 9 commits into
mainfrom
fix/sqlglot-28-except-key-compatibility

Conversation

@mingjerli

@mingjerli mingjerli commented Dec 21, 2025

Copy link
Copy Markdown
Owner

Summary

This PR includes multiple fixes and features for improved lineage analysis and visualization.

Commits

  • 4d400c1 fix: Handle sqlglot 28.x breaking change in EXCEPT/REPLACE key names
  • ea2f42b fix: Exclude star nodes from simplified lineage view
  • ae5794c fix: Treat SELECT queries without destination as virtual result tables
  • f1e3137 fix: Sanitize Graphviz node IDs to avoid colon port syntax issues
  • 1fb7f19 feat: Add logging for validation issues at library level
  • 29ed8ec feat: Enhance validation for unqualified columns in JOINs
  • c2b851a feat: Resolve COUNT(*) to individual columns when schema is known
  • 3f806bd feat: Add to_simplified() method for input/output only lineage graph
  • daabb59 refactor: Unify column naming for cross-pipeline lineage

Key Changes

Bug Fixes

  • sqlglot 28.x compatibility: Handle breaking change where args["except"] became args["except_"], fixing EXCEPT clause handling
  • Graphviz node IDs: Sanitize colons in node IDs to prevent port syntax interpretation issues
  • Simplified view: Exclude star nodes from simplified lineage view to show only explicit columns
  • Virtual result tables: Treat SELECT queries without destination as virtual {query_id}_result tables

New Features

  • Validation logging: Log validation issues at library level for better debugging
  • JOIN validation: Enhanced validation for unqualified columns in JOIN operations
  • COUNT(*) resolution: Resolve COUNT(*) to individual columns when schema is known
  • Simplified graph: Add to_simplified() method for input/output only lineage visualization

Refactoring

  • Column naming: Unified column naming convention for cross-pipeline lineage consistency

Test plan

  • All existing tests pass
  • Verified sqlglot 27.x and 28.x compatibility
  • Tested EXCEPT/REPLACE clause handling in Docker and local environments

🤖 Generated with Claude Code

mingjerli and others added 9 commits December 20, 2025 19:04
- Physical table columns now use `table_name.column_name` (no query_id prefix)
- CTEs and subqueries remain query-scoped to avoid name collisions
- Intermediate tables appear only once in the graph instead of duplicated
- Edges flow naturally through shared column nodes
- Simplified cross-query edge creation to only handle * node connections

This fixes the visualization issue where intermediate tables appeared
twice - once as query output and again as downstream query input.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add to_simplified() method to ColumnLineageGraph that collapses
  intermediate layers (CTEs, subqueries) into direct input→output edges
- Traces backward from each output node to find all contributing input nodes
- Creates direct edges between input and output nodes
- Preserves warnings and issues from original graph
- Add TestSimplifiedGraph test class with 4 test cases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- When upstream query defines table schema, aggregate functions with *
  (COUNT(*), SUM(*), etc.) now create edges to individual columns
  instead of a * node
- Added _resolve_external_table_name() to match short table names
  (e.g., 'events') to full qualified names (e.g., 'staging.events')
- Falls back to * node when schema is unknown (single query case)

Also adds to_simplified() method for PipelineLineageGraph:
- Creates simplified view with only physical table columns
- Removes CTE/subquery internal columns
- Creates direct edges between table columns

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Updated _validate_qualified_columns_in_joins() to walk ALL expressions
  (not just top-level columns) to find unqualified column references
- Now detects columns inside function calls like DATE_TRUNC(order_date, MONTH)
- Added proper ValidationIssue objects with:
  - severity: WARNING
  - category: UNQUALIFIED_COLUMN
  - suggestion with example fix
  - context with available tables and expression

This helps catch ambiguous column references that could lead to
incorrect lineage when multiple tables are JOINed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Added logging to `add_issue()` methods in models.py and column.py
- Issues are logged at appropriate levels (ERROR, WARNING, INFO)
- Logger name: clgraph.validation
- Log format: "Query 'query_id': [category] message"

This makes validation issues transparent to library users without
requiring them to explicitly check for issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Graphviz interprets colons in node IDs as node:port syntax, causing
visualization bugs where phantom nodes appeared and edges connected
incorrectly.

- Added _sanitize_graphviz_id() helper to replace ':' with '__' and '.' with '_'
- Updated visualize_query_units() to sanitize unit_id and table_id
- Updated visualize_column_lineage() to sanitize node.full_name and cluster names
- Updated visualize_column_lineage_simple() to sanitize node.full_name
- Updated visualize_column_path() to sanitize node.full_name

Labels and tooltips preserve original names for readability.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
For plain SELECT statements (without CREATE TABLE/INSERT), output columns
are now assigned to a virtual table named '{query_id}_result' instead of
'{query_id}:unknown'.

This ensures output columns from such queries appear in the simplified
lineage view, which filters out columns containing colons (internal
structures like CTEs and subqueries).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The to_simplified() method now filters out star nodes (table.*) in addition
to CTE and subquery columns. This prevents misleading edges in the
simplified view where unexpanded star nodes would show all columns
(including EXCEPT'd ones) flowing through.

Before: staging.raw_data.internal_notes -> staging.raw_data.* was visible
After: Only explicit column edges are shown, respecting EXCEPT clauses

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
sqlglot 28.x changed the AST key names for SELECT * EXCEPT/REPLACE:
- 'except' -> 'except_' (with underscore)
- 'replace' -> 'replace_' (with underscore)

This caused EXCEPT clauses to be silently ignored, resulting in
columns that should be excluded (e.g., internal_notes) still appearing
in the output.

The fix checks for both old and new key names to maintain compatibility
with sqlglot 27.x and 28.x.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@mingjerli mingjerli merged commit d8ec159 into main Dec 21, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant