fix: Handle sqlglot 28.x breaking change in EXCEPT/REPLACE key names#32
Merged
Conversation
- Physical table columns now use `table_name.column_name` (no query_id prefix) - CTEs and subqueries remain query-scoped to avoid name collisions - Intermediate tables appear only once in the graph instead of duplicated - Edges flow naturally through shared column nodes - Simplified cross-query edge creation to only handle * node connections This fixes the visualization issue where intermediate tables appeared twice - once as query output and again as downstream query input. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add to_simplified() method to ColumnLineageGraph that collapses intermediate layers (CTEs, subqueries) into direct input→output edges - Traces backward from each output node to find all contributing input nodes - Creates direct edges between input and output nodes - Preserves warnings and issues from original graph - Add TestSimplifiedGraph test class with 4 test cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- When upstream query defines table schema, aggregate functions with * (COUNT(*), SUM(*), etc.) now create edges to individual columns instead of a * node - Added _resolve_external_table_name() to match short table names (e.g., 'events') to full qualified names (e.g., 'staging.events') - Falls back to * node when schema is unknown (single query case) Also adds to_simplified() method for PipelineLineageGraph: - Creates simplified view with only physical table columns - Removes CTE/subquery internal columns - Creates direct edges between table columns 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Updated _validate_qualified_columns_in_joins() to walk ALL expressions (not just top-level columns) to find unqualified column references - Now detects columns inside function calls like DATE_TRUNC(order_date, MONTH) - Added proper ValidationIssue objects with: - severity: WARNING - category: UNQUALIFIED_COLUMN - suggestion with example fix - context with available tables and expression This helps catch ambiguous column references that could lead to incorrect lineage when multiple tables are JOINed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Added logging to `add_issue()` methods in models.py and column.py - Issues are logged at appropriate levels (ERROR, WARNING, INFO) - Logger name: clgraph.validation - Log format: "Query 'query_id': [category] message" This makes validation issues transparent to library users without requiring them to explicitly check for issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Graphviz interprets colons in node IDs as node:port syntax, causing visualization bugs where phantom nodes appeared and edges connected incorrectly. - Added _sanitize_graphviz_id() helper to replace ':' with '__' and '.' with '_' - Updated visualize_query_units() to sanitize unit_id and table_id - Updated visualize_column_lineage() to sanitize node.full_name and cluster names - Updated visualize_column_lineage_simple() to sanitize node.full_name - Updated visualize_column_path() to sanitize node.full_name Labels and tooltips preserve original names for readability. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
For plain SELECT statements (without CREATE TABLE/INSERT), output columns
are now assigned to a virtual table named '{query_id}_result' instead of
'{query_id}:unknown'.
This ensures output columns from such queries appear in the simplified
lineage view, which filters out columns containing colons (internal
structures like CTEs and subqueries).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The to_simplified() method now filters out star nodes (table.*) in addition to CTE and subquery columns. This prevents misleading edges in the simplified view where unexpanded star nodes would show all columns (including EXCEPT'd ones) flowing through. Before: staging.raw_data.internal_notes -> staging.raw_data.* was visible After: Only explicit column edges are shown, respecting EXCEPT clauses 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
sqlglot 28.x changed the AST key names for SELECT * EXCEPT/REPLACE: - 'except' -> 'except_' (with underscore) - 'replace' -> 'replace_' (with underscore) This caused EXCEPT clauses to be silently ignored, resulting in columns that should be excluded (e.g., internal_notes) still appearing in the output. The fix checks for both old and new key names to maintain compatibility with sqlglot 27.x and 28.x. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR includes multiple fixes and features for improved lineage analysis and visualization.
Commits
4d400c1fix: Handle sqlglot 28.x breaking change in EXCEPT/REPLACE key namesea2f42bfix: Exclude star nodes from simplified lineage viewae5794cfix: Treat SELECT queries without destination as virtual result tablesf1e3137fix: Sanitize Graphviz node IDs to avoid colon port syntax issues1fb7f19feat: Add logging for validation issues at library level29ed8ecfeat: Enhance validation for unqualified columns in JOINsc2b851afeat: Resolve COUNT(*) to individual columns when schema is known3f806bdfeat: Add to_simplified() method for input/output only lineage graphdaabb59refactor: Unify column naming for cross-pipeline lineageKey Changes
Bug Fixes
args["except"]becameargs["except_"], fixing EXCEPT clause handling{query_id}_resulttablesNew Features
to_simplified()method for input/output only lineage visualizationRefactoring
Test plan
🤖 Generated with Claude Code