Detect wether query just filters rows or is more complex with sqlglot #2619

anuunchin · 2025-05-09T15:09:39Z

Description

This is a research PR with an implementation of a utility function that analyzes a select query and detects whether it's complex or not.

Related Issues

Resolves Detect wether query just filters rows or is more complex with sqlglot #2557

Additional Notes:

A union all statement is parsed as a Union expression rather than a select one.
Queries with CTEs are flagged as complex no matter the underlying structure.

netlify · 2025-05-09T15:09:43Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`48640ad`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/683ec3dc864c9900080bba3e
😎 Deploy Preview	https://deploy-preview-2619--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

anuunchin · 2025-05-09T15:16:27Z

dlt/common/utils.py

+
+def query_is_complex(
+    parsed_select: Union[sqlglot.exp.Select, sqlglot.exp.Union],
+    columns: Set[str],


This will probably need to be TTableSchemaColumns 👀

I'm not quite sure, maybe this will be an SQLGlotSchema, but let's keep a list of known columns for now and we can later change it when we use this code in the transformations work.

anuunchin · 2025-05-09T15:31:51Z

dlt/common/utils.py

+    if non_literal_cols == columns:
+        return False
+
+    return True


Another thing I tried here is using sqlglot's diff function which outputs what needs to be done to create a given query from the base query. In our case, the base query could be SELECT * FROM my_table, or SELECT <all_columns explicitly> FROM my_table and we can check if the required actions involve removing or adding a column and some other allowed actions - but this I realized might not be as straightforward as simply going through the parsed query

tests/load/pipeline/test_query_complexity_analyzer.py

zilto · 2025-05-09T21:52:25Z

dlt/common/utils.py

+        bool: Whether a query is considered complex.
+    """
+    # 1. If more than one table is referenced -> complex
+    tables = {table for table in parsed_select.find_all(sqlglot.exp.Table)}


I remembered reading in the SQGlot AST primer that find(), find_all() and walk() are not always reliable

Here's a common pitfall of the walk methods:

ast.find_all(exp.Table)

At first glance, this seems like a great way to find all tables in a query. However, Table instances are not always tables in your database. Here's an example where this fails:

ast = parse_one(""" WITH x AS ( SELECT a FROM y ) SELECT a FROM x """) # This is NOT a good way to find all tables in the query! for table in ast.find_all(exp.Table): print(table) # x -- this is a common table expression, NOT an actual table # y

The post follows with how to use the Scope object to traverse the AST. At each "node" of the scope (think of traversing a graph), you can directly query the .stars, .tables, etc. properties to retrieve what's in the scope.

@zilto Do we take into account only physical tables? I just thought if the query already has multiple table expressions, no matter whether each of them corresponds to an actual table or not, we still consider the query complex - because it's complex in terms of syntax if not in terms of semantics 👀

Added the sqlglot example just in case: 🫡

{ "query": "WITH temp_table AS (SELECT * FROM my_table) SELECT * FROM temp_table;", "complex": True, "description": "cte + two table refs", }, { "query": "WITH x AS (SELECT a FROM y) SELECT a FROM x", "complex": True, "description": "cte + two table refs + sqlglot example", },

I meant to say that the resource I linked proposes an approach that could simplify our code. Manipulating instances of the Scope class might be more convenient than manipulating raw Expression objects. Scope has many useful properties

Probably something like this:

from sqlglot.optimizer.scope import build_scope ast = parse_one(""" WITH x AS ( SELECT a FROM y ) SELECT a FROM x """) root = build_scope(ast) # `.traverse()` recursively walks the graph for scope in root.traverse(): # 1- Scope<SELECT a FROM y> # 2- Scope<WITH x AS (SELECT a FROM y) SELECT a FROM x> if scope.is_cte: return True if scope.is_derived_table: return True if scope.is_subquery: return True if scope.is_union: return True if scope.pivots: return True if scope.joint_hints: return True # and more

sh-rp

very nice, thanks :)

sh-rp

This is very good. I think you can keep the scope approach and fix the few things I have highlighted.

tests/load/pipeline/test_query_complexity_analyzer.py

dlt/common/utils.py

sh-rp · 2025-05-26T07:17:41Z

dlt/common/utils.py

+
+def query_is_complex(
+    parsed_select: Union[sqlglot.exp.Select, sqlglot.exp.Union],
+    columns: Set[str],


I'm not quite sure, maybe this will be an SQLGlotSchema, but let's keep a list of known columns for now and we can later change it when we use this code in the transformations work.

tests/load/pipeline/test_query_complexity_analyzer.py

dlt/common/utils.py

anuunchin requested a review from sh-rp May 9, 2025 15:09

anuunchin commented May 9, 2025

View reviewed changes

tests/load/pipeline/test_query_complexity_analyzer.py Show resolved Hide resolved

zilto reviewed May 9, 2025

View reviewed changes

anuunchin self-assigned this May 11, 2025

anuunchin force-pushed the research/2557-query-complexity-analyzer branch from a0ba1f9 to d2969ef Compare May 13, 2025 07:15

sh-rp previously approved these changes May 19, 2025

View reviewed changes

anuunchin dismissed sh-rp’s stale review via 42f056f May 22, 2025 09:45

sh-rp requested changes May 26, 2025

View reviewed changes

anuunchin force-pushed the research/2557-query-complexity-analyzer branch 2 times, most recently from d38f5ec to df105c3 Compare May 26, 2025 12:15

anuunchin changed the base branch from devel to feat/2527-transformations May 26, 2025 13:30

anuunchin requested a review from sh-rp May 26, 2025 13:33

sh-rp force-pushed the feat/2527-transformations branch 2 times, most recently from e4235a9 to bb32193 Compare May 30, 2025 12:12

sh-rp changed the base branch from feat/2527-transformations to devel May 30, 2025 15:19

anuunchin closed this Jun 3, 2025

anuunchin force-pushed the research/2557-query-complexity-analyzer branch from 581e638 to fd88bb0 Compare June 3, 2025 09:40

Query complexity analyzer

48640ad

anuunchin reopened this Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect wether query just filters rows or is more complex with sqlglot #2619

Detect wether query just filters rows or is more complex with sqlglot #2619

Uh oh!

anuunchin commented May 9, 2025

Uh oh!

netlify bot commented May 9, 2025 •

edited

Loading

Uh oh!

anuunchin May 9, 2025

Uh oh!

sh-rp May 26, 2025

Uh oh!

anuunchin May 9, 2025

Uh oh!

Uh oh!

zilto May 9, 2025

Uh oh!

anuunchin May 13, 2025

Uh oh!

anuunchin May 13, 2025

Uh oh!

zilto May 13, 2025 •

edited

Loading

Uh oh!

sh-rp left a comment

Uh oh!

sh-rp left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sh-rp May 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Detect wether query just filters rows or is more complex with sqlglot #2619

Are you sure you want to change the base?

Detect wether query just filters rows or is more complex with sqlglot #2619

Uh oh!

Conversation

anuunchin commented May 9, 2025

Description

Related Issues

Additional Notes:

Uh oh!

netlify bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs ready!

Uh oh!

anuunchin May 9, 2025

Choose a reason for hiding this comment

Uh oh!

sh-rp May 26, 2025

Choose a reason for hiding this comment

Uh oh!

anuunchin May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zilto May 9, 2025

Choose a reason for hiding this comment

Uh oh!

anuunchin May 13, 2025

Choose a reason for hiding this comment

Uh oh!

anuunchin May 13, 2025

Choose a reason for hiding this comment

Uh oh!

zilto May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sh-rp left a comment

Choose a reason for hiding this comment

Uh oh!

sh-rp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sh-rp May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

netlify bot commented May 9, 2025 •

edited

Loading

zilto May 13, 2025 •

edited

Loading