Skip to content

Detect wether query just filters rows or is more complex with sqlglot #2619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

anuunchin
Copy link
Contributor

Description

This is a research PR with an implementation of a utility function that analyzes a select query and detects whether it's complex or not.

Related Issues

Additional Notes:

  • A union all statement is parsed as a Union expression rather than a select one.
  • Queries with CTEs are flagged as complex no matter the underlying structure.

@anuunchin anuunchin requested a review from sh-rp May 9, 2025 15:09
Copy link

netlify bot commented May 9, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 48640ad
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/683ec3dc864c9900080bba3e
😎 Deploy Preview https://deploy-preview-2619--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.


def query_is_complex(
parsed_select: Union[sqlglot.exp.Select, sqlglot.exp.Union],
columns: Set[str],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably need to be TTableSchemaColumns 👀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure, maybe this will be an SQLGlotSchema, but let's keep a list of known columns for now and we can later change it when we use this code in the transformations work.

if non_literal_cols == columns:
return False

return True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I tried here is using sqlglot's diff function which outputs what needs to be done to create a given query from the base query. In our case, the base query could be SELECT * FROM my_table, or SELECT <all_columns explicitly> FROM my_table and we can check if the required actions involve removing or adding a column and some other allowed actions - but this I realized might not be as straightforward as simply going through the parsed query

bool: Whether a query is considered complex.
"""
# 1. If more than one table is referenced -> complex
tables = {table for table in parsed_select.find_all(sqlglot.exp.Table)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered reading in the SQGlot AST primer that find(), find_all() and walk() are not always reliable

Here's a common pitfall of the walk methods:

ast.find_all(exp.Table)

At first glance, this seems like a great way to find all tables in a query. However, Table instances are not always tables in your database. Here's an example where this fails:

ast = parse_one("""
WITH x AS (
  SELECT a FROM y
)
SELECT a FROM x
""")

# This is NOT a good way to find all tables in the query!
for table in ast.find_all(exp.Table):
    print(table)

# x  -- this is a common table expression, NOT an actual table
# y

The post follows with how to use the Scope object to traverse the AST. At each "node" of the scope (think of traversing a graph), you can directly query the .stars, .tables, etc. properties to retrieve what's in the scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zilto Do we take into account only physical tables? I just thought if the query already has multiple table expressions, no matter whether each of them corresponds to an actual table or not, we still consider the query complex - because it's complex in terms of syntax if not in terms of semantics 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the sqlglot example just in case: 🫡

    {
        "query": "WITH temp_table AS (SELECT * FROM my_table) SELECT * FROM temp_table;",
        "complex": True,
        "description": "cte + two table refs",
    },
    {
        "query": "WITH x AS (SELECT a FROM y) SELECT a FROM x",
        "complex": True,
        "description": "cte + two table refs + sqlglot example",
    },

Copy link
Collaborator

@zilto zilto May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say that the resource I linked proposes an approach that could simplify our code. Manipulating instances of the Scope class might be more convenient than manipulating raw Expression objects. Scope has many useful properties

Probably something like this:

from sqlglot.optimizer.scope import build_scope

ast = parse_one("""
WITH x AS (
  SELECT a FROM y
)
SELECT a FROM x
""")

root = build_scope(ast)
# `.traverse()` recursively walks the graph
for scope in root.traverse():
   # 1- Scope<SELECT a FROM y>
   # 2- Scope<WITH x AS (SELECT a FROM y) SELECT a FROM x>

    if scope.is_cte:
       return True
       
    if scope.is_derived_table:
       return True
       
    if scope.is_subquery:
       return True
       
    if scope.is_union:
       return True
       
    if scope.pivots:
       return True
       
    if scope.joint_hints:
       return True

   # and more

@anuunchin anuunchin self-assigned this May 11, 2025
@anuunchin anuunchin force-pushed the research/2557-query-complexity-analyzer branch from a0ba1f9 to d2969ef Compare May 13, 2025 07:15
sh-rp
sh-rp previously approved these changes May 19, 2025
Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice, thanks :)

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good. I think you can keep the scope approach and fix the few things I have highlighted.


def query_is_complex(
parsed_select: Union[sqlglot.exp.Select, sqlglot.exp.Union],
columns: Set[str],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure, maybe this will be an SQLGlotSchema, but let's keep a list of known columns for now and we can later change it when we use this code in the transformations work.

@anuunchin anuunchin force-pushed the research/2557-query-complexity-analyzer branch 2 times, most recently from d38f5ec to df105c3 Compare May 26, 2025 12:15
@anuunchin anuunchin changed the base branch from devel to feat/2527-transformations May 26, 2025 13:30
@anuunchin anuunchin requested a review from sh-rp May 26, 2025 13:33
@sh-rp sh-rp force-pushed the feat/2527-transformations branch 2 times, most recently from e4235a9 to bb32193 Compare May 30, 2025 12:12
@sh-rp sh-rp changed the base branch from feat/2527-transformations to devel May 30, 2025 15:19
@anuunchin anuunchin closed this Jun 3, 2025
@anuunchin anuunchin force-pushed the research/2557-query-complexity-analyzer branch from 581e638 to fd88bb0 Compare June 3, 2025 09:40
@anuunchin anuunchin reopened this Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detect wether query just filters rows or is more complex with sqlglot
3 participants