-
Notifications
You must be signed in to change notification settings - Fork 287
Detect wether query just filters rows or is more complex with sqlglot #2619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
dlt/common/utils.py
Outdated
|
||
def query_is_complex( | ||
parsed_select: Union[sqlglot.exp.Select, sqlglot.exp.Union], | ||
columns: Set[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably need to be TTableSchemaColumns 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure, maybe this will be an SQLGlotSchema, but let's keep a list of known columns for now and we can later change it when we use this code in the transformations work.
dlt/common/utils.py
Outdated
if non_literal_cols == columns: | ||
return False | ||
|
||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing I tried here is using sqlglot's diff function which outputs what needs to be done to create a given query from the base query. In our case, the base query could be SELECT * FROM my_table
, or SELECT <all_columns explicitly> FROM my_table
and we can check if the required actions involve removing or adding a column and some other allowed actions - but this I realized might not be as straightforward as simply going through the parsed query
dlt/common/utils.py
Outdated
bool: Whether a query is considered complex. | ||
""" | ||
# 1. If more than one table is referenced -> complex | ||
tables = {table for table in parsed_select.find_all(sqlglot.exp.Table)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remembered reading in the SQGlot AST primer that find()
, find_all()
and walk()
are not always reliable
Here's a common pitfall of the walk methods:
ast.find_all(exp.Table)At first glance, this seems like a great way to find all tables in a query. However, Table instances are not always tables in your database. Here's an example where this fails:
ast = parse_one(""" WITH x AS ( SELECT a FROM y ) SELECT a FROM x """) # This is NOT a good way to find all tables in the query! for table in ast.find_all(exp.Table): print(table) # x -- this is a common table expression, NOT an actual table # y
The post follows with how to use the Scope
object to traverse the AST. At each "node" of the scope (think of traversing a graph), you can directly query the .stars
, .tables
, etc. properties to retrieve what's in the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zilto Do we take into account only physical tables? I just thought if the query already has multiple table expressions, no matter whether each of them corresponds to an actual table or not, we still consider the query complex - because it's complex in terms of syntax if not in terms of semantics 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the sqlglot example just in case: 🫡
{
"query": "WITH temp_table AS (SELECT * FROM my_table) SELECT * FROM temp_table;",
"complex": True,
"description": "cte + two table refs",
},
{
"query": "WITH x AS (SELECT a FROM y) SELECT a FROM x",
"complex": True,
"description": "cte + two table refs + sqlglot example",
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say that the resource I linked proposes an approach that could simplify our code. Manipulating instances of the Scope class might be more convenient than manipulating raw Expression
objects. Scope
has many useful properties
Probably something like this:
from sqlglot.optimizer.scope import build_scope
ast = parse_one("""
WITH x AS (
SELECT a FROM y
)
SELECT a FROM x
""")
root = build_scope(ast)
# `.traverse()` recursively walks the graph
for scope in root.traverse():
# 1- Scope<SELECT a FROM y>
# 2- Scope<WITH x AS (SELECT a FROM y) SELECT a FROM x>
if scope.is_cte:
return True
if scope.is_derived_table:
return True
if scope.is_subquery:
return True
if scope.is_union:
return True
if scope.pivots:
return True
if scope.joint_hints:
return True
# and more
a0ba1f9
to
d2969ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice, thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good. I think you can keep the scope approach and fix the few things I have highlighted.
dlt/common/utils.py
Outdated
|
||
def query_is_complex( | ||
parsed_select: Union[sqlglot.exp.Select, sqlglot.exp.Union], | ||
columns: Set[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure, maybe this will be an SQLGlotSchema, but let's keep a list of known columns for now and we can later change it when we use this code in the transformations work.
d38f5ec
to
df105c3
Compare
e4235a9
to
bb32193
Compare
581e638
to
fd88bb0
Compare
Description
This is a research PR with an implementation of a utility function that analyzes a select query and detects whether it's complex or not.
Related Issues
Additional Notes: