-
Notifications
You must be signed in to change notification settings - Fork 287
Replace def _limit_clause_sql with sqlglot #2689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Just a note: things that could additionally be replaced is the sql code in "row_counts" and where we get state and schema from the destination. |
b341c18
to
3fe5923
Compare
3fe5923
to
dd22801
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this, it's a good first start :) There has been new work on the transformations that makes this much easier. I also think the ultimate goal of this work is to remove as much handwritten sql from our codebase as possible (we have a lot of it in the sql_jobs file which handles merging, replacing and appending), but this is for future tickets.
Firstly I think you have to wait just a little longer until we have the upcoming transformations PR merged, as there are some more change in there. After that you can do the following:
- Have a look at the
def normalize_query
function that @rudolfix wrote. This will create a query with identifiers in the "dlt schema" space into a query with the correct fully qualified identifiers of the destination. It needs the sql_client and the sql_glot schema for this which you have in all the places where you are writing these queries. You can also look at this on the transformations decoupling branch which is more up to date. - Update your queries to not add catalogs, databases or any aliases but rather run them through this normalize function. They should become much easier and all of the destination specific code should go away (clickhouse)
- I think we could add a dedicated file at
dlt/destinations/queries.py
where all these queries live and can be unit tested independently of any actual loading happening! The "normalize_query" function can go there too imho. - Add unit tests for the queries. We can talk about what exactly needs to be tested here.
@@ -187,6 +187,3 @@ def _make_database_exception(cls, ex: Exception) -> Exception: | |||
@staticmethod | |||
def is_dbapi_exception(ex: Exception) -> bool: | |||
return isinstance(ex, pyodbc.Error) | |||
|
|||
def _limit_clause_sql(self, limit: int) -> Tuple[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good riddance :)
dataset_name = self.sql_client.dataset_name | ||
catalog_name = self.sql_client.catalog_name(escape=False) | ||
|
||
if self.config.destination_type == "clickhouse": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just on a general note, the destination specific implementations should never leak into the base class. The right way to do it would be to have the correct implementation in the subclass. But with the changes I propose below, this problem should be gone in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
|
||
table_expr = sqlglot.exp.Table( | ||
this=sqlglot.exp.to_identifier(table_name, quoted=True), | ||
db=sqlglot.exp.to_identifier(dataset_name, quoted=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use this new function to normalize queries (explained above), we will not have to add db and catalog everywhere.
|
||
table_name = self.sql_client.make_qualified_table_name( | ||
dataset_schema.naming.normalize_path(self._table_name) | ||
table_name = dataset_schema.naming.normalize_tables_path(self._table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a change now on devel that runs the result of the query method through a qualifying and normalization step, have a look at what is there now, and you will see that you will not have to do any destination specific stuff or add database and catalog information here.
Description
This PR replaces the _limit_clause_sql function with sqlglot. Also aims to adjust the following functions involving manual query building:
def row_counts()
indlt/destinations/dataset/dataset.py
def query()
indlt/destinations/dataset/relation.py
def get_stored_schema()
indlt/destinations/job_client_impl.py
def get_stored_schema_by_hash()
indlt/destinations/job_client_impl.py
def get_stored_state()
indlt/destinations/job_client_impl.py