Skip to content

Replace def _limit_clause_sql with sqlglot #2689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

anuunchin
Copy link
Contributor

@anuunchin anuunchin commented May 27, 2025

Description

This PR replaces the _limit_clause_sql function with sqlglot. Also aims to adjust the following functions involving manual query building:

  • def row_counts() in dlt/destinations/dataset/dataset.py
  • def query() in dlt/destinations/dataset/relation.py
  • def get_stored_schema() in dlt/destinations/job_client_impl.py
  • def get_stored_schema_by_hash() in dlt/destinations/job_client_impl.py
  • def get_stored_state() in dlt/destinations/job_client_impl.py

Copy link

netlify bot commented May 27, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit dd22801
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/683ec5202af7380008366869
😎 Deploy Preview https://deploy-preview-2689--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@anuunchin anuunchin self-assigned this May 28, 2025
@sh-rp
Copy link
Collaborator

sh-rp commented May 28, 2025

Just a note: things that could additionally be replaced is the sql code in "row_counts" and where we get state and schema from the destination.

@anuunchin anuunchin force-pushed the feat/sqlglot-query-readablerelation branch 5 times, most recently from b341c18 to 3fe5923 Compare June 3, 2025 09:16
@anuunchin anuunchin marked this pull request as ready for review June 3, 2025 09:20
@anuunchin anuunchin force-pushed the feat/sqlglot-query-readablerelation branch from 3fe5923 to dd22801 Compare June 3, 2025 09:49
@anuunchin anuunchin requested review from sh-rp and removed request for sh-rp June 3, 2025 09:50
Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this, it's a good first start :) There has been new work on the transformations that makes this much easier. I also think the ultimate goal of this work is to remove as much handwritten sql from our codebase as possible (we have a lot of it in the sql_jobs file which handles merging, replacing and appending), but this is for future tickets.

Firstly I think you have to wait just a little longer until we have the upcoming transformations PR merged, as there are some more change in there. After that you can do the following:

  • Have a look at the def normalize_query function that @rudolfix wrote. This will create a query with identifiers in the "dlt schema" space into a query with the correct fully qualified identifiers of the destination. It needs the sql_client and the sql_glot schema for this which you have in all the places where you are writing these queries. You can also look at this on the transformations decoupling branch which is more up to date.
  • Update your queries to not add catalogs, databases or any aliases but rather run them through this normalize function. They should become much easier and all of the destination specific code should go away (clickhouse)
  • I think we could add a dedicated file at dlt/destinations/queries.py where all these queries live and can be unit tested independently of any actual loading happening! The "normalize_query" function can go there too imho.
  • Add unit tests for the queries. We can talk about what exactly needs to be tested here.

@@ -187,6 +187,3 @@ def _make_database_exception(cls, ex: Exception) -> Exception:
@staticmethod
def is_dbapi_exception(ex: Exception) -> bool:
return isinstance(ex, pyodbc.Error)

def _limit_clause_sql(self, limit: int) -> Tuple[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good riddance :)

dataset_name = self.sql_client.dataset_name
catalog_name = self.sql_client.catalog_name(escape=False)

if self.config.destination_type == "clickhouse":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just on a general note, the destination specific implementations should never leak into the base class. The right way to do it would be to have the correct implementation in the subclass. But with the changes I propose below, this problem should be gone in any case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫡


table_expr = sqlglot.exp.Table(
this=sqlglot.exp.to_identifier(table_name, quoted=True),
db=sqlglot.exp.to_identifier(dataset_name, quoted=True),
Copy link
Collaborator

@sh-rp sh-rp Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use this new function to normalize queries (explained above), we will not have to add db and catalog everywhere.

Comment on lines 209 to +210

table_name = self.sql_client.make_qualified_table_name(
dataset_schema.naming.normalize_path(self._table_name)
table_name = dataset_schema.naming.normalize_tables_path(self._table_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a change now on devel that runs the result of the query method through a qualifying and normalization step, have a look at what is there now, and you will see that you will not have to do any destination specific stuff or add database and catalog information here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants