Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement option 'delete_rows' of argument 'if_exists' in 'DataFrame.to_sql' API. #60376

Merged
merged 1 commit into from
Feb 19, 2025

Conversation

gmcrocetti
Copy link
Contributor

@gmcrocetti
Copy link
Contributor Author

@WillAyd I chose the name delete_rows instead of delete_replace because the behavior of replace right now is of recreate - as you mentioned - and delete_rows means exactly what is going on behind the scenes.

@erfannariman tagging you due to your help/interest during the lifecycle of this issue.

@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from e443d8d to 33cd0d6 Compare November 20, 2024 14:04
@gmcrocetti gmcrocetti marked this pull request as draft November 20, 2024 14:04
@WillAyd WillAyd added the IO SQL to_sql, read_sql, read_sql_query label Nov 20, 2024
@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch 3 times, most recently from 3c33249 to 1ef5a87 Compare November 22, 2024 01:10
@gmcrocetti gmcrocetti requested a review from WillAyd November 22, 2024 12:26
@gmcrocetti gmcrocetti marked this pull request as ready for review November 22, 2024 12:26
@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch 3 times, most recently from b71c0d9 to 1843040 Compare December 17, 2024 13:45
@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 15bda94 to 2eb19e7 Compare December 27, 2024 18:03
@gmcrocetti gmcrocetti requested a review from WillAyd December 27, 2024 18:04
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the test failures are related. Restarted so let's see...

My remaining feedback is rather minor; overall I think the implementation looks good.

@mroeschke care to take a look?

@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 5f6ab41 to d1b01d2 Compare January 3, 2025 14:42
@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from d1b01d2 to 3e8813f Compare January 3, 2025 14:45
@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 77dc01c to 4c8fcda Compare January 3, 2025 17:43
@gmcrocetti
Copy link
Contributor Author

gmcrocetti commented Jan 23, 2025

blocked by #60748. Changing to draft.

@gmcrocetti
Copy link
Contributor Author

Hello @WillAyd and @mroeschke,
I believe the merge of #60748 has unblocked this one. Would you guys mind taking a look ? I'm resolving all conversations so we can start fresh on this.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work - this is looking a lot cleaner after the pre-cursor

@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch 4 times, most recently from e67c0a2 to 2186e34 Compare February 17, 2025 18:04
@gmcrocetti gmcrocetti marked this pull request as ready for review February 17, 2025 19:00
@gmcrocetti gmcrocetti requested a review from WillAyd February 17, 2025 19:00
def delete_rows(self, name: str, schema: str | None = None) -> None:
table_name = f"{schema}.{name}" if schema else name
if self.has_table(name, schema):
self.execute(f"DELETE FROM {table_name}").close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit unclear why we are calling .close() in this implementation but not in the others

Copy link
Contributor Author

@gmcrocetti gmcrocetti Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good / recommended practice to close a cursor after usage because you release resources. I believe that's a fact we agree ?

Ok...but putting it aside for a moment the answer is the adbc driver raises an error in case a cursor is not explicitly closed, causing some tests to fail. There's no such check in sqlalchemy / sqlite3, that's why a missing close is "overlooked" there. Example:
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do sqlalchemy and sqlite3 just leak the cursor then? Or should they be adding this?

I am just confused as to why we are offering potentially different cursor lifetime management across the implementations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case pandas is leaking the cursor because sqlalchemy and sqlite3 do not provide a cool and friendly message alerting the developer. On the other hand it is standard practice to always close it.

I'm in favor of adding .close (so we guarantee there's no leak) to all calls but we had this discussion previously.

Please let me know if I can help with more context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case pandas is leaking the cursor because sqlalchemy and sqlite3

Oh OK - I thought that previously you tried to .close on those as well but they would cause other test failures, as the lifecycle of the cursor was tied to the class.

We shouldn't be leaking resources across any of these implementations, but the challenge is that we also need to stay consistent in how we are managing lifecycles. If it is as simple as calling .close (or pref using a context manager) for all implementations, then let's do that. If calling .close on everything but ADBC is causing an error, then we need to align the cursor lifecycle management of the ADBC subclass with the others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There you go @WillAyd :). Sorry for not sending it as a separate commit. I'm keeping a close eye on CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good. CI passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great. And none of these offer a context manager right? And we don't want to be using self.pd._sql.execute either?

I'm still a little unclear as to the difference in usage of self.execute versus self.pd_sql.execute within this PR, but I also don't want to bikeshed if that's the path it takes us down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great. And none of these offer a context manager right? And we don't want to be using self.pd._sql.execute either?

sqlite3 implementation for a cursor does not provide a close object. This is one of the reasons I decided to standardize to .close calls. Of course there's a workaround for that (contextlib.closing) and we can discuss if you find it worth it.

I'm still a little unclear as to the difference in usage of self.execute versus self.pd_sql.execute within this PR, but I also don't want to bikeshed if that's the path it takes us down

No worries, I don't think you're bikeshedding. It is a bit confusing for sure. Let me try to break it down:

  • self.pd_sql.execute should be used only at SQLiteTable or SQLTable classes. These classes don't implement the execute method as SQLiteDatabase and SQLDatabase do, respectively and that's why we gotta use self.pd_sql.
  • Remember we wanted to stop using con or cursor objects directly...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's helpful. Sounds like self.pd_sql.execute is just poorly named, but that's not a problem for this PR to fix

@@ -2069,6 +2080,16 @@ def drop_table(self, table_name: str, schema: str | None = None) -> None:
self.get_table(table_name, schema).drop(bind=self.con)
self.meta.clear()

def delete_rows(self, table_name: str, schema: str | None = None) -> None:
schema = schema or self.meta.schema
if self.has_table(table_name, schema):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the case the table does not exist we are just ignoring any instruction to perform a DELETE? I'm somewhat wary of assuming a user may not want an error here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, me neither.
We can remove this check letting the driver error and eventually raise a DatabaseError.
What you think ?

Copy link
Member

@WillAyd WillAyd Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought I think this is OK. It follows the same pattern as replace which will create the table if it does not exist

@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 2186e34 to f5bc6ff Compare February 18, 2025 17:24
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @mroeschke care to review?

@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 9b015c0 to 908bd2f Compare February 19, 2025 01:27
@@ -2698,6 +2700,58 @@ def test_drop_table(conn, request):
assert not insp.has_table("temp_frame")


@pytest.mark.parametrize("conn_name", all_connectable)
def test_delete_rows_success(conn_name, test_frame1, request):
table_name = "temp_frame"
Copy link
Member

@mroeschke mroeschke Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you make table_name a bit more unique between these two tests?

We have existing issue about eventually having tests consistently clean up tables they create, but in the meantime unique names will ensure your two added tests won't clobber each other if one fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright 👍 . Done !

@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 908bd2f to 7034f68 Compare February 19, 2025 12:11
@gmcrocetti gmcrocetti force-pushed the issue-37210-to-sql-truncate branch from 7034f68 to a52d2e2 Compare February 19, 2025 12:11
@gmcrocetti gmcrocetti requested a review from mroeschke February 19, 2025 12:26
@mroeschke mroeschke added this to the 3.0 milestone Feb 19, 2025
@mroeschke mroeschke merged commit 4c3b573 into pandas-dev:main Feb 19, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @gmcrocetti

@gmcrocetti
Copy link
Contributor Author

Thanks a lot for reviewing it folks. A special thanks to @WillAyd who reviewed the two PRs multiple times 🙇 .
Appreciated !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: DataFrame.to_sql with if_exists='replace' should do truncate table instead of drop table
4 participants