Skip to content

Fix overflow case and clean up some logic #18734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 12, 2025

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented May 9, 2025

Description

The calculation of the size in bytes for the CUDA Array Interface for pylibcudf Column objects produced from a column_view and an arbitrary owner previously ran the risk of overflow because the arithmetic was performed on int32 types but that is actually the maximum size in number of elements, not bytes. Since the CAI is a Python object, we can do the arithmetic with pure Python (infinite precision) integers to avoid this problem. In the process of fixing this bug, this PR also does some minor cleanup of the various cases handled in the size calculation.

Resolves #18598

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr self-assigned this May 9, 2025
@vyasr vyasr requested a review from a team as a code owner May 9, 2025 13:49
@vyasr vyasr requested review from bdice and rjzamora May 9, 2025 13:49
@vyasr vyasr added bug Something isn't working non-breaking Non-breaking change labels May 9, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels May 9, 2025
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's onerous to create a large enough column to test that we don't overflow?

Copy link
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the reproducer (or similar) in the issue to our test suite?

@vyasr
Copy link
Contributor Author

vyasr commented May 12, 2025

Yeah I didn't include a repro because the test in #18598 takes 1000 s to run. The big issue is that the assertion takes forever since it requires converting to a pandas object and checking on the CPU.

@vyasr
Copy link
Contributor Author

vyasr commented May 12, 2025

/merge

@rapids-bot rapids-bot bot merged commit 67a5975 into rapidsai:branch-25.06 May 12, 2025
125 checks passed
@vyasr vyasr deleted the fix/cai_overflow branch May 12, 2025 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] cudf.from_pandas breaks for large dataframes with list[..] column in 25.06+
3 participants