Skip to content

Align PandasCursor with PolarsCursor and optimize DataFrame operations#639

Merged
laughingman7743 merged 8 commits intomasterfrom
feature/pandas-cursor-alignment
Jan 4, 2026
Merged

Align PandasCursor with PolarsCursor and optimize DataFrame operations#639
laughingman7743 merged 8 commits intomasterfrom
feature/pandas-cursor-alignment

Conversation

@laughingman7743
Copy link
Member

@laughingman7743 laughingman7743 commented Jan 4, 2026

Summary

This PR aligns the Pandas and Polars cursor implementations for consistency and adds performance optimizations for DataFrame operations.

API Alignment

  • Add as_pandas() method to PandasDataFrameIterator for collecting all chunks into a single DataFrame (mirrors as_polars())
  • Add iter_chunks() method to AthenaPandasResultSet for explicit iterator access
  • Refactor PandasCursor.iter_chunks() to delegate to ResultSet while preserving gc.collect() optimization

Bug Fixes

  • Fix iterrows() to maintain continuous row indices across chunks (was resetting to 0 for each chunk)
  • Fix column names cache initialization order for unload queries (was using stale metadata)

Refactoring

  • Rename DataFrameIterator to PandasDataFrameIterator and PolarsDataFrameIterator for clarity
  • Remove unused _closed flag from PolarsDataFrameIterator
  • Update close() to properly close generator resources
  • Update documentation (docs/pandas.rst, docs/polars.rst, docs/api/pandas.rst, docs/api/polars.rst)

Performance Optimizations

  • Pandas iterrows(): Use itertuples() instead of to_dict("records") to avoid loading all rows into memory at once
  • Pandas _trunc_date(): Cache time column names in __init__ to avoid repeated list comprehension on each chunk
  • Polars iterrows(): Replace inline lambda x: x with module-level _identity() function to avoid creating new function objects in hot path
  • Polars fetchone(): Cache column names in __init__ to avoid repeated _get_column_names() calls

Closes #638

🤖 Generated with Claude Code

laughingman7743 and others added 4 commits January 4, 2026 16:00
- Add as_pandas() method to DataFrameIterator for collecting all chunks
  into a single DataFrame (mirrors PolarsCursor's as_polars() method)
- Add iter_chunks() method to AthenaPandasResultSet for explicit
  iterator access
- Refactor PandasCursor.iter_chunks() to delegate to ResultSet while
  preserving gc.collect() optimization for memory management
- Add comprehensive docstrings with Google-style documentation
- Update docs/pandas.rst with DataFrameIterator.as_pandas() examples

This aligns the Pandas and Polars cursor implementations for consistency,
making it easier for users to switch between them.

Closes #638

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, row indices reset to 0 for each chunk when iterating.
Now row indices are continuous across all chunks, consistent with
PolarsCursor's DataFrameIterator behavior.

This is the expected behavior since chunking is an optimization detail
that should be transparent to the caller.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…meIterator

Rename the DataFrameIterator classes to include their respective module
prefix for clarity and to avoid confusion when importing from both modules.

- pyathena.pandas.result_set.DataFrameIterator → PandasDataFrameIterator
- pyathena.polars.result_set.DataFrameIterator → PolarsDataFrameIterator

Also updates all documentation and test references.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Pandas iterrows(): Use itertuples() instead of to_dict("records")
  to avoid loading all rows into memory at once
- Pandas _trunc_date(): Cache time column names in __init__ to avoid
  repeated list comprehension on each DataFrame chunk
- Polars iterrows(): Replace inline lambda with module-level _identity
  function to avoid creating new function objects in hot path
- Polars fetchone(): Cache column names in __init__ to avoid repeated
  _get_column_names() calls

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@laughingman7743 laughingman7743 changed the title Align PandasCursor chunk processing with PolarsCursor implementation Align PandasCursor with PolarsCursor and optimize DataFrame operations Jan 4, 2026
laughingman7743 and others added 3 commits January 4, 2026 18:54
The _column_names_cache was being set before _create_dataframe_iterator()
was called, but for unload queries, _as_polars() updates _metadata with
the Parquet schema. This caused fetchone() to use stale column names
that didn't match the actual DataFrame columns.

Fix by moving cache initialization after _create_dataframe_iterator(),
and using _get_column_names() directly in methods called during init.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove _closed flag that was not providing any real functionality
- Update close() to properly close generator if reader is a generator
- Align with PandasDataFrameIterator which doesn't use _closed flag

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the PolarsDataFrameIterator class and its as_polars() method
for consistency with pandas.rst documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@laughingman7743 laughingman7743 marked this pull request as ready for review January 4, 2026 10:14
Added missing Polars entry to the installation extra packages table,
documenting the pip install command and version requirement (>=1.0.0).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@laughingman7743 laughingman7743 merged commit a2d9c20 into master Jan 4, 2026
5 checks passed
@laughingman7743 laughingman7743 deleted the feature/pandas-cursor-alignment branch January 4, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Align PandasCursor chunk processing with PolarsCursor implementation

1 participant