Align PandasCursor with PolarsCursor and optimize DataFrame operations#639
Merged
laughingman7743 merged 8 commits intomasterfrom Jan 4, 2026
Merged
Align PandasCursor with PolarsCursor and optimize DataFrame operations#639laughingman7743 merged 8 commits intomasterfrom
laughingman7743 merged 8 commits intomasterfrom
Conversation
- Add as_pandas() method to DataFrameIterator for collecting all chunks into a single DataFrame (mirrors PolarsCursor's as_polars() method) - Add iter_chunks() method to AthenaPandasResultSet for explicit iterator access - Refactor PandasCursor.iter_chunks() to delegate to ResultSet while preserving gc.collect() optimization for memory management - Add comprehensive docstrings with Google-style documentation - Update docs/pandas.rst with DataFrameIterator.as_pandas() examples This aligns the Pandas and Polars cursor implementations for consistency, making it easier for users to switch between them. Closes #638 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, row indices reset to 0 for each chunk when iterating. Now row indices are continuous across all chunks, consistent with PolarsCursor's DataFrameIterator behavior. This is the expected behavior since chunking is an optimization detail that should be transparent to the caller. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…meIterator Rename the DataFrameIterator classes to include their respective module prefix for clarity and to avoid confusion when importing from both modules. - pyathena.pandas.result_set.DataFrameIterator → PandasDataFrameIterator - pyathena.polars.result_set.DataFrameIterator → PolarsDataFrameIterator Also updates all documentation and test references. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Pandas iterrows(): Use itertuples() instead of to_dict("records")
to avoid loading all rows into memory at once
- Pandas _trunc_date(): Cache time column names in __init__ to avoid
repeated list comprehension on each DataFrame chunk
- Polars iterrows(): Replace inline lambda with module-level _identity
function to avoid creating new function objects in hot path
- Polars fetchone(): Cache column names in __init__ to avoid repeated
_get_column_names() calls
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The _column_names_cache was being set before _create_dataframe_iterator() was called, but for unload queries, _as_polars() updates _metadata with the Parquet schema. This caused fetchone() to use stale column names that didn't match the actual DataFrame columns. Fix by moving cache initialization after _create_dataframe_iterator(), and using _get_column_names() directly in methods called during init. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove _closed flag that was not providing any real functionality - Update close() to properly close generator if reader is a generator - Align with PandasDataFrameIterator which doesn't use _closed flag 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the PolarsDataFrameIterator class and its as_polars() method for consistency with pandas.rst documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added missing Polars entry to the installation extra packages table, documenting the pip install command and version requirement (>=1.0.0). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR aligns the Pandas and Polars cursor implementations for consistency and adds performance optimizations for DataFrame operations.
API Alignment
as_pandas()method toPandasDataFrameIteratorfor collecting all chunks into a single DataFrame (mirrorsas_polars())iter_chunks()method toAthenaPandasResultSetfor explicit iterator accessPandasCursor.iter_chunks()to delegate to ResultSet while preservinggc.collect()optimizationBug Fixes
iterrows()to maintain continuous row indices across chunks (was resetting to 0 for each chunk)Refactoring
DataFrameIteratortoPandasDataFrameIteratorandPolarsDataFrameIteratorfor clarity_closedflag fromPolarsDataFrameIteratorclose()to properly close generator resourcesdocs/pandas.rst,docs/polars.rst,docs/api/pandas.rst,docs/api/polars.rst)Performance Optimizations
iterrows(): Useitertuples()instead ofto_dict("records")to avoid loading all rows into memory at once_trunc_date(): Cache time column names in__init__to avoid repeated list comprehension on each chunkiterrows(): Replace inlinelambda x: xwith module-level_identity()function to avoid creating new function objects in hot pathfetchone(): Cache column names in__init__to avoid repeated_get_column_names()callsCloses #638
🤖 Generated with Claude Code