Conversation
ThrawnCA
commented
Mar 2, 2026
- Copy chunked processing from upstream
- Gracefully handle the extra blank cells generated by Excel
…files - Add split_copy_by_size() function to process files in configurable chunks - Extract copy_file() helper for PostgreSQL COPY operations - Add ckanext.xloader.copy_chunk_size config (default: 1GB) - Process large files (>2GB) without memory exhaustion or system freezing - Maintain progress logging for each chunk processed - Prevent system unresponsiveness during large file uploads Fixes ckan#259
- Change chunk processing logs from INFO to DEBUG level per PR feedback - Add test_chunks.py with comprehensive chunked processing tests - Tests verify chunking behavior and data integrity for large files - Ensures small files are not unnecessarily chunked Addresses PR review feedback on logging levels.
Fix test failure by sorting records by ID before validation to ensure consistent results regardless of database return order.
- Replace .format() with %s placeholders in logger calls for better performance - Remove unnecessary int() conversion in debug log message - Improve logging consistency across chunked processing functions
- Add encoding parameter to split_copy_by_size with utf-8 default
- Add ckanext.xloader.copy_chunk_size config option to config_declaration.yaml file
- Replace hardcoded 'utf-8' with configurable variable
Enables custom encoding support for CSV processing.
…ked file processing ## Description Implements chunked file processing to resolve system freezing when loading very large files (>2GB) into DataStore. Fixes ckan#259 ## Problem - XLoader would freeze/hang when processing files >2GB - Entire file loaded into memory causing system unresponsiveness - No progress feedback during large file processing - Memory exhaustion on very large datasets ## Solution - **Chunked Processing**: Split large files into configurable chunks (default: 1GB) - **Progress Logging**: Log each chunk as it's processed - **Memory Efficiency**: Consistent memory usage regardless of file size - **Configurable**: New `ckanext.xloader.copy_chunk_size` setting ## Changes - Add `split_copy_by_size()` function for chunked file processing - Extract `copy_file()` helper for PostgreSQL COPY operations - Add configuration option `ckanext.xloader.copy_chunk_size` (default: 1GB) - Update tests to use smaller chunk size for testing - Maintain existing functionality for smaller files ## Configuration ```ini # Optional: Set chunk size (default: 1073741824 = 1GB) ckanext.xloader.copy_chunk_size = 104857600 # 100MB chunks Fixes ckan#259
Sync with upstream and add test for trailing empty cells
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #139 +/- ##
==========================================
+ Coverage 75.93% 77.13% +1.20%
==========================================
Files 23 24 +1
Lines 2526 2659 +133
==========================================
+ Hits 1918 2051 +133
Misses 608 608 ☔ View full report in Codecov by Sentry. |
duttonw
approved these changes
Mar 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.