Skip to content

Develop to master - merge features from upstream, fix handling of extra cells#139

Merged
ThrawnCA merged 18 commits intomasterfrom
develop
Mar 2, 2026
Merged

Develop to master - merge features from upstream, fix handling of extra cells#139
ThrawnCA merged 18 commits intomasterfrom
develop

Conversation

@ThrawnCA
Copy link

@ThrawnCA ThrawnCA commented Mar 2, 2026

  • Copy chunked processing from upstream
  • Gracefully handle the extra blank cells generated by Excel

Chava Goldshtein and others added 18 commits September 25, 2025 09:57
…files

- Add split_copy_by_size() function to process files in configurable chunks
- Extract copy_file() helper for PostgreSQL COPY operations
- Add ckanext.xloader.copy_chunk_size config (default: 1GB)
- Process large files (>2GB) without memory exhaustion or system freezing
- Maintain progress logging for each chunk processed
- Prevent system unresponsiveness during large file uploads

Fixes ckan#259
- Change chunk processing logs from INFO to DEBUG level per PR feedback
- Add test_chunks.py with comprehensive chunked processing tests
- Tests verify chunking behavior and data integrity for large files
- Ensures small files are not unnecessarily chunked

Addresses PR review feedback on logging levels.
Fix test failure by sorting records by ID before validation to ensure
consistent results regardless of database return order.
- Replace .format() with %s placeholders in logger calls for better performance
- Remove unnecessary int() conversion in debug log message
- Improve logging consistency across chunked processing functions
    - Add encoding parameter to split_copy_by_size with utf-8 default
    - Add ckanext.xloader.copy_chunk_size config option to config_declaration.yaml file
    - Replace hardcoded 'utf-8' with configurable variable

    Enables custom encoding support for CSV processing.
…ked file processing

## Description
Implements chunked file processing to resolve system freezing when loading very large files (>2GB) into DataStore.

Fixes ckan#259

## Problem
- XLoader would freeze/hang when processing files >2GB
- Entire file loaded into memory causing system unresponsiveness  
- No progress feedback during large file processing
- Memory exhaustion on very large datasets

## Solution
- **Chunked Processing**: Split large files into configurable chunks (default: 1GB)
- **Progress Logging**: Log each chunk as it's processed
- **Memory Efficiency**: Consistent memory usage regardless of file size
- **Configurable**: New `ckanext.xloader.copy_chunk_size` setting

## Changes
- Add `split_copy_by_size()` function for chunked file processing
- Extract `copy_file()` helper for PostgreSQL COPY operations
- Add configuration option `ckanext.xloader.copy_chunk_size` (default: 1GB)
- Update tests to use smaller chunk size for testing
- Maintain existing functionality for smaller files

## Configuration
```ini
# Optional: Set chunk size (default: 1073741824 = 1GB)
ckanext.xloader.copy_chunk_size = 104857600  # 100MB chunks


Fixes ckan#259
Sync with upstream and add test for trailing empty cells
@ThrawnCA ThrawnCA requested a review from a team March 2, 2026 05:49
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
2659 2031 76% 0% 🟢

New Files

File Coverage Status
ckanext/xloader/tests/test_chunks.py 100% 🟢
TOTAL 100% 🟢

Modified Files

File Coverage Status
ckanext/xloader/loader.py 90% 🟢
ckanext/xloader/parser.py 98% 🟢
ckanext/xloader/tests/test_jobs.py 99% 🟢
ckanext/xloader/tests/test_loader.py 91% 🟢
ckanext/xloader/tests/test_parser.py 100% 🟢
TOTAL 96% 🟢

updated for commit: 0b86a6e by action🐍

@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 97.24138% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.13%. Comparing base (fc5ec35) to head (0b86a6e).
⚠️ Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
ckanext/xloader/loader.py 91.48% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #139      +/-   ##
==========================================
+ Coverage   75.93%   77.13%   +1.20%     
==========================================
  Files          23       24       +1     
  Lines        2526     2659     +133     
==========================================
+ Hits         1918     2051     +133     
  Misses        608      608              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ThrawnCA ThrawnCA merged commit 1ea57eb into master Mar 2, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants