Conversation
This new script downloads all unprocessed GISAID tar files from S3, extracts and transforms them to NDJSON format, and outputs a manifest of successfully processed files. This simplifies the manual upload workflow by eliminating the need for the manual-upload Snakefile. Users can now simply run: aws s3 cp gisaid_*.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/ Key features: - Supports both S3 and local file paths (for testing) - Processes tars in sorted order to ensure deduplication keeps newest records - Handles errors gracefully, continuing with other tars if one fails - Outputs manifest for atomic S3 moves after successful upload 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace complex checkpoint-based workflow with single-rule processing: - Remove checkpoint fetch_unprocessed_files and related rules - Remove decompress_unprocessed_files and link_gisaid_metadata_and_fasta - Remove aggregate_gisaid_ndjsons function - Add process_all_unprocessed_tars rule that processes all tars at once - Simplify concatenate_gisaid_ndjsons to just combine cache + new records - Replace per-file moving with batch mv_all_processed_tars This eliminates 70 lines of code and removes the need for Snakemake to dynamically discover individual tar files. All processing happens in a single rule via the process-unprocessed-tars script. Benefits: - No more complex wildcard/checkpoint logic - Atomic S3 moves only after successful upload - Manifest-based moving ensures only processed files are moved 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Improvements:
- List all tar files found before processing
- Show final line count written to NDJSON
- Display success summary with checkmarks for processed tars
- Show failed count warning if any tars fail to process
Example output:
Found 2 tar files to process:
- file1.tar
- file2.tar
Processing file1.tar...
Processed 361 records
Wrote 722 lines to output.ndjson
Successfully processed 2/2 tar files
Processed tars:
✓ file1.tar
✓ file2.tar
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR simplifies GISAID data ingestion by replacing a multi-step manual process with a single tar file upload to S3. Instead of downloading, renaming, and processing files through a local Snakemake workflow, users now download a single tar file from GISAID and upload it directly to S3.
Key changes:
- Introduced a new Python script (
bin/process-unprocessed-tars) that downloads and processes all tar files from S3 in a single operation - Replaced Snakemake checkpoint-based workflow with a simpler bulk processing approach
- Removed the entire
manual-upload/directory and its associated workflow
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
bin/process-unprocessed-tars |
New script that downloads all unprocessed tar files from S3, extracts and processes them through augur/transform pipeline, and outputs combined NDJSON plus a manifest of successfully processed files |
workflow/snakemake_rules/fetch_sequences.smk |
Replaced checkpoint-based individual file processing with a single rule that processes all tars in bulk; simplified the concatenation rule to work with the new workflow |
workflow/snakemake_rules/upload.smk |
Replaced per-file moving logic with a single rule that moves all processed tars based on the manifest file; simplified trial run detection |
README.md |
Updated documentation to reflect the new simplified upload process (single tar download and S3 upload) |
manual-upload/Snakefile |
Removed the entire manual upload Snakefile that handled individual file pairs |
manual-upload/README.md |
Removed detailed manual upload documentation that is no longer needed |
manual-upload/config.yaml |
Removed manual upload configuration file |
Comments suppressed due to low confidence (1)
workflow/snakemake_rules/fetch_sequences.smk:79
- When both inputs are empty lists (when config.get('s3_src') is falsy), the cat command will be invoked with no arguments ('cat \ | ./bin/dedup-by-gisaid-id ...'). This will cause cat to wait for stdin indefinitely, hanging the pipeline. The rule should either have proper input dependencies or handle the case where config.get('s3_src') is not set by not invoking cat with empty inputs.
r"""
(cat {input.gisaid_cache:q} {input.unprocessed:q} \
| ./bin/dedup-by-gisaid-id \
--id-field {params.gisaid_id_field:q} \
> {output.ndjson:q}) 2> {log:q}
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
workflow/snakemake_rules/upload.smk
Outdated
| # Only run if s3_src == s3_dst (not a trial run) | ||
| if [[ "{params.s3_src}" != "{params.s3_dst}" ]]; then | ||
| echo "Skipping move (trial run: s3_src != s3_dst)" | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
The condition checks if params.s3_src != params.s3_dst, but these params include different subdirectories (/unprocessed/ vs /processed/), so they will ALWAYS be unequal. This means the tar files will never be moved to processed/. The condition should compare the base config values instead: use shell variables like s3_src_base="{config[s3_src]}" and s3_dst_base="{config[s3_dst]}" and compare those, or rely solely on the Snakemake-level check at line 176 and remove this shell-level check entirely.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
joverlee521
left a comment
There was a problem hiding this comment.
I'm fine with moving to the .tar files if that's easier to manual processing, but I'd still push to keep our own standard file naming to ensure sorting order and also to compress the .tar files before uploading to S3.
| . \ | ||
| --configfile config/local_gisaid.yaml | ||
| ``` | ||
| 1. Download them from GISAID using the "Input for the Augur pipeline" option. This results in a file of the form `gisaid_auspice_input_hcov-19_2026_01_07_11.tar` (exact date will vary). |
There was a problem hiding this comment.
I'd push to rename the file during download to our own standardized name, e.g.2026-01-07-01.tar, so that we are not at the mercy of the default name that GISAID produces. This ensures that the downloaded *.tar are sorted as expected for the deduplication process.
README.md
Outdated
| --configfile config/local_gisaid.yaml | ||
| ``` | ||
| 1. Download them from GISAID using the "Input for the Augur pipeline" option. This results in a file of the form `gisaid_auspice_input_hcov-19_2026_01_07_11.tar` (exact date will vary). | ||
| 2. Upload this file to s3: `aws s3 cp /path/to/downloads/gisaid_auspice_input_hcov-19_2026_01_07_11.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/` |
There was a problem hiding this comment.
Even if we are using the .tar, I'd still want to compress these before/during upload so that they don't bloat our S3 bucket.
bin/process-unprocessed-tars
Outdated
| metadata_file = metadata_files[0] | ||
| sequence_file = sequence_files[0] |
There was a problem hiding this comment.
non-blocking
Might be nice to output warning if there are multiple files.
bin/process-unprocessed-tars
Outdated
| # Parse records | ||
| import io | ||
| records = list(load_ndjson(io.BytesIO(stdout))) | ||
| all_records.extend(records) |
There was a problem hiding this comment.
I'm worried we could run into out of memory errors if we hold all records in memory here. I'd write out the records per tarfile in case we ever need to do a full reprocess.
bin/process-unprocessed-tars
Outdated
| # Download all tar files from S3 | ||
| tmp_dir = Path("tmp/unprocessed-tars") | ||
| tmp_dir.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
non-blocking
Should this tmp_dir get cleaned up at the end of the script?
bin/process-unprocessed-tars
Outdated
| if failed_count > 0: | ||
| print(f"\nWarning: {failed_count} tar files failed to process", file=sys.stderr) |
There was a problem hiding this comment.
non-blocking
Could be nice to have a list of failed tar files for debugging.
| gisaid_cache="data/gisaid/gisaid_cache.ndjson" if config.get('s3_src') else [], | ||
| unprocessed="data/gisaid/unprocessed-combined.ndjson" if config.get('s3_src') else [], |
There was a problem hiding this comment.
If both of these inputs are only available when s3_src is set, then the cat command below will hang if s3_src is not set. This also means there's no way to run the workflow on a local set of tar files.
bin/process-unprocessed-tars
Outdated
| # Sort by filename to ensure consistent ordering | ||
| # Process oldest first so newer records overwrite during deduplication | ||
| for tar_path in sorted(tar_files): |
There was a problem hiding this comment.
The dedup-by-gisaid-id script keeps the first record, so this needs to be sorted in reverse order to keep the newer records.
| shell: | ||
| r""" | ||
| (cat {input.ndjsons:q} \ | ||
| (cat {input.gisaid_cache:q} {input.unprocessed:q} \ |
There was a problem hiding this comment.
The unprocessed file needs to be before the gisaid cache since the dedup-by-gisaid-id script keeps the first record
| (cat {input.gisaid_cache:q} {input.unprocessed:q} \ | |
| (cat {input.unprocessed:q} {input.gisaid_cache:q} \ |
This commit refactors the GISAID tar file processing to fix critical bugs, reduce memory usage, improve code readability, and reorganize S3 storage. Critical bug fixes: - Fix deduplication order: Process tar files newest-first (reverse sorted) so dedup-by-gisaid-id keeps newest records (it keeps first occurrence) - Fix concatenation order: Put tar records before cache in pipeline so newer records from tars are kept during deduplication - Fix tar extraction deprecation warning by using filter='data' parameter (Python 3.12+) with fallback for older versions Memory optimization: - Stream NDJSON output directly to file instead of accumulating all records in memory, preventing OOM errors on large tar sets Code improvements: - Refactor tar extraction logic into process_tar_file() function for better readability and testability - Simplify subprocess handling by using single shell command with pipe instead of manually managing multiple Popen processes with shlex.quote() - Add warning when multiple metadata/sequence files found in tar - Track and list failed tar files by name, not just count S3 reorganization: - Move from gisaid-downloads/unprocessed to gisaid-tars/unprocessed and gisaid-tars/processed for cleaner separation of tar-based workflow - Compress processed tars with zstd before archiving to reduce storage costs - Move compression and archival to end of workflow (upload.smk) after all processing succeeds, preventing premature cleanup on failure - Update local paths: tmp/gisaid-tars-unprocessed, data/gisaid/tar-combined.ndjson, data/gisaid/tar-processed-manifest.txt Documentation: - Update README with new S3 paths and note about zstd compression Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed character validation that rejected filenames containing spaces, parentheses, and other special characters. Fixed shell quoting to properly handle these filenames by quoting the entire S3 path as a single unit. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Use tee in process_all_unprocessed_tars to show output in both logs and console - Add single-line log messages showing input/output line counts in concatenate_gisaid_ndjsons - Makes debugging easier by showing tar processing progress and deduplication statistics in interleaved logs Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This PR simplifies the process of adding new GISAID sequences.
It's no longer required to a) download multiple files, b) rename them, c) run a local Snakemake workflow. Instead one can simply download a single file (.tar containing both sequences and metadat) from GISAID and push it with a single
aws s3 cpcommand:aws s3 cp /Users/cr/Downloads/gisaid_auspice_input_hcov-19_2026_01_07_11.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/Rather than using checkpointing in the snakemake workflow, there's now a single Python script that downloads all tars and turns them into ndjson.
Test run: https://github.com/nextstrain/ncov-ingest/actions/runs/20780469229/job/59676511113