Simplify GISAID data update by corneliusroemer · Pull Request #515 · nextstrain/ncov-ingest

corneliusroemer · 2026-01-07T11:01:12Z

This PR simplifies the process of adding new GISAID sequences.

It's no longer required to a) download multiple files, b) rename them, c) run a local Snakemake workflow. Instead one can simply download a single file (.tar containing both sequences and metadat) from GISAID and push it with a single aws s3 cp command: aws s3 cp /Users/cr/Downloads/gisaid_auspice_input_hcov-19_2026_01_07_11.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/

Rather than using checkpointing in the snakemake workflow, there's now a single Python script that downloads all tars and turns them into ndjson.

Test run: https://github.com/nextstrain/ncov-ingest/actions/runs/20780469229/job/59676511113

This new script downloads all unprocessed GISAID tar files from S3, extracts and transforms them to NDJSON format, and outputs a manifest of successfully processed files. This simplifies the manual upload workflow by eliminating the need for the manual-upload Snakefile. Users can now simply run: aws s3 cp gisaid_*.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/ Key features: - Supports both S3 and local file paths (for testing) - Processes tars in sorted order to ensure deduplication keeps newest records - Handles errors gracefully, continuing with other tars if one fails - Outputs manifest for atomic S3 moves after successful upload 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replace complex checkpoint-based workflow with single-rule processing: - Remove checkpoint fetch_unprocessed_files and related rules - Remove decompress_unprocessed_files and link_gisaid_metadata_and_fasta - Remove aggregate_gisaid_ndjsons function - Add process_all_unprocessed_tars rule that processes all tars at once - Simplify concatenate_gisaid_ndjsons to just combine cache + new records - Replace per-file moving with batch mv_all_processed_tars This eliminates 70 lines of code and removes the need for Snakemake to dynamically discover individual tar files. All processing happens in a single rule via the process-unprocessed-tars script. Benefits: - No more complex wildcard/checkpoint logic - Atomic S3 moves only after successful upload - Manifest-based moving ensures only processed files are moved 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Improvements: - List all tar files found before processing - Show final line count written to NDJSON - Display success summary with checkmarks for processed tars - Show failed count warning if any tars fail to process Example output: Found 2 tar files to process: - file1.tar - file2.tar Processing file1.tar... Processed 361 records Wrote 722 lines to output.ndjson Successfully processed 2/2 tar files Processed tars: ✓ file1.tar ✓ file2.tar 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR simplifies GISAID data ingestion by replacing a multi-step manual process with a single tar file upload to S3. Instead of downloading, renaming, and processing files through a local Snakemake workflow, users now download a single tar file from GISAID and upload it directly to S3.

Key changes:

Introduced a new Python script (bin/process-unprocessed-tars) that downloads and processes all tar files from S3 in a single operation
Replaced Snakemake checkpoint-based workflow with a simpler bulk processing approach
Removed the entire manual-upload/ directory and its associated workflow

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`bin/process-unprocessed-tars`	New script that downloads all unprocessed tar files from S3, extracts and processes them through augur/transform pipeline, and outputs combined NDJSON plus a manifest of successfully processed files
`workflow/snakemake_rules/fetch_sequences.smk`	Replaced checkpoint-based individual file processing with a single rule that processes all tars in bulk; simplified the concatenation rule to work with the new workflow
`workflow/snakemake_rules/upload.smk`	Replaced per-file moving logic with a single rule that moves all processed tars based on the manifest file; simplified trial run detection
`README.md`	Updated documentation to reflect the new simplified upload process (single tar download and S3 upload)
`manual-upload/Snakefile`	Removed the entire manual upload Snakefile that handled individual file pairs
`manual-upload/README.md`	Removed detailed manual upload documentation that is no longer needed
`manual-upload/config.yaml`	Removed manual upload configuration file

Comments suppressed due to low confidence (1)

workflow/snakemake_rules/fetch_sequences.smk:79

When both inputs are empty lists (when config.get('s3_src') is falsy), the cat command will be invoked with no arguments ('cat \ | ./bin/dedup-by-gisaid-id ...'). This will cause cat to wait for stdin indefinitely, hanging the pipeline. The rule should either have proper input dependencies or handle the case where config.get('s3_src') is not set by not invoking cat with empty inputs.

        r"""
        (cat {input.gisaid_cache:q} {input.unprocessed:q} \
            | ./bin/dedup-by-gisaid-id \
                --id-field {params.gisaid_id_field:q} \
            > {output.ndjson:q}) 2> {log:q}
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T12:28:44Z

workflow/snakemake_rules/upload.smk

+        # Only run if s3_src == s3_dst (not a trial run)
+        if [[ "{params.s3_src}" != "{params.s3_dst}" ]]; then
+            echo "Skipping move (trial run: s3_src != s3_dst)"
+            exit 0
+        fi


The condition checks if params.s3_src != params.s3_dst, but these params include different subdirectories (/unprocessed/ vs /processed/), so they will ALWAYS be unequal. This means the tar files will never be moved to processed/. The condition should compare the base config values instead: use shell variables like s3_src_base="{config[s3_src]}" and s3_dst_base="{config[s3_dst]}" and compare those, or rely solely on the Snakemake-level check at line 176 and remove this shell-level check entirely.

README.md

workflow/snakemake_rules/upload.smk

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

workflow/snakemake_rules/upload.smk

joverlee521

I'm fine with moving to the .tar files if that's easier to manual processing, but I'd still push to keep our own standard file naming to ensure sorting order and also to compress the .tar files before uploading to S3.

joverlee521 · 2026-01-07T22:26:08Z

README.md

-  . \
-    --configfile config/local_gisaid.yaml
-```
+1. Download them from GISAID using the "Input for the Augur pipeline" option. This results in a file of the form `gisaid_auspice_input_hcov-19_2026_01_07_11.tar` (exact date will vary).


I'd push to rename the file during download to our own standardized name, e.g.2026-01-07-01.tar, so that we are not at the mercy of the default name that GISAID produces. This ensures that the downloaded *.tar are sorted as expected for the deduplication process.

joverlee521 · 2026-01-07T22:27:03Z

README.md

-    --configfile config/local_gisaid.yaml
-```
+1. Download them from GISAID using the "Input for the Augur pipeline" option. This results in a file of the form `gisaid_auspice_input_hcov-19_2026_01_07_11.tar` (exact date will vary).
+2. Upload this file to s3: `aws s3 cp /path/to/downloads/gisaid_auspice_input_hcov-19_2026_01_07_11.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/`


Even if we are using the .tar, I'd still want to compress these before/during upload so that they don't bloat our S3 bucket.

joverlee521 · 2026-01-07T22:49:13Z

bin/process-unprocessed-tars

+                metadata_file = metadata_files[0]
+                sequence_file = sequence_files[0]


non-blocking

Might be nice to output warning if there are multiple files.

joverlee521 · 2026-01-07T22:52:48Z

bin/process-unprocessed-tars

+                # Parse records
+                import io
+                records = list(load_ndjson(io.BytesIO(stdout)))
+                all_records.extend(records)


I'm worried we could run into out of memory errors if we hold all records in memory here. I'd write out the records per tarfile in case we ever need to do a full reprocess.

joverlee521 · 2026-01-07T22:54:22Z

bin/process-unprocessed-tars

+        # Download all tar files from S3
+        tmp_dir = Path("tmp/unprocessed-tars")
+        tmp_dir.mkdir(parents=True, exist_ok=True)


non-blocking

Should this tmp_dir get cleaned up at the end of the script?

joverlee521 · 2026-01-07T22:57:45Z

bin/process-unprocessed-tars

+    if failed_count > 0:
+        print(f"\nWarning: {failed_count} tar files failed to process", file=sys.stderr)


non-blocking

Could be nice to have a list of failed tar files for debugging.

joverlee521 · 2026-01-07T23:27:30Z

workflow/snakemake_rules/fetch_sequences.smk

+        gisaid_cache="data/gisaid/gisaid_cache.ndjson" if config.get('s3_src') else [],
+        unprocessed="data/gisaid/unprocessed-combined.ndjson" if config.get('s3_src') else [],


If both of these inputs are only available when s3_src is set, then the cat command below will hang if s3_src is not set. This also means there's no way to run the workflow on a local set of tar files.

joverlee521 · 2026-01-07T23:41:15Z

bin/process-unprocessed-tars

+    # Sort by filename to ensure consistent ordering
+    # Process oldest first so newer records overwrite during deduplication
+    for tar_path in sorted(tar_files):


The dedup-by-gisaid-id script keeps the first record, so this needs to be sorted in reverse order to keep the newer records.

joverlee521 · 2026-01-07T23:43:09Z

workflow/snakemake_rules/fetch_sequences.smk

    shell:
        r"""
-        (cat {input.ndjsons:q} \
+        (cat {input.gisaid_cache:q} {input.unprocessed:q} \


The unprocessed file needs to be before the gisaid cache since the dedup-by-gisaid-id script keeps the first record

Suggested change

(cat {input.gisaid_cache:q} {input.unprocessed:q} \

(cat {input.unprocessed:q} {input.gisaid_cache:q} \

This commit refactors the GISAID tar file processing to fix critical bugs, reduce memory usage, improve code readability, and reorganize S3 storage. Critical bug fixes: - Fix deduplication order: Process tar files newest-first (reverse sorted) so dedup-by-gisaid-id keeps newest records (it keeps first occurrence) - Fix concatenation order: Put tar records before cache in pipeline so newer records from tars are kept during deduplication - Fix tar extraction deprecation warning by using filter='data' parameter (Python 3.12+) with fallback for older versions Memory optimization: - Stream NDJSON output directly to file instead of accumulating all records in memory, preventing OOM errors on large tar sets Code improvements: - Refactor tar extraction logic into process_tar_file() function for better readability and testability - Simplify subprocess handling by using single shell command with pipe instead of manually managing multiple Popen processes with shlex.quote() - Add warning when multiple metadata/sequence files found in tar - Track and list failed tar files by name, not just count S3 reorganization: - Move from gisaid-downloads/unprocessed to gisaid-tars/unprocessed and gisaid-tars/processed for cleaner separation of tar-based workflow - Compress processed tars with zstd before archiving to reduce storage costs - Move compression and archival to end of workflow (upload.smk) after all processing succeeds, preventing premature cleanup on failure - Update local paths: tmp/gisaid-tars-unprocessed, data/gisaid/tar-combined.ndjson, data/gisaid/tar-processed-manifest.txt Documentation: - Update README with new S3 paths and note about zstd compression Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Removed character validation that rejected filenames containing spaces, parentheses, and other special characters. Fixed shell quoting to properly handle these filenames by quoting the entire S3 path as a single unit. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Use tee in process_all_unprocessed_tars to show output in both logs and console - Add single-line log messages showing input/output line counts in concatenate_gisaid_ndjsons - Makes debugging easier by showing tar processing progress and deduplication statistics in interleaved logs Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

corneliusroemer and others added 5 commits January 7, 2026 11:53

Don't show progress bar

be994a0

Document how to update GISAID data using new pattern

8559ad0

corneliusroemer requested review from j23414 and joverlee521 January 7, 2026 12:19

corneliusroemer marked this pull request as ready for review January 7, 2026 12:20

Copilot AI review requested due to automatic review settings January 7, 2026 12:20

Copilot started reviewing on behalf of corneliusroemer January 7, 2026 12:20 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

corneliusroemer and others added 2 commits January 7, 2026 13:39

Update README.md

824b50b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update workflow/snakemake_rules/upload.smk

bceef78

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

corneliusroemer commented Jan 7, 2026

View reviewed changes

workflow/snakemake_rules/upload.smk Outdated Show resolved Hide resolved

Apply suggestions from code review

1cb094c

joverlee521 requested changes Jan 7, 2026

View reviewed changes

corneliusroemer and others added 3 commits January 21, 2026 18:01

		metadata_file = metadata_files[0]
		sequence_file = sequence_files[0]

		if failed_count > 0:
		print(f"\nWarning: {failed_count} tar files failed to process", file=sys.stderr)

		gisaid_cache="data/gisaid/gisaid_cache.ndjson" if config.get('s3_src') else [],
		unprocessed="data/gisaid/unprocessed-combined.ndjson" if config.get('s3_src') else [],

	(cat {input.gisaid_cache:q} {input.unprocessed:q} \
	(cat {input.unprocessed:q} {input.gisaid_cache:q} \

Conversation

corneliusroemer commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

corneliusroemer commented Jan 7, 2026 •

edited

Loading