Skip to content

Simplify GISAID data update#515

Open
corneliusroemer wants to merge 11 commits intomasterfrom
simplify-gisaid-manual-upload
Open

Simplify GISAID data update#515
corneliusroemer wants to merge 11 commits intomasterfrom
simplify-gisaid-manual-upload

Conversation

@corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Jan 7, 2026

This PR simplifies the process of adding new GISAID sequences.

It's no longer required to a) download multiple files, b) rename them, c) run a local Snakemake workflow. Instead one can simply download a single file (.tar containing both sequences and metadat) from GISAID and push it with a single aws s3 cp command: aws s3 cp /Users/cr/Downloads/gisaid_auspice_input_hcov-19_2026_01_07_11.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/

Rather than using checkpointing in the snakemake workflow, there's now a single Python script that downloads all tars and turns them into ndjson.

Test run: https://github.com/nextstrain/ncov-ingest/actions/runs/20780469229/job/59676511113

corneliusroemer and others added 5 commits January 7, 2026 11:53
This new script downloads all unprocessed GISAID tar files from S3,
extracts and transforms them to NDJSON format, and outputs a manifest
of successfully processed files.

This simplifies the manual upload workflow by eliminating the need for
the manual-upload Snakefile. Users can now simply run:
  aws s3 cp gisaid_*.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/

Key features:
- Supports both S3 and local file paths (for testing)
- Processes tars in sorted order to ensure deduplication keeps newest records
- Handles errors gracefully, continuing with other tars if one fails
- Outputs manifest for atomic S3 moves after successful upload

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace complex checkpoint-based workflow with single-rule processing:
- Remove checkpoint fetch_unprocessed_files and related rules
- Remove decompress_unprocessed_files and link_gisaid_metadata_and_fasta
- Remove aggregate_gisaid_ndjsons function
- Add process_all_unprocessed_tars rule that processes all tars at once
- Simplify concatenate_gisaid_ndjsons to just combine cache + new records
- Replace per-file moving with batch mv_all_processed_tars

This eliminates 70 lines of code and removes the need for Snakemake
to dynamically discover individual tar files. All processing happens
in a single rule via the process-unprocessed-tars script.

Benefits:
- No more complex wildcard/checkpoint logic
- Atomic S3 moves only after successful upload
- Manifest-based moving ensures only processed files are moved

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Improvements:
- List all tar files found before processing
- Show final line count written to NDJSON
- Display success summary with checkmarks for processed tars
- Show failed count warning if any tars fail to process

Example output:
  Found 2 tar files to process:
    - file1.tar
    - file2.tar

  Processing file1.tar...
    Processed 361 records

  Wrote 722 lines to output.ndjson

  Successfully processed 2/2 tar files
  Processed tars:
    ✓ file1.tar
    ✓ file2.tar

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@corneliusroemer corneliusroemer marked this pull request as ready for review January 7, 2026 12:20
Copilot AI review requested due to automatic review settings January 7, 2026 12:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies GISAID data ingestion by replacing a multi-step manual process with a single tar file upload to S3. Instead of downloading, renaming, and processing files through a local Snakemake workflow, users now download a single tar file from GISAID and upload it directly to S3.

Key changes:

  • Introduced a new Python script (bin/process-unprocessed-tars) that downloads and processes all tar files from S3 in a single operation
  • Replaced Snakemake checkpoint-based workflow with a simpler bulk processing approach
  • Removed the entire manual-upload/ directory and its associated workflow

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
bin/process-unprocessed-tars New script that downloads all unprocessed tar files from S3, extracts and processes them through augur/transform pipeline, and outputs combined NDJSON plus a manifest of successfully processed files
workflow/snakemake_rules/fetch_sequences.smk Replaced checkpoint-based individual file processing with a single rule that processes all tars in bulk; simplified the concatenation rule to work with the new workflow
workflow/snakemake_rules/upload.smk Replaced per-file moving logic with a single rule that moves all processed tars based on the manifest file; simplified trial run detection
README.md Updated documentation to reflect the new simplified upload process (single tar download and S3 upload)
manual-upload/Snakefile Removed the entire manual upload Snakefile that handled individual file pairs
manual-upload/README.md Removed detailed manual upload documentation that is no longer needed
manual-upload/config.yaml Removed manual upload configuration file
Comments suppressed due to low confidence (1)

workflow/snakemake_rules/fetch_sequences.smk:79

  • When both inputs are empty lists (when config.get('s3_src') is falsy), the cat command will be invoked with no arguments ('cat \ | ./bin/dedup-by-gisaid-id ...'). This will cause cat to wait for stdin indefinitely, hanging the pipeline. The rule should either have proper input dependencies or handle the case where config.get('s3_src') is not set by not invoking cat with empty inputs.
        r"""
        (cat {input.gisaid_cache:q} {input.unprocessed:q} \
            | ./bin/dedup-by-gisaid-id \
                --id-field {params.gisaid_id_field:q} \
            > {output.ndjson:q}) 2> {log:q}
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +143 to +147
# Only run if s3_src == s3_dst (not a trial run)
if [[ "{params.s3_src}" != "{params.s3_dst}" ]]; then
echo "Skipping move (trial run: s3_src != s3_dst)"
exit 0
fi
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if params.s3_src != params.s3_dst, but these params include different subdirectories (/unprocessed/ vs /processed/), so they will ALWAYS be unequal. This means the tar files will never be moved to processed/. The condition should compare the base config values instead: use shell variables like s3_src_base="{config[s3_src]}" and s3_dst_base="{config[s3_dst]}" and compare those, or rely solely on the Snakemake-level check at line 176 and remove this shell-level check entirely.

Copilot uses AI. Check for mistakes.
corneliusroemer and others added 2 commits January 7, 2026 13:39
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with moving to the .tar files if that's easier to manual processing, but I'd still push to keep our own standard file naming to ensure sorting order and also to compress the .tar files before uploading to S3.

. \
--configfile config/local_gisaid.yaml
```
1. Download them from GISAID using the "Input for the Augur pipeline" option. This results in a file of the form `gisaid_auspice_input_hcov-19_2026_01_07_11.tar` (exact date will vary).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd push to rename the file during download to our own standardized name, e.g.2026-01-07-01.tar, so that we are not at the mercy of the default name that GISAID produces. This ensures that the downloaded *.tar are sorted as expected for the deduplication process.

README.md Outdated
--configfile config/local_gisaid.yaml
```
1. Download them from GISAID using the "Input for the Augur pipeline" option. This results in a file of the form `gisaid_auspice_input_hcov-19_2026_01_07_11.tar` (exact date will vary).
2. Upload this file to s3: `aws s3 cp /path/to/downloads/gisaid_auspice_input_hcov-19_2026_01_07_11.tar s3://nextstrain-ncov-private/gisaid-downloads/unprocessed/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we are using the .tar, I'd still want to compress these before/during upload so that they don't bloat our S3 bucket.

Comment on lines +81 to +82
metadata_file = metadata_files[0]
sequence_file = sequence_files[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Might be nice to output warning if there are multiple files.

# Parse records
import io
records = list(load_ndjson(io.BytesIO(stdout)))
all_records.extend(records)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried we could run into out of memory errors if we hold all records in memory here. I'd write out the records per tarfile in case we ever need to do a full reprocess.

Comment on lines +21 to +23
# Download all tar files from S3
tmp_dir = Path("tmp/unprocessed-tars")
tmp_dir.mkdir(parents=True, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Should this tmp_dir get cleaned up at the end of the script?

Comment on lines +140 to +141
if failed_count > 0:
print(f"\nWarning: {failed_count} tar files failed to process", file=sys.stderr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Could be nice to have a list of failed tar files for debugging.

Comment on lines +66 to +67
gisaid_cache="data/gisaid/gisaid_cache.ndjson" if config.get('s3_src') else [],
unprocessed="data/gisaid/unprocessed-combined.ndjson" if config.get('s3_src') else [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both of these inputs are only available when s3_src is set, then the cat command below will hang if s3_src is not set. This also means there's no way to run the workflow on a local set of tar files.

Comment on lines +55 to +57
# Sort by filename to ensure consistent ordering
# Process oldest first so newer records overwrite during deduplication
for tar_path in sorted(tar_files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dedup-by-gisaid-id script keeps the first record, so this needs to be sorted in reverse order to keep the newer records.

shell:
r"""
(cat {input.ndjsons:q} \
(cat {input.gisaid_cache:q} {input.unprocessed:q} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unprocessed file needs to be before the gisaid cache since the dedup-by-gisaid-id script keeps the first record

Suggested change
(cat {input.gisaid_cache:q} {input.unprocessed:q} \
(cat {input.unprocessed:q} {input.gisaid_cache:q} \

corneliusroemer and others added 3 commits January 21, 2026 18:01
This commit refactors the GISAID tar file processing to fix critical bugs,
reduce memory usage, improve code readability, and reorganize S3 storage.

Critical bug fixes:
- Fix deduplication order: Process tar files newest-first (reverse sorted)
  so dedup-by-gisaid-id keeps newest records (it keeps first occurrence)
- Fix concatenation order: Put tar records before cache in pipeline so
  newer records from tars are kept during deduplication
- Fix tar extraction deprecation warning by using filter='data' parameter
  (Python 3.12+) with fallback for older versions

Memory optimization:
- Stream NDJSON output directly to file instead of accumulating all
  records in memory, preventing OOM errors on large tar sets

Code improvements:
- Refactor tar extraction logic into process_tar_file() function for
  better readability and testability
- Simplify subprocess handling by using single shell command with pipe
  instead of manually managing multiple Popen processes with shlex.quote()
- Add warning when multiple metadata/sequence files found in tar
- Track and list failed tar files by name, not just count

S3 reorganization:
- Move from gisaid-downloads/unprocessed to gisaid-tars/unprocessed and
  gisaid-tars/processed for cleaner separation of tar-based workflow
- Compress processed tars with zstd before archiving to reduce storage costs
- Move compression and archival to end of workflow (upload.smk) after all
  processing succeeds, preventing premature cleanup on failure
- Update local paths: tmp/gisaid-tars-unprocessed, data/gisaid/tar-combined.ndjson,
  data/gisaid/tar-processed-manifest.txt

Documentation:
- Update README with new S3 paths and note about zstd compression

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed character validation that rejected filenames containing spaces,
parentheses, and other special characters. Fixed shell quoting to properly
handle these filenames by quoting the entire S3 path as a single unit.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Use tee in process_all_unprocessed_tars to show output in both logs and console
- Add single-line log messages showing input/output line counts in concatenate_gisaid_ndjsons
- Makes debugging easier by showing tar processing progress and deduplication statistics in interleaved logs

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants