Skip to content

Add species to ingest/nextstrain-automation#56

Merged
joverlee521 merged 6 commits into
mainfrom
fix-species-ingest
May 19, 2026
Merged

Add species to ingest/nextstrain-automation#56
joverlee521 merged 6 commits into
mainfrom
fix-species-ingest

Conversation

@joverlee521

@joverlee521 joverlee521 commented May 18, 2026

Copy link
Copy Markdown
Contributor

Description of proposed changes

  • Add species to ingest/nextstrain-automation uploads
  • Update phylo/all-outbreaks to start from ebov files

Related issue(s)

Follow up to #53
Resolves #55

Checklist

@joverlee521

Copy link
Copy Markdown
Contributor Author

Both ingest and phylo workflows completed successfully in GH Action workflow. I'll merge this before tomorrow's automated run if there's no feedback.

@jameshadfield jameshadfield left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jover!

Comment on lines +26 to +33
bdbv/metadata.tsv.zst: results/bdbv/metadata.tsv
bdbv/metadata_open.tsv.zst: results/bdbv/metadata_open.tsv
bdbv/sequences.fasta.zst: results/bdbv/sequences.fasta
bdbv/sequences_open.fasta.zst: results/bdbv/sequences_open.fasta
sudv/metadata.tsv.zst: results/sudv/metadata.tsv
sudv/metadata_open.tsv.zst: results/sudv/metadata_open.tsv
sudv/sequences.fasta.zst: results/sudv/sequences.fasta
sudv/sequences_open.fasta.zst: results/sudv/sequences_open.fasta

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangentially (i.e. not-blocking), the ingest for bdbv & sudv is so quick it makes me wonder if there are situations / pathogens where we'd skip intermediate files and just re-ingest from PPX each time. Has this come up for other pathogens?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hmm, I don't think that's been considered elsewhere.

If we were using the standard inputs config that points to the local ingest outputs, I'd think this would be pretty easy to do without any custom workflow handling:

# phylogenetic/defaults/bdbv/config.yaml
inputs:
    - name: bdbv
      metadata: ../ingest/results/bdbv/metadata.tsv
      sequences: ../ingest/results/bdbv/sequences.fasta
nextstrain build ingest && nextstrain build phylogenetic --configfile defaults/bdbv/config.yaml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BDBV/SUDV builds in #54 are hardcoded to only use local ingest files at the moment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the ingest workflow should be updated to be able to ingest a single species then. We can update the hardcoded SPECIES to

SPECIES = config["species"]

Then you should be able to run

nextstrain build ingest --config species="['bdbv']" && nextstrain build phylogenetic -s species-workflows/bdbv.snakefile

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ingest config for species in c97ed1a.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also, upon reflection, let's drop the S3 uploads for files we ourselves are not (yet) using

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, only running ebov in the automated workflow with f077ccb.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f077ccb

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also deleted the bdbv/sudv objects on S3 so we don't get confused.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

joverlee521 added a commit that referenced this pull request May 18, 2026
Add ability to configure species so that users can run ingest for a
subset of species. Motivated by discussion in
<#56 (comment)>
Note that only ebov has Nextclade outputs for now. This still follows
the old pattern of all data files and separate OPEN files for PPX data
because the phylo workflow does not support multiple inputs yet.

Follow up to <#53>
Run the phylo job regardless of cache check for manual runs via
`workflow_dispatch` since manual runs are expected to be used for
testing and for forcing the full workflow run.
Add ability to configure species so that users can run ingest for a
subset of species. Motivated by discussion in
<#56 (comment)>
@joverlee521 joverlee521 force-pushed the fix-species-ingest branch from ba334ef to c97ed1a Compare May 18, 2026 23:38
bdbv/sudv workflows do not pull data from S3 (yet), so only run the
ebov ingest for now.
@joverlee521 joverlee521 merged commit 1ab9e80 into main May 19, 2026
5 checks passed
@joverlee521 joverlee521 deleted the fix-species-ingest branch May 19, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ingest/nextstrain-automation error

2 participants