Skip to content

Replace fauna with curation pipeline#161

Merged
jameshadfield merged 6 commits into
masterfrom
james/replace-fauna-with-curation-pipeline
Feb 1, 2026
Merged

Replace fauna with curation pipeline#161
jameshadfield merged 6 commits into
masterfrom
james/replace-fauna-with-curation-pipeline

Conversation

@jameshadfield

@jameshadfield jameshadfield commented Jan 28, 2026

Copy link
Copy Markdown
Member

See commit messages for all the details.

This PR removes our fauna-ingest pipeline, replacing it with a small ingest workflow which adds GenoFLU annotations to the metadata/sequences from our seasonal-flu ingest workflow. This layer uploads metadata/sequences to the canonical S3 locations, and thus the phylo workflows should run as normal.

After merging, I'll update the seasonal-flu upload workflow as described there

We're about to switch from fauna data (for GISAID builds) to our
all-influenza curation pipeline. Some strain names have changed between
these two approaches as we now apply one version of the curation code to
all (raw) records.

The files added here were automatically generated following development
scripts added to seasonal-flu (where the curation pipeline lives) in
[1], but with avian-flu specific modifications [2]. The approach used
was to compare the strain names in txt files against those in our
curated metadata. If a match wasn't found, we cross-reference with the
most recent fauna metadata, using HA_accession as the key to link fauna
to our newly curated data, as fauna didn't keep track of EPI_ISL.
Finally, we attempt a fuzzy match. The next commit will clean up the
changes introduced here.

The commands run (in seasonal-flu repo) were:

```
cd ingest

./scripts/diff-avian-flu.py --truth ../../avian-flu/ingest/fauna/data/metadata_ha.tsv --query data/avian-flu/curated_gisaid.ndjson.zst

snakemake --cores 4 --snakefile devel/strain-name-updates.snakefile -pf
```

[1] <nextstrain/seasonal-flu#291>
[2] <nextstrain/seasonal-flu@dd88e9b>
Manually check the automated changes from the parent commit. This
approach is the same as the one we recently took in seasonal-flu
<nextstrain/seasonal-flu#291>
In the situation where the requested cores (n) was higher than the machine cores (m) the script would partition the data by the requested cores (n) but then only run m of them at a time.

This patch should be added upstream in due course.
@jameshadfield jameshadfield force-pushed the james/replace-fauna-with-curation-pipeline branch from d087b04 to 37d566a Compare January 28, 2026 01:57

@joverlee521 joverlee521 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted a couple fixes for still supporting the separate analysis directory, which made me realize we never updated nextstrain-pathogen.yaml to indicate compatibility with nextstrain run.

Comment thread ingest/vendored-GenoFLU-multi/bin/genoflu-multi.py
Comment thread ingest/gisaid/Snakefile Outdated
Comment thread ingest/gisaid/Snakefile Outdated
Comment thread rules/main.smk Outdated
Comment thread rules/main.smk Outdated
Comment thread rules/main.smk Outdated
Comment thread scripts/apply-hotfixes.py
Comment on lines +117 to +121
# and overwrite the updated hotfix lines, exluding those in unnecessary_fixes
with open(args.hotfixes, 'w') as fh:
for idx, line in enumerate(lines):
if idx not in unnecessary_fixes:
print(line, file=fh)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any unnecessary hotfixes will be removed from the TSV at the point it's applied, ensuring obsolete fixes (i.e. which have been applied upstream) don't linger.

Is the idea that someone will run this script manually to remove the unnecessary hotfixes in the TSV file? Currently, it will remove it as part of the build, but those changes are not pushed up to GitHub.

@jameshadfield jameshadfield Jan 29, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the idea, because Louise / Moncla lab runs things locally as far as I'm aware. If we do run automated GitHub builds then yes you're right that they might linger for a longer time than is ideal, but eventually we'll run a local build and will get this clean-up done.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hotfix approach is actually a bit useless in retrospect - if you add fixes then (almost) the entire workflow re-runs and so you'll probably subsample out your fixes. In this respect, a better place would be after rule filter but that'd mean hotfix TSVs for every {subtype}/{segment}/{time} combination. Not going to make changes at the moment, but something we can keep improving on.

Comment thread ingest/Snakefile
This takes the (S3) output from our all-influenza curation pipeline
(pre-filtered to avian-flu subtypes) and runs GenoFLU on it. It's a
little strange to have most of the ingest steps in one location and then
the GenoFLU step here; one day we may wish to unify them but that's
quite a big task given that this (avian-flu) ingest pipeline already
exists and is being used on other data sources.
Our all-influenza curation pipeline is located in the seasonal-flu repo,
which makes it cumbersome to apply curation changes when we notice
something's amiss. Our phylo pipelines thus allow metadata to be patched
via a config-defined hotfix file. Rather than letting hotfixes
accumulate over time, when we make curation fixes to the upstream
(seasonal-flu) workflow we can add these hotfixes. Any unnecessary
hotfixes will be removed from the TSV at the point it's applied,
ensuring obsolete fixes (i.e. which have been applied upstream) don't
linger.
The preceeding commits have switched us away from fauna as the canonical
private data source and towards flat files on S3, provisioned via our
new all-infleunza curation pipeline (in the seasonal-flu repo).

To avoid any confusion going forward this commit removes the fauna-
related ingest workflows.

There's more clean-up work we can do in the future around the structure
of the ingest directory and the NDJSON fields which we provision, but
there's enough changes happening at the moment to defer this until later!
@jameshadfield jameshadfield force-pushed the james/replace-fauna-with-curation-pipeline branch from 63bdb11 to 58bb3ca Compare January 29, 2026 03:02
@jameshadfield

This comment was marked as outdated.

@jameshadfield jameshadfield merged commit edd7bf5 into master Feb 1, 2026
10 of 12 checks passed
@jameshadfield jameshadfield deleted the james/replace-fauna-with-curation-pipeline branch February 1, 2026 21:56
jameshadfield added a commit that referenced this pull request Feb 1, 2026
Bug introduced in <#161>
@jameshadfield jameshadfield mentioned this pull request Feb 1, 2026
jameshadfield added a commit to nextstrain/seasonal-flu that referenced this pull request Feb 1, 2026
Avian-flu now sources the curated data from (seasonal-flu) ingest
for GenoFLU annotations and phylo builds. Relevant PR:
<nextstrain/avian-flu#161>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants