Replace fauna with curation pipeline by jameshadfield · Pull Request #161 · nextstrain/avian-flu

jameshadfield · 2026-01-28T01:52:18Z

See commit messages for all the details.

This PR removes our fauna-ingest pipeline, replacing it with a small ingest workflow which adds GenoFLU annotations to the metadata/sequences from our seasonal-flu ingest workflow. This layer uploads metadata/sequences to the canonical S3 locations, and thus the phylo workflows should run as normal.

After merging, I'll update the seasonal-flu upload workflow as described there

We're about to switch from fauna data (for GISAID builds) to our all-influenza curation pipeline. Some strain names have changed between these two approaches as we now apply one version of the curation code to all (raw) records. The files added here were automatically generated following development scripts added to seasonal-flu (where the curation pipeline lives) in [1], but with avian-flu specific modifications [2]. The approach used was to compare the strain names in txt files against those in our curated metadata. If a match wasn't found, we cross-reference with the most recent fauna metadata, using HA_accession as the key to link fauna to our newly curated data, as fauna didn't keep track of EPI_ISL. Finally, we attempt a fuzzy match. The next commit will clean up the changes introduced here. The commands run (in seasonal-flu repo) were: ``` cd ingest ./scripts/diff-avian-flu.py --truth ../../avian-flu/ingest/fauna/data/metadata_ha.tsv --query data/avian-flu/curated_gisaid.ndjson.zst snakemake --cores 4 --snakefile devel/strain-name-updates.snakefile -pf ``` [1] <nextstrain/seasonal-flu#291> [2] <nextstrain/seasonal-flu@dd88e9b>

Manually check the automated changes from the parent commit. This approach is the same as the one we recently took in seasonal-flu <nextstrain/seasonal-flu#291>

In the situation where the requested cores (n) was higher than the machine cores (m) the script would partition the data by the requested cores (n) but then only run m of them at a time. This patch should be added upstream in due course.

joverlee521

Noted a couple fixes for still supporting the separate analysis directory, which made me realize we never updated nextstrain-pathogen.yaml to indicate compatibility with nextstrain run.

joverlee521 · 2026-01-28T21:39:16Z

+    # and overwrite the updated hotfix lines, exluding those in unnecessary_fixes
+    with open(args.hotfixes, 'w') as fh:
+        for idx, line in enumerate(lines):
+            if idx not in unnecessary_fixes:
+                print(line, file=fh)


Any unnecessary hotfixes will be removed from the TSV at the point it's applied, ensuring obsolete fixes (i.e. which have been applied upstream) don't linger.

Is the idea that someone will run this script manually to remove the unnecessary hotfixes in the TSV file? Currently, it will remove it as part of the build, but those changes are not pushed up to GitHub.

Yeah, that's the idea, because Louise / Moncla lab runs things locally as far as I'm aware. If we do run automated GitHub builds then yes you're right that they might linger for a longer time than is ideal, but eventually we'll run a local build and will get this clean-up done.

This hotfix approach is actually a bit useless in retrospect - if you add fixes then (almost) the entire workflow re-runs and so you'll probably subsample out your fixes. In this respect, a better place would be after rule filter but that'd mean hotfix TSVs for every {subtype}/{segment}/{time} combination. Not going to make changes at the moment, but something we can keep improving on.

This takes the (S3) output from our all-influenza curation pipeline (pre-filtered to avian-flu subtypes) and runs GenoFLU on it. It's a little strange to have most of the ingest steps in one location and then the GenoFLU step here; one day we may wish to unify them but that's quite a big task given that this (avian-flu) ingest pipeline already exists and is being used on other data sources.

Our all-influenza curation pipeline is located in the seasonal-flu repo, which makes it cumbersome to apply curation changes when we notice something's amiss. Our phylo pipelines thus allow metadata to be patched via a config-defined hotfix file. Rather than letting hotfixes accumulate over time, when we make curation fixes to the upstream (seasonal-flu) workflow we can add these hotfixes. Any unnecessary hotfixes will be removed from the TSV at the point it's applied, ensuring obsolete fixes (i.e. which have been applied upstream) don't linger.

The preceeding commits have switched us away from fauna as the canonical private data source and towards flat files on S3, provisioned via our new all-infleunza curation pipeline (in the seasonal-flu repo). To avoid any confusion going forward this commit removes the fauna- related ingest workflows. There's more clean-up work we can do in the future around the structure of the ingest directory and the NDJSON fields which we provision, but there's enough changes happening at the moment to defer this until later!

Bug introduced in <#161>

Avian-flu now sources the curated data from (seasonal-flu) ingest for GenoFLU annotations and phylo builds. Relevant PR: <nextstrain/avian-flu#161>

jameshadfield added 3 commits January 28, 2026 12:55

Update hardcoded strain names part II

89bd804

Manually check the automated changes from the parent commit. This approach is the same as the one we recently took in seasonal-flu <nextstrain/seasonal-flu#291>

[ingest] improve parallalisation of GenoFLU

b660b1a

In the situation where the requested cores (n) was higher than the machine cores (m) the script would partition the data by the requested cores (n) but then only run m of them at a time. This patch should be added upstream in due course.

jameshadfield force-pushed the james/replace-fauna-with-curation-pipeline branch from d087b04 to 37d566a Compare January 28, 2026 01:57

jameshadfield mentioned this pull request Jan 28, 2026

Run GenoFlu on all-influenza curated data #158

Closed

joverlee521 reviewed Jan 28, 2026

View reviewed changes

This was referenced Jan 29, 2026

Snakemake improvements mk2 #149

Open

GenoFlu annotations for incomplete genomes #162

Open

jameshadfield added 3 commits January 29, 2026 16:01

jameshadfield force-pushed the james/replace-fauna-with-curation-pipeline branch from 63bdb11 to 58bb3ca Compare January 29, 2026 03:02

This comment was marked as outdated.

Sign in to view

jameshadfield merged commit edd7bf5 into master Feb 1, 2026
10 of 12 checks passed

jameshadfield deleted the james/replace-fauna-with-curation-pipeline branch February 1, 2026 21:56

jameshadfield added a commit that referenced this pull request Feb 1, 2026

[bugfix] fix typo

340fc29

Bug introduced in <#161>

jameshadfield mentioned this pull request Feb 1, 2026

[bugfix] fix typo #163

Merged

jameshadfield added a commit to nextstrain/seasonal-flu that referenced this pull request Feb 1, 2026

Update avian-flu action trigger

0549fb8

Avian-flu now sources the curated data from (seasonal-flu) ingest for GenoFLU annotations and phylo builds. Relevant PR: <nextstrain/avian-flu#161>

jameshadfield mentioned this pull request Feb 1, 2026

Update avian-flu action trigger nextstrain/seasonal-flu#300

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace fauna with curation pipeline#161

Replace fauna with curation pipeline#161
jameshadfield merged 6 commits into
masterfrom
james/replace-fauna-with-curation-pipeline

jameshadfield commented Jan 28, 2026 •

edited

Loading

Uh oh!

joverlee521 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joverlee521 Jan 28, 2026

Uh oh!

jameshadfield Jan 29, 2026 •

edited

Loading

Uh oh!

jameshadfield Feb 1, 2026

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jameshadfield commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joverlee521 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jameshadfield commented Jan 28, 2026 •

edited

Loading

jameshadfield Jan 29, 2026 •

edited

Loading