Replace fauna with curation pipeline#161
Conversation
We're about to switch from fauna data (for GISAID builds) to our all-influenza curation pipeline. Some strain names have changed between these two approaches as we now apply one version of the curation code to all (raw) records. The files added here were automatically generated following development scripts added to seasonal-flu (where the curation pipeline lives) in [1], but with avian-flu specific modifications [2]. The approach used was to compare the strain names in txt files against those in our curated metadata. If a match wasn't found, we cross-reference with the most recent fauna metadata, using HA_accession as the key to link fauna to our newly curated data, as fauna didn't keep track of EPI_ISL. Finally, we attempt a fuzzy match. The next commit will clean up the changes introduced here. The commands run (in seasonal-flu repo) were: ``` cd ingest ./scripts/diff-avian-flu.py --truth ../../avian-flu/ingest/fauna/data/metadata_ha.tsv --query data/avian-flu/curated_gisaid.ndjson.zst snakemake --cores 4 --snakefile devel/strain-name-updates.snakefile -pf ``` [1] <nextstrain/seasonal-flu#291> [2] <nextstrain/seasonal-flu@dd88e9b>
Manually check the automated changes from the parent commit. This approach is the same as the one we recently took in seasonal-flu <nextstrain/seasonal-flu#291>
In the situation where the requested cores (n) was higher than the machine cores (m) the script would partition the data by the requested cores (n) but then only run m of them at a time. This patch should be added upstream in due course.
d087b04 to
37d566a
Compare
joverlee521
left a comment
There was a problem hiding this comment.
Noted a couple fixes for still supporting the separate analysis directory, which made me realize we never updated nextstrain-pathogen.yaml to indicate compatibility with nextstrain run.
| # and overwrite the updated hotfix lines, exluding those in unnecessary_fixes | ||
| with open(args.hotfixes, 'w') as fh: | ||
| for idx, line in enumerate(lines): | ||
| if idx not in unnecessary_fixes: | ||
| print(line, file=fh) |
There was a problem hiding this comment.
Any unnecessary hotfixes will be removed from the TSV at the point it's applied, ensuring obsolete fixes (i.e. which have been applied upstream) don't linger.
Is the idea that someone will run this script manually to remove the unnecessary hotfixes in the TSV file? Currently, it will remove it as part of the build, but those changes are not pushed up to GitHub.
There was a problem hiding this comment.
Yeah, that's the idea, because Louise / Moncla lab runs things locally as far as I'm aware. If we do run automated GitHub builds then yes you're right that they might linger for a longer time than is ideal, but eventually we'll run a local build and will get this clean-up done.
There was a problem hiding this comment.
This hotfix approach is actually a bit useless in retrospect - if you add fixes then (almost) the entire workflow re-runs and so you'll probably subsample out your fixes. In this respect, a better place would be after rule filter but that'd mean hotfix TSVs for every {subtype}/{segment}/{time} combination. Not going to make changes at the moment, but something we can keep improving on.
This takes the (S3) output from our all-influenza curation pipeline (pre-filtered to avian-flu subtypes) and runs GenoFLU on it. It's a little strange to have most of the ingest steps in one location and then the GenoFLU step here; one day we may wish to unify them but that's quite a big task given that this (avian-flu) ingest pipeline already exists and is being used on other data sources.
Our all-influenza curation pipeline is located in the seasonal-flu repo, which makes it cumbersome to apply curation changes when we notice something's amiss. Our phylo pipelines thus allow metadata to be patched via a config-defined hotfix file. Rather than letting hotfixes accumulate over time, when we make curation fixes to the upstream (seasonal-flu) workflow we can add these hotfixes. Any unnecessary hotfixes will be removed from the TSV at the point it's applied, ensuring obsolete fixes (i.e. which have been applied upstream) don't linger.
The preceeding commits have switched us away from fauna as the canonical private data source and towards flat files on S3, provisioned via our new all-infleunza curation pipeline (in the seasonal-flu repo). To avoid any confusion going forward this commit removes the fauna- related ingest workflows. There's more clean-up work we can do in the future around the structure of the ingest directory and the NDJSON fields which we provision, but there's enough changes happening at the moment to defer this until later!
63bdb11 to
58bb3ca
Compare
This comment was marked as outdated.
This comment was marked as outdated.
Avian-flu now sources the curated data from (seasonal-flu) ingest for GenoFLU annotations and phylo builds. Relevant PR: <nextstrain/avian-flu#161>
See commit messages for all the details.
This PR removes our fauna-ingest pipeline, replacing it with a small ingest workflow which adds GenoFLU annotations to the metadata/sequences from our seasonal-flu ingest workflow. This layer uploads metadata/sequences to the canonical S3 locations, and thus the phylo workflows should run as normal.
After merging, I'll update the seasonal-flu upload workflow as described there