Skip to content

Commit 882ec49

Browse files
committed
WIP [ingest] GenoFLU workflow for all-influenza ingest
This takes the output from our all-influenza curation pipeline (pre-filtered to avian-flu subtypes) and runs GenoFLU on it. It's a little strange to have most of the ingest steps in one location and then the GenoFLU step here; one day we may wish to unify them but that's quite a big task given that this (avian-flu) ingest pipeline already exists and is being used on other data sources.
1 parent 6db3c06 commit 882ec49

4 files changed

Lines changed: 166 additions & 3 deletions

File tree

Lines changed: 74 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,78 @@
1-
# this workflow is a stub action to allow testing from a branch
2-
31
name: Run GenoFLU on curated GISAID data
42

3+
defaults:
4+
run:
5+
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023:
6+
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell
7+
#
8+
# Completely spelling it out here so that GitHub can't change it out from under us
9+
# and we don't have to refer to the docs to know the expected behavior.
10+
shell: bash --noprofile --norc -eo pipefail {0}
11+
512
on:
13+
workflow_call:
14+
inputs:
15+
image:
16+
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")'
17+
required: false
18+
type: string
19+
620
workflow_dispatch:
7-
21+
inputs:
22+
image:
23+
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")'
24+
required: false
25+
type: string
26+
trial-name:
27+
description: |
28+
Trial name for outputs.
29+
If not set, outputs will overwrite files at s3://nextstrain-data/files/workflows/avian-flu/
30+
If set, outputs will be uploaded to s3://nextstrain-data/files/workflows/avian-flu/trials/<trial_name>/
31+
required: false
32+
type: string
33+
34+
# Expose a repository dispatch so that we can trigger this workflow when the all-influenza
35+
# curation pipeline has finished (currently via the seasonal-flu repo)
36+
repository_dispatch:
37+
types:
38+
- genoflu-gisaid
39+
40+
jobs:
41+
ingest:
42+
permissions:
43+
id-token: write
44+
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
45+
secrets: inherit
46+
with:
47+
# Starting with the default docker runtime
48+
# We can migrate to AWS Batch when/if we need to for more resources or if
49+
# the job runs longer than the GH Action limit of 6 hours.
50+
runtime: docker
51+
run: |
52+
declare -a config;
53+
54+
if [[ "$TRIAL_NAME" ]]; then
55+
# Create JSON string for the nested upload config
56+
S3_DST="s3://nextstrain-data-private/files/workflows/avian-flu/trial/$TRIAL_NAME"
57+
config+=(
58+
s3_dst=$(jq -cn --arg S3_DST "$S3_DST" '{"gisaid": $S3_DST}')
59+
)
60+
fi;
61+
62+
nextstrain build \
63+
ingest \
64+
--snakefile gisaid/Snakefile \
65+
upload_all \
66+
--config "${config[@]}"
67+
env: |
68+
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.image }}
69+
TRIAL_NAME: ${{ inputs.trial-name }}
70+
# Explicitly excluding `ingest/gisaid/results` and `ingest/gisaid/data`
71+
# since this is private data and should not available through the public artifacts
72+
artifact-name: genoflu-gisaid
73+
artifact-paths: |
74+
ingest/.snakemake/log/
75+
ingest/gisaid/logs/
76+
ingest/gisaid/benchmarks/
77+
!ingest/gisaid/results
78+
!ingest/gisaid/data

ingest/gisaid/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
This directory represents the workflow which takes curated data (TSV + FASTAs) from our all-influenza curation pipeline (part of the seasonal-flu repo) and adds in GenoFLU metadata, and then uploads the (new) metadata and (unchanged) sequences to S3.
2+
3+
See the sister `config.yaml` for the S3 addresses.
4+
5+
**GitHub action**
6+
7+
The `genoflu-gisaid` GitHub action runs this workflow.
8+
The intention is for the seasonal-flu repo to trigger it when newly curated data are available.
9+
10+
**Maual usage**
11+
12+
Working directory: `avian-flu/ingest`
13+
14+
Command: `snakemake --cores 1 --snakefile gisaid/Snakefile -npf`
15+
16+
Add `upload_all` to the end of that rule if you also want to upload files.
17+
18+

ingest/gisaid/Snakefile

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
2+
import os
3+
configfile: os.path.join(workflow.basedir, "config.yaml")
4+
5+
include: "../../shared/vendored/snakemake/remote_files.smk"
6+
include: "../rules/genoflu.smk"
7+
include: "../rules/upload_to_s3.smk"
8+
9+
10+
# The Genoflu workflow will create "gisaid/results/metadata.tsv" with GenoFLU information
11+
# So make that the default workflow target. This will force provisioning of upstream
12+
# metadata & sequences
13+
rule all:
14+
input:
15+
metadata="gisaid/results/metadata.tsv",
16+
17+
rule upload_all:
18+
input:
19+
metadata="gisaid/s3/metadata.done",
20+
sequences=expand("gisaid/s3/sequences_{segment}.done", segment=config["segments"]),
21+
22+
rule get_sequence:
23+
"""
24+
Provisions the curated sequences (ultimately from the seasonal-flu ingest)
25+
into the location where both the GenoFlu workflow and the upload rules can access them.
26+
(Note: We could use a different location and skip `provision_genoflu_sequences` but
27+
we want to upload the sequences at the end of the workflow in order to keep metadata
28+
& sequences in-sync.)
29+
"""
30+
input:
31+
path_or_url(config['sequences'])
32+
output:
33+
"gisaid/results/sequences_{segment}.fasta"
34+
shell:
35+
"""
36+
if [[ {input[0]} == *.xz ]]; then
37+
xz -dc {input[0]} > {output[0]}
38+
elif [[ {input[0]} == *.zst ]]; then
39+
zstd -dc {input[0]} > {output[0]}
40+
else
41+
cp {input[0]} {output[0]}
42+
fi
43+
"""
44+
45+
rule get_metadata:
46+
"""
47+
Provisions the metadata in the location the genoflu workflow expects it.
48+
"""
49+
input:
50+
path_or_url(config['metadata'])
51+
output:
52+
"gisaid/data/metadata_combined.tsv"
53+
shell:
54+
"""
55+
cp {input[0]} {output[0]}
56+
"""

ingest/gisaid/config.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
2+
# Download files from:
3+
# TODO XXX - change when we take the seasonal-flu outputs out of "trials" mode
4+
sequences: s3://nextstrain-data-private/files/workflows/seasonal-flu/trials/ingest/avian-flu/{segment}/sequences.fasta.xz
5+
metadata: s3://nextstrain-data-private/files/workflows/seasonal-flu/trials/ingest/avian-flu/metadata.tsv.xz
6+
# Note: replace the above with (e.g.) "../../seasonal-flu/ingest/results/avian-flu/{segment}.fasta" for local usage, as needed
7+
8+
# Upload data to:
9+
s3_dst:
10+
# NOTE: the intention is to overwrite this for testing purposes via the API call
11+
# In case that doesn't work, I set a conservative trial prefix to ensure we don't overwrite canonical files
12+
# TODO XXX - change to files/workflows/avian-flu
13+
gisaid: s3://nextstrain-data-private/files/workflows/avian-flu/trial/genoflu-gisaid-should-not-be-used
14+
15+
segments: ["pb2", "pb1", "pa", "ha", "np", "na", "mp", "ns"]
16+
17+
genoflu:
18+
gisaid: true

0 commit comments

Comments
 (0)