This is the ingest pipeline for dengue virus sequences.
Follow the standard installation instructions for Nextstrain's suite of software tools.
All workflows are expected to the be run from the top level pathogen repo directory. The default ingest workflow should be run with
Fetch sequences with
nextstrain build ingest data/sequences.ndjsonRun the complete ingest pipeline with
nextstrain build ingestThis will produce 10 files (within the ingest directory):
A pair of files with all the dengue sequences:
ingest/results/metadata_all.tsvingest/results/sequences_all.fasta
A pair of files for each dengue serotype (denv1 - denv4)
ingest/results/metadata_denv1.tsvingest/results/sequences_denv1.fastaingest/results/metadata_denv2.tsvingest/results/sequences_denv2.fastaingest/results/metadata_denv3.tsvingest/results/sequences_denv3.fastaingest/results/metadata_denv4.tsvingest/results/sequences_denv4.fasta
Run the complete ingest pipeline and upload results to AWS S3 with
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
ingest \
upload_all \
--configfile build-configs/nextstrain-automation/config.yamlDo the following to include sequences from static FASTA files.
-
Convert the FASTA files to NDJSON files with:
./ingest/scripts/fasta-to-ndjson \ --fasta {path-to-fasta-file} \ --fields {fasta-header-field-names} \ --separator {field-separator-in-header} \ --exclude {fields-to-exclude-in-output} \ > ingest/data/{file-name}.ndjson -
Add the following to the
.gitignoreto allow the file to be included in the repo:!ingest/data/{file-name}.ndjson
-
Add the
file-name(without the.ndjsonextension) as a source toingest/defaults/config.yaml. This will tell the ingest pipeline to concatenate the records to the GenBank sequences and run them through the same transform pipeline.
Configuration takes place in defaults/config.yaml by default.
Optional configs for uploading files are in build-configs/nextstrain-automation/config.yaml.
The complete ingest pipeline with AWS S3 uploads uses the following environment variables:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
These are optional environment variables used in our automated pipeline.
GITHUB_RUN_ID- provided viagithub.run_idin a GitHub Action workflowAWS_BATCH_JOB_ID- provided via AWS Batch Job environment variables
GenBank sequences and metadata are fetched via NCBI datasets.