WHO Owner

Owned by the Global Tuberculosis Programme, GTB, Geneva Switzerland. References: Carl-Michael Nathanson.

Overview

This repository holds both terraform configuration files and python application code for handling the synchronization between the local database and the INSDC.

This repository must be deployed after the main infrastructure repository.

Terraform

You can use a local backend for deploying, or the same S3 and DynamoDB backend you might have set up for the main infrastructure repository. Be careful to set a new key for the terraform state object file. We use a backend file with empty values for resources that we set up via the command line and secret variables during CICD.

The repository will deploy the following:

AWS Batch resources for handling jobs
Eventbridge rules to schedule synchronization (disabled by default)
Step Function workflow for orchestrating the jobs

For deployment, there are no other dependencies than having the main infrastructure repository deployed.

There are only three terraform variable to set up:

project_name
environment
aws_region

They must be equal to the values used in the main repository.

For CICD, the following variables are necessary:

Variables/Secrets for CICD	Description
AWS_ACCOUNT_ID
AWS_REGION
PROJECT_NAME	As defined in the infrastructure repository

INSDC synchronization

One of the step workflow will run daily (if enabled) and will synchronize our database with new sample data available at the INSDC. All synchronization is performed by using the different tools of the NCBI Entrez API (esearch, efetch, elink). The logic is the following:

Search newly submitted sequencing data with the term “MYCOBACTERIUM” in the organism attribute value.
Insert the new records into the sequencingdata table
Using elink, search for all BioSamples associated with these newly inserted sequencing data and insert the new records into our sample and samplealias tables
Using elink, search for all BioProject associated with these newly inserted samples and insert the new records into our package table

NCBI Taxonomy synchronization

Our database is also synchronized with the NCBI taxonomy, via the tables that we imported from the bioSQL schema. We use scripts (in perl) provided by the bioSQL community to synchronize our taxonomy. We synchronize the taxonomy every 3 months.

Application Code

You will need to build and deploy a Docker image containing the python application code handling synchronization. An AWS ECR has been deployed by the main repository and will receive the Docker image. Refer to our action file and reusable workflow for building and pushing the image.

For the synchronization logic to be successful, you will need to:

Deploy the backend repository
Have the migration run in ECS
Fill up the NCBI authentication secrets that were created in AWS Secrets Manager
Run first the taxonomy synchronization and then the INSDC

Entrez

IDs

All entries in Entrez have an ID, which is called an "accession" - like an accession key. It is a numeric ID with a prefix which identifies the object it provides access to:

experiment_accession (SRX),
study_accession (SRP),
run_alias (GSM_r),
sample_alias (GSM_),
experiment_alias (GSM),
study_alias (GSE)
STUDY with accessions in the form of SRP#, ERP#, or DRP#
SAMPLE with accessions in the form of SRS#, ERS#, or DRS#
EXPERIMENT with accessions in the form of SRX#, ERX#, or DRX#
RUN with accessions in the form of SRR#, ERR#, or DRR#
The first letter in the accession makes a notation of the source database - SRA, EBI, or DDBJ correspondingly.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
devops/envs		devops/envs
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
bitbucket-pipelines.yml		bitbucket-pipelines.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WHO Owner

Overview

Terraform

INSDC synchronization

NCBI Taxonomy synchronization

Application Code

Entrez

IDs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

WorldHealthOrganization/tbsequencing-ncbi-sync

Folders and files

Latest commit

History

Repository files navigation

WHO Owner

Overview

Terraform

INSDC synchronization

NCBI Taxonomy synchronization

Application Code

Entrez

IDs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages