WHO Owner

Owned by the Global Tuberculosis Programme, GTB, Geneva Switzerland. References: Carl-Michael Nathanson.

Overview

This repository holds both terraform configuration files and python application code for handling the bioinformatic processing of samples as well as ETL for running the association algorithm.

This repository must be deployed after the main infrastructure and ncbi-sync repositories.

Content

The repository holds definition for three different components of the bioinformatic processing:

The infrastructure terraform code (devops/envs/)
Docker image definition used for sequencing data processing (containers/)
PySpark ETL jobs for post processing (cfn/glue-jobs)

You can check our GitHub Action workflows in this repository for deploying each component.

Infrastructure

You can use a local backend for deploying, or the same S3 and DynamoDB backend you might have set up for the main infrastructure repository. Be careful to set a new key for the terraform state object. We use GitHub Action secrets and command line arguments to set up the terraform backend for CICD (see terraform-plan and terraform-apply).

Infrastructure includes:

Eight Step Function States Machines that together enable the bioinformatic processing of the WGS Illumina data:

Master, handling together all operations. Runs every day.
Creation of all the temporary AWS resources necessary to run a batch of bioinformatic analysis
Downloading of the references from NCBI (reference TB genome) after resources have been created
Per sample bioinformatic processing
Deletion of temporary AWS resources necessary for batched analysis
Data insertion, which will insert all newly created data (stored in S3) into our RDS database, after all samples have been processed
Variant Annotation, which will create and insert into the RDS database the annotation for the newly identified variants

Glue jobs definitions
Eventbridge rule to schedule daily execution of the bioinformatic workflow
AWS Batch resources for running specific jobs

Master pipeline

Checks whether there are new samples to be processed (if not, stops there)
Creates all the temporary infrastructure necessary to process the samples
Download all necessary reference files from the NCBI Child pipeline
Starts and handle processing of the child pipeline for all queued samples (maximum concurrency is 40 samples)
Copies to S3 from the temporary infrastructure all created files once processing for all samples is finished
Deletes the temporary infrastructure
Runs the data insertion, variant annotation, and calculate statistics states machines
Updates bioinformatic status of processed samples

Resources creation

Creates an FSx volume (shared storage for intermediate files during bioinformatic processing)
Creates a launch template for EC2 instances so that the newly created FSx volume is mounted at start up
Creates a new Batch Computational Environment, which starts EC2 spot instances using the newly created launch template
Creates a new Batch Queue which is associated with the newly created Computational Environment
Waits for FSx drive to be ready (around 15 minutes)

Download references

Downloads the reference TB genome from the NCBI
Prepares all necessary indexes of the downloaded genome
Extracts, compresses, indexes all known theoretical variants from the RDS database

Sample processing

Downloads the raw sequencing data (either from NCBI or from our S3 bucket where contributors upload their data to the tbsequencing portal)
Aligns to the reference (bwa) and sorts the alignment (samtools)
Performs taxonomy analysis
Identifies genetic variants (gatk HaplotypeCaller, bcftools, freebayes)
Calculate per gene and global sequencing QC stats
Identifies deletion (delly)
Formats all output files for RDS insertion via AWS Glue
Updates samples bioinformatic status after successful or failed process

Data insertion

Uses AWS Glue to insert from S3 files to RDS database:

genotype (including deletion) calls
per gene sequencing stats (median coverage etc)
global summary sequencing stats
taxonomy analysis stats

Variant annotation

Creates temporary resources
Requests from the database new variants only (i.e. unannotated)
Download references from the NCBI (gff format)
Processes references and creates SnpEff configuration files
Annotates the new variants, transform them for loading into the database
Normalizes the newly inserted data

Calculate statistics pipeline

Updates all tbsequencing web views. Runs AWS Glue jobs that assign drug resistance predictions from genotype data for the new samples only

Docker images

Specific open source bioinformatic tools will be needed for sequencing data analysis. These will need to be pushed in each of their respective AWS ECR that have been created by the main infrastructure repository.

Glue Jobs

We use Apache PySpark for most of our ETL logic. Some simply prepare data extracted from the bioinformatic analysis for insertion into the database, other prepare input files for the association algorithm. We will describe rapidly the most important glue ETL jobs.

BioSQL Gene Views

Simple transforms relative to gene and protein names associated with genomic features. Never run as main process but other scripts import its defined functions.

Phenotypic Data Views

Binarizes MIC range values into categorical R/S results based on the epidemiological cut off values table. Transforms all inserted phenotype lab results (either binary R/S or MIC range values) into categorical results according to the expert rules. When running this scripts as main process, it will write an excel report on a S3 bucket which will show lab results counts for each drug/medium pairs, and their associated classification. Read as import by other scripts.

Variant Annotation Categorization

Transforms the raw per variant annotation from the database into the final variant nomenclature used for the association algorithm (for instance rpoB_p.Ser450Leu). When running this script as the main process, it will write a very large excel file on an S3 bucket which will have the mapping between variant coordinates on the reference genome sequence and the nomenclature, for any variant on any gene potentially linked to resistance for any drug (based on the data in the gene drug resistance association table). Imported by other scripts.

Stat Analysis

Extract tabular files ready for input for the association algorithm. It is built on most of the other ETL logic implemented in other files, and writes the tabular data on S3. Not imported by other scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.github/workflows		.github/workflows
cfn		cfn
containers		containers
devops/envs		devops/envs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bitbucket-pipelines.yml		bitbucket-pipelines.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WHO Owner

Overview

Content

Infrastructure

Master pipeline

Resources creation

Download references

Sample processing

Data insertion

Variant annotation

Calculate statistics pipeline

Docker images

Glue Jobs

BioSQL Gene Views

Phenotypic Data Views

Variant Annotation Categorization

Stat Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

WorldHealthOrganization/tbsequencing-bioinfoanalysis

Folders and files

Latest commit

History

Repository files navigation

WHO Owner

Overview

Content

Infrastructure

Master pipeline

Resources creation

Download references

Sample processing

Data insertion

Variant annotation

Calculate statistics pipeline

Docker images

Glue Jobs

BioSQL Gene Views

Phenotypic Data Views

Variant Annotation Categorization

Stat Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages