mcdr-mtb-utilities provides script to generate VCF files from Mycobacterium tuberculosis WGS data. These VCF files can be used as input for the webserver of mcdr-mtb that performs prediction of drug resistance from Variant calling format (VCF) files.
There are 2 different scripts:
- variant_call.sh - Generate VCF file from MTB WGS data (from .fastq to .vcf)
- merge_vcf.sh - Merge multiple VCF files into a single merged.vcf
- trim-galore (version
0.6.7) - quality check and trimming of read sequences - bwa (
0.7.17-r1188) - reference based alignment - samtools (
1.13)- processing the BAM files - freebayes (
v1.3.6) - variant calling - libvcflib-tools (
1.0.7) - processing the VCF files - libvcflib-dev (
1.0.7) - processing the VCF files - bgzip (
1.13+ds) - zipping files
Step 1: Install dependent packages/tools
For ubuntu
sudo apt-get install trim-galore bwa samtools freebayes libvcflib-tools libvcflib-dev bgzip
The installation steps for the different packages/tools are given in the following links:
-
trim-galore - https://github.com/FelixKrueger/TrimGalore
-
samtools, bcftools, bgzip(htstools) - http://www.htslib.org/download/
-
freebayes - https://github.com/freebayes/freebayes
-
vcflib - https://github.com/vcflib/vcflib
R should be installed in the user system/PC. R installation steps are given in https://cran.r-project.org/.
Step 2: Install mcdr-mtb-utilities
I. Download the software from GitHub repository
Create a clone of the repository
git clone https://github.com/AbhirupaGhosh/mcdr-mtb-utilitiesNote: Creating a clone of the repository requires git to be installed.
The git can be installed using
sudo apt-get install git
OR
Download using wget
wget https://github.com/AbhirupaGhosh/mcdr-mtb-utilities/archive/refs/heads/main.zip
unzip main.zip
Note: wget can be installed using
sudo apt-get install wgetII. Make the shell scripts executable
chmod +x INSTALLATION_DIR/mcdr-mtb-utilities config.sh variant_call.sh merge_vcf.shINSTALLATION_DIR= Directory where mcdr-mtb-utilities is installedIII. update the paths in config.sh (optional)
the
config.shlooks likefreebayes_path=/usr/bin/freebayes samtools_path=/usr/bin/samtools bwa_path=/usr/bin/bwa trim_galore_path=/usr/bin/trim_galore vcflib_path=/usr/bin/vcflib bgzip_path=/usr/bin/bgzip bcftools_path=/usr/bin/bcftools trim_galore_cores=4 bwa_mem_cores=4 samtools_cores=4Note: It shows the default paths of the executables files for
freebayes,samtools,bwa,trim galore!,vcflib,bgzipandbcftools. The users need to update the paths of the executables, in case these tools were installed in ways other than theapt-get installcommand.
Initially change the directory to the directory where mcdr-mtb-utilities is installed
cd INSTALLATION_DIR/mcdr-mtb-utilities
Different operations can be performed by calling the appropriate scripts with two command-line arguments: INPUT_DIR and OUTPUT_DIR.
INPUT_DIR = the path (absolute or relative) of the folder containing the input files.
OUTPUT_DIR = the path (absolute or relative) of the folder in which mcdr-mtb-utilities will store the outputs.
The executable script, and contents of INPUT_DIR and OUTPUT_DIR depends on the choice of operations. The different operations are explained below.
./variant_call.sh INPUT_DIR OUTPUT_DIR
INPUT_DIR must contain paired end FASTQ files (ISOLATE1_1.fastq.gz & ISOLATE1_2.fastq.gz) of 1 isolate.
OUTPUT_DIR will contain a folder for each ISOLATE ID (ISOLATE_DIR).
Each folder will contain
- the VCF file (ISOLATE1.vcf)
- the intermediate BAM files (ISOLATE1.bam, ISOLATE1_sorted.bam)
./merge_vcf.sh INPUT_DIR OUTPUT_DIR
INPUT_DIR must contain One or more VCFs (ISOLATE1.vcf, ISOLATE2.vcf) of MTB isolates.
OUTPUT_DIR will contain the merged.vcf file along with the compressed VCF files and their index files.
-
Create an Input directory
mkdir /home/ss-uac-3/Input_Dir1 -
Get Data
Download the whole genome sequencing FASTQ files of a MTB isolate run, ERR137249 (ERR137249_1.fastq & ERR137249_2.fastq) from https://www.ebi.ac.uk/ena/browser/view/ERR137249
-
Store these files in
Input_Dir1 -
Create an Output directory
mkdir /home/ss-uac-3/Output_Dir1 -
Go to the
mcdr-mtb-utilitiesinstallation directorycd /home/ss-uac-3/Documents/mcdr-mtb-utilities/ -
Run variant-call.sh
./variant-call.sh /home/ss-uac-3/Input_Dir1/ /home/ss-uac-3/Output_Dir1/
Input_Dir1 contains ERR137249_1.fastq, ERR137249_2.fastq
Output_Dir1 contains -
- Folder - ERR137249
- ERR137249.tsv
The ERR137249 folder contains -
- reference folder - reference genome and index files
- Trim galore outputs - ERR137249_1_val_1.fq.gz, ERR137249_2_val_2.fq.gz, ERR137249_1_trimming_report.txt, ERR137249_2_trimming_report.txt
- Bwa-mem output - ERR137249.bam
- Intermediate BAM files - ERR137249_fix.bam, ERR137249_namesort.bam, ERR137249_positionsort.bam, ERR137249_markdup.bam
- BAM index - ERR137249.bam.bai
- Freebayes output - ERR137249.vcf
- Create an Input directory
mkdir /home/ss-uac-3/Input_Dir2
- Get Data
Download the whole genome sequencing FASTQ files of MTB ISOLATE runs, ERR137249 (ERR137249_1.fastq & ERR137249_2.fastq) and SRR1103491 (SRR1103491_1.fastq & SRR1103491_2.fastq) from https://www.ebi.ac.uk/ena/browser/view/ERR137249 and https://www.ebi.ac.uk/ena/browser/view/SRR1103491
- Store these files in
Input_Dir2 - Create an Output directory
mkdir /home/ss-uac-3/Output_Dir2
- Go to the
mcdr-mtb-utilitiesinstallation directory
cd /home/ss-uac-3/Documents/mcdr-mtb-utilities/
- Run merge-VCF.sh
./merge-VCF.sh /home/ss-uac-3/Input_Dir1/ /home/ss-uac-3/Output_Dir2/
Input_Dir2 contains ERR137249_1.fastq, ERR137249_2.fastq, SRR1103491_1.fastq, SRR1103491_2.fastq
Output_Dir2 contains -
- Two folders - ERR137249, SRR1103491
- merged.vcf
Each of the ERR137249 and SRR1103491 named folder contains -
- reference folder - reference genome and index files
- Trim galore outputs - ISOLATENAME_1_val_1.fq.gz, ISOLATENAME_2_val_2.fq.gz, ISOLATENAME_1_trimming_report.txt, ISOLATENAME_2_trimming_report.txt
- Bwa-mem output - ISOLATENAME.bam
- Intermediate BAM files - ISOLATENAME_fix.bam, ISOLATENAME_namesort.bam, ISOLATENAME_positionsort.bam, ISOLATENAME_markdup.bam
- BAM index - ISOLATENAME.bam.bai
- Freebayes output - ISOLATENAME.vcf
Abhirupa Ghosh, Sudipto Bhattacharjee and Sudipto Saha
The scripts were developed and tested on the Ubuntu Operating system.