Home

Welcome to the MonopogenLite wiki!

MonopogenLite Germline SNV calling and phasing from single-cell sequencing data (for macOS Sequoia and Linux Rocky8).

This is a fork of the original MonopogenLite which works with python (3.9+), and samtools and bcftools (v1.21), vcftools (v0.1.16), and tabix (htslib, v1.21), as well as in the context of Rocky8 (Linux) and macOS Sequoia with brew.

MonopogenLite was forked from Monopogen and edited as such to accommodate the work in MetaPlaq. In MetaPlaq we meta-analyzed 140+ samples with single-cell RNA and ATAC sequencing data which have varying degrees of sequencing quality and depth. Many of these datasets are of limited use for reliable de novo genotype calling. The main focus is on genetic ancestry inference and cis-acting expression quantitative trait loci (eQTL).

MonopogenLite is a light-version of Monopogen and only includes the germline calling and phasing of single-nucleotide variants (SNVs) from single-cell sequencing data.

It has a few improvements:

✅ Works cross-platform on macOS Sequoia and Linux Rocky8.
✅ Works with the newest versions of bcftools, samtools, vcftools, tabix, and htslib.
✅ Works with the version 4.1 of BEAGLE which still includes the gl option.
✅ Works with python 3.9+.
✅ Reproducible workflow to include a genome reference through refgenie.
✅ Reproducible workflow to include 1000G phase 3 high-coverage b38 data including 3,202 individuals.
✅ Streamline code.
✅ Added --debug, --logFile, --version flags.
✅ Removed hard-coding of --platform-library; now works with smartseq2 and celseq2, aside of 10x data.
✅ Removed hard-coding of --nthreads for the phasing step executed through BEAGLE.
✅ Improved handling of per-sample or per-cell SNV calling and phasing.
✅ Added script to prepare outputs for TOPMed imputation.
✅ Code improved to only calls bi-allelic SNVs; no structural, INDELs or multi-allelic variants are called.
✅ Added handling of chromosome X - Phased haploid male samples in the phased imputation panel are now converted from "0" or "1" to "0|0" or "1|1", respectively.
✅ Refactored code to improve readability and maintainability, and cross-platform compatibility.
✅ Utilities work with regular cpu or gpu.
✅ Various utility-scrips:
- ✅ compare_vcf2ref.py -- to compare a given VCF-file to a common reference VCF-file and list the non-overlapping variants.
- ✅ variant_ref_create.sh -- create a reference VCF-file from a given set of BAM-files.
- ✅ variant_ref_checker.py -- apply bcftools stat to calculate statistics on the newly generated reference file (including plotting). Including variant_ref_checker_submit.sh to submit the variant_ref_checker.py script to a job scheduler.
- ✅ makediploidmalesX.py -- make haploid males diploid, or reverse male diploid genotypes to haploid on chromosome X in a given VCF-file. Including makediploidmalesX_submit.sh to submit the makediploidmalesX.py script to a job scheduler.
- ✅ cellsnp_vcf_merger.py -- merge per-sample VCF files into 1 VCF per study. Including cellsnp_vcf_merger_submit.sh to submit the cellsnp_vcf_merger.py script to a job scheduler.

That said, MonopogenLite is just a light-weight version of Monopogen, originally developed by the Ken Chen Lab at the MD Anderson Cancer Center. All credit goes to the original authors and contributors of the concepts underlying Monopogen; we simply adapted and improved it in some functional and coding areas.

Installation

macOS

When you use macOS you can use brew to install missing packages.

Here, we make sure to have gzip through brew.

brew install gzip

Rocky8

You probably have no admin rights on a HPC, so nothing to report here.

Step 1: install R

On Rocky8

cd /path_to_software
mkdir -v R-4.4.1 # (currently the latest version)
mkdir -v tmp
cd /tmp
wget http://cran-mirror.cs.uu.nl/src/base/R-3/R-4.4.1.tar.gz
tar -zxvf R-4.4.1.tar.gz
cd R-4.4.1
./configure --prefix=/path_to_software/R-4.4.1 --enable-java=no
make
make install

On macOS Sequoia with brew

Make sure to get the binary version of r instead of building it from sources - these prevents a lot of issues down the road.

brew install --cask r

Installation R packages

Next, install some R packages:

install.packages("e1071")
install.packages("ggplot2")
install.packages("data.table")

Step 2: create monopogen environment

We have mambaforge3 installed on our cluster (and macOS Sequoia), so:

First, we create an environment.

mamba create --name monopogen python=3.9

Next, we activate it.

mamba activate monopogen

Now, we're ready to install some packages.

mamba install openjdk pandas numpy scipy pillow polars

Note: if your system doesn't support multithreading such as a HPC does, install polars-lts-cpu instead of polars.

We need pysam too to handle VCF-files:

pip install pysam

And for some plots we'll need cmcrameri:

pip install cmcrameri

As well as tectonic to create PDF-files:

mamba install -c conda-forge tectonic

We also want samtools and bcftools.

Note: Monopogen was written based on these specific versions of samtools==1.2 and bcftools==1.8. Changing these to newer versions causes problems downstream. Disadvantage of using these older versions:

index is not implemented in samtools v1.2 (samtools-htslib-API: bam_index_build2() not yet implemented)

other functions and speed improvements

MonopogenLite was adapted to work with these specific versions: vcftools (v0.1.16), samtools and bcftools at v1.21, and tabix and htslib at v1.20.

mamba install -c bioconda samtools=1.21 bcftools=1.21 vcftools=0.1.16

It could be that bcftools --version is not working for you; you'll get bcftools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory. You can fix this by doing the following:

mamba install -c bioconda openssl
mamba remove bcftools
mamba install -c bioconda bcftools=1.21

This should fix the error.

Lastly, we make sure to get tabix and bgzip which are part of htslib.

mamba install htslib=1.21

For the imputation and phasing we will require BEAGLE v4.1 which can be downloaded here. The reason for using v4.1 is that this version can handle genotype likelihood using the gl for the phasing step in MonopogenLite.

Here we create a folder to put BEAGLE into, and download version 4.1.

mkdir -v path/to/beagle
cd path/to/beagle
wget https://faculty.washington.edu/browning/beagle/beagle.27Jan18.7e1.jar
cd ../
chmod -Rv a+rwx beagle

Next, we symlink it to the apps directory.

cd path/to/MonopogenLite
ln -sv path/to/beagle/beagle.27Jan18.7e1.jar apps/beagle.jar

Step 3: references

Lastly, MonopogenLite uses the human reference to map the sequence to and 1000G to infer known variants.

Step 3a: refgenie

To make the management of the reference more generic we make use of refgenie.

Let's install refgenie as per instructions detailed here.

pip install refgenie

See if your install worked by calling refgenie -h on the command line. If the refgenie executable in not in your $PATH, append this to your .bashrc or .profile on Linux (or .bash_profile on macOS):

export PATH=~/.local/bin:$PATH

Initial configuration

If you're using refgenie for the first time you'll need to initialize your genome_folder and configuration file. Just select a folder where you want your genome assets to live, and then try:

refgenie init -c /path/to/genome_folder/genome_config.yaml

The refgenie commands all require knowing where this genome config file (genome_config.yaml) is. You can pass it on the command line all the time (using the -c parameter), but this gets old. An alternative is to set up the $REFGENIE environment variable like so:

export REFGENIE=/path/to/genome_config.yaml

Refgenie will automatically use the config file in this environmental variable if it exists. Add this to your .bashrc or .profile if you want it to persist for future command-line sessions.

export REFGENIE=/path/to/genome_config.yaml

You can always specify -c if you want to override the value in the $REFGENIE variable on an ad-hoc basis.

Listing assets Now you can use the refgenie list command to show local assets (which will be empty at first) or the listr command to show available remote assets:

refgenie list
refgenie listr

Populate some assets

After installing refgenie, we have to make sure to install the human b38 fasta-data. So, we are going to populate your genome folder with a few assets. You can either pull existing assets or build your own. Refgenie will manage them the same way.

refgenie pull hg38/fasta

Step 3b: 1000G

To infer known variants, we make use of 1000 Genome data re-called and mapped to b38. This directory contains 1000 Genomes Project (1kGP) high-coverage Illumina integrated phased panel which includes filtered SNV, INDEL, and SV (large deletions (DEL), insertions (INS), duplications (DUP), and inversions (INV)) variant calls across all 3,202 1kGP samples, i.e. including samples from 'ALL' populations.

To properly download and parse the reference, use the following command:

cd path/to/MonopogenLite
bash src/variant_ref_create.sh --resource-dir $(pwd)/resources --verbose

These 1000G reference data are phased, but the males are all haploid on chromosome X; meaning genotype 0 (or 1) instead of 0|0 (or 1|1). So, you will have to fix the haploid males on chromosome X to trick BEAGLE into also phasing chromosome X in MonopogenLite.

bash src/submit_makediploidmalesX.sh --input $(pwd)/resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz --output $(pwd)/resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.makediploid.chrX.vcf.gz --mpg-dir $(pwd)

You can use --verbose and --debug to get extensive output for each step, including which variants are processed. However, this produces a lot of data - I'd advise against it. If you want to receive an email when it's FAIL you should set --user. Some systems require you provide the --account - the default is dhl_ec. Sometimes the system is very slow - it may have lower tier cpus. You could use --partition gpu to set it to gpu. But I found that setting a different --chunk-size is more effective. The default is --chunk-size 100. Also, the default uses --cpus 4, but it may be overkill and you can set it to a lower number.

After this script has run, you'll need to clean up a bit. I did not add an automatic mv (moving) of files, nor did I add an automatic removal (rm) of intermediate files. You should check yourself what is produced, and decide whether you want to keep them - they could be handy in other scenarios not related to MonopogenLite.

In either case, prior to deleting anything you must do this:

mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.haploid.chrX.vcf.gz
mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz.tbi resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.haploid.chrX.vcf.gz.tbi
mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.makediploid.chrX.vcf.gz resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz
mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.makediploid.chrX.vcf.gz.tbi resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz.tbi

This way, you keep the phased haploid data too.

And afterwards, should you decide to want to delete intermediate files, you can delete these. Just make sure to keep all these files:

1kGP_high_coverage_Illumina.SNVonly_poly.filtered_${AF_FIELD}_${AF_SCI}.norm.fixvariantid.chr*

Note: Originally, Monopogen was developed to infer variants on the autosomal chromosomes. MonopogenLite includes inference of X chromosome too.

Optionally, you can check the reference files you just created by providing just one VCF-file of a (randomly chosen) chromosome. To do so, use this:

bash src/submit_variant_ref_checker.sh --input $(pwd)/resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chr1.vcf.gz --mpg-dir $(pwd) --verbose

This will automatically create stats and plots for each chromosome. You can inspect these to make sure the reference was created properly.

CellSNP

You may want to compare with cellsnp. The great thing is that variant_ref_create.sh automatically creates the necessary input for cellsnp. A big difference is that cellsnp does not use BEAGLE to phase the data (which may be perceived as a 'con'), and it only requires a VCF-file with a list of variants; no (phased) genotypes are needed.

You'll note these files 1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chr#.cellsnp. and 1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chr1_22X.cellsnp.vcf.gz in the resources directory. The former are reference files formatted according to cellsnp-specifications per chromosome, the latter is the combined list of variants as a reference file to be used by cellsnp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Installation

macOS

Rocky8

Step 1: install R

Step 2: create monopogen environment

Step 3: references

Step 3a: refgenie

Step 3b: 1000G

CellSNP

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally