-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the MonopogenLite wiki!
MonopogenLite Germline SNV calling and phasing from single-cell sequencing data (for macOS Sequoia and Linux Rocky8).
This is a fork of the original MonopogenLite which works with python (3.9+), and samtools and bcftools (v1.21), vcftools (v0.1.16), and tabix (htslib, v1.21), as well as in the context of Rocky8 (Linux) and macOS Sequoia with brew.
MonopogenLite was forked from Monopogen and edited as such to accommodate the work in MetaPlaq. In MetaPlaq we meta-analyzed 140+ samples with single-cell RNA and ATAC sequencing data which have varying degrees of sequencing quality and depth. Many of these datasets are of limited use for reliable de novo genotype calling. The main focus is on genetic ancestry inference and cis-acting expression quantitative trait loci (eQTL).
MonopogenLite is a light-version of Monopogen and only includes the germline calling and phasing of single-nucleotide variants (SNVs) from single-cell sequencing data.
It has a few improvements:
- ✅ Works cross-platform on macOS Sequoia and Linux Rocky8.
- ✅ Works with the newest versions of
bcftools,samtools,vcftools,tabix, andhtslib. - ✅ Works with the version 4.1 of
BEAGLEwhich still includes thegloption. - ✅ Works with
python 3.9+. - ✅ Reproducible workflow to include a genome reference through
refgenie. - ✅ Reproducible workflow to include 1000G phase 3 high-coverage b38 data including 3,202 individuals.
- ✅ Streamline code.
- ✅ Added
--debug,--logFile,--versionflags. - ✅ Removed hard-coding of
--platform-library; now works with smartseq2 and celseq2, aside of 10x data. - ✅ Removed hard-coding of
--nthreadsfor the phasing step executed throughBEAGLE. - ✅ Improved handling of per-sample or per-cell SNV calling and phasing.
- ✅ Added script to prepare outputs for TOPMed imputation.
- ✅ Code improved to only calls bi-allelic SNVs; no structural, INDELs or multi-allelic variants are called.
- ✅ Added handling of chromosome X - Phased haploid male samples in the phased imputation panel are now converted from "0" or "1" to "0|0" or "1|1", respectively.
- ✅ Refactored code to improve readability and maintainability, and cross-platform compatibility.
- ✅ Utilities work with regular
cpuorgpu. - ✅ Various utility-scrips:
- ✅
compare_vcf2ref.py-- to compare a given VCF-file to a common reference VCF-file and list the non-overlapping variants. - ✅
variant_ref_create.sh-- create a reference VCF-file from a given set of BAM-files. - ✅
variant_ref_checker.py-- applybcftools statto calculate statistics on the newly generated reference file (including plotting). Includingvariant_ref_checker_submit.shto submit thevariant_ref_checker.pyscript to a job scheduler. - ✅
makediploidmalesX.py-- make haploid males diploid, or reverse male diploid genotypes to haploid on chromosome X in a given VCF-file. IncludingmakediploidmalesX_submit.shto submit themakediploidmalesX.pyscript to a job scheduler. - ✅
cellsnp_vcf_merger.py-- merge per-sample VCF files into 1 VCF per study. Includingcellsnp_vcf_merger_submit.shto submit thecellsnp_vcf_merger.pyscript to a job scheduler.
- ✅
That said, MonopogenLite is just a light-weight version of Monopogen, originally developed by the Ken Chen Lab at the MD Anderson Cancer Center. All credit goes to the original authors and contributors of the concepts underlying Monopogen; we simply adapted and improved it in some functional and coding areas.
When you use macOS you can use brew to install missing packages.
Here, we make sure to have gzip through brew.
brew install gzip
You probably have no admin rights on a HPC, so nothing to report here.
On Rocky8
cd /path_to_software
mkdir -v R-4.4.1 # (currently the latest version)
mkdir -v tmp
cd /tmp
wget http://cran-mirror.cs.uu.nl/src/base/R-3/R-4.4.1.tar.gz
tar -zxvf R-4.4.1.tar.gz
cd R-4.4.1
./configure --prefix=/path_to_software/R-4.4.1 --enable-java=no
make
make install
On macOS Sequoia with brew
Make sure to get the binary version of r instead of building it from sources - these prevents a lot of issues down the road.
brew install --cask r
Installation R packages
Next, install some R packages:
install.packages("e1071")
install.packages("ggplot2")
install.packages("data.table")
We have mambaforge3 installed on our cluster (and macOS Sequoia), so:
First, we create an environment.
mamba create --name monopogen python=3.9
Next, we activate it.
mamba activate monopogen
Now, we're ready to install some packages.
mamba install openjdk pandas numpy scipy pillow polars
Note: if your system doesn't support multithreading such as a HPC does, install
polars-lts-cpuinstead ofpolars.
We need pysam too to handle VCF-files:
pip install pysam
And for some plots we'll need cmcrameri:
pip install cmcrameri
As well as tectonic to create PDF-files:
mamba install -c conda-forge tectonic
We also want samtools and bcftools.
Note:
Monopogenwas written based on these specific versions ofsamtools==1.2andbcftools==1.8. Changing these to newer versions causes problems downstream. Disadvantage of using these older versions:
indexis not implemented insamtools v1.2(samtools-htslib-API: bam_index_build2() not yet implemented)- other functions and speed improvements
MonopogenLitewas adapted to work with these specific versions:vcftools(v0.1.16),samtoolsandbcftoolsatv1.21, andtabixandhtslibatv1.20.
mamba install -c bioconda samtools=1.21 bcftools=1.21 vcftools=0.1.16
It could be that bcftools --version is not working for you; you'll get bcftools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory. You can fix this by doing the following:
mamba install -c bioconda openssl
mamba remove bcftools
mamba install -c bioconda bcftools=1.21
This should fix the error.
Lastly, we make sure to get tabix and bgzip which are part of htslib.
mamba install htslib=1.21
For the imputation and phasing we will require BEAGLE v4.1 which can be downloaded here. The reason for using v4.1 is that this version can handle genotype likelihood using the gl for the phasing step in MonopogenLite.
Here we create a folder to put BEAGLE into, and download version 4.1.
mkdir -v path/to/beagle
cd path/to/beagle
wget https://faculty.washington.edu/browning/beagle/beagle.27Jan18.7e1.jar
cd ../
chmod -Rv a+rwx beagle
Next, we symlink it to the apps directory.
cd path/to/MonopogenLite
ln -sv path/to/beagle/beagle.27Jan18.7e1.jar apps/beagle.jar
Lastly, MonopogenLite uses the human reference to map the sequence to and 1000G to infer known variants.
To make the management of the reference more generic we make use of refgenie.
Let's install refgenie as per instructions detailed here.
pip install refgenie
See if your install worked by calling refgenie -h on the command line. If the refgenie executable in not in your $PATH, append this to your .bashrc or .profile on Linux (or .bash_profile on macOS):
export PATH=~/.local/bin:$PATH
Initial configuration
If you're using refgenie for the first time you'll need to initialize your genome_folder and configuration file. Just select a folder where you want your genome assets to live, and then try:
refgenie init -c /path/to/genome_folder/genome_config.yaml
The refgenie commands all require knowing where this genome config file (genome_config.yaml) is. You can pass it on the command line all the time (using the -c parameter), but this gets old. An alternative is to set up the $REFGENIE environment variable like so:
export REFGENIE=/path/to/genome_config.yaml
Refgenie will automatically use the config file in this environmental variable if it exists. Add this to your .bashrc or .profile if you want it to persist for future command-line sessions.
export REFGENIE=/path/to/genome_config.yaml
You can always specify -c if you want to override the value in the $REFGENIE variable on an ad-hoc basis.
Listing assets
Now you can use the refgenie list command to show local assets (which will be empty at first) or the listr command to show available remote assets:
refgenie list
refgenie listr
Populate some assets
After installing refgenie, we have to make sure to install the human b38 fasta-data. So, we are going to populate your genome folder with a few assets. You can either pull existing assets or build your own. Refgenie will manage them the same way.
refgenie pull hg38/fasta
To infer known variants, we make use of 1000 Genome data re-called and mapped to b38. This directory contains 1000 Genomes Project (1kGP) high-coverage Illumina integrated phased panel which includes filtered SNV, INDEL, and SV (large deletions (DEL), insertions (INS), duplications (DUP), and inversions (INV)) variant calls across all 3,202 1kGP samples, i.e. including samples from 'ALL' populations.
To properly download and parse the reference, use the following command:
cd path/to/MonopogenLite
bash src/variant_ref_create.sh --resource-dir $(pwd)/resources --verbose
These 1000G reference data are phased, but the males are all haploid on chromosome X; meaning genotype 0 (or 1) instead of 0|0 (or 1|1). So, you will have to fix the haploid males on chromosome X to trick BEAGLE into also phasing chromosome X in MonopogenLite.
bash src/submit_makediploidmalesX.sh --input $(pwd)/resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz --output $(pwd)/resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.makediploid.chrX.vcf.gz --mpg-dir $(pwd)
You can use --verbose and --debug to get extensive output for each step, including which variants are processed. However, this produces a lot of data - I'd advise against it.
If you want to receive an email when it's FAIL you should set --user. Some systems require you provide the --account - the default is dhl_ec.
Sometimes the system is very slow - it may have lower tier cpus. You could use --partition gpu to set it to gpu. But I found that setting a different --chunk-size is more effective. The default is --chunk-size 100. Also, the default uses --cpus 4, but it may be overkill and you can set it to a lower number.
After this script has run, you'll need to clean up a bit. I did not add an automatic mv (moving) of files, nor did I add an automatic removal (rm) of intermediate files. You should check yourself what is produced, and decide whether you want to keep them - they could be handy in other scenarios not related to MonopogenLite.
In either case, prior to deleting anything you must do this:
mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.haploid.chrX.vcf.gz
mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz.tbi resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.haploid.chrX.vcf.gz.tbi
mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.makediploid.chrX.vcf.gz resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz
mv -v resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.makediploid.chrX.vcf.gz.tbi resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chrX.vcf.gz.tbi
This way, you keep the phased haploid data too.
And afterwards, should you decide to want to delete intermediate files, you can delete these. Just make sure to keep all these files:
1kGP_high_coverage_Illumina.SNVonly_poly.filtered_${AF_FIELD}_${AF_SCI}.norm.fixvariantid.chr*
Note: Originally,
Monopogenwas developed to infer variants on the autosomal chromosomes.MonopogenLiteincludes inference of X chromosome too.
Optionally, you can check the reference files you just created by providing just one VCF-file of a (randomly chosen) chromosome. To do so, use this:
bash src/submit_variant_ref_checker.sh --input $(pwd)/resources/1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chr1.vcf.gz --mpg-dir $(pwd) --verbose
This will automatically create stats and plots for each chromosome. You can inspect these to make sure the reference was created properly.
You may want to compare with cellsnp. The great thing is that variant_ref_create.sh automatically creates the necessary input for cellsnp. A big difference is that cellsnp does not use BEAGLE to phase the data (which may be perceived as a 'con'), and it only requires a VCF-file with a list of variants; no (phased) genotypes are needed.
You'll note these files 1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chr#.cellsnp. and 1kGP_high_coverage_Illumina.SNVonly_poly.filtered_AF_5e-04.norm.fixvariantid.chr1_22X.cellsnp.vcf.gz in the resources directory. The former are reference files formatted according to cellsnp-specifications per chromosome, the latter is the combined list of variants as a reference file to be used by cellsnp.
The MIT License | Copyright (c) 1979-2025 Sander W. van der Laan | s.w.vanderlaan [at] gmail [dot] com.