
swapfinder
is a tool for the fast identification of sample swaps. It achieves this by calculating and comparing SNP barcodes of samples.
It is inspired by CCLHunter, which is a web-based tool for identifying cancer cell lines.
swapfinder
uses the SNP barcode from the CCLHunter project but provides a command-line tool for users to identify any sample swaps.
- Precisely distinguish samples using hundreds of highly heterogeneous SNP sites.
- Quickly detect sample swaps by comparing SNP barcodes.
- Standalone program with no dependencies or installation required.
- Supports BAM and CRAM formats.
- Supports multiple distance calculation methods.
- Supports reading BAM files from URLs.
- Supports custom SNP barcode lists.
- Ensure you have Rust and Cargo installed. Then, you can clone and build the project with the following commands:
git clone https://github.com/zhengxinchang/swapfinder.git
cd swapfinder
cargo build --release
- Download the pre-built binaries from the releases page
You can calculate the SNP barcode for a given sample with the following command:
swapfinder barcode -b <barcode_file> -i <input_bam_or_cram> -o <output_file> [-r <reference_file>]
Parameters:
-b, --barcode <barcode_file>
: Barcode file
-i, --bam <input_bam_or_cram>
: Input BAM or CRAM file, MUST BE SORTED
-o, --output <output_file>
: Output file
-r <reference_file>
: Reference file (only for CRAM format)
You can compare SNP barcodes of multiple samples with the following command:
swapfinder compare -i <barcode_file1> -i <barcode_file2> -o <output_file>
swapfinder compare -I <barcode_files> -o <output_file>
Parameters:
-i, --barcode <barcode_file>
: Barcode file, can specify multiple
-I, --barcode_files <barcode_files>
: File containing a list of barcode files, mutrually exclusive with -i
-o, --output <output_file>
: Output file
You can find the pre-build barcode files in the barcodes
directory. There are two versions of barcode files which based on GRCh37 and GRCh38 reference genome.
You can also build your barcode files. The format of the barcode file is a tab-separated file with the following mandatory columns:
Chromosome
: Chromosome namePosition
: Position of the SNP- other metadata columns that separate by tab(optional)
Any lines in the header that start with #
will be ignored.
Note that the position should be 0-based.
Calculate SNP Barcode
# barcodes.txt can be found at barcodes/ directory with different reference version.
swapfinder barcode -b barcodes.txt -i sample1.bam -o sample1_barcode.txt -r reference.fa
Compare SNP Barcodes
# use -i option to specify multiple barcode files
swapfinder compare -i sample1_barcode.txt -i sample2_barcode.txt -o comparison.txt
# use -I option to specify a file containing a list of barcode files
swapfinder compare -I barcode_files.txt -o comparison.txt
The screening criteria revolve around the ability to be inherited as stably as possible, and the accuracy problems caused by sequence complexity are minimized. 436 SNP sites and corresponding genes were filtered for building the SNP barcode.
- Each allele should be located within the CDS regions of recognized coding genes.
- Each allele should be recognized as a biallelic variant, meaning that only two variants (including its reference) could appear in this location, as determined by the dbSNP ALFA project's statistics (build id 20201027095038), and its variation type should be a transversion.
- The allele frequency of each locus in dbSNP should fall within the range of 0.4–0.6.
- The allele frequency of each locus in our curated CCLs should also fall within the range of 0.4–0.6.
- Each locus should not be within the tandem repeat regions identified by the tandem repeat finder with default parameters.
- Each SNP should not be located within the linkage disequilibrium regions.
- Only the one farthest from the CDS boundary will be retained if two or more alleles are located on the same coding gene.
1.If you encounter the following error when proccssing the bam/cram from URL:
[E::easy_errno] Libcurl reported error 60 (SSL peer certificate or SSH remote key was not OK)
This is because the libcurl need SSL certificate to verify the server. You can fix it by export the following environment variable:
# make sure the path is correct
export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
This issue also was mentioned in the rust-htslib issue.
Contributions are welcome! Please fork this repository and submit a pull request.
This project is open-source under the MIT license. For more details, please refer to the LICENSE file.
If you use swapfinder
in your research, please cite the following paper:
Congfan Bu, Xinchang Zheng, Jialin Mai, Zhi Nie, Jingyao Zeng, Qiheng Qian, Tianyi Xu, Yanling Sun, Yiming Bao, Jingfa Xiao, CCLHunter: An efficient toolkit for cancer cell line authentication, Computational and Structural Biotechnology Journal, 2023, https://doi.org/10.1016/j.csbj.2023.09.040.