A lightweight library that generates and handles Hi-C contact maps in either cooler-compatible 2Dbedgraph or instaGRAAL format. It is essentially a merge of the yahcp pipeline, the hicstuff library and extra features illustrated in the 3C tutorial and the DADE pipeline, all packaged together for extra convenience.
The goal is to make generation and manipulation of Hi-C matrices as simple as possible and work for any organism.
To install a stable version:
pip install hicstuffor, for the latest development version:
git clone https://github.com/koszullab/hicstuff.git
cd hicstuff
uv sync --extra devIf hicstuff is installed via pip, additional system dependencies such as
bowtie2, bwa, minimap2 and samtools will be needed. You can install
them as follows on Debian/Ubuntu:
apt-get install bowtie2 bwa minimap2 samtools libbz2-dev liblzma-devhicstuff is also available on bioconda, which will automatically handle all dependencies:
conda install bioconda::hicstuffDocker images are automatically built and published to GitHub Container Registry (GHCR) when releases are tagged:
docker pull ghcr.io/baudrly/hicstuff:latest
docker pull ghcr.io/baudrly/hicstuff:<version>The hicstuff command line interface is composed of multiple subcommands. You can always get a summary of all available commands by running:
hicstuff --help
Usage: hicstuff [OPTIONS] COMMAND [ARGS]...
Simple Hi-C pipeline for generating and manipulating contact matrices.
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --version Show the version and exit. โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Alignment โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ iteralign Iteratively align reads to a reference genome. โ
โ cutsite Preprocess FASTQ files by cutting reads at religation sites. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Processing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ digest Digest a genome FASTA into restriction fragments. โ
โ filter Filter spurious Hi-C events (loops and uncuts) from a pairs file. โ
โ pipeline Run the full Hi-C pipeline from FASTQ to contact matrix. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Matrix operations โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ rebin Rebin a Hi-C matrix to a coarser resolution. โ
โ subsample Subsample contacts from a Hi-C matrix. โ
โ convert Convert between Hi-C matrix formats (graal, bg2, cool). โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Analysis & Visualization โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ view Visualize a Hi-C matrix as a heatmap. โ
โ scalogram Generate a scalogram from a Hi-C contact matrix. โ
โ distancelaw Analyse and plot the Hi-C distance law (P(s) curve). โ
โ missview Preview unmappable Hi-C bins for a given read length. โ
โ stats Extract mapping statistics from a hicstuff pipeline log file. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏAll components of the pipeline can be run at once using the hicstuff pipeline command.
This allows to generate a contact matrix from reads in a single command.
By default, the output is a cool file.
hicstuff pipeline --help
Usage: hicstuff pipeline [OPTIONS] INPUT1 [INPUT2]
Run the full Hi-C pipeline from FASTQ to contact matrix.
Example โ generate a multi-resolution .mcool for Arima Hi-C:
hicstuff pipeline --enzyme "DpnII,HinfI" --binning 1000 --threads 8 \
--genome ref.fa R1.fq.gz R2.fq.gz
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * INPUT1 TEXT [required] โ
โ INPUT2 TEXT โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --genome -g FILE Reference genome or aligner index. [required] โ
โ --aligner -a STR Aligner: bowtie2, minimap2 or bwa. [default: bowtie2] โ
โ --balancing-args -B STR Extra arguments passed to `cooler balance`. โ
โ --binning -b INT Bin the cool matrix to this resolution (bp). 0 means no โ
โ binning. [default: 0] โ
โ --centromeres -c FILE Centromere positions file. โ
โ --circular -C Genome is circular. โ
โ --distance-law -d Generate a distance law output file. โ
โ --duplicates -D Filter PCR duplicates. โ
โ --enzyme -e {STR|INT} Restriction enzyme name, 'mnase'/'dnase', or chunk size โ
โ in bp. [default: 5000] โ
โ --exclude -E STR Comma-separated chromosomes to exclude (e.g. chrM,2u). โ
โ --filter -f Filter spurious 3C events (loops and uncuts). โ
โ --force -F Overwrite existing output files. โ
โ --mapping -m STR Mapping mode. [default: normal] โ
โ --matfmt -M STR Output matrix format. [default: cool] โ
โ --no-cleanup -n Keep intermediary files. โ
โ --outdir -o DIR Output directory (default: current directory). โ
โ --plot -p Generate plots at pipeline steps. โ
โ --prefix -P STR Prefix for all output files. โ
โ --quality-min -q INT Minimum mapping quality. [default: 30] โ
โ --remove-centromeres -r INT kb to remove around centromere positions. [default: 0] โ
โ --read-len -R INT Maximum read length (estimated from first read if โ
โ omitted). โ
โ --size -s INT Minimum contig size threshold. [default: 0] โ
โ --start-stage -S STR Pipeline start stage. [default: fastq] โ
โ --threads -t INT Number of parallel threads. [default: 1] โ
โ --tmpdir -T DIR Temporary directory. โ
โ --zoomify -z BOOL Generate multi-resolution .mcool from the binned cool โ
โ matrix. [default: True] โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏFor example, to run the pipeline with minimap2 using 8 threads and save output
files to the directory out:
hicstuff pipeline -t 8 -a minimap2 -e DpnII -o out/ -g genome.fa reads_for.fq reads_rev.fqIf you have already aligned your reads, hicstuff pipeline can also take bam files as input. For example, to generate a matrix in mcool file with multiple resolutions starting from 5kb, you can run:
# Note the bam files have to be name-sorted, this can be done using samtools
samtools sort -n aligned_for.bam -o namesorted_for.bam
samtools sort -n aligned_rev.bam -o namesorted_rev.bam
hicstuff pipeline -S bam -b 5000 -o out/ -g genome.fa namesorted_for.bam namesorted_rev.bamThe general steps of the pipeline are as follows:
For more advanced usage, different scripts can be used independently on the command line to perform individual parts of the pipeline. This readme contains quick descriptions and example usages. To obtain detailed instructions on any subcommand, one can use hicstuff <subcommand> --help.
Generates new gzipped fastq files from original fastq. The function will cut the reads at their religation sites and creates new pairs of reads with the different fragments obtained after cutting at the digestion sites.
hicstuff cutsite --help
Usage: hicstuff cutsite [OPTIONS]
Preprocess FASTQ files by cutting reads at religation sites.
Generates gzipped FASTQ files with reads cut at ligation junctions, creating new fragment pairs
for mapping.
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --forward -F FILE Forward reads FASTQ file. [required] โ
โ * --reverse -R FILE Reverse reads FASTQ file. [required] โ
โ * --prefix -p STR Output prefix (suffixed with _R1.fq.gz / _R2.fq.gz). [required] โ
โ * --enzyme -e STR Comma-separated restriction enzyme(s). [required] โ
โ --mode -m STR Fragment pairing mode. [default: for_vs_rev] โ
โ --seed-size -s INT Minimum read size after cutting. [default: 20] โ
โ --threads -t INT Number of parallel threads. [default: 1] โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Truncate reads from a fastq file to 20 basepairs and iteratively extend and re-align the unmapped reads to optimize the proportion of uniquely aligned reads in a 3C library.
hicstuff iteralign --help
Usage: hicstuff iteralign [OPTIONS] READS_FQ
Iteratively align reads to a reference genome.
Truncates reads to 20 bp then iteratively extends and re-aligns unmapped reads to maximise the
proportion of uniquely aligned reads in a 3C library.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * READS_FQ TEXT [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --genome -g FILE Reference genome or aligner index path. [required] โ
โ * --out-bam -o FILE Output BAM file path. [required] โ
โ --threads -t INT Number of parallel threads. [default: 1] โ
โ --tempdir -T DIR Temporary directory (default: current directory). โ
โ --aligner -a STR Aligner: bowtie2, minimap2 or bwa. [default: bowtie2] โ
โ --min-len -l INT Minimum truncated read length. [default: 20] โ
โ --read-len -R INT Maximum read length (estimated from first read if omitted). โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Digests a fasta file into fragments based on a restriction enzyme or a fixed chunk size. Generates two output files into the target directory named "info_contigs.txt" and "fragments_list.txt"
hicstuff digest --help
Usage: hicstuff digest [OPTIONS] FASTA
Digest a genome FASTA into restriction fragments.
Writes ``fragments_list.txt`` and ``info_contigs.txt`` to the output directory.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * FASTA TEXT [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --enzyme -e ENZ[,ENZ2,...] Restriction enzyme name(s) or fixed chunk size in bp. โ
โ [required] โ
โ --outdir -o DIR Output directory (default: current directory). โ
โ --size -s INT Minimum fragment size to keep. [default: 0] โ
โ --circular -c Genome is circular. โ
โ --plot -p Show fragment length distribution histogram. โ
โ --figdir -f DIR Directory to save the distribution figure. โ
โ --force -F Overwrite existing output directory. โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
For example, to digest the yeast genome with MaeII and HinfI and show histogram of fragment lengths:
hicstuff digest --plot --outdir output_dir --enzyme MaeII,HinfI Sc_ref.faFilters spurious 3C events such as loops and uncuts from the library based on a minimum distance threshold automatically estimated from the library by default. Can also plot 3C library statistics. This module takes a pairs file with 9 columns as input (readID, chr1, pos1, chr2, pos2, strand1, strand2, frag1, frag2) and filters it.
hicstuff filter --help
Usage: hicstuff filter [OPTIONS] INPUT_PAIRS OUTPUT_PAIRS
Filter spurious Hi-C events (loops and uncuts) from a pairs file.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * INPUT_PAIRS TEXT [required] โ
โ * OUTPUT_PAIRS TEXT [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --figdir -f DIR Directory for output figures. โ
โ --interactive -i Ask for thresholds interactively after showing plots. โ
โ --plot -p Show library composition and 3C event abundance plots. โ
โ --prefix -P STR Library name displayed on figures. โ
โ --thresholds -t INT-INT Manual thresholds as UNCUT-LOOP (e.g. 4-5). โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Visualize a Hi-C matrix file as a heatmap of contact frequencies. Allows to tune visualisation by binning and normalizing the matrix, and to save the output image to disk. If no output is specified, the output is displayed.
hicstuff view --help
Usage: hicstuff view [OPTIONS] CONTACT_MAP [CONTACT_MAP2]
Visualize a Hi-C matrix as a heatmap.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * CONTACT_MAP TEXT [required] โ
โ CONTACT_MAP2 TEXT โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --binning -b INT[bp|kb|Mb] Merge bins by factor or generate fixed-size bins. [default: 1] โ
โ --cmap -c STR Matplotlib colormap name. [default: Reds] โ
โ --circular -C Genome is circular. โ
โ --despeckle -d Remove speckle artefacts. โ
โ --dpi -D INT Output image DPI. [default: 300] โ
โ --frags -f FILE fragments_list.txt (required for bp binning and --lines). โ
โ --transform -T STR Pixel transform: log2, log10, ln, sqrt, exp<val>. โ
โ --lines -l Draw chromosome separator lines (requires --frags). โ
โ --max -M NUM[%] Colorscale maximum (percentile with %). [default: 99%] โ
โ --min -m NUM[%] Colorscale minimum. [default: 0] โ
โ --n-mad -N FLOAT MAD threshold for ICE normalization bin filtering. [default: โ
โ 3.0] โ
โ --normalize -n Perform ICE normalization before rendering. โ
โ --output -o FILE Output image path (display interactively if omitted). โ
โ --region -r STR[;STR] UCSC region to zoom into (e.g. chr1:1000-12000). โ
โ --trim -t FLOAT Trim bins deviating by more than this many MADs. โ
โ --help -h Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
For example, to view a 1Mb region of chromosome 1 from a full genome Hi-C matrix rebinned at 10kb:
hicstuff view --normalize --binning 10kb --region chr1:10,000,000-11,000,000 --frags fragments_list.txt contact_map.tsvpairsfiles: This format is used for all intermediate files in the pipeline and is also used byhicstuff filter. It is a tab-separated format holding informations about Hi-C pairs. It has an official specification defined by the 4D Nucleome data coordination and integration center.bedgraph2bedgraph: This is an optional output format ofhicstuff pipelinefor the sparse matrix. It has two fragment per line, and the number of times they are found together. It has the following fields: chr1, start1, end1, chr2, start2, end2, occurences- Those files can be loaded by cooler using
cooler load -f bg2 <chrom.sizes>:<binsize> in.bg2.gz out.coolwherechrom.sizesis a tab delimited file with chromosome names and length on each line, and binsize is the size of bins in the matrix.
- Those files can be loaded by cooler using
GRAALsparse matrix: This is a simple tab-separated file with 3 columns: frag1, frag2, contacts. The id columns correspond to the absolute id of the restriction fragments (0-indexed). The first row is a header containing the number of rows, number of columns and number of nonzero entries in the matrix. Example:
564 564 6978
0 0 3
1 2 4
1 3 3
fragments_list.txt: This tab separated file provides information about restriction fragments positions, size and GC content. Note the coordinates are 0 point basepairs, unlike the pairs format, which has 1 point basepairs. Example:- id: 1 based restriction fragment index within chromosome.
- chrom: Chromosome identifier. Order should be the same as in info_contigs.txt or pairs files.
- start_pos: 0-based start of fragment, in base pairs.
- end_pos: 0-based end of fragment, in base pairs.
- size: Size of fragment, in base pairs.
- gc_content: Proportion of G and C nucleotide in the fragment.
id chrom start_pos end_pos size gc_content
1 seq1 0 21 21 0.5238095238095238
2 seq1 21 80 59 0.576271186440678
3 seq1 80 328 248 0.5201612903225806
info_contigs.txt: This tab separated file gives information on contigs, such as number of restriction fragments and size. Example:- contig: Chromosome identified. Order should be the same in pairs files or fragments_list.txt.
- length: Chromosome length, in base pairs.
- n_frags: Number of restriction fragments in chromosome.
- cumul_length: Cumulative length of previous chromosome, in base pairs.
contig length n_frags cumul_length
seq1 60000 409 0
seq2 20000 155 409
All contributions are welcome, in the form of bug reports, suggestions, documentation or pull requests. We use the numpy standard for docstrings when documenting functions.
The code formatting standard we use is ruff (line-length 100). We use pytest as our testing framework. Ideally, new functions should have associated unit tests, placed in the tests folder.
To test the code, you can run:
uv run pytest tests/To lint and format:
uv run ruff check src/
uv run ruff format src/Please cite hicstuff using the official DOI as follows:
Cyril Matthey-Doret, Lyam Baudry, Amaury Bignaud, Axel Cournac, Remi-Montagne, Nadรจge Guiglielmoni, Thรฉo Foutel Rodier and Vittore F. Scolari. 2020. hicstuff: Simple library/pipeline to generate and handle Hi-C data . Zenodo. http://doi.org/10.5281/zenodo.4066363
Bibtex entry:
@software{cyril_matthey_doret_2020_4066351,
author = {Cyril Matthey-Doret and
Lyam Baudry and
Amaury Bignaud and
Axel Cournac and
Remi-Montagne and
Nadรจge Guiglielmoni and
Thรฉo Foutel-Rodier and
Vittore F. Scolari},
title = {hicstuff: Simple library/pipeline to generate and handle Hi-C data },
month = oct,
year = 2020,
publisher = {Zenodo},
version = {v2.3.1},
doi = {10.5281/zenodo.4066351},
url = {http://doi.org/10.5281/zenodo.4066363}
}