This repository contains code for benchmarking and analyzing Topologically Associating Domains (TADs) in genomic data. It includes implementations of various metrics for evaluating TAD callers and analyzing their characteristics.
TADs (Topologically Associating Domains) are self-interacting genomic regions that play important roles in gene regulation and 3D genome organization. This project provides a comprehensive benchmarking framework for evaluating different TAD calling algorithms and methods.
The analysis focuses on several key aspects of TAD performance:
- Histone Modification Enrichment (Fig1C) - Evaluating TAD with histone modifications (H3K27me3, H3K36me3)
- Structural Protein Enrichment (Fig1D) - Analyzing enrichment of structural proteins (e.g., CTCF, RAD21, SMC3) at TAD boundaries
- CTCF Orientation Analysis (Fig1E) - Examining CTCF orientation at TAD boundaries
- Contact Enrichment (Fig1F) - Measuring interaction frequencies within TADs
- Boundary Insulation (Fig1G) - Quantifying the insulation strength of TAD boundaries
- Robustness (Fig1H) - Assessing the consistency of TAD calls across different Hi-C sequencing depths
Data/- Contains some example dataFig1C_HistMod_FDR/- Histone modification analysisFig1D_StructProt_EnrichBoundaries/- Structural protein enrichment at TAD boundariesFig1E_CTCF_orientation/- CTCF orientation analysisFig1F_Contact_Enrichment/- Contact enrichment analysisFig1G_Boundary_Insulation/- Boundary insulation analysisFig1H_Robustness/- Robustness analysisutils/- Common utility functions used across analyses
The code is written in R and requires the following packages:
- magrittr
- dplyr
- tibble
- readr
- tidyr
- purrr
- data.table
- GenomicRanges
- Biostrings
- BSgenome.Hsapiens.UCSC.hg19
- TFBSTools
- plyr
- stringr
# Install CRAN packages
install.packages(c("magrittr", "dplyr", "tibble", "readr", "tidyr", "purrr",
"data.table", "plyr", "stringr"))
# Install Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("GenomicRanges", "Biostrings",
"BSgenome.Hsapiens.UCSC.hg19", "TFBSTools"))Each analysis component can be run independently by executing the main R script in the corresponding directory.
source("Fig1C_HistMod_FDR/Fig1C_HistMod_FDR.R")source("Fig1E_CTCF_orientation/Fig1E_CTCF_orientation.R")For Contact Enrichment (Fig1F) and Boundary Insulation (Fig1G) analyses, Hi-C contact matrix files are required but not included in this repository due to their large size.
Please download the Hi-C matrix data from: TADShop: https://tadshop.unil.ch/
The expected file format is: HiC_matrix_{chr}.txt (e.g., HiC_matrix_chr6.txt)
The TAD data is stored in TSV format with the following columns:
chr- Chromosome name (e.g., chr1, chr2, ..., chrX)start- Start position of the TAD (in base pairs, 1-based)end- End position of the TAD (in base pairs)meta.tool- Name of the TAD caller tool usedmeta.resol- Resolution of the TAD calls (in base pairs)meta.sub- Subsampling identifier (used in robustness analysis; "." indicates full dataset)
Contributions to improve the benchmarking framework are welcome. Please feel free to submit pull requests or open issues for discussion.
If you use this benchmarking framework in your research, please cite our work.
This project is licensed under the terms of the included LICENSE file.
For questions or issues, please open an issue on GitHub or visit TADShop.