Measure the high-order genome architectures (global folding and checkerboard) from Hi-C experiment.
Citation:
We provide a one-click bash file to compute the strength of large-scale genome architectures, global folding and checkerboard.
Our algorithm comprises three key modules: normalization, global folding, and checkerboard.
- NormDis & CorrectMap: Raw Hi-C maps are scaled to comparable sizes and normalized to remove distance-dependent biases. The resulting maps are subsequently utilized to calculate global folding and checkerboard scores. Notably, you can manually check the normalized maps to remove poor-assembled chromosomes, detailed in CorrectMap.
- Global folding: Based on normalized maps, the computation of global folding score involves two sub-modules: detecting center anchors (GF_S1_get_center) and calculating the global folding scores (GF_S2_get_score). You can re-choose the alternative center anchors, detailed in GF_S1_get_center.
- Checkerboard: Checkerboard scores are calculated based on normalized maps.
Python 3.7+ Seaborn Scipy Torch Scikit-learn
We provide a one-click pipeline script (one_click_pipeline.sh) for automated Hi-C data analysis. The core requirement for its execution is to properly organize input files within the designated base_path directory.
The following directory tree must be created under your base_path:
base_path/
├── [species_name_1]/
│ └── sps_mtx/
├── [species_name_2]/
│ └── sps_mtx/
└── parameters.txt
Steps:
- Create your main
base_pathdirectory. - Inside
base_path, create a sub-directory for each species you wish to analyze (e.g.,human/,mouse/). - Inside each species directory, create a sub-sub-directory named
sps_mtx/. This directory will contain all the input files for the samples belonging to that species.
Place the following two types of files for each sample inside the corresponding sps_mtx/ directory.
- Purpose: Contains the Hi-C contact data.
- File Naming:
<sample>_normalized.mtx<sample>is a unique identifier for the biological sample (e.g.,sample1_normalized.mtx,rep2_normalized.mtx).
- File Format: A three-column, whitespace-separated text file.
- Column 1: Row index (integer). Must be consistent with the index file (0-based or 1-based).
- Column 2: Column index (integer).
- Column 3: Contact value (float).
- Note: It is recommended to use pre-normalized contact matrices (e.g., using ICE or Knight-Ruiz (K) normalization) as input.
- Purpose: Provides the genomic coordinates for each bin (row/column) in the
.mtxfile. - File Naming:
<sample>.window.bed- The
<sample>prefix must match the corresponding.mtxfile but without the_normalizedsuffix. - Example: For
sample1_normalized.mtx, the index file must be namedsample1.window.bed.
- The
- File Format: A four-column, whitespace-separated file in standard BED format.
- Column 1: Chromosome name (string).
- Column 2: Region start position (integer, 0-based).
- Column 3: Region end position (integer).
- Column 4: Index (integer). This number corresponds to the row/column index in the associated
.mtxfile.
- Generation: This file can be created using tools like
bedtools makewindows.
A correctly organized base_path directory will look like this:
base_path/
├── human/
│ └── sps_mtx/
│ ├── sample1_normalized.mtx
│ ├── sample1.window.bed
│ ├── sample2_normalized.mtx
│ └── sample2.window.bed
├── mouse/
│ └── sps_mtx/
│ ├── mouse_sample1_normalized.mtx
│ └── mouse_sample1.window.bed
└── parameters.txt (parameter file, placed directly in base_path)