Machine Learning Reveals 3D Regulatory Mechanisms for Height-Associated Haplotypes

This repository contains the complete computational pipeline and analysis code for studying how genetic variants associated with human height affect 3D chromatin organization and gene regulation. The project combines machine learning approaches with genomics data to understand the regulatory mechanisms underlying height-associated genetic variants.

Abstract

Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with human height, but the mechanisms by which these variants affect gene regulation through 3D chromatin structure remain poorly understood. This study applies machine learning models to predict how height-associated variants alter 3D genome organization and gene expression patterns.

Project Structure

📊 LDSC Enrichment Analysis (`LDSC Enrichment/`)

Linkage Disequilibrium Score Regression (LDSC) analysis to test for enrichment of height-associated variants in specific genomic regions.

Key Scripts:

make_ldsc_data/make_annot.R: Generates LDSC-compatible annotation files from variant data
make_ldsc_data/munge_gwas.R: Processes and formats GWAS summary statistics for multiple ancestries

Data Resources:

akita_variants/variants_in_divergent_windows_10AF.txt: Height-associated variants identified from Akita model predictions
- Format: chromosome, position, reference allele, alternative allele, window position
- Contains variants showing divergent effects on chromatin organization
gwas_raw/: Population-specific GWAS summary statistics for human height
- afr_hg38.txt: African ancestry populations
- amr_hg38.txt: Admixed American populations
- eas_hg38.txt: East Asian populations
- eur_hg38.txt: European ancestry populations
- sas_hg38.txt: South Asian populations
- all_hg38.txt: Combined multi-ancestry analysis
- All files use hg38 genomic coordinates with standard GWAS format (SNPID, RSID, CHR, POS, etc.)

🧬 Hi-C Data Visualization (`Visualize HiC/`)

Specialized tools for creating publication-ready visualizations of chromatin contact maps with genomic annotations.

Main Notebook:

visualization_hic.ipynb: Comprehensive Jupyter notebook containing:
- Custom functions for Hi-C data processing and visualization
- Diamond-rotated contact map plotting
- Integration of multiple genomic annotation tracks
- Side-by-side comparison tools for different conditions

Key Functions:

from_upper_triu(): Converts upper triangular vector format to symmetric contact matrices
load_individual_map(): Loads and processes Hi-C predictions from model outputs
annotations(): Adds genomic annotation tracks (genes, CTCF sites, conservation)
map_comparison_no_delta_with_annotations(): Creates comparative visualizations with genomic context

Required Annotation Files (not included):

grch38_gene_annotations.bed: Gene annotation data
phastConsElements100way_hg38.bed: Evolutionary conservation tracks
ctcf_full_merged_hg38.bed: CTCF binding sites

📄 Documentation and Results

Figures/: Publication-quality figures and supplementary materials
- figure1.png: Main results overview
- figure2.png: LDSC enrichment analysis results
- figure3.png: Hi-C visualization examples
- figure4.png: Mechanistic model
- figure s1.png: Supplementary analysis
ASHG Abstract/: Conference materials
- Abstract submission for the American Society of Human Genetics meeting
ASHG Presentation/: Conference presentation materials
- PowerPoint presentation with key findings and visualizations
Height 3D Genome MS.pdf: Full manuscript (if available)

Getting Started

Prerequisites

R (≥4.0) with packages for statistical analysis
Python (≥3.8) with Jupyter notebook support
Required Python packages: matplotlib, numpy, pandas, scipy
Access to LDSC software suite (for enrichment analysis)

Running the Analysis

1. LDSC Enrichment Analysis

cd "LDSC Enrichment/make_ldsc_data/"

# Generate annotation files
Rscript make_annot.R

# Process GWAS summary statistics
Rscript munge_gwas.R

2. Hi-C Visualizations

# Launch Jupyter notebook
jupyter notebook "Visualize HiC/visualization_hic.ipynb"

# Follow the notebook cells to:
# - Load Hi-C prediction data
# - Add genomic annotations
# - Generate publication figures

3. Quick Git Synchronization

# Use provided auto-sync scripts for version control
./auto_sync.sh      # Unix/Mac
# OR
auto_sync.bat       # Windows

Data Sources

Height GWAS: Multi-ancestry genome-wide association study data
3D Genome Predictions: Machine learning models (Akita) for chromatin organization
Genomic Annotations: Gene locations, regulatory elements, conservation scores
Population Genetics: Linkage disequilibrium patterns across human populations

Key Findings

This analysis reveals:

Height-associated variants are enriched in regions with altered 3D chromatin structure
Population-specific differences in regulatory mechanisms
Novel regulatory targets for height-associated loci
Mechanistic insights into how genetic variants affect gene regulation through chromatin organization

Citation

Manuscript in preparation

Preprint available soon

Contact

For questions about the analysis or code, please open an issue in this repository.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ASHG Abstract		ASHG Abstract
ASHG Presentation		ASHG Presentation
Analysis Script		Analysis Script
Figures		Figures
.DS_Store		.DS_Store
Height 3D Genome MS.pdf		Height 3D Genome MS.pdf
LICENSE		LICENSE
README.md		README.md
WARP.md		WARP.md
auto_sync.bat		auto_sync.bat
auto_sync.sh		auto_sync.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning Reveals 3D Regulatory Mechanisms for Height-Associated Haplotypes

Abstract

Project Structure

📊 LDSC Enrichment Analysis (`LDSC Enrichment/`)

Key Scripts:

Data Resources:

🧬 Hi-C Data Visualization (`Visualize HiC/`)

Main Notebook:

Key Functions:

Required Annotation Files (not included):

📄 Documentation and Results

Getting Started

Prerequisites

Running the Analysis

1. LDSC Enrichment Analysis

2. Hi-C Visualizations

3. Quick Git Synchronization

Data Sources

Key Findings

Citation

Contact

License

About

Uh oh!

Releases

Packages

Languages

License

BaranziniLab/ML3DgenomeHeight

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Reveals 3D Regulatory Mechanisms for Height-Associated Haplotypes

Abstract

Project Structure

📊 LDSC Enrichment Analysis (LDSC Enrichment/)

Key Scripts:

Data Resources:

🧬 Hi-C Data Visualization (Visualize HiC/)

Main Notebook:

Key Functions:

Required Annotation Files (not included):

📄 Documentation and Results

Getting Started

Prerequisites

Running the Analysis

1. LDSC Enrichment Analysis

2. Hi-C Visualizations

3. Quick Git Synchronization

Data Sources

Key Findings

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

📊 LDSC Enrichment Analysis (`LDSC Enrichment/`)

🧬 Hi-C Data Visualization (`Visualize HiC/`)

Packages