This repository contains the complete computational pipeline and analysis code for studying how genetic variants associated with human height affect 3D chromatin organization and gene regulation. The project combines machine learning approaches with genomics data to understand the regulatory mechanisms underlying height-associated genetic variants.
Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with human height, but the mechanisms by which these variants affect gene regulation through 3D chromatin structure remain poorly understood. This study applies machine learning models to predict how height-associated variants alter 3D genome organization and gene expression patterns.
Linkage Disequilibrium Score Regression (LDSC) analysis to test for enrichment of height-associated variants in specific genomic regions.
make_ldsc_data/make_annot.R: Generates LDSC-compatible annotation files from variant datamake_ldsc_data/munge_gwas.R: Processes and formats GWAS summary statistics for multiple ancestries
-
akita_variants/variants_in_divergent_windows_10AF.txt: Height-associated variants identified from Akita model predictions- Format: chromosome, position, reference allele, alternative allele, window position
- Contains variants showing divergent effects on chromatin organization
-
gwas_raw/: Population-specific GWAS summary statistics for human heightafr_hg38.txt: African ancestry populationsamr_hg38.txt: Admixed American populationseas_hg38.txt: East Asian populationseur_hg38.txt: European ancestry populationssas_hg38.txt: South Asian populationsall_hg38.txt: Combined multi-ancestry analysis- All files use hg38 genomic coordinates with standard GWAS format (SNPID, RSID, CHR, POS, etc.)
Specialized tools for creating publication-ready visualizations of chromatin contact maps with genomic annotations.
visualization_hic.ipynb: Comprehensive Jupyter notebook containing:- Custom functions for Hi-C data processing and visualization
- Diamond-rotated contact map plotting
- Integration of multiple genomic annotation tracks
- Side-by-side comparison tools for different conditions
from_upper_triu(): Converts upper triangular vector format to symmetric contact matricesload_individual_map(): Loads and processes Hi-C predictions from model outputsannotations(): Adds genomic annotation tracks (genes, CTCF sites, conservation)map_comparison_no_delta_with_annotations(): Creates comparative visualizations with genomic context
grch38_gene_annotations.bed: Gene annotation dataphastConsElements100way_hg38.bed: Evolutionary conservation tracksctcf_full_merged_hg38.bed: CTCF binding sites
-
Figures/: Publication-quality figures and supplementary materialsfigure1.png: Main results overviewfigure2.png: LDSC enrichment analysis resultsfigure3.png: Hi-C visualization examplesfigure4.png: Mechanistic modelfigure s1.png: Supplementary analysis
-
ASHG Abstract/: Conference materials- Abstract submission for the American Society of Human Genetics meeting
-
ASHG Presentation/: Conference presentation materials- PowerPoint presentation with key findings and visualizations
-
Height 3D Genome MS.pdf: Full manuscript (if available)
- R (≥4.0) with packages for statistical analysis
- Python (≥3.8) with Jupyter notebook support
- Required Python packages:
matplotlib,numpy,pandas,scipy - Access to LDSC software suite (for enrichment analysis)
cd "LDSC Enrichment/make_ldsc_data/"
# Generate annotation files
Rscript make_annot.R
# Process GWAS summary statistics
Rscript munge_gwas.R# Launch Jupyter notebook
jupyter notebook "Visualize HiC/visualization_hic.ipynb"
# Follow the notebook cells to:
# - Load Hi-C prediction data
# - Add genomic annotations
# - Generate publication figures# Use provided auto-sync scripts for version control
./auto_sync.sh # Unix/Mac
# OR
auto_sync.bat # Windows- Height GWAS: Multi-ancestry genome-wide association study data
- 3D Genome Predictions: Machine learning models (Akita) for chromatin organization
- Genomic Annotations: Gene locations, regulatory elements, conservation scores
- Population Genetics: Linkage disequilibrium patterns across human populations
This analysis reveals:
- Height-associated variants are enriched in regions with altered 3D chromatin structure
- Population-specific differences in regulatory mechanisms
- Novel regulatory targets for height-associated loci
- Mechanistic insights into how genetic variants affect gene regulation through chromatin organization
Manuscript in preparation
Preprint available soon
For questions about the analysis or code, please open an issue in this repository.
MIT License