This repository provides an implementation of the graphREML method described in:
Hui Li, Tushar Kamath, Rahul Mazumder, Xihong Lin, & Luke J. O'Connor (2024). Improved heritability partitioning and enrichment analyses using summary statistics with graphREML. medRxiv, 2024-11. DOI: 10.1101/2024.11.04.24316716
and a Python API for computationally efficient linkage disequilibrium (LD) matrix operations with LD graphical models (LDGMs), described in:
Pouria Salehi Nowbandegani, Anthony Wilder Wohns, Jenna L. Ballard, Eric S. Lander, Alex Bloemendal, Benjamin M. Neale, and Luke J. O’Connor (2023) Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies. Nat Genet. DOI: 10.1038/s41588-023-01487-8
It also provides very fast utilities for simulating GWAS summary statistics and computing polygenic risk scores.
Required system dependencies:
- SuiteSparse (for CHOLMOD): On Mac, install with
brew install suitesparse. SuiteSparse is wrapped in scikit-sparse. - IntelMKL (for Intel chips): The performance of SuiteSparse is significantly improved by using IntelMKL instead of OpenBLAS, which will likely be the default. See Giulio Genovese's documentation here.
If you are only using the heritability enrichment score test, nothing needs to be installed. Clone the package and run the source file as a standalone script: uv run graphld/score_test.py --help.
In the repo directory:
uv venv
source .venv/bin/activate
uv syncFor development installation:
uv sync --dev --extra dev # editable with pytest dependencies
uv run pytest # tests will fail if you haven't run `make download` (see below)Although we recommend moving away from conda, it does have the advantage that you can conda install SuiteSparse. Example codes are based on the O2 cluster in the Harvard Medical School computing system.
- Create a conda
envforsuitesparseand activate it: you may need to revert or reinstall some Python packages
module load miniconda3/4.10.3
conda create -n suitesparse conda-forge::suitesparse python=3.11.0
conda activate suitesparse- You may need to revert or reinstall some Python packages, if prompted. For example, to install the correct version of
numpy, use:
pip install numpy==1.26.4- Install
scikit-sparse: you may need to add somecondachannels (see below)
conda config --add channels conda-forge
conda config --set channel_priority strict
conda install scikit-sparse- Install and run graphLD
cd graphld && pip install .- Test if it works
graphld -hPre-computed LDGMs for the 1000 Genomes Project data are available at Zenodo. You can download them using the provided Makefile in the data/ directory:
cd data && make downloadThe download should take 5-30 minutes. By default, this will download and extract LDGM precision matrices, baselineLD annotation files for heritability partitioning with graphREML, and precomputed score test statistics for the heritability enrichment score test. If you only want precision matrices, use make download_precision. If you only want enrichment score test statistics, use make download_scorestats. If you want to download GWAS summary statistics from Li et al. 2025, additionally run make download_sumstats.
The CLI has commands for:
blup, which computes best linear unbiased predictor weightsclump, which p-value thresholding and LD-based pruningsimulate, which simulates GWAS summary statisticsreml, which runs graphREML
The commands largely follow a common interface. Use uv run graphld <command> -h to see usage for a specific command. If you downloaded LDGMs using the Makefile, they will be located automatically.
There is a separate command for the enrichment score test: uv run estest -h.
To run graphREML:
uv run graphld reml \
/path/to/sumstats/file.sumstats \
output_files_prefix \
--annot-dir /directory/containing/annotation/files/ \The summary statistics can be in VCF (.vcf) or LDSC (.sumstats) format. The annotation directory should contain per-chromosome annotation files in LDSC (.annot) format. There can be multiple .annot files per chromosome, including some in the thin-annot format (i.e., without variant IDs). It can additionally contain UCSC .bed files, not stratified per-chromosome; for each .bed file, a new binary annotation will be created with a 1 for variants whose GRCh38 coordinates match the .bed file.
There will be two output files: output_files_prefix.tall.csv, which contains heritability, enrichment, and coefficient estimates for each annotation; and output_files_prefix.convergence.csv, which contains information about the optimization process. If you specify the --alt-output flag, the tall.csv file will be replaced with three files, .heritability.csv, .enrichment.csv, and .parameters.csv, containing heritability, enrichment, and coefficient estimates for each annotation, respectively; these files have one line per model run and three columns per annotation, so that you can store the results of multiple runs or traits in one file. (Use it with --name to keep track of which line is which run.)
An important flag is --intercept, which specifies the expected inflation in the --intercept flag. Not doing this leads to upward bias in the heritability estimates and downward bias (i.e., toward 1) in the enrichment estimates. For example, UK Biobank height has an intercept of around 1.5, and specifying this value changes the coding variant enrichment estimate from around 7 to around 11. Most traits will have smaller intercepts. You can skip this if you are OK with some downward bias.
Non-convergence can be an issue in this kind of algorithm. We've tried to ensure that the trust region procedure is robust, but you should inspect the convergence.csv file and check the log-likelihood changes. If these abruptly go from some large value (e.g., 10) to nearly zero (e.g., 1e-4), it indicates a problem with convergence. Try (1) increasing the number of samples in the stochastic trace estimator with --xtrace-num-samples 200, (2) resetting the trust region at each iteration with --reset-trust-region, and (3) as a last resort, discarding LD blocks containing large-effect variants using --max-chisq-threshold, e.g., --max-chisq-threshold 300.
If some variants are missing from your GWAS summary statistics, graphREML automatically assigns 'surrogate markers' in high LD with those variants. This is a majority of variants if your summary statitics include HapMap3 SNPs only (~1.1M SNPs). This step can be very slow, but you can use precomputed surrogates to avoid re-doing it each time. To precompute surrogates, run graphld surrogates on your summary statistics. This creates a file with cached surrogate markers which can be passed to graphREML with the --surrogates flag. These should match your summary statistics approximately but do not need to match exactly; you can use the same surrogates across different sumstats files with slightly different SNPs.
The enrichment score test is a fast way to test a large number of genomic annotations for heritability enrichment conditional upon some null model. It produces Z statistics, but not point estimates. It requires precomputed derivatives (first and second derivatives of the log-likelihood for each variant). Run graphREML as described above with whatever annotations you wish to include in the null model (i.e., not the ones being tested), and supply the --score-test-filename flag to create a file containing pre-computed derivatives. The file also includes all the annotations in the null model, which will make it somewhat large (>1GB). Data for multiple traits can be added to the same file, but they must share the same annotations.
To perform the score test:
uv run estest \
path/to/precomputed/derivatives.h5 \
path/to/output/file/prefix \
-a /directory/containing/annotation/files/to/test \Alternatively, use uv run graphld/score_test.py with the same arguments. All of the dependencies for this script are included as script dependencies, so there should be no installation needed.
A file will be saved called output_prefix.txt containing one row per annotation and one column per trait, plus a column named meta_analysis. Each entry is a Z score, where a positive value indicates that an annotation is enriched for heritability conditional upon the null model (i.e., that it has a positive value of tau)..
Optional arguments:
-a,--annotations_dir: Directory containing the new annotation files (in.annotformat; see above) that you want to test for enrichment. Optionally, include UCSC.bedfiles (one for the whole genome). Default: "data/annotations/", which is populated by the Makefile.--annotations: Optional comma-separated list with a subset of annotation names to be tested (header names in the.annotfiles).--trait_name: Optional name of a single trait to be tested (specified with--namein thegraphld remlrun). If unspecified, all traits are tested and a meta-analysis is performed.--add-random: Optional comma-separated list of numbers between 0 and 1; for each entry, a random annotation will be created and tested with proportion of variants equal to the number specified. You can use this to validate that the test produces correctly calibrated Z scores.
import graphld as gld
import polars as pl
sumstats: pl.DataFrame = gld.read_ldsc_sumstats("data/test/example.sumstats")
annotations: pl.DataFrame = gld.load_annotations("data/test/", chromosome=1)
default_model_options = gld.ModelOptions()
default_method_options = gld.MethodOptions()
reml_results: dict = gld.run_graphREML(
model_options=default_model_options,
method_options=default_method_options,
summary_stats=sumstats,
annotation_data=annotations,
ldgm_metadata_path="data/test/metadata.csv",
populations="EUR"
)The estimator returns a dictionary containing heritability, enrichment, and coefficient estimates for each annotation, together with standard errors and two-tailed log10 p-values.
LD matrix operations can be performed using the PrecisionOperator, which subclasses the SciPy LinearOperator. It represents an LDGM precision matrix or its Schur complement. If one would
like to compute correlation_matrix[indices, indices] @ vector, one can use ldgm[indices].solve(vector). To compute inv(correlation_matrix[indices, indices]) @ vector, use ldgm[indices] @ vector. Importantly, you cannot do this indexing manually - the submatrix of an inverse is not the inverse of a submatrix. See Section 5 of the supplementary material of our paper.
ldgm: PrecisionOperator = gld.load_ldgm(
filepath="data/test/1kg_chr1_16103_2888443.EAS.edgelist",
snplist_path="data/test/1kg_chr1_16103_2888443.snplist"
)
vector = np.random.randn(ldgm.shape[0])
precision_times_vector = ldgm @ vector
correlation_times_vector = ldgm.solve(result)
assert np.allclose(correlation_times_vector, vector)We provide convenient functions to load LDGMs and merge them with data (probably summary statistics). You'll likely wish to make use of a metadata file that contains information about LDGMs:
ldgm_metadata: pl.DataFrame = gld.read_ldgm_metadata("data/test/metadata.csv", populations=["EUR"])Then, load some summary statistics and partition their rows across LD blocks:
sumstats: pl.DataFrame = gld.read_ldsc_sumstats("data/test/example.sumstats")
partitioned_sumstats: List[pl.DataFrame] = gld.partition_variants(ldgm_metadata, sumstats)You can now load LDGMs and merge them with summary statistics: chrom,chromStart,chromEnd,name,snplistName,population,numVariants,numIndices,numEntries,info
merged_ldgms = []
for row, df in zip(ldgm_metadata.iter_rows(named=True), partitioned_sumstats):
ldgm: PrecisionOperator = gld.load_ldgm(
filepath="data/test/" + row['name'],
snplist_path="data/test/" + row['snplistName']
)
ldgm, _ = gld.merge_snplists(ldgm, df)
merged_ldgms.append(ldgm)After doing this, each entry of merged_ldgms will contain in its variant_info dataframe all of the columns from your summary statistics, and you can perform operations like:
for ldgm in merged_ldgms:
# Retain only the first SNP that matches each row/col of the LDGM
z_scores = ldgm.variant_info.group_by('index', maintain_order=True) \
.agg(pl.col('Z').first()).select('Z').to_numpy()
solution = ldgm.solve(z_scores)The likelihood of GWAS summary statistics under an infinitesimal model is:
The following functions are available:
gaussian_likelihood(pz, M): Computes the log-likelihoodgaussian_likelihood_gradient(pz, M, del_M_del_a=None): Computes the gradient of the log-likelihood, either with respect to the diagonal elements ofM(equivalentlyD), or with respect to parametersawhose partial derivatives are provided indel_M_del_a.gaussian_likelihood_hessian(pz, M, del_M_del_a): Computes an approximation to the Hessian of the log-likelihood with respect toa. This is minus the average of the Fisher information matrix and the observed information matrix, and it is a good approximation when the gradient is close to zero.
LD clumping identifies independent index variants by iteratively selecting the variant with the highest
sumstats_clumped: pl.DataFrame = gld.run_clump(
sumstats=sumstats_dataframe_with_z_scores,
z_col='Z',
ldgm_metadata_path="data/test/metadata.csv",
populations='EUR',
rsq_threshold=0.1,
chisq_threshold=30.0,
).filter(pl.col('is_index'))Summary statistics can be simulated from their asymptotic distribution without individual-level genotype data. Effect sizes are drawn from a flexible mixture distribution, with support for annotation-dependent and frequency-dependent architectures. Unlike the MATLAB implementation, it does not support multiple ancestry groups.
# Define a custom link function at module level
def annot_to_h2(x: np.ndarray) -> np.ndarray:
return x[:, 0] + 9 * x[:, 1] # 10x more h2 for coding vs. noncoding
sumstats: pl.DataFrame = gld.run_simulate(
sample_size=10000,
heritability=0.5,
component_variance=[1,10],
component_weight=[.01,.001],
alpha_param=-1,
populations='EUR',
annotations=annot_df,
link_fn=annot_to_h2,
)The effect size distribution is a mixture of normal distributions with variances specified by component_variance and weights specified by component_weight. If weights sum to less than one, remaining variants have no effect. The alpha_param parameter controls the frequency-dependent architecture and should normally range between -1 and 1.
Annotations can be included; if so, the link_fn should be specified, mapping annotations to relative per-variant heritability. For example, if we have the all-ones annotation and a coding annotation, then the link function could map to 10x more per-variant heritability for coding vs. noncoding variants.
Custom link_fn functions must be defined at module level (not as lambda functions or nested functions) to work with multiprocessing.
Under the infinitesimal model, with per-s.d. effect sizes
where we approximate
sumstats_with_weights: pl.DataFrame = gld.run_blup(
ldgm_metadata_path="data/metadata.csv",
sumstats=sumstats_dataframe_with_z_scores,
heritability=0.1
)This function assumes that heritability is equally distributed among all variants with Z scores provided, i.e., with
ParallelProcessor is a base class which can be used to implement parallel algorithms with LDGMs, wrapping Python's multiprocessing module. It splits work among processes, each of which loads a subset of LD blocks. The advantage of using this is that it handles for you the loading of LDGMs within worker processes. An example can be found in tests/test_multiprocessing.py.
See LDSC summary statistics file format. Read with read_ldsc_sumstats.
The GWAS-VCF specification is supported via the read_gwas_vcf function. It is a VCF file with the following mandatory FORMAT fields:
ES: Effect size estimateSE: Standard error of effect sizeLP: -log10 p-value
You can download BaselineLD model annotation files with GRCh38 coordinates from the Price lab Google Cloud bucket: https://console.cloud.google.com/storage/browser/broad-alkesgroup-public-requester-pays/LDSCORE/GRCh38
Read annotation files with load_annotations.
You can also read UCSC .bed annotation files with load_annotations, and they will be added to the annotation dataframe with one column per file.
- Main LDGM repository, including a MATLAB API: https://github.com/awohns/ldgm
- Original graphREML repository, with a MATLAB implementation: https://github.com/huilisabrina/graphREML (we recommend using the Python implementation, which is much faster)
- LD score regression repository: https://github.com/bulik/ldsc
- Giulio Genovese has implemented a LDGM-VCF file format specification and a bcftools plugin written in C with partially overlapping features, available here.
- All of these rely heavily on sparse matrix operations implemented in SuiteSparse.