Skip to content

Commit 3fa37fe

Browse files
jamesnemeshtracyyuan123claude
authored
Ty eqtl squash (#72)
* Squash ty_eqtl before rebase * Moved main test script into inst/scripts. ROxygen docs were not checked in. * Moved main test script into inst/scripts. ROxygen docs were not checked in. * Moved python eqtl pipeline script to scripts dir. * Modified pipeline to use already generated data if it's available. * Modified pipeline to use already generated data if it's available. * Fixed some unqualified calls to data.table functions * Need a minimal package readme for install. * Refactoring of the python code to make the pipeline script minimally accessible. * Refactoring of the python code to make the pipeline script minimally accessible. * updated .gitignore for python * Trying to replicate the eQTL results plot that's in the slide deck * Trying to replicate the eQTL results plot that's in the slide deck * Fix step ordering, update K=9/seed=119, remove TODOs - Fix step 8/9 order: combine_expression first, then median_expression - Remove Jim's TODO comments - Update cluster_order to c(2,1,7,4,8,5,6,3,0) for K=9 - Update Python pipelines: K=9, random_state=119, desired_order=[2,1,7,4,8,5,6,3,0] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Sort cluster assignments output for reproducibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Deduplicate egene pairs by lowest q-value instead of first occurrence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Sort slope matrix by gene before k-means for reproducible clustering K-means results depend on input row order. Sort by phenotype_id when writing the index-SNP slope matrix (R) and when reading it back in the Python k-means pipeline. Also improve handling of zero-variance rows in gene expression ordering by adding a tiny deterministic trend instead of dropping them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update plotting outputs to SVG, fix cluster order, and align KMeans params - Switch plot_gene_snp, plot_cell_type_pairwise_cor, and kmeans_heatmap output from PNG to SVG for publication-quality figures - Update cluster_order to [5,0,6,2,7,8,10,1,9,4,3] across all scripts - Align KMeans params (random_state=42, n_init=200, max_iter=20) - Add human-readable cell type labels to pairwise correlation heatmap - Flip Fisher exact plot to horizontal orientation (clusters on y-axis) - Add variant_id column to cluster assignments output - Clean up stale PNG references in docstrings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove median imputation option, keep only zero imputation Median imputation of missing eQTL slopes is not methodologically sound. Remove the imputation_method/min_non_na parameters and .impute_slope_dt helper, hardcoding zero imputation (NA -> 0). Update all references across R/Python scripts, runner scripts, test pipelines, docs, and README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Pin KMeans algorithm to elkan for reproducibility across sklearn versions The default KMeans algorithm changed from elkan (sklearn <1.3) to lloyd (sklearn >=1.3), which can produce different cluster assignments. Explicitly setting algorithm="elkan" ensures identical results regardless of sklearn version. Also relaxed version constraints in pyproject.toml (scikit-learn>=1.0, numpy>=1.22, python>=3.9) to match the environment that produced the validated clustering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix cluster_order in test_eqtl_pipeline.R to K=11 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Sync test_eqtl_pipeline.py with K=11 params and infer plot format from extension Update K (9->11), random_state (119->42), desired_order, and heatmap output extension (.png->.svg) to match the K=11 clustering. Remove hardcoded format="svg" from savefig so output format is inferred from the file extension. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Sync cli/test_eqtl_pipeline.py with K=11 params Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Switch all plot outputs to SVG Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add distance-to-TSS scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add new end-to-end pipeline (0308) and update boxplot script New pipeline runs R Part 1 → Python K-means → R Part 2 via system2(), no manual intervention needed. Boxplot improvements: horizontal orientation option, smaller beeswarm points, Neuron (6) label fix, format inferred from output path extension. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Convert standalone scripts to package functions, remove hardcoded paths - Move get_index_snp_start_distance and plot_eqtl_distance_to_tss_boxplot from standalone scripts (inst/scripts/) to exported package functions (R/) - Pipeline now uses bican.mccarroll.eqtl:: calls instead of system2(Rscript) - Inline Python K-means call, remove separate test_eqtl_pipeline_0308.py - Remove hardcoded script_dir and local machine paths - Update NAMESPACE with new exports and imports - Use /broad/ server paths instead of local mount paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Restore separate Python K-means script, fix base_dir to /broad/ Revert Step 10 to call the Python script via system2() instead of inlining Python code. Fix base_dir in Python script from local mount path to /broad/ server path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * All of the scripts that shouldn't be run are commented out. The command line interface around the k-means plot has been expanded to allow for different K settings, cluster orderings, and input/output directory. * add more validation for cluster reorder * updated 0308 code to call k-means from the install * Updated the defaults. R script should be parameterized. * Polishing eQTL pipeline workflow + documentation. * Deleted old no longer needed scripts. * Updated k-means plot to allow sequential cluster labels * Finished fixing up eQTL plots. * Finished fixing up eQTL plots. * Some more eQTL plot cleanup. * Some more eQTL plot cleanup. --------- Co-authored-by: tracyyuan123 <tyuan@college.harvard.edu> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e7cf37b commit 3fa37fe

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+4129
-1
lines changed

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,13 @@
66
**/venv
77
**/*.Rcheck
88
R/*.tar.gz
9+
10+
# Python
11+
__pycache__/
12+
*.py[cod]
13+
*$py.class
14+
*.so
15+
.env
16+
.venv
17+
env/
18+
venv/

R/bican.mccarroll.eqtl/DESCRIPTION

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,14 @@ Package: bican.mccarroll.eqtl
22
Type: Package
33
Title: Eqtl analysis for the bican manuscript
44
Version: 0.1.0
5-
Imports: data.table, ggplot2, ggvenn, cowplot,digest,logger, rlang
5+
Imports: data.table, ggplot2, ggvenn, cowplot, digest, logger, rlang,
6+
ComplexHeatmap, circlize, grid, grDevices, stats, utils,
7+
VariantAnnotation, GenomicRanges, IRanges, ggbeeswarm
8+
Remotes:
9+
bioc::ComplexHeatmap,
10+
bioc::VariantAnnotation,
11+
bioc::GenomicRanges,
12+
bioc::IRanges
613
Authors@R: c(
714
person(
815
"James", "Nemesh",

R/bican.mccarroll.eqtl/NAMESPACE

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Generated by roxygen2: do not edit by hand
22

33
export(build_pair_effect_table)
4+
export(combine_expression_across_cell_types)
45
export(compare_all_eQTL_runs)
56
export(compare_eqtl_runs)
67
export(compare_eqtl_runs_ctr)
@@ -9,39 +10,97 @@ export(eqtl_cluster_filtering_trajectories)
910
export(filter_covariates_to_subpopulation)
1011
export(filter_metacells_to_subpopulation)
1112
export(filter_significant_index)
13+
export(get_cell_type_pairwise_cor_matrix)
14+
export(get_egene_union_pairs)
15+
export(get_heatmap_index_snp_median_expression)
16+
export(get_index_snp_slope_matrix_with_impute)
17+
export(get_index_snp_start_distance)
1218
export(get_or_build_augmented_indices_for_sign)
19+
export(get_pval_nominal_matrix)
20+
export(get_pval_nominal_threshold_matrix)
21+
export(get_sig_coloc)
22+
export(get_slope_matrix)
23+
export(plot_cell_type_pairwise_cor)
1324
export(plot_effect_sizes)
25+
export(plot_eqtl_distance_to_tss_boxplot)
26+
export(plot_fisher_exact)
27+
export(plot_gene_snp)
1428
export(plot_pair_effects)
1529
export(plot_pair_pvals)
1630
export(read_all_pairs_file)
1731
export(read_index_file)
32+
export(run_eqtl_manuscript_pipeline)
33+
export(run_eqtl_manuscript_pipeline_defaults)
1834
export(select_reference_pairs)
35+
importFrom(ComplexHeatmap,Heatmap)
36+
importFrom(ComplexHeatmap,draw)
37+
importFrom(GenomicRanges,GRanges)
38+
importFrom(IRanges,IRanges)
39+
importFrom(VariantAnnotation,geno)
40+
importFrom(VariantAnnotation,readVcf)
41+
importFrom(circlize,colorRamp2)
1942
importFrom(cowplot,draw_label)
2043
importFrom(cowplot,ggdraw)
2144
importFrom(cowplot,plot_grid)
2245
importFrom(data.table,":=")
2346
importFrom(data.table,as.data.table)
47+
importFrom(data.table,data.table)
48+
importFrom(data.table,fifelse)
49+
importFrom(data.table,frank)
2450
importFrom(data.table,fread)
51+
importFrom(data.table,fwrite)
2552
importFrom(data.table,is.data.table)
53+
importFrom(data.table,melt)
54+
importFrom(data.table,rbindlist)
2655
importFrom(data.table,setDF)
56+
importFrom(data.table,setnames)
57+
importFrom(data.table,tstrsplit)
58+
importFrom(data.table,uniqueN)
2759
importFrom(digest,digest)
60+
importFrom(ggbeeswarm,geom_beeswarm)
61+
importFrom(ggbeeswarm,geom_quasirandom)
2862
importFrom(ggplot2,aes)
2963
importFrom(ggplot2,annotate)
64+
importFrom(ggplot2,coord_cartesian)
65+
importFrom(ggplot2,element_blank)
3066
importFrom(ggplot2,element_text)
67+
importFrom(ggplot2,expansion)
68+
importFrom(ggplot2,facet_wrap)
3169
importFrom(ggplot2,geom_abline)
70+
importFrom(ggplot2,geom_boxplot)
71+
importFrom(ggplot2,geom_errorbarh)
3272
importFrom(ggplot2,geom_histogram)
3373
importFrom(ggplot2,geom_point)
3474
importFrom(ggplot2,geom_segment)
75+
importFrom(ggplot2,geom_text)
76+
importFrom(ggplot2,geom_violin)
77+
importFrom(ggplot2,geom_vline)
3578
importFrom(ggplot2,ggplot)
79+
importFrom(ggplot2,ggsave)
3680
importFrom(ggplot2,ggtitle)
3781
importFrom(ggplot2,labs)
82+
importFrom(ggplot2,margin)
3883
importFrom(ggplot2,scale_color_manual)
84+
importFrom(ggplot2,scale_x_continuous)
85+
importFrom(ggplot2,scale_y_continuous)
3986
importFrom(ggplot2,theme)
4087
importFrom(ggplot2,theme_bw)
88+
importFrom(ggplot2,theme_classic)
89+
importFrom(ggplot2,unit)
4190
importFrom(ggvenn,ggvenn)
4291
importFrom(grDevices,dev.off)
4392
importFrom(grDevices,pdf)
93+
importFrom(grDevices,svg)
94+
importFrom(grid,gpar)
95+
importFrom(grid,grid.text)
96+
importFrom(grid,unit)
4497
importFrom(logger,log_info)
98+
importFrom(logger,log_warn)
4599
importFrom(rlang,.data)
100+
importFrom(stats,fisher.test)
101+
importFrom(stats,lm)
102+
importFrom(stats,median)
103+
importFrom(stats,p.adjust)
46104
importFrom(stats,qchisq)
105+
importFrom(stats,sd)
47106
importFrom(utils,write.table)
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# slope_matrix_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/script_output/LEVEL_3/slope_matrix_qval_0.01.tsv"
2+
# pval_nominal_matrix_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/script_output/LEVEL_3/pval_nominal_matrix_qval_0.01.tsv"
3+
# pval_nominal_threshold_matrix_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/script_output/LEVEL_3/pval_nominal_threshold_matrix_qval_0.01.tsv"
4+
# egene_union_pairs_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/script_output/LEVEL_3/egene_union_pairs_qval_0.01.tsv"
5+
# region_cell_type_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/results/region_cell_type.tsv"
6+
# output_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/script_output/LEVEL_3/cell_type_pairwise_r_squared.tsv"
7+
# bican.mccarroll.eqtl::get_cell_type_pairwise_cor_matrix(slope_matrix_path, pval_nominal_matrix_path, pval_nominal_threshold_matrix_path, egene_union_pairs_path, region_cell_type_path, output_path)
8+
9+
10+
#' Compute pairwise Spearman correlation matrix of eQTL effect sizes across cell types
11+
#'
12+
#' For each pair of cell type / region groups, identifies eGene-variant pairs
13+
#' that are nominally significant in at least one of the two groups (using
14+
#' \code{pval_nominal < pval_nominal_threshold}), then computes the Spearman
15+
#' correlation of slopes for those pairs. Returns the R-squared matrix.
16+
#'
17+
#' @param slope_matrix_path Character scalar. Path to the slope matrix TSV
18+
#' (output of \code{\link{get_slope_matrix}}).
19+
#' @param pval_nominal_matrix_path Character scalar. Path to the pval_nominal matrix TSV
20+
#' (output of \code{\link{get_pval_nominal_matrix}}).
21+
#' @param pval_nominal_threshold_matrix_path Character scalar. Path to the pval_nominal_threshold
22+
#' matrix TSV (output of \code{\link{get_pval_nominal_threshold_matrix}}).
23+
#' @param egene_union_pairs_path Character scalar. Path to the eGene union pairs TSV
24+
#' (output of \code{\link{get_egene_union_pairs}}).
25+
#' @param region_cell_type_path Character scalar. Path to a tab-delimited file
26+
#' with columns \code{cell_type} and \code{region}.
27+
#' @param output_path Character scalar or \code{NULL}. If non-NULL, the
28+
#' R-squared matrix is written to this path as a tab-delimited file.
29+
#'
30+
#' @return A matrix of pairwise Spearman R-squared values (cell types x cell types).
31+
#'
32+
#' @export
33+
#' @importFrom data.table fread fwrite
34+
#' @importFrom logger log_info
35+
get_cell_type_pairwise_cor_matrix <- function(slope_matrix_path,
36+
pval_nominal_matrix_path,
37+
pval_nominal_threshold_matrix_path,
38+
egene_union_pairs_path,
39+
region_cell_type_path,
40+
output_path = NULL) {
41+
42+
slope_dt <- data.table::fread(slope_matrix_path)
43+
pval_dt <- data.table::fread(pval_nominal_matrix_path)
44+
pval_threshold_dt <- data.table::fread(pval_nominal_threshold_matrix_path)
45+
egene_dt <- data.table::fread(egene_union_pairs_path, select = c("phenotype_id", "variant_id"))
46+
47+
# Filter to eGene union pairs
48+
slope_dt <- merge(slope_dt, egene_dt, by = c("phenotype_id", "variant_id"))
49+
pval_dt <- merge(pval_dt, egene_dt, by = c("phenotype_id", "variant_id"))
50+
pval_threshold_dt <- merge(pval_threshold_dt, egene_dt, by = c("phenotype_id", "variant_id"))
51+
52+
# Filter to cell type/region columns
53+
region_cell_type_dt <- data.table::fread(region_cell_type_path)
54+
ct_cols <- paste0(region_cell_type_dt$cell_type, "__", region_cell_type_dt$region)
55+
ct_cols <- intersect(ct_cols, names(slope_dt))
56+
ct_cols <- intersect(ct_cols, names(pval_dt))
57+
ct_cols <- intersect(ct_cols, names(pval_threshold_dt))
58+
59+
slope_m <- as.matrix(slope_dt[, ct_cols, with = FALSE])
60+
pval_m <- as.matrix(pval_dt[, ct_cols, with = FALSE])
61+
pval_thresh_m <- as.matrix(pval_threshold_dt[, ct_cols, with = FALSE])
62+
63+
n <- length(ct_cols)
64+
logger::log_info("Computing pairwise Spearman correlations for {n} cell type/regions")
65+
66+
cor_matrix <- matrix(NA_real_, nrow = n, ncol = n,
67+
dimnames = list(ct_cols, ct_cols))
68+
69+
for (i in seq_len(n)) {
70+
for (j in seq_len(n)) {
71+
slope1 <- slope_m[, i]
72+
slope2 <- slope_m[, j]
73+
pval1 <- pval_m[, i]
74+
pval2 <- pval_m[, j]
75+
thresh1 <- pval_thresh_m[, i]
76+
thresh2 <- pval_thresh_m[, j]
77+
78+
sig_idx <- which(
79+
(!is.na(thresh1) & !is.na(pval1) & pval1 < thresh1) |
80+
(!is.na(thresh2) & !is.na(pval2) & pval2 < thresh2)
81+
)
82+
83+
if (length(sig_idx) == 0) {
84+
cor_matrix[i, j] <- 0
85+
next
86+
}
87+
88+
cor_matrix[i, j] <- stats::cor(
89+
slope1[sig_idx], slope2[sig_idx],
90+
use = "pairwise.complete.obs", method = "spearman"
91+
)
92+
}
93+
}
94+
95+
r_squared <- cor_matrix^2
96+
97+
logger::log_info("Correlation matrix complete")
98+
99+
if (!is.null(output_path)) {
100+
out_dt <- data.table::as.data.table(r_squared, keep.rownames = "cell_type")
101+
data.table::fwrite(out_dt, output_path, sep = "\t")
102+
logger::log_info("Written to: {output_path}")
103+
}
104+
105+
return(r_squared)
106+
}
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# eqtl_dir <- "/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/results/LEVEL_3"
2+
# region_cell_type_path <- "/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/manuscript_data/region_cell_type.tsv"
3+
# output_path <- "/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/manuscript_data/combined_tpm_expression_across_cell_types.tsv"
4+
# bican.mccarroll.eqtl::combine_expression_across_cell_types(eqtl_dir, region_cell_type_path, output_path)
5+
6+
7+
#' Combine gene expression TPM across cell types into a single matrix
8+
#'
9+
#' For each cell type / region group, reads the per-sample gene expression
10+
#' TPM BED file, appends the cell type name to sample column headers, and
11+
#' joins all cell types into one wide matrix keyed by gene ID (\code{pid}).
12+
#'
13+
#' @param eqtl_dir Character scalar. Path to the eQTL results directory
14+
#' (e.g. \code{.../LEVEL_3}). Each cell type subdirectory should contain
15+
#' a file named \code{<ct>__<reg>.gene_expression_tpm.bed.gz}.
16+
#' @param region_cell_type_path Character scalar. Path to a tab-delimited
17+
#' file with columns \code{cell_type} and \code{region}.
18+
#' @param output_path Character scalar or \code{NULL}. If non-NULL, the
19+
#' combined matrix is written to this path as a tab-delimited file.
20+
#'
21+
#' @return A \code{data.table} with column \code{pid} (gene ID) and one
22+
#' column per sample per cell type, named \code{<sample>_<ct>__<reg>}.
23+
#'
24+
#' @export
25+
#' @importFrom data.table fread fwrite
26+
#' @importFrom logger log_info
27+
combine_expression_across_cell_types <- function(eqtl_dir,
28+
region_cell_type_path,
29+
output_path = NULL) {
30+
31+
region_cell_type_dt <- data.table::fread(region_cell_type_path)
32+
ct_keys <- paste0(region_cell_type_dt$cell_type, "__", region_cell_type_dt$region)
33+
logger::log_info("Combining expression for {length(ct_keys)} cell type/region groups")
34+
35+
read_and_format <- function(ct_key) {
36+
bed_path <- file.path(eqtl_dir, ct_key,
37+
paste0(ct_key, ".gene_expression_tpm.bed.gz"))
38+
logger::log_info("Reading: {ct_key}")
39+
dt <- data.table::fread(bed_path)
40+
41+
# Remove chr, start, end columns (first 3)
42+
dt[, c(1, 2, 3) := NULL]
43+
44+
# Rename sample columns: append _<ct_key> (skip first col which is pid)
45+
old_names <- names(dt)[-1]
46+
new_names <- paste0(old_names, "_", ct_key)
47+
data.table::setnames(dt, old_names, new_names)
48+
49+
return(dt)
50+
}
51+
52+
expression_list <- lapply(ct_keys, read_and_format)
53+
54+
# Merge all cell types by pid (full outer join)
55+
combined_dt <- expression_list[[1]]
56+
for (i in seq(2, length(expression_list))) {
57+
combined_dt <- merge(combined_dt, expression_list[[i]],
58+
by = "pid", all = TRUE)
59+
}
60+
61+
logger::log_info("Combined: {nrow(combined_dt)} genes x {ncol(combined_dt) - 1} sample columns")
62+
63+
if (!is.null(output_path)) {
64+
data.table::fwrite(combined_dt, output_path, sep = "\t")
65+
logger::log_info("Written to: {output_path}")
66+
}
67+
68+
return(combined_dt)
69+
}
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# eqtl_dir="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/results/LEVEL_3"
2+
# region_cell_type_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/results/region_cell_type.tsv"
3+
# qval_threshold=0.05
4+
# output_path="/broad/bican_um1_mccarroll/RNAseq/analysis/CAP_freeze_3_analysis/eqtls/script_output/LEVEL_3/egene_union_pairs_qval_0.05.tsv"
5+
# bican.mccarroll.eqtl::get_egene_union_pairs(eqtl_dir, region_cell_type_path, qval_threshold, output_path)
6+
7+
8+
#' Get union of eGene-variant pairs across cell types and regions
9+
#'
10+
#' For each cell type / region combination in the region-cell-type table,
11+
#' reads the tensorQTL cis-eQTL index file, filters to eGenes at
12+
#' \code{qval < qval_threshold}, and returns the union of unique
13+
#' (phenotype_id, variant_id) pairs across all groups.
14+
#'
15+
#' Each subdirectory under \code{eqtl_dir} is expected to follow the naming
16+
#' convention \code{<cell_type>__<region>/}, containing a file named
17+
#' \code{<cell_type>__<region>.cis_qtl.txt.gz}.
18+
#'
19+
#' @param eqtl_dir Character scalar. Base directory containing per-cell-type
20+
#' eQTL result subdirectories.
21+
#' @param region_cell_type_path Character scalar. Path to a tab-delimited file
22+
#' with columns \code{cell_type} and \code{region}.
23+
#' @param qval_threshold Numeric scalar. q-value threshold for calling an
24+
#' eGene significant. Default \code{0.05}.
25+
#' @param output_path Character scalar or \code{NULL}. If non-NULL, the result
26+
#' table is written to this path as a tab-delimited file.
27+
#'
28+
#' @return A \code{data.table} with columns \code{phenotype_id}, \code{variant_id},
29+
#' and \code{qval}, containing one row per unique eGene-variant pair.
30+
#'
31+
#' @export
32+
#' @importFrom data.table fread fwrite rbindlist
33+
#' @importFrom logger log_info
34+
get_egene_union_pairs <- function(eqtl_dir,
35+
region_cell_type_path,
36+
qval_threshold = 0.05,
37+
output_path = NULL) {
38+
39+
region_cell_type_dt <- data.table::fread(region_cell_type_path)
40+
41+
results <- vector("list", nrow(region_cell_type_dt))
42+
43+
for (i in seq_len(nrow(region_cell_type_dt))) {
44+
cell_type <- region_cell_type_dt$cell_type[i]
45+
region <- region_cell_type_dt$region[i]
46+
subdir <- paste0(cell_type, "__", region)
47+
filename <- paste0(subdir, ".cis_qtl.txt.gz")
48+
eqtl_file <- file.path(eqtl_dir, subdir, filename)
49+
50+
logger::log_info("Reading eQTL index: {eqtl_file}")
51+
52+
dt <- data.table::fread(eqtl_file, select = c("phenotype_id", "variant_id", "qval"))
53+
dt <- dt[dt$qval < qval_threshold, ]
54+
55+
logger::log_info(" {cell_type} / {region}: {nrow(dt)} eGenes at qval < {qval_threshold}")
56+
results[[i]] <- dt
57+
}
58+
59+
combined <- data.table::rbindlist(results)
60+
61+
# Deduplicate to unique (phenotype_id, variant_id) pairs, keeping lowest qval
62+
data.table::setorder(combined, phenotype_id, variant_id, qval)
63+
result <- combined[!duplicated(combined, by = c("phenotype_id", "variant_id")),
64+
.(phenotype_id, variant_id, qval)]
65+
66+
logger::log_info("Total unique eGene-variant pairs: {nrow(result)}")
67+
68+
if (!is.null(output_path)) {
69+
data.table::fwrite(result, output_path, sep = "\t")
70+
logger::log_info("Written to: {output_path}")
71+
}
72+
73+
return(result)
74+
}

0 commit comments

Comments
 (0)