Skip to content

Commit 6abf7df

Browse files
committed
Add speed note, better ambiguous peaks in figure, feature intuition in body text
1 parent 8b3a6cc commit 6abf7df

File tree

2 files changed

+51
-53
lines changed

2 files changed

+51
-53
lines changed
32.1 KB
Loading

paper/paper.md

Lines changed: 51 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ matrix — covering BigWig signal summaries, GC content, PhyloP conservation sco
3939
distribution moments, and ATAC specificity rankings — that can be used for
4040
downstream statistical analysis, ranking, or machine learning applications. By
4141
separating deterministic feature generation from user defined ranking logic,
42-
PyPeakRankR promotes transparency, modularity, and reproducibility in regulatory
43-
element analysis.
42+
PyPeakRankR promotes transparency, modularity, and reproducibility. It runs
43+
in minutes on thousands of peaks, making it a practical first-pass before downstream modelling.
4444

4545
# Statement of need
4646

@@ -50,10 +50,9 @@ higher-order signal statistics. These features are typically computed using
5050
custom per project scripts that vary across laboratories, complicating
5151
benchmarking and cross-study integration.
5252

53-
The target audience is computational biologists and genomics researchers who
54-
work with scATAC-seq or bulk ATAC-seq data and need to systematically prioritize
55-
peaks for experimental follow-up — particularly for enhancer discovery, AAV tool
56-
design, or regulatory element ranking across cell types or species.
53+
The target audience is computational biologists who work with scATAC-seq or
54+
bulk ATAC-seq data and need to systematically prioritize peaks for experimental
55+
follow-up, particularly for enhancer discovery or AAV tool design.
5756

5857
Existing tools address related but distinct problems: peak callers such as
5958
MACS2 [@Zhang2008] identify open chromatin regions but rank peaks only by
@@ -77,10 +76,9 @@ but does not produce portable, tool agnostic feature tables. `pyfaidx`
7776
[@Shirley2015] enables FASTA sequence access but provides no genomics feature
7877
pipeline.
7978

80-
PyPeakRankR fills this gap by composing these libraries — `pyBigWig` for
81-
signal extraction, `pyfaidx` for sequence access, `scipy` [@Virtanen2020]
82-
for distribution metrics — into a composable CLI pipeline that assembles
83-
heterogeneous features into a single reproducible TSV table.
79+
PyPeakRankR fills this gap by composing `pyBigWig`, `pyfaidx`, and
80+
`scipy` [@Virtanen2020] into a CLI pipeline that assembles heterogeneous
81+
features into a single reproducible TSV table.
8482

8583
# Software design
8684

@@ -89,8 +87,9 @@ PyPeakRankR is built around three core design decisions:
8987
**1. Table-first, composable pipeline.** Every subcommand reads an existing TSV
9088
and appends columns without modifying the peak coordinates. This means a user
9189
can run any subset of steps, add custom columns from other tools, and the table
92-
remains valid throughout. The alternative . A monolithic script requires users to rerun everything when adding
93-
a new feature type. The table-first design enables incremental extension.
90+
remains valid throughout. A monolithic script that computes all features at once requires users to rerun
91+
everything when adding a new feature type. The table-first design enables
92+
incremental extension.
9493

9594
**2. Separation of feature extraction from ranking.** Feature extraction is
9695
deterministic: given the same inputs and the same BigWig files, the same table
@@ -105,27 +104,26 @@ public Python function (`init_table`, `add_signal`, `add_gc`, `add_phylop`,
105104
in shell pipelines and in Python notebooks or workflows, without reimplementing
106105
logic.
107106

108-
The specificity ranking formula is: for each peak, compute the ratio of the
109-
target group's signal to the mean signal across all background groups, then
110-
min-max normalise to [0, 1] across all peaks. This definition matches the CERP
111-
pipeline used in [@Wirthlin2026] and is consistent with the ATAC-specificity
112-
metric validated in the BICCN challenge [@Johansen2025].
113-
114-
The table-first design is directly extensible: future columns could include
115-
sequence model importance scores (e.g., from Borzoi or Enformer) or spatially
116-
resolved specificity scores from MERFISH or other spatial transcriptomics
117-
data, integrating epigenomic and spatial context in one reproducible matrix.
118-
119-
Figure 1 illustrates the features collected by PyPeakRankR for each candidate
120-
peak.
121-
122-
![Features collected by PyPeakRankR. GC content is lower in active enhancers
123-
than in promoters or bulk DNA, reflecting differences in nucleosome occupancy.
124-
PhyloP scores [@Pollard2010] quantify cross-species constraint. ATAC specificity
125-
captures differential accessibility between cell groups. Kurtosis, skewness,
126-
and bimodality (Sarle's coefficient) describe signal shape — inspired by Lu et
127-
al. [@Lu2015], who showed these moments distinguish enhancers from promoters
128-
in ChIP-seq data. Figure adapted from Wirthlin et al.
107+
The specificity ranking formula computes the ratio of target group signal to
108+
mean background signal, then min-max normalises to [0, 1]. This matches the
109+
CERP pipeline [@Wirthlin2026] and the ATAC-specificity metric validated in
110+
the BICCN challenge [@Johansen2025].
111+
112+
Each feature has a distinct biological rationale. GC content is lower in active
113+
enhancers than in promoters or bulk genomic DNA, reflecting differences in
114+
nucleosome occupancy. PhyloP conservation [@Pollard2010] identifies peaks under
115+
cross-species purifying selection. Signal distribution moments — kurtosis
116+
(sharpness), skewness (asymmetry), and bimodality (Sarle's coefficient) — are
117+
motivated by Lu et al. [@Lu2015], who showed these shape features distinguish
118+
enhancers from promoters in ChIP-seq data more reliably than signal intensity
119+
alone. The table-first design is directly extensible: future columns could
120+
include sequence model importance scores (e.g., from Borzoi or Enformer) or
121+
spatially resolved scores from MERFISH, integrating epigenomic and spatial
122+
context in one reproducible matrix.
123+
124+
![Features collected by PyPeakRankR for each candidate peak: GC content,
125+
PhyloP conservation, ATAC specificity, and signal distribution moments
126+
(kurtosis, skewness, bimodality). Figure adapted from Wirthlin et al.
129127
(2026) [@Wirthlin2026].](figure6_panelA.png)
130128

131129
# Ranking vs. MACS2 fold change
@@ -141,22 +139,23 @@ liver (ENCFF160VHY, ENCSR802GEV), heart left ventricle
141139
(ENCFF455AFI, ENCSR117PYB), and lung (ENCFF210HIS, ENCSR647AOY).
142140
Signal was extracted using PyPeakRankR's `add-signal` command.
143141
Specificity scores (colon / mean across tissues, normalised [0, 1])
144-
diverge substantially from MACS2 fold change ranks. P6 (chrX,
145-
FC=14.7, rank #7) achieves the highest specificity (1.000) with colon
146-
mean=320 versus liver=7.6 and heart=7.1. P4 (chr2, FC=16.6, rank #3)
147-
scores lowest (0.000) with signal spread across all tissues. P1
148-
(chr4) is ambiguous: high colon signal but substantial heart
149-
signal (mean=83.8). Table 1 lists all ten peaks.
150-
151-
![ATAC-seq signal tracks for four peaks across colonic mucosa (orange),
152-
liver (purple), heart left ventricle (red), and lung (green). Signal
153-
is the MACS2 p-value BigWig extracted by PyPeakRankR from four ENCODE
154-
GRCh38 tracks. Mean signal per tissue is shown in the coloured badge
155-
(right). Dashed lines mark MACS2 summits. The four peaks illustrate
156-
distinct accessibility patterns: P6 (colon-specific, Spec=1.00), P1
157-
(ambiguous — colon and heart both active, Spec=0.11), P9
158-
(colon and lung, Spec=0.13), and P4 (broadly accessible across all
159-
tissues, Spec=0.00, MACS2 rank #3).](browser_tracks_comparison.png)
142+
diverge substantially from MACS2 fold change ranks (Figure 2, Table 1).
143+
The four peaks shown span the full spectrum: P6 (rank #7 by FC) is the
144+
most colon-specific (Spec=1.00); P1 (rank #10 by FC, lowest in the set)
145+
is ambiguous with both colon and heart active; P9 shows a colon-plus-lung
146+
pattern; and P4 (rank #3 by FC) is the broadest, with signal spread
147+
across all four tissues and the lowest specificity score (0.000).
148+
149+
![ATAC-seq signal tracks across four human tissues for peaks spanning the
150+
full specificity spectrum. Mean MACS2 p-value signal per tissue is shown
151+
in the coloured badge (right); dashed lines mark MACS2 summits. P6
152+
(Spec=1.00) is colon-specific with liver and heart near background. P1
153+
(Spec=0.11) is ambiguous — both colon (mean=445) and heart (mean=84)
154+
are active despite P1 having the lowest MACS2 fold change in the set.
155+
P9 (Spec=0.13) shows a distinct colon-plus-lung pattern (lung mean=70).
156+
P4 (Spec=0.00, MACS2 rank #3) is broadly accessible across all four
157+
tissues, illustrating that high fold change does not imply tissue
158+
specificity.](browser_tracks_comparison.png)
160159

161160

162161
| Peak | Coordinates (GRCh38) | FC | FC rank | Colon | Liver | Heart | Lung | Spec | Spec rank |
@@ -199,17 +198,16 @@ outperformed conventional fold-change approaches, and the resulting
199198
enhancer-AAV tools achieved >70% on-target specificity across cell types,
200199
with exemplary enhancers exceeding 90%.
201200

202-
These applications demonstrate that PyPeakRankR's feature extraction produces
203-
rankings with direct experimental utility across species and brain regions.
201+
These results establish direct experimental utility.
204202

205203
# Implementation
206204

207205
PyPeakRankR is implemented in Python (>=3.9) with the following dependencies:
208206
`pandas` [@Reback2020] for tabular data handling, `numpy` [@Harris2020] for
209207
numerical computation, `pyBigWig` [@Ramirez2020pyBigWig] for BigWig signal extraction,
210208
`pyfaidx` [@Shirley2015] for FASTA sequence access, and `scipy` [@Virtanen2020]
211-
for statistical distribution metrics. The package is installable via pip from GitHub, includes a `pypeakranker`
212-
CLI, unit tests, and example data (`tests/test.bed`). Source code:
209+
for statistical distribution metrics. Installable via pip from GitHub; includes a `pypeakranker` CLI, unit tests,
210+
and example data (`tests/test.bed`). Source:
213211
<https://github.com/AllenInstitute/PeakRankR/tree/python-package> (MIT).
214212

215213
# AI usage disclosure

0 commit comments

Comments
 (0)