@@ -39,8 +39,8 @@ matrix — covering BigWig signal summaries, GC content, PhyloP conservation sco
3939distribution moments, and ATAC specificity rankings — that can be used for
4040downstream statistical analysis, ranking, or machine learning applications. By
4141separating deterministic feature generation from user defined ranking logic,
42- PyPeakRankR promotes transparency, modularity, and reproducibility in regulatory
43- element analysis .
42+ PyPeakRankR promotes transparency, modularity, and reproducibility. It runs
43+ in minutes on thousands of peaks, making it a practical first-pass before downstream modelling .
4444
4545# Statement of need
4646
@@ -50,10 +50,9 @@ higher-order signal statistics. These features are typically computed using
5050custom per project scripts that vary across laboratories, complicating
5151benchmarking and cross-study integration.
5252
53- The target audience is computational biologists and genomics researchers who
54- work with scATAC-seq or bulk ATAC-seq data and need to systematically prioritize
55- peaks for experimental follow-up — particularly for enhancer discovery, AAV tool
56- design, or regulatory element ranking across cell types or species.
53+ The target audience is computational biologists who work with scATAC-seq or
54+ bulk ATAC-seq data and need to systematically prioritize peaks for experimental
55+ follow-up, particularly for enhancer discovery or AAV tool design.
5756
5857Existing tools address related but distinct problems: peak callers such as
5958MACS2 [ @Zhang2008 ] identify open chromatin regions but rank peaks only by
@@ -77,10 +76,9 @@ but does not produce portable, tool agnostic feature tables. `pyfaidx`
7776[ @Shirley2015 ] enables FASTA sequence access but provides no genomics feature
7877pipeline.
7978
80- PyPeakRankR fills this gap by composing these libraries — ` pyBigWig ` for
81- signal extraction, ` pyfaidx ` for sequence access, ` scipy ` [ @Virtanen2020 ]
82- for distribution metrics — into a composable CLI pipeline that assembles
83- heterogeneous features into a single reproducible TSV table.
79+ PyPeakRankR fills this gap by composing ` pyBigWig ` , ` pyfaidx ` , and
80+ ` scipy ` [ @Virtanen2020 ] into a CLI pipeline that assembles heterogeneous
81+ features into a single reproducible TSV table.
8482
8583# Software design
8684
@@ -89,8 +87,9 @@ PyPeakRankR is built around three core design decisions:
8987** 1. Table-first, composable pipeline.** Every subcommand reads an existing TSV
9088and appends columns without modifying the peak coordinates. This means a user
9189can run any subset of steps, add custom columns from other tools, and the table
92- remains valid throughout. The alternative . A monolithic script requires users to rerun everything when adding
93- a new feature type. The table-first design enables incremental extension.
90+ remains valid throughout. A monolithic script that computes all features at once requires users to rerun
91+ everything when adding a new feature type. The table-first design enables
92+ incremental extension.
9493
9594** 2. Separation of feature extraction from ranking.** Feature extraction is
9695deterministic: given the same inputs and the same BigWig files, the same table
@@ -105,27 +104,26 @@ public Python function (`init_table`, `add_signal`, `add_gc`, `add_phylop`,
105104in shell pipelines and in Python notebooks or workflows, without reimplementing
106105logic.
107106
108- The specificity ranking formula is: for each peak, compute the ratio of the
109- target group's signal to the mean signal across all background groups, then
110- min-max normalise to [ 0, 1] across all peaks. This definition matches the CERP
111- pipeline used in [ @Wirthlin2026 ] and is consistent with the ATAC-specificity
112- metric validated in the BICCN challenge [ @Johansen2025 ] .
113-
114- The table-first design is directly extensible: future columns could include
115- sequence model importance scores (e.g., from Borzoi or Enformer) or spatially
116- resolved specificity scores from MERFISH or other spatial transcriptomics
117- data, integrating epigenomic and spatial context in one reproducible matrix.
118-
119- Figure 1 illustrates the features collected by PyPeakRankR for each candidate
120- peak.
121-
122- ![ Features collected by PyPeakRankR. GC content is lower in active enhancers
123- than in promoters or bulk DNA, reflecting differences in nucleosome occupancy.
124- PhyloP scores [ @Pollard2010 ] quantify cross-species constraint. ATAC specificity
125- captures differential accessibility between cell groups. Kurtosis, skewness,
126- and bimodality (Sarle's coefficient) describe signal shape — inspired by Lu et
127- al. [ @Lu2015 ] , who showed these moments distinguish enhancers from promoters
128- in ChIP-seq data. Figure adapted from Wirthlin et al.
107+ The specificity ranking formula computes the ratio of target group signal to
108+ mean background signal, then min-max normalises to [ 0, 1] . This matches the
109+ CERP pipeline [ @Wirthlin2026 ] and the ATAC-specificity metric validated in
110+ the BICCN challenge [ @Johansen2025 ] .
111+
112+ Each feature has a distinct biological rationale. GC content is lower in active
113+ enhancers than in promoters or bulk genomic DNA, reflecting differences in
114+ nucleosome occupancy. PhyloP conservation [ @Pollard2010 ] identifies peaks under
115+ cross-species purifying selection. Signal distribution moments — kurtosis
116+ (sharpness), skewness (asymmetry), and bimodality (Sarle's coefficient) — are
117+ motivated by Lu et al. [ @Lu2015 ] , who showed these shape features distinguish
118+ enhancers from promoters in ChIP-seq data more reliably than signal intensity
119+ alone. The table-first design is directly extensible: future columns could
120+ include sequence model importance scores (e.g., from Borzoi or Enformer) or
121+ spatially resolved scores from MERFISH, integrating epigenomic and spatial
122+ context in one reproducible matrix.
123+
124+ ![ Features collected by PyPeakRankR for each candidate peak: GC content,
125+ PhyloP conservation, ATAC specificity, and signal distribution moments
126+ (kurtosis, skewness, bimodality). Figure adapted from Wirthlin et al.
129127(2026) [ @Wirthlin2026 ] .] ( figure6_panelA.png )
130128
131129# Ranking vs. MACS2 fold change
@@ -141,22 +139,23 @@ liver (ENCFF160VHY, ENCSR802GEV), heart left ventricle
141139(ENCFF455AFI, ENCSR117PYB), and lung (ENCFF210HIS, ENCSR647AOY).
142140Signal was extracted using PyPeakRankR's ` add-signal ` command.
143141Specificity scores (colon / mean across tissues, normalised [ 0, 1] )
144- diverge substantially from MACS2 fold change ranks. P6 (chrX,
145- FC=14.7, rank #7 ) achieves the highest specificity (1.000) with colon
146- mean=320 versus liver=7.6 and heart=7.1. P4 (chr2, FC=16.6, rank #3 )
147- scores lowest (0.000) with signal spread across all tissues. P1
148- (chr4) is ambiguous: high colon signal but substantial heart
149- signal (mean=83.8). Table 1 lists all ten peaks.
150-
151- ![ ATAC-seq signal tracks for four peaks across colonic mucosa (orange),
152- liver (purple), heart left ventricle (red), and lung (green). Signal
153- is the MACS2 p-value BigWig extracted by PyPeakRankR from four ENCODE
154- GRCh38 tracks. Mean signal per tissue is shown in the coloured badge
155- (right). Dashed lines mark MACS2 summits. The four peaks illustrate
156- distinct accessibility patterns: P6 (colon-specific, Spec=1.00), P1
157- (ambiguous — colon and heart both active, Spec=0.11), P9
158- (colon and lung, Spec=0.13), and P4 (broadly accessible across all
159- tissues, Spec=0.00, MACS2 rank #3 ).] ( browser_tracks_comparison.png )
142+ diverge substantially from MACS2 fold change ranks (Figure 2, Table 1).
143+ The four peaks shown span the full spectrum: P6 (rank #7 by FC) is the
144+ most colon-specific (Spec=1.00); P1 (rank #10 by FC, lowest in the set)
145+ is ambiguous with both colon and heart active; P9 shows a colon-plus-lung
146+ pattern; and P4 (rank #3 by FC) is the broadest, with signal spread
147+ across all four tissues and the lowest specificity score (0.000).
148+
149+ ![ ATAC-seq signal tracks across four human tissues for peaks spanning the
150+ full specificity spectrum. Mean MACS2 p-value signal per tissue is shown
151+ in the coloured badge (right); dashed lines mark MACS2 summits. P6
152+ (Spec=1.00) is colon-specific with liver and heart near background. P1
153+ (Spec=0.11) is ambiguous — both colon (mean=445) and heart (mean=84)
154+ are active despite P1 having the lowest MACS2 fold change in the set.
155+ P9 (Spec=0.13) shows a distinct colon-plus-lung pattern (lung mean=70).
156+ P4 (Spec=0.00, MACS2 rank #3 ) is broadly accessible across all four
157+ tissues, illustrating that high fold change does not imply tissue
158+ specificity.] ( browser_tracks_comparison.png )
160159
161160
162161| Peak | Coordinates (GRCh38) | FC | FC rank | Colon | Liver | Heart | Lung | Spec | Spec rank |
@@ -199,17 +198,16 @@ outperformed conventional fold-change approaches, and the resulting
199198enhancer-AAV tools achieved >70% on-target specificity across cell types,
200199with exemplary enhancers exceeding 90%.
201200
202- These applications demonstrate that PyPeakRankR's feature extraction produces
203- rankings with direct experimental utility across species and brain regions.
201+ These results establish direct experimental utility.
204202
205203# Implementation
206204
207205PyPeakRankR is implemented in Python (>=3.9) with the following dependencies:
208206` pandas ` [ @Reback2020 ] for tabular data handling, ` numpy ` [ @Harris2020 ] for
209207numerical computation, ` pyBigWig ` [ @Ramirez2020pyBigWig ] for BigWig signal extraction,
210208` pyfaidx ` [ @Shirley2015 ] for FASTA sequence access, and ` scipy ` [ @Virtanen2020 ]
211- for statistical distribution metrics. The package is installable via pip from GitHub, includes a ` pypeakranker `
212- CLI, unit tests, and example data (` tests/test.bed ` ). Source code :
209+ for statistical distribution metrics. Installable via pip from GitHub; includes a ` pypeakranker ` CLI, unit tests,
210+ and example data (` tests/test.bed ` ). Source:
213211< https://github.com/AllenInstitute/PeakRankR/tree/python-package > (MIT).
214212
215213# AI usage disclosure
0 commit comments