@@ -79,13 +79,10 @@ but does not produce portable, tool-agnostic feature tables. `pyfaidx`
7979[ @Shirley2015 ] enables FASTA sequence access but provides no genomics feature
8080pipeline.
8181
82- PyPeakRankR fills the gap between these low-level libraries and higher-level
83- analysis frameworks. Rather than competing with or duplicating existing tools,
84- it composes them: ` pyBigWig ` for signal extraction, ` pyfaidx ` for sequence
85- access, ` scipy ` [ @Virtanen2020 ] for distribution metrics. The key design
86- contribution is the composable CLI pipeline that assembles heterogeneous features
87- into a single, reproducible TSV table that any downstream tool or statistical
88- model can consume.
82+ PyPeakRankR fills this gap by composing these libraries — ` pyBigWig ` for
83+ signal extraction, ` pyfaidx ` for sequence access, ` scipy ` [ @Virtanen2020 ]
84+ for distribution metrics — into a composable CLI pipeline that assembles
85+ heterogeneous features into a single reproducible TSV table.
8986
9087# Software design
9188
@@ -132,33 +129,56 @@ the per-base signal profile within the peak. Figure adapted from Wirthlin et al.
132129
133130# Ranking vs. MACS2 fold-change
134131
135- A common approach to peak prioritization is to sort peaks by MACS2
136- [ @Zhang2008 ] fold-change or significance score. While this ranks peaks by
137- signal strength, it does not capture cell-type specificity. A peak with high
138- fold-change may be broadly accessible across many cell types — a housekeeping
139- element — and therefore a poor candidate for cell-type targeted experiments.
140-
141- Figure 2 illustrates this distinction using ten MACS2 peaks from a real
142- ATAC-seq experiment. When ranked by fold-change (left), peak P1 (chr4) is
143- placed last because it has the lowest fold-change (11.7) of the set. When
144- ranked by PyPeakRankR specificity score (right), P1 is ranked first because
145- its signal is concentrated in the target cell type relative to background.
146- Conversely, peaks P3 (chr1) and P10 (chr10) rank at the top by fold-change
147- but near the bottom by specificity, consistent with broad chromatin
148- accessibility across cell types. This divergence illustrates why fold-change
149- alone is insufficient for selecting cell-type specific regulatory elements.
150-
151- ![ Comparison of MACS2 fold-change ranking (left) versus PyPeakRankR
152- specificity ranking (right) for ten MACS2 narrowPeak calls from a real
153- ATAC-seq experiment (test.bed). Peaks with the highest MACS2 fold-change
154- are not necessarily the most cell-type specific. P1 (chr4, green border)
155- ranks last by fold-change (FC = 11.7) but first by specificity. P3 (chr1)
156- and P10 (chr10) rank first and second by fold-change (FC = 17.0 and 17.4)
157- but near the bottom by specificity, consistent with broad chromatin
158- accessibility across cell types. Specificity scores are the ratio of target
159- to mean background ATAC signal, min-max normalised to [ 0, 1] ; rank 1 is
160- the peak most exclusively active in the target
161- group.] ( browser_tracks_comparison.png )
132+ Sorting peaks by MACS2 [ @Zhang2008 ] fold-change ranks by signal strength
133+ but not specificity — a broadly accessible peak is a poor candidate for
134+ cell-type targeted experiments.
135+
136+ Figure 2 illustrates this using ten MACS2 narrowPeak calls (Table 1)
137+ scored against four ENCODE human tissue ATAC-seq BigWig tracks
138+ (GRCh38): colonic mucosa [ @ENCODE2020 ] (ENCFF557AZH, ENCSR970UNF),
139+ liver (ENCFF160VHY, ENCSR802GEV), heart left ventricle
140+ (ENCFF455AFI, ENCSR117PYB), and lung (ENCFF210HIS, ENCSR647AOY).
141+ Signal was extracted using PyPeakRankR's ` add-signal ` command.
142+ Specificity scores (colon / mean across tissues, normalised [ 0, 1] )
143+ diverge substantially from MACS2 fold-change ranks. P6 (chrX,
144+ FC=14.7, rank #7 ) achieves the highest specificity (1.000) with colon
145+ mean=320 versus liver=7.6 and heart=7.1. P4 (chr2, FC=16.6, rank #3 )
146+ scores lowest (0.000) with signal spread across all tissues. P1
147+ (chr4) is ambiguous: high colon signal but substantial heart
148+ signal (mean=83.8). Table 1 lists all ten peaks.
149+
150+ ![ ATAC-seq signal tracks across four human tissues (colonic mucosa, liver,
151+ heart left ventricle, lung) for selected peaks from Table 1. Left panel:
152+ top peaks by MACS2 fold-change, showing broadly accessible signal across
153+ all tissues. Right panel: top peaks by PyPeakRankR specificity score,
154+ showing concentrated signal in the target tissue. Signal values are MACS2
155+ p-value BigWig tracks from ENCODE (GRCh38); dashed lines mark MACS2
156+ summits. Specificity scores are the ratio of target to mean background
157+ signal across all four tissues, min-max normalised to [ 0,
158+ 1] .] ( browser_tracks_comparison.png )
159+
160+
161+ | Peak | Coordinates (GRCh38) | FC | FC rank | Colon | Liver | Heart | Lung | Spec | Spec rank |
162+ | ------| ----------------------| ----| ---------| -------| -------| -------| ------| ------| -----------|
163+ | P1 | chr4:49,149,084–49,149,919 | 11.66 | 10 | 444.8 | 4.4 | 83.8 | 49.4 | 0.105 | 8 |
164+ | P2 | chr18:12,947,298–12,948,641 | 15.87 | 5 | 327.3 | 14.8 | 19.5 | 55.2 | 0.383 | 4 |
165+ | P3 | chr1:244,451,043–244,452,405 | 17.03 | 2 | 339.2 | 9.7 | 21.4 | 50.3 | 0.650 | 3 |
166+ | P4 | chr2:178,450,461–178,452,049 | 16.59 | 3 | 242.1 | 18.2 | 16.4 | 43.8 | 0.000 | 10 |
167+ | P5 | chr5:119,267,962–119,269,594 | 16.40 | 4 | 212.0 | 14.3 | 18.3 | 35.7 | 0.015 | 9 |
168+ | P6 | chrX:1,391,993–1,393,130 | 14.75 | 7 | 319.8 | 7.6 | 7.1 | 49.0 | 1.000 | 1 |
169+ | P7 | chr2:197,498,901–197,500,709 | 13.91 | 9 | 241.8 | 13.2 | 8.7 | 39.8 | 0.526 | 5 |
170+ | P8 | chr8:144,826,667–144,827,949 | 15.51 | 6 | 296.6 | 6.1 | 18.6 | 47.3 | 0.628 | 2 |
171+ | P9 | chr10:132,536,759–132,537,934| 14.17 | 8 | 357.5 | 11.6 | 27.6 | 70.4 | 0.127 | 7 |
172+ | P10 | chr10:69,123,341–69,124,698 | 17.35 | 1 | 248.7 | 10.0 | 17.9 | 35.3 | 0.533 | 6 |
173+
174+ : Ten MACS2 narrowPeak calls used in Figure 2 (GRCh38, colonic mucosa
175+ ENCSR970UNF [ @ENCODE2020 ] ). MACS2 fold-change (FC) and PyPeakRankR
176+ specificity score both range 0–1 within this set; ranks frequently
177+ diverge, confirming that signal strength alone does not predict
178+ tissue-specific accessibility. Specificity computed using ` pypeakranker rank-specificity ` with
179+ four ENCODE tissue BigWig tracks. {#tbl: peaks }
180+
181+
162182
163183# Research impact statement
164184
@@ -180,37 +200,29 @@ enhancer-AAV tools achieved >70% on-target specificity across cell types,
180200with exemplary enhancers exceeding 90%.
181201
182202These applications demonstrate that PyPeakRankR's feature extraction produces
183- rankings with direct experimental utility, spanning multiple species and brain
184- regions. The software is openly available and documented for community reuse.
203+ rankings with direct experimental utility across species and brain regions.
185204
186205# Implementation
187206
188207PyPeakRankR is implemented in Python (>=3.9) with the following dependencies:
189208` pandas ` [ @Reback2020 ] for tabular data handling, ` numpy ` [ @Harris2020 ] for
190209numerical computation, ` pyBigWig ` [ @Ramirez2020pyBigWig ] for BigWig signal extraction,
191210` pyfaidx ` [ @Shirley2015 ] for FASTA sequence access, and ` scipy ` [ @Virtanen2020 ]
192- for statistical distribution metrics. The package is installable via pip from
193- GitHub, provides a ` pypeakranker ` CLI entry point, and includes unit tests
194- covering all core functions. Source code is available at
195- < https://github.com/AllenInstitute/PeakRankR/tree/python-package > under the
196- MIT license.
211+ for statistical distribution metrics. The package is installable via pip from GitHub, includes a ` pypeakranker `
212+ CLI, unit tests, and example data (` tests/test.bed ` ). Source code:
213+ < https://github.com/AllenInstitute/PeakRankR/tree/python-package > (MIT).
197214
198215# AI usage disclosure
199216
200- Generative AI tools (Claude, Anthropic) were used to assist with: code
201- scaffolding and refactoring of module structure, drafting sections of this
202- paper and the README, and test scaffolding. All AI-assisted outputs were
203- reviewed, edited, and validated by the authors. All core design decisions —
204- the table-first pipeline architecture, the composable CLI structure, the
205- specificity ranking formula, and the feature set — were made by the human
206- authors. The authors take full responsibility for the accuracy and content of
207- all submitted materials.
217+ Generative AI tools (Claude, Anthropic) assisted with code scaffolding,
218+ paper drafting, and test scaffolding. All outputs were reviewed and validated
219+ by the authors, who take full responsibility for all submitted materials.
220+ All core design decisions were made by the human authors.
208221
209222# Acknowledgements
210223
211- Development was supported by the Allen Institute for Brain Science and informed
212- by regulatory genomics workflows developed in the Human Cell Types program.
213- The authors thank the Allen Institute bioinformatics and enhancer AAV teams for
214- feedback on feature definitions and pipeline design.
224+ Development was supported by the Allen Institute for Brain Science.
225+ The authors thank the bioinformatics and enhancer AAV teams for feedback
226+ on feature definitions and pipeline design.
215227
216228# References
0 commit comments