Skip to content

Commit a0a888f

Browse files
committed
Final JOSS submission: real 4-tissue ENCODE figure, Table 1 with signal data, test.bed, logo, all bib/spelling/URL fixes
1 parent f53044c commit a0a888f

File tree

5 files changed

+100
-53
lines changed

5 files changed

+100
-53
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# PyPeakRankR
22

3+
<p align="center">
4+
<img src="logo.png" width="180" alt="PyPeakRankR logo"/>
5+
</p>
6+
37
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
48
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
59

logo.png

1.03 MB
Loading
218 KB
Loading

paper/paper.bib

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,34 @@ @misc{Ramirez2020pyBigWig
149149
year = {2020},
150150
url = {https://github.com/deeptools/pyBigWig}
151151
}
152+
153+
@article{ENCODE2020,
154+
author = {{ENCODE Project Consortium} and others},
155+
title = {Perspectives on {ENCODE}},
156+
journal = {Nature},
157+
year = {2020},
158+
volume = {583},
159+
pages = {693--698},
160+
doi = {10.1038/s41586-020-2449-8}
161+
}
162+
163+
@misc{ENCSR802GEV,
164+
author = {{ENCODE Project Consortium}},
165+
title = {{ATAC-seq on human liver (ENCSR802GEV)}},
166+
year = {2021},
167+
url = {https://www.encodeproject.org/experiments/ENCSR802GEV/}
168+
}
169+
170+
@misc{ENCSR117PYB,
171+
author = {{ENCODE Project Consortium}},
172+
title = {{ATAC-seq on human heart left ventricle (ENCSR117PYB)}},
173+
year = {2017},
174+
url = {https://www.encodeproject.org/experiments/ENCSR117PYB/}
175+
}
176+
177+
@misc{ENCSR647AOY,
178+
author = {{ENCODE Project Consortium}},
179+
title = {{ATAC-seq on human lung (ENCSR647AOY)}},
180+
year = {2021},
181+
url = {https://www.encodeproject.org/experiments/ENCSR647AOY/}
182+
}

paper/paper.md

Lines changed: 65 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -79,13 +79,10 @@ but does not produce portable, tool-agnostic feature tables. `pyfaidx`
7979
[@Shirley2015] enables FASTA sequence access but provides no genomics feature
8080
pipeline.
8181

82-
PyPeakRankR fills the gap between these low-level libraries and higher-level
83-
analysis frameworks. Rather than competing with or duplicating existing tools,
84-
it composes them: `pyBigWig` for signal extraction, `pyfaidx` for sequence
85-
access, `scipy` [@Virtanen2020] for distribution metrics. The key design
86-
contribution is the composable CLI pipeline that assembles heterogeneous features
87-
into a single, reproducible TSV table that any downstream tool or statistical
88-
model can consume.
82+
PyPeakRankR fills this gap by composing these libraries — `pyBigWig` for
83+
signal extraction, `pyfaidx` for sequence access, `scipy` [@Virtanen2020]
84+
for distribution metrics — into a composable CLI pipeline that assembles
85+
heterogeneous features into a single reproducible TSV table.
8986

9087
# Software design
9188

@@ -132,33 +129,56 @@ the per-base signal profile within the peak. Figure adapted from Wirthlin et al.
132129

133130
# Ranking vs. MACS2 fold-change
134131

135-
A common approach to peak prioritization is to sort peaks by MACS2
136-
[@Zhang2008] fold-change or significance score. While this ranks peaks by
137-
signal strength, it does not capture cell-type specificity. A peak with high
138-
fold-change may be broadly accessible across many cell types — a housekeeping
139-
element — and therefore a poor candidate for cell-type targeted experiments.
140-
141-
Figure 2 illustrates this distinction using ten MACS2 peaks from a real
142-
ATAC-seq experiment. When ranked by fold-change (left), peak P1 (chr4) is
143-
placed last because it has the lowest fold-change (11.7) of the set. When
144-
ranked by PyPeakRankR specificity score (right), P1 is ranked first because
145-
its signal is concentrated in the target cell type relative to background.
146-
Conversely, peaks P3 (chr1) and P10 (chr10) rank at the top by fold-change
147-
but near the bottom by specificity, consistent with broad chromatin
148-
accessibility across cell types. This divergence illustrates why fold-change
149-
alone is insufficient for selecting cell-type specific regulatory elements.
150-
151-
![Comparison of MACS2 fold-change ranking (left) versus PyPeakRankR
152-
specificity ranking (right) for ten MACS2 narrowPeak calls from a real
153-
ATAC-seq experiment (test.bed). Peaks with the highest MACS2 fold-change
154-
are not necessarily the most cell-type specific. P1 (chr4, green border)
155-
ranks last by fold-change (FC = 11.7) but first by specificity. P3 (chr1)
156-
and P10 (chr10) rank first and second by fold-change (FC = 17.0 and 17.4)
157-
but near the bottom by specificity, consistent with broad chromatin
158-
accessibility across cell types. Specificity scores are the ratio of target
159-
to mean background ATAC signal, min-max normalised to [0, 1]; rank 1 is
160-
the peak most exclusively active in the target
161-
group.](browser_tracks_comparison.png)
132+
Sorting peaks by MACS2 [@Zhang2008] fold-change ranks by signal strength
133+
but not specificity — a broadly accessible peak is a poor candidate for
134+
cell-type targeted experiments.
135+
136+
Figure 2 illustrates this using ten MACS2 narrowPeak calls (Table 1)
137+
scored against four ENCODE human tissue ATAC-seq BigWig tracks
138+
(GRCh38): colonic mucosa [@ENCODE2020] (ENCFF557AZH, ENCSR970UNF),
139+
liver (ENCFF160VHY, ENCSR802GEV), heart left ventricle
140+
(ENCFF455AFI, ENCSR117PYB), and lung (ENCFF210HIS, ENCSR647AOY).
141+
Signal was extracted using PyPeakRankR's `add-signal` command.
142+
Specificity scores (colon / mean across tissues, normalised [0, 1])
143+
diverge substantially from MACS2 fold-change ranks. P6 (chrX,
144+
FC=14.7, rank #7) achieves the highest specificity (1.000) with colon
145+
mean=320 versus liver=7.6 and heart=7.1. P4 (chr2, FC=16.6, rank #3)
146+
scores lowest (0.000) with signal spread across all tissues. P1
147+
(chr4) is ambiguous: high colon signal but substantial heart
148+
signal (mean=83.8). Table 1 lists all ten peaks.
149+
150+
![ATAC-seq signal tracks across four human tissues (colonic mucosa, liver,
151+
heart left ventricle, lung) for selected peaks from Table 1. Left panel:
152+
top peaks by MACS2 fold-change, showing broadly accessible signal across
153+
all tissues. Right panel: top peaks by PyPeakRankR specificity score,
154+
showing concentrated signal in the target tissue. Signal values are MACS2
155+
p-value BigWig tracks from ENCODE (GRCh38); dashed lines mark MACS2
156+
summits. Specificity scores are the ratio of target to mean background
157+
signal across all four tissues, min-max normalised to [0,
158+
1].](browser_tracks_comparison.png)
159+
160+
161+
| Peak | Coordinates (GRCh38) | FC | FC rank | Colon | Liver | Heart | Lung | Spec | Spec rank |
162+
|------|----------------------|----|---------|-------|-------|-------|------|------|-----------|
163+
| P1 | chr4:49,149,084–49,149,919 | 11.66 | 10 | 444.8 | 4.4 | 83.8 | 49.4 | 0.105 | 8 |
164+
| P2 | chr18:12,947,298–12,948,641 | 15.87 | 5 | 327.3 | 14.8 | 19.5 | 55.2 | 0.383 | 4 |
165+
| P3 | chr1:244,451,043–244,452,405 | 17.03 | 2 | 339.2 | 9.7 | 21.4 | 50.3 | 0.650 | 3 |
166+
| P4 | chr2:178,450,461–178,452,049 | 16.59 | 3 | 242.1 | 18.2 | 16.4 | 43.8 | 0.000 | 10 |
167+
| P5 | chr5:119,267,962–119,269,594 | 16.40 | 4 | 212.0 | 14.3 | 18.3 | 35.7 | 0.015 | 9 |
168+
| P6 | chrX:1,391,993–1,393,130 | 14.75 | 7 | 319.8 | 7.6 | 7.1 | 49.0 | 1.000 | 1 |
169+
| P7 | chr2:197,498,901–197,500,709 | 13.91 | 9 | 241.8 | 13.2 | 8.7 | 39.8 | 0.526 | 5 |
170+
| P8 | chr8:144,826,667–144,827,949 | 15.51 | 6 | 296.6 | 6.1 | 18.6 | 47.3 | 0.628 | 2 |
171+
| P9 | chr10:132,536,759–132,537,934| 14.17 | 8 | 357.5 | 11.6 | 27.6 | 70.4 | 0.127 | 7 |
172+
| P10 | chr10:69,123,341–69,124,698 | 17.35 | 1 | 248.7 | 10.0 | 17.9 | 35.3 | 0.533 | 6 |
173+
174+
: Ten MACS2 narrowPeak calls used in Figure 2 (GRCh38, colonic mucosa
175+
ENCSR970UNF [@ENCODE2020]). MACS2 fold-change (FC) and PyPeakRankR
176+
specificity score both range 0–1 within this set; ranks frequently
177+
diverge, confirming that signal strength alone does not predict
178+
tissue-specific accessibility. Specificity computed using `pypeakranker rank-specificity` with
179+
four ENCODE tissue BigWig tracks. {#tbl:peaks}
180+
181+
162182

163183
# Research impact statement
164184

@@ -180,37 +200,29 @@ enhancer-AAV tools achieved >70% on-target specificity across cell types,
180200
with exemplary enhancers exceeding 90%.
181201

182202
These applications demonstrate that PyPeakRankR's feature extraction produces
183-
rankings with direct experimental utility, spanning multiple species and brain
184-
regions. The software is openly available and documented for community reuse.
203+
rankings with direct experimental utility across species and brain regions.
185204

186205
# Implementation
187206

188207
PyPeakRankR is implemented in Python (>=3.9) with the following dependencies:
189208
`pandas` [@Reback2020] for tabular data handling, `numpy` [@Harris2020] for
190209
numerical computation, `pyBigWig` [@Ramirez2020pyBigWig] for BigWig signal extraction,
191210
`pyfaidx` [@Shirley2015] for FASTA sequence access, and `scipy` [@Virtanen2020]
192-
for statistical distribution metrics. The package is installable via pip from
193-
GitHub, provides a `pypeakranker` CLI entry point, and includes unit tests
194-
covering all core functions. Source code is available at
195-
<https://github.com/AllenInstitute/PeakRankR/tree/python-package> under the
196-
MIT license.
211+
for statistical distribution metrics. The package is installable via pip from GitHub, includes a `pypeakranker`
212+
CLI, unit tests, and example data (`tests/test.bed`). Source code:
213+
<https://github.com/AllenInstitute/PeakRankR/tree/python-package> (MIT).
197214

198215
# AI usage disclosure
199216

200-
Generative AI tools (Claude, Anthropic) were used to assist with: code
201-
scaffolding and refactoring of module structure, drafting sections of this
202-
paper and the README, and test scaffolding. All AI-assisted outputs were
203-
reviewed, edited, and validated by the authors. All core design decisions —
204-
the table-first pipeline architecture, the composable CLI structure, the
205-
specificity ranking formula, and the feature set — were made by the human
206-
authors. The authors take full responsibility for the accuracy and content of
207-
all submitted materials.
217+
Generative AI tools (Claude, Anthropic) assisted with code scaffolding,
218+
paper drafting, and test scaffolding. All outputs were reviewed and validated
219+
by the authors, who take full responsibility for all submitted materials.
220+
All core design decisions were made by the human authors.
208221

209222
# Acknowledgements
210223

211-
Development was supported by the Allen Institute for Brain Science and informed
212-
by regulatory genomics workflows developed in the Human Cell Types program.
213-
The authors thank the Allen Institute bioinformatics and enhancer AAV teams for
214-
feedback on feature definitions and pipeline design.
224+
Development was supported by the Allen Institute for Brain Science.
225+
The authors thank the bioinformatics and enhancer AAV teams for feedback
226+
on feature definitions and pipeline design.
215227

216228
# References

0 commit comments

Comments
 (0)