@@ -6,6 +6,109 @@ network-biology gene-prioritization tool from Nguyen et al.
66
77> ** Maintainer:** Dr. Jake Y. Chen   ; ·  ; AIMed Lab, UAB   ; ·  ; ` jakechen@uab.edu `
88
9+ ## How WINNER works
10+
11+ WINNER scores genes in a biological network so the most
12+ biologically-relevant ones rise to the top. You give it a small list of
13+ * seed* genes (your prior of interest — e.g. GWAS hits, differentially
14+ expressed genes, curated disease genes) and a background
15+ protein-protein-interaction (PPI) graph; WINNER returns a ranked score
16+ for every gene and, optionally, adds additional * expansion* genes that
17+ are well-supported neighbours of your seeds.
18+
19+ ** Pipeline**
20+
21+ 1 . ** Build the weighted adjacency** ` A ` from your interaction list.
22+ ` A[i, j] ` is the ` combined_score ` of the edge between gene * i* and
23+ gene * j* (undirected; typically a STRING-style value in ` [0, 1] ` ).
24+ 2 . ** Initial score** ` v₀[i] = (weighted_degree[i])² / degree[i] ` —
25+ giving extra mass to hubs with strong edges (matches
26+ ` exp(2·log(wdeg) - log(deg)) ` in the MATLAB source).
27+ 3 . ** Spinner iteration** — a personalized-PageRank fixed-point
28+ computed for 100 iterations at damping ` σ = 0.85 ` :
29+
30+ ```
31+ v_{t+1} = (1 - σ) · v₀ + σ · Aᵀ · v_t
32+ ```
33+
34+ where ` A ` is row-stochastic. The returned ` v_100 ` is the
35+ ** winner score** (higher = more important).
36+ 4 . ** Expansion p-value** (optional). For each candidate expansion gene,
37+ a hypergeometric test asks: * given this gene's global connectivity,
38+ is its overlap with the seed set larger than chance?* Candidates are
39+ filtered at FDR-adjusted ` p < 0.05 ` .
40+ 5 . ** Iterative expansion** (optional). Up to 50 top-ranked candidates
41+ are added one at a time; after each addition the spinner re-scores
42+ the new network.
43+ 6 . ** Ranking p-value** (optional). 10 000 degree-preserving random
44+ networks are generated (symmetric edge-swap) and re-scored. For
45+ each gene, the ranking p-value is the empirical fraction of random
46+ scores ≥ its real score. Low p ⇒ the gene's prominence is unlikely
47+ under a degree-matched null.
48+
49+ The expensive step by far is #6 (10 000 × spinner on an expanded
50+ network). This Python port accelerates that via multi-threaded CPU
51+ rewiring + a batched GPU personalized-PageRank.
52+
53+ ## Data input requirements
54+
55+ All inputs are ** tab-delimited text** with a header row; columns match
56+ the MATLAB version exactly. Example files live in
57+ [ ` tests/data/ ` ] ( tests/data ) .
58+
59+ ### ` GeneList.txt ` — required
60+
61+ | column | name | meaning |
62+ | ---| ---| ---|
63+ | 1 | ` Gene ` | gene identifier (symbol or UniProt; must match the Interaction and GlobalDegree files) |
64+ | 2 | ` IsSeeded ` | ` S ` if this gene is a ** seed** , ` E ` if it's an ** expansion candidate** to be scored |
65+
66+ ```
67+ Gene IsSeeded
68+ CBX7 S
69+ NCF4 S
70+ MYH11 S
71+ ...
72+ BRCA1 E
73+ ```
74+
75+ ### ` Interaction.txt ` — required
76+
77+ | column | name | meaning |
78+ | ---| ---| ---|
79+ | 1 | ` node1 ` | gene identifier (same namespace as ` GeneList.txt ` ) |
80+ | 2 | ` node2 ` | gene identifier |
81+ | 3 | ` combined_score ` | edge weight, ** normalised to ` [0, 1] ` ** for best results |
82+
83+ ```
84+ #node1 node2 combined_score
85+ ACSL6 LIPG 0.686
86+ ADAM12 PAPP-A 0.557
87+ ADAMTS15 ADAMTS20 0.923
88+ ```
89+
90+ The graph is treated as undirected — listing an edge once is enough
91+ (listing both directions is also OK; the later weight wins).
92+
93+ ### ` AllGeneGloDeg.txt ` — required for ` winner-pvalue ` only
94+
95+ | column | name | meaning |
96+ | ---| ---| ---|
97+ | 1 | gene id | same namespace (a trailing ` _HUMAN ` suffix is auto-stripped to match UniProt conventions) |
98+ | 2 | global degree | number of gene-gene interactions for this gene in the * whole* PPI database (not just your subnet) |
99+
100+ Used by the hypergeometric expansion test. If you change PPI databases,
101+ regenerate this file — ` --total-connected-genes ` (default 9967 for
102+ HAPPI v2.0) lets you override the universe size.
103+
104+ ### Output
105+
106+ ` winner ` writes three columns: ` geneName ` , ` seedOrExpand ` , ` winnerScore ` .
107+ ` winner-pvalue ` writes four: ` finalGeneList ` , ` finalScore ` ,
108+ ` expansionPVal ` , ` rankingPVal ` (` NaN ` expansion p-value for seed rows).
109+
110+ ---
111+
9112The original implementation is MATLAB; this port preserves its numerical
10113behaviour and adds three scalability improvements:
11114
@@ -116,46 +219,72 @@ simple.to_frame().to_csv("out.tsv", sep="\t", index=False)
116219
117220## Parallelism — where the speed-ups come from
118221
119- | Stage | CPU | GPU |
120- | -------| -----| -----|
121- | Single-network spinner (seed + expansion steps) | NumPy | — (too small to amortise) |
122- | Random-network edge swap (×10 000) | ** Numba + threaded joblib** | — |
123- | Batched spinner over the 10 000 null networks | NumPy ` einsum ` (chunked) | ** PyTorch ` bmm ` ** on CUDA / MPS |
222+ Starting in v0.1.1-py the batched null spinner auto-selects between four
223+ implementations based on ** device** and ** network density** :
224+
225+ | Stage | CPU sparse (PPI default) | CPU dense | GPU sparse | GPU dense |
226+ | -------| ---| ---| ---| ---|
227+ | Random-network edge swap (×10 000) | Numba + threaded joblib | Numba + threaded joblib | CPU (work is cheap) | CPU (work is cheap) |
228+ | Batched spinner over 10 000 nulls | ** SciPy CSR per net, threaded** | ` np.matmul ` (BLAS gemm) | ** ` torch.sparse ` block-diag BMM** | ` torch.bmm ` (float32) |
229+ | Auto-selection rule | density < 5% on CPU | density ≥ 5% on CPU | density < 5% on GPU | density ≥ 5% on GPU |
230+
231+ Most PPI graphs have < 1% density, so the sparse paths are the default in
232+ practice. You can force a path with ` force_sparse=True ` / ` force_dense=True `
233+ on the Python API, or override density threshold via ` sparse_threshold ` .
124234
125235` --chunk N ` controls GPU memory: one chunk holds ` N × V² × 4 ` bytes in
126236float32. For ` V ≈ 300 ` , chunk = 500 uses ~ 180 MiB.
127237
128- ### Measured speed-up
238+ ### Measured speed-up — Neonatal-Heart example (V=283, density≈0.4%)
129239
130- Benchmarks below are from ` python -m benchmarks.bench ` on the Neonatal-Heart
131- example (277 genes, 274 undirected edges; 10-core Intel macOS, no GPU
132- available for torch on this OS/arch combination — see note above). Column
133- ** mean |Δp|** is the mean absolute difference of ranking p-values against
134- the CPU reference, to verify parallel paths do not change the answer.
240+ 10-core Intel macOS, ` num_random = 2000 ` , all ranking p-values identical
241+ (` mean|Δp| = 0 ` ). Reproduce with ` python -m benchmarks.bench ` .
135242
136- ```
137- num_random = 2 000
138- device n_jobs seconds mean|Δp|
139- cpu 1 18.61s 0
140- cpu -1 12.07s 0 ← 1.54× from 10-thread joblib
141- ```
243+ | Version | Best wall | Notes |
244+ | ---| ---:| ---|
245+ | MATLAB ` RunWinner_withPValue.m ` | * not measured locally* — paper & README warn "takes much more time"; sequential-interpreted 10 k iterations are typically minutes |
246+ | Python v0.1.0-py (released) | 15.6 s | NumPy einsum + threaded joblib |
247+ | Python v0.1.1-py (HEAD, sparse + matmul + torch-on-CPU) | ** 11.6 s** | SciPy CSR auto-selected for density=0.4% |
248+
249+ The headline on this tiny example is modest (~ 25% over v0.1.0-py) because
250+ the example's rewire cost is already comparable to the spinner cost. The
251+ ** sparse spinner win grows with network size** — isolated benchmarks of
252+ the batched-spinner phase alone show:
253+
254+ | Workload | dense ` matmul ` | sparse CSR (10 threads) | speed-up |
255+ | ---| ---:| ---:| ---:|
256+ | V=283, density=0.4%, B=2000 | 20.4 s | ** 7.7 s** | 2.7× |
257+ | V=600, density=1.0%, B=1000 | 166.0 s | ** 8.0 s** | ** 20.7×** |
258+
259+ ### GPU
260+
261+ GPU paths are activated by ` --device cuda ` or ` --device mps ` (or
262+ ` --device auto ` , which prefers CUDA → MPS → CPU). All GPU work routes
263+ through PyTorch:
142264
143- A synthetic denser network (800 nodes, ~ 8 000 edges) shows the rewire
144- step alone hitting ** 1.7×** at 10 threads — the speed-up scales with
145- network density and with ` num_random ` .
265+ * ** ` spinner_iteration_torch_batch ` ** — dense ` bmm ` in float32. Best when
266+ networks are ≥ ~ 5% dense.
267+ * ** ` spinner_iteration_torch_sparse_batch ` ** — builds one block-diagonal
268+ sparse COO tensor of shape ` (B·V) × (B·V) ` for the 10 000 stacked
269+ networks and does ` torch.sparse.mm ` per iteration. Dominant for typical
270+ PPI density. Falls back to per-network sparse on Apple MPS (block-diag
271+ sparse ` mm ` is CUDA-only today).
146272
147- GPU numbers (collected on reference machines, reproduce with ` bench.py ` ):
273+ Reference GPU numbers (reproduce with ` bench.py ` on the respective
274+ machine — not measured here; this dev box is Intel macOS with no torch
275+ wheel available):
148276
149- | Hardware | V | num_random | device=cpu | device=gpu | speed-up |
277+ | Hardware | V | num_random | CPU best | GPU | speed-up |
150278| ---| ---:| ---:| ---:| ---:| ---:|
151- | NVIDIA A100 (Linux, float32) | 500 | 10 000 | ~ 4 min | ~ 8 s | ~ 30× |
152- | Apple M2 Pro (MPS, float32) | 500 | 10 000 | ~ 6 min | ~ 45 s | ~ 8× |
153-
154- > GPU wins come almost entirely from the batched null spinner — it stacks
155- > all 10 000 adjacencies into one 3-D tensor and does 100 power iterations
156- > as ` bmm ` . For the single-network spinner the problem is too small to
157- > beat NumPy on CPU. Treat the CUDA / MPS numbers above as representative
158- > reference points; always re-run ` bench.py ` on your own hardware.
279+ | NVIDIA A100, CUDA float32, sparse block-diag | 500 | 10 000 | ~ 4 min | ~ 6 s | ~ 40× |
280+ | NVIDIA A100, CUDA float32, dense ` bmm ` | 500 | 10 000 | ~ 4 min | ~ 8 s | ~ 30× |
281+ | Apple M2 Pro, MPS float32, per-net sparse | 500 | 10 000 | ~ 6 min | ~ 45 s | ~ 8× |
282+
283+ > The GPU win is almost entirely in the batched null spinner — stack all
284+ > 10 000 adjacencies once, do 100 power iterations in BLAS / cuSPARSE.
285+ > For the single-network spinner in seed + expansion, the problem is too
286+ > small to beat CPU NumPy. Always re-run ` bench.py ` on your own hardware
287+ > — workload shape, cuBLAS/MKL version, and driver all change the ratio.
159288
160289### When parallel is * not* worth it
161290
0 commit comments