Skip to content

Latest commit

 

History

History
141 lines (103 loc) · 6.74 KB

File metadata and controls

141 lines (103 loc) · 6.74 KB

Example Walkthrough — porting scrublet to rs-scrublet

An end-to-end narrative of one port through the 6-step protocol, with concrete commands and intermediate outputs. scrublet is a doublet-detection package: it simulates synthetic doublets, builds a kNN graph over the combined data, and scores each real cell by its doublet-neighbour fraction. The canonical entry point returns a per-cell doublet score (float) and a doublet call (boolean) — so we have one ordinal output and one classification output.

This is illustrative — the numbers are representative, not from a shipped port.


Phase 0 — Decide the gate

scrublet.Scrublet(...).scrub_doublets() returns (doublet_scores, predicted_doublets).

  • doublet_scores — a continuous per-cell value → ordinal (Pearson ≥ 0.99).
  • predicted_doublets — a boolean per-cell label → classification (F1 ≥ 0.95).

There is an internal kNN step (deterministic) and a PCA step (embedding-class internally), but the user-facing outputs are the two above, so the manifest declares two gates.

scrublet uses an RNG to simulate doublets. We pin seed: 42 and will mirror NumPy's generator on the Rust side; if exact mirroring proves impractical, the scores degrade to a distributional comparison — but we try for element-wise first.

data/manifest.yaml (committed, read-only after this):

package: rs-scrublet
upstream: { name: scrublet, version: "0.2.3", source: PyPI, url: https://github.com/swolock/scrublet }
algorithm_class: ordinal
parity_threshold: 0.99
fixture: { path: data/pbmc3k_counts.npy, source: "scanpy.datasets.pbmc3k()", expected_shape: [2700, 32738] }
reference_command: tests/py_reference_driver.py
seed: 42
outputs:
  - { name: doublet_scores, type: "1d vector", location_reference: "$.doublet_scores", location_candidate: "$.doublet_scores", metric: ordinal, threshold: 0.99 }
  - { name: predicted_doublets, type: "1d bool", location_reference: "$.predicted_doublets", location_candidate: "$.predicted_doublets", metric: classification, threshold: 0.95 }

Phase 0.5 — Discovery

$ python -m engine.discover_rust_deps --check scrublet
## Discovery — `scrublet`
**No existing omicverse Rust port found.** Safe to start a new `rs-scrublet` port.

$ python -m engine.discover_rust_deps --pyproject scrublet-ref/pyproject.toml --output DISCOVERY.md

Discovery output (abridged) → decisions:

Python dep rs- port Rust crate to build on
numpy ndarray (+ numpy crate for zero-copy)
scipy (sparse, PCA) ndarray-linalg (truncated SVD), sprs
scikit-learn (NearestNeighbors) linfa-nn / a kd-tree crate (kiddo)
annoy (approx kNN) exact kNN in-crate (annoy is approximate; we want parity)

Commit DISCOVERY.md before any algorithmic code.

Phase 1 — Scaffold

Copy the layout from the rs-scrublet shape (or the classification seed). Create Cargo.toml (pyo3 + numpy + ndarray + ndarray-linalg + linfa-nn + rayon + rand), pyproject.toml (maturin backend), and an empty #[pymodule] in src/lib.rs. Confirm the toolchain:

$ conda activate $RUST_TEST_ENV
$ maturin develop --release
🦀 Built and installed rs_scrublet-0.1.0
$ python -c "import rs_scrublet; print(rs_scrublet.__version__)"
0.1.0

Phase 2 — Equivalence Agent

Translate in dependency order: pipeline_normalizesimulate_doubletspcanearest_neighborscalculate_doublet_scoresscrub_doublets.

# after each function: rebuild + parity-diff
$ maturin develop --release
$ python -m engine.loop --port-dir . --phase equivalence
[ref] ... done in 4.10s
[cand] ... done in 0.31s
[PASS] rs-scrublet: doublet_scores=0.9994 OK; predicted_doublets=0.97 OK

A failure we hit on the way (illustrative): the kNN distances diverged because the Rust kd-tree returned neighbours in a different tie-break order than sklearn. The doublet score (a fraction over a fixed-size neighbourhood) was unaffected, but a few calls flipped near the threshold. Walking PARITY_TAXONOMY.md's suspicion list pointed at neighbour ordering; fixing the tie-break to match sklearn (sort by (distance, index)) recovered F1 = 1.00. The gate was never widened.

Baseline result: parity clears (Pearson 0.9994, F1 1.00) and the Rust baseline is already ~13× faster than scrublet (the per-cell scoring loop left the interpreter).

Phase 3 — Acceleration Agent

iter 0 (baseline)        : 0.31 s   Pearson 0.9994  F1 1.00   (13× vs python)
iter 1 §4.2 LTO+cgu=1    : 0.27 s   Pearson 0.9994  F1 1.00   ACCEPT  (E)
iter 2 §2.2 buffer reuse : 0.24 s   Pearson 0.9994  F1 1.00   ACCEPT  (E)
iter 3 §3.4 rayon per-cell scoring (outer map, inner serial)
                         : 0.07 s   Pearson 0.9994  F1 1.00   ACCEPT  (E)
iter 4 §3.2 rayon parallel reduction for the neighbour-fraction sum
                         : 0.06 s   Pearson 0.99938 F1 1.00   ACCEPT  (B)
         bound: ‖Δ‖ ≤ n·eps·max|x|, n=30 neighbours, max≈1 → ~7e-15; well within atol.
iter 5 f32 PCA accumulator
                         : —        REJECTED — reference is f64; no admissible bound.

Each attempt appends a YAML block to ITERATION_LOG.md. Render the plot:

$ python -m engine.plot_evolution --port-dir .
[plot] wrote examples/evolution.png

Plot 1 drops 0.31 → 0.06 s (≈68× vs the Python reference end-to-end); Plot 2 stays flat at ~0.9994 with a single annotated, bounded dip at iter 4.

Phase 4 — Validate + artefacts

$ maturin build --release && pip install target/wheels/rs_scrublet-*.whl   # fresh env
$ pytest -q                       # green
$ python -m engine.py_function_audit --py-source scrublet-ref --rust-crate src/ --output AUDIT.md
$ jupyter nbconvert --to notebook --execute examples/*.ipynb --output {}

Fill RECONSTRUCTION_REPORT.md (8 sections), MATH.md (the iter-4 reduction bound), and confirm all four notebooks are present and executed. None deferred.

Phase 5 — Release

$ gh repo create <org>/rs-scrublet --public
$ maturin publish                 # → PyPI
$ cargo publish                   # → crates.io

Tick examples/ROADMAP.md (scrublet ✅), add rs-scrublet to TEMPLATE.md as the classification seed, and you're done: a pip-installable, Rust-backed, parity-proven drop-in for scrublet, ~68× faster on PBMC3k.


What this walkthrough illustrates

  • The baseline translation already delivered most of the speedup — Rust acceleration was the last ~5×, not the first 13×.
  • The only parity dip came from a reordered reduction, was bounded in MATH.md, and stayed far inside the gate.
  • The one real bug (kNN tie-break order) was fixed by matching the reference, never by loosening the gate — the cardinal rule.