An end-to-end narrative of one port through the 6-step protocol, with concrete commands and intermediate outputs.
scrubletis a doublet-detection package: it simulates synthetic doublets, builds a kNN graph over the combined data, and scores each real cell by its doublet-neighbour fraction. The canonical entry point returns a per-cell doublet score (float) and a doublet call (boolean) — so we have one ordinal output and one classification output.
This is illustrative — the numbers are representative, not from a shipped port.
scrublet.Scrublet(...).scrub_doublets() returns (doublet_scores, predicted_doublets).
doublet_scores— a continuous per-cell value → ordinal (Pearson ≥ 0.99).predicted_doublets— a boolean per-cell label → classification (F1 ≥ 0.95).
There is an internal kNN step (deterministic) and a PCA step (embedding-class internally), but the user-facing outputs are the two above, so the manifest declares two gates.
scrublet uses an RNG to simulate doublets. We pin seed: 42 and will mirror NumPy's generator on the Rust side; if exact mirroring proves impractical, the scores degrade to a distributional comparison — but we try for element-wise first.
data/manifest.yaml (committed, read-only after this):
package: rs-scrublet
upstream: { name: scrublet, version: "0.2.3", source: PyPI, url: https://github.com/swolock/scrublet }
algorithm_class: ordinal
parity_threshold: 0.99
fixture: { path: data/pbmc3k_counts.npy, source: "scanpy.datasets.pbmc3k()", expected_shape: [2700, 32738] }
reference_command: tests/py_reference_driver.py
seed: 42
outputs:
- { name: doublet_scores, type: "1d vector", location_reference: "$.doublet_scores", location_candidate: "$.doublet_scores", metric: ordinal, threshold: 0.99 }
- { name: predicted_doublets, type: "1d bool", location_reference: "$.predicted_doublets", location_candidate: "$.predicted_doublets", metric: classification, threshold: 0.95 }$ python -m engine.discover_rust_deps --check scrublet
## Discovery — `scrublet`
**No existing omicverse Rust port found.** Safe to start a new `rs-scrublet` port.
$ python -m engine.discover_rust_deps --pyproject scrublet-ref/pyproject.toml --output DISCOVERY.mdDiscovery output (abridged) → decisions:
| Python dep | rs- port | Rust crate to build on |
|---|---|---|
numpy |
— | ndarray (+ numpy crate for zero-copy) |
scipy (sparse, PCA) |
— | ndarray-linalg (truncated SVD), sprs |
scikit-learn (NearestNeighbors) |
— | linfa-nn / a kd-tree crate (kiddo) |
annoy (approx kNN) |
— | exact kNN in-crate (annoy is approximate; we want parity) |
Commit DISCOVERY.md before any algorithmic code.
Copy the layout from the rs-scrublet shape (or the classification seed). Create Cargo.toml (pyo3 + numpy + ndarray + ndarray-linalg + linfa-nn + rayon + rand), pyproject.toml (maturin backend), and an empty #[pymodule] in src/lib.rs. Confirm the toolchain:
$ conda activate $RUST_TEST_ENV
$ maturin develop --release
🦀 Built and installed rs_scrublet-0.1.0
$ python -c "import rs_scrublet; print(rs_scrublet.__version__)"
0.1.0Translate in dependency order: pipeline_normalize → simulate_doublets → pca → nearest_neighbors → calculate_doublet_scores → scrub_doublets.
# after each function: rebuild + parity-diff
$ maturin develop --release
$ python -m engine.loop --port-dir . --phase equivalence
[ref] ... done in 4.10s
[cand] ... done in 0.31s
[PASS] rs-scrublet: doublet_scores=0.9994 OK; predicted_doublets=0.97 OKA failure we hit on the way (illustrative): the kNN distances diverged because the Rust kd-tree returned neighbours in a different tie-break order than sklearn. The doublet score (a fraction over a fixed-size neighbourhood) was unaffected, but a few calls flipped near the threshold. Walking PARITY_TAXONOMY.md's suspicion list pointed at neighbour ordering; fixing the tie-break to match sklearn (sort by (distance, index)) recovered F1 = 1.00. The gate was never widened.
Baseline result: parity clears (Pearson 0.9994, F1 1.00) and the Rust baseline is already ~13× faster than scrublet (the per-cell scoring loop left the interpreter).
iter 0 (baseline) : 0.31 s Pearson 0.9994 F1 1.00 (13× vs python)
iter 1 §4.2 LTO+cgu=1 : 0.27 s Pearson 0.9994 F1 1.00 ACCEPT (E)
iter 2 §2.2 buffer reuse : 0.24 s Pearson 0.9994 F1 1.00 ACCEPT (E)
iter 3 §3.4 rayon per-cell scoring (outer map, inner serial)
: 0.07 s Pearson 0.9994 F1 1.00 ACCEPT (E)
iter 4 §3.2 rayon parallel reduction for the neighbour-fraction sum
: 0.06 s Pearson 0.99938 F1 1.00 ACCEPT (B)
bound: ‖Δ‖ ≤ n·eps·max|x|, n=30 neighbours, max≈1 → ~7e-15; well within atol.
iter 5 f32 PCA accumulator
: — REJECTED — reference is f64; no admissible bound.
Each attempt appends a YAML block to ITERATION_LOG.md. Render the plot:
$ python -m engine.plot_evolution --port-dir .
[plot] wrote examples/evolution.pngPlot 1 drops 0.31 → 0.06 s (≈68× vs the Python reference end-to-end); Plot 2 stays flat at ~0.9994 with a single annotated, bounded dip at iter 4.
$ maturin build --release && pip install target/wheels/rs_scrublet-*.whl # fresh env
$ pytest -q # green
$ python -m engine.py_function_audit --py-source scrublet-ref --rust-crate src/ --output AUDIT.md
$ jupyter nbconvert --to notebook --execute examples/*.ipynb --output {}Fill RECONSTRUCTION_REPORT.md (8 sections), MATH.md (the iter-4 reduction bound), and confirm all four notebooks are present and executed. None deferred.
$ gh repo create <org>/rs-scrublet --public
$ maturin publish # → PyPI
$ cargo publish # → crates.ioTick examples/ROADMAP.md (scrublet ✅), add rs-scrublet to TEMPLATE.md as the classification seed, and you're done: a pip-installable, Rust-backed, parity-proven drop-in for scrublet, ~68× faster on PBMC3k.
- The baseline translation already delivered most of the speedup — Rust acceleration was the last ~5×, not the first 13×.
- The only parity dip came from a reordered reduction, was bounded in
MATH.md, and stayed far inside the gate. - The one real bug (kNN tie-break order) was fixed by matching the reference, never by loosening the gate — the cardinal rule.