Skip to content

ChrisMao0325/rebuildpy

Repository files navigation

rebuildpy

A fixed, reproducible protocol for porting Python packages to Rust crates (rs-<pkg>, exposed to Python via PyO3/maturin), with cryptographic-grade numerical parity against the original Python reference.

This is the Python→Rust sibling of omicverse-rebuildr (which ports R→Python). Same engineering loop, same parity philosophy — the source language is Python and the target language is Rust.


What this is

Single-cell genomics, statistical genetics, and adjacent numerical fields have hundreds of canonical algorithms whose only reference implementation is pure Python (often NumPy/SciPy/Numba/Cython): leidenalg, scrublet, harmonypy, fa2, palantir, MAGIC, scanpy kernels, …

When those algorithms become a bottleneck, the options today are bad:

  1. Numba / Cython the hot loop — helps, but ships a fragile build, doesn't give true multi-core scaling without the GIL dance, and the optimised path silently diverges from the readable reference.
  2. Rewrite in Rust by hand — fast and safe, but the rewrite usually diverges from the Python reference and the divergence is never measured.
  3. Use an "approximate" Rust crate — silently a different algorithm with different numerical behaviour.

rebuildpy is the engineering recipe that takes a port from "I want this in Rust" to "the wheel is on PyPI, the crate is on crates.io, and it provably matches the Python reference on the canonical fixture" — in a small number of agent-driven iterations, with the proof of parity shipped alongside the wheel.

Three core ideas:

  1. The Python source is the executable spec. No reverse-engineering from papers. The agent runs the Python reference on a fixed input and compares its own Rust draft to that output, every iteration.
  2. Parity is class-aware. "Same output" means different things for an embedding (rotation-invariant), a clustering (label-permutation-invariant), or a pseudotime (correlation-invariant). The protocol pre-registers which numerical metric applies to which output and locks the threshold before any agent code is written.
  3. Reconstruction is not metric optimization. We never tune the algorithm to "look better" — we tune it to be identical to the Python reference, then search for speed under provably-equivalent rewrites. In Rust the dominant subtlety is that f64 addition is non-associative, so any reordered parallel/SIMD reduction must carry a derived error bound.

What ships at the end of every port:

  • A pip-installable wheel on PyPI (maturin-built Rust extension) + optionally the crate on crates.io.
  • A RECONSTRUCTION_REPORT.md with full Python-API coverage audit, per-output parity values, two-panel time-vs-accuracy plot, and ecosystem-reuse accounting.
  • Four pre-executed notebooks: pipeline parity, Python tutorial, Python⇄Rust function dictionary, per-iteration evolution.
  • A reproducible parity gate as a pytest test.

Quick start

# 1. Clone the kit
git clone <your-repo-url> rebuildpy
cd rebuildpy

# 2. Provision the Python reference env (see SETUP.md for full instructions)
conda create -n rebuild-pyref python=3.10 -y
conda activate rebuild-pyref
pip install -r requirements.txt
pip install <the-original-python-package>      # the executable spec

# 3. Install the Rust toolchain (NOT a conda package)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo --version && rustc --version

# 4. Provision the Rust target env (maturin builds the extension into it)
conda create -n rebuild-rust python=3.10 -y
conda activate rebuild-rust
pip install -r requirements.txt           # includes maturin

# 5. Export the two paths the kit needs
export PYTHON_REF_ENV=$(conda info --envs | awk '/rebuild-pyref/ {print $NF}')
export RUST_TEST_ENV=$(conda info --envs | awk '/rebuild-rust/ {print $NF}')

# 6. Authenticate GitHub CLI (needed for Discovery step)
gh auth login

# 7. Verify the kit installs cleanly (30 seconds)
python -m engine.smoke_test
# Expected: [smoke] OK -- 5/5 checks passed.

# 8. Check if your target Python package is already ported
python -m engine.discover_rust_deps --check <YourPyPackage>

If the smoke test passes and discovery says "no existing port", you're ready to start a port — follow PROTOCOL.md.

📖 Full setup walkthrough: SETUP.md (~30 minutes including conda + rustup provisioning).


How to invoke the protocol in a session

Point an agent (Claude Code, Cursor, etc.) at this folder and say:

Port Python package X to Rust. Follow rebuildpy/README.md.

The agent will execute the 6-step protocol end-to-end and produce, at the end:

  • an rs-X repository (under your $REBUILDPY_ORG) with an installable maturin wheel,
  • the pre-registered numerical parity gate clearing on the canonical fixture,
  • a structured RECONSTRUCTION_REPORT.md,
  • four mandatory pre-executed notebooks,
  • a PyPI release (and optional crates.io release).

The protocol — 6 steps

┌─ 0.5 Discovery ─────┐
│ • Is target already │ ← if YES: stop, reuse existing rs- repo
│   ported to Rust?   │
│ • Which py deps      │ ← matches added as pyproject deps;
│   have rs-mirrors    │   others mapped to ndarray/polars/petgraph/linfa
│   or a rust crate?  │
└─────────────────────┘
         ↓
┌─ 1 Shape template ──┐
│ Copy layout from a  │
│ prior port matching │
│ the algorithm class │
└─────────────────────┘
         ↓
┌─ 2 Dual envs ───────┐
│ Python reference env│
│ Rust target env     │  (maturin develop --release)
│ Both see same data  │
└─────────────────────┘
         ↓
┌─ 3 Two-agent inner loop ─────────────────────────────────────────────┐
│                                                                       │
│  ┌─ Equivalence Agent ────┐    ┌─ Acceleration Agent ──────────────┐ │
│  │ Translate Python → Rust│ →  │ Search rewrites for speed; each   │ │
│  │ Iterate until parity   │    │ requires admissibility proof:     │ │
│  │ gate clears (Pearson,  │    │ exact / bounded-ε (reduction      │ │
│  │ ARI, Procrustes, etc.) │    │ order!) / class-containment.      │ │
│  │                        │    │ Reject if it breaks parity.       │ │
│  └────────────────────────┘    └───────────────────────────────────┘ │
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘
         ↓
┌─ 4 Validate ────────┐
│ Re-confirm gate.    │
│ Threshold is read-  │
│ only; never widened │
└─────────────────────┘
         ↓
┌─ 5 Release ─────────┐
│ Publish to PyPI +   │
│ crates.io + GitHub. │
│ Become a seed       │
│ template.           │
└─────────────────────┘

Each step is documented in detail:

Step What happens Document
0.5 Discovery Check whether the target is already a Rust port; check whether each Python dep already has an rs- mirror or a standard Rust crate. STOP if duplicate; reuse deps if found. DISCOVERY.md
1 Shape template Copy directory layout + test scaffold from a prior port. Do NOT copy algorithmic code. TEMPLATE.md
2 Dual environments Provision a Python reference env (original package) and a Rust target env (maturin + built extension; cargo from rustup). Both see the same fixture files. SETUP.md
3 Two-agent inner loop (a) Equivalence Agent: translate Python → Rust, iterate until the pre-registered class-aware parity gate clears. (b) Acceleration Agent: verifier-guided search over Rust rewrites for speed, each requiring one of three admissibility proofs. PROTOCOL.md, PARITY_TAXONOMY.md, ACCELERATION_PLAYBOOK.md
4 Validate Re-confirm the gate. The threshold is committed before agent work begins — never tightened or loosened. PARITY_TAXONOMY.md
5 Release Ship the maturin wheel to PyPI + crate to crates.io, publish <org>/rs-X, complete the RECONSTRUCTION_REPORT.md + four mandatory notebooks. NOTEBOOKS.md

The 8 algorithm classes (parity taxonomy)

Different algorithms have different invariance structures, so "same output" needs different metrics. The protocol pre-registers one class per port output:

# Class Parity criterion Default threshold Example Python packages
1 Deterministic numerical (3 sub-tiers — see PARITY_TAXONOMY.md) element-wise max_abs_err < tol, optional rtol-scaled standard 1e-8 / strict 1e-13 / bounded 1e-6; hard ceiling 1e-6 BBKNN distances, MAGIC operator
2 Stochastic numerical Kolmogorov–Smirnov ≤ τ or Wasserstein-1 ≤ τ KS-p ≥ 0.05 dropout simulations, MCMC draws
3 Combinatorial clustering label-invariant: ARI / NMI / Fowlkes–Mallows ARI ≥ 0.95 leidenalg, louvain
4 Continuous embedding rotation-invariant: Procrustes similarity Procrustes ≥ 0.95 UMAP, t-SNE, PCA, harmonypy
5 Ranked output top-K Jaccard / Spearman correlation top-50 Jaccard ≥ 0.8 HVG selection, DE rankings
6 Ordinal output (pseudotime) Pearson / Spearman correlation Pearson ≥ 0.99 (≥ 1 − 1e-12 treated as exact) DPT, palantir
7 Classification label agreement / F1 F1 ≥ 0.95 scrublet doublet calls
8 Statistical inference rank corr on −log10 p + top-K Jaccard Spearman ≥ 0.90 diffxpy, scanpy DE

If the Python function returns multiple outputs of different classes, the manifest declares one gate per output and ALL must pass.

The 8 metric implementations live in engine/parity_metrics.py — import from there rather than redefining. (They are language-agnostic; this is the same module shape omicverse-rebuildr uses.)

📖 Full taxonomy: PARITY_TAXONOMY.md — includes the Python→Rust "when the gate fails: ordered suspicion list" (row-major-vs-transpose, f32-vs-f64, integer wrapping, reduction order, NaN handling, …).


Acceleration: 3 admissibility proof classes

Every rewrite the Acceleration Agent commits must carry one of these proofs:

Proof class Meaning Examples (Rust)
(E) Exact identity Bit-equivalent output by mathematical identity or by not touching the arithmetic. Xᵀ X hoisted out of a loop; Woodbury; zero-copy ArrayView; buffer reuse; fixed-order reduction; LTO + codegen-units=1.
(B) Bounded ε-approximation Error bounded by a closed-form expression; derived in MATH.md, not handwaved. Reordered rayon parallel sum / SIMD horizontal-add (`‖Δ‖ ≤ n·eps·max
(C) Class-containment theorem A known theorem guarantees the same output for the relevant input class. Euclidean MST ⊆ Delaunay (Preparata–Shamos 1985), via spade + petgraph.

📖 Full catalog: ACCELERATION_PLAYBOOK.md. The headline rule: f64 + is non-associative — reordering any float reduction turns an (E) rewrite into a (B) one that needs a bound. This is the single most common way a Rust port silently breaks deterministic-strict parity.


Evaluation: two plots, not one

Traditional evolutionary search plots iteration vs metric because the policy searches for better metric. That's the wrong model here — reconstruction's goal is identical output to the Python reference, not "better" output.

So every port produces two plots against the same iteration axis:

 wall-clock (s)
  │
  │  ●─┐                ← Iteration 0 (straight Rust translation) already
  │    │  ●─┐             a big drop vs the Python reference
  │       │    ●──●
  │ python-ref → iter 0 → iter 1 → iter 2 → iter 3
  │
  └────────────────────────────────────────────────→ iteration

 parity metric (e.g. Procrustes)
  │ ●──●──●──●─┐
  │              \
  │               ●──●   ← annotated: "rayon parallel reduction, n·eps·max|x|"
  │  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ threshold (red dashed line)
  │
  └────────────────────────────────────────────────→ iteration
  • Plot 1 (top, log scale): wall-clock should monotonically decrease as rewrites land. Error bars = stddev over 3 warmup-excluded runs. Always a --release build.
  • Plot 2 (bottom): parity metric should stay flat at the ceiling. Every dip must be annotated with the math approximation that caused it (almost always a reordered reduction).

Wall-clock measurement rules:

  • Warmup: discard the first run (BLAS thread spin-up, extension load, page cache).
  • 3 measured runs; report mean ± stddev.
  • CV > 10% → auto-extend to 5 runs, report median + IQR.
  • Fix BLAS + rayon threads via OMP_NUM_THREADS=8 / RAYON_NUM_THREADS=8 before any imports.
  • Never time a debug build — it is an invalid measurement.

📖 Full spec + iteration-log schema: EVALUATION.md.


Four mandatory notebooks per release

A finished port serves five audiences, each with a different need:

Audience What they need Where they look
Reviewer / scientist evaluating whether to trust the port Pipeline-level proof Rust ≡ Python numerically compare_Python_vs_Rust.ipynb
End user of the package A copy-pastable Python tour of every public function (now Rust-backed) tutorial_<dataset>.ipynb
Python user porting their existing code A function-level dictionary — every Python parameter ↔ Rust parameter, side-by-side calls on identical input function_by_function_Python_parity.ipynb
Auditor of the engineering process asking "did the agent really iterate?" A per-iteration narrative log with one named subplot per iteration evolution.ipynb
CI / automation The pre-registered parity gate as a pytest assertion tests/test_exact_match.py

All four notebooks ship pre-executed so GitHub renders them without re-running. Phase 4 blocks the port from being released if any one is missing.

evolution.ipynb is a forcing function. It is structured as ## Iteration N — <title> headers, one per iteration, each with a markdown narrative of what changed AND a code cell that produces a subplot for that iteration. If the agent skipped the acceleration loop, the notebook still has the baseline block (## Iteration 0 — Baseline translation) — but the protocol then audits whether obvious acceleration opportunities were missed. The summary 2-panel examples/evolution.png (auto-generated by engine.plot_evolution) supplements but does not replace this notebook.

📖 Schemas + section-by-section requirements: NOTEBOOKS.md.


Kit contents

Top-level documents

File What it does
SETUP.md First-time install — prerequisites, dual-env provisioning, rustup, env vars, gh auth, smoke test, troubleshooting.
PROTOCOL.md The 6-step protocol + the two-agent inner loop. Read this in a session before starting a port.
DISCOVERY.md Phase 0.5 — reuse before rebuild. Find existing rs- ports for the target and its Python deps.
PARITY_TAXONOMY.md 8-class algorithm taxonomy → which numerical-parity metric applies (+ the reduction-order rule).
ACCELERATION_PLAYBOOK.md Catalog of Rust rewrites with the 3 admissibility proof types.
EVALUATION.md Two-plot evaluation (time vs iter + accuracy vs iter), warmup excluded, accuracy dips annotated.
NOTEBOOKS.md Four mandatory pre-executed notebooks per release. Non-skippable in Phase 4.
TEMPLATE.md Standard rs-<pkg> repo layout + naming conventions + license decision matrix.
CHECKLIST.md Per-port checklist to tick through, Phase 0–5.

Engine (runnable code) — engine/

File What it does Typical invocation
smoke_test.py 30-second sanity check — verifies the kit installs and all 8 parity metrics + audit / plot / benchmark / loop helpers work. python -m engine.smoke_test
discover_rust_deps.py Lists existing org repos via gh repo list <org> (default omicverse, override with REBUILDPY_ORG); parses the package's pyproject.toml; reports which deps already have rs- mirrors and the Rust-ecosystem crate for the rest. Cached 24h. python -m engine.discover_rust_deps --check <PyPkg>
parity_metrics.py The 8 parity-class metric functions (Pearson, ARI, Procrustes, KS, top-K Jaccard, …) + class dispatcher. from parity_metrics import compute_parity, is_pass
benchmark.py Wall-clock timer with warmup-exclusion + 3-run averaging; pins BLAS + rayon threads; auto-extends to 5 runs + median when CV > 10%. from benchmark import time_callable
py_function_audit.py Parses the Python package's public API (__all__ / def / class) via ast, audits Rust-crate coverage (#[pyfunction] / #[pymethods] / pub fn), produces AUDIT.md. python -m engine.py_function_audit --py-source <pkg>-ref --rust-crate src/
plot_evolution.py Renders the two-panel evolution PNG from ITERATION_LOG.md, annotates accuracy dips with their math reason. python -m engine.plot_evolution --port-dir <path>
loop.py The rebuildpy loop as runnable code — equivalence + acceleration phases as Python callables. python -m engine.loop --port-dir <path> --phase equivalence
manifest.template.yaml Pre-registered parity gate spec — copy into each new port's data/manifest.yaml. (file template)

File-level templates — templates/

Every new port copies these as starting scaffolding; nothing is generated from scratch.

Template Becomes
pyproject.template.toml The port's pyproject.toml (maturin build-backend + metadata)
Cargo.template.toml The port's Cargo.toml (crate deps + release profile)
lib.template.rs The port's src/lib.rs (PyO3 module skeleton)
README.template.md The port's user-facing README.md
py_reference_driver.template.py tests/py_reference_driver.py — invokes the Python reference, dumps JSON
_run_candidate.template.py tests/_run_candidate.py — invokes the Rust extension, dumps JSON
test_exact_match.template.py tests/test_exact_match.py — pytest test that asserts the gate
DISCOVERY.template.md The port's DISCOVERY.md artefact (Phase 0.5)
ITERATION_LOG.template.md The port's ITERATION_LOG.md (Phase 3 acceleration log)
RECONSTRUCTION_REPORT.template.md The port's RECONSTRUCTION_REPORT.md (8-section final report)
MATH.template.md The port's MATH.md (perturbation bounds for (B) rewrites)
compare_Python_vs_Rust.template.ipynb Notebook 1 — pipeline parity
tutorial.template.ipynb Notebook 2 — Python tutorial
function_by_function_Python_parity.template.ipynb Notebook 3 — Python⇄Rust function dictionary
evolution.template.ipynb Notebook 4 — per-iteration evolution
py_per_function_dump.template.py Python driver feeding Notebook 3

Examples & roadmaps — examples/

File What it does
ROADMAP.md Ranked Python packages awaiting Rust ports, with the rust-ecosystem crates each would lean on.
EXAMPLE_WALKTHROUGH.md End-to-end Phase 0 → Phase 5 narrative for one port, with concrete commands and intermediate outputs.

What the agent does in a session

A typical agent session opens with:

Port Python package X to Rust. Follow rebuildpy/README.md.

The agent then executes:

  1. (Phase 0.5 — Discovery) Run engine/discover_rust_deps.py to check:
    • Is <org>/rs-X already published? → if yes, STOP, report the existing repo.
    • Which of X's Python deps already have rs- mirrors or a standard Rust crate? → record in DISCOVERY.md.
  2. (Phase 0) Look up X's algorithm class in PARITY_TAXONOMY.md. Write and commit data/manifest.yaml with the algorithm class, threshold, canonical fixture path, seed, and per-output gate blocks. The gate is read-only after this.
  3. (Phase 1) Copy the layout from TEMPLATE.md (seed shape chosen by algorithm class).
  4. (Phase 2 — Equivalence Agent) Translate each Python function in dependency order into Rust; maturin develop --release; run the per-function parity diff. Iterate until the gate clears at the pre-registered threshold.
  5. (Phase 3 — Acceleration Agent) For each candidate rewrite from ACCELERATION_PLAYBOOK.md:
    • Check precondition + produce admissibility proof (E / B / C). For any reordered reduction, derive the n·eps·max|x| bound in MATH.md.
    • Apply on a working branch; rebuild; re-run parity test (gate still clearing?); re-benchmark.
    • Accept if speedup > 1.05× and gate clears; else roll back.
    • Append one YAML block to ITERATION_LOG.md per attempt.
  6. (Phase 4 — release artefacts) Tick CHECKLIST.md end-to-end; produce all mandatory deliverables:
    • RECONSTRUCTION_REPORT.md (8 sections)
    • MATH.md (perturbation bounds for any (B) rewrites)
    • AUDIT.md (Python-API coverage, auto-generated by engine.py_function_audit)
    • examples/evolution.png (two-panel plot, auto-generated by engine.plot_evolution)
    • examples/compare_Python_vs_Rust.ipynb — pipeline parity
    • examples/tutorial_<dataset>.ipynb — Python tutorial
    • examples/function_by_function_Python_parity.ipynb — Python⇄Rust dictionary
    • examples/evolution.ipynb — per-iteration narrative + subplot
  7. (Phase 5 — release) maturin build --release → wheel to PyPI; cargo publish → crates.io; create GitHub repo + release; add the port as a seed template for future ports.

Always-first invariant: Phase 0.5 (Discovery) is non-skippable. If discovery is skipped, the protocol fails — we risk re-implementing a crate that already exists.

No deferred items in Phase 4: every artefact above is mandatory.


When to use this kit (and when not to)

Use this kit when:

  • ✅ The target is a Python package with a clear numerical output (vector, matrix, table, cluster IDs, p-values) that you want faster and memory-safe.
  • ✅ You can construct a canonical input fixture small enough for fast iteration (< 1 minute end-to-end for the Python reference).
  • ✅ The upstream Python package is open-source under a license you can match.
  • ✅ You're prepared to commit time on the order of 1–5 working days for a clean port.

Don't use it when:

  • ❌ The "Python package" you want is closed-source or only described in a paper without runnable code — no executable spec means no parity oracle.
  • ❌ The algorithm is dominated by calls into another compiled library (the Python is a thin wrapper) — there's little to gain from a Rust rewrite.
  • ❌ You want a Rust algorithm that's better than the Python one, not identical. This kit refuses to widen the gate; fork after the port lands.
  • ❌ The hot path is already a well-tuned C/Fortran extension (e.g. pure BLAS) — Rust won't beat it and the parity oracle's ceiling is that extension.

The evolutionary-RL analogy (in one paragraph)

The acceleration loop is verifier-guided test-time search, not weight-update RL — and importantly not metric optimization:

Component Mapping
Policy The LLM in-context (no fine-tune, no weight updates).
Action One rewrite drawn from ACCELERATION_PLAYBOOK.md (rayon outer-axis map, zero-copy view, Woodbury, target-cpu=native, MST ⊆ Delaunay, …).
Environment The parity test + a 3-run-mean stopwatch on a --release build over the canonical fixture (see EVALUATION.md).
Reward r_t = φ(a_t) · speedup(a_t) — gate must still clear (φ = 1), then wall-clock speedup ranks admissible candidates.
Best-so-far register The last commit on the in-progress port. Roll back if a later rewrite breaks parity.

What we don't do: improve the algorithm's metric. Reconstruction's goal is identical outputs to the Python reference, not "better" ones. Two evaluation plots come out of every port: time vs iteration (monotonically decreasing) and accuracy vs iteration (flat at the maximum; every dip annotated with the math approximation — almost always a reordered reduction).

No model weights change. Search occurs inside one coding-agent session, with the parity test as oracle and the wall-clock as cost function.


Final artefact — reconstruction report

After the parity gate clears and the Acceleration loop terminates, the agent fills out RECONSTRUCTION_REPORT.md. The 8 sections:

  1. Identity — package, upstream version, algorithm class, threshold, final parity value, audit class A/B/C, LOC, speedup vs Python.
  2. Python API coverage audit — every public name from the package is in the table (ported / skipped with reason). Auto-populated by engine.py_function_audit. Also lists dependencies reused (ecosystem audit — which Rust crates / rs- mirrors were reused vs re-implemented).
  3. Parity evidence — per-output metric values, per-fixture wall-clock + parity, reproducible reference command.
  4. Acceleration evidence — two-panel evolution figure embedded, accepted-vs-rejected rewrites with admissibility proofs.
  5. Code quality auditmaturin build --release + pip install + pytest green + four mandatory notebooks executed + license compatible + version pinned. All non-skippable.
  6. Known limitations — honest list of what the port doesn't do; never used as an excuse to widen the gate.
  7. Integration — crate/wheel location, public-API exposure, tutorial slot.
  8. Sign-off — author, date, active time spent, final audit class.

This is what we present as "the port is done".


Evolution — how the protocol got here

The protocol is a faithful adaptation of omicverse-rebuildr (R→Python), re-pointed at the Python→Rust direction. The changes that the new direction forces:

Area What changed vs the R→Python kit Why
Reference / target Python is the reference (executable spec); Rust is the target. Envs become PYTHON_REF_ENV / RUST_TEST_ENV. The fast language is now the target, not the source.
Deterministic error sources "cross-BLAS rounding" is joined by parallel/SIMD float-reduction reordering as the dominant (B) source. f64 + is non-associative; rayon/SIMD reorder sums. This is the new central admissibility concern.
Acceleration playbook R→Python algebraic rewrites are kept; added Rust-specific §2 memory/ownership, §3 parallelism/SIMD, §4 compiler flags, §5 interpreter-overhead removal. Rust's speed comes from ownership + parallelism + codegen, not just algebra.
Coverage audit NAMESPACE parsing → Python __all__/ast parsing; coverage checked against #[pyfunction]/pub fn. The source is Python; the target is a Rust crate.
Discovery R DESCRIPTION deps → Python pyproject.toml deps; deps mapped to rs- ports and standard crates (ndarray/polars/petgraph/linfa). Reuse the Rust ecosystem, not just org mirrors.
Build / release wheel-on-PyPI → maturin wheel on PyPI + crate on crates.io; debug-build timings declared invalid. Rust has two distribution channels and a release/debug split.

Ports shipped under this protocol

See examples/ROADMAP.md for the full ranked list.

Status Port Date Audit Speedup Notes
⬜ next rs-leidenalg TBD TBD Community detection; petgraph + ARI gate; highest reuse density
rs-scrublet TBD TBD Doublet detection; classification/F1 gate; rayon per-cell map
rs-harmonypy TBD TBD Batch integration; embedding/Procrustes gate; ndarray-linalg
rs-fa2 TBD TBD ForceAtlas2 layout; deterministic-bounded; SIMD axpy
rs-palantir TBD TBD Pseudotime; ordinal/Pearson gate; Woodbury

FAQ

Q: How long does a typical port take? A: Translation-only (class A): 1–3 days. With minor acceleration (class B): 2–5 days. Heavy acceleration with proofs (class C): 1–2 weeks. The Rust baseline translation usually already delivers most of the speedup; acceleration is about the last 2–5×.

Q: Why is parity so much more fragile than R→Python? A: Because Rust gives you parallelism and SIMD by default, and f64 addition is non-associative. A rayon parallel sum is reproducible but not bit-identical to a serial sum. The protocol handles this with the reduction-order rule: fixed-order reduction = (E) exact; reordered = (B) bounded by n·eps·max|x|, declared in MATH.md. See PARITY_TAXONOMY.md.

Q: maturin/PyO3 or a standalone Rust binary for the candidate? A: Default to maturin/PyO3 — the deliverable is a pip-installable Rust-backed Python package, and the candidate runner just import rs_<pkg>. A cargo run binary that dumps JSON is an acceptable fallback when the package is CLI-shaped, but you lose the "drop-in faster replacement" property.

Q: What if my target's deps have no Rust crate (e.g. statsmodels)? A: Either (a) port the specific routine you need into the crate, or (b) keep that step in Python and call back across the boundary, documenting the seam in MATH.md and RECONSTRUCTION_REPORT.md §6. Don't pretend a different routine is equivalent.

Q: My port gets a 1.2× speedup from a rayon reduction but Procrustes drops from 1.0000 to 0.9990 (still above threshold). Accept? A: Only if it's a (B) rewrite with the perturbation bound derived in MATH.md. A "small" empirical drop with no closed-form bound is a bug, not an optimisation. Reject and either fix the reduction order ((E)) or derive the bound.

Q: Can I publish to a different GitHub org? A: Yes. Export REBUILDPY_ORG=<your-org> before running engine.discover_rust_deps. The kit pushes nothing automatically — Phase 5's gh repo create, maturin publish, and cargo publish are explicit and you control them.

Q: Does this work on Windows? A: Tested on Linux; macOS should work (set the Accelerate BLAS backend for ndarray-linalg). Windows requires WSL2 because the kit shells out to bash for some pipe operations.


License

The kit itself is MIT. Each individual port matches its upstream Python package's license (GPL-3 if upstream is GPL ≥ 2; MIT/Apache-2.0 dual — the Rust convention — if upstream is permissive). See TEMPLATE.md §License decision matrix.


Provenance

This protocol is a direct adaptation of the omicverse-rebuildr recipe (reference-driven cross-language library synthesis via LLM agents), re-pointed from R→Python to Python→Rust. The reference-driven parity-gate methodology, the 8-class taxonomy, the two-plot evaluation, and the verifier-guided acceleration search are inherited wholesale; the Python→Rust direction adds the reduction-order admissibility rule and the Rust-specific acceleration playbook. Case-study ports live under github.com/<org>/rs-*.

About

A reproducible protocol for porting Python packages to Rust crates (PyO3/maturin) with class-aware numerical parity against the Python reference. Python->Rust analog of omicverse-rebuildr.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors