Skip to content

Latest commit

 

History

History
114 lines (81 loc) · 11.1 KB

File metadata and controls

114 lines (81 loc) · 11.1 KB

Parity Taxonomy — 8 Algorithm Classes

Pick the class before writing any agent code. The class fixes the parity metric, which fixes the pass/fail threshold in manifest.yaml. The threshold is committed before agent work begins and is read-only during the agent loop.

The metric only ever sees numbers, so this taxonomy is language-agnostic — it is the same taxonomy omicverse-rebuildr uses for R→Python. What changes for Python→Rust is which error sources the deterministic tolerance must absorb (see below).

The table

# Algorithm class Parity criterion Default threshold Example Python packages
1 Deterministic numerical (3 sub-tiers — see below) element-wise: max abs err < tol (optionally rtol-scaled) standard: 1e-8; strict: 1e-13; bounded: 1e-6 numpy/scipy kernels, BBKNN distances, MAGIC operator
2 Stochastic numerical distributional: Kolmogorov–Smirnov ≤ τ or Wasserstein-1 ≤ τ KS-p ≥ 0.05 or W₁ ≤ 1% of scale MCMC draws, dropout-simulation posteriors
3 Combinatorial clustering label-invariant: ARI / NMI / Fowlkes–Mallows ARI ≥ 0.95 leidenalg, louvain, KMeans labels
4 Continuous embedding rotation-invariant: Procrustes similarity (1 − min‖.‖² after best rotation/translation/scaling) Procrustes ≥ 0.95 UMAP, t-SNE, PCA, harmonypy
5 Ranked output top-K Jaccard / Spearman correlation of the ranking top-50 Jaccard ≥ 0.8; Spearman ≥ 0.9 rank_genes_groups, HVG selection
6 Ordinal output (pseudotime) Pearson / Spearman correlation of per-element values Pearson ≥ 0.99 (treats ≥ 1 − 1e-12 as exact — pearsonr is itself f64-noisy) DPT, palantir pseudotime
7 Classification (binary / multi) label agreement / F1 Agreement ≥ 0.95 scrublet doublet calls, cell-type labels
8 Statistical inference rank correlation on −log10 p + top-k overlap Spearman ≥ 0.90 on −log10 p; top-50 Jaccard ≥ 0.7 diffxpy, scanpy DE, GSEA

Deterministic sub-tiers (class 1)

The class-1 threshold needs to absorb two independent error sources. This is where Python→Rust differs from R→Python — the error sources are different:

Source Typical magnitude
f64 rounding + BLAS divergence (numpy links one BLAS; Rust ndarray-linalg links another — OpenBLAS / MKL / Accelerate) eps · √N1e-14 to 1e-10
Parallel / SIMD float-reduction reorderingrayon par_iter().sum() and autovectorised SIMD reduce in a different order than a serial loop, and f64 + is non-associative `≈ n · eps · max
Any (B) bounded ε-approximation introduced by an Acceleration rewrite per-rewrite, typically 1e-9 to 1e-6 (sum of admitted rewrites, derived in MATH.md)

So a single fixed threshold is the wrong abstraction. Pick a sub-tier per port:

Sub-tier (manifest.yaml::algorithm_class) Default atol Default rtol Hard ceiling When to use
deterministic-strict 1e-13 1e-15 1e-13 element-wise / single-pass / fixed-order (serial) reduction. The Rust port must not reorder any sum.
deterministic (alias) / deterministic-standard 1e-8 1e-8 Default. One or two matmul / PCA; numpy-BLAS ↔ Rust-BLAS divergence is fine.
deterministic-bounded 1e-6 1e-6 Contains (B) ε-approximation rewrites including reordered parallel/SIMD reductions. MATH.md must derive Σ bound ≤ atol.

Hard ceiling rule: any deterministic-* threshold above 1e-6 is rejected by is_pass() — at that scale "deterministic" has lost meaning and the port should switch to ordinal (Pearson) or embedding (Procrustes) instead. This is non-negotiable; widening the gate is the cardinal sin the protocol forbids.

The reduction-order rule (Rust-specific): a serial, fixed-order reduction in Rust is (E) exact with respect to the Python reference if Python also reduced serially in that order. The moment you switch to rayon parallel reduction or rely on SIMD autovectorisation, the summation order changes and the result is no longer bit-identical — it becomes a (B) bounded rewrite with bound ≈ n · eps · max|xᵢ|, which must be derived in MATH.md and which forces the deterministic-bounded tier. Do not silently parallelise a deterministic-strict port.

Relative-tolerance mode: when output values span many orders of magnitude (p-values, abundance counts, eigenvalues), set parity_rtol in the manifest:

algorithm_class: deterministic-standard
parity_threshold: 1.0          # required when rtol > 0; pass iff returned-value < 1
parity_rtol: 1e-8              # error scaled by rtol·|reference|
parity_atol: 1e-10             # small absolute floor for values near zero

parity_deterministic(..., rtol=...) then returns max(|ref - cand| / (rtol·|ref| + tiny)), and parity_threshold caps that normalised quantity (typically 1.0).

Why ordinal treats 1.0 as ≥ 1 − 1e-12: scipy.stats.pearsonr accumulates f64 rounding internally. Even on bit-identical inputs the returned r is typically 0.9999999999999999, not literal 1.0. The is_pass helper subtracts 1e-12 from the ordinal threshold before the comparison so a perfect port doesn't fail by one ulp.

How to pick the class

Ask one question: what does the Python function return?

  • A vector / matrix of floats meant to be compared bit-by-bit → (1) Deterministic.
  • Samples from a posterior or any RNG-driven simulation → (2) Stochastic.
  • A vector of cluster IDs (integer labels with no canonical ordering) → (3) Clustering.
  • A coordinate matrix where the absolute basis is meaningless (PCA, UMAP, t-SNE) → (4) Embedding.
  • A ranked list of features (gene names, top-K) → (5) Ranked.
  • A continuous monotonic per-element ordering (pseudotime, latent time, z) → (6) Ordinal.
  • A binary or multi-class label per element (where the labels HAVE canonical meaning, e.g., "Doublet"/"Singlet") → (7) Classification.
  • A vector of p-values or effect sizes from a statistical test → (8) Inference.

If the Python function returns multiple outputs with different classes (e.g., embedding + clustering + scores), give each its own gate in manifest.yaml::outputs: and require all to pass.

Why class-aware (not uniform)

A naive "element-wise equality with tolerance" works only for class 1. For clustering, [0,1,1,0] and [1,0,0,1] encode the same partition — needs ARI. For embeddings, two coordinate matrices differing by an orthogonal rotation are the same up to basis choice — needs Procrustes. Worse in the Rust setting: any algorithm with internal parallelism produces a reproducible-but-different float result, so a uniform 1e-13 gate would false-fail every rayon-accelerated port. The class-aware gate (plus the reduction-order rule) keeps the gate meaningful.

The metrics, with the canonical implementation to use

Class Use this Notes
1 Deterministic np.allclose(a, b, rtol=0, atol=tol) + scipy.stats.pearsonr(a.ravel(), b.ravel()) Demand both: element-wise + Pearson
2 Stochastic scipy.stats.ks_2samp(a, b) or scipy.stats.wasserstein_distance(a, b) Seed-locked Python reference + seed-locked Rust RNG
3 Clustering sklearn.metrics.adjusted_rand_score(labels_py, labels_rs) Always ARI; NMI as secondary
4 Embedding scipy.spatial.procrustes(M_py, M_rs) returns (_, _, disparity); similarity = 1 - disparity Sign-flip ambiguity is absorbed
5 Ranked len(set(top_k_py) & set(top_k_rs)) / len(set(top_k_py) | set(top_k_rs)) + scipy.stats.spearmanr top-50 is the conventional K
6 Ordinal scipy.stats.pearsonr(pt_py, pt_rs) + spearmanr Pearson is primary; demand ≥0.99
7 Classification sklearn.metrics.f1_score(y_py, y_rs, average='binary' or 'macro') + (y_py == y_rs).mean() If labels are "Doublet" / "Singlet", keep that — don't relabel to 0/1
8 Inference scipy.stats.spearmanr(-np.log10(p_py), -np.log10(p_rs)) + top-K Jaccard Use −log10 p to weight small p-values more

All of these are implemented in engine/parity_metrics.py — import from there to avoid re-deriving thresholds across ports.

What about Bonferroni-on-thresholds?

Don't. Each port has one algorithm-class gate at one threshold, fixed before agent work. No multiple-comparison correction; no per-fixture adjustment. The gate either clears or it doesn't.

Stochastic packages — how to make them deterministic-enough to test

For class 2 / 3 / 7 / 8 algorithms with internal RNG:

  1. Pin the seed in manifest.yaml::seed: (default 42).
  2. In both py_reference_driver.py and _run_candidate.py, set the seed immediately before calling the algorithm. Python uses np.random.seed(42) and random.seed(42); the Rust side seeds its rand generator (StdRng::seed_from_u64(42)) with the same value.
  3. The Python RNG (NumPy PCG64 / Mersenne Twister) and the Rust rand RNG produce different streams — they will diverge. The comparison then degrades from element-wise to distributional (class 2) or label-invariant (class 3) — that's the point of the taxonomy. For a faithful element-wise port of an RNG-heavy algorithm, reimplement the exact same generator in Rust (e.g. mirror NumPy's PCG64) and document it in MATH.md.

When the gate fails in Step 3

Order of suspicion (Python→Rust edition, in decreasing frequency on real ports):

  1. Row-major vs the wrong axis: NumPy is row-major (C order); ndarray defaults to row-major too, but a .t() view or a Array::from_shape_vec with the wrong order silently transposes. Check shapes AND strides.
  2. f32 vs f64: a Rust literal 0.5 is f64, but 1.0f32 or an Array<f32> quietly halves your precision. Match the reference dtype exactly.
  3. Integer overflow / wrapping: Rust u32/i32 arithmetic wraps in release builds (panics in debug). Python ints are arbitrary precision. Use u64/i64 or checked arithmetic where the reference can exceed 2³¹.
  4. Reduction order: a rayon parallel sum or SIMD reduction reorders f64 addition → not bit-identical. Either fix the order (serial / deterministic fold) for (E), or declare (B) with the n·eps·max|x| bound.
  5. 0-based both, but off-by-one in ranges: Python range(1, 9) is [1..8]; Rust 1..9 is also [1..8] — but 1..=9 is inclusive. Watch the inclusive-range operator.
  6. BLAS backend default tolerance: ndarray-linalg's eig / svd / qr may order eigenvalues or sign eigenvectors differently from LAPACK-via-scipy. Sort + sign-fix before comparing (or use the embedding class).
  7. NaN / Inf propagation: NumPy silently produces nan; Rust f64 also propagates nan but comparisons differ (nan != nan), and partial_cmp returns None. Decide NaN handling explicitly.
  8. Sparse format: scipy CSR vs a hand-rolled Rust sparse representation — verify index ordering (sorted indices), duplicate handling, and implicit-zero semantics.

Ablate each in order. Do not loosen the gate.