Skip to content

Minor Performance optimizations#461

Open
Yagth wants to merge 13 commits into
devfrom
ref/MinorOptimizations
Open

Minor Performance optimizations#461
Yagth wants to merge 13 commits into
devfrom
ref/MinorOptimizations

Conversation

@Yagth

@Yagth Yagth commented Jun 11, 2026

Copy link
Copy Markdown
Member

Description

Stack of small, incremental optimizations to the MOSES hot path on PeTTaV1.
Each commit is a single isolated change with a measured rationale:

  • compareAndSwap rewritten from O(N²) (List.append) to O(N) (cons + reverse)
  • findAndReplace rewritten as a direct cons-recursive walk (drops the subtraction-atom + map-atom double pass and per-element reduce/2 dispatch)
  • expToMMap collapsed to a single union-atom call (was quadratic MultiMap.insert accumulation)
  • Native Prolog stubs (loaded via consult + register_fun) for getLiterals, getChildrenExp, replaceVarsWithTruth, isConsistentExp
  • Prolog specific memoization tabling on removeEmptyAND in the benchmark entry point
  • Prolog (cut) in hillClimbing/11 to release choice points before the tail-recursion
  • .gitignore: stop excluding *.pl so the new stubs file is tracked

The four Prolog stubs (getLiterals, getChildrenExp, replaceVarsWithTruth, isConsistentExp) collectively account for ~27 % wall on mux6, with replaceVarsWithTruth the single largest contribution (~13.5 %); they win by collapsing per-element reduce/2 dispatch from filter-atom/map-atom/foldl into native single-pass Prolog walks (and, for isConsistentExp, sidestep an improper-list crash in the MeTTa version). Combined effect on parity3 in CI-equivalent config: ~44 s → ~26 s wall.

Motivation and Context

parity3.log and a fresh mux6 --profile flagged a handful of predicates accounting for 40-plus % of run time: per-element reduce/2 dispatch in filter-atom / map-atom, quadratic MultiMap.insert in expToMMap, and a k-fold-over-vars in replaceVarsWithTruth. The stubs and rewrites target those directly without changing call semantics anywhere.

How Has This Been Tested?

  • Full unit-test sweep.
  • Boolean-reduct regression tests (rte{1,2,3}, cut-unnecessary-{and,or}, delete-inconsistent, zero-constraint-subsumption, helper-functions, reduce-to-elegance) green.
  • optimization/hillclimbing/test/cross-top-one-test.metta green (covers the rewritten compareAndSwap).
  • bscore-test.metta green (covers replaceVarsWithTruth semantics).
  • Benchmark: moses/tests/demo-problems-benchmark-test.metta on parity3 drops from ~44 s to ~26 s wall (single-run, same machine).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Yagth added 11 commits June 11, 2026 10:18
The accumulator was being built with List.append per step, which walks
the growing acc on every iteration. Switched to cons-atom + a single
reverse at the end.

Profile data on parity3 before the change:
  List.append/3 in mergeInstance: 110,048 calls / 1.86 s (2.8%)
After: down to ~1k calls / negligible time.
The commented-out (cut) on line 480 was previously a hand-rolled
choice-point prune around the until/applyReduce fixpoint. Measured A/B
on parity3 after the other RTE-level optimisations landed showed no
measurable wall-time delta, so it stays commented. Comment refreshed to
record that A/B result so the next reader doesn't re-try blindly.
…curse

Mirrors the existing runMoses cut. hillClimbing/11 is the inner
hill-climbing loop driver; on parity3 it's called 10 times at the outer
scope but recurses many times internally, each iteration accumulating
WAM choice points across multi-clause MeTTa functions in its body. A cut
at the end of the let* (line 354), placed after the score-comparison
binding and before the trailing if cascade that either returns or
tail-recurses, commits the choice points so they aren't kept alive
across iterations.

Measured A/B on parity3:
  pre-cut baseline:   61.16 s
  with this cut:      57.15 s   (-4.0 s, -6.6 %)

Two Tier-2 cut candidates (loopCreateDeme, mergeDemes) were A/B'd and
dropped because they hurt or were noise — see
plan/keen-finding-hellman.md for the predictive rule (the function must
accumulate enough choice points across deep iteration for cut at the top
to release a meaningful chain).
The old body did (subtraction-atom $old $list) and then (map-atom $list
$a (replace $old $new $a)) — two list walks where the second one
dispatches a small `replace` function through reduce/2 per element.
Switched the second pass to an explicit cons-atom recursion
(farReplaceWalk) that does the equality check inline, avoiding the
per-element MeTTa dispatch.

A/B on mux6 (with the rest of the optimisations in place) showed this
matches a native Prolog stub for the same predicate within noise (+1 %
wall each way), so MeTTa stays for readability.
The old implementation walked the input pair list and called
MultiMap.insert at every step, growing an accumulator. Each insert is
O(n) on the accumulator size, so the whole loop is O(n²).

The only comparator passed in production is discSpec<
(knob-representation.metta:165), which is hardcoded to False — so
MultiMap.insert always lands at the tail and the function is
operationally a list append. Replaced with (union-atom $map $pairs),
which is a Prolog builtin (append/3 underneath), making it O(n).

Profile data (mux6, before this change): expToMMap/4 was 12.1 % of
profiled time (17.9 s / 2,487 calls). The rewrite reduced that to <1 %.
We're about to land a Prolog shim file (rte-helpers-fast.pl) alongside
the MeTTa source, loaded via consult from the benchmark entry point.
Removing the catch-all *.pl exclusion so it's tracked. *.html stays in
place.
Adds moses/tests/rte-helpers-fast.pl with stubs for two RTE accessors
and the consult call in the benchmark entry point that loads them at
runtime.

The MeTTa source definitions in reduct/boolean-reduct/rte-helpers.metta
use filter-atom + isLiterals/isChildren, which dispatches through
reduce/2 once per child. The Prolog stubs do a single direct list walk
with no reduce/2 round-trip. retractall + new clauses replaces the
MeTTa-asserted versions after the imports complete (the consult fires
AFTER the MeTTa import block).

Both helpers (collect_lits/2, collect_kids/2) are pure Prolog, not
registered with register_fun — they're internal recursion bodies only.

mux6 A/B vs the MeTTa filter-atom originals:
  - getLiterals/2:    +6.1 % wall
  - getChildrenExp/2: +7.3 % wall

Two related accessor cases were tested and dropped from this stub file
since A/B showed they were neutral or even slightly faster in MeTTa:
  - getLiteralChildren (the fused pair): MeTTa equal-or-better
  - findAndReplace via direct recursion (committed earlier)

Bool returns from these stubs are lowercase Prolog atoms (true/false)
because PeTTaV1's parser converts MeTTa True/False to lowercase at parse
time (parser.pl:47-49).
This is the single biggest individual win in the stub set: +13.5 % wall
on mux6 vs an equivalent single-pass MeTTa walker. The MeTTa source in
scoring/bscore.metta:140 does

   (let $blList (List.zip $bList $lList)
        (List.foldl replaceVarWithTruth $boolExpr $blList))

— a foldl over the variable list where each step walks the entire tree
replacing one variable. For k variables the tree is walked k times. On
mux6 that totals 16.6 million elementary replaceVarWithTruth/3 calls and
~44% of profiled time.

The stub builds an assoc once and walks the tree once with a per-leaf
get_assoc lookup. Semantics matched against
scoring/test/bscore-test.metta:
  - Multi-element expression: head is equality-checked against the assoc
(no AND/OR wrap), tail recurses fully.
  - Single-element expression [X]: unwrap and recurse on X.
  - Bare AND/OR atom: wrap in a 1-element list (preserving the MeTTa
original's quirky branch).
  - Other atoms: substitute via assoc or preserve.

Helper predicates rvw_full/3 and rvw_head/3 are pure Prolog, not
registered as MeTTa funs.
A/B on mux6 showed an optimised MeTTa short-circuit walker for
isConsistentExp was within noise of the Prolog stub (~1 % wall), so the
performance argument by itself is weak. The correctness argument isn't:
the MeTTa walker errors on the improper-list case that the codebase
actually exercises.

deleteInconsistentHandle calls (concatTuple $dominantSet $guardSet)
where $guardSet can be a bare Symbol when $current is one. concatTuple
is union-atom, which is append/3 in Prolog; appending [B,C] and A yields
the improper list [B,C|A]. PeTTaV1's get-metatype matches Expression via
is_list/1, which only succeeds on PROPER lists, so the MeTTa walker has
no matching clause for the improper-list tail and silently fails the
deleteInconsistentHandle test case at line 13 of its test file.

The Prolog stub handles this trivially with is_list/1 checks plus a
catch-all that returns true (no negation pair possible for a bare atom
or an improper list).

Helper predicates has_neg_pair/1, has_neg_with/2, is_neg_pair/2 are pure
Prolog, not registered as MeTTa funs.
removeEmptyAND/2 (reduct/boolean-reduct/cut-unnecessary-and.metta:102)
is a recursive pure tree rewrite. Profile data on parity3 before tabling
shows 4,350,433 calls totalling 16.2 s (24 % of profiled time, the
largest single self-time hotspot at the time).

Almost all of those 4.35M calls are deep intra-call recursion over the
same small set of subtrees (literals, NOT-expressions, AND/OR shells),
so SLG tabling collapses the call count by ~4.5× — measured down to
977,135 calls and 1.9 s of total time on the same problem.

Loading lib_tabling via the (library …) import form and tabling the
predicate at the benchmark entry point keeps the directive co-located
with the consult of the Prolog stubs, so the perf scaffolding is in one
place.
@Yagth Yagth changed the title Ref/minor optimizations Minor Performance optimizations Jun 11, 2026
@Yagth Yagth marked this pull request as ready for review June 11, 2026 07:39
Yagth added 2 commits June 12, 2026 10:10
… sortDeme, any

Tier 1+2 from the post-stack mux6 profile. Each stub displaces the MeTTa
definition at consult time via retractall + register_fun.

- setDifference/3: subtract via native memberchk (was hand-rolled O(nm)
  with reduce/2 dispatch). Was 4.2 % wall on mux6, now 0.1 %.
- getGuardSet/2: head test in Prolog bypasses the if-decons-expr-custom
  primitive (5.7 % wall by itself). Was 7.2 %, now 0.6 %.
- sortDeme/2: keysort over (-penalizedScore, cpxy) replaces selectionSort
  + sInstComparator chain. Was 5.7 %, now ~0 %. NaN penalizedScore sorted
  to the end to match cScoreExpr< semantics.
- any/2: memberchk(true) replaces (once (is-member True ...)). Was 2.2 %,
  now 0.3 %.

Combined mux6 wall: 52.96 s -> ~48 s (-10 %). Output bit-identical to
baseline; helper-functions/RTE/instance regression tests pass.

Also tried SLG tabling reduceToElegance/propagateNot/addAND on top of
this — regressed mux6 to ~70 s, so dropped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant