Minor Performance optimizations by Yagth · Pull Request #461 · iCog-Labs-Dev/metta-moses

Yagth · 2026-06-11T07:21:50Z

Description

Stack of small, incremental optimizations to the MOSES hot path on PeTTaV1.
Each commit is a single isolated change with a measured rationale:

compareAndSwap rewritten from O(N²) (List.append) to O(N) (cons + reverse)
findAndReplace rewritten as a direct cons-recursive walk (drops the subtraction-atom + map-atom double pass and per-element reduce/2 dispatch)
expToMMap collapsed to a single union-atom call (was quadratic MultiMap.insert accumulation)
Native Prolog stubs (loaded via consult + register_fun) for getLiterals, getChildrenExp, replaceVarsWithTruth, isConsistentExp
Prolog specific memoization tabling on removeEmptyAND in the benchmark entry point
Prolog (cut) in hillClimbing/11 to release choice points before the tail-recursion
.gitignore: stop excluding *.pl so the new stubs file is tracked

The four Prolog stubs (getLiterals, getChildrenExp, replaceVarsWithTruth, isConsistentExp) collectively account for ~27 % wall on mux6, with replaceVarsWithTruth the single largest contribution (~13.5 %); they win by collapsing per-element reduce/2 dispatch from filter-atom/map-atom/foldl into native single-pass Prolog walks (and, for isConsistentExp, sidestep an improper-list crash in the MeTTa version). Combined effect on parity3 in CI-equivalent config: ~44 s → ~26 s wall.

Motivation and Context

parity3.log and a fresh mux6 --profile flagged a handful of predicates accounting for 40-plus % of run time: per-element reduce/2 dispatch in filter-atom / map-atom, quadratic MultiMap.insert in expToMMap, and a k-fold-over-vars in replaceVarsWithTruth. The stubs and rewrites target those directly without changing call semantics anywhere.

How Has This Been Tested?

Full unit-test sweep.
Boolean-reduct regression tests (rte{1,2,3}, cut-unnecessary-{and,or}, delete-inconsistent, zero-constraint-subsumption, helper-functions, reduce-to-elegance) green.
optimization/hillclimbing/test/cross-top-one-test.metta green (covers the rewritten compareAndSwap).
bscore-test.metta green (covers replaceVarsWithTruth semantics).
Benchmark: moses/tests/demo-problems-benchmark-test.metta on parity3 drops from ~44 s to ~26 s wall (single-run, same machine).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
I have added tests to cover my changes.
All new and existing tests passed.

The accumulator was being built with List.append per step, which walks the growing acc on every iteration. Switched to cons-atom + a single reverse at the end. Profile data on parity3 before the change: List.append/3 in mergeInstance: 110,048 calls / 1.86 s (2.8%) After: down to ~1k calls / negligible time.

The commented-out (cut) on line 480 was previously a hand-rolled choice-point prune around the until/applyReduce fixpoint. Measured A/B on parity3 after the other RTE-level optimisations landed showed no measurable wall-time delta, so it stays commented. Comment refreshed to record that A/B result so the next reader doesn't re-try blindly.

…curse Mirrors the existing runMoses cut. hillClimbing/11 is the inner hill-climbing loop driver; on parity3 it's called 10 times at the outer scope but recurses many times internally, each iteration accumulating WAM choice points across multi-clause MeTTa functions in its body. A cut at the end of the let* (line 354), placed after the score-comparison binding and before the trailing if cascade that either returns or tail-recurses, commits the choice points so they aren't kept alive across iterations. Measured A/B on parity3: pre-cut baseline: 61.16 s with this cut: 57.15 s (-4.0 s, -6.6 %) Two Tier-2 cut candidates (loopCreateDeme, mergeDemes) were A/B'd and dropped because they hurt or were noise — see plan/keen-finding-hellman.md for the predictive rule (the function must accumulate enough choice points across deep iteration for cut at the top to release a meaningful chain).

The old body did (subtraction-atom $old $list) and then (map-atom $list $a (replace $old $new $a)) — two list walks where the second one dispatches a small `replace` function through reduce/2 per element. Switched the second pass to an explicit cons-atom recursion (farReplaceWalk) that does the equality check inline, avoiding the per-element MeTTa dispatch. A/B on mux6 (with the rest of the optimisations in place) showed this matches a native Prolog stub for the same predicate within noise (+1 % wall each way), so MeTTa stays for readability.

The old implementation walked the input pair list and called MultiMap.insert at every step, growing an accumulator. Each insert is O(n) on the accumulator size, so the whole loop is O(n²). The only comparator passed in production is discSpec< (knob-representation.metta:165), which is hardcoded to False — so MultiMap.insert always lands at the tail and the function is operationally a list append. Replaced with (union-atom $map $pairs), which is a Prolog builtin (append/3 underneath), making it O(n). Profile data (mux6, before this change): expToMMap/4 was 12.1 % of profiled time (17.9 s / 2,487 calls). The rewrite reduced that to <1 %.

We're about to land a Prolog shim file (rte-helpers-fast.pl) alongside the MeTTa source, loaded via consult from the benchmark entry point. Removing the catch-all *.pl exclusion so it's tracked. *.html stays in place.

Adds moses/tests/rte-helpers-fast.pl with stubs for two RTE accessors and the consult call in the benchmark entry point that loads them at runtime. The MeTTa source definitions in reduct/boolean-reduct/rte-helpers.metta use filter-atom + isLiterals/isChildren, which dispatches through reduce/2 once per child. The Prolog stubs do a single direct list walk with no reduce/2 round-trip. retractall + new clauses replaces the MeTTa-asserted versions after the imports complete (the consult fires AFTER the MeTTa import block). Both helpers (collect_lits/2, collect_kids/2) are pure Prolog, not registered with register_fun — they're internal recursion bodies only. mux6 A/B vs the MeTTa filter-atom originals: - getLiterals/2: +6.1 % wall - getChildrenExp/2: +7.3 % wall Two related accessor cases were tested and dropped from this stub file since A/B showed they were neutral or even slightly faster in MeTTa: - getLiteralChildren (the fused pair): MeTTa equal-or-better - findAndReplace via direct recursion (committed earlier) Bool returns from these stubs are lowercase Prolog atoms (true/false) because PeTTaV1's parser converts MeTTa True/False to lowercase at parse time (parser.pl:47-49).

This is the single biggest individual win in the stub set: +13.5 % wall on mux6 vs an equivalent single-pass MeTTa walker. The MeTTa source in scoring/bscore.metta:140 does (let $blList (List.zip $bList $lList) (List.foldl replaceVarWithTruth $boolExpr $blList)) — a foldl over the variable list where each step walks the entire tree replacing one variable. For k variables the tree is walked k times. On mux6 that totals 16.6 million elementary replaceVarWithTruth/3 calls and ~44% of profiled time. The stub builds an assoc once and walks the tree once with a per-leaf get_assoc lookup. Semantics matched against scoring/test/bscore-test.metta: - Multi-element expression: head is equality-checked against the assoc (no AND/OR wrap), tail recurses fully. - Single-element expression [X]: unwrap and recurse on X. - Bare AND/OR atom: wrap in a 1-element list (preserving the MeTTa original's quirky branch). - Other atoms: substitute via assoc or preserve. Helper predicates rvw_full/3 and rvw_head/3 are pure Prolog, not registered as MeTTa funs.

A/B on mux6 showed an optimised MeTTa short-circuit walker for isConsistentExp was within noise of the Prolog stub (~1 % wall), so the performance argument by itself is weak. The correctness argument isn't: the MeTTa walker errors on the improper-list case that the codebase actually exercises. deleteInconsistentHandle calls (concatTuple $dominantSet $guardSet) where $guardSet can be a bare Symbol when $current is one. concatTuple is union-atom, which is append/3 in Prolog; appending [B,C] and A yields the improper list [B,C|A]. PeTTaV1's get-metatype matches Expression via is_list/1, which only succeeds on PROPER lists, so the MeTTa walker has no matching clause for the improper-list tail and silently fails the deleteInconsistentHandle test case at line 13 of its test file. The Prolog stub handles this trivially with is_list/1 checks plus a catch-all that returns true (no negation pair possible for a bare atom or an improper list). Helper predicates has_neg_pair/1, has_neg_with/2, is_neg_pair/2 are pure Prolog, not registered as MeTTa funs.

removeEmptyAND/2 (reduct/boolean-reduct/cut-unnecessary-and.metta:102) is a recursive pure tree rewrite. Profile data on parity3 before tabling shows 4,350,433 calls totalling 16.2 s (24 % of profiled time, the largest single self-time hotspot at the time). Almost all of those 4.35M calls are deep intra-call recursion over the same small set of subtrees (literals, NOT-expressions, AND/OR shells), so SLG tabling collapses the call count by ~4.5× — measured down to 977,135 calls and 1.9 s of total time on the same problem. Loading lib_tabling via the (library …) import form and tabling the predicate at the benchmark entry point keeps the directive co-located with the consult of the Prolog stubs, so the perf scaffolding is in one place.

… sortDeme, any Tier 1+2 from the post-stack mux6 profile. Each stub displaces the MeTTa definition at consult time via retractall + register_fun. - setDifference/3: subtract via native memberchk (was hand-rolled O(nm) with reduce/2 dispatch). Was 4.2 % wall on mux6, now 0.1 %. - getGuardSet/2: head test in Prolog bypasses the if-decons-expr-custom primitive (5.7 % wall by itself). Was 7.2 %, now 0.6 %. - sortDeme/2: keysort over (-penalizedScore, cpxy) replaces selectionSort + sInstComparator chain. Was 5.7 %, now ~0 %. NaN penalizedScore sorted to the end to match cScoreExpr< semantics. - any/2: memberchk(true) replaces (once (is-member True ...)). Was 2.2 %, now 0.3 %. Combined mux6 wall: 52.96 s -> ~48 s (-10 %). Output bit-identical to baseline; helper-functions/RTE/instance regression tests pass. Also tried SLG tabling reduceToElegance/propagateNot/addAND on top of this — regressed mux6 to ~70 s, so dropped.

Yagth added 11 commits June 11, 2026 10:18

chore(deme): drop trailing whitespace in loopCreateDeme

8c318f9

chore: stop ignoring *.pl

21b3c07

We're about to land a Prolog shim file (rte-helpers-fast.pl) alongside the MeTTa source, loaded via consult from the benchmark entry point. Removing the catch-all *.pl exclusion so it's tracked. *.html stays in place.

Yagth changed the title ~~Ref/minor optimizations~~ Minor Performance optimizations Jun 11, 2026

Yagth marked this pull request as ready for review June 11, 2026 07:39

Yagth added 2 commits June 12, 2026 10:10

chore: Cleanup comments and unnecessary code blocks

d5e32c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor Performance optimizations#461

Minor Performance optimizations#461
Yagth wants to merge 13 commits into
devfrom
ref/MinorOptimizations

Yagth commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yagth commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yagth commented Jun 11, 2026 •

edited

Loading