Minor Performance optimizations#461
Open
Yagth wants to merge 13 commits into
Open
Conversation
The accumulator was being built with List.append per step, which walks the growing acc on every iteration. Switched to cons-atom + a single reverse at the end. Profile data on parity3 before the change: List.append/3 in mergeInstance: 110,048 calls / 1.86 s (2.8%) After: down to ~1k calls / negligible time.
The commented-out (cut) on line 480 was previously a hand-rolled choice-point prune around the until/applyReduce fixpoint. Measured A/B on parity3 after the other RTE-level optimisations landed showed no measurable wall-time delta, so it stays commented. Comment refreshed to record that A/B result so the next reader doesn't re-try blindly.
…curse Mirrors the existing runMoses cut. hillClimbing/11 is the inner hill-climbing loop driver; on parity3 it's called 10 times at the outer scope but recurses many times internally, each iteration accumulating WAM choice points across multi-clause MeTTa functions in its body. A cut at the end of the let* (line 354), placed after the score-comparison binding and before the trailing if cascade that either returns or tail-recurses, commits the choice points so they aren't kept alive across iterations. Measured A/B on parity3: pre-cut baseline: 61.16 s with this cut: 57.15 s (-4.0 s, -6.6 %) Two Tier-2 cut candidates (loopCreateDeme, mergeDemes) were A/B'd and dropped because they hurt or were noise — see plan/keen-finding-hellman.md for the predictive rule (the function must accumulate enough choice points across deep iteration for cut at the top to release a meaningful chain).
The old body did (subtraction-atom $old $list) and then (map-atom $list $a (replace $old $new $a)) — two list walks where the second one dispatches a small `replace` function through reduce/2 per element. Switched the second pass to an explicit cons-atom recursion (farReplaceWalk) that does the equality check inline, avoiding the per-element MeTTa dispatch. A/B on mux6 (with the rest of the optimisations in place) showed this matches a native Prolog stub for the same predicate within noise (+1 % wall each way), so MeTTa stays for readability.
The old implementation walked the input pair list and called MultiMap.insert at every step, growing an accumulator. Each insert is O(n) on the accumulator size, so the whole loop is O(n²). The only comparator passed in production is discSpec< (knob-representation.metta:165), which is hardcoded to False — so MultiMap.insert always lands at the tail and the function is operationally a list append. Replaced with (union-atom $map $pairs), which is a Prolog builtin (append/3 underneath), making it O(n). Profile data (mux6, before this change): expToMMap/4 was 12.1 % of profiled time (17.9 s / 2,487 calls). The rewrite reduced that to <1 %.
We're about to land a Prolog shim file (rte-helpers-fast.pl) alongside the MeTTa source, loaded via consult from the benchmark entry point. Removing the catch-all *.pl exclusion so it's tracked. *.html stays in place.
Adds moses/tests/rte-helpers-fast.pl with stubs for two RTE accessors and the consult call in the benchmark entry point that loads them at runtime. The MeTTa source definitions in reduct/boolean-reduct/rte-helpers.metta use filter-atom + isLiterals/isChildren, which dispatches through reduce/2 once per child. The Prolog stubs do a single direct list walk with no reduce/2 round-trip. retractall + new clauses replaces the MeTTa-asserted versions after the imports complete (the consult fires AFTER the MeTTa import block). Both helpers (collect_lits/2, collect_kids/2) are pure Prolog, not registered with register_fun — they're internal recursion bodies only. mux6 A/B vs the MeTTa filter-atom originals: - getLiterals/2: +6.1 % wall - getChildrenExp/2: +7.3 % wall Two related accessor cases were tested and dropped from this stub file since A/B showed they were neutral or even slightly faster in MeTTa: - getLiteralChildren (the fused pair): MeTTa equal-or-better - findAndReplace via direct recursion (committed earlier) Bool returns from these stubs are lowercase Prolog atoms (true/false) because PeTTaV1's parser converts MeTTa True/False to lowercase at parse time (parser.pl:47-49).
This is the single biggest individual win in the stub set: +13.5 % wall
on mux6 vs an equivalent single-pass MeTTa walker. The MeTTa source in
scoring/bscore.metta:140 does
(let $blList (List.zip $bList $lList)
(List.foldl replaceVarWithTruth $boolExpr $blList))
— a foldl over the variable list where each step walks the entire tree
replacing one variable. For k variables the tree is walked k times. On
mux6 that totals 16.6 million elementary replaceVarWithTruth/3 calls and
~44% of profiled time.
The stub builds an assoc once and walks the tree once with a per-leaf
get_assoc lookup. Semantics matched against
scoring/test/bscore-test.metta:
- Multi-element expression: head is equality-checked against the assoc
(no AND/OR wrap), tail recurses fully.
- Single-element expression [X]: unwrap and recurse on X.
- Bare AND/OR atom: wrap in a 1-element list (preserving the MeTTa
original's quirky branch).
- Other atoms: substitute via assoc or preserve.
Helper predicates rvw_full/3 and rvw_head/3 are pure Prolog, not
registered as MeTTa funs.
A/B on mux6 showed an optimised MeTTa short-circuit walker for isConsistentExp was within noise of the Prolog stub (~1 % wall), so the performance argument by itself is weak. The correctness argument isn't: the MeTTa walker errors on the improper-list case that the codebase actually exercises. deleteInconsistentHandle calls (concatTuple $dominantSet $guardSet) where $guardSet can be a bare Symbol when $current is one. concatTuple is union-atom, which is append/3 in Prolog; appending [B,C] and A yields the improper list [B,C|A]. PeTTaV1's get-metatype matches Expression via is_list/1, which only succeeds on PROPER lists, so the MeTTa walker has no matching clause for the improper-list tail and silently fails the deleteInconsistentHandle test case at line 13 of its test file. The Prolog stub handles this trivially with is_list/1 checks plus a catch-all that returns true (no negation pair possible for a bare atom or an improper list). Helper predicates has_neg_pair/1, has_neg_with/2, is_neg_pair/2 are pure Prolog, not registered as MeTTa funs.
removeEmptyAND/2 (reduct/boolean-reduct/cut-unnecessary-and.metta:102) is a recursive pure tree rewrite. Profile data on parity3 before tabling shows 4,350,433 calls totalling 16.2 s (24 % of profiled time, the largest single self-time hotspot at the time). Almost all of those 4.35M calls are deep intra-call recursion over the same small set of subtrees (literals, NOT-expressions, AND/OR shells), so SLG tabling collapses the call count by ~4.5× — measured down to 977,135 calls and 1.9 s of total time on the same problem. Loading lib_tabling via the (library …) import form and tabling the predicate at the benchmark entry point keeps the directive co-located with the consult of the Prolog stubs, so the perf scaffolding is in one place.
… sortDeme, any Tier 1+2 from the post-stack mux6 profile. Each stub displaces the MeTTa definition at consult time via retractall + register_fun. - setDifference/3: subtract via native memberchk (was hand-rolled O(nm) with reduce/2 dispatch). Was 4.2 % wall on mux6, now 0.1 %. - getGuardSet/2: head test in Prolog bypasses the if-decons-expr-custom primitive (5.7 % wall by itself). Was 7.2 %, now 0.6 %. - sortDeme/2: keysort over (-penalizedScore, cpxy) replaces selectionSort + sInstComparator chain. Was 5.7 %, now ~0 %. NaN penalizedScore sorted to the end to match cScoreExpr< semantics. - any/2: memberchk(true) replaces (once (is-member True ...)). Was 2.2 %, now 0.3 %. Combined mux6 wall: 52.96 s -> ~48 s (-10 %). Output bit-identical to baseline; helper-functions/RTE/instance regression tests pass. Also tried SLG tabling reduceToElegance/propagateNot/addAND on top of this — regressed mux6 to ~70 s, so dropped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Stack of small, incremental optimizations to the MOSES hot path on PeTTaV1.
Each commit is a single isolated change with a measured rationale:
compareAndSwaprewritten from O(N²) (List.append) to O(N) (cons + reverse)findAndReplacerewritten as a direct cons-recursive walk (drops thesubtraction-atom+map-atomdouble pass and per-elementreduce/2dispatch)expToMMapcollapsed to a singleunion-atomcall (was quadraticMultiMap.insertaccumulation)consult+register_fun) forgetLiterals,getChildrenExp,replaceVarsWithTruth,isConsistentExpremoveEmptyANDin the benchmark entry point(cut)inhillClimbing/11to release choice points before the tail-recursion.gitignore: stop excluding*.plso the new stubs file is trackedThe four Prolog stubs (
getLiterals,getChildrenExp,replaceVarsWithTruth,isConsistentExp) collectively account for ~27 % wall on mux6, withreplaceVarsWithTruththe single largest contribution (~13.5 %); they win by collapsing per-elementreduce/2dispatch fromfilter-atom/map-atom/foldlinto native single-pass Prolog walks (and, forisConsistentExp, sidestep an improper-list crash in the MeTTa version). Combined effect on parity3 in CI-equivalent config: ~44 s → ~26 s wall.Motivation and Context
parity3.logand a fresh mux6--profileflagged a handful of predicates accounting for 40-plus % of run time: per-elementreduce/2dispatch infilter-atom/map-atom, quadraticMultiMap.insertinexpToMMap, and a k-fold-over-vars inreplaceVarsWithTruth. The stubs and rewrites target those directly without changing call semantics anywhere.How Has This Been Tested?
rte{1,2,3},cut-unnecessary-{and,or},delete-inconsistent,zero-constraint-subsumption,helper-functions,reduce-to-elegance) green.optimization/hillclimbing/test/cross-top-one-test.mettagreen (covers the rewrittencompareAndSwap).bscore-test.mettagreen (coversreplaceVarsWithTruthsemantics).moses/tests/demo-problems-benchmark-test.mettaon parity3 drops from ~44 s to ~26 s wall (single-run, same machine).Types of changes
Checklist: