docs(design): add maintainer decision layer roadmap#35
Draft
frankekn wants to merge 6 commits intopwrdrvr:mainfrom
Draft
docs(design): add maintainer decision layer roadmap#35frankekn wants to merge 6 commits intopwrdrvr:mainfrom
frankekn wants to merge 6 commits intopwrdrvr:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a design roadmap for a maintainer decision layer above
ghcrawl's existing search and clustering pipeline.This PR is still docs-only. It does not change runtime behavior.
What has changed since the original draft is that the roadmap is now backed by a concrete local benchmark. That benchmark matters because it tests the exact architectural claim this PR makes: clustering and semantic neighbors are useful as the recall layer, but maintainer-facing decision quality improves when a second-stage scoring layer sits above retrieval.
Why
ghcrawlalready does repository-wide semantic grouping well. The remaining gap is maintainer-facing decision support: helping answer which nearby PR is the strongest base, which variant is likely superseded, and which semantic neighbors should stay excluded.This PR does not introduce code or change current runtime behavior. It documents a clean additive direction so future feature work can converge on one architecture instead of growing as disconnected heuristics.
Changes
docs/designs/maintainer-decision-layer.mddocs/PLAN.mdwith a dedicated maintainer decision-analysis phaseDesign Intent
The proposed direction is additive:
analyze-pr, triage, API, and future UI surfacesArchitecture
flowchart LR A[Existing ghcrawl pipeline] --> B[Neighbors / Clusters] B --> C[Candidate retrieval] C --> D[Decision analysis] D --> E[Maintainer outputs] E --> E1[analyze-pr] E --> E2[triage] E --> E3[API] style D fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff style E fill:#3a2812,stroke:#fbbf24,stroke-width:2px,color:#ffffffInitial Scoring Direction
One thing I wanted to make explicit in this roadmap is that the future decision layer should not be “more clustering”. It should be a second-stage score model that combines the strongest maintainer-facing signals:
In other words:
best_base,superseded_candidate, andexcluded_neighborare the maintainer-facing resultThat distinction is the main reason this proposal exists. The goal is not to treat cluster membership itself as the final answer. The goal is to reuse the current retrieval and clustering pipeline, then add a maintainer-oriented decision pass on top of it.
flowchart LR A[semantic similarity] B[linked issue overlap] C[path relevance] D[companion test relevance] E[state and recency] F[noise penalty] A --> G[decision score] B --> G C --> G D --> G E --> G F --> G G --> H[best_base] G --> I[same_cluster_candidate] G --> J[superseded_candidate] G --> K[excluded_neighbor] style G fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffffThe updated design doc keeps the score model concrete at the level that matters for a roadmap:
Exact score weights are intentionally left to implementation and fixture-driven tuning rather than being treated as roadmap truth. The weighting approach itself is not hypothetical:
claw-maintainer-tuialready runs a weighted maintainer decision model today, and the firstghcrawlimplementation can start from that profile before retuning against local fixtures.For clarity, this is not a “stale local data only” proposal. V1 should analyze the latest local repository snapshot produced by the existing explicit
refreshorsync -> embed -> clusterpipeline. Freshness remains an explicit operational step, while the scoring pass itself stays free of hidden live GitHub or OpenAI fetches.Benchmark Report: Cluster Decision Effectiveness
Executive Summary
I ran a local benchmark to test the architectural claim behind this roadmap.
The benchmark compared:
clawlensusing local embeddings and a maintainer-oriented decision layerghcrawlusing its native OpenAI-based embedding pathBoth systems were run against the same fixed
openclaw/openclawsnapshot and the same exact recent PR/issue slice.On that benchmark, the maintainer-oriented approach materially outperformed the upstream baseline on the metric that matters most for this roadmap: cluster-family decision effectiveness.
The headline result is cluster-family quality:
clawlensmean F1:86.5%ghcrawlmean F1:25.1%That is not a marginal improvement. It is a large gap in the maintainer task this roadmap is trying to support: given one PR, identify the other PRs that truly belong in the same family and rank them usefully.
Question This Benchmark Answers
This benchmark was designed to answer the following question:
Can a maintainer-oriented decision layer above retrieval and clustering produce better maintainer decisions than the current upstream baseline?
On this benchmark, the answer is yes.
Question This Benchmark Does Not Answer
This benchmark does not prove durable cluster identity stability across reruns over time.
In other words, it does not answer the separate question:
That is a different acceptance test and still needs its own explicit rerun-over-time validation.
Benchmark Setup
Repository and snapshot:
openclaw/openclaworigin/main21c00165efd3d7f4a3e7f68aed9e6363da8cb7eaScope:
1000PRs1000issuesSystems compared:
clawlens: local embeddingsghcrawl: native OpenAI embedding pathCommand used:
Judged evaluation set:
6judged queries total1semantic retrieval query5related-prcluster-family queries compatible with the indexed sliceWhy These Metrics Matter
1. Cluster-family mean precision
This measures how often the returned cluster members are actually correct.
Why it matters:
2. Cluster-family mean recall
This measures how many of the true family members the system successfully finds.
Why it matters:
3. Cluster-family mean F1
F1 combines precision and recall into a single quality score.
Why it matters:
4. Top-3 hit rate
This measures whether a correct answer appears near the top even if it is not ranked first.
Why it matters:
5. MRR
MRR measures how high the first correct result appears.
Why it matters:
6. Recall@10
This measures how much of the relevant set appears in the first 10 results.
Why it matters:
7. Cold index duration
This measures how long it takes to build the initial searchable corpus.
Why it matters:
8. Warm refresh duration
This measures incremental update cost after the initial index exists.
Why it matters:
9. Peak RSS
This measures memory usage.
Why it matters:
10. Time to first semantic success
This measures how long it takes before the system can answer its first semantic query successfully.
Why it matters:
Results
Cluster-family quality
clawlens76.3%100.0%86.5%ghcrawl40.0%20.0%25.1%Retrieval quality
clawlens33.3%83.3%0.58383.3%ghcrawl33.3%33.3%0.33316.7%Operational performance
clawlens:385,469 ms,306 MBghcrawl:741,542 ms,454 MBclawlens:19,087 ms,125 MBghcrawl:733,668 ms,410 MBclawlens:1,129,293 msghcrawl:1,242,947 msInterpretation
If the evaluation criterion is cluster decision effectiveness,
clawlenswins decisively on this benchmark.The most important evidence is the cluster-family result:
clawlensmean F1:86.5%ghcrawlmean F1:25.1%That gap means the maintainer-oriented approach is substantially better at answering the maintainer question that actually matters:
This is not just a speed win or a resource win. It is a quality win on the decision surface itself.
The retrieval metrics reinforce the same conclusion:
clawlenswas much better at surfacing the right family within the top few results and across the first page of resultsOperationally, the benchmark also shows that the local approach is practical:
Why This Matters For This Roadmap
This benchmark does not turn this docs PR into an implementation PR.
What it does do is provide concrete evidence that the architectural split proposed here is the right one:
That is the central claim of this roadmap, and the benchmark supports it.
Next Implementation Direction
The benchmark in this PR supports the decision-layer direction, but it also clarifies an adjacent durability requirement for later implementation work.
The key follow-up is to separate durable family identity from run-local cluster rows:
clusters.idas a run-local storage id if usefulfamily_idfor maintainer-facing identityfamily_idacross reruns when a family matches a previous snapshotfamily_idthrough snapshot matchingThis PR does not attempt to implement that contract. It only makes the intended layering explicit so that future implementation work can add durable family identity without collapsing retrieval, identity, and maintainer decision logic into one step.
Delivery Path
flowchart TD P0[Phase 0<br/>fix current contracts] P1[Phase 1<br/>reusable decision core] P2[Phase 2<br/>analyze-pr] P3[Phase 3<br/>triage and API reuse] P4[Phase 4<br/>decision artifacts on run state] P0 --> P1 --> P2 --> P3 --> P4Non-Goals
Testing
claw-maintainer-tui