Skip to content

docs(design): add maintainer decision layer roadmap#35

Draft
frankekn wants to merge 6 commits intopwrdrvr:mainfrom
frankekn:frank/docs-maintainer-decision-roadmap
Draft

docs(design): add maintainer decision layer roadmap#35
frankekn wants to merge 6 commits intopwrdrvr:mainfrom
frankekn:frank/docs-maintainer-decision-roadmap

Conversation

@frankekn
Copy link
Copy Markdown

@frankekn frankekn commented Mar 27, 2026

Summary

Add a design roadmap for a maintainer decision layer above ghcrawl's existing search and clustering pipeline.

This PR is still docs-only. It does not change runtime behavior.

What has changed since the original draft is that the roadmap is now backed by a concrete local benchmark. That benchmark matters because it tests the exact architectural claim this PR makes: clustering and semantic neighbors are useful as the recall layer, but maintainer-facing decision quality improves when a second-stage scoring layer sits above retrieval.

Why

ghcrawl already does repository-wide semantic grouping well. The remaining gap is maintainer-facing decision support: helping answer which nearby PR is the strongest base, which variant is likely superseded, and which semantic neighbors should stay excluded.

This PR does not introduce code or change current runtime behavior. It documents a clean additive direction so future feature work can converge on one architecture instead of growing as disconnected heuristics.

Changes

  • add docs/designs/maintainer-decision-layer.md
  • extend docs/PLAN.md with a dedicated maintainer decision-analysis phase

Design Intent

The proposed direction is additive:

  • keep the current search and cluster pipeline
  • reuse clusters and semantic neighbors as candidate retrieval
  • add a reusable decision-analysis layer above retrieval
  • expose that layer through analyze-pr, triage, API, and future UI surfaces

Architecture

flowchart LR
    A[Existing ghcrawl pipeline] --> B[Neighbors / Clusters]
    B --> C[Candidate retrieval]
    C --> D[Decision analysis]
    D --> E[Maintainer outputs]

    E --> E1[analyze-pr]
    E --> E2[triage]
    E --> E3[API]

    style D fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff
    style E fill:#3a2812,stroke:#fbbf24,stroke-width:2px,color:#ffffff
Loading

Initial Scoring Direction

One thing I wanted to make explicit in this roadmap is that the future decision layer should not be “more clustering”. It should be a second-stage score model that combines the strongest maintainer-facing signals:

  • semantic similarity
  • linked issue overlap
  • path relevance
  • companion test relevance
  • state / recency
  • unrelated churn penalty

In other words:

  • clusters and semantic neighbors are the recall layer
  • the weighted score model is the decision layer
  • explicit roles such as best_base, superseded_candidate, and excluded_neighbor are the maintainer-facing result

That distinction is the main reason this proposal exists. The goal is not to treat cluster membership itself as the final answer. The goal is to reuse the current retrieval and clustering pipeline, then add a maintainer-oriented decision pass on top of it.

flowchart LR
    A[semantic similarity]
    B[linked issue overlap]
    C[path relevance]
    D[companion test relevance]
    E[state and recency]
    F[noise penalty]

    A --> G[decision score]
    B --> G
    C --> G
    D --> G
    E --> G
    F --> G

    G --> H[best_base]
    G --> I[same_cluster_candidate]
    G --> J[superseded_candidate]
    G --> K[excluded_neighbor]

    style G fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff
Loading

The updated design doc keeps the score model concrete at the level that matters for a roadmap:

  • explicit feature families
  • retrieval provenance separated from decision role
  • a deterministic latest-snapshot boundary for v1
  • adjacent decision tables later instead of coupling decisions into cluster snapshots
  • early fixture-based evaluation before tuning widens

Exact score weights are intentionally left to implementation and fixture-driven tuning rather than being treated as roadmap truth. The weighting approach itself is not hypothetical: claw-maintainer-tui already runs a weighted maintainer decision model today, and the first ghcrawl implementation can start from that profile before retuning against local fixtures.

For clarity, this is not a “stale local data only” proposal. V1 should analyze the latest local repository snapshot produced by the existing explicit refresh or sync -> embed -> cluster pipeline. Freshness remains an explicit operational step, while the scoring pass itself stays free of hidden live GitHub or OpenAI fetches.

Benchmark Report: Cluster Decision Effectiveness

Executive Summary

I ran a local benchmark to test the architectural claim behind this roadmap.

The benchmark compared:

  • clawlens using local embeddings and a maintainer-oriented decision layer
  • ghcrawl using its native OpenAI-based embedding path

Both systems were run against the same fixed openclaw/openclaw snapshot and the same exact recent PR/issue slice.

On that benchmark, the maintainer-oriented approach materially outperformed the upstream baseline on the metric that matters most for this roadmap: cluster-family decision effectiveness.

The headline result is cluster-family quality:

  • clawlens mean F1: 86.5%
  • ghcrawl mean F1: 25.1%

That is not a marginal improvement. It is a large gap in the maintainer task this roadmap is trying to support: given one PR, identify the other PRs that truly belong in the same family and rank them usefully.

Question This Benchmark Answers

This benchmark was designed to answer the following question:

Can a maintainer-oriented decision layer above retrieval and clustering produce better maintainer decisions than the current upstream baseline?

On this benchmark, the answer is yes.

Question This Benchmark Does Not Answer

This benchmark does not prove durable cluster identity stability across reruns over time.

In other words, it does not answer the separate question:

  • if clustering runs now, then new PRs merge, then clustering runs again later, do the visible cluster identities remain stable?

That is a different acceptance test and still needs its own explicit rerun-over-time validation.

Benchmark Setup

Repository and snapshot:

  • repo: openclaw/openclaw
  • snapshot ref: origin/main
  • snapshot sha: 21c00165efd3d7f4a3e7f68aed9e6363da8cb7ea

Scope:

  • recent slice of 1000 PRs
  • up to 1000 issues
  • both systems were forced onto the same exact scope

Systems compared:

  • clawlens: local embeddings
  • ghcrawl: native OpenAI embedding path

Command used:

pnpm clawlens benchmark compare \
  --repo openclaw/openclaw \
  --db /tmp/clawlens-benchmark-openclaw-slice-1000.sqlite \
  --ghcrawl-db /tmp/ghcrawl-benchmark-openclaw-slice-1000.sqlite \
  --ghcrawl-dir /Users/termtek/Documents/GitHub/oss/ghcrawl \
  --ghcrawl-embedding-baseline native \
  --snapshot-ref origin/main \
  --snapshot-sha 21c00165efd3d7f4a3e7f68aed9e6363da8cb7ea \
  --cold-runs 1 \
  --warm-runs 1 \
  --max-prs 1000 \
  --max-issues 1000

Judged evaluation set:

  • 6 judged queries total
  • 1 semantic retrieval query
  • 5 related-pr cluster-family queries compatible with the indexed slice

Why These Metrics Matter

1. Cluster-family mean precision

This measures how often the returned cluster members are actually correct.

Why it matters:

  • high precision means the system is not polluting a cluster with unrelated PRs
  • low precision means maintainers waste time inspecting false neighbors

2. Cluster-family mean recall

This measures how many of the true family members the system successfully finds.

Why it matters:

  • high recall means a maintainer can trust the system to surface the full family
  • low recall means real siblings are missed, which weakens triage and consolidation

3. Cluster-family mean F1

F1 combines precision and recall into a single quality score.

Why it matters:

  • clustering is only useful if it is both accurate and complete
  • F1 is the best single-number summary of actual cluster decision quality

4. Top-3 hit rate

This measures whether a correct answer appears near the top even if it is not ranked first.

Why it matters:

  • maintainer workflows are rarely “take only rank 1”
  • if the right siblings consistently appear in the top few results, the system is practically useful

5. MRR

MRR measures how high the first correct result appears.

Why it matters:

  • better rank placement means less scanning and faster confidence
  • this is a direct proxy for maintainer usability

6. Recall@10

This measures how much of the relevant set appears in the first 10 results.

Why it matters:

  • cluster decisions often depend on seeing the broader family, not just one match
  • a system with high recall@10 is better for “show me the cluster around this PR”

7. Cold index duration

This measures how long it takes to build the initial searchable corpus.

Why it matters:

  • if the system is too slow to build, it is harder to use in real local maintainer workflows

8. Warm refresh duration

This measures incremental update cost after the initial index exists.

Why it matters:

  • maintainers care about repeated reruns, not just first build
  • warm refresh is closer to day-to-day operational reality

9. Peak RSS

This measures memory usage.

Why it matters:

  • lower memory usage makes the system easier to run locally and easier to automate reliably

10. Time to first semantic success

This measures how long it takes before the system can answer its first semantic query successfully.

Why it matters:

  • this is the practical “when does semantic clustering become usable?” metric
  • especially important when local embedding work is deferred until first semantic use

Results

Cluster-family quality

  • clawlens
    • mean precision: 76.3%
    • mean recall: 100.0%
    • mean F1: 86.5%
  • ghcrawl
    • mean precision: 40.0%
    • mean recall: 20.0%
    • mean F1: 25.1%

Retrieval quality

  • clawlens
    • top-1 hit rate: 33.3%
    • top-3 hit rate: 83.3%
    • MRR: 0.583
    • recall@10: 83.3%
  • ghcrawl
    • top-1 hit rate: 33.3%
    • top-3 hit rate: 33.3%
    • MRR: 0.333
    • recall@10: 16.7%

Operational performance

  • cold index
    • clawlens: 385,469 ms, 306 MB
    • ghcrawl: 741,542 ms, 454 MB
  • warm refresh
    • clawlens: 19,087 ms, 125 MB
    • ghcrawl: 733,668 ms, 410 MB
  • time to first semantic success
    • clawlens: 1,129,293 ms
    • ghcrawl: 1,242,947 ms

Interpretation

If the evaluation criterion is cluster decision effectiveness, clawlens wins decisively on this benchmark.

The most important evidence is the cluster-family result:

  • clawlens mean F1: 86.5%
  • ghcrawl mean F1: 25.1%

That gap means the maintainer-oriented approach is substantially better at answering the maintainer question that actually matters:

  • which PRs belong together in the same fix family or cluster?

This is not just a speed win or a resource win. It is a quality win on the decision surface itself.

The retrieval metrics reinforce the same conclusion:

  • top-1 was tied
  • but clawlens was much better at surfacing the right family within the top few results and across the first page of results

Operationally, the benchmark also shows that the local approach is practical:

  • faster cold build
  • dramatically faster warm refresh
  • lower RAM usage
  • earlier first semantic success

Why This Matters For This Roadmap

This benchmark does not turn this docs PR into an implementation PR.

What it does do is provide concrete evidence that the architectural split proposed here is the right one:

  • clustering and neighbors are the recall layer
  • decision scoring is the maintainer-facing answer layer

That is the central claim of this roadmap, and the benchmark supports it.

Next Implementation Direction

The benchmark in this PR supports the decision-layer direction, but it also clarifies an adjacent durability requirement for later implementation work.

The key follow-up is to separate durable family identity from run-local cluster rows:

  • keep clusters.id as a run-local storage id if useful
  • introduce a visible durable family_id for maintainer-facing identity
  • inherit family_id across reruns when a family matches a previous snapshot
  • give linked-issue families a deterministic canonical identity early
  • let semantic-only families inherit prior family_id through snapshot matching
  • keep maintainer decision scoring above retrieval and identity rather than coupling it into storage

This PR does not attempt to implement that contract. It only makes the intended layering explicit so that future implementation work can add durable family identity without collapsing retrieval, identity, and maintainer decision logic into one step.

Delivery Path

flowchart TD
    P0[Phase 0<br/>fix current contracts]
    P1[Phase 1<br/>reusable decision core]
    P2[Phase 2<br/>analyze-pr]
    P3[Phase 3<br/>triage and API reuse]
    P4[Phase 4<br/>decision artifacts on run state]

    P0 --> P1 --> P2 --> P3 --> P4
Loading

Non-Goals

  • no cluster-model replacement
  • no storage redesign in the first iteration
  • no embedding backend changes
  • no runtime behavior change in this PR

Testing

  • docs-only change
  • manually reviewed markdown structure and Mermaid diagrams for GitHub rendering
  • benchmark evidence gathered separately from runtime implementation in claw-maintainer-tui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant