docs(design): add maintainer decision layer roadmap by frankekn · Pull Request #35 · pwrdrvr/ghcrawl

frankekn · 2026-03-27T09:00:32Z

Summary

Add a design roadmap for a maintainer decision layer above ghcrawl's existing search and clustering pipeline.

This PR is still docs-only. It does not change runtime behavior.

What has changed since the original draft is that the roadmap is now backed by a concrete local benchmark. That benchmark matters because it tests the exact architectural claim this PR makes: clustering and semantic neighbors are useful as the recall layer, but maintainer-facing decision quality improves when a second-stage scoring layer sits above retrieval.

Why

ghcrawl already does repository-wide semantic grouping well. The remaining gap is maintainer-facing decision support: helping answer which nearby PR is the strongest base, which variant is likely superseded, and which semantic neighbors should stay excluded.

This PR does not introduce code or change current runtime behavior. It documents a clean additive direction so future feature work can converge on one architecture instead of growing as disconnected heuristics.

Changes

add docs/designs/maintainer-decision-layer.md
extend docs/PLAN.md with a dedicated maintainer decision-analysis phase

Design Intent

The proposed direction is additive:

keep the current search and cluster pipeline
reuse clusters and semantic neighbors as candidate retrieval
add a reusable decision-analysis layer above retrieval
expose that layer through analyze-pr, triage, API, and future UI surfaces

Architecture

flowchart LR
    A[Existing ghcrawl pipeline] --> B[Neighbors / Clusters]
    B --> C[Candidate retrieval]
    C --> D[Decision analysis]
    D --> E[Maintainer outputs]

    E --> E1[analyze-pr]
    E --> E2[triage]
    E --> E3[API]

    style D fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff
    style E fill:#3a2812,stroke:#fbbf24,stroke-width:2px,color:#ffffff

Initial Scoring Direction

One thing I wanted to make explicit in this roadmap is that the future decision layer should not be “more clustering”. It should be a second-stage score model that combines the strongest maintainer-facing signals:

semantic similarity
linked issue overlap
path relevance
companion test relevance
state / recency
unrelated churn penalty

In other words:

clusters and semantic neighbors are the recall layer
the weighted score model is the decision layer
explicit roles such as best_base, superseded_candidate, and excluded_neighbor are the maintainer-facing result

That distinction is the main reason this proposal exists. The goal is not to treat cluster membership itself as the final answer. The goal is to reuse the current retrieval and clustering pipeline, then add a maintainer-oriented decision pass on top of it.

flowchart LR
    A[semantic similarity]
    B[linked issue overlap]
    C[path relevance]
    D[companion test relevance]
    E[state and recency]
    F[noise penalty]

    A --> G[decision score]
    B --> G
    C --> G
    D --> G
    E --> G
    F --> G

    G --> H[best_base]
    G --> I[same_cluster_candidate]
    G --> J[superseded_candidate]
    G --> K[excluded_neighbor]

    style G fill:#173042,stroke:#7dd3fc,stroke-width:2px,color:#ffffff

The updated design doc keeps the score model concrete at the level that matters for a roadmap:

explicit feature families
retrieval provenance separated from decision role
a deterministic latest-snapshot boundary for v1
adjacent decision tables later instead of coupling decisions into cluster snapshots
early fixture-based evaluation before tuning widens

Exact score weights are intentionally left to implementation and fixture-driven tuning rather than being treated as roadmap truth. The weighting approach itself is not hypothetical: claw-maintainer-tui already runs a weighted maintainer decision model today, and the first ghcrawl implementation can start from that profile before retuning against local fixtures.

For clarity, this is not a “stale local data only” proposal. V1 should analyze the latest local repository snapshot produced by the existing explicit refresh or sync -> embed -> cluster pipeline. Freshness remains an explicit operational step, while the scoring pass itself stays free of hidden live GitHub or OpenAI fetches.

Benchmark Report: Cluster Decision Effectiveness

Executive Summary

I ran a local benchmark to test the architectural claim behind this roadmap.

The benchmark compared:

clawlens using local embeddings and a maintainer-oriented decision layer
ghcrawl using its native OpenAI-based embedding path

Both systems were run against the same fixed openclaw/openclaw snapshot and the same exact recent PR/issue slice.

On that benchmark, the maintainer-oriented approach materially outperformed the upstream baseline on the metric that matters most for this roadmap: cluster-family decision effectiveness.

The headline result is cluster-family quality:

clawlens mean F1: 86.5%
ghcrawl mean F1: 25.1%

That is not a marginal improvement. It is a large gap in the maintainer task this roadmap is trying to support: given one PR, identify the other PRs that truly belong in the same family and rank them usefully.

Question This Benchmark Answers

This benchmark was designed to answer the following question:

Can a maintainer-oriented decision layer above retrieval and clustering produce better maintainer decisions than the current upstream baseline?

On this benchmark, the answer is yes.

Question This Benchmark Does Not Answer

This benchmark does not prove durable cluster identity stability across reruns over time.

In other words, it does not answer the separate question:

if clustering runs now, then new PRs merge, then clustering runs again later, do the visible cluster identities remain stable?

That is a different acceptance test and still needs its own explicit rerun-over-time validation.

Benchmark Setup

Repository and snapshot:

repo: openclaw/openclaw
snapshot ref: origin/main
snapshot sha: 21c00165efd3d7f4a3e7f68aed9e6363da8cb7ea

Scope:

recent slice of 1000 PRs
up to 1000 issues
both systems were forced onto the same exact scope

Systems compared:

clawlens: local embeddings
ghcrawl: native OpenAI embedding path

Command used:

pnpm clawlens benchmark compare \
  --repo openclaw/openclaw \
  --db /tmp/clawlens-benchmark-openclaw-slice-1000.sqlite \
  --ghcrawl-db /tmp/ghcrawl-benchmark-openclaw-slice-1000.sqlite \
  --ghcrawl-dir /Users/termtek/Documents/GitHub/oss/ghcrawl \
  --ghcrawl-embedding-baseline native \
  --snapshot-ref origin/main \
  --snapshot-sha 21c00165efd3d7f4a3e7f68aed9e6363da8cb7ea \
  --cold-runs 1 \
  --warm-runs 1 \
  --max-prs 1000 \
  --max-issues 1000

Judged evaluation set:

6 judged queries total
1 semantic retrieval query
5 related-pr cluster-family queries compatible with the indexed slice

Why These Metrics Matter

1. Cluster-family mean precision

This measures how often the returned cluster members are actually correct.

Why it matters:

high precision means the system is not polluting a cluster with unrelated PRs
low precision means maintainers waste time inspecting false neighbors

2. Cluster-family mean recall

This measures how many of the true family members the system successfully finds.

Why it matters:

high recall means a maintainer can trust the system to surface the full family
low recall means real siblings are missed, which weakens triage and consolidation

3. Cluster-family mean F1

F1 combines precision and recall into a single quality score.

Why it matters:

clustering is only useful if it is both accurate and complete
F1 is the best single-number summary of actual cluster decision quality

4. Top-3 hit rate

This measures whether a correct answer appears near the top even if it is not ranked first.

Why it matters:

maintainer workflows are rarely “take only rank 1”
if the right siblings consistently appear in the top few results, the system is practically useful

5. MRR

MRR measures how high the first correct result appears.

Why it matters:

better rank placement means less scanning and faster confidence
this is a direct proxy for maintainer usability

6. Recall@10

This measures how much of the relevant set appears in the first 10 results.

Why it matters:

cluster decisions often depend on seeing the broader family, not just one match
a system with high recall@10 is better for “show me the cluster around this PR”

7. Cold index duration

This measures how long it takes to build the initial searchable corpus.

Why it matters:

if the system is too slow to build, it is harder to use in real local maintainer workflows

8. Warm refresh duration

This measures incremental update cost after the initial index exists.

Why it matters:

maintainers care about repeated reruns, not just first build
warm refresh is closer to day-to-day operational reality

9. Peak RSS

This measures memory usage.

Why it matters:

lower memory usage makes the system easier to run locally and easier to automate reliably

10. Time to first semantic success

This measures how long it takes before the system can answer its first semantic query successfully.

Why it matters:

this is the practical “when does semantic clustering become usable?” metric
especially important when local embedding work is deferred until first semantic use

Results

Cluster-family quality

clawlens
- mean precision: 76.3%
- mean recall: 100.0%
- mean F1: 86.5%
ghcrawl
- mean precision: 40.0%
- mean recall: 20.0%
- mean F1: 25.1%

Retrieval quality

clawlens
- top-1 hit rate: 33.3%
- top-3 hit rate: 83.3%
- MRR: 0.583
- recall@10: 83.3%
ghcrawl
- top-1 hit rate: 33.3%
- top-3 hit rate: 33.3%
- MRR: 0.333
- recall@10: 16.7%

Operational performance

cold index
- clawlens: 385,469 ms, 306 MB
- ghcrawl: 741,542 ms, 454 MB
warm refresh
- clawlens: 19,087 ms, 125 MB
- ghcrawl: 733,668 ms, 410 MB
time to first semantic success
- clawlens: 1,129,293 ms
- ghcrawl: 1,242,947 ms

Interpretation

If the evaluation criterion is cluster decision effectiveness, clawlens wins decisively on this benchmark.

The most important evidence is the cluster-family result:

clawlens mean F1: 86.5%
ghcrawl mean F1: 25.1%

That gap means the maintainer-oriented approach is substantially better at answering the maintainer question that actually matters:

which PRs belong together in the same fix family or cluster?

This is not just a speed win or a resource win. It is a quality win on the decision surface itself.

The retrieval metrics reinforce the same conclusion:

top-1 was tied
but clawlens was much better at surfacing the right family within the top few results and across the first page of results

Operationally, the benchmark also shows that the local approach is practical:

faster cold build
dramatically faster warm refresh
lower RAM usage
earlier first semantic success

Why This Matters For This Roadmap

This benchmark does not turn this docs PR into an implementation PR.

What it does do is provide concrete evidence that the architectural split proposed here is the right one:

clustering and neighbors are the recall layer
decision scoring is the maintainer-facing answer layer

That is the central claim of this roadmap, and the benchmark supports it.

Next Implementation Direction

The benchmark in this PR supports the decision-layer direction, but it also clarifies an adjacent durability requirement for later implementation work.

The key follow-up is to separate durable family identity from run-local cluster rows:

keep clusters.id as a run-local storage id if useful
introduce a visible durable family_id for maintainer-facing identity
inherit family_id across reruns when a family matches a previous snapshot
give linked-issue families a deterministic canonical identity early
let semantic-only families inherit prior family_id through snapshot matching
keep maintainer decision scoring above retrieval and identity rather than coupling it into storage

This PR does not attempt to implement that contract. It only makes the intended layering explicit so that future implementation work can add durable family identity without collapsing retrieval, identity, and maintainer decision logic into one step.

Delivery Path

flowchart TD
    P0[Phase 0<br/>fix current contracts]
    P1[Phase 1<br/>reusable decision core]
    P2[Phase 2<br/>analyze-pr]
    P3[Phase 3<br/>triage and API reuse]
    P4[Phase 4<br/>decision artifacts on run state]

    P0 --> P1 --> P2 --> P3 --> P4

Non-Goals

no cluster-model replacement
no storage redesign in the first iteration
no embedding backend changes
no runtime behavior change in this PR

Testing

docs-only change
manually reviewed markdown structure and Mermaid diagrams for GitHub rendering
benchmark evidence gathered separately from runtime implementation in claw-maintainer-tui

frankekn added 6 commits March 27, 2026 16:57

docs: add maintainer decision layer roadmap

a2c3fe1

docs: add roadmap diagrams

177926e

docs: add decision scoring proposal

6bd3440

docs: refine decision-layer contract

06a50d1

docs: clarify roadmap boundaries and diagram contrast

fd0416c

docs: add durable family identity design

2361833

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(design): add maintainer decision layer roadmap#35

docs(design): add maintainer decision layer roadmap#35
frankekn wants to merge 6 commits intopwrdrvr:mainfrom
frankekn:frank/docs-maintainer-decision-roadmap

frankekn commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

frankekn commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Design Intent

Architecture

Initial Scoring Direction

Benchmark Report: Cluster Decision Effectiveness

Executive Summary

Question This Benchmark Answers

Question This Benchmark Does Not Answer

Benchmark Setup

Why These Metrics Matter

1. Cluster-family mean precision

2. Cluster-family mean recall

3. Cluster-family mean F1

4. Top-3 hit rate

5. MRR

6. Recall@10

7. Cold index duration

8. Warm refresh duration

9. Peak RSS

10. Time to first semantic success

Results

Cluster-family quality

Retrieval quality

Operational performance

Interpretation

Why This Matters For This Roadmap

Next Implementation Direction

Delivery Path

Non-Goals

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

frankekn commented Mar 27, 2026 •

edited

Loading