Skip to content

Commit 6d906d6

Browse files
ZviBaratzclaude
andauthored
feat: add automated field test pipeline (#34)
## Summary - Add two-layer automated field test pipeline for batch ego-lint regression testing across 10 real-world GNOME extensions - Layer 1 (bash): `scripts/field-test-runner.sh` orchestrator with manifest-driven fetching, JSON output parsing, baseline diffing, and annotation-aware filtering - Layer 2 (skill): `skills/ego-field-test/SKILL.md` for selective ego-review, finding classification, regression reports, and issue creation - Seed 10 annotation files from existing field test reports (670 lines of institutional TP/FP knowledge) ## Test plan - [x] `parse-manifest.py` correctly parses all 10 extensions from manifest - [x] `parse-lint-results.py` matches expected counts (204/0/9/23 for hara-hachi-bu) - [x] `field-test-runner.sh --extension hara-hachi-bu` produces valid JSON - [x] Baseline creation and diffing works (`changed: false` on re-run) - [x] `field-tests/cache/` and `field-tests/results/` properly gitignored - [x] Existing 527 test assertions still pass (2 pre-existing failures in `on-destroy-cleanup`) Closes #33 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7cea626 commit 6d906d6

36 files changed

+7886
-3
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,5 @@ node_modules/
66
.claude/
77
tests/fixtures/*/node_modules/
88
tests/assertions/local-regression.sh
9+
field-tests/cache/
10+
field-tests/results/

CLAUDE.md

Lines changed: 55 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
44

55
## Project Overview
66

7-
Claude Code plugin for GNOME Shell extension EGO (extensions.gnome.org) review compliance. It provides five skills (`ego-lint`, `ego-review`, `ego-scaffold`, `ego-simulate`, `ego-submit`). This is **not** a GNOME extension itself — it's a set of tools that validate GNOME extensions against EGO submission requirements. Load it with `claude --plugin-dir <path-to-this-repo>`.
7+
Claude Code plugin for GNOME Shell extension EGO (extensions.gnome.org) review compliance. It provides six skills (`ego-lint`, `ego-review`, `ego-scaffold`, `ego-simulate`, `ego-submit`, `ego-field-test`). This is **not** a GNOME extension itself — it's a set of tools that validate GNOME extensions against EGO submission requirements. Load it with `claude --plugin-dir <path-to-this-repo>`.
88

99
## Running ego-lint
1010

@@ -77,7 +77,7 @@ Additional tooling:
7777
### Three-tier rule system
7878

7979
- **Tier 1 (patterns.yaml)**: 124 regex rules in YAML, processed by `apply-patterns.py`. Covers web APIs, deprecated APIs, security (telemetry, curl/gsettings spawn, base64), logging, import segregation, AI slop signals, subprocess safety, i18n, GSettings bind flags, GNOME 44-50 migration, code quality advisories. Add new rules by editing `rules/patterns.yaml`. Advanced fields: `min-version`/`max-version` (version-gating), `guard-pattern` + `guard-window` (line-level suppression with configurable lookback) + `guard-window-forward` (forward peeking), `replacement-pattern` (file-level suppression), `exclude-dirs`, `skip-comments` (comment-aware matching).
80-
- **Tier 2 (scripts)**: 17 structural heuristic check scripts in Python/bash. `check-quality.py` (AI slop heuristics), `check-init.py` (init-time safety), `check-lifecycle.py` (enable/disable symmetry + timeout verification), `check-resources.py` + `build-resource-graph.py` (cross-file resource tracking), `check-disclosures.py` (clipboard/network disclosure), `check-polkit.py` (polkit policy validation), `check-schema-usage.py` (unused/undefined schema keys), `check-accessibility.py` (a11y checks), plus metadata, prefs, GObject, async, CSS, imports, schema, and package checks. `ego-lint.sh` also has an inline minified JS check, code metrics, and a provenance-gated post-filter that suppresses R-SLOP-01/02 JSDoc warnings when `quality/code-provenance` score >= 4.
80+
- **Tier 2 (scripts)**: 17 structural heuristic check scripts in Python/bash. `check-quality.py` (AI slop heuristics), `check-init.py` (init-time safety), `check-lifecycle.py` (enable/disable symmetry + timeout verification), `check-resources.py` + `build-resource-graph.py` (cross-file resource tracking), `check-disclosures.py` (clipboard/network disclosure), `check-polkit.py` (polkit policy validation), `check-schema-usage.py` (unused/undefined schema keys), `check-accessibility.py` (a11y checks), plus metadata, prefs, GObject, async, CSS, imports, schema, and package checks. `ego-lint.sh` also has an inline minified JS check, code metrics, and a provenance-gated post-filter that suppresses R-SLOP-01/02 JSDoc warnings when `quality/code-provenance` score >= 3.
8181
- **Tier 3 (checklists)**: 6 semantic review checklists in `skills/ego-review/references/`: lifecycle, security, code-quality (with 10 additional quality items), ai-slop (46-item scoring model), licensing, accessibility (7 items). Applied by Claude during `ego-review` phases.
8282

8383
### ego-review internals
@@ -154,6 +154,59 @@ test(ego-lint): add fixture for deprecated ByteArray usage
154154
- **PR closes issue**: Include `Closes #N` in the PR description to auto-close the issue on merge
155155
- **Tests before PR**: Run `bash tests/run-tests.sh` and verify all assertions pass before pushing
156156

157+
## Field Testing
158+
159+
Batch ego-lint runner for regression testing across 10 real-world GNOME extensions.
160+
161+
### Running field tests
162+
163+
```bash
164+
bash scripts/field-test-runner.sh # lint all extensions
165+
bash scripts/field-test-runner.sh --extension NAME # lint single extension
166+
bash scripts/field-test-runner.sh --update-baselines # save current as golden
167+
bash scripts/field-test-runner.sh --no-fetch # skip git clones, use cache
168+
bash scripts/field-test-runner.sh --review --no-fetch # lint + review all
169+
bash scripts/field-test-runner.sh --review --review-exclude X # review all except X
170+
bash scripts/field-test-runner.sh --review-dry-run # print prompts only
171+
```
172+
173+
### Pipeline structure
174+
175+
- `field-tests/manifest.yaml` — Extension source manifest (local paths, GitHub repos)
176+
- `field-tests/baselines/` — Golden JSON snapshots (committed)
177+
- `field-tests/annotations/` — Per-extension finding classifications: tp, fp, borderline, expected (committed)
178+
- `field-tests/history.jsonl` — Append-only trend data (committed)
179+
- `field-tests/cache/` — Downloaded extensions (gitignored)
180+
- `field-tests/results/` — Timestamped run output (gitignored), includes `.review.md` reports
181+
- `field-tests/reports/` — Regression/synthesis reports (committed)
182+
- `scripts/field-test-runner.sh` — Bash orchestrator (lint + optional review phase)
183+
- `scripts/parse-manifest.py` — Manifest YAML → JSON (inline parser, no PyYAML)
184+
- `scripts/parse-lint-results.py` — ego-lint stdout → structured JSON
185+
- `scripts/diff-baselines.py` — Baseline comparison + annotation-aware filtering
186+
- `scripts/review-prompt.md` — Review prompt template (incremental Write strategy)
187+
- `scripts/hydrate-review-prompt.py` — Template hydration with lint/diff/annotation data
188+
- `skills/ego-field-test/SKILL.md` — Claude Code skill for full pipeline (classification, synthesis, issue creation)
189+
190+
### Calibration cycle
191+
192+
1. Make a code change (guard pattern, threshold tweak, new rule)
193+
2. Run `bash scripts/field-test-runner.sh --no-fetch` — see impact across all extensions
194+
3. Classify new unannotated findings in `field-tests/annotations/`
195+
4. If FPs found, create issues and fix
196+
5. Run with `--update-baselines` to snapshot improved state
197+
198+
### Review phase
199+
200+
The `--review` flag runs headless `claude -p` sessions after lint. Each session uses `scripts/review-prompt.md` (hydrated with lint results, diff, and annotations). Key flags:
201+
202+
- `--review` — review all extensions; `--review-changed` — only changed ones
203+
- `--review-exclude NAME` — skip specific extensions from review (repeatable); `--exclude` skips from both lint and review
204+
- `--budget AMOUNT` — max USD per review session (default: 4.00)
205+
- `--parallel N` — max concurrent sessions (default: 3)
206+
- `--review-dry-run` — write hydrated prompts without invoking claude
207+
208+
Reports are written incrementally (section-by-section) to survive budget exhaustion. Review findings use `review/` prefix in annotation files to distinguish from lint findings.
209+
157210
## Releasing
158211

159212
release-please automates versioning, CHANGELOG updates, git tags, and GitHub Releases:

field-tests/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
cache/
2+
results/
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Classified findings for AppIndicator/KStatusNotifierItem Support
2+
# Source: docs/internal/field-test-appindicator.md
3+
findings:
4+
# FAILs — True Positives
5+
- id: "R-DEPR-04::legacy imports.gi syntax"
6+
classification: tp
7+
notes: "3 instances in indicatorStatusIcon.js, interfaces.js, appIndicator.js — GNOME 45+ should use ESM"
8+
- id: "R-VER44-01::Meta.later_add dead code"
9+
classification: tp
10+
notes: "promiseUtils.js — API removed in GNOME 44, extension targets 45+"
11+
- id: "R-VER44-02::Meta.later_remove dead code"
12+
classification: tp
13+
notes: "promiseUtils.js — same dead code issue"
14+
- id: "metadata/future-shell-version::GNOME 50"
15+
classification: tp
16+
notes: "shell-version includes 50 which is newer than current stable"
17+
- id: "no-deprecated-modules::imports.byteArray"
18+
classification: tp
19+
notes: "interfaces.js uses deprecated imports.byteArray"
20+
- id: "non-gjs-scripts::ksni.py"
21+
classification: borderline
22+
notes: "indicator-test-tool/ksni.py is a developer test tool, not part of extension proper"
23+
- id: "metadata/shell-version-range::6 versions"
24+
classification: tp
25+
notes: "45-50 exceeds max 4 allowed"
26+
# FAILs — False Positives
27+
- id: "R-SLOP-16::GLib.file_get_contents"
28+
classification: fp
29+
notes: "Rule claims API doesn't exist in GJS, but it does — valid GI binding for g_file_get_contents()"
30+
- id: "R-VER46-01::add_actor runtime-guarded"
31+
classification: fp
32+
notes: "Code has if (obj.add_actor) guard — runtime feature detection. Fixed: guard-pattern."
33+
- id: "R-VER46-02::remove_actor runtime-guarded"
34+
classification: fp
35+
notes: "Same runtime guard pattern as add_actor. Fixed: guard-pattern."
36+
- id: "init/shell-modification::non-Extension constructors"
37+
classification: fp
38+
notes: "3 FPs — GLib.Error, Gio.Cancellable in constructors of runtime-only classes. Fixed: scoped to extension.js."
39+
# WARNs — True Positives
40+
- id: "R-SEC-06::run_dispose"
41+
classification: tp
42+
notes: "statusNotifierWatcher.js — needs justification"
43+
- id: "R-LOG-03::print/printerr in dev tool"
44+
classification: tp
45+
notes: "11 instances in indicator-test-tool/"
46+
- id: "R-QUAL-26::custom Logger class"
47+
classification: tp
48+
notes: "Logger wraps GLib.log_structured; console.debug preferred"
49+
- id: "R-QUAL-33::Gio._promisify module-scope"
50+
classification: tp
51+
notes: "4 files — correct advisory"
52+
- id: "quality/module-state::mutable module-level let"
53+
classification: tp
54+
notes: "settingsManager.js — intentional singleton but valid concern"
55+
- id: "quality/mock-in-production::test files"
56+
classification: tp
57+
notes: "indicator-test-tool/testTool.js shouldn't ship"
58+
- id: "gobject/missing-gtypename::collision risk"
59+
classification: tp
60+
notes: "5 instances"
61+
- id: "async/missing-cancellable::async without cancellable"
62+
classification: tp
63+
notes: "dbusMenu.js, appIndicator.js"
64+
- id: "disclosure/private-api::undisclosed"
65+
classification: tp
66+
notes: "Main.layoutManager access not disclosed"
67+
- id: "disclosure/file-io::undisclosed"
68+
classification: tp
69+
notes: "File I/O not disclosed in metadata"
70+
# WARNs — False Positives
71+
- id: "R-SLOP-13::this instanceof in factory"
72+
classification: fp
73+
notes: "3 FPs — methods in MenuItemFactory bound to different shellItem types via connectSmart. Fixed: guard-pattern."
74+
- id: "R-SLOP-35::Object.freeze enum"
75+
classification: fp
76+
notes: "3 FPs — standard JS enum pattern (SNICategory, SNIStatus, SNIconType). Fixed: guard-pattern."
77+
- id: "R-SLOP-38::domain-specific identifiers"
78+
classification: fp
79+
notes: "4 FPs — brightnessContrastEffect and similar are standard Clutter API names. Fixed: threshold raised."
80+
- id: "R-QUAL-31::_onDestroy signal handler"
81+
classification: fp
82+
notes: "7 FPs — _onDestroy is PanelMenu.Button signal handler convention. Fixed: guard-pattern."
83+
- id: "lifecycle/connectObject-migration::connectSmart equivalent"
84+
classification: fp
85+
notes: "6 FPs — connectSmart provides equivalent auto-cleanup. Fixed: recognized in check-lifecycle.py."
86+
- id: "lifecycle/signal-balance::connectSmart not counted"
87+
classification: fp
88+
notes: "66 connects vs 18 disconnects — doesn't account for connectSmart auto-disconnect. Fixed."
89+
# WARNs — Mixed
90+
- id: "quality/constructor-resources::runtime-only constructors"
91+
classification: borderline
92+
notes: "8 hits — extension.js:36 is TP (Extension ctor), others are FP (runtime-only, cleaned via destroy/connectSmart)"
93+
- id: "lifecycle/untracked-timeout::GSource-based promise"
94+
classification: borderline
95+
notes: "promiseUtils.js is FP (GSource-based promise with _cleanup); indicator-test-tool entries are TP but irrelevant"
96+
- id: "quality/redundant-cleanup::verbose destroy guards"
97+
classification: borderline
98+
notes: "4 instances — if (x) x.destroy() vs x?.destroy() is style preference"
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Classified findings for Blur my Shell
2+
# Source: docs/internal/field-test-blur-my-shell.md
3+
findings:
4+
# FAILs — Fixed False Positives
5+
- id: "init/shell-modification::GObject.registerClass at module scope"
6+
classification: fp
7+
notes: "13 FPs — GObject.registerClass returns a class constructor, not an instance. Fixed: exemption in check-init.py."
8+
- id: "file-structure/extension.js::src/ layout"
9+
classification: fp
10+
notes: "Extension uses src/ subdirectory layout. Fixed: src/ fallback in ego-lint.sh."
11+
# FAILs — True Positives
12+
- id: "R-DEPR-06::Tweener usage"
13+
classification: tp
14+
notes: "appfolders.js uses imports.tweener.tweener — would crash on GNOME 46+. 4 line hits."
15+
- id: "R-VER47-01::Clutter.Color"
16+
classification: borderline
17+
notes: "appfolders.js has ternary runtime guard (Clutter.Color ? ... : Cogl.Color), but file crashes before reaching this code due to Tweener import"
18+
# WARNs — False Positives
19+
- id: "resource-tracking/destroy-not-called::disable() not recognized"
20+
classification: fp
21+
notes: "63 hits — components use .disable() not .destroy() as cleanup. Resource graph only recognizes destroy()."
22+
- id: "quality/constructor-resources::pipeline instances"
23+
classification: fp
24+
notes: "17 hits — mostly FP, pipeline instances managed via parent lifecycle"
25+
- id: "resource-tracking/no-destroy-method::utility classes"
26+
classification: fp
27+
notes: "10 hits — utility classes use disconnect_all(), remove() instead"
28+
- id: "R-SLOP-38::descriptive identifier"
29+
classification: fp
30+
notes: "dash_not_already_destroyed is descriptive, not AI verbosity. Fixed: guard-pattern."
31+
- id: "R-SLOP-24::non-extension schema"
32+
classification: fp
33+
notes: "new Gio.Settings({schema: 'org.gnome.mutter'}) correctly accesses system schema. Fixed: guard-pattern."
34+
# WARNs — True Positives
35+
- id: "lifecycle/prototype-override::UnlockDialog overrides"
36+
classification: tp
37+
notes: "6 instances — correctly flags lockscreen UnlockDialog overrides"
38+
- id: "R-I18N-01::template literals in _()"
39+
classification: tp
40+
notes: "4 instances — breaks xgettext extraction"
41+
- id: "R-SLOP-16::GLib.file_get_contents synchronous"
42+
classification: tp
43+
notes: "Synchronous file read advisory"
44+
- id: "R-SLOP-03::version field deprecated"
45+
classification: tp
46+
notes: "Deprecated for GNOME 45+"
47+
- id: "R-SEC-09::Main.extensionManager access"
48+
classification: tp
49+
notes: "Extension system interference for Dash to Panel compat"
50+
- id: "R-DEPR-09::var usage"
51+
classification: tp
52+
notes: "var x, y; should use let"
53+
- id: "quality/private-api::Main.overview._overview"
54+
classification: tp
55+
notes: "Private API access — correct advisory"
56+
- id: "quality/module-state::module vars not reset"
57+
classification: tp
58+
notes: "sigma and brightness not reset in disable"
59+
- id: "quality/empty-catch::empty catch block"
60+
classification: tp
61+
notes: "paint_signals.js — empty catch"
62+
- id: "lifecycle/signal-balance::125 vs 28"
63+
classification: tp
64+
notes: "By design — Connections class auto-cleans, but signal-balance heuristic can't verify"
65+
- id: "lifecycle/async-destroyed-guard::await import"
66+
classification: tp
67+
notes: "Low risk — await import() in utils.js without guard"
68+
# WARNs — Mixed
69+
- id: "lifecycle/untracked-timeout::prefs auto-cleanup"
70+
classification: borderline
71+
notes: "4 hits — 2 TP, 2 FP (prefs.js timeouts auto-cleanup on window close)"
72+
# ego-review advisory findings (not caught by ego-lint)
73+
- id: "lifecycle::actor.destroy missing parentheses"
74+
classification: tp
75+
notes: "L-3: coverflow_alt_tab.js:69 — actor.destroy is property access, not function call. Undetectable by pattern matching."
76+
- id: "lifecycle::splice wrong argument type"
77+
classification: tp
78+
notes: "L-10: window_list.js:111 — passes object instead of index to splice()"
79+
- id: "lifecycle::setTimeout source ID not stored"
80+
classification: tp
81+
notes: "L-1: panel.js:70 — can fire after disable"
82+
- id: "lifecycle::GLib.idle_add source ID not stored"
83+
classification: tp
84+
notes: "L-2: panel.js:91 — can fire after disable"
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Classified findings for Clipboard Indicator
2+
# Source: docs/internal/field-test-clipboard-indicator.md
3+
findings:
4+
# FAILs — Fixed False Positives
5+
- id: "R-WEB-01::setTimeout"
6+
classification: fp
7+
notes: "GJS added native setTimeout in GNOME 45. Rule was unconditionally blocking. Fixed: max-version 44."
8+
- id: "R-WEB-02::setInterval"
9+
classification: fp
10+
notes: "Same as R-WEB-01. Fixed: max-version 44."
11+
- id: "R-WEB-10::clearTimeout"
12+
classification: fp
13+
notes: "Same as R-WEB-01. Fixed: max-version 44."
14+
- id: "R-WEB-11::clearInterval"
15+
classification: fp
16+
notes: "Same as R-WEB-01. Fixed: max-version 44."
17+
- id: "license::LICENSE.rst not recognized"
18+
classification: fp
19+
notes: "License check only recognized LICENSE/COPYING, not .rst/.md/.txt variants. Fixed."
20+
- id: "metadata/uuid-matches-dir::cloned repo"
21+
classification: fp
22+
notes: "FAIL for cloned repos where directory != UUID. Fixed: downgraded to WARN."
23+
# FAILs — True Positives
24+
- id: "css/shell-class-override::.popup-menu-item"
25+
classification: tp
26+
notes: "Overrides Shell theme class without scoping — genuine issue"
27+
- id: "R-DEPR-11::Shell.KeyBindingMode"
28+
classification: tp
29+
notes: "Dead code, removed before GNOME 40"
30+
# WARNs — True Positives
31+
- id: "css/important::!important usage"
32+
classification: tp
33+
notes: "Correct advisory"
34+
- id: "R-DEPR-09::var declarations"
35+
classification: tp
36+
notes: "3 instances — should be const/let"
37+
- id: "R-SEC-06::run_dispose"
38+
classification: tp
39+
notes: "run_dispose() on virtual keyboard device"
40+
- id: "R-PREFS-04c::GTK layout widget"
41+
classification: tp
42+
notes: "Correct advisory"
43+
- id: "R-VER48-04b::vertical property deprecated"
44+
classification: tp
45+
notes: "3 instances — correct advisory"
46+
- id: "quality/module-state::module-level let"
47+
classification: tp
48+
notes: "5 module-level let variables — real enable/disable lifecycle concern"
49+
- id: "quality/constructor-resources::connect in prefs"
50+
classification: tp
51+
notes: "2 instances — .connect() in prefs constructor"
52+
- id: "quality/private-api::Main.panel"
53+
classification: tp
54+
notes: "Private API access — correct advisory"
55+
- id: "lifecycle/signal-balance::17 vs 6"
56+
classification: tp
57+
notes: "Genuine signal balance concern"
58+
- id: "lifecycle/async-destroyed-guard::no _destroyed guard"
59+
classification: tp
60+
notes: "No destroyed guard on async code"
61+
- id: "lifecycle/clipboard-keybinding::security pattern"
62+
classification: tp
63+
notes: "Clipboard + keybinding pattern detected"
64+
- id: "gobject/missing-gtypename::ConfirmDialog"
65+
classification: tp
66+
notes: "Missing GTypeName — collision risk"
67+
- id: "async/no-cancellable::Gio async without cancellable"
68+
classification: tp
69+
notes: "Correct finding"
70+
- id: "async/missing-cancellable::_async without cancellable"
71+
classification: tp
72+
notes: "Correct finding"
73+
- id: "resource-tracking/no-destroy-method::registry.js"
74+
classification: tp
75+
notes: "Registry has no destroy() method"
76+
- id: "resource-tracking/ownership::orphan detected"
77+
classification: tp
78+
notes: "1 orphan detected — correct"
79+
# ego-review blocking issues
80+
- id: "lifecycle::_historyLabel on global.stage never removed"
81+
classification: tp
82+
notes: "B1: Actor leak on every enable/disable cycle — ego-review finding"
83+
- id: "lifecycle::_notifSource never destroyed in disable()"
84+
classification: tp
85+
notes: "B2: Notification source persists — ego-review finding"

0 commit comments

Comments
 (0)