Skip to content

Staged DSL mining pipeline with candidate tests and reviewable cleanup generation #1111

Description

@carstenartur

Problem

The current mining pipeline mixes several concerns:

  • existing-rule scanning
  • LLM-based cleanup discovery
  • automatic hint-file updates
  • GitHub Pages reporting
  • issue/PR generation

This currently produces too many low-quality or weakly validated cleanup candidates and makes the process difficult to review and debug.

In particular:

  • GREEN findings can directly end up in productive .sandbox-hint files
  • duplicate or weak rules are difficult to detect
  • Gemini free-tier responses are often too noisy or too opaque
  • generated cleanup candidates usually have no accompanying JUnit tests
  • GitHub Pages already contains review/navigation concepts, but the pipeline is still too tightly coupled

Goal

Introduce a staged and reviewable mining pipeline:

Commit history
  ↓
LLM discovers cleanup candidate
  ↓
Candidate DSL rule
  ↓
Auto-generated lightweight JUnit test
  ↓
GitHub Pages review state
  ↓
Issue / PR creation
  ↓
Promotion into productive bundled .sandbox-hint files

The main idea is:

  • discovery remains cheap and experimental
  • productive cleanup rules remain stable and reviewable
  • every cleanup candidate becomes reproducible and testable

Proposed architecture

1. Introduce candidate staging

Do NOT write newly mined GREEN rules directly into:

sandbox_common_core/src/main/resources/org/sandbox/jdt/triggerpattern/internal

Instead introduce:

mining-candidates/

or:

sandbox_common_core/src/main/resources/org/sandbox/jdt/triggerpattern/candidates/

Candidate rules should remain isolated until explicitly promoted.


2. Introduce MiningCandidate model

Example:

{
  "dslRule": "...",
  "beforeExample": "...",
  "afterExample": "...",
  "negativeExample": "...",
  "targetHintFile": "performance.sandbox-hint",
  "sourceCommit": "...",
  "status": "DISCOVERED"
}

Suggested states:

  • DISCOVERED
  • DSL_VALID
  • TEST_GENERATED
  • TEST_PASSED
  • READY_FOR_PR
  • PROMOTED
  • REJECTED

3. Auto-generate lightweight JUnit tests

Current verbose Eclipse cleanup tests are too expensive for early-stage mining validation.

Instead automatically generate lightweight core tests using:

  • HintFileParser
  • BatchTransformationProcessor
  • BatchTransformationProcessorTest

as the execution/test foundation.

The generated tests should verify:

  1. DSL parses successfully
  2. beforeExample matches
  3. replacement equals afterExample
  4. negativeExample does NOT match

Example target location:

sandbox_common_core/src/test/java/org/sandbox/jdt/triggerpattern/generated/

This allows mined cleanup rules to become reproducible without immediately requiring full Eclipse cleanup bridge tests.


4. Improve Gemini prompt strategy

The current prompt is likely too broad for Gemini free-tier models.

Suggested changes:

Use discovery-first prompting

Instead of:

  • categorization
  • full cleanup evaluation
  • plugin replacement decisions
  • broad architectural analysis

focus on:

Discover exactly one reusable cleanup idea from the commit.

Force structured output

Require:

  • one DSL rule
  • one beforeExample
  • one afterExample
  • one negativeExample
  • confidence
  • noCleanup=true if uncertain

Strong safety rules

Reject:

  • import-only transformations
  • type-changing replacements
  • architecture refactorings
  • broad semantic rewrites
  • duplicates of existing rules

5. Improve duplicate handling

Do not send all existing rules to Gemini.

Instead:

  • keep deterministic duplicate detection in Java
  • provide only:
    • rule ids
    • summaries
    • normalized patterns
    • nearest similar rules

This should reduce copying/rehashing of old cleanup ideas.


6. Improve GitHub Pages transparency

GitHub Pages mining reports should expose:

  • API calls attempted
  • successful responses
  • parse failures
  • truncated responses
  • deferred commits
  • duplicate candidates skipped
  • noCleanup decisions
  • candidate states

Add direct actions:

  • create issue
  • create PR
  • promote candidate
  • reject candidate

7. Separate scanning from discovery

Clarify distinction between:

Existing-rule scanning

"Where do existing rules already match?"

vs.

New-cleanup discovery

"Can a new reusable cleanup rule be derived from this commit?"

These are currently too tightly coupled.


Suggested implementation steps

  1. Add MiningCandidate model
  2. Add candidate output directory
  3. Add candidate status model
  4. Add generated lightweight JUnit tests
  5. Add Pages candidate state visualization
  6. Add staged PR generation
  7. Only later allow promotion into bundled rules

Relevant existing components

  • MiningCli
  • HintFileUpdater
  • KnownRulesStore
  • GithubPagesGenerator
  • PromptBuilder
  • GeminiClient
  • BatchTransformationProcessor
  • BatchTransformationProcessorTest
  • HintFileCleanUpBridgeTest

These already provide most required building blocks.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions