Staged DSL mining pipeline with candidate tests and reviewable cleanup generation

## Problem

The current mining pipeline mixes several concerns:

- existing-rule scanning
- LLM-based cleanup discovery
- automatic hint-file updates
- GitHub Pages reporting
- issue/PR generation

This currently produces too many low-quality or weakly validated cleanup candidates and makes the process difficult to review and debug.

In particular:

- GREEN findings can directly end up in productive `.sandbox-hint` files
- duplicate or weak rules are difficult to detect
- Gemini free-tier responses are often too noisy or too opaque
- generated cleanup candidates usually have no accompanying JUnit tests
- GitHub Pages already contains review/navigation concepts, but the pipeline is still too tightly coupled

## Goal

Introduce a staged and reviewable mining pipeline:

```text
Commit history
  ↓
LLM discovers cleanup candidate
  ↓
Candidate DSL rule
  ↓
Auto-generated lightweight JUnit test
  ↓
GitHub Pages review state
  ↓
Issue / PR creation
  ↓
Promotion into productive bundled .sandbox-hint files
```

The main idea is:

- discovery remains cheap and experimental
- productive cleanup rules remain stable and reviewable
- every cleanup candidate becomes reproducible and testable

---

## Proposed architecture

### 1. Introduce candidate staging

Do NOT write newly mined GREEN rules directly into:

```text
sandbox_common_core/src/main/resources/org/sandbox/jdt/triggerpattern/internal
```

Instead introduce:

```text
mining-candidates/
```

or:

```text
sandbox_common_core/src/main/resources/org/sandbox/jdt/triggerpattern/candidates/
```

Candidate rules should remain isolated until explicitly promoted.

---

### 2. Introduce MiningCandidate model

Example:

```json
{
  "dslRule": "...",
  "beforeExample": "...",
  "afterExample": "...",
  "negativeExample": "...",
  "targetHintFile": "performance.sandbox-hint",
  "sourceCommit": "...",
  "status": "DISCOVERED"
}
```

Suggested states:

- DISCOVERED
- DSL_VALID
- TEST_GENERATED
- TEST_PASSED
- READY_FOR_PR
- PROMOTED
- REJECTED

---

## 3. Auto-generate lightweight JUnit tests

Current verbose Eclipse cleanup tests are too expensive for early-stage mining validation.

Instead automatically generate lightweight core tests using:

- `HintFileParser`
- `BatchTransformationProcessor`
- `BatchTransformationProcessorTest`

as the execution/test foundation.

The generated tests should verify:

1. DSL parses successfully
2. beforeExample matches
3. replacement equals afterExample
4. negativeExample does NOT match

Example target location:

```text
sandbox_common_core/src/test/java/org/sandbox/jdt/triggerpattern/generated/
```

This allows mined cleanup rules to become reproducible without immediately requiring full Eclipse cleanup bridge tests.

---

## 4. Improve Gemini prompt strategy

The current prompt is likely too broad for Gemini free-tier models.

Suggested changes:

### Use discovery-first prompting

Instead of:

- categorization
- full cleanup evaluation
- plugin replacement decisions
- broad architectural analysis

focus on:

```text
Discover exactly one reusable cleanup idea from the commit.
```

### Force structured output

Require:

- one DSL rule
- one beforeExample
- one afterExample
- one negativeExample
- confidence
- noCleanup=true if uncertain

### Strong safety rules

Reject:

- import-only transformations
- type-changing replacements
- architecture refactorings
- broad semantic rewrites
- duplicates of existing rules

---

## 5. Improve duplicate handling

Do not send all existing rules to Gemini.

Instead:

- keep deterministic duplicate detection in Java
- provide only:
  - rule ids
  - summaries
  - normalized patterns
  - nearest similar rules

This should reduce copying/rehashing of old cleanup ideas.

---

## 6. Improve GitHub Pages transparency

GitHub Pages mining reports should expose:

- API calls attempted
- successful responses
- parse failures
- truncated responses
- deferred commits
- duplicate candidates skipped
- noCleanup decisions
- candidate states

Add direct actions:

- create issue
- create PR
- promote candidate
- reject candidate

---

## 7. Separate scanning from discovery

Clarify distinction between:

### Existing-rule scanning

"Where do existing rules already match?"

vs.

### New-cleanup discovery

"Can a new reusable cleanup rule be derived from this commit?"

These are currently too tightly coupled.

---

## Suggested implementation steps

1. Add MiningCandidate model
2. Add candidate output directory
3. Add candidate status model
4. Add generated lightweight JUnit tests
5. Add Pages candidate state visualization
6. Add staged PR generation
7. Only later allow promotion into bundled rules

---

## Relevant existing components

- `MiningCli`
- `HintFileUpdater`
- `KnownRulesStore`
- `GithubPagesGenerator`
- `PromptBuilder`
- `GeminiClient`
- `BatchTransformationProcessor`
- `BatchTransformationProcessorTest`
- `HintFileCleanUpBridgeTest`

These already provide most required building blocks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staged DSL mining pipeline with candidate tests and reviewable cleanup generation #1111

Problem

Goal

Proposed architecture

1. Introduce candidate staging

2. Introduce MiningCandidate model

3. Auto-generate lightweight JUnit tests

4. Improve Gemini prompt strategy

Use discovery-first prompting

Force structured output

Strong safety rules

5. Improve duplicate handling

6. Improve GitHub Pages transparency

7. Separate scanning from discovery

Existing-rule scanning

New-cleanup discovery

Suggested implementation steps

Relevant existing components

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Staged DSL mining pipeline with candidate tests and reviewable cleanup generation #1111

Description

Problem

Goal

Proposed architecture

1. Introduce candidate staging

2. Introduce MiningCandidate model

3. Auto-generate lightweight JUnit tests

4. Improve Gemini prompt strategy

Use discovery-first prompting

Force structured output

Strong safety rules

5. Improve duplicate handling

6. Improve GitHub Pages transparency

7. Separate scanning from discovery

Existing-rule scanning

New-cleanup discovery

Suggested implementation steps

Relevant existing components

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions