Problem
The current mining pipeline mixes several concerns:
- existing-rule scanning
- LLM-based cleanup discovery
- automatic hint-file updates
- GitHub Pages reporting
- issue/PR generation
This currently produces too many low-quality or weakly validated cleanup candidates and makes the process difficult to review and debug.
In particular:
- GREEN findings can directly end up in productive
.sandbox-hint files
- duplicate or weak rules are difficult to detect
- Gemini free-tier responses are often too noisy or too opaque
- generated cleanup candidates usually have no accompanying JUnit tests
- GitHub Pages already contains review/navigation concepts, but the pipeline is still too tightly coupled
Goal
Introduce a staged and reviewable mining pipeline:
Commit history
↓
LLM discovers cleanup candidate
↓
Candidate DSL rule
↓
Auto-generated lightweight JUnit test
↓
GitHub Pages review state
↓
Issue / PR creation
↓
Promotion into productive bundled .sandbox-hint files
The main idea is:
- discovery remains cheap and experimental
- productive cleanup rules remain stable and reviewable
- every cleanup candidate becomes reproducible and testable
Proposed architecture
1. Introduce candidate staging
Do NOT write newly mined GREEN rules directly into:
sandbox_common_core/src/main/resources/org/sandbox/jdt/triggerpattern/internal
Instead introduce:
or:
sandbox_common_core/src/main/resources/org/sandbox/jdt/triggerpattern/candidates/
Candidate rules should remain isolated until explicitly promoted.
2. Introduce MiningCandidate model
Example:
{
"dslRule": "...",
"beforeExample": "...",
"afterExample": "...",
"negativeExample": "...",
"targetHintFile": "performance.sandbox-hint",
"sourceCommit": "...",
"status": "DISCOVERED"
}
Suggested states:
- DISCOVERED
- DSL_VALID
- TEST_GENERATED
- TEST_PASSED
- READY_FOR_PR
- PROMOTED
- REJECTED
3. Auto-generate lightweight JUnit tests
Current verbose Eclipse cleanup tests are too expensive for early-stage mining validation.
Instead automatically generate lightweight core tests using:
HintFileParser
BatchTransformationProcessor
BatchTransformationProcessorTest
as the execution/test foundation.
The generated tests should verify:
- DSL parses successfully
- beforeExample matches
- replacement equals afterExample
- negativeExample does NOT match
Example target location:
sandbox_common_core/src/test/java/org/sandbox/jdt/triggerpattern/generated/
This allows mined cleanup rules to become reproducible without immediately requiring full Eclipse cleanup bridge tests.
4. Improve Gemini prompt strategy
The current prompt is likely too broad for Gemini free-tier models.
Suggested changes:
Use discovery-first prompting
Instead of:
- categorization
- full cleanup evaluation
- plugin replacement decisions
- broad architectural analysis
focus on:
Discover exactly one reusable cleanup idea from the commit.
Force structured output
Require:
- one DSL rule
- one beforeExample
- one afterExample
- one negativeExample
- confidence
- noCleanup=true if uncertain
Strong safety rules
Reject:
- import-only transformations
- type-changing replacements
- architecture refactorings
- broad semantic rewrites
- duplicates of existing rules
5. Improve duplicate handling
Do not send all existing rules to Gemini.
Instead:
- keep deterministic duplicate detection in Java
- provide only:
- rule ids
- summaries
- normalized patterns
- nearest similar rules
This should reduce copying/rehashing of old cleanup ideas.
6. Improve GitHub Pages transparency
GitHub Pages mining reports should expose:
- API calls attempted
- successful responses
- parse failures
- truncated responses
- deferred commits
- duplicate candidates skipped
- noCleanup decisions
- candidate states
Add direct actions:
- create issue
- create PR
- promote candidate
- reject candidate
7. Separate scanning from discovery
Clarify distinction between:
Existing-rule scanning
"Where do existing rules already match?"
vs.
New-cleanup discovery
"Can a new reusable cleanup rule be derived from this commit?"
These are currently too tightly coupled.
Suggested implementation steps
- Add MiningCandidate model
- Add candidate output directory
- Add candidate status model
- Add generated lightweight JUnit tests
- Add Pages candidate state visualization
- Add staged PR generation
- Only later allow promotion into bundled rules
Relevant existing components
MiningCli
HintFileUpdater
KnownRulesStore
GithubPagesGenerator
PromptBuilder
GeminiClient
BatchTransformationProcessor
BatchTransformationProcessorTest
HintFileCleanUpBridgeTest
These already provide most required building blocks.
Problem
The current mining pipeline mixes several concerns:
This currently produces too many low-quality or weakly validated cleanup candidates and makes the process difficult to review and debug.
In particular:
.sandbox-hintfilesGoal
Introduce a staged and reviewable mining pipeline:
The main idea is:
Proposed architecture
1. Introduce candidate staging
Do NOT write newly mined GREEN rules directly into:
Instead introduce:
or:
Candidate rules should remain isolated until explicitly promoted.
2. Introduce MiningCandidate model
Example:
{ "dslRule": "...", "beforeExample": "...", "afterExample": "...", "negativeExample": "...", "targetHintFile": "performance.sandbox-hint", "sourceCommit": "...", "status": "DISCOVERED" }Suggested states:
3. Auto-generate lightweight JUnit tests
Current verbose Eclipse cleanup tests are too expensive for early-stage mining validation.
Instead automatically generate lightweight core tests using:
HintFileParserBatchTransformationProcessorBatchTransformationProcessorTestas the execution/test foundation.
The generated tests should verify:
Example target location:
This allows mined cleanup rules to become reproducible without immediately requiring full Eclipse cleanup bridge tests.
4. Improve Gemini prompt strategy
The current prompt is likely too broad for Gemini free-tier models.
Suggested changes:
Use discovery-first prompting
Instead of:
focus on:
Force structured output
Require:
Strong safety rules
Reject:
5. Improve duplicate handling
Do not send all existing rules to Gemini.
Instead:
This should reduce copying/rehashing of old cleanup ideas.
6. Improve GitHub Pages transparency
GitHub Pages mining reports should expose:
Add direct actions:
7. Separate scanning from discovery
Clarify distinction between:
Existing-rule scanning
"Where do existing rules already match?"
vs.
New-cleanup discovery
"Can a new reusable cleanup rule be derived from this commit?"
These are currently too tightly coupled.
Suggested implementation steps
Relevant existing components
MiningCliHintFileUpdaterKnownRulesStoreGithubPagesGeneratorPromptBuilderGeminiClientBatchTransformationProcessorBatchTransformationProcessorTestHintFileCleanUpBridgeTestThese already provide most required building blocks.