Context
General-purpose LLMs lack deep knowledge of specific compiler internals (e.g., "How to fix an integer fusion bug in Linalg"). To bridge this gap, we must mine the historical wisdom contained in the llvm-project git history. By extracting atomic "Code Change + Regression Test" pairs and converting them into structured "Recipes," we create a high-quality few-shot dataset that teaches the Agent how verified engineers solve specific compiler problems.
Objective
Implement an end-to-end data pipeline (src/mlirAgent/mining/) that:
- Mines high-quality commits (Logic + Test) from the repository.
- Enriches them with semantic metadata (GitHub PR labels).
- Synthesizes structured YAML "Cookbook Entries" using an LLM.
- Ingests these entries into LanceDB for semantic retrieval.
Scope of Work
- Commit Miner (
mine_commits.py):
- Iterate
llvm-project history using pydriller.
- Heuristic: Keep only commits that modify both Source Code (
.cpp, .td) AND Tests (.mlir, .ll).
- Filter out noise (merges, formatting, reverts).
- Metadata Enricher (
enrich_metadata.py):
- Parse PR numbers from commit messages.
- Fetch labels from GitHub API (e.g.,
missed-optimization, crash, vectorizers) to categorize the recipe.
- Recipe Synthesizer (
synthesize_recipes.py):
- Prompt Tier 1 Model: "Analyze this Diff + Test. Summarize the Problem, the Solution Pattern, and the Verification Logic."
- Output: Structured YAML (Schema:
id, problem, solution, verification).
- Cookbook Ingestion (
scripts/ingest_cookbook.py):
- Load YAML recipes.
- Generate embeddings for the
problem.description.
- Store in
data/lancedb for RAG retrieval.
Acceptance Criteria
- Test 1 (Mining): Run
mine_commits.py on the last 1000 commits of LLVM.
- Success: Output
raw_recipes.jsonl contains >50 entries, all having both changes and tests fields.
- Test 2 (Enrichment): Run
enrich_metadata.py.
- Success: Recipes are updated with
github_labels (e.g., ["mlir:linalg"]) fetched from the API.
- Test 3 (Synthesis): Run
synthesize_recipes.py on a specific "Integer Fusion" commit.
- Success: Generates a valid YAML file matching the
CookbookEntry schema, correctly identifying the C++ fix and the .mlir check.
- Test 4 (Retrieval):
- Input: Query "How do I fix a linalg fusion crash?".
- Condition: Query the LanceDB instance.
- Success: The system returns the "Integer Fusion" recipe as the top result.
Context
General-purpose LLMs lack deep knowledge of specific compiler internals (e.g., "How to fix an integer fusion bug in Linalg"). To bridge this gap, we must mine the historical wisdom contained in the
llvm-projectgit history. By extracting atomic "Code Change + Regression Test" pairs and converting them into structured "Recipes," we create a high-quality few-shot dataset that teaches the Agent how verified engineers solve specific compiler problems.Objective
Implement an end-to-end data pipeline (
src/mlirAgent/mining/) that:Scope of Work
mine_commits.py):llvm-projecthistory usingpydriller..cpp,.td) AND Tests (.mlir,.ll).enrich_metadata.py):missed-optimization,crash,vectorizers) to categorize the recipe.synthesize_recipes.py):id,problem,solution,verification).scripts/ingest_cookbook.py):problem.description.data/lancedbfor RAG retrieval.Acceptance Criteria
mine_commits.pyon the last 1000 commits of LLVM.raw_recipes.jsonlcontains >50 entries, all having bothchangesandtestsfields.enrich_metadata.py.github_labels(e.g.,["mlir:linalg"]) fetched from the API.synthesize_recipes.pyon a specific "Integer Fusion" commit.CookbookEntryschema, correctly identifying the C++ fix and the.mlircheck.