[Infra][Knowledge] Compiler Recipe Mining & Cookbook Pipeline

### Context
General-purpose LLMs lack deep knowledge of specific compiler internals (e.g., "How to fix an integer fusion bug in Linalg"). To bridge this gap, we must mine the historical wisdom contained in the `llvm-project` git history. By extracting atomic "Code Change + Regression Test" pairs and converting them into structured "Recipes," we create a high-quality few-shot dataset that teaches the Agent *how* verified engineers solve specific compiler problems.

### Objective
Implement an end-to-end data pipeline (`src/mlirAgent/mining/`) that:
1.  **Mines** high-quality commits (Logic + Test) from the repository.
2.  **Enriches** them with semantic metadata (GitHub PR labels).
3.  **Synthesizes** structured YAML "Cookbook Entries" using an LLM.
4.  **Ingests** these entries into LanceDB for semantic retrieval.

### Scope of Work
1.  **Commit Miner (`mine_commits.py`):**
    * Iterate `llvm-project` history using `pydriller`.
    * **Heuristic:** Keep only commits that modify *both* Source Code (`.cpp`, `.td`) AND Tests (`.mlir`, `.ll`).
    * Filter out noise (merges, formatting, reverts).
2.  **Metadata Enricher (`enrich_metadata.py`):**
    * Parse PR numbers from commit messages.
    * Fetch labels from GitHub API (e.g., `missed-optimization`, `crash`, `vectorizers`) to categorize the recipe.
3.  **Recipe Synthesizer (`synthesize_recipes.py`):**
    * Prompt Tier 1 Model: "Analyze this Diff + Test. Summarize the Problem, the Solution Pattern, and the Verification Logic."
    * Output: Structured YAML (Schema: `id`, `problem`, `solution`, `verification`).
4.  **Cookbook Ingestion (`scripts/ingest_cookbook.py`):**
    * Load YAML recipes.
    * Generate embeddings for the `problem.description`.
    * Store in `data/lancedb` for RAG retrieval.

### Acceptance Criteria
* **Test 1 (Mining):** Run `mine_commits.py` on the last 1000 commits of LLVM.
    * *Success:* Output `raw_recipes.jsonl` contains >50 entries, all having both `changes` and `tests` fields.
* **Test 2 (Enrichment):** Run `enrich_metadata.py`.
    * *Success:* Recipes are updated with `github_labels` (e.g., `["mlir:linalg"]`) fetched from the API.
* **Test 3 (Synthesis):** Run `synthesize_recipes.py` on a specific "Integer Fusion" commit.
    * *Success:* Generates a valid YAML file matching the `CookbookEntry` schema, correctly identifying the C++ fix and the `.mlir` check.
* **Test 4 (Retrieval):**
    * *Input:* Query "How do I fix a linalg fusion crash?".
    * *Condition:* Query the LanceDB instance.
    * *Success:* The system returns the "Integer Fusion" recipe as the top result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infra][Knowledge] Compiler Recipe Mining & Cookbook Pipeline #13

Context

Objective

Scope of Work

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Infra][Knowledge] Compiler Recipe Mining & Cookbook Pipeline #13

Description

Context

Objective

Scope of Work

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions