Skip to content

[Infra][Knowledge] Compiler Recipe Mining & Cookbook Pipeline #13

@copparihollmann

Description

@copparihollmann

Context

General-purpose LLMs lack deep knowledge of specific compiler internals (e.g., "How to fix an integer fusion bug in Linalg"). To bridge this gap, we must mine the historical wisdom contained in the llvm-project git history. By extracting atomic "Code Change + Regression Test" pairs and converting them into structured "Recipes," we create a high-quality few-shot dataset that teaches the Agent how verified engineers solve specific compiler problems.

Objective

Implement an end-to-end data pipeline (src/mlirAgent/mining/) that:

  1. Mines high-quality commits (Logic + Test) from the repository.
  2. Enriches them with semantic metadata (GitHub PR labels).
  3. Synthesizes structured YAML "Cookbook Entries" using an LLM.
  4. Ingests these entries into LanceDB for semantic retrieval.

Scope of Work

  1. Commit Miner (mine_commits.py):
    • Iterate llvm-project history using pydriller.
    • Heuristic: Keep only commits that modify both Source Code (.cpp, .td) AND Tests (.mlir, .ll).
    • Filter out noise (merges, formatting, reverts).
  2. Metadata Enricher (enrich_metadata.py):
    • Parse PR numbers from commit messages.
    • Fetch labels from GitHub API (e.g., missed-optimization, crash, vectorizers) to categorize the recipe.
  3. Recipe Synthesizer (synthesize_recipes.py):
    • Prompt Tier 1 Model: "Analyze this Diff + Test. Summarize the Problem, the Solution Pattern, and the Verification Logic."
    • Output: Structured YAML (Schema: id, problem, solution, verification).
  4. Cookbook Ingestion (scripts/ingest_cookbook.py):
    • Load YAML recipes.
    • Generate embeddings for the problem.description.
    • Store in data/lancedb for RAG retrieval.

Acceptance Criteria

  • Test 1 (Mining): Run mine_commits.py on the last 1000 commits of LLVM.
    • Success: Output raw_recipes.jsonl contains >50 entries, all having both changes and tests fields.
  • Test 2 (Enrichment): Run enrich_metadata.py.
    • Success: Recipes are updated with github_labels (e.g., ["mlir:linalg"]) fetched from the API.
  • Test 3 (Synthesis): Run synthesize_recipes.py on a specific "Integer Fusion" commit.
    • Success: Generates a valid YAML file matching the CookbookEntry schema, correctly identifying the C++ fix and the .mlir check.
  • Test 4 (Retrieval):
    • Input: Query "How do I fix a linalg fusion crash?".
    • Condition: Query the LanceDB instance.
    • Success: The system returns the "Integer Fusion" recipe as the top result.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions