Import magpie

danielholanda · danielholanda · commit d6392dc3f79a · 2026-05-28T13:19:23.000-07:00
diff --git a/skills/magpie/.federated.json b/skills/magpie/.federated.json
@@ -0,0 +1,9 @@
+{
+  "source": "amd-agi-magpie",
+  "repo": "AMD-AGI/Magpie",
+  "ref": "main",
+  "commit": "3820dccb9fc7eb06de92f6388886ad672e299ce5",
+  "path": "skills/magpie",
+  "license": "MIT",
+  "imported_at": "2026-05-28T20:04:55Z"
+}
diff --git a/skills/magpie/SKILL.md b/skills/magpie/SKILL.md
@@ -0,0 +1,139 @@
+---
+name: magpie
+description: Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.
+---
+
+# Magpie
+
+Magpie is a GPU kernel evaluation and LLM benchmarking framework. Use this skill when performing analyze, compare, benchmark, gap-analysis, or when creating kernel configs or discovering kernels without MCP.
+
+**When describing Magpie's capabilities:** Describe only what is in this skill. Do not add project-specific, pipeline-specific, or other product/org names (e.g. do not mention any parent repo name).
+
+## Entry point
+
+- **CLI:** `magpie` or `python -m Magpie`. Run from the Magpie repo root (or with `PYTHONPATH` including the Magpie package).
+- **Setup:** From repo root, `pip install -e .` (or `make install`).
+
+## Analyze (single or multi-kernel)
+
+Analyze kernel(s) for correctness and performance.
+
+**With kernel config (recommended):**
+
+```bash
+magpie analyze --kernel-config path/to/kernel.yaml
+```
+
+**Inline (single kernel):**
+
+```bash
+magpie analyze path/to/kernel.hip --testcase "./run_test.sh"
+```
+
+- `-k`, `--kernel-config`: YAML with `kernel` or `kernels` (see template below).
+- `-t`, `--testcase`: Command to run the test (required if not in config).
+- `--type`: `hip` | `cuda` | `pytorch` (default: hip).
+- `--compile-cmd`: Custom compile command.
+- `--no-perf`: Skip performance profiling.
+- `-o`, `--output-dir`: Output directory (default: `./results`).
+
+**Config template (single kernel):** Use `kernel:` with `id`, `type`, `source_files`, `working_dir`, `testcase_command`, optional `compile_command`, `env`. See [Magpie/kernel_config.yaml.example](Magpie/kernel_config.yaml.example) and [examples/ck_gemm_add.yaml](examples/ck_gemm_add.yaml).
+
+## Compare (multiple kernels)
+
+Compare and rank multiple kernel implementations.
+
+**With config:**
+
+```bash
+magpie compare --kernel-config path/to/compare.yaml
+```
+
+**Inline:**
+
+```bash
+magpie compare kernel1.hip kernel2.hip --testcase "./run_test.sh"
+```
+
+- `-k`, `--kernel-config`: YAML with `kernels:` list.
+- `--baseline`: Index of baseline kernel (default: 0).
+- `--no-perf`, `-o`: Same as analyze.
+
+Example: [examples/ck_grouped_gemm_compare.yaml](examples/ck_grouped_gemm_compare.yaml).
+
+## Benchmark (vLLM / SGLang)
+
+Run framework-level LLM inference benchmarks with optional profiling and gap analysis.
+
+**With config (recommended):**
+
+```bash
+magpie benchmark --benchmark-config examples/benchmark_vllm.yaml
+```
+
+**CLI overrides:** `magpie benchmark [vllm|sglang] -m <model> --benchmark-config <yaml>` with optional:
+
+- `-m`, `--model`: Model name or path.
+- `-p`, `--precision`: fp8 | fp16 | bf16 | fp4 (default: fp8).
+- `--tp`: Tensor parallel size (default: 1).
+- `--concurrency`, `--input-len`, `--output-len`: Request and sequence settings.
+- `--torch-profiler`, `--system-profiler`: Enable profilers.
+- `--run-mode`: `docker` (default) or `local`.
+- `--docker-image`, `--timeout`, `-o`: Override image, timeout (seconds), output dir.
+
+Example configs: [examples/benchmark_vllm.yaml](examples/benchmark_vllm.yaml), [docs/benchmark.md](docs/benchmark.md).
+
+## Gap analysis (standalone)
+
+Run gap analysis on existing torch trace directories.
+
+```bash
+magpie benchmark gap-analysis --trace-dir path/to/torch_trace
+```
+
+- `--trace-dir`: Path to `torch_trace` dir or benchmark workspace (required).
+- `--start-pct`, `--end-pct`: Analysis window 0–100 (default: 0, 100).
+- `--top-k`: Top bottleneck kernels (default: 20).
+- `--min-duration-us`: Minimum event duration (µs).
+- `--categories`, `--ignore-categories`: Include/exclude event categories.
+
+## GPU info
+
+```bash
+magpie --gpu-info
+```
+
+Shows vendor, architecture, compiler, profiler. No mode required.
+
+## Create kernel config (no CLI)
+
+When the user needs a kernel config file:
+
+1. Emit YAML matching the structure in [Magpie/kernel_config.yaml.example](Magpie/kernel_config.yaml.example): `kernel:` with `id`, `type` (hip|cuda|pytorch), `source_files`, `working_dir`, `testcase_command`, and optionally `compile_command`, `env`.
+2. Write the file to the user's requested path (e.g. `kernel_config.yaml`).
+3. Run: `magpie analyze --kernel-config <that_file>`.
+
+For **compare**, use `kernels:` as a list of kernel entries (each with `id`, `type`, `source_files`, etc.).
+
+## Discover kernels (no CLI)
+
+1. Scan the project for `.hip`, `.cu`, or PyTorch kernel files.
+2. For each candidate, build a kernel config entry (id, type, source_files, working_dir, testcase_command if inferrable; otherwise prompt user).
+3. Optionally write a combined config and run `magpie analyze -k <file>` or `magpie compare -k <file>`.
+
+## Suggest optimizations (no CLI)
+
+1. Read analyze or compare JSON output (from `-o` results or last run).
+2. Use `performance_state`, `performance_result.summary`, and per-kernel stats (dispatch count, duration, utilization).
+3. Suggest improvements (e.g. memory bandwidth, occupancy, kernel fusion) based on the metrics.
+
+## List / get benchmark results (no CLI)
+
+- **List:** Results live under the benchmark `--output-dir` (default: `./results`); each run has a timestamped workspace (e.g. `results/benchmark_vllm_<timestamp>/`).
+- **Get result:** Open `benchmark_report.json` or `inferencex_result.json` in that workspace.
+- **Compare runs:** Diff two workspace reports or run two benchmarks and compare; for TraceLens comparison use TraceLens tooling if available.
+
+## Additional resources
+
+- Full CLI reference: [reference.md](reference.md)
+- Copy-paste command examples: [examples.md](examples.md)
diff --git a/skills/magpie/examples.md b/skills/magpie/examples.md
@@ -0,0 +1,66 @@
+# Magpie command examples
+
+Copy-paste examples. Run from the Magpie repo root (or set `PYTHONPATH`).
+
+## GPU info
+
+```bash
+magpie --gpu-info
+# or
+python -m Magpie --gpu-info
+```
+
+## Analyze (config file)
+
+```bash
+magpie analyze --kernel-config examples/ck_gemm_add.yaml
+magpie analyze --kernel-config my_kernel.yaml -o ./out
+```
+
+## Analyze (inline)
+
+```bash
+magpie analyze ./kernels/matmul.hip --testcase "./run_test.sh"
+magpie analyze ./kernel.hip -t "./run_test.sh" --type hip --no-perf
+```
+
+## Compare (config file)
+
+```bash
+magpie compare --kernel-config examples/ck_grouped_gemm_compare.yaml
+```
+
+## Compare (inline)
+
+```bash
+magpie compare kernel_v1.hip kernel_v2.hip --testcase "./run_test.sh"
+magpie compare k1.hip k2.hip k3.hip -t "./test.sh" --baseline 0
+```
+
+## Benchmark (config file)
+
+```bash
+magpie benchmark --benchmark-config examples/benchmark_vllm.yaml
+magpie benchmark --benchmark-config examples/benchmark_sglang.yaml -o ./results
+```
+
+## Benchmark (CLI overrides)
+
+```bash
+magpie benchmark vllm -m deepseek-ai/DeepSeek-R1-0528 -p fp8 --tp 8
+magpie benchmark --benchmark-config examples/benchmark_vllm.yaml --run-mode local --timeout 1800
+```
+
+## Gap analysis (standalone)
+
+```bash
+magpie benchmark gap-analysis --trace-dir ./results/benchmark_vllm_20260101_120000/torch_trace
+magpie benchmark gap-analysis --trace-dir ./results/run_xyz/torch_trace --top-k 30 --start-pct 20 --end-pct 80
+```
+
+## Verbose and config
+
+```bash
+magpie -v analyze --kernel-config my.yaml
+magpie --config path/to/config.yaml benchmark --benchmark-config examples/benchmark_vllm.yaml
+```
diff --git a/skills/magpie/reference.md b/skills/magpie/reference.md
@@ -0,0 +1,103 @@
+# Magpie CLI reference
+
+Full flag reference for `magpie` / `python -m Magpie`. Global options apply before the mode.
+
+## Global options
+
+| Option | Description |
+|--------|-------------|
+| `--config`, `-c` | Framework configuration file (default: package config.yaml) |
+| `--verbose`, `-v` | Verbose output |
+| `--gpu-info` | Show detected GPU info and exit (no mode required) |
+| `--environment`, `-e` | `local` \| `container` |
+| `--workers`, `-w` | Number of concurrent workers |
+| `--docker-image` | Docker image for container environment |
+
+## analyze
+
+| Option | Description |
+|--------|-------------|
+| `kernels` | Positional: kernel file(s) to analyze |
+| `--kernel-config`, `-k` | Kernel configuration YAML |
+| `--testcase`, `-t` | Testcase command |
+| `--type` | `hip` \| `cuda` \| `pytorch` (default: hip) |
+| `--compile-cmd` | Custom compile command |
+| `--no-perf` | Skip performance profiling |
+| `--output-dir`, `-o` | Output directory (default: ./results) |
+
+## compare
+
+| Option | Description |
+|--------|-------------|
+| `kernels` | Positional: kernel files to compare |
+| `--kernel-config`, `-k` | Kernel configuration YAML (kernels list) |
+| `--testcase`, `-t` | Testcase command (optional) |
+| `--type` | `hip` \| `cuda` \| `pytorch` (default: hip) |
+| `--baseline` | Baseline kernel index (default: 0) |
+| `--no-perf` | Skip performance profiling |
+| `--output-dir`, `-o` | Output directory (default: ./results) |
+
+## benchmark
+
+| Option | Description |
+|--------|-------------|
+| `framework` | Optional positional: `vllm` \| `sglang` |
+| `--benchmark-config`, `-b` | Benchmark configuration YAML |
+| `--model`, `-m` | Model name or path |
+| `--precision`, `-p` | fp8 \| fp16 \| bf16 \| fp4 (default: fp8) |
+| `--tp` | Tensor parallel size (default: 1) |
+| `--concurrency` | Request concurrency (default: 32) |
+| `--input-len` | Input sequence length (default: 1024) |
+| `--output-len` | Output sequence length (default: 512) |
+| `--torch-profiler` | Enable torch profiler |
+| `--system-profiler` | Enable system profiler (rocprof/ncu) |
+| `--run-mode` | `docker` \| `local` |
+| `--docker-image` | Override Docker image |
+| `--inferencex-path` | Path to InferenceX (auto-cloned if empty) |
+| `--benchmark-script` | Override benchmark script name |
+| `--timeout` | Timeout in seconds (default: 3600) |
+| `--output-dir`, `-o` | Output directory (default: ./results) |
+
+## benchmark gap-analysis
+
+| Option | Description |
+|--------|-------------|
+| `--trace-dir` | Path to torch_trace dir or benchmark workspace (required) |
+| `--start-pct` | Start of analysis window 0–100 (default: 0) |
+| `--end-pct` | End of analysis window 0–100 (default: 100) |
+| `--top-k` | Number of top bottleneck kernels (default: 20) |
+| `--min-duration-us` | Minimum event duration in µs (default: 0) |
+| `--categories` | Event categories to include (space-separated) |
+| `--ignore-categories` | Event categories to exclude |
+
+## Kernel config YAML (analyze)
+
+Single kernel:
+
+```yaml
+kernel:
+  id: "my_kernel"
+  type: hip
+  source_files: ["./kernel.hip"]
+  working_dir: "."
+  testcase_command: "./run_test.sh"
+  # optional: compile_command, env
+```
+
+Multiple kernels (compare):
+
+```yaml
+kernels:
+  - id: "v1"
+    type: hip
+    source_files: ["./v1.hip"]
+    testcase_command: "./run_test.sh"
+  - id: "v2"
+    type: hip
+    source_files: ["./v2.hip"]
+    testcase_command: "./run_test.sh"
+```
+
+## Benchmark config YAML
+
+Top-level key `benchmark:` with `framework`, `model`, `precision`, `envs`, `profiler`, `gap_analysis`, `timeout_seconds`, etc. See `examples/benchmark_vllm.yaml` and `docs/benchmark.md`.