|
| 1 | +--- |
| 2 | +name: magpie |
| 3 | +description: Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels. |
| 4 | +--- |
| 5 | + |
| 6 | +# Magpie |
| 7 | + |
| 8 | +Magpie is a GPU kernel evaluation and LLM benchmarking framework. Use this skill when performing analyze, compare, benchmark, gap-analysis, or when creating kernel configs or discovering kernels without MCP. |
| 9 | + |
| 10 | +**When describing Magpie's capabilities:** Describe only what is in this skill. Do not add project-specific, pipeline-specific, or other product/org names (e.g. do not mention any parent repo name). |
| 11 | + |
| 12 | +## Entry point |
| 13 | + |
| 14 | +- **CLI:** `magpie` or `python -m Magpie`. Run from the Magpie repo root (or with `PYTHONPATH` including the Magpie package). |
| 15 | +- **Setup:** From repo root, `pip install -e .` (or `make install`). |
| 16 | + |
| 17 | +## Analyze (single or multi-kernel) |
| 18 | + |
| 19 | +Analyze kernel(s) for correctness and performance. |
| 20 | + |
| 21 | +**With kernel config (recommended):** |
| 22 | + |
| 23 | +```bash |
| 24 | +magpie analyze --kernel-config path/to/kernel.yaml |
| 25 | +``` |
| 26 | + |
| 27 | +**Inline (single kernel):** |
| 28 | + |
| 29 | +```bash |
| 30 | +magpie analyze path/to/kernel.hip --testcase "./run_test.sh" |
| 31 | +``` |
| 32 | + |
| 33 | +- `-k`, `--kernel-config`: YAML with `kernel` or `kernels` (see template below). |
| 34 | +- `-t`, `--testcase`: Command to run the test (required if not in config). |
| 35 | +- `--type`: `hip` | `cuda` | `pytorch` (default: hip). |
| 36 | +- `--compile-cmd`: Custom compile command. |
| 37 | +- `--no-perf`: Skip performance profiling. |
| 38 | +- `-o`, `--output-dir`: Output directory (default: `./results`). |
| 39 | + |
| 40 | +**Config template (single kernel):** Use `kernel:` with `id`, `type`, `source_files`, `working_dir`, `testcase_command`, optional `compile_command`, `env`. See [Magpie/kernel_config.yaml.example](Magpie/kernel_config.yaml.example) and [examples/ck_gemm_add.yaml](examples/ck_gemm_add.yaml). |
| 41 | + |
| 42 | +## Compare (multiple kernels) |
| 43 | + |
| 44 | +Compare and rank multiple kernel implementations. |
| 45 | + |
| 46 | +**With config:** |
| 47 | + |
| 48 | +```bash |
| 49 | +magpie compare --kernel-config path/to/compare.yaml |
| 50 | +``` |
| 51 | + |
| 52 | +**Inline:** |
| 53 | + |
| 54 | +```bash |
| 55 | +magpie compare kernel1.hip kernel2.hip --testcase "./run_test.sh" |
| 56 | +``` |
| 57 | + |
| 58 | +- `-k`, `--kernel-config`: YAML with `kernels:` list. |
| 59 | +- `--baseline`: Index of baseline kernel (default: 0). |
| 60 | +- `--no-perf`, `-o`: Same as analyze. |
| 61 | + |
| 62 | +Example: [examples/ck_grouped_gemm_compare.yaml](examples/ck_grouped_gemm_compare.yaml). |
| 63 | + |
| 64 | +## Benchmark (vLLM / SGLang) |
| 65 | + |
| 66 | +Run framework-level LLM inference benchmarks with optional profiling and gap analysis. |
| 67 | + |
| 68 | +**With config (recommended):** |
| 69 | + |
| 70 | +```bash |
| 71 | +magpie benchmark --benchmark-config examples/benchmark_vllm.yaml |
| 72 | +``` |
| 73 | + |
| 74 | +**CLI overrides:** `magpie benchmark [vllm|sglang] -m <model> --benchmark-config <yaml>` with optional: |
| 75 | + |
| 76 | +- `-m`, `--model`: Model name or path. |
| 77 | +- `-p`, `--precision`: fp8 | fp16 | bf16 | fp4 (default: fp8). |
| 78 | +- `--tp`: Tensor parallel size (default: 1). |
| 79 | +- `--concurrency`, `--input-len`, `--output-len`: Request and sequence settings. |
| 80 | +- `--torch-profiler`, `--system-profiler`: Enable profilers. |
| 81 | +- `--run-mode`: `docker` (default) or `local`. |
| 82 | +- `--docker-image`, `--timeout`, `-o`: Override image, timeout (seconds), output dir. |
| 83 | + |
| 84 | +Example configs: [examples/benchmark_vllm.yaml](examples/benchmark_vllm.yaml), [docs/benchmark.md](docs/benchmark.md). |
| 85 | + |
| 86 | +## Gap analysis (standalone) |
| 87 | + |
| 88 | +Run gap analysis on existing torch trace directories. |
| 89 | + |
| 90 | +```bash |
| 91 | +magpie benchmark gap-analysis --trace-dir path/to/torch_trace |
| 92 | +``` |
| 93 | + |
| 94 | +- `--trace-dir`: Path to `torch_trace` dir or benchmark workspace (required). |
| 95 | +- `--start-pct`, `--end-pct`: Analysis window 0–100 (default: 0, 100). |
| 96 | +- `--top-k`: Top bottleneck kernels (default: 20). |
| 97 | +- `--min-duration-us`: Minimum event duration (µs). |
| 98 | +- `--categories`, `--ignore-categories`: Include/exclude event categories. |
| 99 | + |
| 100 | +## GPU info |
| 101 | + |
| 102 | +```bash |
| 103 | +magpie --gpu-info |
| 104 | +``` |
| 105 | + |
| 106 | +Shows vendor, architecture, compiler, profiler. No mode required. |
| 107 | + |
| 108 | +## Create kernel config (no CLI) |
| 109 | + |
| 110 | +When the user needs a kernel config file: |
| 111 | + |
| 112 | +1. Emit YAML matching the structure in [Magpie/kernel_config.yaml.example](Magpie/kernel_config.yaml.example): `kernel:` with `id`, `type` (hip|cuda|pytorch), `source_files`, `working_dir`, `testcase_command`, and optionally `compile_command`, `env`. |
| 113 | +2. Write the file to the user's requested path (e.g. `kernel_config.yaml`). |
| 114 | +3. Run: `magpie analyze --kernel-config <that_file>`. |
| 115 | + |
| 116 | +For **compare**, use `kernels:` as a list of kernel entries (each with `id`, `type`, `source_files`, etc.). |
| 117 | + |
| 118 | +## Discover kernels (no CLI) |
| 119 | + |
| 120 | +1. Scan the project for `.hip`, `.cu`, or PyTorch kernel files. |
| 121 | +2. For each candidate, build a kernel config entry (id, type, source_files, working_dir, testcase_command if inferrable; otherwise prompt user). |
| 122 | +3. Optionally write a combined config and run `magpie analyze -k <file>` or `magpie compare -k <file>`. |
| 123 | + |
| 124 | +## Suggest optimizations (no CLI) |
| 125 | + |
| 126 | +1. Read analyze or compare JSON output (from `-o` results or last run). |
| 127 | +2. Use `performance_state`, `performance_result.summary`, and per-kernel stats (dispatch count, duration, utilization). |
| 128 | +3. Suggest improvements (e.g. memory bandwidth, occupancy, kernel fusion) based on the metrics. |
| 129 | + |
| 130 | +## List / get benchmark results (no CLI) |
| 131 | + |
| 132 | +- **List:** Results live under the benchmark `--output-dir` (default: `./results`); each run has a timestamped workspace (e.g. `results/benchmark_vllm_<timestamp>/`). |
| 133 | +- **Get result:** Open `benchmark_report.json` or `inferencex_result.json` in that workspace. |
| 134 | +- **Compare runs:** Diff two workspace reports or run two benchmarks and compare; for TraceLens comparison use TraceLens tooling if available. |
| 135 | + |
| 136 | +## Additional resources |
| 137 | + |
| 138 | +- Full CLI reference: [reference.md](reference.md) |
| 139 | +- Copy-paste command examples: [examples.md](examples.md) |
0 commit comments