Skip to content

Commit 1a20687

Browse files
committed
Check analysis-orchestrator import
1 parent 22af8dc commit 1a20687

18 files changed

Lines changed: 3084 additions & 0 deletions

.claude-plugin/marketplace.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,11 @@
99
"version": "0.1.0"
1010
},
1111
"plugins": [
12+
{
13+
"name": "analysis-orchestrator",
14+
"source": "./skills/analysis-orchestrator",
15+
"description": "Orchestrates modular PyTorch profiler trace analysis with TraceLens: generates perf reports, prepares category data, runs system-level and compute-kernel subagents in parallel, validates outputs, and writes a prioritized stakeholder report (analysis.md)."
16+
},
1217
{
1318
"name": "apu-memory-tuner",
1419
"source": "./skills/apu-memory-tuner",
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"source": "amd-agi-tracelens",
3+
"repo": "AMD-AGI/TraceLens",
4+
"ref": "feat/gw_rename_directories",
5+
"commit": "9b461bb25192ce73cb70de912ce27df515b56b44",
6+
"path": "TraceLens/Agent/Analysis/skills/analysis-orchestrator",
7+
"license": "MIT",
8+
"imported_at": "2026-06-18T21:23:39Z"
9+
}
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
name: analysis-orchestrator
3+
description: >-
4+
Orchestrates modular PyTorch profiler trace analysis with TraceLens: generates perf
5+
reports, prepares category data, runs system-level and compute-kernel subagents in
6+
parallel, validates outputs, and writes a prioritized stakeholder report (analysis.md).
7+
Use when the user asks to follow the analysis orchestrator, run the agentic analysis
8+
workflow, analyze a trace, compare two traces, or mentions standalone or comparative
9+
TraceLens analysis.
10+
---
11+
12+
<!--
13+
Copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved.
14+
15+
See LICENSE for license information.
16+
-->
17+
18+
# Analysis orchestrator
19+
20+
Coordinate **system-level** analysis (CPU/idle, kernel fusion, multi-kernel / comm / memcpy) and **compute-kernel** analysis (GEMM, SDPA, elementwise, etc.): one trace load, shared prep, parallel subagents, then aggregation into `analysis.md`.
21+
22+
## Full procedure
23+
24+
Follow **[reference.md](reference.md)** for every step (user prompts, `<prefix>` / `{CMD}` usage, CLI commands, subagent launch text, validation, report `tee` order, plot embedding, and trace diagnostics).
25+
26+
## Workflow index
27+
28+
```
29+
0. Query User Inputs (Platform, Trace Path(s), Analysis Mode, Environment Setup)
30+
1. Generate Performance Report (branches on analysis mode: training vs inference then, comparison scope)
31+
2-5. Prepare Category Data (GPU Util, Top Ops, Tree Data, Multi-Kernel Data, Category Filtering)
32+
6. System-Level Analysis (PARALLEL) → system_findings/
33+
7. Compute Kernel Subagents (PARALLEL) → category_findings/
34+
7.5. Aggregate → priority_data.json::findings[]
35+
8. Validate Subagent Outputs
36+
9. load_findings + Model Identification (subagent) → metadata/model_info.json
37+
10. Render performance PNG if agent_extension.py is absent
38+
11. Generate analysis.md (orchestrator writes via <prefix> tee), optional extension, embed PNG
39+
```
40+
41+
## Rules
42+
43+
- **Subagents:** Use the Task tool **only** where reference.md says “subagent” (Steps **6**, **7**, **9**). The orchestrator runs everything else, including Step 7.5, using the command prefix from `<output_dir>/cache/cmd_prefix.txt` (`{CMD}` substitution).
44+
- **Language:** Prefer vendor-agnostic terms (GPU kernels, collective communication, vendor GEMM library, DNN primitives, GPU graph). When quoting trace data, real kernel names are fine.
45+
- **Subagent prompts:** Point each subagent at the checked-in agent file under `TraceLens/Agent/Analysis/skills/analysis-orchestrator/agents/<name>.md` (see reference.md for exact paths and prompt shells).
46+
47+
## Primary outputs
48+
49+
- **Deliverable:** `<output_dir>/analysis.md`
50+
- **Internals:** `system_findings/`, `category_findings/`, `category_data/`, `metadata/`, `perf_report*.xlsx`, CSV folders — see package README for layout.
51+
52+
## Agent layout
53+
54+
Project subagents ship with this skill: `TraceLens/Agent/Analysis/skills/analysis-orchestrator/agents/*.md`.
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
<!--
2+
Copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved.
3+
4+
See LICENSE for license information.
5+
-->
6+
7+
---
8+
name: convolution-analyzer
9+
description: Analyze Convolution operations for compute efficiency and layout optimization. Use when orchestrator needs Convolution category analysis.
10+
model: claude-opus-4-7-high
11+
---
12+
13+
# Convolution Analysis Subagent
14+
15+
Analyze Convolution operations for compute efficiency and memory-layout optimization. Renders P-items from the per-category findings the analyzer script has already grouped and gated.
16+
17+
---
18+
19+
## Context Passing
20+
21+
When invoked by the orchestrator, you will receive the following context:
22+
23+
**Required context provided by orchestrator:**
24+
- `output_dir`: Base analysis output directory
25+
- `prefix`: Command prefix from `<output_dir>/cache/cmd_prefix.txt` — contains a template with `{CMD}` placeholder; substitute `{CMD}` with the actual command
26+
- `cat`: `conv_fwd` or `conv_bwd`
27+
- `comparison_scope`: `standalone` (default) or `comparative`
28+
29+
**Input files (pre-computed by orchestrator):**
30+
1. `<output_dir>/category_data/<cat>_ops.csv` - Filtered Convolution operations (includes `call_stack` column for architecture context)
31+
2. `<output_dir>/metadata/<cat>_metadata.json` - Hardware specs
32+
33+
**Output file you must write:**
34+
- `<output_dir>/category_findings/<cat>_findings.md`
35+
36+
---
37+
38+
## Error Handling
39+
40+
**If category data files are missing:**
41+
1. Write a findings file noting: "No Convolution operations found in trace"
42+
2. Return gracefully
43+
44+
**If analysis script fails:**
45+
1. Write a findings file with Status: ERROR
46+
2. **CRITICAL: Do NOT manually analyze the raw CSV data**
47+
3. **CRITICAL: Do NOT provide any bottleneck findings**
48+
49+
---
50+
51+
## Language Guidelines
52+
53+
Use vendor-agnostic terminology:
54+
- "GPU kernels" not "CUDA kernels"
55+
- "DNN library" not vendor-specific names
56+
- Focus on operation semantics, not vendor implementation details
57+
58+
---
59+
60+
## Analysis Workflow
61+
62+
### Step 1: Run Analysis Script
63+
64+
```bash
65+
<prefix> python3 \
66+
TraceLens/Agent/Analysis/category_analyses/convolution_analysis.py \
67+
--output-dir <output_dir> \
68+
--category <cat> \
69+
--comparison_scope <comparison_scope>
70+
```
71+
72+
### Step 2: Read metrics
73+
74+
```bash
75+
cat <output_dir>/category_data/<cat>_metrics.json
76+
```
77+
78+
`category_specific.transpose_overhead_percent` flags memory-layout mismatch (NCHW vs NHWC); reference it in **Identification** for any memory-bound finding when it exceeds ~10%.
79+
80+
### Step 3: Classify members by name
81+
82+
Each `category_findings[i].members[j].operation` carries a torch op name (e.g. `aten::conv2d`, `aten::conv_transpose2d`). Classify each member semantically when describing the finding:
83+
84+
- **Standard 2D**: `conv2d` operations (most common in CNNs).
85+
- **1D**: `conv1d` operations (sequence/audio models).
86+
- **3D**: `conv3d` operations (video/volumetric models).
87+
- **Depthwise**: depthwise / channel-wise convolutions (low parallelism, expect lower efficiency).
88+
- **Transpose / Deconv**: transpose convolutions, deconvolutions (also signals potential layout mismatch — cross-reference with `category_specific.transpose_overhead_percent`).
89+
- **Other**: anything not matching the above.
90+
91+
These are guidelines; if a member doesn't fit neatly, classify it semantically.
92+
93+
### Step 4: Render P-items from `category_findings`
94+
95+
**efficiency_percent semantics:**
96+
- **Standalone:** Treat `efficiency_percent` as **% of roofline**.
97+
- **Comparative:** Treat `efficiency_percent` as **100 × (trace2 kernel time) / (trace1 kernel time)**.
98+
99+
Per [`utils/templates/sub_agent_spec.md`](../utils/templates/sub_agent_spec.md), emit one P-item per entry in ascending `rank` order; ground **Insight** / **Action** / **Reasoning for Slowdown** in the `members[]` rows (their `operation`, `efficiency_pct`, `time_ms`, `library`) using the Action Prose Guidance, Expected Efficiency, and Common Patterns below. If `category_findings[]` is empty, emit empty `## Recommendations` and `## Detailed Analysis` sections.
100+
101+
**Markers required:** wrap every `**Impact**` line in `<!-- impact-begin kind=p_item ... --> ... <!-- impact-end -->` and every Detailed Analysis `**Impact estimate:**` two-bullet block in `kind=detail_estimate` markers per spec § Impact markers (REQUIRED), with `low` / `mid` / `high` taken verbatim from `category_findings[i].impact_score{,_low,_high}`.
102+
103+
**Trace observability:** ground every claim in **Reasoning for Slowdown** / **Resolution** in the spec § Trace observability (compute tier) **CAN Infer** rows; for any property in the universal **CANNOT Infer** rows or the category-specific rows in [§ Trace observability (category-specific)](#trace-observability-category-specific) below, use the listed fallback prose instead of speculating.
104+
105+
---
106+
107+
## Action Prose Guidance
108+
109+
Vendor/library/framework-agnostic. Pick the row matching `category_findings[i].bound_type`:
110+
111+
| `bound_type` | Action template |
112+
|---|---|
113+
| `compute` | Profile the dominant member kernels for tile-size and wave-occupancy tuning. Depthwise members will naturally show lower efficiency due to limited parallelism — call that out in **Identification** before recommending tuning. |
114+
| `memory` | If `transpose_overhead_percent` > 10%, recommend converting to channels-last layout (`model.to(memory_format=torch.channels_last)`) to eliminate transpose overhead. Otherwise optimize memory access patterns of the dominant member kernels. |
115+
116+
---
117+
118+
## Expected efficiency per operation type
119+
120+
| Convolution type | Expected efficiency | Bound type |
121+
|------------------|---------------------|------------|
122+
| Large kernels (5×5+) | >70% of peak TFLOPS | compute-bound |
123+
| Standard 3×3 | >70% of peak TFLOPS | compute-bound |
124+
| 1×1 (pointwise) | >60% of peak HBM BW | memory-bound |
125+
| Depthwise | >50% (low parallelism) | varies |
126+
127+
**Transpose overhead bands:**
128+
- `>20%`: high — strongly recommend channels-last.
129+
- `10–20%`: moderate — consider channels-last.
130+
- `<10%`: acceptable.
131+
132+
---
133+
134+
## Common Patterns
135+
136+
### Transpose overhead (layout mismatch)
137+
- **Symptoms:** Many `batched_transpose` kernels; 30–45% of convolution time.
138+
- **Cause:** PyTorch defaults to NCHW; vendor DNN libraries prefer NHWC.
139+
- **Algorithmic (primary):** `model.to(memory_format=torch.channels_last)`.
140+
141+
### Large-kernel convolutions
142+
- **Symptoms:** Kernel size > 3×3, compute-bound.
143+
- **Algorithmic:** Limited — these are typically well-optimized.
144+
- **Kernel:** Profile if efficiency below expected band.
145+
146+
### Small-kernel convolutions (1×1, 3×3)
147+
- **Symptoms:** Common in modern architectures.
148+
- **Algorithmic:** Fusion opportunities → defer to kernel fusion analysis.
149+
- **Kernel:** Optimize memory access patterns.
150+
151+
### Depthwise convolutions
152+
- **Symptoms:** Low efficiency due to limited parallelism.
153+
- **Algorithmic:** Limited optimization potential.
154+
- **Kernel:** Specialized depthwise kernels.
155+
156+
---
157+
158+
## Trace observability (category-specific)
159+
160+
The universal CANNOT Infer rows in [`sub_agent_spec.md`](../utils/templates/sub_agent_spec.md) always apply. In addition, Convolution analysis cannot observe:
161+
162+
| NOT observable | Why | Fallback prose |
163+
|----------------|-----|----------------|
164+
| Per-op layout (NCHW vs. NHWC) | Only the aggregate `category_specific.transpose_overhead_percent` is exposed, not per-op layout | "Per-op layout not visible — refer to aggregate `transpose_overhead_percent`." |
165+
166+
---
167+
168+
## Validate findings
169+
170+
Per [`sub_agent_spec.md`](../utils/templates/sub_agent_spec.md) § Validate findings, run:
171+
172+
```bash
173+
<prefix> python3 -c "
174+
import sys
175+
from TraceLens.Agent.Analysis.utils.validation_utils import validate_findings_file
176+
passed, errors = validate_findings_file(sys.argv[1], sys.argv[2], sys.argv[3])
177+
if not passed:
178+
print('FAIL:')
179+
for e in errors:
180+
print(' - ' + e)
181+
sys.exit(1)
182+
print('PASS: Findings file is valid')
183+
" '<output_dir>/category_findings/<cat>_findings.md' 'compute' '<comparison_scope>'
184+
```
185+
186+
If validation fails, fix the findings file and re-run. Max 2 retries.

0 commit comments

Comments
 (0)