Skip to content

Commit d828a54

Browse files
yaooqinnCopilot
andcommitted
Bundle spark-advisor skill for diagnosis and comparison
- Add spark-advisor skill with SKILL.md and references (diagnostics.md, comparison.md) for TPC-DS benchmark analysis and Gluten-aware tuning - Update install-skill to install both spark-history-cli and spark-advisor - Update setup.py package_data to include spark-advisor files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 65586ea commit d828a54

7 files changed

Lines changed: 546 additions & 28 deletions

File tree

setup.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@
1515
},
1616
packages=find_packages(),
1717
package_data={
18-
"spark_history_cli": ["skills/*.md"],
18+
"spark_history_cli": [
19+
"skills/*.md",
20+
"skills/spark-advisor/*.md",
21+
"skills/spark-advisor/references/*.md",
22+
],
1923
},
2024
install_requires=[
2125
"click>=8.0.0",

spark_history_cli/cli.py

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
from spark_history_cli.core.client import SparkHistoryClient, HistoryServerError
1919
from spark_history_cli.core.session import Session
2020
from spark_history_cli.core import formatters as fmt
21-
from spark_history_cli.utils.skill_install import default_skill_target, install_copilot_skill
21+
from spark_history_cli.utils.skill_install import default_skill_target, install_all_skills
2222

2323

2424
# ── Shared state via Click context ────────────────────────────────────
@@ -1100,27 +1100,30 @@ def cmd_install_skill(
11001100
target_dir: Path | None,
11011101
force: bool,
11021102
):
1103-
"""Install the bundled Copilot skill."""
1104-
destination = target_dir or default_skill_target(scope)
1103+
"""Install the bundled Copilot skills (spark-history-cli + spark-advisor)."""
1104+
base = target_dir or default_skill_target(scope)
11051105
try:
1106-
installed_path = install_copilot_skill(destination, force=force)
1106+
installed = install_all_skills(base, force=force)
11071107
except FileExistsError as exc:
11081108
raise click.ClickException(str(exc)) from exc
11091109

1110+
names = [p.name for p in installed]
11101111
result = {
1111-
"name": "spark-history-cli",
1112-
"installed_to": str(installed_path),
1112+
"skills": names,
1113+
"installed_to": str(base),
11131114
"scope": scope,
11141115
"next_steps": [
11151116
"Run /skills reload in Copilot CLI if it is already open.",
1116-
"Verify with /skills list or /skills info spark-history-cli.",
1117-
"Use it with prompts like 'Use /spark-history-cli to inspect my latest SHS app'.",
1117+
"Verify with /skills list.",
1118+
"Use spark-history-cli skill for SHS queries.",
1119+
"Use spark-advisor skill for diagnosis and comparison.",
11181120
],
11191121
}
11201122
if state.json_mode:
11211123
output_json(result)
11221124
else:
1123-
click.echo(f"Installed Copilot skill to {installed_path}")
1125+
for path in installed:
1126+
click.echo(f"Installed skill: {path.name} -> {path}")
11241127
click.echo("Next steps:")
11251128
for step in result["next_steps"]:
11261129
click.echo(f" - {step}")
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
name: "spark-advisor"
3+
description: "Diagnose, compare, and optimize Apache Spark applications and SQL queries using Spark History Server data. Use this skill whenever the user wants to understand why a Spark app is slow, compare two benchmark runs or TPC-DS results, find performance bottlenecks (skew, GC pressure, shuffle spill, straggler tasks), get tuning recommendations, or optimize Spark/Gluten configurations. Also trigger when the user mentions 'diagnose', 'compare runs', 'why is this query slow', 'tune my Spark job', 'benchmark comparison', 'performance regression', or asks about executor skew, shuffle overhead, AQE effectiveness, or Gluten offloading issues."
4+
---
5+
6+
# Spark Advisor
7+
8+
You are a Spark performance engineer. Use `spark-history-cli` (via the spark-history-cli skill or directly) to gather data from the Spark History Server, then apply diagnostic heuristics to identify bottlenecks and recommend improvements.
9+
10+
## When to use this skill
11+
12+
- User asks why a Spark application or SQL query is slow
13+
- User wants to compare two benchmark runs (especially TPC-DS)
14+
- User asks for tuning advice based on actual execution data
15+
- User mentions performance regressions between runs
16+
- User wants to understand executor skew, GC pressure, shuffle overhead, or spill
17+
- User asks about Gluten/Velox offloading effectiveness
18+
19+
## Prerequisites
20+
21+
- A running Spark History Server accessible via `spark-history-cli`
22+
- If the CLI is not installed: `pip install spark-history-cli`
23+
- Default server: `http://localhost:18080` (override with `--server`)
24+
25+
## Core Workflow
26+
27+
### 1. Gather Context
28+
29+
Always start by understanding what the user has and what they want to know:
30+
- Which application(s)? Get app IDs.
31+
- Single app diagnosis or comparison between two apps?
32+
- Specific query concern or overall app performance?
33+
- What changed between runs (config, data, Spark version, Gluten version)?
34+
35+
### 2. Collect Data
36+
37+
Use `--json` for all data collection so you can reason over structured data.
38+
39+
**For single-app diagnosis**, collect in this order:
40+
```bash
41+
# Overview first
42+
spark-history-cli --json -a <app> summary
43+
spark-history-cli --json -a <app> env
44+
45+
# Then drill into workload
46+
spark-history-cli --json -a <app> sql # all SQL executions
47+
spark-history-cli --json -a <app> stages # all stages
48+
spark-history-cli --json -a <app> executors --all # executor metrics
49+
```
50+
51+
**For app comparison**, collect the same data for both apps.
52+
53+
**For specific query diagnosis**, also fetch:
54+
```bash
55+
spark-history-cli --json -a <app> sql <exec-id> # SQL detail with nodes/edges
56+
spark-history-cli -a <app> sql-plan <exec-id> --view final # post-AQE plan
57+
spark-history-cli -a <app> sql-plan <exec-id> --view initial # pre-AQE plan
58+
spark-history-cli --json -a <app> sql-jobs <exec-id> # linked jobs
59+
spark-history-cli --json -a <app> stage-summary <stage> # task quantiles for slow stages
60+
spark-history-cli --json -a <app> stage-tasks <stage> --sort-by -runtime --length 10 # stragglers
61+
```
62+
63+
### 3. Analyze
64+
65+
Apply the diagnostic rules from `references/diagnostics.md` to identify issues.
66+
Key areas to check:
67+
- **Duration breakdown**: Where is time spent? (stages, tasks, shuffle, GC)
68+
- **Skew detection**: Compare p50 vs p95 in stage-summary; >3x ratio suggests skew
69+
- **GC pressure**: Total GC time vs executor run time; >10% is concerning
70+
- **Shuffle overhead**: Large shuffle read/write relative to input size
71+
- **Spill**: Any memory or disk spill indicates memory pressure
72+
- **Straggler tasks**: Tasks much slower than peers (check stage-tasks sorted by runtime)
73+
- **Config issues**: Suboptimal shuffle partitions, executor sizing, serializer choice
74+
75+
### 4. Compare (when applicable)
76+
77+
For TPC-DS benchmark comparisons, see `references/comparison.md` for the structured approach:
78+
- Match queries by name (q1, q2, ..., q99)
79+
- Calculate speedup/regression per query
80+
- Identify top-N improved and regressed queries
81+
- Drill into regressed queries to find root cause
82+
- Compare configurations side-by-side
83+
84+
### 5. Report
85+
86+
Produce two outputs:
87+
1. **Conversation summary**: Key findings and top recommendations (concise, actionable)
88+
2. **Detailed report file**: Full analysis saved to disk as Markdown
89+
90+
Report structure:
91+
```markdown
92+
# Spark Performance Report
93+
94+
## Executive Summary
95+
<2-3 sentence overview of findings>
96+
97+
## Application Overview
98+
<summary data for each app>
99+
100+
## Findings
101+
### Finding 1: <title>
102+
- **Severity**: High/Medium/Low
103+
- **Evidence**: <specific metrics>
104+
- **Recommendation**: <what to change>
105+
106+
## Configuration Comparison (if comparing)
107+
<side-by-side diff of key Spark properties>
108+
109+
## Query-Level Analysis (if TPC-DS)
110+
<table of query durations with speedup/regression>
111+
112+
## Recommendations
113+
<prioritized list of actionable changes>
114+
```
115+
116+
## Diagnostic Quick Reference
117+
118+
These are the most impactful things to check. For the full diagnostic ruleset, see `references/diagnostics.md`.
119+
120+
| Symptom | What to Check | CLI Command |
121+
|---------|--------------|-------------|
122+
| Slow overall | Duration breakdown by stage | `summary`, `stages` |
123+
| Task skew | p50 vs p95 duration | `stage-summary <id>` |
124+
| GC pressure | GC time vs run time per executor | `executors --all` |
125+
| Shuffle heavy | Shuffle bytes vs input bytes | `stages`, `stage <id>` |
126+
| Memory spill | Spill bytes > 0 | `stage <id>`, `stage-summary <id>` |
127+
| Straggler tasks | Top tasks by runtime | `stage-tasks <id> --sort-by -runtime` |
128+
| Bad config | Partition count, executor sizing | `env`, `summary` |
129+
| AQE ineffective | Initial vs final plan difference | `sql-plan <id> --view initial/final` |
130+
| Gluten fallback | Non-Transformer nodes in final plan | `sql-plan <id> --view final` |
131+
132+
## Gluten/Velox Awareness
133+
134+
When analyzing Gluten-accelerated applications:
135+
- **Plan nodes**: `*Transformer` and `*ExecTransformer` nodes indicate Gluten-offloaded operators
136+
- **Fallback detection**: Non-Transformer nodes in the final plan (e.g., `SortMergeJoin` instead of `ShuffledHashJoinExecTransformer`) indicate Gluten fallback — these are performance-critical to investigate
137+
- **Columnar exchanges**: `ColumnarExchange` and `ColumnarBroadcastExchange` are Gluten's native shuffle — look for `VeloxColumnarToRow` transitions which indicate fallback boundaries
138+
- **Native metrics**: Gluten stages may show different metric patterns (lower GC, different memory profiles) than vanilla Spark stages
139+
140+
## References
141+
142+
- `references/diagnostics.md` — Full diagnostic ruleset with thresholds and heuristics
143+
- `references/comparison.md` — TPC-DS benchmark comparison methodology
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# TPC-DS Benchmark Comparison
2+
3+
Methodology for comparing two TPC-DS benchmark runs using `spark-history-cli`.
4+
5+
## Step 1: Identify the Two Runs
6+
7+
Ask the user for two application IDs, or find them automatically:
8+
```bash
9+
spark-history-cli --json apps --status completed --limit 10
10+
```
11+
12+
Label them as **baseline** (older/reference run) and **candidate** (newer/test run).
13+
14+
## Step 2: Collect Summaries
15+
16+
For each app:
17+
```bash
18+
spark-history-cli --json -a <app> summary
19+
spark-history-cli --json -a <app> env
20+
spark-history-cli --json -a <app> sql
21+
```
22+
23+
## Step 3: Match Queries
24+
25+
TPC-DS queries are identified by their SQL execution `description` field, which typically contains the query name (e.g., "Query - q1", "q1 [i:1]").
26+
27+
Parse the description to extract query names and match them across the two runs.
28+
29+
**Matching rules**:
30+
- Strip prefixes like "Query - ", "Delta: Query - ", etc.
31+
- Match on the base query name (q1, q2, ..., q99, q14a, q14b, q23a, q23b, q24a, q24b)
32+
- TPC-DS has 99 queries but some are split (q14a/b, q23a/b, q24a/b, q39a/b) = ~103 total
33+
- If a query ran multiple iterations, use the first successful run or the fastest
34+
35+
## Step 4: Calculate Metrics
36+
37+
For each matched query pair:
38+
```
39+
duration_baseline = baseline SQL execution duration (ms)
40+
duration_candidate = candidate SQL execution duration (ms)
41+
speedup = duration_baseline / duration_candidate
42+
regression = duration_candidate / duration_baseline (if > 1.0)
43+
delta_seconds = (duration_candidate - duration_baseline) / 1000
44+
```
45+
46+
Aggregate metrics:
47+
```
48+
total_baseline = sum of all baseline query durations
49+
total_candidate = sum of all candidate query durations
50+
overall_speedup = total_baseline / total_candidate
51+
geomean_speedup = geometric mean of per-query speedups
52+
```
53+
54+
## Step 5: Produce Comparison Table
55+
56+
Sort queries by absolute time delta (largest regression first):
57+
58+
```markdown
59+
| Query | Baseline | Candidate | Delta | Speedup | Status |
60+
|-------|----------|-----------|-------|---------|--------|
61+
| q67 | 72s | 85s | +13s | 0.85x | ⚠ REGRESSED |
62+
| q1 | 61s | 45s | -16s | 1.36x | ✓ IMPROVED |
63+
| ... | ... | ... | ... | ... | ... |
64+
```
65+
66+
**Status labels**:
67+
- `✓ IMPROVED`: speedup > 1.05x (>5% faster)
68+
- `≈ NEUTRAL`: speedup between 0.95x and 1.05x
69+
- `⚠ REGRESSED`: speedup < 0.95x (>5% slower)
70+
71+
## Step 6: Drill into Regressions
72+
73+
For the top-3 regressed queries, investigate root cause:
74+
75+
1. **Compare plans**: Fetch `sql-plan --view final` for both apps
76+
- Did the plan change? (different join strategies, missing Gluten offloading)
77+
- Did AQE make different decisions?
78+
79+
2. **Compare stage metrics**: For the slowest stages in each
80+
- Check task skew (`stage-summary`)
81+
- Check shuffle size changes
82+
- Check GC time differences
83+
84+
3. **Compare configurations**: Diff the `env` output
85+
- Focus on: shuffle partitions, memory, broadcast threshold, AQE settings
86+
- For Gluten: check native engine version, offload settings
87+
88+
## Step 7: Configuration Diff
89+
90+
Extract key Spark properties from both apps' `env` and show differences:
91+
92+
```markdown
93+
| Property | Baseline | Candidate |
94+
|----------|----------|-----------|
95+
| spark.executor.memory | 56g | 64g |
96+
| spark.sql.shuffle.partitions | 200 | 512 |
97+
| spark.sql.adaptive.enabled | true | true |
98+
```
99+
100+
Focus on properties that differ and are performance-relevant:
101+
- `spark.executor.memory`, `spark.executor.cores`, `spark.executor.instances`
102+
- `spark.sql.shuffle.partitions`
103+
- `spark.sql.adaptive.*`
104+
- `spark.sql.autoBroadcastJoinThreshold`
105+
- `spark.serializer`
106+
- Any `spark.gluten.*` or `spark.plugins` changes
107+
108+
## Step 8: Recommendations
109+
110+
Based on the comparison findings, prioritize recommendations:
111+
112+
1. **If overall regression**: Focus on the top regressed queries and their root causes
113+
2. **If overall improvement but some regressions**: Note the wins, investigate the regressions
114+
3. **If config change caused issues**: Recommend reverting specific settings
115+
4. **If plan changes caused regression**: Recommend query hints or optimizer settings
116+
117+
## Report Template
118+
119+
```markdown
120+
# TPC-DS Benchmark Comparison
121+
122+
## Overview
123+
- **Baseline**: <app-name> (<app-id>) — <duration>
124+
- **Candidate**: <app-name> (<app-id>) — <duration>
125+
- **Overall**: <speedup>x (total <baseline-total>s → <candidate-total>s)
126+
- **Improved**: N queries | **Regressed**: N queries | **Neutral**: N queries
127+
128+
## Configuration Changes
129+
<config diff table>
130+
131+
## Query Results
132+
<full comparison table sorted by delta>
133+
134+
## Top Regressions
135+
### q67: +13s (0.85x)
136+
- **Root cause**: <explanation>
137+
- **Evidence**: <specific metrics>
138+
139+
## Top Improvements
140+
### q1: -16s (1.36x)
141+
- **Likely cause**: <explanation>
142+
143+
## Recommendations
144+
1. <highest impact recommendation>
145+
2. <second recommendation>
146+
3. ...
147+
```

0 commit comments

Comments
 (0)