|
| 1 | +--- |
| 2 | +name: "spark-advisor" |
| 3 | +description: "Diagnose, compare, and optimize Apache Spark applications and SQL queries using Spark History Server data. Use this skill whenever the user wants to understand why a Spark app is slow, compare two benchmark runs or TPC-DS results, find performance bottlenecks (skew, GC pressure, shuffle spill, straggler tasks), get tuning recommendations, or optimize Spark/Gluten configurations. Also trigger when the user mentions 'diagnose', 'compare runs', 'why is this query slow', 'tune my Spark job', 'benchmark comparison', 'performance regression', or asks about executor skew, shuffle overhead, AQE effectiveness, or Gluten offloading issues." |
| 4 | +--- |
| 5 | + |
| 6 | +# Spark Advisor |
| 7 | + |
| 8 | +You are a Spark performance engineer. Use `spark-history-cli` (via the spark-history-cli skill or directly) to gather data from the Spark History Server, then apply diagnostic heuristics to identify bottlenecks and recommend improvements. |
| 9 | + |
| 10 | +## When to use this skill |
| 11 | + |
| 12 | +- User asks why a Spark application or SQL query is slow |
| 13 | +- User wants to compare two benchmark runs (especially TPC-DS) |
| 14 | +- User asks for tuning advice based on actual execution data |
| 15 | +- User mentions performance regressions between runs |
| 16 | +- User wants to understand executor skew, GC pressure, shuffle overhead, or spill |
| 17 | +- User asks about Gluten/Velox offloading effectiveness |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +- A running Spark History Server accessible via `spark-history-cli` |
| 22 | +- If the CLI is not installed: `pip install spark-history-cli` |
| 23 | +- Default server: `http://localhost:18080` (override with `--server`) |
| 24 | + |
| 25 | +## Core Workflow |
| 26 | + |
| 27 | +### 1. Gather Context |
| 28 | + |
| 29 | +Always start by understanding what the user has and what they want to know: |
| 30 | +- Which application(s)? Get app IDs. |
| 31 | +- Single app diagnosis or comparison between two apps? |
| 32 | +- Specific query concern or overall app performance? |
| 33 | +- What changed between runs (config, data, Spark version, Gluten version)? |
| 34 | + |
| 35 | +### 2. Collect Data |
| 36 | + |
| 37 | +Use `--json` for all data collection so you can reason over structured data. |
| 38 | + |
| 39 | +**For single-app diagnosis**, collect in this order: |
| 40 | +```bash |
| 41 | +# Overview first |
| 42 | +spark-history-cli --json -a <app> summary |
| 43 | +spark-history-cli --json -a <app> env |
| 44 | + |
| 45 | +# Then drill into workload |
| 46 | +spark-history-cli --json -a <app> sql # all SQL executions |
| 47 | +spark-history-cli --json -a <app> stages # all stages |
| 48 | +spark-history-cli --json -a <app> executors --all # executor metrics |
| 49 | +``` |
| 50 | + |
| 51 | +**For app comparison**, collect the same data for both apps. |
| 52 | + |
| 53 | +**For specific query diagnosis**, also fetch: |
| 54 | +```bash |
| 55 | +spark-history-cli --json -a <app> sql <exec-id> # SQL detail with nodes/edges |
| 56 | +spark-history-cli -a <app> sql-plan <exec-id> --view final # post-AQE plan |
| 57 | +spark-history-cli -a <app> sql-plan <exec-id> --view initial # pre-AQE plan |
| 58 | +spark-history-cli --json -a <app> sql-jobs <exec-id> # linked jobs |
| 59 | +spark-history-cli --json -a <app> stage-summary <stage> # task quantiles for slow stages |
| 60 | +spark-history-cli --json -a <app> stage-tasks <stage> --sort-by -runtime --length 10 # stragglers |
| 61 | +``` |
| 62 | + |
| 63 | +### 3. Analyze |
| 64 | + |
| 65 | +Apply the diagnostic rules from `references/diagnostics.md` to identify issues. |
| 66 | +Key areas to check: |
| 67 | +- **Duration breakdown**: Where is time spent? (stages, tasks, shuffle, GC) |
| 68 | +- **Skew detection**: Compare p50 vs p95 in stage-summary; >3x ratio suggests skew |
| 69 | +- **GC pressure**: Total GC time vs executor run time; >10% is concerning |
| 70 | +- **Shuffle overhead**: Large shuffle read/write relative to input size |
| 71 | +- **Spill**: Any memory or disk spill indicates memory pressure |
| 72 | +- **Straggler tasks**: Tasks much slower than peers (check stage-tasks sorted by runtime) |
| 73 | +- **Config issues**: Suboptimal shuffle partitions, executor sizing, serializer choice |
| 74 | + |
| 75 | +### 4. Compare (when applicable) |
| 76 | + |
| 77 | +For TPC-DS benchmark comparisons, see `references/comparison.md` for the structured approach: |
| 78 | +- Match queries by name (q1, q2, ..., q99) |
| 79 | +- Calculate speedup/regression per query |
| 80 | +- Identify top-N improved and regressed queries |
| 81 | +- Drill into regressed queries to find root cause |
| 82 | +- Compare configurations side-by-side |
| 83 | + |
| 84 | +### 5. Report |
| 85 | + |
| 86 | +Produce two outputs: |
| 87 | +1. **Conversation summary**: Key findings and top recommendations (concise, actionable) |
| 88 | +2. **Detailed report file**: Full analysis saved to disk as Markdown |
| 89 | + |
| 90 | +Report structure: |
| 91 | +```markdown |
| 92 | +# Spark Performance Report |
| 93 | + |
| 94 | +## Executive Summary |
| 95 | +<2-3 sentence overview of findings> |
| 96 | + |
| 97 | +## Application Overview |
| 98 | +<summary data for each app> |
| 99 | + |
| 100 | +## Findings |
| 101 | +### Finding 1: <title> |
| 102 | +- **Severity**: High/Medium/Low |
| 103 | +- **Evidence**: <specific metrics> |
| 104 | +- **Recommendation**: <what to change> |
| 105 | + |
| 106 | +## Configuration Comparison (if comparing) |
| 107 | +<side-by-side diff of key Spark properties> |
| 108 | + |
| 109 | +## Query-Level Analysis (if TPC-DS) |
| 110 | +<table of query durations with speedup/regression> |
| 111 | + |
| 112 | +## Recommendations |
| 113 | +<prioritized list of actionable changes> |
| 114 | +``` |
| 115 | + |
| 116 | +## Diagnostic Quick Reference |
| 117 | + |
| 118 | +These are the most impactful things to check. For the full diagnostic ruleset, see `references/diagnostics.md`. |
| 119 | + |
| 120 | +| Symptom | What to Check | CLI Command | |
| 121 | +|---------|--------------|-------------| |
| 122 | +| Slow overall | Duration breakdown by stage | `summary`, `stages` | |
| 123 | +| Task skew | p50 vs p95 duration | `stage-summary <id>` | |
| 124 | +| GC pressure | GC time vs run time per executor | `executors --all` | |
| 125 | +| Shuffle heavy | Shuffle bytes vs input bytes | `stages`, `stage <id>` | |
| 126 | +| Memory spill | Spill bytes > 0 | `stage <id>`, `stage-summary <id>` | |
| 127 | +| Straggler tasks | Top tasks by runtime | `stage-tasks <id> --sort-by -runtime` | |
| 128 | +| Bad config | Partition count, executor sizing | `env`, `summary` | |
| 129 | +| AQE ineffective | Initial vs final plan difference | `sql-plan <id> --view initial/final` | |
| 130 | +| Gluten fallback | Non-Transformer nodes in final plan | `sql-plan <id> --view final` | |
| 131 | + |
| 132 | +## Gluten/Velox Awareness |
| 133 | + |
| 134 | +When analyzing Gluten-accelerated applications: |
| 135 | +- **Plan nodes**: `*Transformer` and `*ExecTransformer` nodes indicate Gluten-offloaded operators |
| 136 | +- **Fallback detection**: Non-Transformer nodes in the final plan (e.g., `SortMergeJoin` instead of `ShuffledHashJoinExecTransformer`) indicate Gluten fallback — these are performance-critical to investigate |
| 137 | +- **Columnar exchanges**: `ColumnarExchange` and `ColumnarBroadcastExchange` are Gluten's native shuffle — look for `VeloxColumnarToRow` transitions which indicate fallback boundaries |
| 138 | +- **Native metrics**: Gluten stages may show different metric patterns (lower GC, different memory profiles) than vanilla Spark stages |
| 139 | + |
| 140 | +## References |
| 141 | + |
| 142 | +- `references/diagnostics.md` — Full diagnostic ruleset with thresholds and heuristics |
| 143 | +- `references/comparison.md` — TPC-DS benchmark comparison methodology |
0 commit comments