Skip to content

Commit 50d4a01

Browse files
authored
Add profiling skill and Cargo profiling profile (#655)
## Summary - Add `[profile.profiling]` to `rust/Cargo.toml` — inherits release optimizations (LTO, opt-level 3, single codegen unit) with debug symbols enabled for readable flamegraphs - Add `.claude/skills/profiling/SKILL.md` — project-specific skill for profiling the Rubydex indexer ## What the skill covers - **samply** for interactive CPU flamegraphs (Firefox Profiler in browser) - **macOS `sample`** for text-based call trees (non-interactive/agent use) - Phase isolation with `--stop-after` and `--stats` - How to read profiles (self-time, concentration vs. spread, allocation pressure) - Memory profiling with `utils/mem-use` - Before/after comparison workflow with delta tables - Troubleshooting (permissions, missing debug symbols, run variance) ## Eval results Ran the skill in a fresh session. It successfully guided profiling end-to-end and identified the dominant bottleneck: ``` Profiling Results Summary ┌────────────┬─────────┬───────┐ │ Phase │ Time │ % │ ├────────────┼─────────┼───────┤ │ Listing │ 0.67s │ 1.3% │ ├────────────┼─────────┼───────┤ │ Indexing │ 10.0s │ 19.0% │ ├────────────┼─────────┼───────┤ │ Resolution │ 41.3s │ 78.4% │ ├────────────┼─────────┼───────┤ │ Querying │ 0.70s │ 1.3% │ ├────────────┼─────────┼───────┤ │ Total │ 52.7s │ │ ├────────────┼─────────┼───────┤ │ Memory │ 4756 MB │ │ └────────────┴─────────┴───────┘ The Bottleneck: name_depth in sorting (100% of sampled resolution time) ┌───────────────────────────────────┬──────────────┬───────┐ │ Function │ Self Samples │ % │ ├───────────────────────────────────┼──────────────┼───────┤ │ name_depth (nesting closure) │ 12,536 │ 59.6% │ ├───────────────────────────────────┼──────────────┼───────┤ │ name_depth (parent_scope closure) │ 5,984 │ 28.5% │ ├───────────────────────────────────┼──────────────┼───────┤ │ quicksort internals │ 1,447 │ 6.9% │ ├───────────────────────────────────┼──────────────┼───────┤ │ memcmp │ 884 │ 4.2% │ └───────────────────────────────────┴──────────────┴───────┘ 88% of all sampled time is spent in name_depth, called from sort_by in prepare_units. The function recursively walks parent_scope and nesting chains with zero memoization — and the sort calls it O(n log n) times per comparison. ``` This finding led directly to #654 (3x resolution speedup via memoized depth computation).
1 parent 0f06d12 commit 50d4a01

File tree

2 files changed

+265
-0
lines changed

2 files changed

+265
-0
lines changed

.claude/skills/profiling/SKILL.md

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
---
2+
name: profiling
3+
description: >
4+
Profile Rubydex indexer performance — CPU flamegraphs, memory usage, phase-level timing.
5+
Use this skill whenever the user mentions profiling, performance, flamegraphs, benchmarking,
6+
"why is X slow", bottlenecks, hot paths, memory usage, or wants to understand where time
7+
is spent during indexing/resolution. Also trigger when comparing performance before/after
8+
a change.
9+
---
10+
11+
# Profiling Rubydex
12+
13+
This skill helps you profile the Rubydex indexer to find CPU and memory bottlenecks.
14+
The indexer has a multi-phase pipeline (listing → indexing → resolution → querying).
15+
Use `--stats` to see which phase dominates, then profile to find what's expensive inside it.
16+
17+
## Profiling tool: samply
18+
19+
Use **samply** — a sampling profiler that opens results in Firefox Profiler (in-browser).
20+
It captures call stacks at high frequency and produces interactive flamegraphs with filtering,
21+
timeline views, and per-function cost breakdowns.
22+
23+
Install if needed:
24+
25+
```bash
26+
cargo install samply
27+
```
28+
29+
## Build profile
30+
31+
Profiling needs optimized code *with* debug symbols so you get real function names in the
32+
flamegraph instead of mangled addresses. The workspace Cargo.toml has a custom profile for this:
33+
34+
```toml
35+
# rust/Cargo.toml
36+
[profile.profiling]
37+
inherits = "release"
38+
debug = true # Full debug symbols for readable flamegraphs
39+
strip = false # Keep symbols in the binary
40+
```
41+
42+
If this profile doesn't exist yet, **add it** to `rust/Cargo.toml` before profiling. The
43+
release profile uses `lto = true`, `opt-level = 3`, `codegen-units = 1` — the profiling
44+
profile inherits all of that and just adds debug info.
45+
46+
Build with:
47+
48+
```bash
49+
cargo build --profile profiling
50+
```
51+
52+
The binary lands at `rust/target/profiling/rubydex_cli` (not `target/release/`).
53+
54+
The first build is slow (LTO + single codegen unit must recompile everything). Subsequent
55+
builds after small changes are faster since Cargo caches intermediate artifacts in
56+
`target/profiling/`. Don't delete that directory between runs.
57+
58+
## Running a profile
59+
60+
### Full pipeline
61+
62+
```bash
63+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats
64+
```
65+
66+
The `--stats` flag prints the timing breakdown and memory stats to stderr after completion,
67+
so you get both the samply profile AND the summary stats in one run.
68+
69+
Useful samply flags:
70+
- `--no-open` — don't auto-open the browser (useful for scripted runs)
71+
- `--save-only` — save the profile to disk without starting the local server; load later
72+
with `samply load <profile.json>`
73+
74+
### Isolating a phase
75+
76+
Use `--stop-after` to profile only up to a specific stage. This is useful when you want
77+
a cleaner flamegraph focused on one phase without the noise of later stages:
78+
79+
```bash
80+
# Profile only listing + indexing (skip resolution)
81+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats --stop-after indexing
82+
83+
# Profile through resolution (skip querying)
84+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats --stop-after resolution
85+
```
86+
87+
Valid `--stop-after` values: `listing`, `indexing`, `resolution`.
88+
89+
### Common target paths
90+
91+
The user should have a `DEFAULT_BENCH_WORKSPACE` configured pointing to a target codebase.
92+
93+
For synthetic corpora, use `utils/bench` with size arguments (tiny/small/medium/large/huge),
94+
which auto-generates corpora at `../rubydex_corpora/<size>/`.
95+
96+
## Reading the results
97+
98+
When samply finishes, it automatically opens Firefox Profiler in the browser. Key things
99+
to guide the user through:
100+
101+
### Firefox Profiler tips
102+
103+
1. **Call Tree tab** — shows cumulative time per function, sorted by total cost. Start here
104+
to find the most expensive call paths.
105+
106+
2. **Flame Graph tab** — visual representation where width = time. Look for wide bars — those
107+
are the hot functions. Click to zoom in.
108+
109+
3. **Timeline** — shows activity over time. Useful for spotting if one phase is unexpectedly
110+
long or if there are idle gaps.
111+
112+
4. **Filtering** — type a function name in the filter box to isolate it.
113+
114+
5. **Transform > Focus on subtree** — right-click a function to see only its callees. Perfect
115+
for drilling into a specific phase.
116+
117+
6. **Transform > Merge function** — collapse recursive calls to see aggregate cost.
118+
119+
### Text-based profiling with `sample` (macOS)
120+
121+
When you can't interact with a browser (e.g., running from a script or agent), use macOS's
122+
built-in `sample` command for a text-based call tree:
123+
124+
```bash
125+
# Start the indexer in the background, then sample it
126+
rust/target/profiling/rubydex_cli <TARGET_PATH> --stats &
127+
PID=$!
128+
sample $PID -f /tmp/rubydex-sample.txt
129+
wait $PID
130+
```
131+
132+
Or for a simpler approach, sample for a fixed duration while the indexer runs:
133+
134+
```bash
135+
rust/target/profiling/rubydex_cli <TARGET_PATH> --stats &
136+
PID=$!
137+
sleep 2 # let it get past listing/indexing into resolution
138+
sample $PID 30 -f /tmp/rubydex-sample.txt # sample for 30 seconds
139+
wait $PID
140+
```
141+
142+
The output is a text call tree with sample counts — sort by "self" samples to find hot functions.
143+
144+
### How to read the profile
145+
146+
Don't assume which functions are hot — let the data tell you. Hot paths change as the
147+
codebase evolves.
148+
149+
1. **Sort by self-time** (time spent in the function itself, not its callees). This reveals
150+
the actual hot spots. High total-time but low self-time means the function is just a
151+
caller — drill into its children.
152+
153+
2. **Look for concentration vs. spread.** A single function dominating self-time suggests
154+
an algorithmic fix (memoization, better data structure). Time spread across many functions
155+
suggests the workload itself is large and optimization requires a different approach.
156+
157+
3. **Check for allocation pressure.** If `alloc` / `malloc` / `realloc` show up prominently
158+
in self-time, the bottleneck is memory allocation, not computation.
159+
160+
## Memory profiling
161+
162+
For memory, the `--stats` flag already reports Maximum RSS at the end. For deeper memory
163+
analysis:
164+
165+
### Quick check with utils/mem-use
166+
167+
```bash
168+
utils/mem-use rust/target/profiling/rubydex_cli <TARGET_PATH> --stats
169+
```
170+
171+
This wraps the command with `/usr/bin/time -l` and reports:
172+
- Maximum Resident Set Size (RSS)
173+
- Peak Memory Footprint
174+
- Execution Time
175+
176+
## Before/after comparison workflow
177+
178+
When the user has made a change and wants to measure impact:
179+
180+
1. **Get baseline** — run on the current main/branch before changes:
181+
```bash
182+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats 2>&1 | tee /tmp/rubydex-baseline.txt
183+
```
184+
Save the samply profile URL from the browser (Firefox Profiler allows sharing via permalink).
185+
186+
2. **Apply changes** and rebuild:
187+
```bash
188+
cargo build --profile profiling
189+
```
190+
191+
3. **Get new measurement**:
192+
```bash
193+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats 2>&1 | tee /tmp/rubydex-after.txt
194+
```
195+
196+
4. **Compare** — parse both output files and show a side-by-side delta of:
197+
- Total time and per-phase breakdown (listing, indexing, resolution, querying)
198+
- Memory (RSS)
199+
- Declaration/definition counts (sanity check that output is equivalent)
200+
201+
Present the comparison as a formatted table showing absolute values and % change.
202+
203+
### Quick benchmark (no flamegraph)
204+
205+
When the user just wants timing/memory numbers without the full profiler overhead:
206+
207+
```bash
208+
# Release build (faster than profiling profile since no debug symbols)
209+
cargo build --release
210+
utils/bench # uses DEFAULT_BENCH_WORKSPACE
211+
utils/bench medium # synthetic corpus
212+
utils/bench /path/to/project # specific directory
213+
```
214+
215+
## Timing phases (--stats output)
216+
217+
The `--stats` flag on rubydex_cli prints a timing breakdown using the internal timer system.
218+
The phases are:
219+
220+
| Phase | What it measures |
221+
|-------|-----------------|
222+
| Initialization | Setup and configuration |
223+
| Listing | File discovery (walking directories, filtering .rb files) |
224+
| Indexing | Parsing Ruby files and extracting definitions/references |
225+
| Resolution | Computing fully qualified names, resolving constants, linearizing ancestors |
226+
| Integrity check | Validating graph consistency (optional) |
227+
| Querying | Building query indices |
228+
| Cleanup | Time not accounted for by other phases |
229+
230+
It also prints:
231+
- Maximum RSS in bytes and MB
232+
- Declaration/definition counts and breakdown by kind
233+
- Orphan rate (definitions not linked to declarations)
234+
235+
## Troubleshooting
236+
237+
### samply permission errors on macOS
238+
239+
samply uses the `dtrace` backend on macOS which may need elevated permissions. If you get
240+
permission errors:
241+
242+
```bash
243+
sudo samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats
244+
```
245+
246+
Or grant Terminal/iTerm the "Developer Tools" permission in System Settings > Privacy & Security.
247+
248+
### Empty or unhelpful flamegraphs
249+
250+
If the flamegraph shows mostly `[unknown]` frames:
251+
- Make sure you built with `--profile profiling` (not `--release`)
252+
- Verify debug symbols: `dsymutil -s rust/target/profiling/rubydex_cli | head -20` should
253+
show symbol entries.
254+
- On macOS, ensure `strip = false` in the profiling profile
255+
256+
### Comparing runs with variance
257+
258+
Indexer performance can vary ±5% between runs due to OS scheduling, file system caching, etc.
259+
For reliable comparisons, run 3 times and take the median, or at minimum run twice and check
260+
consistency before drawing conclusions.

rust/Cargo.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,11 @@ lto = true
1212
opt-level = 3
1313
codegen-units = 1
1414

15+
[profile.profiling]
16+
inherits = "release"
17+
debug = true
18+
strip = false
19+
1520
[workspace.lints.clippy]
1621
# See: https://doc.rust-lang.org/clippy/lints.html
1722
correctness = "deny"

0 commit comments

Comments
 (0)