Skip to content

Commit 443260d

Browse files
harp-intelclaude
andauthored
docs: add signal handling documentation (#674)
Describes Linux signal handling across all commands (root, reporting, metrics) and scenarios (local/remote targets, Ctrl-C vs external signals). Linked from ARCHITECTURE.md. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5207d18 commit 443260d

File tree

2 files changed

+296
-0
lines changed

2 files changed

+296
-0
lines changed

ARCHITECTURE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,8 @@ SIGINT received
240240
→ Print partial results
241241
```
242242

243+
For a detailed description of signal handling across all commands and scenarios (local/remote targets, Ctrl-C vs external signals), see [Signal Handling](docs/signal-handling.md).
244+
243245
## Unit Testing
244246

245247
```bash

docs/signal-handling.md

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# Signal Handling
2+
3+
This document describes how PerfSpect handles Linux signals across all commands and scenarios. It is intended for developers working on the codebase.
4+
5+
## Overview
6+
7+
PerfSpect registers signal handlers for `SIGINT` and `SIGTERM` at multiple levels. The handling strategy differs depending on which command is running and whether the target is local or remote. There are three independent signal-handling layers:
8+
9+
1. **Root-level handler** -- active during the pre-command update check (`cmd/root.go`)
10+
2. **Reporting-command handler** -- used by `report`, `benchmark`, `telemetry`, `flamegraph`, and `lock` (`internal/workflow/signals.go`)
11+
3. **Metrics-command handler** -- used by `metrics` (`cmd/metrics/metrics.go`)
12+
13+
Each layer replaces the previous one via `signal.Notify`, so only one Go-level handler is active at a time for a given signal.
14+
15+
## Background: Process Groups and Terminal Signals
16+
17+
A key design decision throughout PerfSpect is the use of **new process groups** (`Setpgid: true`) for child processes. This is critical for understanding signal propagation:
18+
19+
- When the user presses Ctrl-C in a terminal, the kernel sends `SIGINT` to **every process in the foreground process group**.
20+
- PerfSpect runs the controller script (and `perf stat` for metrics) in a **separate process group** so that the terminal's `SIGINT` is **not** automatically delivered to them.
21+
- This gives PerfSpect's signal handler the opportunity to orchestrate a graceful shutdown rather than having children killed out from under it.
22+
23+
For remote targets, the "child" from PerfSpect's perspective is the local SSH process. The controller script runs on the remote host in its own process group as well.
24+
25+
## 1. Root-Level Handler (`cmd/root.go`)
26+
27+
### When Active
28+
29+
During the `PersistentPreRunE` phase, while PerfSpect checks for available updates (Intel network only). This handler is short-lived and is deregistered via `defer signal.Stop(sigChannel)` before the subcommand begins.
30+
31+
### What It Does
32+
33+
```
34+
SIGINT or SIGTERM received
35+
-> Log the signal
36+
-> Call terminateApplication() to clean up (close log file, remove temp directory)
37+
-> os.Exit(1)
38+
```
39+
40+
### Scenarios
41+
42+
This handler is identical for all scenarios (local/remote, Ctrl-C/external signal) because it runs before any target connection or data collection begins. It simply ensures a clean exit if the user interrupts during the update check.
43+
44+
**Source:** `cmd/root.go:226-239`
45+
46+
## 2. Reporting-Command Handler (`internal/workflow/`)
47+
48+
Used by: `report`, `benchmark`, `telemetry`, `flamegraph`, `lock`
49+
50+
### Architecture
51+
52+
```text
53+
perfspect (Go)
54+
┌─────────────────────────────────────────────┐
55+
│ signal handler goroutine │
56+
│ (configureSignalHandler) │
57+
│ │
58+
│ Listens for SIGINT / SIGTERM │
59+
└──────────┬──────────────────────────────────┘
60+
61+
┌───────────────────┼───────────────────┐
62+
▼ ▼ ▼
63+
Target A Target B Target N
64+
┌──────────┐ ┌──────────┐ ┌──────────┐
65+
│controller│ │controller│ │controller│
66+
│ .sh │ │ .sh │ │ .sh │
67+
│(own PGID)│ │(own PGID)│ │(own PGID)│
68+
└────┬─────┘ └────┬─────┘ └────┬─────┘
69+
│ │ │
70+
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
71+
│ scripts │ │ scripts │ │ scripts │
72+
│(setsid) │ │(setsid) │ │(setsid) │
73+
└─────────┘ └─────────┘ └─────────┘
74+
```
75+
76+
Each target's controller script runs in its own process group. Each individual script within the controller also runs in its own session (`setsid`).
77+
78+
### Signal Flow
79+
80+
The handler is installed in `configureSignalHandler()` (`internal/workflow/signals.go:64-174`). When a signal arrives:
81+
82+
1. **Read each target's controller PID** from `controller.pid` on the target (via `cat` over SSH for remote, locally for local).
83+
2. **Send SIGINT to each controller** using `kill -SIGINT <pid>` on the target.
84+
3. **Spawn per-target goroutines** that poll (`ps -p <pid>`) until the controller exits, with a **20-second timeout**.
85+
4. If the timeout expires, **send SIGKILL** to the controller.
86+
5. After all controllers have exited, **sleep 500ms** to allow local SSH processes to finish cleanly.
87+
6. **Send SIGINT to perfspect's remaining children** (e.g., SSH processes that haven't exited yet) via `util.SignalChildren()`.
88+
89+
### Controller Script Signal Handling
90+
91+
The controller script (`internal/script/script.go:258-402`) has its own `trap handle_sigint SIGINT` handler:
92+
93+
1. For each running concurrent script, call `kill_script()`:
94+
- Send `SIGTERM` to the script's **process group** (`kill -SIGTERM -$pid`). Bash background jobs ignore SIGINT but respond to SIGTERM.
95+
- Poll for up to **60 seconds** for graceful exit.
96+
- If still alive, send `SIGKILL` to the process group.
97+
- Set exit code to 143 (SIGTERM) for killed scripts.
98+
2. Kill the current sequential script if one is running.
99+
3. Call `print_summary` to emit whatever output was collected (partial results).
100+
4. Remove `controller.pid` and exit 0.
101+
102+
### Individual Script Signal Handling
103+
104+
Some long-running scripts have their own signal traps:
105+
106+
| Script | Trap | Behavior |
107+
|--------|------|----------|
108+
| Flamegraph (`flamegraph`) | `trap on_signal INT TERM EXIT` | Stops perf, collapses stacks, prints partial results, restores kernel settings |
109+
| Lock contention (`lock`) | `trap on_signal INT TERM EXIT` | Similar to flamegraph -- stops perf, processes data, prints results |
110+
| Instruction telemetry | `trap finalize INT TERM EXIT` | Kills processwatch, cleans up |
111+
| Kernel telemetry | `trap cleanup SIGINT SIGTERM` | Exits cleanly |
112+
| Syscall telemetry | `trap cleanup INT TERM` | Kills perf, exits cleanly |
113+
114+
### Scenario: Local Target, Ctrl-C in Shell
115+
116+
```text
117+
User presses Ctrl-C
118+
119+
├─→ Kernel sends SIGINT to foreground process group
120+
│ ├─→ perfspect receives SIGINT
121+
│ └─→ controller.sh does NOT receive it (different PGID)
122+
123+
└─→ perfspect's signal handler activates
124+
├─→ Reads controller.pid from local filesystem
125+
├─→ Runs: kill -SIGINT <controller_pid>
126+
├─→ Controller's handle_sigint trap fires
127+
│ ├─→ Sends SIGTERM to each script's process group
128+
│ ├─→ Waits for scripts to exit (up to 60s each)
129+
│ ├─→ Prints collected output
130+
│ └─→ Exits 0
131+
├─→ Signal handler detects controller has exited
132+
├─→ Sends SIGINT to any remaining perfspect children
133+
└─→ Perfspect processes partial results and exits
134+
```
135+
136+
### Scenario: Local Target, SIGINT from Another Process
137+
138+
The flow is identical to Ctrl-C except:
139+
- Only perfspect receives the signal (the kernel does not broadcast to the foreground process group since the signal comes from `kill`, not the terminal).
140+
- The controller and its scripts are unaffected until perfspect's handler explicitly signals them.
141+
- This is the scenario the separate-process-group design specifically addresses -- it ensures the same graceful shutdown path regardless of signal source.
142+
143+
### Scenario: Remote Target, Ctrl-C in Shell
144+
145+
```text
146+
User presses Ctrl-C
147+
148+
├─→ Kernel sends SIGINT to foreground process group
149+
│ ├─→ perfspect receives SIGINT
150+
│ └─→ local SSH process does NOT receive it (different PGID)
151+
152+
└─→ perfspect's signal handler activates
153+
├─→ Reads controller.pid from remote target (via SSH, reusing connection)
154+
├─→ Runs: kill -SIGINT <controller_pid> on remote target (via SSH)
155+
├─→ Remote controller's handle_sigint trap fires
156+
│ ├─→ Sends SIGTERM to each remote script's process group
157+
│ ├─→ Waits for scripts to exit (up to 60s each)
158+
│ ├─→ Prints collected output
159+
│ └─→ Exits 0
160+
├─→ SSH process carrying controller output exits naturally
161+
│ (500ms grace period to avoid interrupting output transfer)
162+
├─→ Signal handler detects controller has exited (via SSH ps -p)
163+
├─→ Sends SIGINT to any remaining local children (SSH processes)
164+
└─→ Perfspect processes partial results and exits
165+
```
166+
167+
**Key detail:** The `signalProcessOnTarget()` function (`internal/workflow/signals.go:26-48`) uses `t.RunCommandEx()` to run `kill` on the target. For remote targets, this opens a new SSH session. The `waitTime` of 15 seconds accounts for the controller's 5-second grace period plus network latency.
168+
169+
### Scenario: Remote Target, SIGINT from Another Process
170+
171+
Same as Ctrl-C except:
172+
- Only perfspect receives the signal; neither the local SSH process nor the remote controller are affected by the initial signal.
173+
- The handler follows the same path: signal the remote controller via SSH, wait for exit, then clean up local SSH children.
174+
175+
### Edge Cases and Known Issues
176+
177+
- **Race condition on PID file removal:** The controller deletes `controller.pid` before fully exiting. The signal handler adds a 500ms sleep after detecting PID file removal to let SSH finish transferring output. See comment at `internal/workflow/signals.go:150-155`.
178+
- **SSH exit code 255:** When SSH is interrupted, it returns exit code 255 even if the remote controller exited cleanly. The output parser checks for the `<---------------------->` delimiter to determine if usable output exists and processes it despite the non-zero exit code (`internal/script/script.go:140-149`).
179+
180+
## 3. Metrics-Command Handler (`cmd/metrics/`)
181+
182+
The `metrics` command has its own signal handling because it does not use the controller script. Instead, it streams `perf stat` output directly.
183+
184+
### Architecture
185+
186+
```text
187+
perfspect (Go)
188+
┌─────────────────────────────────────────────┐
189+
│ signalManager │
190+
│ - ctx/cancel (context.Context) │
191+
│ - sigChannel (SIGINT, SIGTERM) │
192+
│ - handleSignals() goroutine │
193+
└──────────┬──────────────────────────────────┘
194+
195+
┌───────────────────┼───────────────────┐
196+
▼ ▼ ▼
197+
Target A Target B Target N
198+
┌──────────┐ ┌──────────┐ ┌──────────┐
199+
│ perf stat│ │ perf stat│ │ perf stat│
200+
│(streamed)│ │(streamed)│ │(streamed)│
201+
└──────────┘ └──────────┘ └──────────┘
202+
```
203+
204+
### Signal Flow
205+
206+
The `signalManager` (`cmd/metrics/metrics.go:70-132`) uses a `context.Context` for coordinated shutdown:
207+
208+
1. **On signal receipt** (or internal error triggering `triggerShutdown()`):
209+
- Cancel the context (`sm.cancel()`).
210+
- Send `SIGKILL` to all child processes (`util.SignalChildren(syscall.SIGKILL)`).
211+
212+
2. **Collection loop** (`collectOnTarget`, `cmd/metrics/metrics.go:1332`):
213+
- Checks `signalMgr.shouldStop()` on each iteration. When true, exits the loop.
214+
215+
3. **Processing pipeline** (`processPerfOutput`, `cmd/metrics/metrics.go:1539-1606`):
216+
- Also monitors the context for cancellation.
217+
- Can itself trigger shutdown via `signalMgr.triggerShutdown()` on repeated processing errors or cgroup refresh timeout.
218+
219+
**Important difference from reporting commands:** The metrics handler sends `SIGKILL` immediately rather than attempting graceful shutdown with `SIGINT` then escalating to `SIGKILL`. This is because `perf stat` is a streaming process and partial output is already captured in real-time -- there is no risk of losing already-collected data.
220+
221+
### Scenario: Local Target, Ctrl-C in Shell
222+
223+
```text
224+
User presses Ctrl-C
225+
226+
├─→ Kernel sends SIGINT to foreground process group
227+
│ ├─→ perfspect receives SIGINT
228+
│ └─→ perf stat does NOT receive it (running via RunScriptStream,
229+
│ but the script command is not in perfspect's process group
230+
│ because metrics uses the target layer which isolates children)
231+
232+
└─→ signalManager.handleSignals() activates
233+
├─→ Cancels context
234+
├─→ Sends SIGKILL to all child processes
235+
├─→ collectOnTarget loop sees shouldStop() == true, exits
236+
├─→ processPerfOutput sees context cancellation, drains remaining data
237+
└─→ Perfspect outputs whatever metrics were collected and exits
238+
```
239+
240+
### Scenario: Local Target, SIGINT from Another Process
241+
242+
Identical to Ctrl-C. Only perfspect receives the signal; child processes are killed via `SIGKILL` by the handler.
243+
244+
### Scenario: Remote Target, Ctrl-C in Shell
245+
246+
```text
247+
User presses Ctrl-C
248+
249+
├─→ Kernel sends SIGINT to foreground process group
250+
│ ├─→ perfspect receives SIGINT
251+
│ └─→ local SSH process does NOT receive it (different PGID)
252+
253+
└─→ signalManager.handleSignals() activates
254+
├─→ Cancels context
255+
├─→ Sends SIGKILL to all child processes (including local SSH)
256+
├─→ SSH connection drops, remote perf stat is orphaned
257+
│ (remote perf stat will eventually be cleaned up by the OS
258+
│ or by the target's temp directory cleanup on next run)
259+
├─→ collectOnTarget loop exits
260+
└─→ Perfspect outputs whatever metrics were streamed so far
261+
```
262+
263+
**Note:** Unlike the reporting-command handler, the metrics handler does not attempt to gracefully stop the remote `perf stat` process. The remote process becomes orphaned when SSH is killed. This is acceptable because metrics data is streamed and processed incrementally.
264+
265+
### Scenario: Remote Target, SIGINT from Another Process
266+
267+
Same as Ctrl-C. Only perfspect receives the signal; the local SSH process and remote `perf stat` are killed/orphaned as described above.
268+
269+
## Summary Table
270+
271+
| Aspect | Root | Reporting Commands | Metrics Command |
272+
|--------|------|--------------------|-----------------|
273+
| **Source file** | `cmd/root.go` | `internal/workflow/signals.go` | `cmd/metrics/metrics.go` |
274+
| **Signals caught** | SIGINT, SIGTERM | SIGINT, SIGTERM | SIGINT, SIGTERM |
275+
| **Active during** | Update check | Data collection | Data collection |
276+
| **Child isolation** | N/A | `Setpgid: true` | `Setpgid: true` (via target layer) |
277+
| **Shutdown strategy** | Immediate exit | Graceful: SIGINT -> wait -> SIGKILL | Immediate: SIGKILL |
278+
| **Partial results** | No | Yes (controller prints summary) | Yes (already streamed) |
279+
| **Remote cleanup** | N/A | Explicit signal via SSH | None (orphaned) |
280+
| **Timeout** | None | 20s per target | None |
281+
282+
## Key Source Files
283+
284+
| File | Role |
285+
|------|------|
286+
| `cmd/root.go:226-239` | Root-level signal handler during update check |
287+
| `internal/workflow/signals.go` | Reporting-command signal handler and target signaling |
288+
| `internal/workflow/workflow.go:149-150` | Where the reporting handler is installed |
289+
| `cmd/metrics/metrics.go:70-132` | Metrics signalManager definition |
290+
| `cmd/metrics/metrics.go:850-851` | Where the metrics handler is installed |
291+
| `internal/script/script.go:128-134` | Controller process group isolation |
292+
| `internal/script/script.go:258-402` | Controller script template (including `handle_sigint` trap) |
293+
| `internal/target/helpers.go:128-130` | `Setpgid` process group isolation |
294+
| `internal/util/util.go:561-640` | Signal utility functions (`SignalProcess`, `SignalChildren`, `SignalSelf`) |

0 commit comments

Comments
 (0)