|
| 1 | +# Signal Handling |
| 2 | + |
| 3 | +This document describes how PerfSpect handles Linux signals across all commands and scenarios. It is intended for developers working on the codebase. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +PerfSpect registers signal handlers for `SIGINT` and `SIGTERM` at multiple levels. The handling strategy differs depending on which command is running and whether the target is local or remote. There are three independent signal-handling layers: |
| 8 | + |
| 9 | +1. **Root-level handler** -- active during the pre-command update check (`cmd/root.go`) |
| 10 | +2. **Reporting-command handler** -- used by `report`, `benchmark`, `telemetry`, `flamegraph`, and `lock` (`internal/workflow/signals.go`) |
| 11 | +3. **Metrics-command handler** -- used by `metrics` (`cmd/metrics/metrics.go`) |
| 12 | + |
| 13 | +Each layer replaces the previous one via `signal.Notify`, so only one Go-level handler is active at a time for a given signal. |
| 14 | + |
| 15 | +## Background: Process Groups and Terminal Signals |
| 16 | + |
| 17 | +A key design decision throughout PerfSpect is the use of **new process groups** (`Setpgid: true`) for child processes. This is critical for understanding signal propagation: |
| 18 | + |
| 19 | +- When the user presses Ctrl-C in a terminal, the kernel sends `SIGINT` to **every process in the foreground process group**. |
| 20 | +- PerfSpect runs the controller script (and `perf stat` for metrics) in a **separate process group** so that the terminal's `SIGINT` is **not** automatically delivered to them. |
| 21 | +- This gives PerfSpect's signal handler the opportunity to orchestrate a graceful shutdown rather than having children killed out from under it. |
| 22 | + |
| 23 | +For remote targets, the "child" from PerfSpect's perspective is the local SSH process. The controller script runs on the remote host in its own process group as well. |
| 24 | + |
| 25 | +## 1. Root-Level Handler (`cmd/root.go`) |
| 26 | + |
| 27 | +### When Active |
| 28 | + |
| 29 | +During the `PersistentPreRunE` phase, while PerfSpect checks for available updates (Intel network only). This handler is short-lived and is deregistered via `defer signal.Stop(sigChannel)` before the subcommand begins. |
| 30 | + |
| 31 | +### What It Does |
| 32 | + |
| 33 | +``` |
| 34 | +SIGINT or SIGTERM received |
| 35 | + -> Log the signal |
| 36 | + -> Call terminateApplication() to clean up (close log file, remove temp directory) |
| 37 | + -> os.Exit(1) |
| 38 | +``` |
| 39 | + |
| 40 | +### Scenarios |
| 41 | + |
| 42 | +This handler is identical for all scenarios (local/remote, Ctrl-C/external signal) because it runs before any target connection or data collection begins. It simply ensures a clean exit if the user interrupts during the update check. |
| 43 | + |
| 44 | +**Source:** `cmd/root.go:226-239` |
| 45 | + |
| 46 | +## 2. Reporting-Command Handler (`internal/workflow/`) |
| 47 | + |
| 48 | +Used by: `report`, `benchmark`, `telemetry`, `flamegraph`, `lock` |
| 49 | + |
| 50 | +### Architecture |
| 51 | + |
| 52 | +```text |
| 53 | + perfspect (Go) |
| 54 | + ┌─────────────────────────────────────────────┐ |
| 55 | + │ signal handler goroutine │ |
| 56 | + │ (configureSignalHandler) │ |
| 57 | + │ │ |
| 58 | + │ Listens for SIGINT / SIGTERM │ |
| 59 | + └──────────┬──────────────────────────────────┘ |
| 60 | + │ |
| 61 | + ┌───────────────────┼───────────────────┐ |
| 62 | + ▼ ▼ ▼ |
| 63 | + Target A Target B Target N |
| 64 | + ┌──────────┐ ┌──────────┐ ┌──────────┐ |
| 65 | + │controller│ │controller│ │controller│ |
| 66 | + │ .sh │ │ .sh │ │ .sh │ |
| 67 | + │(own PGID)│ │(own PGID)│ │(own PGID)│ |
| 68 | + └────┬─────┘ └────┬─────┘ └────┬─────┘ |
| 69 | + │ │ │ |
| 70 | + ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ |
| 71 | + │ scripts │ │ scripts │ │ scripts │ |
| 72 | + │(setsid) │ │(setsid) │ │(setsid) │ |
| 73 | + └─────────┘ └─────────┘ └─────────┘ |
| 74 | +``` |
| 75 | + |
| 76 | +Each target's controller script runs in its own process group. Each individual script within the controller also runs in its own session (`setsid`). |
| 77 | + |
| 78 | +### Signal Flow |
| 79 | + |
| 80 | +The handler is installed in `configureSignalHandler()` (`internal/workflow/signals.go:64-174`). When a signal arrives: |
| 81 | + |
| 82 | +1. **Read each target's controller PID** from `controller.pid` on the target (via `cat` over SSH for remote, locally for local). |
| 83 | +2. **Send SIGINT to each controller** using `kill -SIGINT <pid>` on the target. |
| 84 | +3. **Spawn per-target goroutines** that poll (`ps -p <pid>`) until the controller exits, with a **20-second timeout**. |
| 85 | +4. If the timeout expires, **send SIGKILL** to the controller. |
| 86 | +5. After all controllers have exited, **sleep 500ms** to allow local SSH processes to finish cleanly. |
| 87 | +6. **Send SIGINT to perfspect's remaining children** (e.g., SSH processes that haven't exited yet) via `util.SignalChildren()`. |
| 88 | + |
| 89 | +### Controller Script Signal Handling |
| 90 | + |
| 91 | +The controller script (`internal/script/script.go:258-402`) has its own `trap handle_sigint SIGINT` handler: |
| 92 | + |
| 93 | +1. For each running concurrent script, call `kill_script()`: |
| 94 | + - Send `SIGTERM` to the script's **process group** (`kill -SIGTERM -$pid`). Bash background jobs ignore SIGINT but respond to SIGTERM. |
| 95 | + - Poll for up to **60 seconds** for graceful exit. |
| 96 | + - If still alive, send `SIGKILL` to the process group. |
| 97 | + - Set exit code to 143 (SIGTERM) for killed scripts. |
| 98 | +2. Kill the current sequential script if one is running. |
| 99 | +3. Call `print_summary` to emit whatever output was collected (partial results). |
| 100 | +4. Remove `controller.pid` and exit 0. |
| 101 | + |
| 102 | +### Individual Script Signal Handling |
| 103 | + |
| 104 | +Some long-running scripts have their own signal traps: |
| 105 | + |
| 106 | +| Script | Trap | Behavior | |
| 107 | +|--------|------|----------| |
| 108 | +| Flamegraph (`flamegraph`) | `trap on_signal INT TERM EXIT` | Stops perf, collapses stacks, prints partial results, restores kernel settings | |
| 109 | +| Lock contention (`lock`) | `trap on_signal INT TERM EXIT` | Similar to flamegraph -- stops perf, processes data, prints results | |
| 110 | +| Instruction telemetry | `trap finalize INT TERM EXIT` | Kills processwatch, cleans up | |
| 111 | +| Kernel telemetry | `trap cleanup SIGINT SIGTERM` | Exits cleanly | |
| 112 | +| Syscall telemetry | `trap cleanup INT TERM` | Kills perf, exits cleanly | |
| 113 | + |
| 114 | +### Scenario: Local Target, Ctrl-C in Shell |
| 115 | + |
| 116 | +```text |
| 117 | +User presses Ctrl-C |
| 118 | + │ |
| 119 | + ├─→ Kernel sends SIGINT to foreground process group |
| 120 | + │ ├─→ perfspect receives SIGINT |
| 121 | + │ └─→ controller.sh does NOT receive it (different PGID) |
| 122 | + │ |
| 123 | + └─→ perfspect's signal handler activates |
| 124 | + ├─→ Reads controller.pid from local filesystem |
| 125 | + ├─→ Runs: kill -SIGINT <controller_pid> |
| 126 | + ├─→ Controller's handle_sigint trap fires |
| 127 | + │ ├─→ Sends SIGTERM to each script's process group |
| 128 | + │ ├─→ Waits for scripts to exit (up to 60s each) |
| 129 | + │ ├─→ Prints collected output |
| 130 | + │ └─→ Exits 0 |
| 131 | + ├─→ Signal handler detects controller has exited |
| 132 | + ├─→ Sends SIGINT to any remaining perfspect children |
| 133 | + └─→ Perfspect processes partial results and exits |
| 134 | +``` |
| 135 | + |
| 136 | +### Scenario: Local Target, SIGINT from Another Process |
| 137 | + |
| 138 | +The flow is identical to Ctrl-C except: |
| 139 | +- Only perfspect receives the signal (the kernel does not broadcast to the foreground process group since the signal comes from `kill`, not the terminal). |
| 140 | +- The controller and its scripts are unaffected until perfspect's handler explicitly signals them. |
| 141 | +- This is the scenario the separate-process-group design specifically addresses -- it ensures the same graceful shutdown path regardless of signal source. |
| 142 | + |
| 143 | +### Scenario: Remote Target, Ctrl-C in Shell |
| 144 | + |
| 145 | +```text |
| 146 | +User presses Ctrl-C |
| 147 | + │ |
| 148 | + ├─→ Kernel sends SIGINT to foreground process group |
| 149 | + │ ├─→ perfspect receives SIGINT |
| 150 | + │ └─→ local SSH process does NOT receive it (different PGID) |
| 151 | + │ |
| 152 | + └─→ perfspect's signal handler activates |
| 153 | + ├─→ Reads controller.pid from remote target (via SSH, reusing connection) |
| 154 | + ├─→ Runs: kill -SIGINT <controller_pid> on remote target (via SSH) |
| 155 | + ├─→ Remote controller's handle_sigint trap fires |
| 156 | + │ ├─→ Sends SIGTERM to each remote script's process group |
| 157 | + │ ├─→ Waits for scripts to exit (up to 60s each) |
| 158 | + │ ├─→ Prints collected output |
| 159 | + │ └─→ Exits 0 |
| 160 | + ├─→ SSH process carrying controller output exits naturally |
| 161 | + │ (500ms grace period to avoid interrupting output transfer) |
| 162 | + ├─→ Signal handler detects controller has exited (via SSH ps -p) |
| 163 | + ├─→ Sends SIGINT to any remaining local children (SSH processes) |
| 164 | + └─→ Perfspect processes partial results and exits |
| 165 | +``` |
| 166 | + |
| 167 | +**Key detail:** The `signalProcessOnTarget()` function (`internal/workflow/signals.go:26-48`) uses `t.RunCommandEx()` to run `kill` on the target. For remote targets, this opens a new SSH session. The `waitTime` of 15 seconds accounts for the controller's 5-second grace period plus network latency. |
| 168 | + |
| 169 | +### Scenario: Remote Target, SIGINT from Another Process |
| 170 | + |
| 171 | +Same as Ctrl-C except: |
| 172 | +- Only perfspect receives the signal; neither the local SSH process nor the remote controller are affected by the initial signal. |
| 173 | +- The handler follows the same path: signal the remote controller via SSH, wait for exit, then clean up local SSH children. |
| 174 | + |
| 175 | +### Edge Cases and Known Issues |
| 176 | + |
| 177 | +- **Race condition on PID file removal:** The controller deletes `controller.pid` before fully exiting. The signal handler adds a 500ms sleep after detecting PID file removal to let SSH finish transferring output. See comment at `internal/workflow/signals.go:150-155`. |
| 178 | +- **SSH exit code 255:** When SSH is interrupted, it returns exit code 255 even if the remote controller exited cleanly. The output parser checks for the `<---------------------->` delimiter to determine if usable output exists and processes it despite the non-zero exit code (`internal/script/script.go:140-149`). |
| 179 | + |
| 180 | +## 3. Metrics-Command Handler (`cmd/metrics/`) |
| 181 | + |
| 182 | +The `metrics` command has its own signal handling because it does not use the controller script. Instead, it streams `perf stat` output directly. |
| 183 | + |
| 184 | +### Architecture |
| 185 | + |
| 186 | +```text |
| 187 | + perfspect (Go) |
| 188 | + ┌─────────────────────────────────────────────┐ |
| 189 | + │ signalManager │ |
| 190 | + │ - ctx/cancel (context.Context) │ |
| 191 | + │ - sigChannel (SIGINT, SIGTERM) │ |
| 192 | + │ - handleSignals() goroutine │ |
| 193 | + └──────────┬──────────────────────────────────┘ |
| 194 | + │ |
| 195 | + ┌───────────────────┼───────────────────┐ |
| 196 | + ▼ ▼ ▼ |
| 197 | + Target A Target B Target N |
| 198 | + ┌──────────┐ ┌──────────┐ ┌──────────┐ |
| 199 | + │ perf stat│ │ perf stat│ │ perf stat│ |
| 200 | + │(streamed)│ │(streamed)│ │(streamed)│ |
| 201 | + └──────────┘ └──────────┘ └──────────┘ |
| 202 | +``` |
| 203 | + |
| 204 | +### Signal Flow |
| 205 | + |
| 206 | +The `signalManager` (`cmd/metrics/metrics.go:70-132`) uses a `context.Context` for coordinated shutdown: |
| 207 | + |
| 208 | +1. **On signal receipt** (or internal error triggering `triggerShutdown()`): |
| 209 | + - Cancel the context (`sm.cancel()`). |
| 210 | + - Send `SIGKILL` to all child processes (`util.SignalChildren(syscall.SIGKILL)`). |
| 211 | + |
| 212 | +2. **Collection loop** (`collectOnTarget`, `cmd/metrics/metrics.go:1332`): |
| 213 | + - Checks `signalMgr.shouldStop()` on each iteration. When true, exits the loop. |
| 214 | + |
| 215 | +3. **Processing pipeline** (`processPerfOutput`, `cmd/metrics/metrics.go:1539-1606`): |
| 216 | + - Also monitors the context for cancellation. |
| 217 | + - Can itself trigger shutdown via `signalMgr.triggerShutdown()` on repeated processing errors or cgroup refresh timeout. |
| 218 | + |
| 219 | +**Important difference from reporting commands:** The metrics handler sends `SIGKILL` immediately rather than attempting graceful shutdown with `SIGINT` then escalating to `SIGKILL`. This is because `perf stat` is a streaming process and partial output is already captured in real-time -- there is no risk of losing already-collected data. |
| 220 | + |
| 221 | +### Scenario: Local Target, Ctrl-C in Shell |
| 222 | + |
| 223 | +```text |
| 224 | +User presses Ctrl-C |
| 225 | + │ |
| 226 | + ├─→ Kernel sends SIGINT to foreground process group |
| 227 | + │ ├─→ perfspect receives SIGINT |
| 228 | + │ └─→ perf stat does NOT receive it (running via RunScriptStream, |
| 229 | + │ but the script command is not in perfspect's process group |
| 230 | + │ because metrics uses the target layer which isolates children) |
| 231 | + │ |
| 232 | + └─→ signalManager.handleSignals() activates |
| 233 | + ├─→ Cancels context |
| 234 | + ├─→ Sends SIGKILL to all child processes |
| 235 | + ├─→ collectOnTarget loop sees shouldStop() == true, exits |
| 236 | + ├─→ processPerfOutput sees context cancellation, drains remaining data |
| 237 | + └─→ Perfspect outputs whatever metrics were collected and exits |
| 238 | +``` |
| 239 | + |
| 240 | +### Scenario: Local Target, SIGINT from Another Process |
| 241 | + |
| 242 | +Identical to Ctrl-C. Only perfspect receives the signal; child processes are killed via `SIGKILL` by the handler. |
| 243 | + |
| 244 | +### Scenario: Remote Target, Ctrl-C in Shell |
| 245 | + |
| 246 | +```text |
| 247 | +User presses Ctrl-C |
| 248 | + │ |
| 249 | + ├─→ Kernel sends SIGINT to foreground process group |
| 250 | + │ ├─→ perfspect receives SIGINT |
| 251 | + │ └─→ local SSH process does NOT receive it (different PGID) |
| 252 | + │ |
| 253 | + └─→ signalManager.handleSignals() activates |
| 254 | + ├─→ Cancels context |
| 255 | + ├─→ Sends SIGKILL to all child processes (including local SSH) |
| 256 | + ├─→ SSH connection drops, remote perf stat is orphaned |
| 257 | + │ (remote perf stat will eventually be cleaned up by the OS |
| 258 | + │ or by the target's temp directory cleanup on next run) |
| 259 | + ├─→ collectOnTarget loop exits |
| 260 | + └─→ Perfspect outputs whatever metrics were streamed so far |
| 261 | +``` |
| 262 | + |
| 263 | +**Note:** Unlike the reporting-command handler, the metrics handler does not attempt to gracefully stop the remote `perf stat` process. The remote process becomes orphaned when SSH is killed. This is acceptable because metrics data is streamed and processed incrementally. |
| 264 | + |
| 265 | +### Scenario: Remote Target, SIGINT from Another Process |
| 266 | + |
| 267 | +Same as Ctrl-C. Only perfspect receives the signal; the local SSH process and remote `perf stat` are killed/orphaned as described above. |
| 268 | + |
| 269 | +## Summary Table |
| 270 | + |
| 271 | +| Aspect | Root | Reporting Commands | Metrics Command | |
| 272 | +|--------|------|--------------------|-----------------| |
| 273 | +| **Source file** | `cmd/root.go` | `internal/workflow/signals.go` | `cmd/metrics/metrics.go` | |
| 274 | +| **Signals caught** | SIGINT, SIGTERM | SIGINT, SIGTERM | SIGINT, SIGTERM | |
| 275 | +| **Active during** | Update check | Data collection | Data collection | |
| 276 | +| **Child isolation** | N/A | `Setpgid: true` | `Setpgid: true` (via target layer) | |
| 277 | +| **Shutdown strategy** | Immediate exit | Graceful: SIGINT -> wait -> SIGKILL | Immediate: SIGKILL | |
| 278 | +| **Partial results** | No | Yes (controller prints summary) | Yes (already streamed) | |
| 279 | +| **Remote cleanup** | N/A | Explicit signal via SSH | None (orphaned) | |
| 280 | +| **Timeout** | None | 20s per target | None | |
| 281 | + |
| 282 | +## Key Source Files |
| 283 | + |
| 284 | +| File | Role | |
| 285 | +|------|------| |
| 286 | +| `cmd/root.go:226-239` | Root-level signal handler during update check | |
| 287 | +| `internal/workflow/signals.go` | Reporting-command signal handler and target signaling | |
| 288 | +| `internal/workflow/workflow.go:149-150` | Where the reporting handler is installed | |
| 289 | +| `cmd/metrics/metrics.go:70-132` | Metrics signalManager definition | |
| 290 | +| `cmd/metrics/metrics.go:850-851` | Where the metrics handler is installed | |
| 291 | +| `internal/script/script.go:128-134` | Controller process group isolation | |
| 292 | +| `internal/script/script.go:258-402` | Controller script template (including `handle_sigint` trap) | |
| 293 | +| `internal/target/helpers.go:128-130` | `Setpgid` process group isolation | |
| 294 | +| `internal/util/util.go:561-640` | Signal utility functions (`SignalProcess`, `SignalChildren`, `SignalSelf`) | |
0 commit comments