Commit 0566ddb
[Jobs] Always run jobs controllers with debug logging
The jobs controller process inherits SKYPILOT_DEBUG from whichever process
starts it, and that inherited level is unreliable and rarely debug:
- Consolidation mode: the controller subprocess inherits the os.environ of
the process that starts it. In steady state that is the managed-job refresh
daemon (a thread in the main API server process), so it inherits the API
server's level; when a `sky jobs launch` request worker starts it directly,
it inherits the client's level.
- Remote mode: it comes from the controller env file (built from the
launching client's level).
None of these is guaranteed to be debug, so the controller's own
("controller_system") logs are emitted at INFO level and without timestamps
unless someone happened to opt in. This makes it hard to reconstruct incidents
from controller logs (e.g. determining when controllers re-claim jobs during
a failover/rolling upgrade).
Always run the controller with debug logging enabled, independent of the
inherited level, in both consolidation and remote modes:
- At the controller process entry point (__main__), force SKYPILOT_DEBUG=1
and reload the logger so the controller-process ("controller_system") logs
get debug detail + timestamps. Setting the env var also propagates to
subprocesses spawned by the controller.
- In the per-job loop, override SKYPILOT_DEBUG to '1' after applying the
job's env (which carries the launching client's value), so per-job
controller logs are also at debug level.
Both entry points run the same `python -m sky.jobs.controller` module, so
the fix covers consolidation-mode and remote controllers alike.
Tested:
- Repro'd in consolidation mode: controller_system log had no timestamps and
no debug lines when the client did not set SKYPILOT_DEBUG.
- After fix, verified end-to-end on Kubernetes in both consolidation mode and
remote-controller mode: both the controller_system log and per-job logs are
emitted at debug level with timestamps even when the launching client did
not set SKYPILOT_DEBUG (env file still records SKYPILOT_DEBUG='0').
- Added unit tests for the per-job env override helper.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FmmoQDXj9gwbXvS7PhXuLF1 parent 719c50f commit 0566ddb
2 files changed
Lines changed: 95 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
52 | 53 | | |
53 | 54 | | |
54 | 55 | | |
| |||
103 | 104 | | |
104 | 105 | | |
105 | 106 | | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
106 | 126 | | |
107 | 127 | | |
108 | 128 | | |
| |||
2148 | 2168 | | |
2149 | 2169 | | |
2150 | 2170 | | |
2151 | | - | |
2152 | | - | |
2153 | | - | |
2154 | | - | |
2155 | | - | |
| 2171 | + | |
2156 | 2172 | | |
2157 | 2173 | | |
2158 | 2174 | | |
| |||
2534 | 2550 | | |
2535 | 2551 | | |
2536 | 2552 | | |
| 2553 | + | |
| 2554 | + | |
| 2555 | + | |
| 2556 | + | |
| 2557 | + | |
| 2558 | + | |
| 2559 | + | |
| 2560 | + | |
| 2561 | + | |
| 2562 | + | |
| 2563 | + | |
| 2564 | + | |
| 2565 | + | |
| 2566 | + | |
| 2567 | + | |
| 2568 | + | |
| 2569 | + | |
| 2570 | + | |
| 2571 | + | |
| 2572 | + | |
2537 | 2573 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| 21 | + | |
20 | 22 | | |
21 | 23 | | |
| 24 | + | |
| 25 | + | |
22 | 26 | | |
23 | 27 | | |
24 | 28 | | |
| |||
1077 | 1081 | | |
1078 | 1082 | | |
1079 | 1083 | | |
| 1084 | + | |
| 1085 | + | |
| 1086 | + | |
| 1087 | + | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
| 1102 | + | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
| 1108 | + | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
0 commit comments