Commit d0b385d
[ENG-3665] Fix AWS source dying when a single region times out (#4670)
## Summary
- **Hotfix**: A single unreachable AWS region (e.g. `me-south-1` being
decommissioned) was causing the entire AWS source to stay permanently
unready, because the STS timeout error wasn't handled gracefully.
- **Assistant awareness**: The explore assistant now knows when sources
are unhealthy and tells users results may be incomplete, instead of
claiming resources don't exist.
## Linear Ticket
- **Ticket**:
[ENG-3665](https://linear.app/overmind/issue/ENG-3665/aws-source-is-dead-when-a-single-region-times-out)
— AWS source is dead when a single region times out
- **Purpose**: Prevent a single dead region from taking down the entire
AWS source, and surface health degradation to users via the assistant.
## Changes
### 1. `aws-source/proc/proc.go` + `proc_test.go`
Expanded `isOptInRegionError` to also match `context.DeadlineExceeded`
and `context.Canceled`. When a region times out on the STS
`GetCallerIdentity` call, it is now skipped (with a warning log) instead
of failing the entire source initialization. Four new test cases cover
bare and wrapped timeout/cancellation errors.
### 2. `services/gateway/service/assistant.go` + related files
`setupTools` now returns a `setupToolsResult` struct containing both the
tool list and a health summary string. `buildSourceHealthSummary`
iterates all sources and formats a markdown section listing any that
aren't healthy (with name, type, status, and error message). This is
appended to the system prompt when creating the LLM conversation, so the
assistant can proactively inform users about degraded sources.
**Reviewers should focus on**: the error-matching logic in
`isOptInRegionError` (is `DeadlineExceeded`/`Canceled` the right
scope?), and the system prompt injection wording in
`buildSourceHealthSummary`.
Made with [Cursor](https://cursor.com)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes AWS source initialization to skip regions on STS timeouts (and
opt-in/OIDC errors) instead of failing, which could mask genuine
regional issues if misclassified. Also injects source health into the
assistant system prompt, so prompt-sanitization and wording correctness
matter.
>
> **Overview**
> Prevents the AWS source from getting stuck unready when a single
region can’t respond: STS `GetCallerIdentity` failures that are
*timeouts* (and existing opt-in/OIDC failures) are now treated as
**skippable**, recorded, and initialization continues for the remaining
regions (with improved log messaging and an OTel
`ovm.adapter.regionSkipped` event).
>
> Makes the Explore assistant **aware of degraded sources** by changing
`setupTools` to return both the tool list and a generated “Source Health
Warnings” block, sanitising source-provided strings to reduce
prompt-injection risk, and appending that health summary to the
assistant system prompt; tests were added/updated to cover the new
timeout/skip logic and prompt sanitisation/summary generation.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
e6f9e3241f554ea9e69b5b0907dd83d83323a454. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: David Schmitt <DavidS-ovm@users.noreply.github.com>
GitOrigin-RevId: c195aaa468142d0a75206c7ef5d8948066c8e34a1 parent 43f4333 commit d0b385d
2 files changed
Lines changed: 171 additions & 13 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
49 | 51 | | |
50 | 52 | | |
51 | 53 | | |
| |||
82 | 84 | | |
83 | 85 | | |
84 | 86 | | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
89 | 96 | | |
90 | 97 | | |
91 | 98 | | |
92 | 99 | | |
93 | 100 | | |
94 | | - | |
95 | 101 | | |
96 | 102 | | |
97 | 103 | | |
98 | | - | |
99 | 104 | | |
100 | 105 | | |
101 | 106 | | |
| |||
106 | 111 | | |
107 | 112 | | |
108 | 113 | | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
109 | 120 | | |
110 | 121 | | |
111 | 122 | | |
112 | 123 | | |
113 | 124 | | |
114 | 125 | | |
115 | | - | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
116 | 130 | | |
117 | 131 | | |
118 | 132 | | |
| |||
322 | 336 | | |
323 | 337 | | |
324 | 338 | | |
325 | | - | |
326 | | - | |
327 | | - | |
| 339 | + | |
| 340 | + | |
328 | 341 | | |
329 | 342 | | |
330 | 343 | | |
331 | 344 | | |
332 | 345 | | |
333 | 346 | | |
334 | 347 | | |
335 | | - | |
336 | | - | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
337 | 365 | | |
338 | 366 | | |
339 | 367 | | |
| |||
645 | 673 | | |
646 | 674 | | |
647 | 675 | | |
648 | | - | |
| 676 | + | |
649 | 677 | | |
650 | 678 | | |
651 | 679 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
176 | 176 | | |
177 | 177 | | |
178 | 178 | | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
179 | 189 | | |
180 | 190 | | |
181 | 191 | | |
| |||
188 | 198 | | |
189 | 199 | | |
190 | 200 | | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
191 | 300 | | |
192 | 301 | | |
193 | 302 | | |
| |||
240 | 349 | | |
241 | 350 | | |
242 | 351 | | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
243 | 373 | | |
244 | 374 | | |
245 | 375 | | |
| |||
0 commit comments