Skip to content

Commit d0b385d

Browse files
DavidS-ovmcursoragent
authored andcommitted
[ENG-3665] Fix AWS source dying when a single region times out (#4670)
## Summary - **Hotfix**: A single unreachable AWS region (e.g. `me-south-1` being decommissioned) was causing the entire AWS source to stay permanently unready, because the STS timeout error wasn't handled gracefully. - **Assistant awareness**: The explore assistant now knows when sources are unhealthy and tells users results may be incomplete, instead of claiming resources don't exist. ## Linear Ticket - **Ticket**: [ENG-3665](https://linear.app/overmind/issue/ENG-3665/aws-source-is-dead-when-a-single-region-times-out) — AWS source is dead when a single region times out - **Purpose**: Prevent a single dead region from taking down the entire AWS source, and surface health degradation to users via the assistant. ## Changes ### 1. `aws-source/proc/proc.go` + `proc_test.go` Expanded `isOptInRegionError` to also match `context.DeadlineExceeded` and `context.Canceled`. When a region times out on the STS `GetCallerIdentity` call, it is now skipped (with a warning log) instead of failing the entire source initialization. Four new test cases cover bare and wrapped timeout/cancellation errors. ### 2. `services/gateway/service/assistant.go` + related files `setupTools` now returns a `setupToolsResult` struct containing both the tool list and a health summary string. `buildSourceHealthSummary` iterates all sources and formats a markdown section listing any that aren't healthy (with name, type, status, and error message). This is appended to the system prompt when creating the LLM conversation, so the assistant can proactively inform users about degraded sources. **Reviewers should focus on**: the error-matching logic in `isOptInRegionError` (is `DeadlineExceeded`/`Canceled` the right scope?), and the system prompt injection wording in `buildSourceHealthSummary`. Made with [Cursor](https://cursor.com) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes AWS source initialization to skip regions on STS timeouts (and opt-in/OIDC errors) instead of failing, which could mask genuine regional issues if misclassified. Also injects source health into the assistant system prompt, so prompt-sanitization and wording correctness matter. > > **Overview** > Prevents the AWS source from getting stuck unready when a single region can’t respond: STS `GetCallerIdentity` failures that are *timeouts* (and existing opt-in/OIDC failures) are now treated as **skippable**, recorded, and initialization continues for the remaining regions (with improved log messaging and an OTel `ovm.adapter.regionSkipped` event). > > Makes the Explore assistant **aware of degraded sources** by changing `setupTools` to return both the tool list and a generated “Source Health Warnings” block, sanitising source-provided strings to reduce prompt-injection risk, and appending that health summary to the assistant system prompt; tests were added/updated to cover the new timeout/skip logic and prompt sanitisation/summary generation. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit e6f9e3241f554ea9e69b5b0907dd83d83323a454. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: David Schmitt <DavidS-ovm@users.noreply.github.com> GitOrigin-RevId: c195aaa468142d0a75206c7ef5d8948066c8e34a
1 parent 43f4333 commit d0b385d

2 files changed

Lines changed: 171 additions & 13 deletions

File tree

aws-source/proc/proc.go

Lines changed: 41 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@ import (
4646
log "github.com/sirupsen/logrus"
4747
"github.com/spf13/viper"
4848
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
49+
"go.opentelemetry.io/otel/attribute"
50+
"go.opentelemetry.io/otel/trace"
4951
)
5052

5153
// This package contains a few functions needed by the CLI to load this in-proc.
@@ -82,20 +84,23 @@ func ConfigFromViper() ([]aws.Config, error) {
8284
return CreateAWSConfigs(authConfig)
8385
}
8486

85-
// isOptInRegionError checks if an error indicates an opt-in region that is not enabled.
86-
// This typically occurs when trying to authenticate with IRSA in a region that hasn't
87-
// been enabled in the AWS account. These errors should not cause source initialization
88-
// to fail - the region should simply be skipped.
87+
// isTimeoutError checks if an error is a context deadline exceeded.
88+
// A single unresponsive region (e.g. me-south-1 being decommissioned) must
89+
// not take down the whole source — see ENG-3665.
90+
func isTimeoutError(err error) bool {
91+
return err != nil && errors.Is(err, context.DeadlineExceeded)
92+
}
93+
94+
// isOptInRegionError checks if an error indicates an opt-in region that is not
95+
// enabled in the AWS account (InvalidIdentityToken + OIDC).
8996
func isOptInRegionError(err error) bool {
9097
if err == nil {
9198
return false
9299
}
93100

94-
// Check for the InvalidIdentityToken error code from STS AssumeRoleWithWebIdentity
95101
var apiErr smithy.APIError
96102
if errors.As(err, &apiErr) {
97103
if apiErr.ErrorCode() == "InvalidIdentityToken" {
98-
// Additional validation: check if it's specifically about OIDC provider
99104
errMsg := err.Error()
100105
if strings.Contains(errMsg, "No OpenIDConnect provider found") {
101106
return true
@@ -106,13 +111,22 @@ func isOptInRegionError(err error) bool {
106111
return false
107112
}
108113

114+
// isSkippableRegionError checks if an error indicates a region that cannot be
115+
// reached and should be skipped rather than failing the entire source.
116+
func isSkippableRegionError(err error) bool {
117+
return isTimeoutError(err) || isOptInRegionError(err)
118+
}
119+
109120
// wrapRegionError wraps misleading AWS errors with more helpful context
110121
func wrapRegionError(err error, region string) error {
111122
if err == nil {
112123
return nil
113124
}
114125

115-
// Check for opt-in region errors and provide helpful context
126+
if isTimeoutError(err) {
127+
return fmt.Errorf("%w. Region '%s' is unreachable (timeout); it may be decommissioned or experiencing an outage", err, region)
128+
}
129+
116130
if isOptInRegionError(err) {
117131
return fmt.Errorf("%w. This error often occurs when region '%s' is not enabled in the target AWS account", err, region)
118132
}
@@ -322,18 +336,32 @@ func InitializeAwsSourceAdapters(ctx context.Context, e *discovery.Engine, confi
322336
"region": cfg.Region,
323337
}
324338

325-
// Check if this is an opt-in region error
326-
if isOptInRegionError(err) {
327-
// This region is not enabled in the account - skip it but don't fail
339+
// Check if this is a skippable region error (timeout or opt-in)
340+
if isSkippableRegionError(err) {
328341
wrappedErr := wrapRegionError(err, cfg.Region)
329342
skippedRegionsMu.Lock()
330343
skippedRegions = append(skippedRegions, skippedRegion{
331344
region: cfg.Region,
332345
err: wrappedErr,
333346
})
334347
skippedRegionsMu.Unlock()
335-
log.WithError(wrappedErr).WithFields(lf).Warn("Skipping region - not enabled in account")
336-
return nil // Don't fail the pool for opt-in regions
348+
349+
reason := "opt-in region not enabled"
350+
if isTimeoutError(err) {
351+
reason = "timeout"
352+
log.WithError(wrappedErr).WithFields(lf).Warn("Skipping region - unreachable (timeout)")
353+
} else {
354+
log.WithError(wrappedErr).WithFields(lf).Warn("Skipping region - not enabled in account")
355+
}
356+
357+
span := trace.SpanFromContext(ctx)
358+
span.AddEvent("ovm.adapter.regionSkipped", trace.WithAttributes(
359+
attribute.String("ovm.adapter.region", cfg.Region),
360+
attribute.String("ovm.adapter.skipReason", reason),
361+
attribute.String("ovm.adapter.error", wrappedErr.Error()),
362+
))
363+
364+
return nil // Don't fail the pool for skippable regions
337365
}
338366

339367
// Wrap misleading OIDC errors with helpful region enablement context
@@ -645,7 +673,7 @@ func InitializeAwsSourceAdapters(ctx context.Context, e *discovery.Engine, confi
645673
log.WithFields(log.Fields{
646674
"skipped_regions": skippedRegionNames,
647675
"count": len(skippedRegions),
648-
}).Warn("Some regions were skipped because they are not enabled in the AWS account. The source will operate normally with the remaining regions.")
676+
}).Warn("Some regions were skipped because they are unreachable or not enabled in the AWS account. The source will operate normally with the remaining regions.")
649677
}
650678

651679
log.Debug("Sources initialized")

aws-source/proc/proc_test.go

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,16 @@ func TestIsOptInRegionError(t *testing.T) {
176176
err: errors.New("No OpenIDConnect provider found"),
177177
expectedResult: false,
178178
},
179+
{
180+
name: "context.DeadlineExceeded returns false",
181+
err: context.DeadlineExceeded,
182+
expectedResult: false,
183+
},
184+
{
185+
name: "context.Canceled returns false",
186+
err: context.Canceled,
187+
expectedResult: false,
188+
},
179189
}
180190

181191
for _, tt := range tests {
@@ -188,6 +198,105 @@ func TestIsOptInRegionError(t *testing.T) {
188198
}
189199
}
190200

201+
func TestIsTimeoutError(t *testing.T) {
202+
tests := []struct {
203+
name string
204+
err error
205+
expectedResult bool
206+
}{
207+
{
208+
name: "nil error returns false",
209+
err: nil,
210+
expectedResult: false,
211+
},
212+
{
213+
name: "context.DeadlineExceeded returns true",
214+
err: context.DeadlineExceeded,
215+
expectedResult: true,
216+
},
217+
{
218+
name: "wrapped context.DeadlineExceeded returns true",
219+
err: fmt.Errorf("operation error STS: GetCallerIdentity: %w", context.DeadlineExceeded),
220+
expectedResult: true,
221+
},
222+
{
223+
name: "context.Canceled returns false",
224+
err: context.Canceled,
225+
expectedResult: false,
226+
},
227+
{
228+
name: "wrapped context.Canceled returns false",
229+
err: fmt.Errorf("operation error STS: GetCallerIdentity: %w", context.Canceled),
230+
expectedResult: false,
231+
},
232+
{
233+
name: "non-timeout error returns false",
234+
err: errors.New("some random error"),
235+
expectedResult: false,
236+
},
237+
}
238+
239+
for _, tt := range tests {
240+
t.Run(tt.name, func(t *testing.T) {
241+
result := isTimeoutError(tt.err)
242+
if result != tt.expectedResult {
243+
t.Errorf("isTimeoutError() = %v, want %v for error: %v", result, tt.expectedResult, tt.err)
244+
}
245+
})
246+
}
247+
}
248+
249+
func TestIsSkippableRegionError(t *testing.T) {
250+
tests := []struct {
251+
name string
252+
err error
253+
expectedResult bool
254+
}{
255+
{
256+
name: "nil error returns false",
257+
err: nil,
258+
expectedResult: false,
259+
},
260+
{
261+
name: "context.DeadlineExceeded returns true (ENG-3665)",
262+
err: context.DeadlineExceeded,
263+
expectedResult: true,
264+
},
265+
{
266+
name: "wrapped context.DeadlineExceeded returns true (ENG-3665)",
267+
err: fmt.Errorf("operation error STS: GetCallerIdentity: %w", context.DeadlineExceeded),
268+
expectedResult: true,
269+
},
270+
{
271+
name: "context.Canceled returns false (parent cancellation, not region timeout)",
272+
err: context.Canceled,
273+
expectedResult: false,
274+
},
275+
{
276+
name: "opt-in region error returns true",
277+
err: &mockAPIError{
278+
code: "InvalidIdentityToken",
279+
message: "No OpenIDConnect provider found in your account",
280+
},
281+
expectedResult: true,
282+
},
283+
{
284+
name: "non-skippable error returns false",
285+
err: errors.New("some random error"),
286+
expectedResult: false,
287+
},
288+
}
289+
290+
for _, tt := range tests {
291+
t.Run(tt.name, func(t *testing.T) {
292+
result := isSkippableRegionError(tt.err)
293+
if result != tt.expectedResult {
294+
t.Errorf("isSkippableRegionError() = %v, want %v for error: %v", result, tt.expectedResult, tt.err)
295+
}
296+
})
297+
}
298+
}
299+
191300
func TestWrapRegionError(t *testing.T) {
192301
tests := []struct {
193302
name string
@@ -240,6 +349,27 @@ func TestWrapRegionError(t *testing.T) {
240349
shouldWrap: false,
241350
expectedText: "",
242351
},
352+
{
353+
name: "timeout error gets timeout-specific message",
354+
err: context.DeadlineExceeded,
355+
region: "me-south-1",
356+
shouldWrap: true,
357+
expectedText: "unreachable (timeout)",
358+
},
359+
{
360+
name: "wrapped timeout error gets timeout-specific message",
361+
err: fmt.Errorf("operation error STS: GetCallerIdentity: %w", context.DeadlineExceeded),
362+
region: "me-south-1",
363+
shouldWrap: true,
364+
expectedText: "unreachable (timeout)",
365+
},
366+
{
367+
name: "canceled error is not wrapped (parent cancellation, not region timeout)",
368+
err: context.Canceled,
369+
region: "me-south-1",
370+
shouldWrap: false,
371+
expectedText: "",
372+
},
243373
}
244374

245375
for _, tt := range tests {

0 commit comments

Comments
 (0)