See Slack thread for initial context.
Summary
The AWS MSK collector iterates through all regions sequentially under a shared collector timeout (default 1 minute). When any region's MSK endpoint is temporarily slow to respond, the sequential loop exceeds the timeout budget and the entire scrape fails with context deadline exceeded.
Observed behaviour
Intermittent errors in the aws_msk collector:
level=ERROR msg="error listing MSK clusters" provider=aws logger=msk region=eu-north-1 error="operation error Kafka: ListClustersV2, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
level=ERROR msg="could not collect metrics" provider=aws collector=aws_msk message="context deadline exceeded"
The failing region rotates across different regions on different days (eu-north-1, ap-northeast-3, ap-east-2, ca-west-1, ap-southeast-4, etc.), confirming this is not a region-specific issue but a transient network slowness problem that can affect any region.
Root cause
In pkg/aws/msk/msk.go, Collect() iterates regions sequentially and passes the parent context directly to each ListClustersV2 call:
for _, region := range c.regions {
select {
case <-ctx.Done():
return ctx.Err() // fails entire collector if deadline exceeded
default:
}
clusters, err := regionClient.ListMSKClusters(ctx) // parent ctx passed directly
...
}
If one region's endpoint hangs (e.g. TCP connection that doesn't get a response), it consumes the entire shared timeout. The next iteration's ctx.Done() check then fires, returning an error for the whole collector.
By contrast, all other AWS collectors (aws_ec2, aws_rds, NATGATEWAY, etc.) track per-region and run in parallel, so a single slow region doesn't affect the others.
Suggested fix
Wrap each ListClustersV2 call in its own short sub-context (e.g. 10–15 seconds per region), independent of the outer collector timeout:
regionCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
clusters, err := regionClient.ListMSKClusters(regionCtx)
cancel()
if err != nil {
c.logger.Error("error listing MSK clusters", "region", regionName, "error", err)
continue // slow/unavailable region no longer blocks the rest
}
This matches the intent of the existing continue on error — the goal is clearly to skip bad regions, but the shared context means a hanging region can still kill the whole scrape before the error is even returned.
Impact
Intermittent gaps in MSK cost metrics. The collector works correctly most of the time (~14s average scrape across all regions) but fails periodically when any region is temporarily slow.
See Slack thread for initial context.
Summary
The AWS MSK collector iterates through all regions sequentially under a shared collector timeout (default 1 minute). When any region's MSK endpoint is temporarily slow to respond, the sequential loop exceeds the timeout budget and the entire scrape fails with
context deadline exceeded.Observed behaviour
Intermittent errors in the
aws_mskcollector:The failing region rotates across different regions on different days (
eu-north-1,ap-northeast-3,ap-east-2,ca-west-1,ap-southeast-4, etc.), confirming this is not a region-specific issue but a transient network slowness problem that can affect any region.Root cause
In
pkg/aws/msk/msk.go,Collect()iterates regions sequentially and passes the parent context directly to eachListClustersV2call:If one region's endpoint hangs (e.g. TCP connection that doesn't get a response), it consumes the entire shared timeout. The next iteration's
ctx.Done()check then fires, returning an error for the whole collector.By contrast, all other AWS collectors (
aws_ec2,aws_rds,NATGATEWAY, etc.) track per-region and run in parallel, so a single slow region doesn't affect the others.Suggested fix
Wrap each
ListClustersV2call in its own short sub-context (e.g. 10–15 seconds per region), independent of the outer collector timeout:This matches the intent of the existing
continueon error — the goal is clearly to skip bad regions, but the shared context means a hanging region can still kill the whole scrape before the error is even returned.Impact
Intermittent gaps in MSK cost metrics. The collector works correctly most of the time (~14s average scrape across all regions) but fails periodically when any region is temporarily slow.