Skip to content

AWS MSK collector: sequential region iteration with shared timeout causes intermittent scrape failures #895

@simaarfania

Description

@simaarfania

See Slack thread for initial context.

Summary

The AWS MSK collector iterates through all regions sequentially under a shared collector timeout (default 1 minute). When any region's MSK endpoint is temporarily slow to respond, the sequential loop exceeds the timeout budget and the entire scrape fails with context deadline exceeded.

Observed behaviour

Intermittent errors in the aws_msk collector:

level=ERROR msg="error listing MSK clusters" provider=aws logger=msk region=eu-north-1 error="operation error Kafka: ListClustersV2, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
level=ERROR msg="could not collect metrics" provider=aws collector=aws_msk message="context deadline exceeded"

The failing region rotates across different regions on different days (eu-north-1, ap-northeast-3, ap-east-2, ca-west-1, ap-southeast-4, etc.), confirming this is not a region-specific issue but a transient network slowness problem that can affect any region.

Root cause

In pkg/aws/msk/msk.go, Collect() iterates regions sequentially and passes the parent context directly to each ListClustersV2 call:

for _, region := range c.regions {
    select {
    case <-ctx.Done():
        return ctx.Err()  // fails entire collector if deadline exceeded
    default:
    }
    clusters, err := regionClient.ListMSKClusters(ctx)  // parent ctx passed directly
    ...
}

If one region's endpoint hangs (e.g. TCP connection that doesn't get a response), it consumes the entire shared timeout. The next iteration's ctx.Done() check then fires, returning an error for the whole collector.

By contrast, all other AWS collectors (aws_ec2, aws_rds, NATGATEWAY, etc.) track per-region and run in parallel, so a single slow region doesn't affect the others.

Suggested fix

Wrap each ListClustersV2 call in its own short sub-context (e.g. 10–15 seconds per region), independent of the outer collector timeout:

regionCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
clusters, err := regionClient.ListMSKClusters(regionCtx)
cancel()
if err != nil {
    c.logger.Error("error listing MSK clusters", "region", regionName, "error", err)
    continue  // slow/unavailable region no longer blocks the rest
}

This matches the intent of the existing continue on error — the goal is clearly to skip bad regions, but the shared context means a hanging region can still kill the whole scrape before the error is even returned.

Impact

Intermittent gaps in MSK cost metrics. The collector works correctly most of the time (~14s average scrape across all regions) but fails periodically when any region is temporarily slow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions