AWS MSK collector: sequential region iteration with shared timeout causes intermittent scrape failures

See [Slack thread](https://raintank-corp.slack.com/archives/C07SUF6SESF/p1775833468192719) for initial context.

## Summary

The AWS MSK collector iterates through all regions **sequentially** under a shared collector timeout (default 1 minute). When any region's MSK endpoint is temporarily slow to respond, the sequential loop exceeds the timeout budget and the entire scrape fails with `context deadline exceeded`.

## Observed behaviour

Intermittent errors in the `aws_msk` collector:

```
level=ERROR msg="error listing MSK clusters" provider=aws logger=msk region=eu-north-1 error="operation error Kafka: ListClustersV2, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded"
level=ERROR msg="could not collect metrics" provider=aws collector=aws_msk message="context deadline exceeded"
```

The failing region rotates across different regions on different days (`eu-north-1`, `ap-northeast-3`, `ap-east-2`, `ca-west-1`, `ap-southeast-4`, etc.), confirming this is not a region-specific issue but a transient network slowness problem that can affect any region.

## Root cause

In [`pkg/aws/msk/msk.go`](https://github.com/grafana/cloudcost-exporter/blob/main/pkg/aws/msk/msk.go), `Collect()` iterates regions sequentially and passes the parent context directly to each `ListClustersV2` call:

```go
for _, region := range c.regions {
    select {
    case <-ctx.Done():
        return ctx.Err()  // fails entire collector if deadline exceeded
    default:
    }
    clusters, err := regionClient.ListMSKClusters(ctx)  // parent ctx passed directly
    ...
}
```

If one region's endpoint hangs (e.g. TCP connection that doesn't get a response), it consumes the entire shared timeout. The next iteration's `ctx.Done()` check then fires, returning an error for the whole collector.

By contrast, all other AWS collectors (`aws_ec2`, `aws_rds`, `NATGATEWAY`, etc.) track per-region and run in parallel, so a single slow region doesn't affect the others.

## Suggested fix

Wrap each `ListClustersV2` call in its own short sub-context (e.g. 10–15 seconds per region), independent of the outer collector timeout:

```go
regionCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
clusters, err := regionClient.ListMSKClusters(regionCtx)
cancel()
if err != nil {
    c.logger.Error("error listing MSK clusters", "region", regionName, "error", err)
    continue  // slow/unavailable region no longer blocks the rest
}
```

This matches the intent of the existing `continue` on error — the goal is clearly to skip bad regions, but the shared context means a hanging region can still kill the whole scrape before the error is even returned.

## Impact

Intermittent gaps in MSK cost metrics. The collector works correctly most of the time (~14s average scrape across all regions) but fails periodically when any region is temporarily slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS MSK collector: sequential region iteration with shared timeout causes intermittent scrape failures #895

Summary

Observed behaviour

Root cause

Suggested fix

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

AWS MSK collector: sequential region iteration with shared timeout causes intermittent scrape failures #895

Description

Summary

Observed behaviour

Root cause

Suggested fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions