Skip to content

fix(aws): skip failing RDS and VPC collectors instead of aborting provider#863

Draft
stephan-rayner wants to merge 1 commit intomainfrom
stephan-rayner/fix/aws-collector-init
Draft

fix(aws): skip failing RDS and VPC collectors instead of aborting provider#863
stephan-rayner wants to merge 1 commit intomainfrom
stephan-rayner/fix/aws-collector-init

Conversation

@stephan-rayner
Copy link
Copy Markdown
Contributor

@stephan-rayner stephan-rayner commented Mar 27, 2026

Summary

  • RDS and VPC collector init failures previously returned nil, err, killing the entire AWS provider and dropping all metrics from every other service
  • Aligns RDS and VPC error handling with S3 and EC2: log the error at ERROR level and continue so remaining collectors stay intact

Fixes #716

Test plan

  • make test passes
  • Verify that a pricing config error for RDS/VPC no longer prevents S3/EC2 metrics from being exported

@stephan-rayner stephan-rayner requested a review from a team March 27, 2026 04:50
@stephan-rayner stephan-rayner self-assigned this Mar 27, 2026
@stephan-rayner stephan-rayner added bug Something isn't working area/monitoring labels Mar 27, 2026
…vider

  createAWSConfig failures during RDS and VPC collector init returned
  nil, err, killing the entire AWS provider. S3 and EC2 already log
  and continue on init failure. Align RDS and VPC with the same pattern
  so a pricing config error for one collector does not drop metrics from
  all others.

  Also makes createAWSConfig a package-level var to allow injection in
  tests. Adds unit tests asserting that a createAWSConfig failure for
  RDS or VPC skips that collector while leaving the remaining collectors
  intact.

  Fixes #716
@stephan-rayner stephan-rayner force-pushed the stephan-rayner/fix/aws-collector-init branch from 5e41c46 to 36aeb94 Compare March 27, 2026 04:56
@leonorfmartins
Copy link
Copy Markdown
Contributor

hm, I'm more unsure about this one. I agree that we should not block other collectors from progressing if one of the collectors fails to be initialised. But just logging an error and continuing will not be enough because then our error rate SLOs will never fire and we won't know a collector broke in the first place.
What if we, on top of this change that you are proposing, return an error on each collector's initialisation? That way, we will still know if a collector fails to be initialised but healthy collectors will keep doing its thing.

@stephan-rayner
Copy link
Copy Markdown
Contributor Author

hm, I'm more unsure about this one. I agree that we should not block other collectors from progressing if one of the collectors fails to be initialised. But just logging an error and continuing will not be enough because then our error rate SLOs will never fire and we won't know a collector broke in the first place. What if we, on top of this change that you are proposing, return an error on each collector's initialisation? That way, we will still know if a collector fails to be initialised but healthy collectors will keep doing its thing.

I like that @leonorfmartins! It was a kind of a coin flip for this change. A little more than half of the resources we look at do it the way in the PR and a little less than half do it the way you are suggesting.

Upon further reflection, and after having slept on it, I agree with you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/monitoring bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS: Handle collector errors gracefully

2 participants