Skip to content

add feature to configurate promhttp error handling #411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nnikitos95
Copy link

Motivation

The current implementation uses PromHTTP's default error handling, which may not suit all deployment scenarios. By allowing customization, users can:

  • Opt for more lenient error handling (e.g., ContinueOnError) in high-availability environments.
  • Ensure better alignment with their operational requirements.

Default Behavior

  • Maintains backward compatibility by retaining the current default error handling behavior unless explicitly configured.

@nnikitos95 nnikitos95 force-pushed the feat/add_promhttp_error_handling_flag branch from b5a9fc6 to 9bfc60d Compare February 10, 2025 12:44
@nnikitos95
Copy link
Author

cc: @SuperQ @kgeckhart

@SuperQ
Copy link
Contributor

SuperQ commented Mar 10, 2025

I'm not sure we should do this, if there are errors, we don't want to silently ignore them. They should be fixed.

@nnikitos95
Copy link
Author

First of all, promlib has that option, and it looks convenient to have it here. And user is able to define how he wants to handle errors.

Google API has many errors, which couldn't be fixed as simple, as their support is not an example of the fast or customer-faced approach, we have a lot of examples when they deployed some descriptors in BETA stage and exporter has started to fail. So this setting is a good way to save collecting metrics which can be collected.

The problem we are trying to solve is similar to this

Also this exporter has metrics to identify such errors, so users still noticed about them only if provided setting is ContinueOnError

As i mentioned, the default behaviour is still the same.

P.S. Now we are using forked version with this setting and everything goes fine, we haven't lose any metrics we expected, the exporter returns it's own metrics in any case and we notified if any problems happened

@SuperQ
Copy link
Contributor

SuperQ commented Mar 10, 2025

The problem is that the linked error is an actual problem. By ignoring errors like that you're not going to get the data you expect.

This is a correctness problem that I don't think we want to paper over.

we haven't lose any metrics we expected

Are you sure about that? Because the linked issue is very likely to be a real problem.

@SuperQ
Copy link
Contributor

SuperQ commented Mar 10, 2025

So this setting is a good way to save collecting metrics which can be collected.

This is a very bad idea. Partial data is very difficult to track and debug. This is why the default behavior is the way it is. As well as when Prometheus polls targets it treats any non-200 status code or timeout as "no data".

This violates "fail fast" principal and is not something we are likely to want to put in an official exporter without some serious discussion.

@nnikitos95
Copy link
Author

Are you sure about that? Because the linked issue is very likely to be a real problem.

Yes. In the fail-fast approach we are loosing all the metrics in any case, as /metrics returns 500 or they outdated as we haven't collected them in-time and haven't reacted within possible alerts rely on them.

Obviously that we can split type-prefixes of broken metrics to different instances of stackdriver, to not affect reliable prefixes. But it takes time in any case, even if we use fail-fast and can react in short time. But usually it happens in night time, so we can lose the data.

As well as when Prometheus polls targets it treats any non-200 status code or timeout as "no data".

I completely agree with that. But as i said above, for us it's not okay to have "no data" when we can have "some non-broken data".

By ignoring errors like that you're not going to get the data you expect.

Usually those metrics are broken on Google side and they in some ALPHA/BETA stages which can't be filtered so we can't do anything with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants