Skip to content

proposal: Support secret providers #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

hsmatulis
Copy link

This PR adds a proposal to support specifying secrets through secret providers.

TL;DR from doc

TL;DR: This document proposes adding a new way for Prometheus to discover and use secrets from various secret providers, similar to how service discovery works. It introduces a new configuration section where users can specify different secret providers and their configurations. It also defines interfaces and methods for secret providers to implement, allowing for flexibility in how secrets are fetched and managed.

Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress thanks!

Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
@rapphil
Copy link

rapphil commented Mar 31, 2025

LGTM. can you just update the description of the pull request for posteriority.

Copy link

@rajagopalanand rajagopalanand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally! Just last nit noted by @rajagopalanand and good to go IMO!

Approving from my side, but I will ask around for a second Prometheus maintainer to have another look

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty solid :) I just had some small questions

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing feedback and discussion in https://github.com/prometheus/proposals/pull/47/files#r2041335383

Should we update the proposal with the mentioned alternatives considered?

What do you think is the reasonable solution from this discussion? I see the potential for changing proposal slightly to support the following (for consistency with SD):

password: "<inlined secret>"
password:
  kubernetes:
    namespace: "<ns>"
    name: "<secret name>"
    key: "<data's key for secret name>"
password:
  file:
    path: "<path to secret file>"

@rajagopalanand
Copy link

rajagopalanand commented Apr 23, 2025

I vote for changing the proposal to be consistent with SD

Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
@hsmatulis hsmatulis force-pushed the secret-providers branch 4 times, most recently from 535a36e to 9891e9c Compare April 23, 2025 08:38
@hsmatulis hsmatulis force-pushed the secret-providers branch 2 times, most recently from d726925 to b918d59 Compare April 23, 2025 08:42
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Excited to see this in action. :)

@hsmatulis
Copy link
Author

I changed the config style to be more consistent with SD in the proposal. Thanks everyone for all the feedback, and feel free to reach out on the prometheus slack if you have any feedback or ideas. I am working on an implementation for the proposal:)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @dgl @machine424 -- are you ok to merge? 🤗

## Remote Secrets (Secret Providers)

* **Owners:**
* Henrique Matulis (@hsmatulisgoogle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: username doesn't match (anymore? I think it was changed)

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one last question :)


## Action Plan

* [ ] Create action plan after doc is stable!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to work on this before or after merging? :)

Comment on lines +78 to +86
For example when specifying a password fetched from the kubernetes provider with an id of `pass2` in namespace `ns1` for the HTTP passsword field it would look like this:

```
password:
kubernetes:
namespace: <ns>
name: <secret name>
key: <data's key for secret name>
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example is not matching the statement.

Did you mean:

Suggested change
For example when specifying a password fetched from the kubernetes provider with an id of `pass2` in namespace `ns1` for the HTTP passsword field it would look like this:
```
password:
kubernetes:
namespace: <ns>
name: <secret name>
key: <data's key for secret name>
```
For example when specifying a password fetched from the kubernetes provider with an id of `pass2` in namespace `ns1` for the HTTP passsword field it would look like this:
    password:
      kubernetes:
        namespace: ns1
        name: pass2
        key: http_password


The second case is that we already have a secret value, but refreshing it has resulted in an error. In this case we should keep the component that uses this secret running with the potentially stale secret, and schedule a retry.

If we are starting up prometheus and do not get any secret values, startup will continue and prometheus will continue to run and retry finding secrets. If there are components that are fully specified, they will run during this time. The idea is that it is better to send partial metrics than no metrics, and the emitted metrics for secrets can alert users if there is a problem with their service providers.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If we are starting up prometheus and do not get any secret values, startup will continue and prometheus will continue to run and retry finding secrets. If there are components that are fully specified, they will run during this time. The idea is that it is better to send partial metrics than no metrics, and the emitted metrics for secrets can alert users if there is a problem with their service providers.
If we are starting up Prometheus and do not get any secret values, startup will continue and Prometheus will continue to run and retry finding secrets. If there are components that are fully specified, they will run during this time. The idea is that it is better to send partial metrics than no metrics, and the emitted metrics for secrets can alert users if there is a problem with their service providers.

# ... other config ...
```
* **Caching Behavior on Failure:** If a scheduled background refresh attempt fails (e.g., due to network issues, temporary provider unavailability, invalid credentials after rotation), Prometheus will continue to use the last successfully fetched secret value. This ensures components relying on the secret can continue operating with the last known good credential, preventing outages due to transient refresh problems. Failed refresh attempts will be logged, and the `prometheus_remote_secret_state` metric will reflect the 'stale' state.
* **Permission Errors:** If a component using a cached secret reports a specific permission or authentication error, this might indicate the cached secret is no longer valid. Hence, an immediate refresh attempt for that specific secret is triggered, bypassing the regular refresh interval.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the implications of this statement. Isn't this requirement for secrets providers leaking into error handling strategy of all components that use a secrets provider?

We only know that a credential is expired/revoked after we try to use it. My point is that all components that use secrets provider will need to implement error handling in a specific way or will the secret provider provide some functionality that components should use?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take PagerDuty or Slack notifier as an example, is the expectation those integrations/receiver implementations should interpret AuthN errors returned and somehow notify the secret provider to refresh before retrying?

prometheus_remote_secret_last_successful_fetch_seconds{provider="kubernetes", secret_id="pass1"} 15
```

A state enum describing in which error condition the secret is in:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the state, maybe it would be useful to also expose some counters to show events. Why? because that allow us to identify precisely what is happening across multiple scrapes, network failures etc. Something like:

prometheus_remote_secret_fetch_failures_total
prometheus_remote_secret_fetch_success_total

* **Not Querying Per-Scrape:** It is explicitly **not** the default behavior to query the secret provider on every use.


### Secret rotation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any example of backend that would benefit from the behaviours defined in this section.

Any example of backend that does not support secret overlapping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants