Skip to content

proposal: Support secret providers #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

hsmatulisgoogle
Copy link

This PR adds a proposal to support specifying secrets through secret providers.

TL;DR from doc

TL;DR: This document proposes adding a new way for Prometheus to discover and use secrets from various secret providers, similar to how service discovery works. It introduces a new configuration section where users can specify different secret providers and their configurations. It also defines interfaces and methods for secret providers to implement, allowing for flexibility in how secrets are fetched and managed.

Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress thanks!

Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
Signed-off-by: Henrique Spanoudis Matulis <[email protected]>
@rapphil
Copy link

rapphil commented Mar 31, 2025

LGTM. can you just update the description of the pull request for posteriority.

Copy link

@rajagopalanand rajagopalanand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


### Where will code live

Both the Alertmanager and Prometheus repos will be able to use secret providers. The code will eventually live in a separete repository specifically created for it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this reside in https://github.com/prometheus/common?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be one option, but only if it's a separate Go module. I think it's fine to skip this detail in the proposal, and just decide later.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally! Just last nit noted by @rajagopalanand and good to go IMO!

Approving from my side, but I will ask around for a second Prometheus maintainer to have another look


### Nested secrets

Secret providers might require secrets to be configured themselves. We will allow secrets to be passed in to secret providers.
Copy link
Member

@machine424 machine424 Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in the Prometheus-on-Kubernetes setup, if the secret provider requires auth (which seems likely, I wouldn't see the point of not having auth, it'd go against the goals of the proposal) then Prometheus will need to store the credentials for talking to the secret provider using the same mechanism (Kubernetes Secrets) that it currently uses to store the credentials it retrieves, thus prone to the same "vulnerabilities". Am I missing something?

Copy link
Member

@bwplotka bwplotka Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for double checking this.

This proposal is not about vulnerabilities (although you will have more choices on what providers to choose, based on security levels), but to allow custom providers like we allow custom service discoveries.

The main outcome with the "recursive/nested" sort of approach here is that secret provider secret can't use the same secret provider and cycling connections are not allowed. Eventually secret provider has to use file (recommended) or inline (with environment variable eval for example) or any other that does not require secrets.

To avoid tricky implementation we could immediately limit the design to exactly that - provider secret can only use inline or file based, but it might be cleaner to not limit the choice here (e.g. there might be other providers that do not require secret e.g. Kubernetes secrets)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details.

I was confused by security and vulnerabilities in The motivation behind this design document is to enhance the security and flexibility of secret management in Prometheus.[...] can lead to security vulnerabilities. [...], I'd reformulate that as the proposal is mainly about adding an interface to how to fetch secrets and is not directly enhancing security per se, because the same storing mechanisms will continue to be used anyway to avoid chicken and egg situation (inline and file will continue to be used for secret providers secrets as you mentioned...)

In order to avoid falling back into the same challenges we have with the SD mechanisms, maybe we should be more explicit about ownership/maintenance, conditions for adding a new provider etc.
Also, did you consider only having a generic spec like http_sd_config and delegate the adapters addition/maintenance to the community/users?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments!

To clarify a little bit about what is meant when referring to the security issues, the main use case for our team in proposing this related to the kubernetes environment where secrets stored on disk can be more easily exposed than those on memory

because the same storing mechanisms will continue to be used anyway to avoid chicken and egg situation (inline and file will continue to be used for secret providers secrets as you mentioned...)

This chicken and egg issue wouldn't apply to kubernetes specifically since the container prometheus is running on will naturally have access to the kubernetes secret API as configured by kubernetes. This is also common for some cloud providers, where there is an identity associated with the machine you are running on that will naturally give you access to the related secret APIs (assuming you have configured it with the correct access rights)

Also, did you consider only having a generic spec like http_sd_config and delegate the adapters addition/maintenance to the community/users?

I feel like only have a generic spec would be too restrictive and slow down adoption, as I imagine it requiring running sidecars in certain enviroments. Additionally designing and maintaining such a spec can be quite burdensome as there are many secret providers each with their own nuances. However in theory a generic spec could also be added as a secret provider

In order to avoid falling back into the same challenges we have with the SD mechanisms, maybe we should be more explicit about ownership/maintenance, conditions for adding a new provider etc.

Yes, I understand the concern there about avoiding the issues with ownership/maintenance. I believe we could take a multi pronged approach like the following:

  1. For common secret providers, support them directly
  2. For less common secret providers either:
    i. Use a generic spec requiring a sidecar as mentioned above
    ii. Create a plugin system requiring custom compilation of prometheus

I am not sure what processes we could use to draw the line between less common and common secret providers. Do you have any ideas about an approach like this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we have the same challenges as with the SD mechanism. I don't think it's a blocker to this though, as the the same solution and path for SD could be used for SP (secret providers). I think it's fine to delegate this and solve the problem together with SD, not trying to come up with some unique solutions for SP that won't fit to SD later on.

However, in some way we are trying something new because we want to potentially maintain those in different repo (e.g. so AM don't need to import Prometheus), but it's TBC, don't need to be part of this proposal decision perhaps.

For less common secret providers either:
i. Use a generic spec requiring a sidecar as mentioned above

Good idea, but we tried that in the past and it's a bit too troublesome for users in practice, too much overhead and not as powerful as OOTB experience.

ii. Create a plugin system requiring custom compilation of prometheus

yea, this is what we do with SD now and we should do the same in SP: https://github.com/prometheus/prometheus/blob/main/plugins.yml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details.

This chicken and egg issue wouldn't apply to kubernetes specifically since the container prometheus is running on will naturally have access to the kubernetes secret API as configured by kubernetes.

And AFAIK, those credentials live in the container's FS...
As long as we (should) use the inline and file for the secret providers secrets themselves, I still don't think this proposal should mention "security enhancement"; it indeed will help avoid storing (or having to provide) final secrets as inline or files, but users will still need to provide the meta secrets as inline or files.

Yes, I don't think the "ownership/maintenance" is a blocker, we can also add that later, I just wanted to draw attention to that. Maybe we can mention this in the proposal, that we'll work on that alongside SD.

I like the idea of build time plugins as for SD in https://github.com/prometheus/prometheus/blob/main/plugins.yml
not sure how that works, but if we could have that also address #47 (comment) to control/limit the extra dependencies, that'd be great.

A last one, I see a lot of focus on kubernetes secret provider (that would be a good candidate for OOTB/enabled by default plugin), do you know what the other providers could be? could we list some of them, it'd help with thinking about the API, with implementation.


### Pitfalls of the current solution

Storing secrets in the filesystem poses risks, especially in environments like Kubernetes, where any pod on a node can access files mounted on that node. This could expose secrets to attackers. Additionally, configuring secrets through the filesystem often requires extra setup steps in some environments, which can be cumbersome for users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where any pod on a node can access files mounted on that node seems like a "[citation needed]", it's possible, with a privileged pod or hostPath for example, but you should configure Kubernetes to not allow that with something like the restricted (or even baseline) Pod Security Standard.

If the argument is instead that because pods share a node and therefore an escape from a pod means the files are accessible, then at that point the security of the container runtime or Kubernetes has been compromised and it is possible that the memory of the process would also be accessible. If that's the concern then this proposal should also cover whether there's mitigations for that (such as setting the process non-dumpable, etc.), but I don't think that pitfall is actually something this proposal is intending to address, so either just don't mention it, or do mention it and declare that out of scope.

Overall the proposal makes perfect sense, but this pitfall doesn't seem that useful to include, it just doesn't sell the proposal, should instead effort be spent on securing whichever Kubernetes setup ends up not being secure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, related/same observation I had here #47 (comment), even with this proposal we'll continue passing (meta) secrets inline or via file, I don't think the proposal should claim that the changes will improve on security.

```
secret_field:
provider: <type of the provider>
<property1>: <value1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a map is easy but makes validation more difficult. Generally we are strict with configurations, preferring they error out if they are bad, one way to implement this would be something like:

type SecretProvider struct {
  Provider string `yaml:"provider"`
  Kubernetes *kubernetes_secret.Config `yaml:"kubernetes,omitempty"`
  [...other providers...]
}

(Where kubernetes_secret.Config comes from the kubernetes secret provider, but see later for how discovery does this. It could also make sense to omit the provider string at this point and just test for one of the config structs being set, i.e. like oneof in proto)

Then the config would be like:

secret_field:
  provider: kubernetes
  kubernetes:
    field_for_kubernetes: <value1>

Then this can use the YAML parser's validation logic (I don't want to get tied up in implementation details, here, that may need some kind of dynamic registration as done with discovery in discovery/registry.go rather than embedding types in the struct definition as above).

If using a map approach rather than this nested struct approach, it would be possible to have a function to verify the map only contains known keys (and checking the values?), although combing that with auth_token as mentioned later seems somewhat tricky, as the map would have to be declared as map[string]any.

I agree with the desire to defer the auth_token part of this work and I don't think we even need as much detail as I've put above in the actual proposal. However the design of this affects how the configuration should be validated and is an important part of the API surface of this proposal that should be considered upfront.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with David here, different secret providers might need different information to fetch a field.

type SecretField struct {
  Provider any # I know it's not really any
  property1 any
  property2 any
}

it's forcing all providers to have the same configuration options.

Copy link
Member

@bwplotka bwplotka Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's why @TheSpiritXIII (initial author) opted initially for

  password:
  password_file:
  password_ref: <ref to actual provider e.g. kuberentes with fields_for_kubernetes>

I think what @dgl proposes and challenges @ArthurSens mention makes sense - it allows at least this consistency with e.g. scrape config discovery which follows the same pattern.

I DO however vote for at least attempting in practice to implement the variance of

  password: <just "string" OR full complex provider(s) strcut>

Which avoids us to have this odd _ref or _complex duplicates everywhere.

So... how to find out what's possible? Can we right now implement an example Go code that shows yaml unmarshalling and code for the 2 core questions we have here:

  1. Dynamic provider fields VS per provider sections.
  2. Field with the dynamic "inline OR providers" type VS separate fields for inline, file and providers

We could then compare easily all. WDYT? @hsmatulisgoogle ?

Essentially what will help us find the best solution and consensus here (:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't stop myself from experimenting here...

Here is the showdown of (1) question https://gist.github.com/bwplotka/baf5344837d98719dd126ed6bd143015 (the same in playground).

Check the "A" code (what this proposal proposes now). I think it does not look too magic or bad. Totally doable and trivial. I also like option "C" (somewhat what @dgl and @ArthurSens propose). A tiny bit more explicit, more consistent with scrape job + SD... we could do that too!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to answer (2) added secret2_test to gist https://gist.github.com/bwplotka/baf5344837d98719dd126ed6bd143015 (playground)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sum up; learnings for me is to NOT worry about yaml unmarshal implementation. Generally everything discussed here is similarly safe and relatively trivial to implement. Let's focus on what we want user to be able to do

  1. Dynamic provider fields VS per provider sections.

Dynamic is trivial to implement:

func (s *SecretA) UnmarshalYAML(_ context.Context, unmarshalFn func(any) error) error {
	provider := struct {
		Provider string `yaml:"provider"`
	}{}
	if err := unmarshalFn(&provider); err != nil {
		return err
	}

	switch provider.Provider {
	case "kubernetes":
		s.Provider = provider.Provider
		return unmarshalFn(&s.Kubernetes)
	case "file":
		s.Provider = provider.Provider
		return unmarshalFn(&s.File)
	default:
		return fmt.Errorf("unknown provider %q", provider.Provider)
	}
}

...but I like the consistency of @dgl and @ArthurSens case:

cExample = `
password:
  kubernetes:
    namespace: "<ns>"
    name: "<secret name>"
    key: "<data's key for secret name>"
`
	cExampleFile = `
password:
  file:
    path: "<path to secret file>"
`
  1. Field with the dynamic "inline OR providers" type VS separate fields for inline, file and providers

Dynamic inline is trivial to implement as well, I vote for doing this.

func (s *SecretA) UnmarshalYAML(_ context.Context, unmarshalFn func(any) error) error {
	// Try inlined form first, for compatibility and ease of use.
	var inlinedForm string
	if err := unmarshalFn(&inlinedForm); err == nil {
		s.Inline = InlineSP{Password: inlinedForm}
		return nil
	}
	// Fallback to complex struct.
	// NOTE: "plain" casting is needed to avoid recursive call to UnmarshalYAML.
	// This is what the current Prometheus code is doing too.
	type plain SecretA
	return unmarshalFn((*plain)(s))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

The only thing the experiment isn't addressing is that for example SecretA brings in the Kubernetes struct, which in a real implementation could end up bringing in the Kubernetes client libraries. Using some kind of dynamic registration like discovery does could potentially be used to break this dep.

However definitely agree with:

Let's focus on what we want user to be able to do

I think the style you've outlined is good and this is enough for the proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing the experiment isn't addressing is that for example SecretA brings in the Kubernetes struct, which in a real implementation could end up bringing in the Kubernetes client libraries. Using some kind of dynamic registration like discovery does could potentially be used to break this dep.

Yea, true, then we might need some dynamic "any" struct anyway for that reason, but I wanted to focus on a user side first.

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty solid :) I just had some small questions


We can classify all errors stemming from secret provider failures into 2 cases.

The first case is that we have not gotten any secret value since startup. In this case we should not initialize any component that mentions this faulty secret, log this failure and schedule a retry to get this secret value and initialize the component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we never get a response, should we fail Prometheus's start-up or start with a missing secret?

Or even if we're in the process of retrying, should we delay Prometheus's start-up or start things anyway?

A state enum describing in which error condition the secret is in:
* error: there has been no successful request and no secret has been retrieved
* stale: a request has succeeded but the latest request failed
* none: the last request was succesful
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about "success" to represent success? Was there any particular reason to call it none?

Comment on lines +82 to +84
provider: kubernetes
namespace: ns1
secret_id: pass2
Copy link
Member

@bwplotka bwplotka Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's secret ID? Maybe better to make it obvious what practically we would see e.g.

Suggested change
provider: kubernetes
namespace: ns1
secret_id: pass2
provider: kubernetes
namespace: <ns>
name: <secret name>
key: <data's key for secret name>

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing feedback and discussion in https://github.com/prometheus/proposals/pull/47/files#r2041335383

Should we update the proposal with the mentioned alternatives considered?

What do you think is the reasonable solution from this discussion? I see the potential for changing proposal slightly to support the following (for consistency with SD):

password: "<inlined secret>"
password:
  kubernetes:
    namespace: "<ns>"
    name: "<secret name>"
    key: "<data's key for secret name>"
password:
  file:
    path: "<path to secret file>"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants