-
Notifications
You must be signed in to change notification settings - Fork 14
proposal: Support secret providers #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
45d10a3
55e0f18
d344204
296a135
ca962cc
a8b43ab
2a67e6e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,159 @@ | ||||||||||||||||
## Remote Secrets (Secret Providers) | ||||||||||||||||
|
||||||||||||||||
* **Owners:** | ||||||||||||||||
* Henrique Matulis (@hsmatulisgoogle) | ||||||||||||||||
|
||||||||||||||||
* **Implementation Status:** Not implemented | ||||||||||||||||
|
||||||||||||||||
* **Related Issues and PRs:** | ||||||||||||||||
* https://github.com/prometheus/prometheus/issues/8551 | ||||||||||||||||
* https://github.com/prometheus/prometheus/issues/11477 | ||||||||||||||||
* https://github.com/prometheus/alertmanager/issues/3108 | ||||||||||||||||
* https://github.com/prometheus/prometheus/pull/13955 | ||||||||||||||||
* https://github.com/prometheus/exporter-toolkit/pull/141 | ||||||||||||||||
* https://github.com/prometheus/prometheus/issues/5795 | ||||||||||||||||
|
||||||||||||||||
* **Other docs or links:** | ||||||||||||||||
* [Prometheus Remote Secrets Doc](https://docs.google.com/document/d/1EqHd2EwQxf9SYD8-gl3sgkwaU6A10GhiN7aw-2kx7NU/edit?tab=t.0) | ||||||||||||||||
* Previous proposal by @TheSpiritXIII | ||||||||||||||||
* https://stackoverflow.com/questions/43609144/can-prometheus-store-basic-auth-passwords-in-any-format-other-then-plain-text | ||||||||||||||||
* https://groups.google.com/g/prometheus-users/c/yWLE9qoG5GU/m/ke8ewxjIAQAJ | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
> TL;DR: This proposal introduces "secret providers" for Prometheus and Alertmanager, enabling them to fetch secrets from external systems. The config format will allow specifying a provider and its parameters for `<secret>` fields. | ||||||||||||||||
|
||||||||||||||||
## Why | ||||||||||||||||
|
||||||||||||||||
The motivation behind this design document is to enhance the security and flexibility of secret management in Prometheus. Currently, Prometheus only supports reading secrets from the filesystem or directly from the configuration file, which can lead to security vulnerabilities and limitations when working with certain service providers. | ||||||||||||||||
|
||||||||||||||||
This proposal introduces secret discovery, similar to service discovery, where different secret providers can contribute code to read secrets from their respective APIs. This would allow for more secure and dynamic secret retrieval, eliminating the need to store secrets in the filesystem and reducing the potential for unauthorized access. | ||||||||||||||||
|
||||||||||||||||
### Pitfalls of the current solution | ||||||||||||||||
|
||||||||||||||||
Storing secrets in the filesystem poses risks, especially in environments like Kubernetes, where any pod on a node can access files mounted on that node. This could expose secrets to attackers. Additionally, configuring secrets through the filesystem often requires extra setup steps in some environments, which can be cumbersome for users. | ||||||||||||||||
|
||||||||||||||||
Storing secrets inline can also pose risks, as the configuration file may still be accessible through the filesystem. Additionally it can lead to configuration files becoming cluttered and difficult to manage. | ||||||||||||||||
|
||||||||||||||||
## Goals | ||||||||||||||||
|
||||||||||||||||
Goals and use cases for the solution as proposed in [How](#how): | ||||||||||||||||
|
||||||||||||||||
* Allow Prometheus to read secrets remotely from secret providers. | ||||||||||||||||
hsmatulisgoogle marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
* Anywhere in the configs([1](https://prometheus.io/docs/prometheus/latest/configuration/configuration/),[2](https://prometheus.io/docs/alerting/latest/configuration/#configuration-file-introduction)) with the `<secret>` type, it should be possible to fetch from secret providers | ||||||||||||||||
* This will include Alertmanager | ||||||||||||||||
* Introduce secret discovery, similar to service discovery, where different secret providers can contribute code to read secrets from their respective API. | ||||||||||||||||
* Backwards compatibility for secrets in config | ||||||||||||||||
|
||||||||||||||||
### Audience | ||||||||||||||||
|
||||||||||||||||
* Prometheus maintainers | ||||||||||||||||
* Alertmanager maintainers | ||||||||||||||||
* Secret providers interested in contributing code | ||||||||||||||||
* Users looking to use secret providers | ||||||||||||||||
|
||||||||||||||||
## Non-Goals | ||||||||||||||||
|
||||||||||||||||
* Support for non-string secret values | ||||||||||||||||
* Secret transformations/processing | ||||||||||||||||
* Things like String concatenation, base64 encoding/decoding | ||||||||||||||||
* Default values | ||||||||||||||||
* De-duplication and re-using common config values for secret provider configs | ||||||||||||||||
* This will be left to a follow up proposal if needed | ||||||||||||||||
|
||||||||||||||||
## How | ||||||||||||||||
hsmatulisgoogle marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
### Configuration | ||||||||||||||||
|
||||||||||||||||
Wherever a `<secret>` type is present in the configuration files, we will allow users to specify a map specifying how to fetch from a secret provider. The generic format would be | ||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
secret_field: | ||||||||||||||||
provider: <type of the provider> | ||||||||||||||||
<property1>: <value1> | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using a map is easy but makes validation more difficult. Generally we are strict with configurations, preferring they error out if they are bad, one way to implement this would be something like: type SecretProvider struct {
Provider string `yaml:"provider"`
Kubernetes *kubernetes_secret.Config `yaml:"kubernetes,omitempty"`
[...other providers...]
} (Where kubernetes_secret.Config comes from the kubernetes secret provider, but see later for how discovery does this. It could also make sense to omit the provider string at this point and just test for one of the config structs being set, i.e. like oneof in proto) Then the config would be like: secret_field:
provider: kubernetes
kubernetes:
field_for_kubernetes: <value1> Then this can use the YAML parser's validation logic (I don't want to get tied up in implementation details, here, that may need some kind of dynamic registration as done with discovery in discovery/registry.go rather than embedding types in the struct definition as above). If using a map approach rather than this nested struct approach, it would be possible to have a function to verify the map only contains known keys (and checking the values?), although combing that with I agree with the desire to defer the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with David here, different secret providers might need different information to fetch a field. type SecretField struct {
Provider any # I know it's not really any
property1 any
property2 any
} it's forcing all providers to have the same configuration options. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, that's why @TheSpiritXIII (initial author) opted initially for
I think what @dgl proposes and challenges @ArthurSens mention makes sense - it allows at least this consistency with e.g. scrape config discovery which follows the same pattern. I DO however vote for at least attempting in practice to implement the variance of
Which avoids us to have this odd So... how to find out what's possible? Can we right now implement an example Go code that shows yaml unmarshalling and code for the 2 core questions we have here:
We could then compare easily all. WDYT? @hsmatulisgoogle ? Essentially what will help us find the best solution and consensus here (: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I couldn't stop myself from experimenting here... Here is the showdown of (1) question https://gist.github.com/bwplotka/baf5344837d98719dd126ed6bd143015 (the same in playground). Check the "A" code (what this proposal proposes now). I think it does not look too magic or bad. Totally doable and trivial. I also like option "C" (somewhat what @dgl and @ArthurSens propose). A tiny bit more explicit, more consistent with scrape job + SD... we could do that too! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And to answer (2) added There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To sum up; learnings for me is to NOT worry about yaml unmarshal implementation. Generally everything discussed here is similarly safe and relatively trivial to implement. Let's focus on what we want user to be able to do
Dynamic is trivial to implement: func (s *SecretA) UnmarshalYAML(_ context.Context, unmarshalFn func(any) error) error {
provider := struct {
Provider string `yaml:"provider"`
}{}
if err := unmarshalFn(&provider); err != nil {
return err
}
switch provider.Provider {
case "kubernetes":
s.Provider = provider.Provider
return unmarshalFn(&s.Kubernetes)
case "file":
s.Provider = provider.Provider
return unmarshalFn(&s.File)
default:
return fmt.Errorf("unknown provider %q", provider.Provider)
}
} ...but I like the consistency of @dgl and @ArthurSens case: cExample = `
password:
kubernetes:
namespace: "<ns>"
name: "<secret name>"
key: "<data's key for secret name>"
`
cExampleFile = `
password:
file:
path: "<path to secret file>"
`
Dynamic inline is trivial to implement as well, I vote for doing this. func (s *SecretA) UnmarshalYAML(_ context.Context, unmarshalFn func(any) error) error {
// Try inlined form first, for compatibility and ease of use.
var inlinedForm string
if err := unmarshalFn(&inlinedForm); err == nil {
s.Inline = InlineSP{Password: inlinedForm}
return nil
}
// Fallback to complex struct.
// NOTE: "plain" casting is needed to avoid recursive call to UnmarshalYAML.
// This is what the current Prometheus code is doing too.
type plain SecretA
return unmarshalFn((*plain)(s))
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 The only thing the experiment isn't addressing is that for example However definitely agree with:
I think the style you've outlined is good and this is enough for the proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yea, true, then we might need some dynamic "any" struct anyway for that reason, but I wanted to focus on a user side first. |
||||||||||||||||
... | ||||||||||||||||
<propertyN>: <valueN> | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
For example when specifying a password fetched from the kubernetes provider with an id of `pass2` in namespace `ns1` for the HTTP passsword field it would look like this: | ||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
password: | ||||||||||||||||
provider: kubernetes | ||||||||||||||||
namespace: ns1 | ||||||||||||||||
secret_id: pass2 | ||||||||||||||||
Comment on lines
+82
to
+84
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's secret ID? Maybe better to make it obvious what practically we would see e.g.
Suggested change
|
||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
hsmatulisgoogle marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
### Inline secrets | ||||||||||||||||
hsmatulisgoogle marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
|
||||||||||||||||
When specifying secrets inline in the config file, a string can be passed in as usual for backwards compatibility. | ||||||||||||||||
``` | ||||||||||||||||
password: my_important_secret | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
### Error handling | ||||||||||||||||
|
||||||||||||||||
We can classify all errors stemming from secret provider failures into 2 cases. | ||||||||||||||||
|
||||||||||||||||
The first case is that we have not gotten any secret value since startup. In this case we should not initialize any component that mentions this faulty secret, log this failure and schedule a retry to get this secret value and initialize the component. | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case we never get a response, should we fail Prometheus's start-up or start with a missing secret? Or even if we're in the process of retrying, should we delay Prometheus's start-up or start things anyway? |
||||||||||||||||
|
||||||||||||||||
The second case is that we already have a secret value, but refreshing it has resulted in an error. In this case we should keep the component that uses this secret running with the potentially stale secret, and schedule a retry. | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
### Secret rotation | ||||||||||||||||
|
||||||||||||||||
When a secret is rotated, it is possible for this to happen out of sync. For example, one of the following could happen with secrets used for target scraping: | ||||||||||||||||
|
||||||||||||||||
1. The scrape target's secret updated but the currently stored secret hasn't updated yet. | ||||||||||||||||
2. The secret provider's secret changes but the scrape target hasn't updated to the new secret yet. | ||||||||||||||||
|
||||||||||||||||
Therefore to help alleviate 1. we can ignore caching and query the secret provider again if permission errors are reported by the component. | ||||||||||||||||
|
||||||||||||||||
To alleviate 2. we can store the previous secret value before rotation for a period of time and use it as a fallback in case of permission errors. | ||||||||||||||||
|
||||||||||||||||
### Metrics | ||||||||||||||||
|
||||||||||||||||
Metrics should be present to help users identify the errors mentioned above (in addition to logs). Therefore the following metrics should be reported per secret present in the config file: | ||||||||||||||||
|
||||||||||||||||
The time since the last successful secret fetch | ||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
prometheus_remote_secret_last_successful_fetch_seconds{provider="kubernetes", secret_id="pass1"} 15 | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
A state enum describing in which error condition the secret is in: | ||||||||||||||||
* error: there has been no successful request and no secret has been retrieved | ||||||||||||||||
* stale: a request has succeeded but the latest request failed | ||||||||||||||||
* none: the last request was succesful | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what do you think about "success" to represent success? Was there any particular reason to call it none? |
||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
# HELP prometheus_remote_secret_state Describes the current state of a remotely fetched secret. | ||||||||||||||||
# TYPE prometheus_remote_secret_state gauge | ||||||||||||||||
prometheus_remote_secret_state{provider="kubernetes", secret_id="pass1", state="none"} 0 | ||||||||||||||||
prometheus_remote_secret_state{provider="kubernetes", secret_id="auth_token", state="stale"} 1 | ||||||||||||||||
prometheus_remote_secret_state{id="myk8secrets", secret_id="pass2", state="error"} 2 | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
### Nested secrets | ||||||||||||||||
|
||||||||||||||||
Secret providers might require secrets to be configured themselves. We will allow secrets to be passed in to secret providers. | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, in the Prometheus-on-Kubernetes setup, if the secret provider requires auth (which seems likely, I wouldn't see the point of not having auth, it'd go against the goals of the proposal) then Prometheus will need to store the credentials for talking to the secret provider using the same mechanism (Kubernetes Secrets) that it currently uses to store the credentials it retrieves, thus prone to the same "vulnerabilities". Am I missing something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice, thanks for double checking this. This proposal is not about vulnerabilities (although you will have more choices on what providers to choose, based on security levels), but to allow custom providers like we allow custom service discoveries. The main outcome with the "recursive/nested" sort of approach here is that secret provider secret can't use the same secret provider and cycling connections are not allowed. Eventually secret provider has to use file (recommended) or inline (with environment variable eval for example) or any other that does not require secrets. To avoid tricky implementation we could immediately limit the design to exactly that - provider secret can only use inline or file based, but it might be cleaner to not limit the choice here (e.g. there might be other providers that do not require secret e.g. Kubernetes secrets) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the details. I was confused by In order to avoid falling back into the same challenges we have with the SD mechanisms, maybe we should be more explicit about ownership/maintenance, conditions for adding a new provider etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the detailed comments! To clarify a little bit about what is meant when referring to the security issues, the main use case for our team in proposing this related to the kubernetes environment where secrets stored on disk can be more easily exposed than those on memory
This chicken and egg issue wouldn't apply to kubernetes specifically since the container prometheus is running on will naturally have access to the kubernetes secret API as configured by kubernetes. This is also common for some cloud providers, where there is an identity associated with the machine you are running on that will naturally give you access to the related secret APIs (assuming you have configured it with the correct access rights)
I feel like only have a generic spec would be too restrictive and slow down adoption, as I imagine it requiring running sidecars in certain enviroments. Additionally designing and maintaining such a spec can be quite burdensome as there are many secret providers each with their own nuances. However in theory a generic spec could also be added as a secret provider
Yes, I understand the concern there about avoiding the issues with ownership/maintenance. I believe we could take a multi pronged approach like the following:
I am not sure what processes we could use to draw the line between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, we have the same challenges as with the SD mechanism. I don't think it's a blocker to this though, as the the same solution and path for SD could be used for SP (secret providers). I think it's fine to delegate this and solve the problem together with SD, not trying to come up with some unique solutions for SP that won't fit to SD later on. However, in some way we are trying something new because we want to potentially maintain those in different repo (e.g. so AM don't need to import Prometheus), but it's TBC, don't need to be part of this proposal decision perhaps.
Good idea, but we tried that in the past and it's a bit too troublesome for users in practice, too much overhead and not as powerful as OOTB experience.
yea, this is what we do with SD now and we should do the same in SP: https://github.com/prometheus/prometheus/blob/main/plugins.yml There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the details.
And AFAIK, those credentials live in the container's FS... Yes, I don't think the "ownership/maintenance" is a blocker, we can also add that later, I just wanted to draw attention to that. Maybe we can mention this in the proposal, that we'll work on that alongside SD. I like the idea of build time plugins as for SD in https://github.com/prometheus/prometheus/blob/main/plugins.yml A last one, I see a lot of focus on kubernetes secret provider (that would be a good candidate for OOTB/enabled by default plugin), do you know what the other providers could be? could we list some of them, it'd help with thinking about the API, with implementation. |
||||||||||||||||
|
||||||||||||||||
``` | ||||||||||||||||
... | ||||||||||||||||
password: | ||||||||||||||||
provider: bootstrapped | ||||||||||||||||
secret_id: pass1 | ||||||||||||||||
auth_token: | ||||||||||||||||
provider: kubernetes | ||||||||||||||||
secret_id: auth_token | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
However, an initial implementation might only allow inline secrets for secret providers. This might limit the usefulness of certain providers that require sensitive information for their own configuration. | ||||||||||||||||
|
||||||||||||||||
### Where will code live | ||||||||||||||||
|
||||||||||||||||
Both the Alertmanager and Prometheus repos will be able to use secret providers. The code will eventually live in a separete repository specifically created for it. | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this reside in https://github.com/prometheus/common? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would be one option, but only if it's a separate Go module. I think it's fine to skip this detail in the proposal, and just decide later. |
||||||||||||||||
|
||||||||||||||||
hsmatulisgoogle marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||
## Action Plan | ||||||||||||||||
|
||||||||||||||||
* [ ] Create action plan after doc is stable! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where any pod on a node can access files mounted on that node
seems like a "[citation needed]", it's possible, with aprivileged
pod orhostPath
for example, but you should configure Kubernetes to not allow that with something like the restricted (or even baseline) Pod Security Standard.If the argument is instead that because pods share a node and therefore an escape from a pod means the files are accessible, then at that point the security of the container runtime or Kubernetes has been compromised and it is possible that the memory of the process would also be accessible. If that's the concern then this proposal should also cover whether there's mitigations for that (such as setting the process non-dumpable, etc.), but I don't think that pitfall is actually something this proposal is intending to address, so either just don't mention it, or do mention it and declare that out of scope.
Overall the proposal makes perfect sense, but this pitfall doesn't seem that useful to include, it just doesn't sell the proposal, should instead effort be spent on securing whichever Kubernetes setup ends up not being secure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, related/same observation I had here #47 (comment), even with this proposal we'll continue passing (meta) secrets inline or via file, I don't think the proposal should claim that the changes will improve on security.