Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4377: Secured Prometheus metrics endpoints #4404

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kannon92
Copy link
Contributor

@kannon92 kannon92 commented Feb 25, 2025

What type of PR is this?

/kind documentation

What this PR does / why we need it:

KEP for securing metrics with TLS.

Which issue(s) this PR fixes:

KEP for #4377

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 25, 2025
Copy link

netlify bot commented Feb 25, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 570e57b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67bfd260b89d4c0008307f68

@kannon92
Copy link
Contributor Author

/cc @tenzen-y @mimowo

@mimowo
Copy link
Contributor

mimowo commented Feb 26, 2025

/assign @gabesaba
for the first pass

@gabesaba
Copy link
Contributor

/unassign

As I won't be able to take a look until next week

@mimowo
Copy link
Contributor

mimowo commented Feb 26, 2025

/assign @PBundyra
Can you make the first pass here?

@PBundyra
Copy link
Contributor

/assign @PBundyra Can you make the first pass here?

Sure, on it

@PBundyra
Copy link
Contributor

/lgtm
Thanks @kannon92!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 72f730f0bc8f1f6ebf62e4c0c154b538fc6770f0


### Controller Changes

To keep configurations minimal, we will require that the certificates be mounted to "/tmp/k8s-metrics-server/metrics-certs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why tmp? Mounts rarely happen in that place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly took the inspiration from the path we use for internal cert rotations.

https://github.com/kubernetes-sigs/kueue/blob/main/pkg/util/cert/cert.go#L31

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the internal certs rotations are temp and localized in principle. The external certs are expected to be "permanent", so indeed the "tmp" folder name is a bit misleading. Wondering if there are some best practices for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see anything in cert docs.

We could put in /etc ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed an update to put in /etc rather than /tmp. I think that is a better location for these certs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/etc/kueue/certs sgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/etc/metrics/certs
or
/etc/kueue/metrics/certs?

We should not combine all the certificates into the same directory.

Copy link
Contributor

@mimowo mimowo Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/etc/metrics/certs or /etc/kueue/metrics/certs?

Probably /etc/kueue/metrics/certs, because in /etc the files are typically namespaced by the app name. Sure, we have only single app here, so everything works, but it feels more natural. OTOH, this is not a strong opinion.

We should not combine all the certificates into the same directory.

Sure, if users want to use the same certs for webhooks and metrics they would need to create soft link between the two folders or just copy them to both places.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@kannon92 kannon92 requested a review from mwielgus February 26, 2025 15:53
// False means that we will allow access to metrics to whoever has access to the ServiceAccount
// Default will be false
// +optional
UseTLS bool `json:"useTLS,omitempty"`
Copy link
Contributor

@mimowo mimowo Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API should de-couple two things. TLS could be used based on internally generated and self-signed certicicates or external certificates.

This knob seems to do two things at the same time - enable TLS at the monitor level and requires using external cert manager.

I would suggest to rename it to UseCertManager. And use TLS regardless if the certs are internally generated or external.

EDIT: alternative names coming to my mind: UseExternalCerts, or just ExternalCerts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And use TLS regardless if the certs are internally generated or external.

That would be a breaking change. This would effectively require Cert Manager for metrics. If Secure option is set and controller runtime generates certificates it doesn't not provide the ability to check TLS.

Controller runtime requires the use of an external certificate solution for TLS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently store the certs generated by OPA in /tmp/k8s-webhook-server/serving-certs.
Couldn't we use the certs for metrics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the KEP, there are 2 types of internal certs, 1 is generated by OPA, 2 is generated by controller-runtime.
Which internal do you indicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see about using internal certs but should we create a new secret for metrics?

I think reusing the same secret for multiple endpoints is not a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see about using internal certs but should we create a new secret for metrics?

I'm ok with either way.

I think reusing the same secret for multiple endpoints is not a good idea.

What point of view do you think it is not a good idea? Security or controller-runtime specification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why this improves security? It is not clear to me, and from the user perspective it requires a bit more overhead, and the setup is more complex.

However, since we already use different (internal) certs for webhooks and metrics, so I'm ok to keep them separate also for the external certs, but to me the argument is more about "consistency with what we currently do".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All examples seem to require creation of a certificate for webhooks and then a certificate for metrics.

If a certificate is compromised then you would effectively give access to all APIs. If you at least have separate certificates per API then it would most likely be more secure.

Copy link
Contributor

@mimowo mimowo Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you at least have separate certificates per API then it would most likely be more secure.

Yeah, OTOH, the more certificates you have the higher the chance some would leak. Getting any part of your app compromised increases the risk the other parts will fall soon.

So, I believe there are pros and cons to security for both approaches. By analogy: is it safer to have a single complex master password, or remembering 100 weak passwords or using stickers for them :)

I just imagine being an admin and managing certificates per endpoint is a lot of devops work in principle, and more points of failure, and of leakage, but I see company-wide strategies may differ.

Just would like to make sure the API is flexible to do both ways depending on user preference.

Comment on lines 126 to 136
```golang
// ControllerMetrics defines the metrics configs.
type ControllerMetrics struct {
...
// UseTLS, if true, will provide tls validation for the prometheus endpoint
// False means that we will allow access to metrics to whoever has access to the ServiceAccount
// Default will be false
// +optional
UseTLS bool `json:"useTLS,omitempty"`
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we do not need to expand any APIs.
So, shouldn't metrics server watch /tmp/k8s-webhook-server/serving-certs directory?
In that case, we can avoid focusing on CertManger cases. They can mount arbitrary certifications in the /tmp/k8s-webhook-server/serving-certs.

This indicates the metrics endpoints always use HTTPS regardless of certificate manager types.

cc: @mimowo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not say that we should use /tmp/k8s-webhook-server/serving-certs. I'm ok with any directory. But we can have only one certificate directory for metrics and webhooks regardless of certificate types.

Copy link
Member

@tenzen-y tenzen-y Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo Do you see any problems in your use cases when we enable HTTPS for metrics endpoints by internal-certmanager?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates the metrics endpoints always use HTTPS regardless of certificate manager types.

Yeah, I think we only support HTTPS / TLS, so I wouldn't call it UseTLS. My reading of your work is the ability to supply external certs (as mentioned in another comment).

I'm wondering if we do not need to expand any APIs.

Basically there are 3 things which could be configured: CertDir, CertName and KeyName : https://github.com/kubernetes-sigs/kueue/blob/main/vendor/sigs.k8s.io/controller-runtime/pkg/metrics/server/server.go#L85-L102

I can see why they are configurable at the controller-runtime level, but I don't see a need to configure them in Kueue. Having clear defaults will make the docs easier if less parametrized IMO. We can introduce always the parameters when users have good use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo Do you see any problems in your use cases when we enable HTTPS for metrics endpoints by internal-certmanager?

Nope, I thought we already do it at the endpoint level, looking at this test:

"curl -s -k -H \"Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" https://%s.%s.svc.cluster.local:8443/metrics ",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but we do not append certs for metrics as they are handled by controller runtime feature.

If we maintain certs for metrics by OPA, what happen? In that case, will metrics server ignores the certs as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make the helm configuration easier actually. Right now I think I need to add special code to append volumes in case of using external certs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make the helm configuration easier actually. Right now I think I need to add special code to append volumes in case of using external certs.

I do not think that we respect Helm codes instead of API. API should always be respected; it is used widely regardless of tools.

We are currently working on Helm Chart auto-generation instead of the current shell. I believe it could mitigate Helm generation complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding item:

  • Internal Cert Manage with webhooks, controller runtime metrics with https but servicemonitor does not validate tls.
  • Supported or not?

@mimowo Proposal:

  • Leave this one as is and keep tls no verify on for ServiceMonitor and call this out of scope for this KEP.
  • We will support this existing solution as is and not require changes here.

@tenzen-y Proposal:

  • Internal Cert Manager for webhooks and metrics, use tls verify for ServiceMonitor
  • Drop the middle state where we skip TLS in service monitor
  • Internal cert manage would generate and watch secrets for both metrics and webhooks.

We could enhance this so that we no longer rely on controller runtime to generate this certificate and we use the similar opa code to watch/rotate certificates for metrics. We would need to create a secret for that.

We could keep the code as is and assume that if someone actually cares about this, the best approach would be to use an external solution for rotating certificates. This is the recommendation for Kubebuilder that one should use an external certificate solution for production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We followed up via slack.

We will not provide an API for this but will assume that if one wants to use external certificates then they also will want to provide certificates for prometheus.

I updated the design with these findings.

@tenzen-y
Copy link
Member

/retitle KEP-4377: Secured Prometheus metrics endpoints

@k8s-ci-robot k8s-ci-robot changed the title add a kep for metrics tls KEP-4377: Secured Prometheus metrics endpoints Feb 26, 2025
@kannon92
Copy link
Contributor Author

cc @camilamacedo86

If you have time, i'd appreciate your insight best practices for metrics security.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants