Skip to content

Conversation

@dipti-pai
Copy link
Member

@dipti-pai dipti-pai commented Apr 29, 2025

Depends on: fluxcd/pkg#917

Part of: fluxcd/flux2#5022

Fixes: #1047

Changes include :

  • If authentication token is not specified in provider, attempt to get the token using workload identity.
  • Add new field .spec.serviceAccountName to support multi-tenant workload identity as defined in RFC-0010 to use an identity with a service account other than the notification-controller.
  • Use proxy to get the token if specified in provider spec.
  • Cache the tokens if enabled in the notification controller options.
  • If address has SAS connection string, use that for authentication, this takes priority over token-authentication
  • If static JWT token is specified in the secret reference, use it for authentication, this takes priority over workload identity-acquired token.
  • Update RBAC for notification-controller to be able to create service token requests.
  • Add unit tests for the 3 authentication mechanisms (SAS, JWT, managed identity).
  • Add documentation for using single-tenant and multi-tenant approaches of workload identity with azureeventhub provider.
  • Enable token cache by default.

Tested the feature with notification-controller service account (single tenant) and standalone service account with proxy and token cache. Also tested with existing auth mechanisms (JWT token in secret/SAS string).

Sharing test results below:

Notification controller logs sending the events to event hub

{"level":"info","ts":"2025-04-29T21:20:59.268Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32119512"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:26:00.024Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32121054"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:31:00.932Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32122584"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:36:01.805Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32124121"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}
{"level":"info","ts":"2025-04-29T21:41:02.685Z","logger":"event-server","msg":"dispatching event","eventInvolvedObject":{"kind":"Kustomization","namespace":"default","name":"testconfig-kustomization-1","uid":"9917e19b-fdc9-466a-8a73-6fb819681f7f","apiVersion":"kustomize.toolkit.fluxcd.io/v1","resourceVersion":"32125656"},"message":"ConfigMap/game-demo namespace not specified: the server could not find the requested resource\n"}

Cache metrics:

curl localhost:5000/metrics | grep gotk_token_cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
10# HELP gotk_token_cache_events_total Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
# TYPE gotk_token_cache_events_total counter
gotk_token_cache_events_total{event_type="cache_hit",kind="Provider",name="azure",namespace="default"} 272
gotk_token_cache_events_total{event_type="cache_miss",kind="Provider",name="azure",namespace="default"} 25
# HELP gotk_token_cache_evictions_total Total number of cache evictions.
# TYPE gotk_token_cache_evictions_total counter
gotk_token_cache_evictions_total 0
0 # HELP gotk_token_cache_requests_total Total number of cache requests partioned by success or failure.
29# TYPE gotk_token_cache_requests_total counter
29gotk_token_cache_requests_total{status="success"} 297
5 # HELP gotk_token_cached_items Total number of items in the cache.
  # TYPE gotk_token_cached_items gauge
 0gotk_token_cached_items 1
 29295    0     0  1487k      0 --:--:-- --:--:-- --:--:-- 1505k

Proxy logs getting tokens once per hour since that's how long the token is valid:

kubectl logs proxy-server-f4c5b56db-bqwsw
2025/04/28 17:56:43 [001] INFO: Running 0 CONNECT handlers
2025/04/28 17:56:43 [001] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 18:56:54 [002] INFO: Running 0 CONNECT handlers
2025/04/28 18:56:54 [002] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 19:57:05 [003] INFO: Running 0 CONNECT handlers
2025/04/28 19:57:05 [003] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 20:55:16 [005] INFO: Running 0 CONNECT handlers
2025/04/28 20:55:16 [005] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 21:55:28 [006] INFO: Running 0 CONNECT handlers
2025/04/28 21:55:28 [006] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 22:55:40 [007] INFO: Running 0 CONNECT handlers
2025/04/28 22:55:40 [007] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/28 23:55:52 [008] INFO: Running 0 CONNECT handlers
2025/04/28 23:55:52 [008] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 00:56:02 [009] INFO: Running 0 CONNECT handlers
2025/04/29 00:56:02 [009] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 01:56:13 [010] INFO: Running 0 CONNECT handlers
2025/04/29 01:56:13 [010] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 03:00:25 [011] INFO: Running 0 CONNECT handlers
2025/04/29 03:00:25 [011] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 04:00:41 [012] INFO: Running 0 CONNECT handlers
2025/04/29 04:00:41 [012] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 05:00:53 [013] INFO: Running 0 CONNECT handlers
2025/04/29 05:00:53 [013] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 06:01:05 [014] INFO: Running 0 CONNECT handlers
2025/04/29 06:01:05 [014] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 07:05:17 [015] INFO: Running 0 CONNECT handlers
2025/04/29 07:05:17 [015] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 08:05:29 [016] INFO: Running 0 CONNECT handlers
2025/04/29 08:05:29 [016] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 09:05:41 [017] INFO: Running 0 CONNECT handlers
2025/04/29 09:05:41 [017] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 10:05:52 [018] INFO: Running 0 CONNECT handlers
2025/04/29 10:05:52 [018] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 11:06:02 [019] INFO: Running 0 CONNECT handlers
2025/04/29 11:06:02 [019] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 12:06:13 [020] INFO: Running 0 CONNECT handlers
2025/04/29 12:06:13 [020] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 13:10:25 [021] INFO: Running 0 CONNECT handlers
2025/04/29 13:10:25 [021] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 14:10:36 [022] INFO: Running 0 CONNECT handlers
2025/04/29 14:10:36 [022] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 15:10:48 [023] INFO: Running 0 CONNECT handlers
2025/04/29 15:10:48 [023] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 16:10:58 [024] INFO: Running 0 CONNECT handlers
2025/04/29 16:10:58 [024] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 17:11:10 [025] INFO: Running 0 CONNECT handlers
2025/04/29 17:11:10 [025] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 18:15:22 [026] INFO: Running 0 CONNECT handlers
2025/04/29 18:15:22 [026] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 19:15:34 [027] INFO: Running 0 CONNECT handlers
2025/04/29 19:15:34 [027] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 20:15:46 [028] INFO: Running 0 CONNECT handlers
2025/04/29 20:15:46 [028] INFO: Accepting CONNECT to login.microsoftonline.com:443
2025/04/29 21:15:57 [029] INFO: Running 0 CONNECT handlers
2025/04/29 21:15:57 [029] INFO: Accepting CONNECT to login.microsoftonline.com:443

@dipti-pai dipti-pai marked this pull request as draft April 29, 2025 22:31
@matheuscscp matheuscscp changed the title Implement managed identity support for Azure Event Hub provider [RFC-0010] Implement managed identity support for Azure Event Hub provider Apr 30, 2025
@stefanprodan stefanprodan added the area/alerting Alerting related issues and PRs label Apr 30, 2025
@dipti-pai dipti-pai force-pushed the azeventhub-mi-support branch from 2bbc7be to dc2fd41 Compare April 30, 2025 19:21
@dipti-pai dipti-pai force-pushed the azeventhub-mi-support branch 2 times, most recently from d30e98a to 54ceab7 Compare May 2, 2025 21:52
@dipti-pai dipti-pai marked this pull request as ready for review May 2, 2025 22:04
Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there 👌

I'd like to ask you to please investigate something: Here we are introducing a new operation performed by notification-controller in the Kubernetes API, we are now calling the TokenRequest API to issue a Kubernetes ServiceAccount token for the cloud provider STS exchange. This requires the create verb for the (sub)resource serviceaccounts/token in a ClusterRoleBinding (i.e. for all namespaces). Please check what RBAC permissions notification-controller has in order to be able to perform this operation. I suspect the obvious, it has cluster-admin like kustomize-controller

@dipti-pai
Copy link
Member Author

Here we are introducing a new operation performed by notification-controller in the Kubernetes API, we are now calling the TokenRequest API to issue a Kubernetes ServiceAccount token for the cloud provider STS exchange. This requires the create verb for the (sub)resource serviceaccounts/token in a ClusterRoleBinding (i.e. for all namespaces). Please check what RBAC permissions notification-controller has in order to be able to perform this operation. I suspect the obvious, it has cluster-admin like kustomize-controller

I meant to include a comment for this. In my install, I had to add a new rule to notification-controller's clusterrole for this permission. I used ARC extension for Flux to test this end-to-end and had to extend permissions there. To work with Flux bootstrap, do we need to extend the RBAC permissions somewhere, perhaps here ? Thanks.

- apiGroups:
  - ""
  resources:
  - serviceaccounts/token
  verbs:
  - create

@dipti-pai dipti-pai force-pushed the azeventhub-mi-support branch from 54ceab7 to 6d514d6 Compare May 2, 2025 23:20
@matheuscscp
Copy link
Member

matheuscscp commented May 2, 2025

Here we are introducing a new operation performed by notification-controller in the Kubernetes API, we are now calling the TokenRequest API to issue a Kubernetes ServiceAccount token for the cloud provider STS exchange. This requires the create verb for the (sub)resource serviceaccounts/token in a ClusterRoleBinding (i.e. for all namespaces). Please check what RBAC permissions notification-controller has in order to be able to perform this operation. I suspect the obvious, it has cluster-admin like kustomize-controller

I meant to include a comment for this. In my install, I had to add a new rule to notification-controller's clusterrole for this permission. I used ARC extension for Flux to test this end-to-end and had to extend permissions there. To work with Flux bootstrap, do we need to extend the RBAC permissions somewhere, perhaps here ? Thanks.

- apiGroups:
  - ""
  resources:
  - serviceaccounts/token
  verbs:
  - create

Yes, that looks like the right place, but I don't see how it binds to the notification-controller ServiceAccount, though 🤔 I think it's through the ClusterRoleBinding here with a bit of magic, but I'm not sure @stefanprodan can you please confirm?

Edit: I believe config/rbac/role.yaml is indeed the right place but we need to use controller-gen to add it, just add this line in the provider controller and run make manifests:

diff --git a/internal/controller/provider_controller.go b/internal/controller/provider_controller.go
index 1f7d0f9..5bca247 100644
--- a/internal/controller/provider_controller.go
+++ b/internal/controller/provider_controller.go
@@ -35,6 +35,7 @@ import (
 // +kubebuilder:rbac:groups=notification.toolkit.fluxcd.io,resources=providers,verbs=get;list;watch;create;update;patch;delete
 // +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch
 // +kubebuilder:rbac:groups="",resources=events,verbs=create;patch
+// +kubebuilder:rbac:groups="",resources=serviceaccounts/token,verbs=create
 
 // ProviderReconciler reconciles a Provider object to migrate it to static
 // Provider.

Edit 2: Actually I don't see the rules from role.yaml in the output of flux install --export, @stefanprodan how does this work?

Edit 3: Looking at the output of flux install --export there are only two ClusterRoleBinding objects: one that gives cluster-admin to kustomize-controller and helm-controller, and one that gives crd-controller-flux-system to all controllers. I opened a PR to add the required permission here, but for consistency I think we should also add the +kubebuilder directive above and run make manifests anyway.

@matheuscscp
Copy link
Member

matheuscscp commented May 3, 2025

Hi Dipti 👋

We have finally released fluxcd/pkg/[email protected] and fluxcd/pkg/[email protected]:

https://github.com/fluxcd/kustomize-controller/compare/c413d479c373425ed46ea8704cf38b0afd42c066..f4c2d12eb3e3ea6986257f6f03b59b540a3baf7e

Please update this PR accordingly 🙏

Please also enable the token cache by default, see this comment from Stefan:

fluxcd/kustomize-controller#1426 (comment)

- If authentication token is not specified in provider, attempt to get the token using workload identity.
= Add new field .spec.serviceAccountName to support multi-tenant workload identity as defined in RFC-0010 to use an identity with a service account other than the notification-controller.
- Use proxy to get the token if specified in provider spec.
- Cache the tokens if enabled in the notification controller options.
- If address has SAS connection string, use that for authentication, this takes priority over token-authentication
- If static JWT token is specified in the secret reference, use it for authentication, this takes priority over workload identity-acquired token.
- Update RBAC for notification-controller to be able to create service token requests.
- Add unit tests for the 3 authentication mechanisms (SAS, JWT, managed identity).
- Add documentation for using single-tenant and multi-tenant approaches of workload identity with azureeventhub provider.
- Add operation post to github helpers and provider controller for cache event metrics
- Enable token cache by default.

Signed-off-by: Dipti Pai <[email protected]>

review comments

Signed-off-by: Dipti Pai <[email protected]>

enable cache by default

Signed-off-by: Dipti Pai <[email protected]>
@dipti-pai dipti-pai force-pushed the azeventhub-mi-support branch from 6d514d6 to 0beb3d0 Compare May 5, 2025 19:30
Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dipti-pai ❤️

@matheuscscp matheuscscp merged commit f5ddc97 into fluxcd:main May 5, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/alerting Alerting related issues and PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add managed identity support of Azure Event Hub provider in notification-controller

3 participants