Description
Bug Report
This is likely not a JOSDK bug but based on offline discussion with @csviri I am opening it here to track it.
In our current setup the operator is deployed in namespace x
and is watching namespace y
. The access to namespace y
is controlled by roles and rolebindings (created in namespace y
).
If there are CRs present in y
and the namespace is deleted before the CRs are individually deleted we get the following exception during cleanup:
ERROR][flink/basic-example] Error during event processing ExecutionScope{ resource id: ResourceID{name='basic-example', namespace='flink'}, version: 1791281} failed.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:546)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:566)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:369)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:712)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:172)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:177)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:88)
at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:39)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher$CustomResourceFacade.updateResource(ReconciliationDispatcher.java:387)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.conflictRetryingUpdate(ReconciliationDispatcher.java:343)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleCleanup(ReconciliationDispatcher.java:297)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:87)
at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:701)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:681)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:628)
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:591)
at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$5(StandardHttpClient.java:120)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:135)
... 3 more
Furthermore the namespace deletion gets stuck because the finalizer from the CR is never removed. The root problem seems to be when the namespace deletion is initiated the role and rolebinding is immediately deleted therefore the operator cannot remove the finalizer from the resource anymore.
Environment
Kubernetes cluster type:
kind
JOSDK version: 4.3.0
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-20T03:36:50Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/arm64"}