-
Notifications
You must be signed in to change notification settings - Fork 28
Fix two error handling errors causing existing resources to be removed #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
We encountered an issue where the database was unavailable while running the Terraform Zitadel provider with a `project_role` resource. After some debugging we found that Zitadel itself returned an error (FATAL: the database system is shutting down (SQLSTATE 57P03)), but the Zitadel provider ignored this error and removed the project role from the state. This, of course, caused issues when we tried to execute `terraform apply` again when the database was restarted. As now the Zitadel provider thought the roles didn't exist yet and tried to create them resulting in like: ``` Error: failed to create project role: rpc error: code = AlreadyExists desc = Role already exists (V3-DKcYh) ``` By returning any received errors this problem should no longer occur. Note that resetting/clearning the ID is still done at the end of this function, unless there is exactly 1 result. So functionally that didn't change.
We encountered an issue where the database was unavailable while running the Terraform Zitadel provider with an `action` resource. After some debugging we found that Zitadel itself returned an error, but it looks like the middleware (potentially as obfuscation strategy) converts the error into a `NotFound` error: https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111 So considering this behavior in the Zitadel API it seem to be a bug to depend on the error code to determine if the resource should be removed from the state. In our case it caused the resource to be removed while it actually did exist. So trying to run another `terraform apply` when the database was running again, caused errors like: ``` Error: failed to create action: rpc error: code = AlreadyExists desc = Errors.Action.AlreadyExists (V3-DKcYh) ``` Note that resetting/clearning the ID is still done at the end of this function, unless there is exactly 1 result. So functionally that didn't change.
472e85b to
b03c642
Compare
|
Any update on this one? Anything I can do to move this forward? |
|
@eliobischof can you pls have a look? |
|
Zitadel can return an error "NotFound" if a resource is not existing, correct behavior, which is in our case with terraform not an error but just a state which needs to be handled properly. What you now would have is that if a "NotFound" error is returned, the Terraform provider would not register it as a missing resource, but as a general error with the request. |
|
Hi @stebenz,
Returning an error "NotFound" if the resource actually doesn't exists is of course correct behavior. But if it's possible that other types of errors are also returned as a "NotFound" error, it's no longer a good idea to wipe the resource from the Terraform state based on the received "NotFound" error. As I mentioned, in our case the DB wasn't available for a brief moment and instead of returning an error indicating that the API failed to fetch the resource, it returned a "NotFound" error which then caused the resource to be removed from the Terraform state while it actually still existed within Zitadel. See this function in the middleware which seems to return a "NotFound" error regardless of the actual error that occurred: https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111
I guess not... But if the middleware is using "NotFound" errors as a kind of obfuscation mechanism to hide the actual error, I don't see how we can use the error to determine if something actually doesn't exist anymore (and so can safely be removed from the Terraform state). But looking at the code... In both places where I now removed the check to determine if its an "NotFound" error or not, we actually do a LIST call and not a GET of a specific resource (and getting a "NotFound" error for a list call feels a bit strange IMHO anyway). But next to that, in the code directly below where I removed the error checks there is logic to see if the resource in question is returned by the list call and if not the resource is removed from the Terraform state. So in my opinion this is correct and safe behavior. If the LIST call fails for whatever reason we just return the error. If the LIST call succeeds, but the resource is not present in the response we remove the resource from the Terraform state. |
|
Any updates @stebenz? |
|
Hi @stebenz, anything I can do to move this forward? |
|
Heya @svanharmelen sorry for the delay, Stefan is currently holidaying for 2 weeks. @eliobischof can you pls have a look? |
|
@stebenz heya, can you pls have a look at this again |
|
Hi @svanharmelen |
|
While I appreciate your point, I don't think (hope) it should be a blocker for this PR. This PR fixes the cases I came across so it's at least a step in the right direction. Going through the full provider seems like a much bigger task which can (should?) maybe be picked up your team? While I would like to, I don't have the time and resources to take on that task. Yet I would love to be able to stop using our fork and start using the official provider again. So why not move this one forward (merge it) and put the task to go through the whole provider on Zitadels backlog? |
|
@stebenz any comments/thoughts on my question from yesterday? |
|
@stebenz any updates? |
|
The year is almost over, would be nice to finally get some movement on this one @stebenz |
|
The year is almost over, would be nice to finally get some movement on this one @mridang |
|
@svanharmelen I will have a look at this. The pattern for Unfortunately, this may turn out to be a complex change and may require time. I'll do my best to see this through soon but I don't have an ETA unfortunately. |
We encountered two issues where the database was unavailable while running
the Terraform Zitadel provider with an
actionorproject_roleresource.action resource
After some debugging we found that Zitadel itself returned an error, but
it looks like the middleware (potentially as obfuscation strategy)
converts the error into a
NotFounderror:https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111
So considering this behavior in the Zitadel API it seem to be a bug to
depend on the error code to determine if the resource should be removed
from the state.
In our case it caused the resource to be removed while it actually did
exist. So trying to run another
terraform applywhen the database wasrunning again, caused errors like:
project_role resource
After some debugging we found that Zitadel itself returned an error
(FATAL: the database system is shutting down (SQLSTATE 57P03)), but the
Zitadel provider ignored this error and removed the project role from
the state.
This, of course, caused issues when we tried to execute
terraform applyagain when the database was restarted. As now the Zitadelprovider thought the roles didn't exist yet and tried to create them
resulting in like: