Skip to content

Conversation

@svanharmelen
Copy link

@svanharmelen svanharmelen commented May 28, 2025

We encountered two issues where the database was unavailable while running
the Terraform Zitadel provider with an action or project_role resource.

action resource

After some debugging we found that Zitadel itself returned an error, but
it looks like the middleware (potentially as obfuscation strategy)
converts the error into a NotFound error:
https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111

So considering this behavior in the Zitadel API it seem to be a bug to
depend on the error code to determine if the resource should be removed
from the state.

In our case it caused the resource to be removed while it actually did
exist. So trying to run another terraform apply when the database was
running again, caused errors like:

Error: failed to create action: rpc error: code = AlreadyExists desc =
Errors.Action.AlreadyExists (V3-DKcYh)

project_role resource

After some debugging we found that Zitadel itself returned an error
(FATAL: the database system is shutting down (SQLSTATE 57P03)), but the
Zitadel provider ignored this error and removed the project role from
the state.

This, of course, caused issues when we tried to execute terraform apply again when the database was restarted. As now the Zitadel
provider thought the roles didn't exist yet and tried to create them
resulting in like:

Error: failed to create project role: rpc error: code = AlreadyExists
desc = Role already exists (V3-DKcYh)

We encountered an issue where the database was unavailable while running
the Terraform Zitadel provider with a `project_role` resource.

After some debugging we found that Zitadel itself returned an error
(FATAL: the database system is shutting down (SQLSTATE 57P03)), but the
Zitadel provider ignored this error and removed the project role from
the state.

This, of course, caused issues when we tried to execute `terraform
apply` again when the database was restarted. As now the Zitadel
provider thought the roles didn't exist yet and tried to create them
resulting in like:

```
Error: failed to create project role: rpc error: code = AlreadyExists
desc = Role already exists (V3-DKcYh)
```

By returning any received errors this problem should no longer occur.

Note that resetting/clearning the ID is still done at the end of this
function, unless there is exactly 1 result. So functionally that didn't
change.
We encountered an issue where the database was unavailable while running
the Terraform Zitadel provider with an `action` resource.

After some debugging we found that Zitadel itself returned an error, but
it looks like the middleware (potentially as obfuscation strategy)
converts the error into a `NotFound` error:
https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111

So considering this behavior in the Zitadel API it seem to be a bug to
depend on the error code to determine if the resource should be removed
from the state.

In our case it caused the resource to be removed while it actually did
exist. So trying to run another `terraform apply` when the database was
running again, caused errors like:

```
Error: failed to create action: rpc error: code = AlreadyExists desc =
Errors.Action.AlreadyExists (V3-DKcYh)
```

Note that resetting/clearning the ID is still done at the end of this
function, unless there is exactly 1 result. So functionally that didn't
change.
@hifabienne hifabienne moved this to 📋 Sprint Backlog in Product Management Jun 2, 2025
@muhlemmer muhlemmer requested a review from eliobischof June 5, 2025 09:33
@svanharmelen
Copy link
Author

Any update on this one? Anything I can do to move this forward?

@elinashoko
Copy link

@eliobischof can you pls have a look?

@elinashoko elinashoko moved this from 📋 Sprint Backlog to 🏗 In progress in Product Management Jul 10, 2025
@muhlemmer muhlemmer moved this from 🏗 In progress to 👀 In review in Product Management Jul 17, 2025
@elinashoko elinashoko moved this from 👀 In review to 📋 Sprint Backlog in Product Management Jul 31, 2025
@stebenz
Copy link
Contributor

stebenz commented Aug 18, 2025

Zitadel can return an error "NotFound" if a resource is not existing, correct behavior, which is in our case with terraform not an error but just a state which needs to be handled properly.

What you now would have is that if a "NotFound" error is returned, the Terraform provider would not register it as a missing resource, but as a general error with the request.
@svanharmelen Would that be your desired outcome?

@svanharmelen
Copy link
Author

svanharmelen commented Aug 27, 2025

Hi @stebenz,

Zitadel can return an error "NotFound" if a resource is not existing, correct behavior, which is in our case with terraform not an error but just a state which needs to be handled properly.

Returning an error "NotFound" if the resource actually doesn't exists is of course correct behavior. But if it's possible that other types of errors are also returned as a "NotFound" error, it's no longer a good idea to wipe the resource from the Terraform state based on the received "NotFound" error.

As I mentioned, in our case the DB wasn't available for a brief moment and instead of returning an error indicating that the API failed to fetch the resource, it returned a "NotFound" error which then caused the resource to be removed from the Terraform state while it actually still existed within Zitadel.

See this function in the middleware which seems to return a "NotFound" error regardless of the actual error that occurred: https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111

What you now would have is that if a "NotFound" error is returned, the Terraform provider would not register it as a missing resource, but as a general error with the request.
@svanharmelen Would that be your desired outcome?

I guess not... But if the middleware is using "NotFound" errors as a kind of obfuscation mechanism to hide the actual error, I don't see how we can use the error to determine if something actually doesn't exist anymore (and so can safely be removed from the Terraform state).

But looking at the code... In both places where I now removed the check to determine if its an "NotFound" error or not, we actually do a LIST call and not a GET of a specific resource (and getting a "NotFound" error for a list call feels a bit strange IMHO anyway). But next to that, in the code directly below where I removed the error checks there is logic to see if the resource in question is returned by the list call and if not the resource is removed from the Terraform state.

So in my opinion this is correct and safe behavior. If the LIST call fails for whatever reason we just return the error. If the LIST call succeeds, but the resource is not present in the response we remove the resource from the Terraform state.

@svanharmelen
Copy link
Author

Any updates @stebenz?

@svanharmelen
Copy link
Author

Hi @stebenz, anything I can do to move this forward?

@elinashoko
Copy link

Heya @svanharmelen sorry for the delay, Stefan is currently holidaying for 2 weeks. @eliobischof can you pls have a look?

@elinashoko elinashoko requested review from stebenz and removed request for eliobischof October 23, 2025 15:56
@elinashoko
Copy link

@stebenz heya, can you pls have a look at this again

@stebenz
Copy link
Contributor

stebenz commented Nov 6, 2025

Hi @svanharmelen
I think we would need to extend this PR not only for the resources which were in use for you, so we would need to check if we can remove all related error checks in the provider.

@svanharmelen
Copy link
Author

While I appreciate your point, I don't think (hope) it should be a blocker for this PR. This PR fixes the cases I came across so it's at least a step in the right direction. Going through the full provider seems like a much bigger task which can (should?) maybe be picked up your team?

While I would like to, I don't have the time and resources to take on that task. Yet I would love to be able to stop using our fork and start using the official provider again. So why not move this one forward (merge it) and put the task to go through the whole provider on Zitadels backlog?

@svanharmelen
Copy link
Author

@stebenz any comments/thoughts on my question from yesterday?

@svanharmelen
Copy link
Author

@stebenz any updates?

@svanharmelen
Copy link
Author

The year is almost over, would be nice to finally get some movement on this one @stebenz

@IAM-marco IAM-marco requested review from mridang and removed request for stebenz December 16, 2025 08:43
@svanharmelen
Copy link
Author

The year is almost over, would be nice to finally get some movement on this one @mridang

@mridang mridang self-assigned this Jan 2, 2026
@mridang
Copy link
Collaborator

mridang commented Jan 2, 2026

@svanharmelen I will have a look at this. The pattern for IgnoreIfNotFoundError seems to be rather inconsistent across the codebase and we would prefer to make a cross-cutting fix. Another issue is that the API shouldn't be converting all errors to 404. If this truly is the case, then I think we will need to make clean fix on the Zitadel API. There was a similar PR in the zitadel-go repo as well recently.

Unfortunately, this may turn out to be a complex change and may require time. I'll do my best to see this through soon but I don't have an ETA unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: 📋 Sprint Backlog

Development

Successfully merging this pull request may close these issues.

5 participants