Fix two error handling errors causing existing resources to be removed #246

svanharmelen · 2025-05-28T11:58:56Z

We encountered two issues where the database was unavailable while running
the Terraform Zitadel provider with an action or project_role resource.

action resource

After some debugging we found that Zitadel itself returned an error, but
it looks like the middleware (potentially as obfuscation strategy)
converts the error into a NotFound error:
https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111

So considering this behavior in the Zitadel API it seem to be a bug to
depend on the error code to determine if the resource should be removed
from the state.

In our case it caused the resource to be removed while it actually did
exist. So trying to run another terraform apply when the database was
running again, caused errors like:

Error: failed to create action: rpc error: code = AlreadyExists desc =
Errors.Action.AlreadyExists (V3-DKcYh)

project_role resource

After some debugging we found that Zitadel itself returned an error
(FATAL: the database system is shutting down (SQLSTATE 57P03)), but the
Zitadel provider ignored this error and removed the project role from
the state.

This, of course, caused issues when we tried to execute terraform apply again when the database was restarted. As now the Zitadel
provider thought the roles didn't exist yet and tried to create them
resulting in like:

Error: failed to create project role: rpc error: code = AlreadyExists
desc = Role already exists (V3-DKcYh)

We encountered an issue where the database was unavailable while running the Terraform Zitadel provider with a `project_role` resource. After some debugging we found that Zitadel itself returned an error (FATAL: the database system is shutting down (SQLSTATE 57P03)), but the Zitadel provider ignored this error and removed the project role from the state. This, of course, caused issues when we tried to execute `terraform apply` again when the database was restarted. As now the Zitadel provider thought the roles didn't exist yet and tried to create them resulting in like: ``` Error: failed to create project role: rpc error: code = AlreadyExists desc = Role already exists (V3-DKcYh) ``` By returning any received errors this problem should no longer occur. Note that resetting/clearning the ID is still done at the end of this function, unless there is exactly 1 result. So functionally that didn't change.

We encountered an issue where the database was unavailable while running the Terraform Zitadel provider with an `action` resource. After some debugging we found that Zitadel itself returned an error, but it looks like the middleware (potentially as obfuscation strategy) converts the error into a `NotFound` error: https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111 So considering this behavior in the Zitadel API it seem to be a bug to depend on the error code to determine if the resource should be removed from the state. In our case it caused the resource to be removed while it actually did exist. So trying to run another `terraform apply` when the database was running again, caused errors like: ``` Error: failed to create action: rpc error: code = AlreadyExists desc = Errors.Action.AlreadyExists (V3-DKcYh) ``` Note that resetting/clearning the ID is still done at the end of this function, unless there is exactly 1 result. So functionally that didn't change.

svanharmelen · 2025-06-19T11:37:27Z

Any update on this one? Anything I can do to move this forward?

elinashoko · 2025-07-10T08:36:48Z

@eliobischof can you pls have a look?

stebenz · 2025-08-18T11:52:31Z

Zitadel can return an error "NotFound" if a resource is not existing, correct behavior, which is in our case with terraform not an error but just a state which needs to be handled properly.

What you now would have is that if a "NotFound" error is returned, the Terraform provider would not register it as a missing resource, but as a general error with the request.
@svanharmelen Would that be your desired outcome?

svanharmelen · 2025-08-27T11:59:59Z

Hi @stebenz,

Zitadel can return an error "NotFound" if a resource is not existing, correct behavior, which is in our case with terraform not an error but just a state which needs to be handled properly.

Returning an error "NotFound" if the resource actually doesn't exists is of course correct behavior. But if it's possible that other types of errors are also returned as a "NotFound" error, it's no longer a good idea to wipe the resource from the Terraform state based on the received "NotFound" error.

As I mentioned, in our case the DB wasn't available for a brief moment and instead of returning an error indicating that the API failed to fetch the resource, it returned a "NotFound" error which then caused the resource to be removed from the Terraform state while it actually still existed within Zitadel.

See this function in the middleware which seems to return a "NotFound" error regardless of the actual error that occurred: https://github.com/zitadel/zitadel/blob/main/internal/api/grpc/server/middleware/instance_interceptor.go#L92-L111

What you now would have is that if a "NotFound" error is returned, the Terraform provider would not register it as a missing resource, but as a general error with the request.
@svanharmelen Would that be your desired outcome?

I guess not... But if the middleware is using "NotFound" errors as a kind of obfuscation mechanism to hide the actual error, I don't see how we can use the error to determine if something actually doesn't exist anymore (and so can safely be removed from the Terraform state).

But looking at the code... In both places where I now removed the check to determine if its an "NotFound" error or not, we actually do a LIST call and not a GET of a specific resource (and getting a "NotFound" error for a list call feels a bit strange IMHO anyway). But next to that, in the code directly below where I removed the error checks there is logic to see if the resource in question is returned by the list call and if not the resource is removed from the Terraform state.

So in my opinion this is correct and safe behavior. If the LIST call fails for whatever reason we just return the error. If the LIST call succeeds, but the resource is not present in the response we remove the resource from the Terraform state.

svanharmelen · 2025-09-10T10:12:31Z

Any updates @stebenz?

svanharmelen · 2025-09-23T13:13:51Z

Hi @stebenz, anything I can do to move this forward?

elinashoko · 2025-09-25T09:03:15Z

Heya @svanharmelen sorry for the delay, Stefan is currently holidaying for 2 weeks. @eliobischof can you pls have a look?

elinashoko · 2025-10-23T15:57:34Z

@stebenz heya, can you pls have a look at this again

stebenz · 2025-11-06T10:15:32Z

Hi @svanharmelen
I think we would need to extend this PR not only for the resources which were in use for you, so we would need to check if we can remove all related error checks in the provider.

svanharmelen · 2025-11-06T10:31:10Z

While I appreciate your point, I don't think (hope) it should be a blocker for this PR. This PR fixes the cases I came across so it's at least a step in the right direction. Going through the full provider seems like a much bigger task which can (should?) maybe be picked up your team?

While I would like to, I don't have the time and resources to take on that task. Yet I would love to be able to stop using our fork and start using the official provider again. So why not move this one forward (merge it) and put the task to go through the whole provider on Zitadels backlog?

svanharmelen · 2025-11-07T08:46:08Z

@stebenz any comments/thoughts on my question from yesterday?

svanharmelen · 2025-11-20T12:19:02Z

@stebenz any updates?

svanharmelen · 2025-12-15T13:54:44Z

The year is almost over, would be nice to finally get some movement on this one @stebenz

svanharmelen · 2025-12-29T09:34:16Z

The year is almost over, would be nice to finally get some movement on this one @mridang

mridang · 2026-01-02T02:46:52Z

@svanharmelen I will have a look at this. The pattern for IgnoreIfNotFoundError seems to be rather inconsistent across the codebase and we would prefer to make a cross-cutting fix. Another issue is that the API shouldn't be converting all errors to 404. If this truly is the case, then I think we will need to make clean fix on the Zitadel API. There was a similar PR in the zitadel-go repo as well recently.

Unfortunately, this may turn out to be a complex change and may require time. I'll do my best to see this through soon but I don't have an ETA unfortunately.

hifabienne added this to Product Management May 28, 2025

hifabienne added os-contribution resources labels May 28, 2025

svanharmelen force-pushed the fix/error-handling branch from 472e85b to b03c642 Compare May 28, 2025 12:01

hifabienne moved this to 📋 Sprint Backlog in Product Management Jun 2, 2025

muhlemmer requested a review from eliobischof June 5, 2025 09:33

elinashoko moved this from 📋 Sprint Backlog to 🏗 In progress in Product Management Jul 10, 2025

muhlemmer moved this from 🏗 In progress to 👀 In review in Product Management Jul 17, 2025

elinashoko moved this from 👀 In review to 📋 Sprint Backlog in Product Management Jul 31, 2025

elinashoko added the waiting label Aug 27, 2025

elinashoko requested review from stebenz and removed request for eliobischof October 23, 2025 15:56

elinashoko removed the waiting label Oct 23, 2025

IAM-marco requested review from mridang and removed request for stebenz December 16, 2025 08:43

mridang self-assigned this Jan 2, 2026

Fix two error handling errors causing existing resources to be removed #246

Are you sure you want to change the base?

Fix two error handling errors causing existing resources to be removed #246

Uh oh!

Conversation

svanharmelen commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

action resource

project_role resource

Uh oh!

svanharmelen commented Jun 19, 2025

Uh oh!

elinashoko commented Jul 10, 2025

Uh oh!

stebenz commented Aug 18, 2025

Uh oh!

svanharmelen commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svanharmelen commented Sep 10, 2025

Uh oh!

svanharmelen commented Sep 23, 2025

Uh oh!

elinashoko commented Sep 25, 2025

Uh oh!

elinashoko commented Oct 23, 2025

Uh oh!

stebenz commented Nov 6, 2025

Uh oh!

svanharmelen commented Nov 6, 2025

Uh oh!

svanharmelen commented Nov 7, 2025

Uh oh!

svanharmelen commented Nov 20, 2025

Uh oh!

svanharmelen commented Dec 15, 2025

Uh oh!

svanharmelen commented Dec 29, 2025

Uh oh!

mridang commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

svanharmelen commented May 28, 2025 •

edited

Loading

svanharmelen commented Aug 27, 2025 •

edited

Loading