`google_container_cluster` unable to create default node pool when `workload_identity_config` is set

### Community Note

* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request.
* Please do not leave _+1_ or _me too_ comments, they generate extra noise for issue followers and do not help prioritize the request.
* If you are interested in working on this issue or have submitted a pull request, please leave a comment.
* If an issue is assigned to a user, that user is claiming responsibility for the issue.
* Customers working with a Google Technical Account Manager or Customer Engineer can ask them to [reach out internally](https://github.com/hashicorp/terraform-provider-google/wiki/Customer-Contact#raising-gcp-internal-issues-with-the-provider-development-team) to expedite investigation and resolution of this issue.


### Terraform Version & Provider Version(s)

```
Terraform v1.11.3
on darwin_arm64
+ provider registry.terraform.io/hashicorp/google v6.27.0
```

### Affected Resource(s)

`google_container_cluster`

### Terraform Configuration

This is based on the [example usage](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#example-usage---with-the-default-node-pool) in the resource documentation, with some necessary changes for my test environment. It uses a small shared subnet, and `node_locations` and `default_max_pods_per_node` are tuned down to match.

```tf
data "google_compute_subnetwork" "cna_subnet" {
  name    = "gke-cluster-confluence-us-east4-05"
  project = data.google_compute_network.cna_network.project
}

data "google_project" "project" {}

resource "google_service_account" "workload_identity_test" {
  account_id   = "workload-identity-test"
  display_name = "Workload Identity Test Service Account"
}

resource "google_container_cluster" "workload_identity_test" {
  name     = "workload-identity-test"
  location = "us-east4"
  node_locations = [
    "us-east4-b",
    "us-east4-c",
  ]

  initial_node_count = 1

  node_config {
    service_account = google_service_account.workload_identity_test.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  default_max_pods_per_node = 56

  network    = data.google_compute_network.cna_network.id
  subnetwork = data.google_compute_subnetwork.cna_subnet.id

  ip_allocation_policy {
    cluster_secondary_range_name = data.google_compute_subnetwork.cna_subnet.secondary_ip_range[0].range_name
  }

  enterprise_config {
    desired_tier = "ENTERPRISE"
  }

  workload_identity_config {
    workload_pool = "${data.google_project.project.project_id}.svc.id.goog"
  }

  timeouts {
    create = "120m"
    update = "120m"
    read   = "120m"
  }
}

```


### Debug Output

https://gist.github.com/phardy/d1ee5b1cdf4ae5d8ec17f2c3b5aa147b

### Expected Behavior

I expect a new cluster to be created, with the default node pool still attached. I expect this to take some time - experience testing similar configurations shows at least 30 minutes, and more to add a workload pool afterwards.

### Actual Behavior

The terraform apply fails after ~40 minutes, despite the 120m timeouts specified in the resource, with this error:

> Error: Error waiting for creating GKE cluster: All cluster resources were brought up, but: 2 nodes out of 2 are unhealthy.

Inspecting the cluster in the GCS console shows the same error reported by the console. Inspecting the default node pool shows it reporting all nodes OK. Running a `terraform plan` at this point indicates the existing cluster is tainted, and proposes destroying it and creating a new cluster.

### Steps to reproduce

1. `terraform apply`

### Important Factoids

As mentioned with the terraform configuration, I'm using a small shared subnet. The primary IPv4 range for this subnet is a /28, and it has a single secondary IPv4 /24 range.

I've tested an identical configuration _without_ the `workload_identity_config` block, which created successfully. And then added the `workload_identity_config` block, making the final config identical to what I've pasted here. Terraform successfully modifies the cluster enabling workload, although this takes approximately 30 minutes to apply.

My initial attempts at creating this were using a separately managed node pool. However this also fails, my understanding reading the documentation and experimenting is that the provider creates the cluster with a default pool and then deletes it. And this initial node pool creation is unsuccessful.

I've also attempted creating this with the default service account (omitting `service_account` and `oauth_scopes` from the `node_config` block). This makes no change to the Actual Behaviour.

### References

_No response_

b/409662502

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`google_container_cluster` unable to create default node pool when `workload_identity_config` is set #22168

Community Note

Terraform Version & Provider Version(s)

Affected Resource(s)

Terraform Configuration

Debug Output

Expected Behavior

Actual Behavior

Steps to reproduce

Important Factoids

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

google_container_cluster unable to create default node pool when workload_identity_config is set #22168

Description

Community Note

Terraform Version & Provider Version(s)

Affected Resource(s)

Terraform Configuration

Debug Output

Expected Behavior

Actual Behavior

Steps to reproduce

Important Factoids

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`google_container_cluster` unable to create default node pool when `workload_identity_config` is set #22168