Skip to content

google_container_cluster unable to create default node pool when workload_identity_config is set #22168

Open
@phardy

Description

@phardy

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.11.3
on darwin_arm64
+ provider registry.terraform.io/hashicorp/google v6.27.0

Affected Resource(s)

google_container_cluster

Terraform Configuration

This is based on the example usage in the resource documentation, with some necessary changes for my test environment. It uses a small shared subnet, and node_locations and default_max_pods_per_node are tuned down to match.

data "google_compute_subnetwork" "cna_subnet" {
  name    = "gke-cluster-confluence-us-east4-05"
  project = data.google_compute_network.cna_network.project
}

data "google_project" "project" {}

resource "google_service_account" "workload_identity_test" {
  account_id   = "workload-identity-test"
  display_name = "Workload Identity Test Service Account"
}

resource "google_container_cluster" "workload_identity_test" {
  name     = "workload-identity-test"
  location = "us-east4"
  node_locations = [
    "us-east4-b",
    "us-east4-c",
  ]

  initial_node_count = 1

  node_config {
    service_account = google_service_account.workload_identity_test.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  default_max_pods_per_node = 56

  network    = data.google_compute_network.cna_network.id
  subnetwork = data.google_compute_subnetwork.cna_subnet.id

  ip_allocation_policy {
    cluster_secondary_range_name = data.google_compute_subnetwork.cna_subnet.secondary_ip_range[0].range_name
  }

  enterprise_config {
    desired_tier = "ENTERPRISE"
  }

  workload_identity_config {
    workload_pool = "${data.google_project.project.project_id}.svc.id.goog"
  }

  timeouts {
    create = "120m"
    update = "120m"
    read   = "120m"
  }
}

Debug Output

https://gist.github.com/phardy/d1ee5b1cdf4ae5d8ec17f2c3b5aa147b

Expected Behavior

I expect a new cluster to be created, with the default node pool still attached. I expect this to take some time - experience testing similar configurations shows at least 30 minutes, and more to add a workload pool afterwards.

Actual Behavior

The terraform apply fails after ~40 minutes, despite the 120m timeouts specified in the resource, with this error:

Error: Error waiting for creating GKE cluster: All cluster resources were brought up, but: 2 nodes out of 2 are unhealthy.

Inspecting the cluster in the GCS console shows the same error reported by the console. Inspecting the default node pool shows it reporting all nodes OK. Running a terraform plan at this point indicates the existing cluster is tainted, and proposes destroying it and creating a new cluster.

Steps to reproduce

  1. terraform apply

Important Factoids

As mentioned with the terraform configuration, I'm using a small shared subnet. The primary IPv4 range for this subnet is a /28, and it has a single secondary IPv4 /24 range.

I've tested an identical configuration without the workload_identity_config block, which created successfully. And then added the workload_identity_config block, making the final config identical to what I've pasted here. Terraform successfully modifies the cluster enabling workload, although this takes approximately 30 minutes to apply.

My initial attempts at creating this were using a separately managed node pool. However this also fails, my understanding reading the documentation and experimenting is that the provider creates the cluster with a default pool and then deletes it. And this initial node pool creation is unsuccessful.

I've also attempted creating this with the default service account (omitting service_account and oauth_scopes from the node_config block). This makes no change to the Actual Behaviour.

References

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions