Skip to content

google_container_cluster unable to create default node pool when workload_identity_config is set #22168

Open
@phardy

Description

@phardy

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.11.3
on darwin_arm64
+ provider registry.terraform.io/hashicorp/google v6.27.0

Affected Resource(s)

google_container_cluster

Terraform Configuration

This is based on the example usage in the resource documentation, with some necessary changes for my test environment. It uses a small shared subnet, and node_locations and default_max_pods_per_node are tuned down to match.

data "google_compute_subnetwork" "cna_subnet" {
  name    = "gke-cluster-confluence-us-east4-05"
  project = data.google_compute_network.cna_network.project
}

data "google_project" "project" {}

resource "google_service_account" "workload_identity_test" {
  account_id   = "workload-identity-test"
  display_name = "Workload Identity Test Service Account"
}

resource "google_container_cluster" "workload_identity_test" {
  name     = "workload-identity-test"
  location = "us-east4"
  node_locations = [
    "us-east4-b",
    "us-east4-c",
  ]

  initial_node_count = 1

  node_config {
    service_account = google_service_account.workload_identity_test.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }

  default_max_pods_per_node = 56

  network    = data.google_compute_network.cna_network.id
  subnetwork = data.google_compute_subnetwork.cna_subnet.id

  ip_allocation_policy {
    cluster_secondary_range_name = data.google_compute_subnetwork.cna_subnet.secondary_ip_range[0].range_name
  }

  enterprise_config {
    desired_tier = "ENTERPRISE"
  }

  workload_identity_config {
    workload_pool = "${data.google_project.project.project_id}.svc.id.goog"
  }

  timeouts {
    create = "120m"
    update = "120m"
    read   = "120m"
  }
}

Debug Output

https://gist.github.com/phardy/d1ee5b1cdf4ae5d8ec17f2c3b5aa147b

Expected Behavior

I expect a new cluster to be created, with the default node pool still attached. I expect this to take some time - experience testing similar configurations shows at least 30 minutes, and more to add a workload pool afterwards.

Actual Behavior

The terraform apply fails after ~40 minutes, despite the 120m timeouts specified in the resource, with this error:

Error: Error waiting for creating GKE cluster: All cluster resources were brought up, but: 2 nodes out of 2 are unhealthy.

Inspecting the cluster in the GCS console shows the same error reported by the console. Inspecting the default node pool shows it reporting all nodes OK. Running a terraform plan at this point indicates the existing cluster is tainted, and proposes destroying it and creating a new cluster.

Steps to reproduce

  1. terraform apply

Important Factoids

As mentioned with the terraform configuration, I'm using a small shared subnet. The primary IPv4 range for this subnet is a /28, and it has a single secondary IPv4 /24 range.

I've tested an identical configuration without the workload_identity_config block, which created successfully. And then added the workload_identity_config block, making the final config identical to what I've pasted here. Terraform successfully modifies the cluster enabling workload, although this takes approximately 30 minutes to apply.

My initial attempts at creating this were using a separately managed node pool. However this also fails, my understanding reading the documentation and experimenting is that the provider creates the cluster with a default pool and then deletes it. And this initial node pool creation is unsuccessful.

I've also attempted creating this with the default service account (omitting service_account and oauth_scopes from the node_config block). This makes no change to the Actual Behaviour.

References

No response

b/409662502

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions