DND-1304 (follow-up): handle v21 schema breaks found by stsaasuat plan by jms200 · Pull Request #29 · comet-ml/terraform-aws-comet-stsaas

jms200 · 2026-06-12T21:16:41Z

User description

Summary

Follow-up to #28 — landing the v21 schema/behavior issues that turned up when running the first dry-run plan against stsaasuat. The original PR's edits were correct in spirit but missed a few v21 schema renames + a v6 quirk and one default flip we should not absorb on a brownfield migration.

Closes / addresses DND-1304.

What's in this PR

File	Change
`modules/comet_eks/main.tf`	v21 dropped `eks_managed_node_group_defaults` → move to `local.eks_managed_node_group_defaults`, merge into each NG entry
`modules/comet_eks/main.tf`	Pin `metadata_options.http_put_response_hop_limit = 2` (v21 default flipped 2→1; would break sidecar IMDS-through-proxy patterns)
`modules/comet_eks/main.tf`	Karpenter NG `taints` list → map (v21 schema)
`modules/comet_eks/variables.tf` + root `variables.tf`	`eks_clickhouse_taints` list → map (v21 schema)
`modules/comet_elasticache/main.tf`	Gate `auth_token_update_strategy` on `auth_token` being non-null (v6 errors otherwise)

Plan against stsaasuat

Clean — Plan: 2 to add, 14 to change, 4 to destroy. All v21-driven:

✅ aws_eks_cluster.this[0].deletion_protection: false → true — the actual DND-1304 ask
✅ No node group replacements (in-place updates of launch templates + nodegroups)
✅ No data-plane disruption
4 destroys: aws_iam_policy.custom + 2 attachments (v21 removed the enable_security_groups_for_pods legacy wiring; stsaasuat doesn't use it) + time_sleep.this replacement (internal timing primitive)
14 changes: cluster getting deletion_protection=true, KMS/IAM tags shed terraform-aws-modules=eks tag, OIDC provider thumbprint refresh, IAM role tag updates

Plan log: see attached evidence in support-agent investigation folder.

Test plan

First clean plan against stsaasuat — no destructive changes to EKS cluster, nodegroups, or ElastiCache ✅
Cut v1.20.0 (MINOR — new family per [feedback_terraform_aws_comet_stsaas_versioning])
Bump comet-devops/terraform/stsaas/stsaasuat/main.tf ref → Atlantis apply
Fleet rollout per the v1.14.x/v1.15.x/v1.17.x/v1.18.x precedent

🤖 Generated with Claude Code

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Align EKS node group defaults, metadata options, and ClickHouse taint handling with the v21 schema while sustaining the prior autoscaler tag propagation and IMDS hop-limit behavior for node groups. Guard the ElastiCache replication group auth token update strategy on the presence of an auth token so provider v6 calls succeed when the token is unset.

Topic Details

EKS nodegroups

Align local.eks_managed_node_group_defaults, eks_clickhouse_taints, and taint handling with the v21 schema while preserving previous metadata hop-limit and autoscaler tag behavior.

Modified files (3)

modules/comet_eks/main.tf
modules/comet_eks/variables.tf
variables.tf

Latest Contributors(2)

User	Commit	Date
jms200	DND-1304: handle v21 s...	June 12, 2026
darenjacobs@msn.com	feat(elasticache): exp...	April 22, 2026

ElastiCache auth

Guard auth_token_update_strategy on a non-null auth_token so provider v6 avoids errors when the token is unset.

Modified files (1)

modules/comet_elasticache/main.tf

Latest Contributors(2)

User	Commit	Date
jms200	DND-1304: handle v21 s...	June 12, 2026
darenjacobs@msn.com	feat(elasticache): exp...	April 22, 2026

_{Review this PR on Baz | Customize your next review}

- comet_eks/main.tf: v21 dropped eks_managed_node_group_defaults — move to local.eks_managed_node_group_defaults, merge into each NG entry - comet_eks/main.tf: pin metadata_options.http_put_response_hop_limit=2 in node group defaults (v21 default flipped 2→1; would break any sidecar IMDS-through-proxy pattern) - comet_eks/main.tf: karpenter NG taints list → map (v21 schema) - comet_eks + root variables.tf: eks_clickhouse_taints list → map (v21 schema) - comet_elasticache/main.tf: gate auth_token_update_strategy on auth_token being non-null (v6 errors otherwise) Plan against stsaasuat now clean (2 add / 14 change / 4 destroy, all v21- driven; deletion_protection false→true confirmed). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

baz-reviewer · 2026-06-12T21:26:56Z

+  # AWS provider v6 errors if auth_token_update_strategy is set without auth_token.
+  auth_token_update_strategy  = var.elasticache_auth_token != null ? var.elasticache_auth_token_update_strategy : null


This guard nulls auth_token_update_strategy whenever elasticache_auth_token is null, so an explicit DELETE removal request never reaches AWS — should we preserve DELETE instead of overriding it to null?

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In modules/comet_elasticache/main.tf around lines 10-11 (inside the ElastiCache replication group/resource arguments), the current guard sets `auth_token_update_strategy` to `null` whenever `var.elasticache_auth_token` is `null`, which blocks the explicit removal case `auth_token = null` with `auth_token_update_strategy = "DELETE"`. Refactor the expression so that when `var.elasticache_auth_token` is null you still pass through the strategy value if it is exactly `DELETE`, but continue to set it to null for any other strategy that would be invalid without an auth token. Ensure the logic matches the AWS provider v6.31.0 documented behavior and add/adjust a variable contract test or validation if one exists for this module.

baz-reviewer · 2026-06-12T21:26:56Z

+      # common_tags on the nodegroup propagate to instances; CA discovery
+      # tags must be on the ASG so the autoscaler can match them.
+      tags = merge(var.common_tags, local.cluster_autoscaler_asg_tags)


eks_enable_cluster_autoscaler=true doesn't make the managed node group ASGs discoverable: cluster_autoscaler_asg_tags is merged into tags and then overwritten by var.common_tags, so should we move it to autoscaling_group_tags?

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In modules/comet_eks/main.tf around lines 81-83 where `cluster_autoscaler_asg_tags` is merged into `eks_managed_node_group_defaults.tags`, fix the logic so the `k8s.io/cluster-autoscaler/${var.eks_cluster_name}` discovery tags are applied to the managed node group ASG tag field (the upstream EKS module uses `autoscaling_group_tags`), not the instance/nodegroup `tags` map. Then, in the built-in node group definitions later (e.g., karpenter/admin/comet/druid/airflow/clickhouse around their `tags = var.common_tags` and their merges with `local.eks_managed_node_group_defaults`), ensure those blocks do not overwrite/remove the ASG discovery tags—merge both `var.common_tags` and `cluster_autoscaler_asg_tags` into the correct ASG tag attribute. After the change, verify that when `var.eks_enable_cluster_autoscaler = true`, the created managed node group ASGs carry the CA discovery tags so autoscaler can discover and scale them.

baz-reviewer · 2026-06-12T21:26:56Z

    } : {},
    # Additional custom node groups
-    var.additional_node_groups
+    { for k, v in var.additional_node_groups : k => merge(local.eks_managed_node_group_defaults, v) }


additional_node_groups now wraps each entry in merge(local.eks_managed_node_group_defaults, v), but merge() is shallow so caller tags/metadata_options replace the defaults; should we deep-merge these fields so var.common_tags and the Cluster Autoscaler discovery tags stay intact?

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In modules/comet_eks/main.tf around lines 377-377 where `eks_managed_node_groups` builds the “Additional custom node groups” with `{ for k, v in var.additional_node_groups : k => merge(local.eks_managed_node_group_defaults, v) }`, fix the shallow merge behavior that overwrites nested maps. Terraform `merge()` is shallow, so if `v` includes `tags` (or `metadata_options`), it replaces the defaults (including `local.eks_managed_node_group_defaults.tags` and the Cluster Autoscaler discovery tags) rather than extending them. Refactor this mapping so `tags` are merged key-by-key with defaults preserved (defaults first, then caller keys win), and do the same for `metadata_options` (and any other nested default maps you intend to be additive), while keeping the rest of `v` overriding top-level defaults.

baz-reviewer Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DND-1304 (follow-up): handle v21 schema breaks found by stsaasuat plan#29

DND-1304 (follow-up): handle v21 schema breaks found by stsaasuat plan#29
jms200 wants to merge 1 commit into
mainfrom
jms200/DND-1304/eks-v21-aws-v6

jms200 commented Jun 12, 2026 •

edited by atlassian Bot

Loading

Uh oh!

baz-reviewer Bot Jun 12, 2026

Uh oh!

baz-reviewer Bot Jun 12, 2026

Uh oh!

baz-reviewer Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# AWS provider v6 errors if auth_token_update_strategy is set without auth_token.
		auth_token_update_strategy = var.elasticache_auth_token != null ? var.elasticache_auth_token_update_strategy : null

Conversation

jms200 commented Jun 12, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

What's in this PR

Plan against stsaasuat

Test plan

Generated description

Uh oh!

baz-reviewer Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jms200 commented Jun 12, 2026 •

edited by atlassian Bot

Loading