Skip to content

DND-1304 (follow-up): handle v21 schema breaks found by stsaasuat plan#29

Open
jms200 wants to merge 1 commit into
mainfrom
jms200/DND-1304/eks-v21-aws-v6
Open

DND-1304 (follow-up): handle v21 schema breaks found by stsaasuat plan#29
jms200 wants to merge 1 commit into
mainfrom
jms200/DND-1304/eks-v21-aws-v6

Conversation

@jms200

@jms200 jms200 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

User description

Summary

Follow-up to #28 — landing the v21 schema/behavior issues that turned up when running the first dry-run plan against stsaasuat. The original PR's edits were correct in spirit but missed a few v21 schema renames + a v6 quirk and one default flip we should not absorb on a brownfield migration.

Closes / addresses DND-1304.

What's in this PR

File Change
modules/comet_eks/main.tf v21 dropped eks_managed_node_group_defaults → move to local.eks_managed_node_group_defaults, merge into each NG entry
modules/comet_eks/main.tf Pin metadata_options.http_put_response_hop_limit = 2 (v21 default flipped 2→1; would break sidecar IMDS-through-proxy patterns)
modules/comet_eks/main.tf Karpenter NG taints list → map (v21 schema)
modules/comet_eks/variables.tf + root variables.tf eks_clickhouse_taints list → map (v21 schema)
modules/comet_elasticache/main.tf Gate auth_token_update_strategy on auth_token being non-null (v6 errors otherwise)

Plan against stsaasuat

Clean — Plan: 2 to add, 14 to change, 4 to destroy. All v21-driven:

  • aws_eks_cluster.this[0].deletion_protection: false → true — the actual DND-1304 ask
  • ✅ No node group replacements (in-place updates of launch templates + nodegroups)
  • ✅ No data-plane disruption
  • 4 destroys: aws_iam_policy.custom + 2 attachments (v21 removed the enable_security_groups_for_pods legacy wiring; stsaasuat doesn't use it) + time_sleep.this replacement (internal timing primitive)
  • 14 changes: cluster getting deletion_protection=true, KMS/IAM tags shed terraform-aws-modules=eks tag, OIDC provider thumbprint refresh, IAM role tag updates

Plan log: see attached evidence in support-agent investigation folder.

Test plan

  • First clean plan against stsaasuat — no destructive changes to EKS cluster, nodegroups, or ElastiCache ✅
  • Cut v1.20.0 (MINOR — new family per [feedback_terraform_aws_comet_stsaas_versioning])
  • Bump comet-devops/terraform/stsaas/stsaasuat/main.tf ref → Atlantis apply
  • Fleet rollout per the v1.14.x/v1.15.x/v1.17.x/v1.18.x precedent

🤖 Generated with Claude Code


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Align EKS node group defaults, metadata options, and ClickHouse taint handling with the v21 schema while sustaining the prior autoscaler tag propagation and IMDS hop-limit behavior for node groups. Guard the ElastiCache replication group auth token update strategy on the presence of an auth token so provider v6 calls succeed when the token is unset.

TopicDetails
EKS nodegroups Align local.eks_managed_node_group_defaults, eks_clickhouse_taints, and taint handling with the v21 schema while preserving previous metadata hop-limit and autoscaler tag behavior.
Modified files (3)
  • modules/comet_eks/main.tf
  • modules/comet_eks/variables.tf
  • variables.tf
Latest Contributors(2)
UserCommitDate
jms200DND-1304: handle v21 s...June 12, 2026
darenjacobs@msn.comfeat(elasticache): exp...April 22, 2026
ElastiCache auth Guard auth_token_update_strategy on a non-null auth_token so provider v6 avoids errors when the token is unset.
Modified files (1)
  • modules/comet_elasticache/main.tf
Latest Contributors(2)
UserCommitDate
jms200DND-1304: handle v21 s...June 12, 2026
darenjacobs@msn.comfeat(elasticache): exp...April 22, 2026
Review this PR on Baz | Customize your next review

- comet_eks/main.tf: v21 dropped eks_managed_node_group_defaults — move to
  local.eks_managed_node_group_defaults, merge into each NG entry
- comet_eks/main.tf: pin metadata_options.http_put_response_hop_limit=2 in
  node group defaults (v21 default flipped 2→1; would break any sidecar
  IMDS-through-proxy pattern)
- comet_eks/main.tf: karpenter NG taints list → map (v21 schema)
- comet_eks + root variables.tf: eks_clickhouse_taints list → map (v21 schema)
- comet_elasticache/main.tf: gate auth_token_update_strategy on auth_token
  being non-null (v6 errors otherwise)

Plan against stsaasuat now clean (2 add / 14 change / 4 destroy, all v21-
driven; deletion_protection false→true confirmed).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines +10 to +11
# AWS provider v6 errors if auth_token_update_strategy is set without auth_token.
auth_token_update_strategy = var.elasticache_auth_token != null ? var.elasticache_auth_token_update_strategy : null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard nulls auth_token_update_strategy whenever elasticache_auth_token is null, so an explicit DELETE removal request never reaches AWS — should we preserve DELETE instead of overriding it to null?

Severity web_search

Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
modules/comet_elasticache/main.tf around lines 10-11 (inside the ElastiCache replication
group/resource arguments), the current guard sets `auth_token_update_strategy` to `null`
whenever `var.elasticache_auth_token` is `null`, which blocks the explicit removal case
`auth_token = null` with `auth_token_update_strategy = "DELETE"`. Refactor the
expression so that when `var.elasticache_auth_token` is null you still pass through the
strategy value if it is exactly `DELETE`, but continue to set it to null for any other
strategy that would be invalid without an auth token. Ensure the logic matches the AWS
provider v6.31.0 documented behavior and add/adjust a variable contract test or
validation if one exists for this module.

Comment thread modules/comet_eks/main.tf
Comment on lines +81 to +83
# common_tags on the nodegroup propagate to instances; CA discovery
# tags must be on the ASG so the autoscaler can match them.
tags = merge(var.common_tags, local.cluster_autoscaler_asg_tags)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eks_enable_cluster_autoscaler=true doesn't make the managed node group ASGs discoverable: cluster_autoscaler_asg_tags is merged into tags and then overwritten by var.common_tags, so should we move it to autoscaling_group_tags?

Severity web_search

Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
modules/comet_eks/main.tf around lines 81-83 where `cluster_autoscaler_asg_tags` is
merged into `eks_managed_node_group_defaults.tags`, fix the logic so the
`k8s.io/cluster-autoscaler/${var.eks_cluster_name}` discovery tags are applied to the
managed node group ASG tag field (the upstream EKS module uses
`autoscaling_group_tags`), not the instance/nodegroup `tags` map. Then, in the built-in
node group definitions later (e.g., karpenter/admin/comet/druid/airflow/clickhouse
around their `tags = var.common_tags` and their merges with
`local.eks_managed_node_group_defaults`), ensure those blocks do not overwrite/remove
the ASG discovery tags—merge both `var.common_tags` and `cluster_autoscaler_asg_tags`
into the correct ASG tag attribute. After the change, verify that when
`var.eks_enable_cluster_autoscaler = true`, the created managed node group ASGs carry
the CA discovery tags so autoscaler can discover and scale them.

Comment thread modules/comet_eks/main.tf
} : {},
# Additional custom node groups
var.additional_node_groups
{ for k, v in var.additional_node_groups : k => merge(local.eks_managed_node_group_defaults, v) }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional_node_groups now wraps each entry in merge(local.eks_managed_node_group_defaults, v), but merge() is shallow so caller tags/metadata_options replace the defaults; should we deep-merge these fields so var.common_tags and the Cluster Autoscaler discovery tags stay intact?

Severity web_search

Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
modules/comet_eks/main.tf around lines 377-377 where `eks_managed_node_groups` builds
the “Additional custom node groups” with `{ for k, v in var.additional_node_groups :
k => merge(local.eks_managed_node_group_defaults, v) }`, fix the shallow merge behavior
that overwrites nested maps. Terraform `merge()` is shallow, so if `v` includes `tags`
(or `metadata_options`), it replaces the defaults (including
`local.eks_managed_node_group_defaults.tags` and the Cluster Autoscaler discovery tags)
rather than extending them. Refactor this mapping so `tags` are merged key-by-key with
defaults preserved (defaults first, then caller keys win), and do the same for
`metadata_options` (and any other nested default maps you intend to be additive), while
keeping the rest of `v` overriding top-level defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant