DND-1304 (follow-up): handle v21 schema breaks found by stsaasuat plan#29
DND-1304 (follow-up): handle v21 schema breaks found by stsaasuat plan#29jms200 wants to merge 1 commit into
Conversation
- comet_eks/main.tf: v21 dropped eks_managed_node_group_defaults — move to local.eks_managed_node_group_defaults, merge into each NG entry - comet_eks/main.tf: pin metadata_options.http_put_response_hop_limit=2 in node group defaults (v21 default flipped 2→1; would break any sidecar IMDS-through-proxy pattern) - comet_eks/main.tf: karpenter NG taints list → map (v21 schema) - comet_eks + root variables.tf: eks_clickhouse_taints list → map (v21 schema) - comet_elasticache/main.tf: gate auth_token_update_strategy on auth_token being non-null (v6 errors otherwise) Plan against stsaasuat now clean (2 add / 14 change / 4 destroy, all v21- driven; deletion_protection false→true confirmed). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| # AWS provider v6 errors if auth_token_update_strategy is set without auth_token. | ||
| auth_token_update_strategy = var.elasticache_auth_token != null ? var.elasticache_auth_token_update_strategy : null |
There was a problem hiding this comment.
This guard nulls auth_token_update_strategy whenever elasticache_auth_token is null, so an explicit DELETE removal request never reaches AWS — should we preserve DELETE instead of overriding it to null?
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
modules/comet_elasticache/main.tf around lines 10-11 (inside the ElastiCache replication
group/resource arguments), the current guard sets `auth_token_update_strategy` to `null`
whenever `var.elasticache_auth_token` is `null`, which blocks the explicit removal case
`auth_token = null` with `auth_token_update_strategy = "DELETE"`. Refactor the
expression so that when `var.elasticache_auth_token` is null you still pass through the
strategy value if it is exactly `DELETE`, but continue to set it to null for any other
strategy that would be invalid without an auth token. Ensure the logic matches the AWS
provider v6.31.0 documented behavior and add/adjust a variable contract test or
validation if one exists for this module.
| # common_tags on the nodegroup propagate to instances; CA discovery | ||
| # tags must be on the ASG so the autoscaler can match them. | ||
| tags = merge(var.common_tags, local.cluster_autoscaler_asg_tags) |
There was a problem hiding this comment.
eks_enable_cluster_autoscaler=true doesn't make the managed node group ASGs discoverable: cluster_autoscaler_asg_tags is merged into tags and then overwritten by var.common_tags, so should we move it to autoscaling_group_tags?
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
modules/comet_eks/main.tf around lines 81-83 where `cluster_autoscaler_asg_tags` is
merged into `eks_managed_node_group_defaults.tags`, fix the logic so the
`k8s.io/cluster-autoscaler/${var.eks_cluster_name}` discovery tags are applied to the
managed node group ASG tag field (the upstream EKS module uses
`autoscaling_group_tags`), not the instance/nodegroup `tags` map. Then, in the built-in
node group definitions later (e.g., karpenter/admin/comet/druid/airflow/clickhouse
around their `tags = var.common_tags` and their merges with
`local.eks_managed_node_group_defaults`), ensure those blocks do not overwrite/remove
the ASG discovery tags—merge both `var.common_tags` and `cluster_autoscaler_asg_tags`
into the correct ASG tag attribute. After the change, verify that when
`var.eks_enable_cluster_autoscaler = true`, the created managed node group ASGs carry
the CA discovery tags so autoscaler can discover and scale them.
| } : {}, | ||
| # Additional custom node groups | ||
| var.additional_node_groups | ||
| { for k, v in var.additional_node_groups : k => merge(local.eks_managed_node_group_defaults, v) } |
There was a problem hiding this comment.
additional_node_groups now wraps each entry in merge(local.eks_managed_node_group_defaults, v), but merge() is shallow so caller tags/metadata_options replace the defaults; should we deep-merge these fields so var.common_tags and the Cluster Autoscaler discovery tags stay intact?
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
modules/comet_eks/main.tf around lines 377-377 where `eks_managed_node_groups` builds
the “Additional custom node groups” with `{ for k, v in var.additional_node_groups :
k => merge(local.eks_managed_node_group_defaults, v) }`, fix the shallow merge behavior
that overwrites nested maps. Terraform `merge()` is shallow, so if `v` includes `tags`
(or `metadata_options`), it replaces the defaults (including
`local.eks_managed_node_group_defaults.tags` and the Cluster Autoscaler discovery tags)
rather than extending them. Refactor this mapping so `tags` are merged key-by-key with
defaults preserved (defaults first, then caller keys win), and do the same for
`metadata_options` (and any other nested default maps you intend to be additive), while
keeping the rest of `v` overriding top-level defaults.
User description
Summary
Follow-up to #28 — landing the v21 schema/behavior issues that turned up when running the first dry-run plan against stsaasuat. The original PR's edits were correct in spirit but missed a few v21 schema renames + a v6 quirk and one default flip we should not absorb on a brownfield migration.
Closes / addresses DND-1304.
What's in this PR
modules/comet_eks/main.tfeks_managed_node_group_defaults→ move tolocal.eks_managed_node_group_defaults, merge into each NG entrymodules/comet_eks/main.tfmetadata_options.http_put_response_hop_limit = 2(v21 default flipped 2→1; would break sidecar IMDS-through-proxy patterns)modules/comet_eks/main.tftaintslist → map (v21 schema)modules/comet_eks/variables.tf+ rootvariables.tfeks_clickhouse_taintslist → map (v21 schema)modules/comet_elasticache/main.tfauth_token_update_strategyonauth_tokenbeing non-null (v6 errors otherwise)Plan against stsaasuat
Clean —
Plan: 2 to add, 14 to change, 4 to destroy. All v21-driven:aws_eks_cluster.this[0].deletion_protection: false → true— the actual DND-1304 askaws_iam_policy.custom+ 2 attachments (v21 removed theenable_security_groups_for_podslegacy wiring; stsaasuat doesn't use it) +time_sleep.thisreplacement (internal timing primitive)deletion_protection=true, KMS/IAM tags shedterraform-aws-modules=ekstag, OIDC provider thumbprint refresh, IAM role tag updatesPlan log: see attached evidence in support-agent investigation folder.
Test plan
v1.20.0(MINOR — new family per [feedback_terraform_aws_comet_stsaas_versioning])comet-devops/terraform/stsaas/stsaasuat/main.tfref → Atlantis apply🤖 Generated with Claude Code
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Align EKS node group defaults, metadata options, and ClickHouse taint handling with the v21 schema while sustaining the prior autoscaler tag propagation and IMDS hop-limit behavior for node groups. Guard the ElastiCache replication group auth token update strategy on the presence of an auth token so provider v6 calls succeed when the token is unset.
local.eks_managed_node_group_defaults,eks_clickhouse_taints, and taint handling with the v21 schema while preserving previous metadata hop-limit and autoscaler tag behavior.Modified files (3)
Latest Contributors(2)
auth_token_update_strategyon a non-nullauth_tokenso provider v6 avoids errors when the token is unset.Modified files (1)
Latest Contributors(2)