Skip to content

WIF External Cluster Scanning: Private Endpoint Connectivity #1442

@slntopp

Description

@slntopp

Problem

When scanning an external EKS cluster via WIF (IRSA), the operator's init container runs aws eks update-kubeconfig to generate a kubeconfig for the target cluster. The resulting kubeconfig contains the cluster's API server endpoint as returned by the EKS API.

EKS clusters with both public and private endpoints enabled use split-horizon DNS:

  • Queries from outside the VPC resolve to the public endpoint IP
  • Queries from within the VPC resolve to the private endpoint IP

This means scanner pods running inside a VPC will always resolve the target cluster's API server hostname to its private IP, even when a public endpoint exists. If the scanner and target clusters are in the same VPC but have separate security groups, or if they're in different VPCs without peering/transit gateway, the scanner pods get i/o timeout errors trying to reach the private IP.

What we observed

x unable to create runtime for asset error="Get \"https://<cluster-id>.gr7.eu-central-1.eks.amazonaws.com/version?timeout=32s\": dial tcp 10.0.2.136:443: i/o timeout"

The hostname resolved to 10.0.2.136 (a private IP on a subnet behind the target cluster's primary security group), even though the cluster has a public endpoint available.

Workaround applied in e2e tests

Added an explicit security group rule allowing ingress from the scanner cluster's node security group to the target cluster's primary security group on port 443:

resource "aws_security_group_rule" "scanner_to_target_api" {
  type                     = "ingress"
  from_port                = 443
  to_port                  = 443
  protocol                 = "tcp"
  security_group_id        = module.eks_target[0].cluster_primary_security_group_id
  source_security_group_id = module.eks.node_security_group_id
}

Key detail: the correct security groups to use are:

  • Destination: the EKS-managed primary security group (attached to the API server ENIs), NOT the module-managed cluster security group
  • Source: the node security group (where pod traffic originates), NOT the cluster security group

Scope

This affects all cloud providers where the operator generates a kubeconfig via CLI (aws eks update-kubeconfig, gcloud container clusters get-credentials), not just EKS. Any scenario where DNS resolves to an unreachable private IP will fail.

Scenarios:

  1. Same VPC, different security groups (our e2e case) — fixable with SG rules
  2. Different VPCs, no peering — requires VPC peering, transit gateway, or PrivateLink
  3. Private-only clusters — scanner must be in the same network or have a route to the private endpoint
  4. Cross-region — private endpoints are not reachable cross-region

Proposed improvements

1. Document networking requirements

Add documentation explaining:

  • The split-horizon DNS behavior and its impact on cross-cluster scanning
  • Required security group rules when scanner and target are in the same VPC
  • Network topology requirements for different VPC / cross-region scenarios

2. Support endpoint override in MondooAuditConfig CRD (optional)

Currently the WIF external cluster spec only accepts clusterName and region:

externalClusters:
  - name: target-cluster
    workloadIdentity:
      provider: eks
      eks:
        region: eu-central-1
        clusterName: my-target-cluster
        roleArn: arn:aws:iam::123456789:role/scanner-role

Consider adding an optional endpoint field that lets users override the API server address. This would allow users to specify a reachable endpoint (e.g., a public IP, a VPC endpoint, or a load balancer) when the default DNS resolution doesn't work:

        endpoint: https://public-ip-or-custom-endpoint:443

The init container would pass this to aws eks update-kubeconfig --endpoint <url> (supported by the AWS CLI) or patch the kubeconfig after generation.

3. Consider public endpoint preference (optional)

When both public and private endpoints are available, the operator could query the EKS API for endpoint configuration and prefer the public endpoint when running from outside the target cluster's node network. This is complex to detect reliably and may not be desirable in all cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions