Skip to content

Iris: run controller & workers under explicit service account, not default compute SA #3473

@rjpower

Description

@rjpower

Problem

When Iris creates GCE instances (controller VM) and TPU VMs via gcloud compute instances create / gcloud compute tpus tpu-vm create, no --service-account flag is passed. This means all resources run under the Compute Engine default service account (748532799086-compute@developer.gserviceaccount.com), which typically has roles/editor — far more permissions than needed.

Proposed Change

Pass an explicit --service-account=<sa> flag when creating controller VMs and TPU worker slices in lib/iris/src/iris/cluster/platform/gcp.py. The SA should be configurable in the cluster config YAML (e.g., platform.gcp.service_account), falling back to the default compute SA if unset for backward compatibility.

This enables:

  • Least privilege: controller and workers only get the permissions they actually need (e.g., pull container images, write logs)
  • Audit clarity: resource actions in Cloud Audit Logs are attributed to a purpose-specific SA
  • CI isolation: the CI smoke test SA only needs serviceAccountUser on a narrow-scoped runtime SA, not the powerful default compute SA

Files to Change

  • lib/iris/src/iris/cluster/platform/gcp.py — add --service-account to create_slice, create_vm_slice, and controller VM creation
  • lib/iris/protos/config.proto (or equivalent) — add service_account field to GcpPlatformConfig
  • lib/iris/examples/smoke.yaml, coreweave.yaml, etc. — document the new field

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions