Skip to content

iris: add Kind-based integration tests for K8s scheduling correctness #3940

@rjpower

Description

@rjpower

Summary

Add integration tests using Kind (Kubernetes in Docker) to validate K8s scheduling correctness — topology constraints, affinity rules, taints, RBAC, resource quotas.

Motivation

The current InMemoryK8sService fake is good for unit-testing K8sTaskProvider logic (manifest construction, state mapping, log fetching) but it implements a simplified scheduler that doesn't handle:

  • podAffinity / podAntiAffinity with topologyKey
  • Real resource quota enforcement
  • RBAC validation
  • Topology spread constraints

This means configuration errors like setting topologyKey: "coreweave.cloud/spiine" (typo) instead of "coreweave.cloud/spine" are not caught by tests. Rather than reimplementing K8s scheduling in our fake, we should use Kind for integration tests that validate scheduling correctness.

Proposed approach

Use both InMemoryK8sService and Kind at different layers:

Layer Tool Speed What it validates
Unit tests InMemoryK8sService Instant Our code: manifest building, state transitions, log fetching
Integration tests Kind cluster ~10-30s startup The config: scheduling, topology, affinity, RBAC

Implementation:

  1. Add a conftest.py fixture that:

    • Spins up a Kind cluster with configurable node pools (labels, taints, resources)
    • Yields a CloudK8sService pointed at the Kind cluster
    • Tears down the cluster after tests
  2. Mark tests with @pytest.mark.kind (requires Docker, skip in CI without Docker)

  3. Write tests for:

    • Multi-task job with correct colocation topology key → all pods scheduled
    • Typo in topology key → pods stay Pending/Unschedulable
    • GPU pod on CPU-only nodepool → Unschedulable
    • Resource exhaustion → Pending
    • Taint without toleration → Unschedulable
    • RBAC: service account without pod creation permission → rejected
  4. Stop extending InMemoryK8sService._schedule_pod() with K8s scheduler semantics — let Kind handle scheduling correctness.

Context

This came out of the provider refactoring in #3900. The fake K8s service handles nodeSelector, tolerations, and resource capacity but not affinity rules. The right answer is to use the real scheduler (Kind) rather than reimplement it.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions