iris: add Kind-based integration tests for K8s scheduling correctness

## Summary

Add integration tests using [Kind](https://kind.sigs.k8s.io/) (Kubernetes in Docker) to validate K8s scheduling correctness — topology constraints, affinity rules, taints, RBAC, resource quotas.

## Motivation

The current `InMemoryK8sService` fake is good for unit-testing K8sTaskProvider logic (manifest construction, state mapping, log fetching) but it implements a simplified scheduler that doesn't handle:

- `podAffinity` / `podAntiAffinity` with `topologyKey`
- Real resource quota enforcement
- RBAC validation
- Topology spread constraints

This means configuration errors like setting `topologyKey: "coreweave.cloud/spiine"` (typo) instead of `"coreweave.cloud/spine"` are not caught by tests. Rather than reimplementing K8s scheduling in our fake, we should use Kind for integration tests that validate scheduling correctness.

## Proposed approach

**Use both InMemoryK8sService and Kind at different layers:**

| Layer | Tool | Speed | What it validates |
|-------|------|-------|-------------------|
| Unit tests | `InMemoryK8sService` | Instant | Our code: manifest building, state transitions, log fetching |
| Integration tests | Kind cluster | ~10-30s startup | The config: scheduling, topology, affinity, RBAC |

**Implementation:**

1. Add a `conftest.py` fixture that:
   - Spins up a Kind cluster with configurable node pools (labels, taints, resources)
   - Yields a `CloudK8sService` pointed at the Kind cluster
   - Tears down the cluster after tests

2. Mark tests with `@pytest.mark.kind` (requires Docker, skip in CI without Docker)

3. Write tests for:
   - Multi-task job with correct colocation topology key → all pods scheduled
   - Typo in topology key → pods stay Pending/Unschedulable
   - GPU pod on CPU-only nodepool → Unschedulable
   - Resource exhaustion → Pending
   - Taint without toleration → Unschedulable
   - RBAC: service account without pod creation permission → rejected

4. Stop extending `InMemoryK8sService._schedule_pod()` with K8s scheduler semantics — let Kind handle scheduling correctness.

## Context

This came out of the provider refactoring in #3900. The fake K8s service handles nodeSelector, tolerations, and resource capacity but not affinity rules. The right answer is to use the real scheduler (Kind) rather than reimplement it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iris: add Kind-based integration tests for K8s scheduling correctness #3940

Summary

Motivation

Proposed approach

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Layer	Tool	Speed	What it validates
Unit tests	`InMemoryK8sService`	Instant	Our code: manifest building, state transitions, log fetching
Integration tests	Kind cluster	~10-30s startup	The config: scheduling, topology, affinity, RBAC

iris: add Kind-based integration tests for K8s scheduling correctness #3940

Description

Summary

Motivation

Proposed approach

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions