Scale tests validate KAI scheduler performance and correctness at large cluster sizes (hundreds to thousands of nodes). These tests simulate realistic workloads to ensure the scheduler maintains acceptable performance and correctness under scale.
Scale tests verify:
- Scheduling performance: Time to schedule large numbers of pods across many nodes
- Topology-aware scheduling: Time to allocate for distributed jobs with topology constraints
- Resource allocation: Proper GPU allocation and queue quota enforcement at scale
- Reclaim behavior: Preemption and resource reclamation with background workloads
- Distributed job scheduling: Multi-pod job allocation across nodes
- System stability: Scheduler behavior under concurrent job creation and high load
Tests use Ginkgo for test organization and execution. The test suite (scale_suite_test.go) defines test contexts and scenarios.
Tests use KWOK (Kubernetes WithOut Kubelet) to simulate large clusters without requiring real nodes:
- KWOK nodes: Virtual nodes created via the kwok-operator
NodePoolCRD. EachNodePooldefines the desired node count and a node template (labels, capacity, allocatable resources). The operator reconciles the pool by creating/deleting KWOK-backed virtual nodes to match the spec. Seetest/e2e/scale/base_kwok_managed_nodepool.yamlfor the base pool definition. - Default scale: 500 nodes (configurable via
NODE_COUNTenvironment variable) - GPU simulation: Fake GPU operator provides GPU resource reporting
- Pod lifecycle: KWOK stages simulate pod completion and status transitions
Tests are organized into contexts:
- Topology tests: Validate topology-aware scheduling with hierarchical constraints
- Big cluster tests: Performance tests with large node counts
- Cluster fill scenarios (scheduler enabled/disabled during job creation)
- Whole GPU allocation tests
- Distributed job scheduling
- Reclaim scenarios
Run from the repo root on a cluster with KAI scheduler already installed:
./hack/setup-scale-test-env.shThis installs:
- KWOK + KWOK operator for simulated nodes
- Fake GPU operator for GPU resource reporting on KWOK nodes
- Prometheus + Grafana + Pyroscope for metrics and profiling
- ServiceMonitors for scheduler and binder metrics
- Tuned scheduler/binder config for scale (consolidation disabled, high binder concurrency)
ginkgo -v ./test/e2e/scale/Node count defaults to 500, override with NODE_COUNT env var.
Scale tests should run from a runner pod inside the target cluster, not from an external machine. This minimizes API server latency during test execution and metric collection.
The target cluster should be a real cluster with real GPU nodes — KWOK simulates node presence but the scheduler, binder, and control plane run on actual hardware. As these tests are designed to measure Kai-scheduler's performance in real scenarios and not test logic, the tests must run on actual hardware.
Minimal cluster requirements:
- Dedicated control plane nodes (not shared with test workloads)
- KAI scheduler installed via Helm
kubectlaccess from the runner pod (via ServiceAccount or kubeconfig)
- Tests run on dedicated infrastructure every 24 hours
- Test results are stored in S3 and displayed on a public dashboard
- Dashboard URL: KAI Scheduler Scale Tests
The scale tests dashboard displays historical test results fetched from S3. The dashboard shows:
- Test execution times and performance metrics
- Pass/fail status for each test
- Detailed failure messages and logs
- Historical trends (30 days)
- Search and filter capabilities
Test results are stored in an S3 bucket (configured via repository secret) with the following structure:
Public/
manifest.json # Index of all test runs
<run-id>/
report.json # Ginkgo JSON report for that run
The manifest.json file lists all available test runs:
{
"runs": [
{
"timestamp": "2024-01-15T10:00:00Z",
"path": "Public/<run-id>/report.json"
}
]
}The dashboard is automatically deployed to GitHub Pages when changes are pushed to the docs/scale-tests/ directory. The S3 bucket URL is configured via the SCALE_TESTS_S3_URL repository variable