A Kubernetes Operator that simulates a distributed job scheduling system using Custom Resource Definitions (CRDs) to manage compute nodes and jobs.
A construct like ComputeNode makes sense for a hetergenous compute environment where different nodes have varying capabilities are spread across disjoint infrastructure providers. In Kubernetes world, virtual-kubelet and similar constructs can be used to represent such executors as k8s nodes.
A ComputeNode abstraction would further allow us to make scheduling decisions which aren't just restricted to attributes of k8s node resources and allow the scheduler to manage tasks across different kinds of compute resources.
This operator allows you to define and manage these compute resources, schedule jobs across them, and simulate task execution.
The ComputeJob abstraction allows you to define a job that can be executed across multiple compute nodes, taking advantage of their combined resources based on scheduling criteria like node selectors and resource requirements.
This operator manages the lifecycle of compute nodes and distributed jobs within a Kubernetes cluster. It provides sophisticated resource-aware scheduling, state management, and observability for distributed computing workloads. It has been bootstrapped using kubebuilder v4 and follows standard Kubernetes operator patterns.
The tests suite aims to serve as a living documentation of the operators capabilities and the test fixtures (
testdata/fixtures) serve as samples for the opeartor's API resources
- Custom Resource Management: Three CRDs (ComputeNode, ComputeJob, ComputeTask) for complete lifecycle management
- Intelligent Scheduling: Resource-aware scheduling with pluggable filter and scoring algorithms
- Task Simulation: Configurable execution simulation for testing distributed scheduling logic
- Observability: Rich status tracking, events, and metrics integration
- Testing: Comprehensive test suite that serves as living documentation of features and behaviors
- Go
- Docker
- Make
- kind (can be installed on macOS and Linux using via
make kind)
go tools like
kustomizekindkubectlenvtestgolangci-lintwill be installed automatically via theMakefilewhen running e2e scenarios. This is scaffolded bykubebuilder. We HAVE NOT usedhelm, onlykustomizeto deploy the operator.
The e2e suite has comprehensive scenarios to both run simulations and actual workloads on real k8s clusters. The best place to start for anyone who wants to test drive the project and understand its function and capabilities. The e2e suite will deploy a 3-node kind cluster, deploy the operator to it and run various scheduling scenarios.
make test-e2e| Scenario | Description | Key Features Tested |
|---|---|---|
| Resource Contention | Tests job scheduling with resource constraints | Resource-aware scheduling, job queueing, resource allocation/deallocation |
| Parallel Distribution | Distributes parallel jobs across multiple ComputeNodes | Multi-node scheduling, parallelism, resource distribution |
| Node Selector | Tests node selection based on labels | Label-based scheduling, GPU vs CPU nodes, nodeSelector compliance |
| Ownership & Cleanup | Verifies cascading deletion of resources | Kubernetes ownership references, automatic resource cleanup |
| Node Auto-Discovery | Tests automatic ComputeNode creation from K8s nodes | Node discovery, resource syncing, real workload execution |
Resource Contention Scenario:
- Creates a single ComputeNode with limited resources
- Schedules a resource-intensive job
- Attempts to schedule a competing job
- Verifies second job waits in Pending state
- Confirms scheduling after first job completes
Parallel Distribution Scenario:
- Sets up 3 worker ComputeNodes
- Deploys a parallel job with
parallelism=3 - Verifies tasks are distributed across all nodes
- Confirms resource allocation on each node
- Tests proper completion and cleanup
Node Selector Scenario:
- Creates specialized nodes (GPU, CPU) with different labels
- Tests jobs with specific nodeSelector requirements
- Verifies jobs only schedule on matching nodes
- Confirms isolation between different node types
Auto-Discovery Integration:
- Tests automatic ComputeNode creation from Kubernetes nodes
- Verifies resource synchronization between K8s and ComputeNodes
- Executes real workloads (Pods) on auto-discovered nodes
- Validates end-to-end integration with actual Kubernetes infrastructure
Builds the controller manager docker image
make build
# Pushing the image isn't required for local testing with `kind`
make push
# for loading image into kind outside of e2e tests which do it on their own
kind load docker-image $IMAGE --name $CLUSTER_NAMEDeploys a 3 worker node kind cluster for testing. This can be used to fully demonstrate the capabilities of the operator
make setup-test-e2e
# export the kubeconfig to make it the default for kubectl and other k8s clients
kind get kubeconfig --name madhukar-assignment-test-e2e > kubeconfig
export KUBECONFIG=kubeconfigThe deploy-auto-discovery directive sets up controllers to automatically discover Kubernetes nodes and create ComputeNode resources based on them, and use real pods as the default executor instead of simulation mode.
make deploy-auto-discoveryThis runs controller functional tests which mock the k8s API server and other components using envtest.
make testA mermaid capable Markdown reader is recommended for viewing docs
- System Architecture - High-level system overview, components, and dataflows
- API Design - Detailed API field analysis and rationale
- Controller Design - Controller patterns, state machines, and best practices
- ADR-001: Task Simulation Approach - Task execution simulation strategy
- ADR-002: ComputeTask Abstraction - Task representation and lifecycle
- ADR-003: Scheduler Architecture - Pluggable scheduling framework
- ADR-004: ComputeNode Design - Node resource modeling
- ADR-006: ComputeJob Design - Job specification and execution modes
- ADR-007: Node Discovery & In-Cluster Execution - Real workload execution patterns