Distributed Job Scheduler Operator

A Kubernetes Operator that simulates a distributed job scheduling system using Custom Resource Definitions (CRDs) to manage compute nodes and jobs.

A construct like ComputeNode makes sense for a hetergenous compute environment where different nodes have varying capabilities are spread across disjoint infrastructure providers. In Kubernetes world, virtual-kubelet and similar constructs can be used to represent such executors as k8s nodes.

A ComputeNode abstraction would further allow us to make scheduling decisions which aren't just restricted to attributes of k8s node resources and allow the scheduler to manage tasks across different kinds of compute resources.

This operator allows you to define and manage these compute resources, schedule jobs across them, and simulate task execution. The ComputeJob abstraction allows you to define a job that can be executed across multiple compute nodes, taking advantage of their combined resources based on scheduling criteria like node selectors and resource requirements.

Overview

This operator manages the lifecycle of compute nodes and distributed jobs within a Kubernetes cluster. It provides sophisticated resource-aware scheduling, state management, and observability for distributed computing workloads. It has been bootstrapped using kubebuilder v4 and follows standard Kubernetes operator patterns.

The tests suite aims to serve as a living documentation of the operators capabilities and the test fixtures (testdata/fixtures) serve as samples for the opeartor's API resources

Key Features

Custom Resource Management: Three CRDs (ComputeNode, ComputeJob, ComputeTask) for complete lifecycle management
Intelligent Scheduling: Resource-aware scheduling with pluggable filter and scoring algorithms
Task Simulation: Configurable execution simulation for testing distributed scheduling logic
Observability: Rich status tracking, events, and metrics integration
Testing: Comprehensive test suite that serves as living documentation of features and behaviors

Pre-requisites

Go
Docker
Make
kind (can be installed on macOS and Linux using via make kind)

go tools like kustomize kind kubectl envtest golangci-lint will be installed automatically via the Makefile when running e2e scenarios. This is scaffolded by kubebuilder. We HAVE NOT used helm, only kustomize to deploy the operator.

Demo

The e2e suite has comprehensive scenarios to both run simulations and actual workloads on real k8s clusters. The best place to start for anyone who wants to test drive the project and understand its function and capabilities. The e2e suite will deploy a 3-node kind cluster, deploy the operator to it and run various scheduling scenarios.

make test-e2e

E2E Test Scenarios

Scenario	Description	Key Features Tested
Resource Contention	Tests job scheduling with resource constraints	Resource-aware scheduling, job queueing, resource allocation/deallocation
Parallel Distribution	Distributes parallel jobs across multiple ComputeNodes	Multi-node scheduling, parallelism, resource distribution
Node Selector	Tests node selection based on labels	Label-based scheduling, GPU vs CPU nodes, nodeSelector compliance
Ownership & Cleanup	Verifies cascading deletion of resources	Kubernetes ownership references, automatic resource cleanup
Node Auto-Discovery	Tests automatic ComputeNode creation from K8s nodes	Node discovery, resource syncing, real workload execution

Detailed Test Coverage

Resource Contention Scenario:

Creates a single ComputeNode with limited resources
Schedules a resource-intensive job
Attempts to schedule a competing job
Verifies second job waits in Pending state
Confirms scheduling after first job completes

Parallel Distribution Scenario:

Sets up 3 worker ComputeNodes
Deploys a parallel job with parallelism=3
Verifies tasks are distributed across all nodes
Confirms resource allocation on each node
Tests proper completion and cleanup

Node Selector Scenario:

Creates specialized nodes (GPU, CPU) with different labels
Tests jobs with specific nodeSelector requirements
Verifies jobs only schedule on matching nodes
Confirms isolation between different node types

Auto-Discovery Integration:

Tests automatic ComputeNode creation from Kubernetes nodes
Verifies resource synchronization between K8s and ComputeNodes
Executes real workloads (Pods) on auto-discovered nodes
Validates end-to-end integration with actual Kubernetes infrastructure

Developing

Build

Builds the controller manager docker image

make build
# Pushing the image isn't required for local testing with `kind`
make push
# for loading image into kind outside of e2e tests which do it on their own
kind load docker-image $IMAGE --name $CLUSTER_NAME

Deploy kind cluster

Deploys a 3 worker node kind cluster for testing. This can be used to fully demonstrate the capabilities of the operator

make setup-test-e2e
# export the kubeconfig to make it the default for kubectl and other k8s clients
kind get kubeconfig --name madhukar-assignment-test-e2e > kubeconfig
export KUBECONFIG=kubeconfig

Deploy operator

The deploy-auto-discovery directive sets up controllers to automatically discover Kubernetes nodes and create ComputeNode resources based on them, and use real pods as the default executor instead of simulation mode.

make deploy-auto-discovery

Testing: Unit tests and Controller functional tests

This runs controller functional tests which mock the k8s API server and other components using envtest.

make test

Documentation

A mermaid capable Markdown reader is recommended for viewing docs

Core Documentation

System Architecture - High-level system overview, components, and dataflows
API Design - Detailed API field analysis and rationale
Controller Design - Controller patterns, state machines, and best practices

Architecture Decision Records (ADRs)

ADR-001: Task Simulation Approach - Task execution simulation strategy
ADR-002: ComputeTask Abstraction - Task representation and lifecycle
ADR-003: Scheduler Architecture - Pluggable scheduling framework
ADR-004: ComputeNode Design - Node resource modeling
ADR-006: ComputeJob Design - Job specification and execution modes
ADR-007: Node Discovery & In-Cluster Execution - Real workload execution patterns

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
api/v1		api/v1
cmd		cmd
config		config
docs		docs
hack		hack
internal		internal
test		test
testdata/fixtures		testdata/fixtures
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum
mise.toml		mise.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Job Scheduler Operator

Overview

Key Features

Pre-requisites

Demo

E2E Test Scenarios

Detailed Test Coverage

Developing

Build

Deploy kind cluster

Deploy operator

Testing: Unit tests and Controller functional tests

Documentation

Core Documentation

Architecture Decision Records (ADRs)

references

About

Uh oh!

Releases

Packages

Languages

madhukar93/distributed-scheduling-simulation

Folders and files

Latest commit

History

Repository files navigation

Distributed Job Scheduler Operator

Overview

Key Features

Pre-requisites

Demo

E2E Test Scenarios

Detailed Test Coverage

Developing

Build

Deploy kind cluster

Deploy operator

Testing: Unit tests and Controller functional tests

Documentation

Core Documentation

Architecture Decision Records (ADRs)

references

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages