Skip to content

madhukar93/distributed-scheduling-simulation

Repository files navigation

Distributed Job Scheduler Operator

A Kubernetes Operator that simulates a distributed job scheduling system using Custom Resource Definitions (CRDs) to manage compute nodes and jobs.

A construct like ComputeNode makes sense for a hetergenous compute environment where different nodes have varying capabilities are spread across disjoint infrastructure providers. In Kubernetes world, virtual-kubelet and similar constructs can be used to represent such executors as k8s nodes.

A ComputeNode abstraction would further allow us to make scheduling decisions which aren't just restricted to attributes of k8s node resources and allow the scheduler to manage tasks across different kinds of compute resources.

This operator allows you to define and manage these compute resources, schedule jobs across them, and simulate task execution. The ComputeJob abstraction allows you to define a job that can be executed across multiple compute nodes, taking advantage of their combined resources based on scheduling criteria like node selectors and resource requirements.

Overview

This operator manages the lifecycle of compute nodes and distributed jobs within a Kubernetes cluster. It provides sophisticated resource-aware scheduling, state management, and observability for distributed computing workloads. It has been bootstrapped using kubebuilder v4 and follows standard Kubernetes operator patterns.

The tests suite aims to serve as a living documentation of the operators capabilities and the test fixtures (testdata/fixtures) serve as samples for the opeartor's API resources

Key Features

  • Custom Resource Management: Three CRDs (ComputeNode, ComputeJob, ComputeTask) for complete lifecycle management
  • Intelligent Scheduling: Resource-aware scheduling with pluggable filter and scoring algorithms
  • Task Simulation: Configurable execution simulation for testing distributed scheduling logic
  • Observability: Rich status tracking, events, and metrics integration
  • Testing: Comprehensive test suite that serves as living documentation of features and behaviors

Pre-requisites

  • Go
  • Docker
  • Make
  • kind (can be installed on macOS and Linux using via make kind)

go tools like kustomize kind kubectl envtest golangci-lint will be installed automatically via the Makefile when running e2e scenarios. This is scaffolded by kubebuilder. We HAVE NOT used helm, only kustomize to deploy the operator.

Demo

The e2e suite has comprehensive scenarios to both run simulations and actual workloads on real k8s clusters. The best place to start for anyone who wants to test drive the project and understand its function and capabilities. The e2e suite will deploy a 3-node kind cluster, deploy the operator to it and run various scheduling scenarios.

make test-e2e

E2E Test Scenarios

Scenario Description Key Features Tested
Resource Contention Tests job scheduling with resource constraints Resource-aware scheduling, job queueing, resource allocation/deallocation
Parallel Distribution Distributes parallel jobs across multiple ComputeNodes Multi-node scheduling, parallelism, resource distribution
Node Selector Tests node selection based on labels Label-based scheduling, GPU vs CPU nodes, nodeSelector compliance
Ownership & Cleanup Verifies cascading deletion of resources Kubernetes ownership references, automatic resource cleanup
Node Auto-Discovery Tests automatic ComputeNode creation from K8s nodes Node discovery, resource syncing, real workload execution

Detailed Test Coverage

Resource Contention Scenario:

  • Creates a single ComputeNode with limited resources
  • Schedules a resource-intensive job
  • Attempts to schedule a competing job
  • Verifies second job waits in Pending state
  • Confirms scheduling after first job completes

Parallel Distribution Scenario:

  • Sets up 3 worker ComputeNodes
  • Deploys a parallel job with parallelism=3
  • Verifies tasks are distributed across all nodes
  • Confirms resource allocation on each node
  • Tests proper completion and cleanup

Node Selector Scenario:

  • Creates specialized nodes (GPU, CPU) with different labels
  • Tests jobs with specific nodeSelector requirements
  • Verifies jobs only schedule on matching nodes
  • Confirms isolation between different node types

Auto-Discovery Integration:

  • Tests automatic ComputeNode creation from Kubernetes nodes
  • Verifies resource synchronization between K8s and ComputeNodes
  • Executes real workloads (Pods) on auto-discovered nodes
  • Validates end-to-end integration with actual Kubernetes infrastructure

Developing

Build

Builds the controller manager docker image

make build
# Pushing the image isn't required for local testing with `kind`
make push
# for loading image into kind outside of e2e tests which do it on their own
kind load docker-image $IMAGE --name $CLUSTER_NAME

Deploy kind cluster

Deploys a 3 worker node kind cluster for testing. This can be used to fully demonstrate the capabilities of the operator

make setup-test-e2e
# export the kubeconfig to make it the default for kubectl and other k8s clients
kind get kubeconfig --name madhukar-assignment-test-e2e > kubeconfig
export KUBECONFIG=kubeconfig

Deploy operator

The deploy-auto-discovery directive sets up controllers to automatically discover Kubernetes nodes and create ComputeNode resources based on them, and use real pods as the default executor instead of simulation mode.

make deploy-auto-discovery

Testing: Unit tests and Controller functional tests

This runs controller functional tests which mock the k8s API server and other components using envtest.

make test

Documentation

A mermaid capable Markdown reader is recommended for viewing docs

Core Documentation

Architecture Decision Records (ADRs)

references

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages