Skip to content

bug/feat: BACKEND scheduler_type enum hardcoded to 'kai' — no fallback to K8s default-scheduler in 6.2.10 #936

@narenandu

Description

@narenandu

Describe the bug.

The OSMO BACKEND config field scheduler_settings.scheduler_type is a Pydantic Enum with a single permitted value: 'kai'. There is no way to use Kubernetes' default scheduler, no way to use Volcano, and no way to disable PodGroup creation entirely. Submitting default returns:

HTTP/2 422 Unprocessable Entity

{
  "detail": [{
    "loc": ["body", "configs", "scheduler_settings", "scheduler_type"],
    "msg": "value is not a valid enumeration member; permitted: 'kai'",
    "type": "type_error.enum",
    "ctx": {"enum_values": ["kai"]}
  }]
}

This effectively makes kai-scheduler a hard runtime dependency for every OSMO deployment. Operators on clusters that already have a different batch scheduler (or no batch scheduler at all) must install kai-scheduler purely to satisfy OSMO's PodGroup CR creation step, even when their workflows don't need gang scheduling.

Affected versions. 6.2.10. (Likely earlier — present since BACKEND config introduction.)


Reproduction

Backend default config out of the box:

$ osmo config show BACKEND | jq '.backends[0].scheduler_settings'
{
  "scheduler_type": "kai",
  "scheduler_name": "kai-scheduler",
  "scheduler_timeout": 30
}

Attempt to switch to K8s default-scheduler:

$ curl -X POST "$GATEWAY/api/configs/backend/default" \
    -H "Authorization: Bearer $JWT" \
    -d '{
      "configs": {
        "scheduler_settings": {
          "scheduler_type": "default",
          "scheduler_name": "default-scheduler",
          "scheduler_timeout": 30
        }
      }
    }'
{"detail":[{"loc":["body","configs","scheduler_settings","scheduler_type"],
            "msg":"value is not a valid enumeration member; permitted: 'kai'",
            "type":"type_error.enum","ctx":{"enum_values":["kai"]}}]}

Without kai-scheduler installed, every workflow submission fails:

2026-XX-XX backend-worker [ERROR] backend_worker: wf_uuid=...
  Fatal exception of type ResourceNotFoundError: No matches found for
  {'api_version': 'scheduling.run.ai/v2alpha2', 'kind': 'PodGroup'}

Workflow status: FAILED_SERVER_ERROR. The error message is opaque from the user's perspective — there is no surface-level indication that "you need to install kai-scheduler".


Why this matters

  1. Forces a heavyweight dependency. kai-scheduler installs 7 Deployments, 6 cluster-scoped CRDs (scheduling.run.ai/{v2,v2alpha2}/{PodGroup, Queue, BindRequest}, kai.scheduler/v1alpha1/{Topology, SchedulingShard, Config}), and an admission webhook. For a CKS / EKS / GKE deployment that already runs Volcano or relies on the K8s default scheduler, this is a non-trivial second scheduler with operational overhead and potential conflicts.

  2. Conflicts with existing scheduler stacks. On clusters where the platform team has standardized on a different scheduler (default, Volcano, Kueue), introducing kai-scheduler introduces ambiguity over which scheduler binds which pods, and may break invariants those platforms enforce (e.g., per-node GPU verification jobs).

  3. No graceful degradation. Even simple echo workflows that don't require gang scheduling cannot run without kai-scheduler installed. There is no scheduler_type: 'none' to bypass PodGroup creation entirely.


Suggested fix

Three concrete options, ordered by impact:

Option A — Add 'default' to the enum. Map scheduler_type: 'default' → no PodGroup CR; set pod.spec.schedulerName = scheduler_name (defaulting to default-scheduler); skip the scheduling.run.ai/v2alpha2/PodGroup resource creation in the backend-worker's CreateGroup job. Single-pod workflows would work out of the box.

Option B — Add 'volcano' to the enum. Map scheduler_type: 'volcano' → use Volcano's scheduling.volcano.sh/v1beta1/PodGroup. Volcano is widely deployed in HPC/ML platforms and has feature parity with kai's basic gang scheduling.

Option C — Add 'none'. Bypass any PodGroup CR creation and rely on the K8s default scheduler. Loses gang scheduling but unblocks every other use case.

The enum location is in OSMO's Pydantic models — likely src/utils/connectors/postgres.py (where LogConfig/DataConfig live per #749) or the related BackendConfig model. The actual CreateGroup / CleanupGroup job logic is in src/utils/job/backend_jobs.py (the file that emits the ResourceNotFoundError above).


Documentation gap (separate, smaller fix)

If hardcoded-to-kai is the intended design, deployment_guide/appendix/deploy_minimal.rst and the chart-level prerequisites sections should explicitly state:

kai-scheduler is a required runtime dependency. Install it from oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler before submitting any workflow.

Today this is implicit: the backend-operator chart's priorityClasses.enabled flag hints that kai is expected, but there is no top-level callout that submitting a workflow without it produces a FAILED_SERVER_ERROR.


Environment

  • OSMO 6.2.10 (charts 1.2.1 / app 6.2.10.fa46f7f09)
  • Kubernetes v1.35.4
  • Reproduced on a fresh install before kai-scheduler was added.

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalThe author is not in @NVIDIA/osmo-dev

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions