bug/feat: BACKEND scheduler_type enum hardcoded to 'kai' — no fallback to K8s default-scheduler in 6.2.10

### Describe the bug.

The OSMO `BACKEND` config field `scheduler_settings.scheduler_type` is a Pydantic `Enum` with **a single permitted value: `'kai'`**. There is no way to use Kubernetes' default scheduler, no way to use Volcano, and no way to disable PodGroup creation entirely. Submitting `default` returns:

```http
HTTP/2 422 Unprocessable Entity

{
  "detail": [{
    "loc": ["body", "configs", "scheduler_settings", "scheduler_type"],
    "msg": "value is not a valid enumeration member; permitted: 'kai'",
    "type": "type_error.enum",
    "ctx": {"enum_values": ["kai"]}
  }]
}
```

This effectively makes `kai-scheduler` a hard runtime dependency for every OSMO deployment. Operators on clusters that already have a different batch scheduler (or no batch scheduler at all) must install kai-scheduler purely to satisfy OSMO's PodGroup CR creation step, even when their workflows don't need gang scheduling.

**Affected versions.** `6.2.10`. (Likely earlier — present since `BACKEND` config introduction.)

---

### Reproduction

Backend `default` config out of the box:

```bash
$ osmo config show BACKEND | jq '.backends[0].scheduler_settings'
{
  "scheduler_type": "kai",
  "scheduler_name": "kai-scheduler",
  "scheduler_timeout": 30
}
```

Attempt to switch to K8s default-scheduler:

```bash
$ curl -X POST "$GATEWAY/api/configs/backend/default" \
    -H "Authorization: Bearer $JWT" \
    -d '{
      "configs": {
        "scheduler_settings": {
          "scheduler_type": "default",
          "scheduler_name": "default-scheduler",
          "scheduler_timeout": 30
        }
      }
    }'
{"detail":[{"loc":["body","configs","scheduler_settings","scheduler_type"],
            "msg":"value is not a valid enumeration member; permitted: 'kai'",
            "type":"type_error.enum","ctx":{"enum_values":["kai"]}}]}
```

Without kai-scheduler installed, every workflow submission fails:

```
2026-XX-XX backend-worker [ERROR] backend_worker: wf_uuid=...
  Fatal exception of type ResourceNotFoundError: No matches found for
  {'api_version': 'scheduling.run.ai/v2alpha2', 'kind': 'PodGroup'}
```

Workflow status: `FAILED_SERVER_ERROR`. The error message is opaque from the user's perspective — there is no surface-level indication that "you need to install kai-scheduler".

---

### Why this matters

1. **Forces a heavyweight dependency.** kai-scheduler installs 7 Deployments, 6 cluster-scoped CRDs (`scheduling.run.ai/{v2,v2alpha2}/{PodGroup, Queue, BindRequest}`, `kai.scheduler/v1alpha1/{Topology, SchedulingShard, Config}`), and an admission webhook. For a CKS / EKS / GKE deployment that already runs Volcano or relies on the K8s default scheduler, this is a non-trivial second scheduler with operational overhead and potential conflicts.

2. **Conflicts with existing scheduler stacks.** On clusters where the platform team has standardized on a different scheduler (default, Volcano, Kueue), introducing kai-scheduler introduces ambiguity over which scheduler binds which pods, and may break invariants those platforms enforce (e.g., per-node GPU verification jobs).

3. **No graceful degradation.** Even simple `echo` workflows that don't require gang scheduling cannot run without kai-scheduler installed. There is no `scheduler_type: 'none'` to bypass PodGroup creation entirely.

---

### Suggested fix

Three concrete options, ordered by impact:

**Option A — Add `'default'` to the enum.** Map `scheduler_type: 'default'` → no PodGroup CR; set `pod.spec.schedulerName = scheduler_name` (defaulting to `default-scheduler`); skip the `scheduling.run.ai/v2alpha2/PodGroup` resource creation in the backend-worker's `CreateGroup` job. Single-pod workflows would work out of the box.

**Option B — Add `'volcano'` to the enum.** Map `scheduler_type: 'volcano'` → use Volcano's `scheduling.volcano.sh/v1beta1/PodGroup`. Volcano is widely deployed in HPC/ML platforms and has feature parity with kai's basic gang scheduling.

**Option C — Add `'none'`.** Bypass any PodGroup CR creation and rely on the K8s default scheduler. Loses gang scheduling but unblocks every other use case.

The enum location is in OSMO's Pydantic models — likely [`src/utils/connectors/postgres.py`](https://github.com/NVIDIA/OSMO/blob/6.2.10/src/utils/connectors/postgres.py) (where `LogConfig`/`DataConfig` live per #749) or the related `BackendConfig` model. The actual `CreateGroup` / `CleanupGroup` job logic is in [`src/utils/job/backend_jobs.py`](https://github.com/NVIDIA/OSMO/blob/6.2.10/src/utils/job/backend_jobs.py) (the file that emits the `ResourceNotFoundError` above).

---

### Documentation gap (separate, smaller fix)

If hardcoded-to-kai is the intended design, [`deployment_guide/appendix/deploy_minimal.rst`](https://github.com/NVIDIA/OSMO/blob/6.2.10/docs/deployment_guide/appendix/deploy_minimal.rst) and the chart-level `prerequisites` sections should explicitly state:

> kai-scheduler is a **required** runtime dependency. Install it from `oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler` before submitting any workflow.

Today this is implicit: the backend-operator chart's `priorityClasses.enabled` flag hints that kai is expected, but there is no top-level callout that submitting a workflow without it produces a `FAILED_SERVER_ERROR`.

---

### Environment

- OSMO `6.2.10` (charts `1.2.1` / app `6.2.10.fa46f7f09`)
- Kubernetes `v1.35.4`
- Reproduced on a fresh install before kai-scheduler was added.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/feat: BACKEND scheduler_type enum hardcoded to 'kai' — no fallback to K8s default-scheduler in 6.2.10 #936

Describe the bug.

Reproduction

Why this matters

Suggested fix

Documentation gap (separate, smaller fix)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug/feat: BACKEND scheduler_type enum hardcoded to 'kai' — no fallback to K8s default-scheduler in 6.2.10 #936

Description

Describe the bug.

Reproduction

Why this matters

Suggested fix

Documentation gap (separate, smaller fix)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions