Skip to content

Expected Runtime Plugin for Soft Eviction via Requeue Action #904

@rich7420

Description

@rich7420

What you would like to be added?

Overview

We propose a new Expected Runtime plugin that nominates running jobs as requeue candidates once they exceed a configurable expected runtime. This plugin works in conjunction with a Requeue action (tracked separately) that performs transactional "virtual eviction": checkpoint → virtual evict → try schedule higher-priority workloads → commit or rollback.

Unlike strict "max runtime" eviction, this plugin implements soft eviction eligibility: jobs exceeding the expected runtime become eligible to be requeued, but are only actually evicted if doing so allows higher-priority workloads to run.

Proposed Plugin: expectedruntime

Naming (TBD)

Options:

  • expectedruntime (@itsomri )
  • softdeadline
  • softmaxruntime

User Interface (MVP: PodGroup annotations)

Annotation keys (prefix TBD for upstream discussion):

  • volcano.sh/expected-runtime (TBD): duration string (e.g. 2h, 30m)
    • Meaning: After this duration, the job becomes eligible for requeue nomination
  • volcano.sh/requeue-delay (TBD): duration string (e.g. 10m)
    • Meaning: Cooldown period after a successful requeue commit
  • volcano.sh/requeue-not-before (TBD): RFC3339 timestamp (system-managed)
    • Meaning: Not-before gate timestamp; written by the system after committed requeue

Note: requeue-not-before is system-managed; users should not set it manually.

Plugin Behavior

The plugin nominates running jobs as requeue candidates based on the following eligibility checks (all must pass):

  1. Running check: Job must have active allocated/running tasks
  2. Preemptible check: Job must be marked as Preemptible (Phase 1 requirement for consistency)
  3. Configuration check: expected-runtime annotation must exist and be valid (opt-in)
  4. Time check: runtime = now - LastStartTimestamp >= expectedRuntime
  5. Cooldown gate: If requeue-not-before exists, must satisfy now >= not-before
  6. MinRuntime interaction (recommended): Respect minruntime protection if hook available; otherwise rely on Requeue action filters

Contract with Requeue Action

  • Plugin is side-effect free: only produces nominations, never performs eviction
  • Requeue action responsibilities:
    • Deduplicates candidates (multiple plugins may nominate the same job)
    • Performs transactional try/commit/rollback
    • Writes requeue-not-before upon committed requeue
    • Records union of nominators/reasons for debugging (e.g., nominated_by="expectedruntime,proportion")

Observability

Metrics (examples; naming to match project conventions):

  • *_requeue_nominations_total{plugin="expectedruntime"}
  • *_requeue_nomination_skipped_total{reason="cooldown|missing_start|invalid_duration|not_running|not_preemptible|minruntime_protected|clock_skew"}

Cardinality note: reason / plugin / nominated_by must be a finite set (no job names, timestamps, etc. in labels).

Events/Logs (MVP acceptable as logs only):

  • Nominated: includes runtime/expected, nominated_by
  • Skipped: includes reason (especially cooldown, invalid config)

Example Usage

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: my-job
  annotations:
    volcano.sh/expected-runtime: "2h"
    volcano.sh/requeue-delay: "10m"
spec:
  # ... other PodGroup spec

Expected behavior:

  • At ~2h runtime, the job becomes eligible for requeue nomination
  • If no higher-priority contender can schedule → rollback, job keeps running
  • If a contender can schedule → commit, job is evicted, requeue-not-before is set by system

Test Plan

Unit tests (plugin):

  • Config parsing: valid/invalid/missing
  • Time logic: before/after expected runtime boundary
  • Cooldown gate: before/after not-before
  • Preemptible filter
  • MinRuntime interaction (when hook available)

Integration tests (plugin + requeue action):

  • No contention → rollback, job keeps running
  • Contention exists → commit, job is evicted, not-before is set
  • Cooldown blocks repeated attempts
  • Multiple plugins nominating same job → dedup with reasons union

Rollout Plan

  • Phase 1: Annotations-only, opt-in (no impact on workloads without annotations)
  • Phase 2: Add queue defaults (reduce per-workload config burden)
  • Phase 3: Promote to spec fields; keep annotation overrides for compatibility

Non-goals

  • Strict max runtime enforcement (hard kill at deadline)
  • Performing eviction inside the plugin (handled by Requeue action)
  • Introducing new CRD fields in Phase 1 (start with annotations)

Why is this needed?

Problem Statement

Currently, there is no built-in mechanism in KAI Scheduler to handle jobs that run longer than expected while still allowing them to continue if there's no resource contention. External solutions (like strict "max runtime" controllers) have a significant downside: they forcefully terminate jobs even when there's no competition for resources, which is wasteful.

Use Cases

  1. Time-aware fairness scenarios: When a queue has exhausted its time-based fair share but no other queue is "strong" enough to reclaim from it, jobs can remain running indefinitely. An expected runtime mechanism allows these jobs to become eligible for requeue when higher-priority workloads need resources.

  2. Resource efficiency: Jobs that exceed their expected runtime should yield resources to more deserving workloads, but only when there's actual contention. If no one needs the resources, the job should continue running.

  3. Predictable workload behavior: Users can set expected runtimes for their workloads, knowing that the scheduler will attempt to reclaim resources when appropriate, without hard-killing jobs unnecessarily.

Benefits

  • Soft eviction: Jobs are only evicted when there's real contention, not just because time expired
  • Stability: Cooldown/not-before gate prevents thrash under contention
  • Consistency: Aligns with existing preemptibility and minruntime semantics
  • Observability: Provides metrics and events for debugging and monitoring
  • Flexibility: Opt-in via annotations, no impact on existing workloads

Relationship to Requeue Action

This plugin is designed to work with a Requeue action (tracked separately) that handles the actual eviction logic:

  • Plugin: "Which jobs should be considered for requeue?" (nomination)
  • Action: "Should we actually evict this job?" (transactional decision)

This separation allows:

  • Multiple plugins to nominate candidates (expectedruntime, proportion, future use cases)
  • Clean separation of concerns
  • Independent evolution of nomination logic and eviction execution

Alternatives Considered

  1. Strict max runtime eviction: Rejected because it wastes resources when there's no contention; better suited to external controllers
  2. Cooldown only in plugin: Rejected because action must also enforce to avoid races and ensure commit-only semantics
  3. Hard-coded time limits: Rejected because different workloads have different expected runtimes; needs per-workload configuration

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions