Skip to content

Releases: volcano-sh/volcano-global

v0.3.0

29 Jan 13:26
2ae7715

Choose a tag to compare

v0.3.0

Summary

Volcano Global v0.3.0 introduces two major features: HyperJob for Multi-cluster Job Splitting and the Data Dependency Scheduling Framework for data-aware workload placement. This release significantly expands Volcano Global's capabilities for AI/ML and Big Data workloads by enabling intelligent scheduling based on both compute resources and data locality.

What's New

Key Features Overview

  • HyperJob for Multi-cluster Job Splitting: Enable large-scale training across multiple clusters with automatic job splitting, distribution, and unified status aggregation
  • Data Dependency Scheduling: Schedule workloads to clusters based on data locality, with pluggable integration for external data systems like Amoro

Key Feature Details

HyperJob for Multi-cluster Job Splitting

Background and Motivation:

As AI training workloads grow in scale and complexity, organizations increasingly face the challenge of managing large-scale training jobs across multiple heterogeneous clusters. Several key problems have emerged:

  • Scale Limitations: Training jobs for large LLM and foundation models require hundreds or thousands of GPUs. Single-cluster scheduling becomes a bottleneck when job requirements exceed cluster capacity.
  • Heterogeneous Infrastructure: Different clusters may have different types of accelerators (A100, H100, Ascend910B, Ascend910C, etc.), even with different hardware configurations, network topologies, and geographic locations. Current solutions lack the ability to split and coordinate training jobs across these heterogeneous environments.
  • Operational Complexity: Managing distributed training across clusters manually is error-prone and requires deep expertise in both the training framework and cluster orchestration.

HyperJob is a higher-level abstraction built on top of Volcano Job. It composes multiple Volcano Job templates and extends training capabilities beyond single cluster boundaries, while preserving the full capabilities of existing Volcano Jobs within each cluster. The HyperJob Controller automatically splits and distributes jobs across clusters, enabling users to run multi-cluster training with the same ease as single-cluster Volcano Jobs.

Key Capabilities:

  • Automatic Resource Generation: Creates VCJobs and PropagationPolicies for each ReplicatedJob definition
  • Status Aggregation: Aggregates status from all child VCJobs into a unified HyperJob status
  • Change Detection: Uses SHA256 hashing of specs to detect and apply changes efficiently
  • Karmada Integration: Generates PropagationPolicies with proper cluster affinity and replica scheduling settings

Configuration:

To enable the HyperJob controller, add reconciler to the controllers list:

args:
  - --controllers=dispatcher,reconciler
  - --reconcilers=hyperjob

Example HyperJob resource (Large-scale Training Job Splitting across 2 clusters with 256 GPUs total):

apiVersion: training.volcano.sh/v1alpha1
kind: HyperJob
metadata:
  name: llm-training
spec:
  replicatedJobs:
  - name: trainer
    replicas: 2
    templateSpec:
      tasks:
      - name: worker
        replicas: 128
        template:
          spec:
            containers:
            - name: trainer
              image: training-image:v1
              resources:
                requests:
                  nvidia.com/gpu: 1

Related:

Data Dependency Scheduling

Background and Motivation:

In High-Performance Computing scenarios such as AI training and Big Data analysis, task execution depends heavily on data resources, not just compute resources. In multi-cluster environments, the scheduler might dispatch tasks to clusters physically distant from their data sources, resulting in prohibitive cross-region bandwidth costs and high I/O latency.

The Data Dependency Scheduling framework introduces a dedicated DataDependencyController that bridges the gap between logical data requirements and physical cluster placement. By utilizing external dependency detection plugins (such as Amoro), the controller queries real-time physical data distribution and translates this information into scheduling constraints. This achieves a fully automated "Compute-to-Data" (Data Gravity) workflow without manual intervention.

Key Capabilities:

  • DataSourceClaim/DataSource CRDs: Declarative API for data dependency management with a "Declaration - Cache" pattern
  • Plugin Architecture: Extensible framework supporting multiple data systems (Amoro, Hive, S3)
  • Two-Phase Resolution: Dynamic API resolution combined with static location-to-cluster mapping
  • Automatic Affinity Injection: Injects ClusterAffinity constraints into Karmada ResourceBindings
  • Feature Gate Control: Enable/disable via DataDependencyAwareness feature gate
  • Cache Optimization: DataSource objects serve as metadata cache to minimize external API calls

Architecture:

The framework uses a two-level CRD abstraction:

  • DataSourceClaim (DSC): Namespace-scoped, represents user's data requirement
  • DataSource (DS): Cluster-scoped, acts as a system-level cache of API resolution results

The controller operates with a state machine based on DSC phase (Pending/Bound) to handle discovery, binding, injection, and self-healing operations.

Configuration:

To enable Data Dependency Scheduling:

args:
  - --controllers=dispatcher,datadependency-controller
  - --feature-gates=DataDependencyAwareness=true
env:
  - name: PLUGIN_CONFIG_PATH
    value: "/etc/volcano-global/plugins"
volumeMounts:
  - name: plugin-config
    mountPath: /etc/volcano-global/plugins

Plugin configuration example (ConfigMap):

apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-global-datasource-plugins
  namespace: volcano-global
data:
  amoro: |
    {
      "system": "amoro",
      "endpoint": {
        "url": "http://amoro-server.example.com",
        "port": 1630
      },
      "locationMapping": {
        "s3://warehouse-dc1/": ["cluster1", "cluster2"],
        "s3://warehouse-dc2/": ["cluster3"]
      }
    }

Related:

Other Notable Changes

API Changes

  • DataSourceClaim CRD: New namespace-scoped CRD for declaring workload data dependencies
apiVersion: datadependency.volcano.sh/v1alpha1
kind: DataSourceClaim
metadata:
  name: example
spec:
  system: "amoro"
  dataSourceType: "table"
  dataSourceName: "catalog.db.table"
  workload:
    apiVersion: "apps/v1"
    kind: "Deployment"
    name: "my-app"

Related: #27, @FanXu

  • DataSource CRD: New cluster-scoped CRD for caching resolved data location metadata
apiVersion: datadependency.volcano.sh/v1alpha1
kind: DataSource
metadata:
  name: amoro-catalog-db-table
spec:
  system: "amoro"
  dataSourceType: "table"
  dataSourceName: "catalog.db.table"
  locality:
    clusterNames: ["cluster1", "cluster2"]

Related: #27, @FanXu

  • HyperJob CRD: New CRD for multi-cluster job orchestration (defined in volcano-sh/apis repository)

Related: #25, @JesseStutler

Features & Enhancements

  • Reconciler Framework: Added controller-runtime based reconciler framework for building controllers with shared Manager instance (#25, @JesseStutler)
  • Feature Gate Infrastructure: Added feature gate support for controlling optional features (#27, @FanXu)
  • Amoro Plugin: Implemented Amoro data lake integration plugin for data discovery (#27, @FanXu)
  • Automated Image Release: Added GitHub Actions workflow for automated Docker image releases to DockerHub (#41)
  • Unit Test CI: Added GitHub Actions workflow for automated unit testing (#34)
  • License Lint CI: Added GitHub Actions workflow for license compliance checking (#39, @FAUST.)

Bug Fixes

  • Queue Allocation Error: Fixed queue resource allocation calculation to use correct dispatch status check (#22, @Monokaix)
  • Cache Update Concurrency: Fixed cache updates to use atomic locks for event handling (#17, #18, @tanberBro)
  • DataSource Lifecycle: Fixed DataSource lifecycle handli...
Read more

v0.2.1

25 Jun 01:30
4b984c4

Choose a tag to compare

What's Changed

Full Changelog: v0.2.0...v0.2.1

v0.2.0

30 May 16:42
c96f3a7

Choose a tag to compare

What's Changed

  • feature: queue capacity management by @tanberBro in #16
  • bugfix: cache updates from events all use atomic locks and the calculation of allocated use rbi.DispatchStatus != api.Suspended. by @tanberBro in #18

v0.1.0

24 Jan 10:49
f6c8d78

Choose a tag to compare

What's new

Welcome to the first release of Volcano Global! 🚀 🎉 📣

With the rapid growth of enterprise business, a single Kubernetes cluster often cannot meet the demands of large-scale AI training and inference tasks. Users typically need to manage multiple Kubernetes clusters to achieve unified workload distribution, deployment, and management. Currently there are already many users using Volcano in multiple clusters and using Karmada to managem them, in order to better support AI jobs in multi-cluster environment, support global queue management, job priority and fair scheduling, etc., the Volcano community has incubated the Volcano Global sub-project. This project extends Volcano's powerful scheduling capabilities in single clusters to provide a unified scheduling platform for multi-cluster AI jobs, supporting cross-cluster job distribution, resource management, and priority control.

Volcano Global provides the following enhancements on top of Karmada to meet the complex demands of multi-cluster AI job scheduling:

  1. Supports Cross-Cluster Scheduling of Volcano Jobs
    Users can deploy and schedule Volcano Jobs across multiple clusters, fully utilizing the resources of multiple clusters to improve task execution efficiency.
  2. Queue Priority Scheduling
    Supports cross-cluster queue priority management, ensuring high-priority queue tasks can obtain resources first.
  3. Job Priority Scheduling and Queuing
    Supports job-level priority scheduling and queuing mechanisms in multi-cluster environments, ensuring critical tasks are executed promptly.
  4. Multi-Tenant Fair Scheduling
    Provides cross-cluster multi-tenant fair scheduling capabilities, ensuring fair resource allocation among tenants and avoiding resource contention.

For detailed introduction and user guide, please refer to: Multi-cluster Scheduling | Volcano.

Changes

  • Bump karmada version to support resourceBinding suspension. (#10 @Monokaix)
  • chore: remove the pod group (#9 @Vacant2333)
  • Update desgin img (#8 @Monokaix)
  • Change the traversal method of queue to round-robin (#6 @MondayCha)
  • fix: go mod confliction (#5 @Vacant2333)
  • [Proposal] Queue capacity management proposal (#2 @Vacant2333)
  • [Init] Add volcano-global dispatcher, controller manager, webhook manager and deploy guide (#1 @Vacant2333)