Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions content/en/docs/jobflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
+++
title = "JobFlow"

date = 2025-06-25
lastmod = 2025-06-25

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
toc_depth = 5
type = "docs" # Do not modify.

# Add menu entry to sidebar.
[menu.docs]
parent = "features"
weight = 8
+++

# Background

In today's cloud computing environment, the complexity and scale of batch jobs continue to grow, especially in the fields of artificial intelligence, big data, and high-performance computing (HPC). These jobs often require long running times (days or weeks) and have complex dependencies among them. Traditional job management approaches require users to manually orchestrate multiple VCJobs or rely on third-party job orchestration platforms, which not only increases management complexity but also reduces resource utilization efficiency.

Existing workflow engines, while capable of handling general workflows, are not specifically designed for batch job workloads and cannot fully understand or optimize the characteristics of VCJobs. Users often struggle to obtain detailed information about job running status, execution progress, and resource utilization, which presents challenges for managing complex workloads.

To address these issues, we propose JobFlow, a cloud-native orchestration solution specifically designed for VCJobs. JobFlow introduces two core concepts: JobTemplate and JobFlow, allowing users to define jobs in a declarative manner and orchestrate them through rich control primitives (such as sequential execution, parallel execution, conditional branching, loops, etc.). This approach not only simplifies the management of complex jobs but also improves resource utilization and accelerates workload execution.

Unlike general-purpose workflow engines, JobFlow deeply understands the internal mechanisms of VCJobs and can provide more detailed job insights, including running status, timestamps, next execution plans, and key metrics such as failure rates. This enables users to better monitor and optimize their workloads, ensuring critical tasks execute as expected.

# Features

This functionality extends support for VCJobs, introducing capabilities for sequential startup and dependency management. Users can configure VCJobs to start in a specific order, or set up a VCJob to wait for other VCJobs to complete before executing, enhancing the flexibility of workflow control.

The newly added JobFlow and JobTemplate Custom Resource Definitions (CRDs) provide more advanced job management capabilities. Users can create reusable job templates and complex job workflows through these resources, and can view the operational status of JobFlows in real-time.

# Usage

To use JobFlow's functionality, we first need to understand two key concepts: JobTemplate and JobFlow. An example of how they work together can be found at https://github.com/volcano-sh/volcano/tree/master/example/jobflow.

## JobTemplate

JobTemplate (jt) is a template definition for Volcano jobs (vcjob). It won't be directly processed or executed, but instead waits to be referenced by a JobFlow.

Here's a simple example of a JobTemplate:

```
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
name: a
spec:
minAvailable: 1
schedulerName: volcano
queue: default
tasks:
- replicas: 1
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:1.14.2
command:
- sh
- -c
- sleep 10s
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: "1"
restartPolicy: OnFailure
```

This JobTemplate defines a task template that uses the default queue, defines a task named default-nginx, uses the nginx:1.14.2 image, and executes the command `sh -c sleep 10s`.

## JobFlow

JobFlow (jf) defines the workflow and dependencies for a group of jobs. It can reference multiple JobTemplates and create and execute Volcano jobs according to specified dependencies. JobFlow supports various dependency types (such as HTTP, TCP, task status, etc.), and can modify referenced JobTemplates through a patch mechanism.

Here's a simple example of a JobFlow:

```
apiVersion: flow.volcano.sh/v1alpha1
kind: JobFlow
metadata:
name: test
namespace: default
spec:
jobRetainPolicy: delete # After jobflow runs, delete the generated jobs.
flows:
- name: a
- name: b
dependsOn:
targets: ['a']
- name: c
dependsOn:
targets: ['b']
- name: d
dependsOn:
targets: ['b']
- name: e
dependsOn:
targets: ['c','d']
```

This YAML file defines a JobFlow resource named `test` created in the `default` namespace in Kubernetes. This JobFlow orchestrates the execution of five jobs (a, b, c, d, e) and defines the dependencies between them.

`jobRetainPolicy: delete` indicates that when the JobFlow execution is complete, all generated Volcano jobs will be deleted and not retained in the system.

The dependencies represented in this YAML are as follows:

- Job `a` has no dependencies and will execute first
- Job `b` depends on `a` and will only start after `a` successfully completes
- Job `c` depends on `b` and needs to wait for `b` to complete
- Job `d` also depends on `b` and needs to wait for `b` to complete
- Job `e` depends on both `c` and `d` and will only execute after both `c` and `d` are complete

# Architecture

In its architectural design, JobFlow still uses the Kubernetes Operator approach with CRDs and Controllers, as shown in the diagram.

In the diagram, the blue components are part of Kubernetes itself, the orange components are existing Volcano definitions, and the red components are new JobFlow definitions.

![img](https://github.com/volcano-sh/volcano/raw/master/docs/design/images/jobflow-2.png)

## Workflow

The interaction between JobFlow and JobTemplate Controllers and resources is shown in the diagram.

![img](https://github.com/volcano-sh/volcano/raw/master/docs/design/images/jobflow-3.png)

The workflow creation process can be described as follows:

1. Users create JobFlow and JobTemplate resources.
2. JobFlowController creates corresponding VcJobs based on the JobFlow configuration, using JobTemplates as templates and following the flow dependency rules.
3. After VcJobs are created, VcJobController creates corresponding Pods and PodGroups based on the VcJob configuration.
4. When Pods and PodGroups are created, vc-scheduler retrieves Pod/PodGroup and node information from kube-apiserver.
5. After obtaining this information, vc-scheduler selects appropriate nodes for each Pod according to its configured scheduling policies.
6. After nodes are assigned to Pods, kubelet retrieves Pod configurations from kube-apiserver and starts the corresponding containers.

For more detailed information, you can refer to the controller's detailed logic in /volcano/pkg/controllers/jobflow/jobflow_controller.go
141 changes: 141 additions & 0 deletions content/zh/docs/jobflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
+++
title = "JobFlow"

date = 2025-06-25
lastmod = 2025-06-25

draft = false # Is this a draft? true/false
toc = true # Show table of contents? true/false
toc_depth = 5
type = "docs" # Do not modify.

# Add menu entry to sidebar.
[menu.docs]
parent = "features"
weight = 8
+++

# 背景

在当今的云计算环境中,批处理作业的复杂性和规模不断增长,特别是在人工智能、大数据和高性能计算(HPC)领域。这些作业通常需要长时间运行(数天或数周),并且彼此之间存在复杂的依赖关系。传统的作业管理方法要求用户手动编排多个VCJob或依赖第三方作业编排平台,这不仅增加了管理复杂性,还降低了资源利用效率。

现有的工作流引擎虽然能够处理一般性的工作流程,但它们并非专为批量作业工作负载设计,无法充分理解和优化VCJob的特性。用户往往难以获取作业的详细运行状态、执行进度以及资源使用情况,这给复杂工作负载的管理带来了挑战。

为了解决这些问题,我们提出了JobFlow,这是一种专为VCJob设计的云原生编排解决方案。JobFlow引入了JobTemplate和JobFlow两个核心概念,允许用户以声明式方式定义作业并通过丰富的控制原语(如顺序执行、并行执行、条件分支、循环等)来编排它们。这种方法不仅简化了复杂作业的管理,还提高了资源利用率,加速了工作负载的执行。

与通用工作流引擎不同,JobFlow深度理解VCJob的内部机制,能够提供更详细的作业洞察,包括运行状态、时间戳、下一步执行计划和故障率等关键指标。这使得用户能够更好地监控和优化其工作负载,确保关键任务按预期执行。

# 功能

这个功能扩展了对VCJob的支持,引入了顺序启动和依赖关系管理的能力。用户可以配置VCJob按特定顺序启动,或设置某个VCJob必须等待其他VCJob完成后才能执行,增强了工作流控制的灵活性

新增的JobFlow和JobTemplate自定义资源定义(CRD)提供了更高级的作业管理功能。用户可以通过这些资源创建可重用的作业模板和复杂的作业流程,并能够实时查看JobFlow的运行状态。

# 使用

要使用JobFlow的功能,首先我们要理解两个关键概念:JobTemplate和 JobFlow。他们配合工作的一个例子在https://github.com/volcano-sh/volcano/tree/master/example/jobflow这里找到。

## JobTemplate

JobTemplate (jt) 是 Volcano 作业(vcjob)的模板定义,它不会被直接处理执行,而是等待被 JobFlow 引用。

一个简单的JobTemplate的例子如下:

```
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
name: a
spec:
minAvailable: 1
schedulerName: volcano
queue: default
tasks:
- replicas: 1
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:1.14.2
command:
- sh
- -c
- sleep 10s
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: "1"
restartPolicy: OnFailure
```

这个JobTemplate定义了一个任务模版,使用默认(default)队列,定义了一个名字叫default-nginx的任务,使用nginx:1.14.2这个镜像,执行sh -c sleep 10s这个命令。

## JobFlow

JobFlow (jf) 则定义了一组作业的运行流程和依赖关系,它可以引用多个 JobTemplate,根据指定的依赖关系按顺序创建和执行 Volcano 作业。JobFlow 支持多种依赖类型(如 HTTP、TCP、任务状态等),并且可以在引用 JobTemplate 时通过 patch 机制对其进行修改。

一个简单的JobFlow的例子如下:

```
apiVersion: flow.volcano.sh/v1alpha1
kind: JobFlow
metadata:
name: test
namespace: default
spec:
jobRetainPolicy: delete # After jobflow runs, keep the generated job. Otherwise, delete it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment for jobRetainPolicy is in English and is misleading. It should be translated to Chinese and corrected to accurately describe what the delete value does. The description on line 108 is correct.

Suggested change
jobRetainPolicy: delete # After jobflow runs, keep the generated job. Otherwise, delete it.
jobRetainPolicy: delete # JobFlow运行结束后,生成的job将被删除。

flows:
- name: a
- name: b
dependsOn:
targets: ['a']
- name: c
dependsOn:
targets: ['b']
- name: d
dependsOn:
targets: ['b']
- name: e
dependsOn:
targets: ['c','d']
```

这个 YAML 文件定义了一个名为 `test` 的 JobFlow 资源,它在 Kubernetes 的 `default` 命名空间中创建。这个 JobFlow 编排了五个作业(a、b、c、d、e)的执行流程,并定义了它们之间的依赖关系。

jobRetainPolicy: delete 这表示当 JobFlow 执行完成后,所有生成的 Volcano 作业将被删除,不会保留在系统中。

这个yaml所表示的依赖关系如下:

- 作业 `a` 没有依赖,将首先执行
- 作业 `b` 依赖于 `a`,只有当 `a` 成功完成后才会开始
- 作业 `c` 依赖于 `b`,需要等待 `b` 完成
- 作业 `d` 也依赖于 `b`,需要等待 `b` 完成
- 作业 `e` 依赖于 `c` 和 `d`,只有当 `c` 和 `d` 都完成后才会执行

# 架构

jobFlow在架构设计上,仍然采用CRD和Controller的Kubernetes Operator方式实现,其架构设计如图。

其中,蓝色部分是k8s本身的组件,橙色是Volcano现有的定义,红色是JobFlow的新定义。

![作业流程-2.png](https://github.com/volcano-sh/volcano/raw/master/docs/design/images/jobflow-2.png)

## 工作流程

JobFlow和JobTemplate的Controller与资源的交互如图。

![作业流程-3.png](https://github.com/volcano-sh/volcano/raw/master/docs/design/images/jobflow-3.png)

其创建的工作流程可以表述如下:

1. 用户创建JobFlow和JobTemplate资源。
2. JobFlowController根据JobFlow的配置,以JobTemplate为模板,按照流程依赖规则创建相应的VcJob。
3. VcJob 创建完成后,VcJobController 根据 VcJob 的配置创建相应的 Pod 和 PodGroup。
4. 当Pod、PodGroup创建完成后,vc-scheduler会去kube-apiserver获取Pod/PodGroup以及节点信息。
5. vc-scheduler 获取到这些信息之后,会根据其配置的调度策略,为每个 Pod 选择合适的节点。
6. 将节点分配给Pod后,kubelet会从kube-apiserver获取Pod的配置,并启动相应的容器。

如想知道更进一步的信息,可以参看controller的详细逻辑/volcano/pkg/controllers/jobflow/jobflow_controller.go