Skip to content

Commit c2e25cd

Browse files
authored
Kubeflow Training Operator 1.8.1 (#169)
1 parent 5701bb9 commit c2e25cd

20 files changed

+46705
-0
lines changed
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Kubeflow Training Operator
2+
3+
[![Build Status](https://github.com/kubeflow/training-operator/actions/workflows/test-go.yaml/badge.svg?branch=master)](https://github.com/kubeflow/training-operator/actions/workflows/test-go.yaml?branch=master)
4+
[![Coverage Status](https://coveralls.io/repos/github/kubeflow/training-operator/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/training-operator?branch=master)
5+
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/training-operator)](https://goreportcard.com/report/github.com/kubeflow/training-operator)
6+
7+
## Overview
8+
9+
Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and
10+
scalable distributed training of machine learning (ML) models created with various ML frameworks
11+
such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others.
12+
13+
Training Operator allows you to use Kubernetes workloads to effectively train your large models
14+
via [Kubernetes Custom Resources APIs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
15+
or using Training Operator Python SDK.
16+
17+
> Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.
18+
19+
- For a complete reference of the custom resource definitions, please refer to the API Definition.
20+
- [TensorFlow API Definition](pkg/apis/kubeflow.org/v1/tensorflow_types.go)
21+
- [PyTorch API Definition](pkg/apis/kubeflow.org/v1/pytorch_types.go)
22+
- [Apache MXNet API Definition](pkg/apis/kubeflow.org/v1/mxnet_types.go)
23+
- [XGBoost API Definition](pkg/apis/kubeflow.org/v1/xgboost_types.go)
24+
- [MPI API Definition](pkg/apis/kubeflow.org/v1/mpi_types.go)
25+
- [PaddlePaddle API Definition](pkg/apis/kubeflow.org/v1/paddlepaddle_types.go)
26+
- For details of all-in-one operator design, please refer to the [All-in-one Kubeflow Training Operator](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)
27+
- For details on its observability, please refer to the [monitoring design doc](docs/monitoring/README.md).
28+
29+
## Prerequisites
30+
31+
- Version >= 1.25 of Kubernetes cluster and `kubectl`
32+
33+
## Installation
34+
35+
### Master Branch
36+
37+
```bash
38+
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
39+
```
40+
41+
### Stable Release
42+
43+
```bash
44+
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
45+
```
46+
47+
### TensorFlow Release Only
48+
49+
For users who prefer to use original TensorFlow controllers, please checkout `v1.2-branch`, patches for bug fixes will still be accepted to this branch.
50+
51+
```bash
52+
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"
53+
```
54+
55+
### Python SDK for Kubeflow Training Operator
56+
57+
Training Operator provides Python SDK for the custom resources. To learn more about available
58+
SDK APIs check [the `TrainingClient`](sdk/python/kubeflow/training/api/training_client.py).
59+
60+
Use `pip install` command to install the latest release of the SDK:
61+
62+
```
63+
pip install kubeflow-training
64+
```
65+
66+
Training Operator controller and Python SDK have the same release versions.
67+
68+
## Quickstart
69+
70+
Please refer to the [getting started guide](https://www.kubeflow.org/docs/components/training/overview/#getting-started)
71+
to quickly create your first Training Operator Job using Python SDK.
72+
73+
If you want to work directly with Kubernetes Custom Resources provided by Training Operator,
74+
follow [the PyTorchJob MNIST guide](https://www.kubeflow.org/docs/components/training/pytorch/#creating-a-pytorch-training-job).
75+
76+
## API Documentation
77+
78+
Please refer to following API Documentation:
79+
80+
- [Kubeflow.org v1 API Documentation](docs/api/kubeflow.org_v1_generated.asciidoc)
81+
82+
## Community
83+
84+
The following links provide information about getting involved in the community:
85+
86+
- Attend [the AutoML and Training Working Group](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit) community meeting.
87+
- Join our [Slack](https://www.kubeflow.org/docs/about/community/#kubeflow-slack) channel.
88+
- Check out [who is using the Training Operator](./docs/adopters.md).
89+
90+
This is a part of Kubeflow, so please see [readme in kubeflow/kubeflow](https://github.com/kubeflow/kubeflow#get-involved) to get in touch with the community.
91+
92+
## Contributing
93+
94+
Please refer to the [DEVELOPMENT](docs/development/developer_guide.md)
95+
96+
## Change Log
97+
98+
Please refer to [CHANGELOG](CHANGELOG.md)
99+
100+
## Version Matrix
101+
102+
The following table lists the most recent few versions of the operator.
103+
104+
| Operator Version | API Version | Kubernetes Version |
105+
| ---------------------- | ----------- | ------------------ |
106+
| `v1.0.x` | `v1` | 1.16+ |
107+
| `v1.1.x` | `v1` | 1.16+ |
108+
| `v1.2.x` | `v1` | 1.16+ |
109+
| `v1.3.x` | `v1` | 1.18+ |
110+
| `v1.4.x` | `v1` | 1.23+ |
111+
| `v1.5.x` | `v1` | 1.23+ |
112+
| `v1.6.x` | `v1` | 1.23+ |
113+
| `v1.7.x` | `v1` | 1.25+ |
114+
| `latest` (master HEAD) | `v1` | 1.25+ |
115+
116+
## Acknowledgement
117+
118+
This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.
119+
120+
- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS).
121+
- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS).
122+
- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS).
123+
- MXNet Operator: [list of contributors](https://github.com/kubeflow/mxnet-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/mxnet-operator/blob/master/OWNERS).
124+
- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS).
Binary file not shown.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Patterns to ignore when building packages.
2+
# This supports shell glob matching, relative path matching, and
3+
# negation (prefixed with !). Only one pattern per line.
4+
.DS_Store
5+
# Common VCS dirs
6+
.git/
7+
.gitignore
8+
.bzr/
9+
.bzrignore
10+
.hg/
11+
.hgignore
12+
.svn/
13+
# Common backup files
14+
*.swp
15+
*.bak
16+
*.tmp
17+
*.orig
18+
*~
19+
# Various IDEs
20+
.project
21+
.idea/
22+
*.tmproj
23+
.vscode/
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
apiVersion: v2
2+
name: training-operator
3+
description: Installs Kubeflow Training Operator
4+
# A chart can be either an 'application' or a 'library' chart.
5+
#
6+
# Application charts are a collection of templates that can be packaged into versioned archives
7+
# to be deployed.
8+
#
9+
# Library charts provide useful utilities or functions for the chart developer. They're included as
10+
# a dependency of application charts to inject those utilities and functions into the rendering
11+
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
12+
type: application
13+
14+
# This is the chart version. This version number should be incremented each time you make changes
15+
# to the chart and its templates, including the app version.
16+
# Versions are expected to follow Semantic Versioning (https://semver.org/)
17+
version: 1.8.1
18+
19+
# This is the version number of the application being deployed. This version number should be
20+
# incremented each time you make changes to the application. Versions are not expected to
21+
# follow Semantic Versioning. They should reflect the version the application is using.
22+
# It is recommended to use it with quotes.
23+
appVersion: "1.8.1"

0 commit comments

Comments
 (0)