Skip to content

Commit 40191fb

Browse files
authored
doc edits for the new KFTO section (#106)
* doc edits for the new KFTO section * doc-update-kfto-dt-module peer review comments * doc-update-kfto-dt-module fix typo * doc-update-kfto-dt-module address review comments * doc-update-kfto-dt-module clarify previosu edit * doc-update-kfto-dt-module one clarification
1 parent 36d216f commit 40191fb

File tree

4 files changed

+41
-34
lines changed

4 files changed

+41
-34
lines changed

9_distributed_training_kfto.ipynb

Lines changed: 32 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,23 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Training the Fraud Detection model with Kubeflow Training Operator"
7+
"# Training the Fraud Detection model with the Kubeflow Training Operator"
88
]
99
},
1010
{
1111
"cell_type": "markdown",
1212
"metadata": {},
1313
"source": [
14-
"The example fraud detection model is very small and quickly trained. However, for many large models, training requires multiple GPUs and often multiple machines. In this notebook, you learn how to train a model by using Kubeflow Training Operator on OpenShift AI to scale out the model training. You use the Training Operator SDK to create a PyTorchJob executing the provided training script."
14+
"The example fraud detection model is small and quickly trained. For many large models, training requires multiple GPUs and often multiple machines. In this notebook, you learn how to train a model by using the Kubeflow Training Operator on OpenShift AI to scale out model training. You use the Training Operator SDK to create a PyTorchJob that executes the provided model training script."
1515
]
1616
},
1717
{
1818
"cell_type": "markdown",
1919
"metadata": {},
2020
"source": [
21-
"### Preparing Training Operator SDK\n",
21+
"### Install the Training Operator SDK\n",
2222
"\n",
23-
"Training operator SDK is not available by default on Tensorflow notebooks.Therefore it needs to be installed first."
23+
"The Training Operator SDK is not available by default with the Tensorflow workbench image. Run the following command to install it:"
2424
]
2525
},
2626
{
@@ -36,9 +36,9 @@
3636
"cell_type": "markdown",
3737
"metadata": {},
3838
"source": [
39-
"### Preparing the data\n",
39+
"### Prepare the data\n",
4040
"\n",
41-
"Normally, the training data for your model would be available in a shared location. For this example, the data is local. You must upload it to your object storage so that you can see how data loading from a shared data source works. Training data is downloaded via the training script and distributed among workers by DistributedSampler."
41+
"Typically, the training data for your model is available in a shared location. For this example, the data is local. You upload it to your object storage so that you can learn how to load data from a shared data source. The provided model training script downloads the training data. The PyTorch DistributedSampler utility distributes the tasks among worker nodes."
4242
]
4343
},
4444
{
@@ -63,12 +63,12 @@
6363
"cell_type": "markdown",
6464
"metadata": {},
6565
"source": [
66-
"### Authenticate to the cluster by using the OpenShift console login\n",
66+
"### Authenticate the Training Operator SDK to the OpenShift cluster\n",
6767
"\n",
68-
"Training Operator SDK requires authenticated access to the OpenShift cluster to create PyTorchJobs. The easiest way to get access details is through the OpenShift web console. \n",
68+
"The Training Operator SDK requires authenticated access to the OpenShift cluster so that it can create PyTorchJobs. The easiest way to get access details is by using the OpenShift web console. \n",
6969
" \n",
7070
"\n",
71-
"1. To generate the command, select **Copy login command** from the username drop-down menu at the top right of the web console.\n",
71+
"1. To generate the command, select **Copy login command** from the username drop-down menu at the top right of the OpenShift web console.\n",
7272
"\n",
7373
" <figure>\n",
7474
" <img src=\"./assets/copy-login.png\" alt=\"copy login\" >\n",
@@ -84,7 +84,7 @@
8484
" - token: `sha256~LongString`\n",
8585
" - server: `https://api.your-cluster.domain.com:6443`\n",
8686
" \n",
87-
"4. In the following code cell replace the token and server values with the values that you noted in Step 3.\n",
87+
"4. In the following code cell, replace the token and server values with the values that you noted in Step 3.\n",
8888
" For example:\n",
8989
" ```\n",
9090
" api_server = \"https://api.your-cluster.domain.com:6443\"\n",
@@ -125,9 +125,9 @@
125125
"cell_type": "markdown",
126126
"metadata": {},
127127
"source": [
128-
"### Initialize Training client\n",
128+
"### Initialize the Training client\n",
129129
"\n",
130-
"Initialize Training client using provided user credentials."
130+
"Initialize the Training client by using the provided user credentials."
131131
]
132132
},
133133
{
@@ -145,15 +145,25 @@
145145
"cell_type": "markdown",
146146
"metadata": {},
147147
"source": [
148-
"### Create PyTorchJob\n",
148+
"### Create a PyTorchJob\n",
149149
"\n",
150-
"Submit PyTorchJob using Training Operator SDK client.\n",
150+
"Use the Training Operator SDK client to submit a PyTorchJob.\n",
151151
"\n",
152-
"Training script is imported from `kfto-scripts` folder.\n",
152+
"The model training script is imported from the `kfto-scripts` folder.\n",
153153
"\n",
154-
"Training script loads and distributes training dataset among nodes, performs distributed training, evaluation using test dataset, exports the trained model to onnx format and uploads it to the S3 bucket specified in provided connection.\n",
154+
"The model training script loads and distributes the training data set among nodes, performs distributed training, evaluates by using the test data set, and exports the trained model to ONNX format and uploads it to the S3 bucket that is specified in the provided connection.\n",
155155
"\n",
156-
"Important note - If Kueue component is enabled in RHOAI then you must create all Kueue related resources (ResourceFlavor, ClusterQueue and LocalQueue) and provide LocalQueue name in the script below, also uncomment label declaration in create_job function."
156+
"*Important note:* If one of the following is true:\n",
157+
"\n",
158+
"* You are not using the Red Hat Sandbox test environment. \n",
159+
"\n",
160+
"* The Kueue component is enabled for OpenShift AI and you have created all Kueue related resources (`ResourceFlavor`, `ClusterQueue`, and `LocalQueue`) and set the `local_queue_name` to \"local-queue\", as described in the _Setting up Kueue resources_ section of this Fraud Detection workshop/tutorial.\n",
161+
"\n",
162+
"You must make the following edits to the script in the next Python cell:\n",
163+
"\n",
164+
"* Provide the `LocalQueue` name.\n",
165+
"\n",
166+
"* Uncomment the label declaration in the `create_job` function."
157167
]
158168
},
159169
{
@@ -174,11 +184,10 @@
174184
" V1SecretKeySelector\n",
175185
")\n",
176186
"\n",
177-
"# Job name serves as unique identifier to retrieve job related informations using SDK\n",
187+
"# Job name serves as a unique identifier to retrieve job-related information by using the SDK\n",
178188
"job_name = \"fraud-detection\"\n",
179189
"\n",
180-
"# Specifies Kueue LocalQueue name.\n",
181-
"# If Kueue component is enabled then you must create all Kueue related resources (ResourceFlavor, ClusterQueue and LocalQueue) and provide LocalQueue name here.\n",
190+
"# If the Kueue component is enabled, and you have created the Kueue-related resources (ResourceFlavor, ClusterQueue and LocalQueue), provide the LocalQueue name on the following line:\n",
182191
"local_queue_name = \"local-queue\"\n",
183192
"\n",
184193
"client.create_job(\n",
@@ -192,8 +201,8 @@
192201
" \"cpu\": 1,\n",
193202
" },\n",
194203
" base_image=\"quay.io/modh/training:py311-cuda124-torch251\",\n",
195-
" # Uncomment the following line to add the queue-name label if Kueue component is enabled in RHOAI and all Kueue related resources are created. Replace `local_queue_name` with the name of your LocalQueue\n",
196-
"# labels={\"kueue.x-k8s.io/queue-name\": local_queue_name},\n",
204+
" # If the Kueue component is enabled and you have created the Kueue-related resources (ResourceFlavor, ClusterQueue and LocalQueue), then uncomment the following line to add the queue-name label:\n",
205+
" # labels={\"kueue.x-k8s.io/queue-name\": local-queue},\n",
197206
" env_vars=[\n",
198207
" V1EnvVar(name=\"AWS_ACCESS_KEY_ID\", value=os.environ.get(\"AWS_ACCESS_KEY_ID\")),\n",
199208
" V1EnvVar(name=\"AWS_S3_BUCKET\", value=os.environ.get(\"AWS_S3_BUCKET\")),\n",
@@ -271,7 +280,7 @@
271280
"source": [
272281
"### Delete jobs\n",
273282
"\n",
274-
"When finished you can delete the PyTorchJob."
283+
"After the PyTorchJob is finished, you can delete it."
275284
]
276285
},
277286
{

workshop/docs/modules/ROOT/nav.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,4 @@
2828
2929
* 5. Distributed Training
3030
** xref:distributed-jobs-with-ray.adoc[1. Distributed Jobs with Ray]
31-
** xref:distributed-jobs-with-kfto.adoc[2. Distributed Jobs with Training operator]
31+
** xref:distributed-jobs-with-kfto.adoc[2. Distributed Jobs with the Training Operator]
Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
[id='distributed-jobs-with-kfto']
2-
= Distributing training jobs with Training operator
2+
= Distributing training jobs with the Training Operator
33

4-
In previous sections of this {deliverable}, you trained the fraud model directly in a notebook and then in a pipeline. In this section, you learn how to train the model by using Training operator. Training operator is a tool used for scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch.
4+
In previous sections of this {deliverable}, you trained the fraud detection model directly in a notebook and then in a pipeline. In this section, you learn how to train the model by using the Training Operator. The Training Operator is a tool for scalable distributed training of machine learning (ML) models created with various ML frameworks, such as PyTorch.
55

6-
This section demonstrates how you can use Training operator to distribute the training of a machine learning model across multiple CPUs. While distributing training is not necessary for a simple model, applying it to the example fraud model is a good way for you to learn how to use distributed training for more complex models that require more compute power, such as multiple GPUs across multiple machines.
6+
You can use the Training Operator to distribute the training of a machine learning model across many hardware resources. While distributed training is not necessary for a simple model, applying it to the example fraud detection model is a good way for you to learn how to use distributed training for more complex models that require more compute power, such as many GPUs across many machines.
77

8-
In your notebook environment, open the `9_distributed_training_kfto.ipynb` file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, initializing Training operator client and submitting PyTorchJob.
8+
In your notebook environment, open the `9_distributed_training_kfto.ipynb` file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, initializing the Training Operator client, and submitting a PyTorchJob.
99

1010
Optionally, if you want to view the Python code for this section, you can find it in the `kfto-scripts/train_pytorch_cpu.py` file.
1111

1212
image::distributed/kfto-jupyter-notebook.png[Jupyter Notebook]
1313

14-
For more information about PyTorchJob training on Training operator, see the https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/pytorch[Training operator PyTorchJob guide].
14+
For more information about PyTorchJob training with the Training Operator, see the https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/pytorch[Training operator PyTorchJob guide].
1515

workshop/docs/modules/ROOT/pages/setting-up-kueue-resources.adoc

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,11 @@
11
[id='setting-up-kueue-resources']
22
= Setting up Kueue resources
33

4-
NOTE: If you are using the Red Hat Developer Sandbox you can skip this step and move on to the next section, xref:creating-a-workbench.adoc[Create a Workbench].
4+
In this section, you prepare your {deliverable} environment so that you can use Kueue for distributing training with the Training Operator.
55

6-
NOTE: If you do not intend to complete the Distributing training jobs with Training operator section of this {deliverable} you can skip this step and move on to the next section, xref:creating-a-workbench.adoc[Create a Workbench].
6+
In the _Distributing training jobs with the Training Operator_ section of this {deliverable}, you implement a distributed training job by using Kueue for managing job resources. With Kueue, you can manage cluster resource quotas and how different workloads consume them.
77

8-
In this section, you prepare your {deliverable} environment so that you can use Distributing training with Training operator.
9-
10-
Later in this {deliverable}, you implement a Distributed training job using Kueue for managing job resources. With Kueue, you can manage cluster resource quotas and how different workloads consume them.
8+
NOTE: If you are using the Red Hat Developer Sandbox, or if you do not intend to use Kueue to schedule your training jobs in the _Distributing training jobs with the Training Operator_ section of this {deliverable}, skip this section and continue to the next section, xref:creating-a-workbench.adoc[Creating a workbench and selecting a workbench image].
119

1210
.Procedure
1311

0 commit comments

Comments
 (0)