|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# Training the Fraud Detection model with Kubeflow Training Operator" |
| 7 | + "# Training the Fraud Detection model with the Kubeflow Training Operator" |
8 | 8 | ] |
9 | 9 | }, |
10 | 10 | { |
11 | 11 | "cell_type": "markdown", |
12 | 12 | "metadata": {}, |
13 | 13 | "source": [ |
14 | | - "The example fraud detection model is very small and quickly trained. However, for many large models, training requires multiple GPUs and often multiple machines. In this notebook, you learn how to train a model by using Kubeflow Training Operator on OpenShift AI to scale out the model training. You use the Training Operator SDK to create a PyTorchJob executing the provided training script." |
| 14 | + "The example fraud detection model is small and quickly trained. For many large models, training requires multiple GPUs and often multiple machines. In this notebook, you learn how to train a model by using the Kubeflow Training Operator on OpenShift AI to scale out model training. You use the Training Operator SDK to create a PyTorchJob that executes the provided model training script." |
15 | 15 | ] |
16 | 16 | }, |
17 | 17 | { |
18 | 18 | "cell_type": "markdown", |
19 | 19 | "metadata": {}, |
20 | 20 | "source": [ |
21 | | - "### Preparing Training Operator SDK\n", |
| 21 | + "### Install the Training Operator SDK\n", |
22 | 22 | "\n", |
23 | | - "Training operator SDK is not available by default on Tensorflow notebooks.Therefore it needs to be installed first." |
| 23 | + "The Training Operator SDK is not available by default with the Tensorflow workbench image. Run the following command to install it:" |
24 | 24 | ] |
25 | 25 | }, |
26 | 26 | { |
|
36 | 36 | "cell_type": "markdown", |
37 | 37 | "metadata": {}, |
38 | 38 | "source": [ |
39 | | - "### Preparing the data\n", |
| 39 | + "### Prepare the data\n", |
40 | 40 | "\n", |
41 | | - "Normally, the training data for your model would be available in a shared location. For this example, the data is local. You must upload it to your object storage so that you can see how data loading from a shared data source works. Training data is downloaded via the training script and distributed among workers by DistributedSampler." |
| 41 | + "Typically, the training data for your model is available in a shared location. For this example, the data is local. You upload it to your object storage so that you can learn how to load data from a shared data source. The provided model training script downloads the training data. The PyTorch DistributedSampler utility distributes the tasks among worker nodes." |
42 | 42 | ] |
43 | 43 | }, |
44 | 44 | { |
|
63 | 63 | "cell_type": "markdown", |
64 | 64 | "metadata": {}, |
65 | 65 | "source": [ |
66 | | - "### Authenticate to the cluster by using the OpenShift console login\n", |
| 66 | + "### Authenticate the Training Operator SDK to the OpenShift cluster\n", |
67 | 67 | "\n", |
68 | | - "Training Operator SDK requires authenticated access to the OpenShift cluster to create PyTorchJobs. The easiest way to get access details is through the OpenShift web console. \n", |
| 68 | + "The Training Operator SDK requires authenticated access to the OpenShift cluster so that it can create PyTorchJobs. The easiest way to get access details is by using the OpenShift web console. \n", |
69 | 69 | " \n", |
70 | 70 | "\n", |
71 | | - "1. To generate the command, select **Copy login command** from the username drop-down menu at the top right of the web console.\n", |
| 71 | + "1. To generate the command, select **Copy login command** from the username drop-down menu at the top right of the OpenShift web console.\n", |
72 | 72 | "\n", |
73 | 73 | " <figure>\n", |
74 | 74 | " <img src=\"./assets/copy-login.png\" alt=\"copy login\" >\n", |
|
84 | 84 | " - token: `sha256~LongString`\n", |
85 | 85 | " - server: `https://api.your-cluster.domain.com:6443`\n", |
86 | 86 | " \n", |
87 | | - "4. In the following code cell replace the token and server values with the values that you noted in Step 3.\n", |
| 87 | + "4. In the following code cell, replace the token and server values with the values that you noted in Step 3.\n", |
88 | 88 | " For example:\n", |
89 | 89 | " ```\n", |
90 | 90 | " api_server = \"https://api.your-cluster.domain.com:6443\"\n", |
|
125 | 125 | "cell_type": "markdown", |
126 | 126 | "metadata": {}, |
127 | 127 | "source": [ |
128 | | - "### Initialize Training client\n", |
| 128 | + "### Initialize the Training client\n", |
129 | 129 | "\n", |
130 | | - "Initialize Training client using provided user credentials." |
| 130 | + "Initialize the Training client by using the provided user credentials." |
131 | 131 | ] |
132 | 132 | }, |
133 | 133 | { |
|
145 | 145 | "cell_type": "markdown", |
146 | 146 | "metadata": {}, |
147 | 147 | "source": [ |
148 | | - "### Create PyTorchJob\n", |
| 148 | + "### Create a PyTorchJob\n", |
149 | 149 | "\n", |
150 | | - "Submit PyTorchJob using Training Operator SDK client.\n", |
| 150 | + "Use the Training Operator SDK client to submit a PyTorchJob.\n", |
151 | 151 | "\n", |
152 | | - "Training script is imported from `kfto-scripts` folder.\n", |
| 152 | + "The model training script is imported from the `kfto-scripts` folder.\n", |
153 | 153 | "\n", |
154 | | - "Training script loads and distributes training dataset among nodes, performs distributed training, evaluation using test dataset, exports the trained model to onnx format and uploads it to the S3 bucket specified in provided connection.\n", |
| 154 | + "The model training script loads and distributes the training data set among nodes, performs distributed training, evaluates by using the test data set, and exports the trained model to ONNX format and uploads it to the S3 bucket that is specified in the provided connection.\n", |
155 | 155 | "\n", |
156 | | - "Important note - If Kueue component is enabled in RHOAI then you must create all Kueue related resources (ResourceFlavor, ClusterQueue and LocalQueue) and provide LocalQueue name in the script below, also uncomment label declaration in create_job function." |
| 156 | + "*Important note:* If one of the following is true:\n", |
| 157 | + "\n", |
| 158 | + "* You are not using the Red Hat Sandbox test environment. \n", |
| 159 | + "\n", |
| 160 | + "* The Kueue component is enabled for OpenShift AI and you have created all Kueue related resources (`ResourceFlavor`, `ClusterQueue`, and `LocalQueue`) and set the `local_queue_name` to \"local-queue\", as described in the _Setting up Kueue resources_ section of this Fraud Detection workshop/tutorial.\n", |
| 161 | + "\n", |
| 162 | + "You must make the following edits to the script in the next Python cell:\n", |
| 163 | + "\n", |
| 164 | + "* Provide the `LocalQueue` name.\n", |
| 165 | + "\n", |
| 166 | + "* Uncomment the label declaration in the `create_job` function." |
157 | 167 | ] |
158 | 168 | }, |
159 | 169 | { |
|
174 | 184 | " V1SecretKeySelector\n", |
175 | 185 | ")\n", |
176 | 186 | "\n", |
177 | | - "# Job name serves as unique identifier to retrieve job related informations using SDK\n", |
| 187 | + "# Job name serves as a unique identifier to retrieve job-related information by using the SDK\n", |
178 | 188 | "job_name = \"fraud-detection\"\n", |
179 | 189 | "\n", |
180 | | - "# Specifies Kueue LocalQueue name.\n", |
181 | | - "# If Kueue component is enabled then you must create all Kueue related resources (ResourceFlavor, ClusterQueue and LocalQueue) and provide LocalQueue name here.\n", |
| 190 | + "# If the Kueue component is enabled, and you have created the Kueue-related resources (ResourceFlavor, ClusterQueue and LocalQueue), provide the LocalQueue name on the following line:\n", |
182 | 191 | "local_queue_name = \"local-queue\"\n", |
183 | 192 | "\n", |
184 | 193 | "client.create_job(\n", |
|
192 | 201 | " \"cpu\": 1,\n", |
193 | 202 | " },\n", |
194 | 203 | " base_image=\"quay.io/modh/training:py311-cuda124-torch251\",\n", |
195 | | - " # Uncomment the following line to add the queue-name label if Kueue component is enabled in RHOAI and all Kueue related resources are created. Replace `local_queue_name` with the name of your LocalQueue\n", |
196 | | - "# labels={\"kueue.x-k8s.io/queue-name\": local_queue_name},\n", |
| 204 | + " # If the Kueue component is enabled and you have created the Kueue-related resources (ResourceFlavor, ClusterQueue and LocalQueue), then uncomment the following line to add the queue-name label:\n", |
| 205 | + " # labels={\"kueue.x-k8s.io/queue-name\": local-queue},\n", |
197 | 206 | " env_vars=[\n", |
198 | 207 | " V1EnvVar(name=\"AWS_ACCESS_KEY_ID\", value=os.environ.get(\"AWS_ACCESS_KEY_ID\")),\n", |
199 | 208 | " V1EnvVar(name=\"AWS_S3_BUCKET\", value=os.environ.get(\"AWS_S3_BUCKET\")),\n", |
|
271 | 280 | "source": [ |
272 | 281 | "### Delete jobs\n", |
273 | 282 | "\n", |
274 | | - "When finished you can delete the PyTorchJob." |
| 283 | + "After the PyTorchJob is finished, you can delete it." |
275 | 284 | ] |
276 | 285 | }, |
277 | 286 | { |
|
0 commit comments