Commit deed1ce
authored
feat(trainer): add dataset and model initializer support to container backend (#188)
* feat(trainer): add dataset and model initializer support to container backend
Add support for dataset and model initializers in the container backend
to bring it to feature parity with the Kubernetes backend.
Changes:
- Add utility functions for building initializer commands and environment variables
- Implement _run_initializers() and _run_single_initializer() methods in ContainerBackend
- Run initializers sequentially before training containers start
- Download datasets to /workspace/dataset and models to /workspace/model
- Track initializer containers as separate steps in TrainJob
- Support all initializer types: HuggingFace, S3, and DataCache
- Add comprehensive unit tests for all initializer configurations
- Handle initializer failures with proper cleanup and error messages
Fixes #171
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
* feat(trainer): address reviewer feedback for initializer support
- Make initializer image configurable via ContainerBackendConfig
- Make initializer timeout configurable (default 600 seconds)
- Implement wait API in adapters instead of polling
- Clean up successful initializer containers after completion
- Clean up network on initializer failure
- Raise ValueError for unsupported initializer types (no datacache fallback)
All tests passing (173/173). Addresses all feedback from PR #188.
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
* chore(trainer): add cleanup helper to reduce duplication
Add _cleanup_container_resources() helper method to consolidate
duplicated cleanup logic for stopping/removing containers and
deleting networks. Refactor 5 locations across train(), initializer
handlers, and delete_job() to use this helper.
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
* fix(trainer): use correct initializer images and working directory
Address feedback for initializer support in container backend:
- Use separate images for dataset/model initializers:
- kubeflow/dataset-initializer:latest for datasets
- kubeflow/model-initializer:latest for models
(instead of kubeflow/training-operator:latest)
- Update python commands to use pkg.initializers module:
- python -m pkg.initializers.dataset (for dataset)
- python -m pkg.initializers.model (for model)
- Change initializer working_dir from /workspace to /app
per Dockerfile convention
Refs: https://github.com/kubeflow/trainer/tree/master/cmd/initializers
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
* fix(container): address PR review comments for initializer support
- Use GHCR images as default for dataset/model initializers
- Replace suppress with try-except blocks
- Refactor initializer utils with ContainerInitializer dataclass
- Add get_dataset_initializer and get_model_initializer functions
- Remove DataCache support (unsupported in container backend)
- Merge initializer tests into test_train() and test_get_job_logs()
- Remove duplicate test functions
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
* fix(container): add name field to ContainerInitializer and remove init_type
- Add name field to ContainerInitializer dataclass
- Set name='dataset-initializer' and name='model-initializer' in utils
- Remove init_type parameter from _run_single_initializer()
- Use container_init.name for labels and log messages
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
---------
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>1 parent 36e2282 commit deed1ce
File tree
7 files changed
+688
-48
lines changed- kubeflow/trainer/backends/container
- adapters
7 files changed
+688
-48
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
193 | 193 | | |
194 | 194 | | |
195 | 195 | | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
227 | 227 | | |
228 | 228 | | |
229 | 229 | | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
254 | 254 | | |
255 | 255 | | |
256 | 256 | | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
0 commit comments