Skip to content

feat(trainer): add dataset and model initializer support to container backend#188

Merged
google-oss-prow[bot] merged 6 commits intokubeflow:mainfrom
HKanoje:feat/171-add-initializer-support-container-backend
Feb 12, 2026
Merged

feat(trainer): add dataset and model initializer support to container backend#188
google-oss-prow[bot] merged 6 commits intokubeflow:mainfrom
HKanoje:feat/171-add-initializer-support-container-backend

Conversation

@HKanoje
Copy link
Copy Markdown
Contributor

@HKanoje HKanoje commented Dec 4, 2025

What this PR does / why we need it

This PR implements dataset and model initializer support in the container backend, bringing it to feature parity with the Kubernetes backend. This addresses issue #171 by enabling users to automatically download and prepare datasets and models before training starts.

Solution Overview

This implementation adds full initializer support to the container backend by:

  1. Running initializer containers before training - Initializers execute sequentially (dataset first, then model) before any training containers start
  2. Using shared volumes - All containers (initializers and training nodes) share the same workspace directory on the host
  3. Proper lifecycle management - Initializers must complete successfully before training begins, with automatic cleanup on failures
  4. Comprehensive error handling - Clear error messages and proper resource cleanup when initialization fails

Detailed Changes

1. New Utility Functions (kubeflow/trainer/backends/container/utils.py)

build_initializer_command(initializer, init_type)

Builds the appropriate container command based on initializer type:

  • HuggingFace: Uses kubeflow.storage_initializer.hugging_face module
  • S3: Uses kubeflow.storage_initializer.s3 module
  • DataCache: Uses kubeflow.storage_initializer.datacache module

build_initializer_env(initializer, init_type)

Constructs environment variables from initializer configuration:

  • Sets STORAGE_URI from the initializer config
  • Sets OUTPUT_PATH to /workspace/dataset or /workspace/model based on type
  • Adds optional fields like ACCESS_TOKEN, ENDPOINT, REGION, etc.
  • Handles DataCache-specific variables like CLUSTER_SIZE, METADATA_LOC

get_initializer_image()

Returns the initializer container image (kubeflow/training-operator:latest).
This can be made configurable via backend config in future iterations.

2. Enhanced ContainerBackend (kubeflow/trainer/backends/container/backend.py)

_run_initializers(job_name, initializer, workdir, network_id)

Orchestrates the initialization phase:

  • Pulls the initializer image if needed (respects pull_policy)
  • Runs dataset initializer if configured
  • Runs model initializer if configured
  • Ensures proper sequencing and error propagation

_run_single_initializer(job_name, initializer_config, init_type, image, workdir, network_id)

Executes a single initializer container:

  • Creates container with proper labels for tracking
  • Mounts shared volume to /workspace
  • Monitors container status with configurable timeout (default 10 minutes)
  • Waits for successful completion (exit code 0)
  • Captures and reports logs on failure
  • Cleans up failed containers automatically

Updated train() method

  • Creates network and working directory first
  • Runs initializers before generating training script
  • Only proceeds to training if initialization succeeds
  • Maintains backward compatibility (initializers are optional)

Updated __get_trainjob_from_containers()

  • Correctly counts only training nodes for num_nodes (excludes initializers)
  • Includes initializer containers in the steps list
  • Properly tracks initializer status

Updated get_job_logs()

  • Supports fetching logs from specific initializer steps
  • When requesting node-0 logs (default), only shows training container logs
  • Can explicitly request initializer logs with step="dataset-initializer" or step="model-initializer"

3. Comprehensive Test Coverage (kubeflow/trainer/backends/container/backend_test.py)

Added 11 new test cases covering:

Initialization Success Scenarios

  • HuggingFace dataset initializer - Tests storage_uri parsing and access_token handling
  • S3 dataset initializer - Tests endpoint, region, and credential configuration
  • HuggingFace model initializer - Tests model downloads with ignore_patterns
  • S3 model initializer - Tests S3-compatible storage with authentication
  • Both dataset and model - Tests sequential execution of both initializers
  • DataCache initializer - Tests distributed cache configuration with metadata_loc and cluster size

Log Retrieval

  • Tests getting logs from dataset-initializer step
  • Tests getting logs from model-initializer step
  • Tests that default node logs exclude initializer logs

Error Handling

  • Non-zero exit code - Verifies proper error reporting when initializer fails
  • Timeout handling - Ensures timeout errors are caught and reported
  • Resource cleanup - Confirms containers and networks are cleaned up on failure

Implementation Details

Initialization Flow

  1. User calls train() with initializer parameter

  2. ContainerBackend creates:

   ├── Working directory: ~/.kubeflow/trainer/containers/<job-name>/
   └── Network: <job-name>-net
  1. If initializer.dataset is set:
   ├── Create dataset-initializer container
   ├── Mount workdir to /workspace
   ├── Run initialization (downloads to /workspace/dataset)
   └── Wait for completion (exit code 0)
  1. If initializer.model is set:
   ├── Create model-initializer container
   ├── Mount workdir to /workspace
   ├── Run initialization (downloads to /workspace/model)
   └── Wait for completion (exit code 0)
  1. Create training containers:
   ├── Mount same workdir to /workspace
   └── Access data at /workspace/dataset and /workspace/model

Volume Layout

Host: ~/.kubeflow/trainer/containers/<job-name>/

├── dataset/ (from dataset-initializer, if configured)
│   └── <downloaded dataset files>
├── model/ (from model-initializer, if configured)
│   └── <downloaded model files>
└── outputs/ (accessible to all training nodes)
    └── <training outputs>

Container Mount: /workspace/

├── dataset/ (read by training code)
├── model/ (read by training code)
└── outputs/ (written by training code)

Testing Results

All tests pass with no regressions:

  • 43/43 container backend tests passed
  • 173/173 trainer module tests passed
  • make verify (ruff lint + format) passed
  • pre-commit hooks passed

Usage Example

from kubeflow.trainer import TrainerClient
from kubeflow.trainer.types import (
    CustomTrainer,
    Initializer,
    HuggingFaceDatasetInitializer,
    HuggingFaceModelInitializer,
)

# Define initializers
initializer = Initializer(
    dataset=HuggingFaceDatasetInitializer(
        storage_uri="hf://username/my-dataset",
        access_token="hf_xxxxx",
    ),
    model=HuggingFaceModelInitializer(
        storage_uri="hf://username/my-model",
        access_token="hf_xxxxx",
    ),
)

# Define trainer
trainer = CustomTrainer(
    func=train_func,
    num_nodes=2,
)

# Train with initializers
client = TrainerClient()
job_name = client.train(
    trainer=trainer,
    initializer=initializer,  # <-- Now works with container backend!
)

# Data is automatically available at:
# - /workspace/dataset (in training containers)
# - /workspace/model (in training containers)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 4, 2025

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HKanoje for this PR! I've left some comments, ptal! Thanks

"""
# Use the training-operator image which contains initializer scripts
# This can be made configurable via backend config in the future
return "kubeflow/training-operator:latest"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this configurable rather than hardcoding it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Definitely. I've made this configurable via ContainerBackendConfig.initializer_image (default: kubeflow/training-operator:latest). Users can now customize it when creating the backend.

try:
import time

timeout = 600 # 10 minutes timeout for initialization
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable, or is 10 minutes always going to be enough time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ContainerBackendConfig.initializer_timeout (default: 600 seconds / 10 minutes). This gives users flexibility for large datasets/models that may take longer to download.

# Clean up the failed container
from contextlib import suppress

with suppress(Exception):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as cleaning up when a failure occurs, should we clean up the initializer containers when they have been successful also?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented wait_for_container() in the adapter interface and both Docker/Podman adapters. This replaces the polling loop with a single blocking wait call - much more efficient.

logger.debug(f"Created network: {network_id}")

# Run initializers if configured
if initializer:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the initializer fails should we clean up the network we have created?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Added cleanup for successful initializer containers after completion to prevent accumulation. Also added cleanup for timed-out containers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Would it make sense to add a helper function for the cleanup logic to reduce duplication?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Added _cleanup_container_resources() helper method in commit 0b7a952 to consolidate the duplicated cleanup logic across exception handlers and delete_job().

if isinstance(
initializer, (types.HuggingFaceDatasetInitializer, types.HuggingFaceModelInitializer)
)
else "python -m kubeflow.storage_initializer.datacache "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are setting datacache as the default/fallback, do we want to do this? In thekubernetes backend we offer 2 options and raise a value error if the type is invalid.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - that was inconsistent with the kubernetes backend. Changed to raise ValueError with a clear message listing all supported types instead of defaulting to datacache.

import time

timeout = 600 # 10 minutes timeout for initialization
polling_interval = 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could using wait API be supported?

@kramaranya
Copy link
Copy Markdown
Contributor

/ok-to-test

HKanoje added a commit to HKanoje/sdk that referenced this pull request Jan 5, 2026
- Make initializer image configurable via ContainerBackendConfig
- Make initializer timeout configurable (default 600 seconds)
- Implement wait API in adapters instead of polling
- Clean up successful initializer containers after completion
- Clean up network on initializer failure
- Raise ValueError for unsupported initializer types (no datacache fallback)

All tests passing (173/173). Addresses all feedback from PR kubeflow#188.
@kramaranya
Copy link
Copy Markdown
Contributor

Hey @HKanoje, could you please sign your commits?

@coveralls
Copy link
Copy Markdown

coveralls commented Jan 5, 2026

Pull Request Test Coverage Report for Build 21811391977

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 160 of 195 (82.05%) changed or added relevant lines in 7 files are covered.
  • 21 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.9%) to 68.424%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/container/adapters/base.py 2 3 66.67%
kubeflow/trainer/backends/container/backend_test.py 59 61 96.72%
kubeflow/trainer/backends/container/utils.py 28 30 93.33%
kubeflow/trainer/backends/container/adapters/podman.py 1 9 11.11%
kubeflow/trainer/backends/container/adapters/docker.py 1 11 9.09%
kubeflow/trainer/backends/container/backend.py 66 78 84.62%
Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/backends/kubernetes/utils.py 21 77.64%
Totals Coverage Status
Change from base Build 21637340746: 0.9%
Covered Lines: 2921
Relevant Lines: 4269

💛 - Coveralls

HKanoje added a commit to HKanoje/sdk that referenced this pull request Jan 5, 2026
- Make initializer image configurable via ContainerBackendConfig
- Make initializer timeout configurable (default 600 seconds)
- Implement wait API in adapters instead of polling
- Clean up successful initializer containers after completion
- Clean up network on initializer failure
- Raise ValueError for unsupported initializer types (no datacache fallback)

All tests passing (173/173). Addresses all feedback from PR kubeflow#188.

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
@HKanoje HKanoje force-pushed the feat/171-add-initializer-support-container-backend branch from 768a6a9 to 0dbb6b6 Compare January 5, 2026 16:49
@HKanoje
Copy link
Copy Markdown
Contributor Author

HKanoje commented Jan 5, 2026

@kramaranya Done! All commits are now signed.

@HKanoje
Copy link
Copy Markdown
Contributor Author

HKanoje commented Feb 1, 2026

@kramaranya @szaher Please Review the changes whenever you get a chance! Thanks!

Copy link
Copy Markdown
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this locally using docker/colima, I have left some comments that should be addressed.
I have created this quick PR in the trainer repo moving a kubernetes import to method level as it was causing an error locally looking for kubeconfig. This PR should be merged after that one. Hope that makes sense. Thanks for your work on this.

description="Configuration for training runtime sources",
)
initializer_image: str = Field(
default="kubeflow/training-operator:latest",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the initializer functionality with the local Docker/Podman container backend and found some issues relating to the use of this image here that need to be addressed in this PR. This image is a controller image and does not container initializer code. Instead we should use these: kubeflow/dataset-initializer:latest, kubeflow/model-initializer:latest. Functionality should be updated to select whichever one is required for each initialiser.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! You're right - I've updated the implementation to use the correct images.

elif isinstance(
initializer, (types.HuggingFaceDatasetInitializer, types.HuggingFaceModelInitializer)
):
python_cmd = "python -m kubeflow.storage_initializer.hugging_face "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python_cmd = "python -m kubeflow.storage_initializer.hugging_face "
python_cmd = "python -m pkg.initializers.dataset"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

environment=env,
labels=labels,
volumes=volumes,
working_dir=constants.WORKSPACE_PATH,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be /app - https://github.com/kubeflow/trainer/blob/master/cmd/initializers/dataset/Dockerfile#L3

Suggested change
working_dir=constants.WORKSPACE_PATH,
working_dir=/app,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! I've changed the working_dir from /workspace to /app to match the Dockerfile convention. Added a comment referencing the Dockerfile for future maintainability.

… backend

Add support for dataset and model initializers in the container backend
to bring it to feature parity with the Kubernetes backend.

Changes:
- Add utility functions for building initializer commands and environment variables
- Implement _run_initializers() and _run_single_initializer() methods in ContainerBackend
- Run initializers sequentially before training containers start
- Download datasets to /workspace/dataset and models to /workspace/model
- Track initializer containers as separate steps in TrainJob
- Support all initializer types: HuggingFace, S3, and DataCache
- Add comprehensive unit tests for all initializer configurations
- Handle initializer failures with proper cleanup and error messages

Fixes kubeflow#171

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
- Make initializer image configurable via ContainerBackendConfig
- Make initializer timeout configurable (default 600 seconds)
- Implement wait API in adapters instead of polling
- Clean up successful initializer containers after completion
- Clean up network on initializer failure
- Raise ValueError for unsupported initializer types (no datacache fallback)

All tests passing (173/173). Addresses all feedback from PR kubeflow#188.

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
Add _cleanup_container_resources() helper method to consolidate
duplicated cleanup logic for stopping/removing containers and
deleting networks. Refactor 5 locations across train(), initializer
handlers, and delete_job() to use this helper.

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
Address feedback for initializer support in container backend:

- Use separate images for dataset/model initializers:
  - kubeflow/dataset-initializer:latest for datasets
  - kubeflow/model-initializer:latest for models
  (instead of kubeflow/training-operator:latest)

- Update python commands to use pkg.initializers module:
  - python -m pkg.initializers.dataset (for dataset)
  - python -m pkg.initializers.model (for model)

- Change initializer working_dir from /workspace to /app
  per Dockerfile convention

Refs: https://github.com/kubeflow/trainer/tree/master/cmd/initializers
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
@HKanoje HKanoje force-pushed the feat/171-add-initializer-support-container-backend branch from 2bf8662 to 6241fae Compare February 4, 2026 07:45
@Fiona-Waters
Copy link
Copy Markdown
Contributor

Thanks for this @HKanoje I've re-run with your latest changes (and the trainer changes) and it works as expected.
/lgtm

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @HKanoje!
I left a few thoughts.

# Stop and remove containers
if container_ids:
for container_id in container_ids:
with suppress(Exception):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use suppress here? We don't do that in other part of SDK.

Comment on lines +891 to +892
# Tests for Initializer Support
@pytest.mark.parametrize(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add these test cases to the test_train() API:

def test_train(container_backend, test_case):

),
],
)
def test_get_logs_with_initializers(container_backend, test_case):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, add it to test_get_job_logs()

),
],
)
def test_initializer_failures(container_backend, test_case):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be added to test_train()

Comment on lines +69 to +73
default="kubeflow/dataset-initializer:latest",
description="Container image for dataset initializers",
)
model_initializer_image: str = Field(
default="kubeflow/model-initializer:latest",
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +286 to +312
elif isinstance(initializer, (types.S3DatasetInitializer, types.S3ModelInitializer)):
if initializer.endpoint:
env["ENDPOINT"] = initializer.endpoint
if initializer.access_key_id:
env["ACCESS_KEY_ID"] = initializer.access_key_id
if initializer.secret_access_key:
env["SECRET_ACCESS_KEY"] = initializer.secret_access_key
if initializer.region:
env["REGION"] = initializer.region
if initializer.role_arn:
env["ROLE_ARN"] = initializer.role_arn
if hasattr(initializer, "ignore_patterns") and initializer.ignore_patterns:
env["IGNORE_PATTERNS"] = ",".join(initializer.ignore_patterns)

elif isinstance(initializer, types.DataCacheInitializer):
env["CLUSTER_SIZE"] = str(initializer.num_data_nodes + 1)
env["METADATA_LOC"] = initializer.metadata_loc
if initializer.head_cpu:
env["HEAD_CPU"] = initializer.head_cpu
if initializer.head_mem:
env["HEAD_MEM"] = initializer.head_mem
if initializer.worker_cpu:
env["WORKER_CPU"] = initializer.worker_cpu
if initializer.worker_mem:
env["WORKER_MEM"] = initializer.worker_mem
if initializer.iam_role:
env["IAM_ROLE"] = initializer.iam_role
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this to something like we do here:

def get_optional_initializer_envs(
initializer: types.BaseInitializer, required_fields: set
) -> list[models.IoK8sApiCoreV1EnvVar]:
"""Get the optional envs from the initializer config"""
envs = []
for f in fields(initializer):
if f.name not in required_fields:
value = getattr(initializer, f.name)
if value is not None:
# Convert list values (like ignore_patterns) to comma-separated strings
if isinstance(value, list):
value = ",".join(str(item) for item in value)
envs.append(models.IoK8sApiCoreV1EnvVar(name=f.name.upper(), value=value))
return envs

Env variables are always have the same name as fields, but just upper cased.

return aggregate_status_from_containers(statuses)


def build_initializer_command(initializer: types.BaseInitializer, init_type: str) -> list[str]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the init_type variable since you can understand whether it is dataset or model initializer by checking type of initializer field.

Suggested change
def build_initializer_command(initializer: types.BaseInitializer, init_type: str) -> list[str]:
def build_initializer_command(initializer: types.BaseInitializer) -> list[str]:

types.S3ModelInitializer,
types.HuggingFaceDatasetInitializer,
types.HuggingFaceModelInitializer,
types.DataCacheInitializer,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it from now, I am not sure how data cache can be supported in Container backend at the moment.
cc @akshaychitneni

Suggested change
types.DataCacheInitializer,

return ["bash", "-c", python_cmd]


def build_initializer_env(initializer: types.BaseInitializer, init_type: str) -> dict[str, str]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point

Suggested change
def build_initializer_env(initializer: types.BaseInitializer, init_type: str) -> dict[str, str]:
def build_initializer_env(initializer: types.BaseInitializer) -> dict[str, str]:

# Run dataset initializer if configured
if initializer.dataset:
# Get and pull dataset initializer image
dataset_image = container_utils.get_initializer_image(self.cfg, "dataset")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you simplify the logic here to be consistent with what we do in Kubernetes backend:

dataset=utils.get_dataset_initializer(initializer.dataset)

Just simple define two utils function in container utils:

container_utils.get_dataset_initializer()
container_utils.get_model_initializer()

Which returns internal type that you can use in the _adapter.create_and_start_container() API:

@dataclass
class ContainerInitializer:
   image: str
   command: str
   env: dict

WDYT @HKanoje @Fiona-Waters ?

@google-oss-prow google-oss-prow bot removed the lgtm label Feb 5, 2026
- Use GHCR images as default for dataset/model initializers
- Replace suppress with try-except blocks
- Refactor initializer utils with ContainerInitializer dataclass
- Add get_dataset_initializer and get_model_initializer functions
- Remove DataCache support (unsupported in container backend)
- Merge initializer tests into test_train() and test_get_job_logs()
- Remove duplicate test functions

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
@HKanoje HKanoje force-pushed the feat/171-add-initializer-support-container-backend branch from e214292 to a1c9ce0 Compare February 5, 2026 08:06
@HKanoje
Copy link
Copy Markdown
Contributor Author

HKanoje commented Feb 5, 2026

Hi @andreyvelich @Fiona-Waters,

I've addressed all the review comments in the latest commit:

  • Updated default images to GHCR (ghcr.io/kubeflow/trainer/dataset-initializer:latest and ghcr.io/kubeflow/trainer/model-initializer:latest)
  • Replaced suppress with try-except blocks (added # noqa: SIM105 to satisfy linter)
  • Refactored initializer utils with ContainerInitializer dataclass
  • Added get_dataset_initializer() and get_model_initializer() functions following kubernetes backend pattern
  • Simplified env building using get_optional_initializer_envs() helper
  • Removed init_type parameter (now determined from initializer type)
  • Removed DataCache support (not supported in container backend)
  • Merged initializer tests into test_train() and test_get_job_logs()
  • Removed duplicate test functions

self,
job_name: str,
container_init: container_utils.ContainerInitializer,
init_type: str,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing init_type separately, you can simple add name to the ContainerInitializer type which can be:

name = dataset-initializer
name = model-initializer

Then, just use this name in the f"{self.label_prefix}/step"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! @andreyvelich

Changes made:

Added name field to ContainerInitializer dataclass
Set name="dataset-initializer" in get_dataset_initializer()
Set name="model-initializer" in get_model_initializer()
Removed init_type parameter from _run_single_initializer()
Now using container_init.name for labels and log messages

…t_type

- Add name field to ContainerInitializer dataclass
- Set name='dataset-initializer' and name='model-initializer' in utils
- Remove init_type parameter from _run_single_initializer()
- Use container_init.name for labels and log messages

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @HKanoje!
Please can you create tracking issue for this: #188 (comment)
/lgtm
/assign @Fiona-Waters @kramaranya

Copy link
Copy Markdown
Contributor

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks @HKanoje !

@HKanoje
Copy link
Copy Markdown
Contributor Author

HKanoje commented Feb 12, 2026

@andreyvelich Done, I have created the issue #290

@andreyvelich
Copy link
Copy Markdown
Member

Thanks! /approve

@andreyvelich
Copy link
Copy Markdown
Member

/approve

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit deed1ce into kubeflow:main Feb 12, 2026
17 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.4 milestone Feb 12, 2026
@HKanoje HKanoje deleted the feat/171-add-initializer-support-container-backend branch February 13, 2026 05:26
openshift-merge-bot bot pushed a commit to opendatahub-io/kubeflow-sdk that referenced this pull request Mar 11, 2026
* chore!: upgrade to Python 3.10 (kubeflow#282)

This upgrades the minimum Python version for the project from 3.9 to
3.10. Python 3.9 is past end-of-life and dependencies will likely
require a supported version soon.

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* chore: Confirm that a public ConfigMap exists to check version (kubeflow#250)

* Confirm that a public ConfigMap exists to check version

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* python 3.9 fix

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>

* Exceptiom handling better

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>

* Addressing comments

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Update kubeflow/trainer/backends/kubernetes/backend.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>

* Refactored tests into a single function and followed agents.md

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* CI friendly edit

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* pre-commit format checked

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Modified according to new updates

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Ran pre-commit locally to fix formatting

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* unix2dos CLAUDE.md

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Revert CLAUDE.md

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

---------

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* chore: added sdk docs website to readme (kubeflow#284)

* docs: added sdk docs website to readme

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* format: order of sdk docs

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat(trainer): add dataset and model initializer support to container backend (kubeflow#188)

* feat(trainer): add dataset and model initializer support to container backend

Add support for dataset and model initializers in the container backend
to bring it to feature parity with the Kubernetes backend.

Changes:
- Add utility functions for building initializer commands and environment variables
- Implement _run_initializers() and _run_single_initializer() methods in ContainerBackend
- Run initializers sequentially before training containers start
- Download datasets to /workspace/dataset and models to /workspace/model
- Track initializer containers as separate steps in TrainJob
- Support all initializer types: HuggingFace, S3, and DataCache
- Add comprehensive unit tests for all initializer configurations
- Handle initializer failures with proper cleanup and error messages

Fixes kubeflow#171

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* feat(trainer): address reviewer feedback for initializer support

- Make initializer image configurable via ContainerBackendConfig
- Make initializer timeout configurable (default 600 seconds)
- Implement wait API in adapters instead of polling
- Clean up successful initializer containers after completion
- Clean up network on initializer failure
- Raise ValueError for unsupported initializer types (no datacache fallback)

All tests passing (173/173). Addresses all feedback from PR kubeflow#188.

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* chore(trainer): add cleanup helper to reduce duplication

Add _cleanup_container_resources() helper method to consolidate
duplicated cleanup logic for stopping/removing containers and
deleting networks. Refactor 5 locations across train(), initializer
handlers, and delete_job() to use this helper.

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* fix(trainer): use correct initializer images and working directory

Address feedback for initializer support in container backend:

- Use separate images for dataset/model initializers:
  - kubeflow/dataset-initializer:latest for datasets
  - kubeflow/model-initializer:latest for models
  (instead of kubeflow/training-operator:latest)

- Update python commands to use pkg.initializers module:
  - python -m pkg.initializers.dataset (for dataset)
  - python -m pkg.initializers.model (for model)

- Change initializer working_dir from /workspace to /app
  per Dockerfile convention

Refs: https://github.com/kubeflow/trainer/tree/master/cmd/initializers
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* fix(container): address PR review comments for initializer support

- Use GHCR images as default for dataset/model initializers
- Replace suppress with try-except blocks
- Refactor initializer utils with ContainerInitializer dataclass
- Add get_dataset_initializer and get_model_initializer functions
- Remove DataCache support (unsupported in container backend)
- Merge initializer tests into test_train() and test_get_job_logs()
- Remove duplicate test functions

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* fix(container): add name field to ContainerInitializer and remove init_type

- Add name field to ContainerInitializer dataclass
- Set name='dataset-initializer' and name='model-initializer' in utils
- Remove init_type parameter from _run_single_initializer()
- Use container_init.name for labels and log messages

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

---------

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* feat: add SparkClient API for SparkConnect session management (kubeflow#225)

* feat(spark): add core types, dataclasses, and constants

- Add SparkConnectInfo, SparkConnectState, Driver, Executor types
- Add type tests for validation
- Add Kubernetes backend constants (CRD group, version, defaults)

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add backend base class and options pattern

- Add RuntimeBackend abstract base class with session lifecycle methods
- Add options pattern (Name, Image, Timeout, etc.) aligned with trainer SDK
- Add validation utilities for connect parameters
- Add comprehensive option tests

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add KubernetesBackend for SparkConnect CRD operations

- Implement KubernetesBackend with create/get/list/delete session methods
- Add port-forward support for out-of-cluster connections
- Add CRD builder utilities and URL validation
- Add comprehensive backend and utils tests with parametrized patterns

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add SparkClient API with KEP-107 compliant connect method

- Implement SparkClient as main user interface for SparkConnect sessions
- Support connect to existing server (base_url) or auto-create new session
- Add public exports for SparkClient, Driver, Executor, options
- Add SparkClient unit tests

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* chore(spark): add test infrastructure and package init files

- Add test common utilities and fixtures
- Add package __init__ files for test directories
- Setup test/e2e/spark structure

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add example scripts demonstrating SparkClient usage

- Add spark_connect_simple.py with 3 usage levels (minimal, simple, advanced)
- Add spark_advanced_options.py with full configuration examples
- Add connect_existing_session.py for connecting to existing servers
- Add demo and test scripts for local development

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* docs(spark): add documentation for SparkClient and E2E testing

- Add examples/spark/README.md with usage guide
- Add local Spark Connect testing documentation
- Add E2E test README with CI/CD integration guide
- Update KEP-107 proposal documentation

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* test(spark): add E2E test framework with cluster watcher

- Add test_spark_examples.py with example validation tests
- Add cluster_watcher.py for monitoring SparkConnect and pods during tests
- Add run_in_cluster.py for executing examples as K8s Jobs

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* ci(spark): add GitHub Actions workflow and E2E cluster setup

- Add test-spark-examples.yaml workflow for E2E validation
- Add e2e-setup-cluster.sh for Kind cluster with Spark Operator
- Add SparkConnect CRD, Kind config, and E2E runner Dockerfile
- Update Makefile with E2E setup target
- Update PR title check for spark prefix

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* chore(spark): add pyspark[connect] dependency and update lock file

- Add spark extra with pyspark[connect]==3.4.1 for grpcio, pandas, pyarrow
- Update uv.lock with resolved dependencies
- Update .gitignore for spark-related files

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* Update kubeflow/spark/backends/base.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>

* refactor(spark): rename backend.connect_session() to connect()

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* refactor: move session creation flow from SparkClient to backend.create_and_connect()

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

---------

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>
Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* chore: bump minimum model-registry version to 0.3.6 (kubeflow#289)

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* fix: Improve CVE workflow (kubeflow#267)

* fix: Improve CVE workflow

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix: fix issue with bash compare

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* feat: Add workflow to cleanup overrides in pyproject.toml

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix: address review comments

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* chore: refactor to reduce size of cve related workflows

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

---------

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* chore: upgrade code style for python3.10 (kubeflow#288)

* chore: update code style for Python 3.10

This disables a couple ruff rules in pyproject.toml:
```
"UP007", # Use X | Y instead of Union[X, Y] (requires Python 3.10+)
"UP045", # Use X | None instead of Optional[X] (requires Python 3.10+)
```

Then the code changes are made with:
```
uv run ruff check --fix
uv run ruff format
```

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* fix: handle unions, bools in convert_value

The convert_value function didn't seems to be handling union types
properly and also needs to handle `T | None` similarly to
`Optional[None]` after the upgrade to Python 3.10. This fixes union
types, an issue with bool conversion, and adds tests for this function.

Signed-off-by: Jon Burdo <jon@jonburdo.com>

---------

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* chore(ci): bump astral-sh/setup-uv from 5 to 7 (kubeflow#276)

Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 5 to 7.
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](astral-sh/setup-uv@v5...v7)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the python-minor group across 1 directory with 4 updates (kubeflow#291)

Bumps the python-minor group with 4 updates in the / directory: [coverage](https://github.com/coveragepy/coveragepy), [ruff](https://github.com/astral-sh/ruff), [pre-commit](https://github.com/pre-commit/pre-commit) and [ty](https://github.com/astral-sh/ty).


Updates `coverage` from 7.10.7 to 7.13.4
- [Release notes](https://github.com/coveragepy/coveragepy/releases)
- [Changelog](https://github.com/coveragepy/coveragepy/blob/main/CHANGES.rst)
- [Commits](coveragepy/coveragepy@7.10.7...7.13.4)

Updates `ruff` from 0.14.14 to 0.15.0
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.14.14...0.15.0)

Updates `pre-commit` from 4.3.0 to 4.5.1
- [Release notes](https://github.com/pre-commit/pre-commit/releases)
- [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md)
- [Commits](pre-commit/pre-commit@v4.3.0...v4.5.1)

Updates `ty` from 0.0.14 to 0.0.16
- [Release notes](https://github.com/astral-sh/ty/releases)
- [Changelog](https://github.com/astral-sh/ty/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ty@0.0.14...0.0.16)

---
updated-dependencies:
- dependency-name: coverage
  dependency-version: 7.13.4
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: python-minor
- dependency-name: ruff
  dependency-version: 0.15.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: python-minor
- dependency-name: pre-commit
  dependency-version: 4.5.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: python-minor
- dependency-name: ty
  dependency-version: 0.0.16
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: python-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: Added examples to the documentation demonstrating different ways to handle ports (kubeflow#243)

* update docs and add test cases.

Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>

* pre-commit error solved

Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>

* Update kubeflow/hub/api/model_registry_client.py

Co-authored-by: Jon Burdo <jon@jonburdo.com>
Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>

* readme updated

Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>

* Refactor model registry client test cases for clarity

Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>

---------

Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>
Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>
Co-authored-by: Jon Burdo <jon@jonburdo.com>

* chore(ci): bump peter-evans/create-pull-request from 6 to 8 (kubeflow#277)

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 8.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v8)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(ci): bump actions/checkout from 4 to 6 (kubeflow#278)

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: Adds a GitHub Actions workflow to check kubeflow/hub/OWNERS. (kubeflow#280)

* Add OWNERS validation

Signed-off-by: muhammadjunaid8047 <muhammadjunaid8047@gmail.com>

* Update .github/workflows/check-owners.yaml

Co-authored-by: Jon Burdo <jon@jonburdo.com>
Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>

* Update OWNERS file check in workflow

Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>

* Update paths in check-owners workflow

Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>

---------

Signed-off-by: muhammadjunaid8047 <muhammadjunaid8047@gmail.com>
Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>
Co-authored-by: Jon Burdo <jon@jonburdo.com>

* fix: nightly security dependency updates (kubeflow#296)

Co-authored-by: google-oss-prow <92114575+google-oss-prow@users.noreply.github.com>

* chore(ci): bump aquasecurity/trivy-action from 0.33.1 to 0.34.0 in the actions group (kubeflow#297)

Bumps the actions group with 1 update: [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action).


Updates `aquasecurity/trivy-action` from 0.33.1 to 0.34.0
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.33.1...0.34.0)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-version: 0.34.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump pytest from 8.4.2 to 9.0.2 (kubeflow#301)

Bumps [pytest](https://github.com/pytest-dev/pytest) from 8.4.2 to 9.0.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@8.4.2...9.0.2)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(trainer): Support namespaced TrainingRuntime in the SDK (kubeflow#130)

* feat(backend): Support namespaced TrainingRuntime in the SDK

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed bugs and validated current test cases

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed pre-commit test failure

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Addressed comments

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed no attribute 'DEFAULT_TIMEOUT' error

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Added namespace-scoped runtime to test cases

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Addressed fallback logic bugs

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Added scope field to Runtime

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Improved code

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed copilot's comments

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Shadow duplicate runtimes, priority to ns

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed bug

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed copilot comments

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Improved test cases to validate all possible cases

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* small fix

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* lint fix

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* improved error message

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Moeed <shaikmoeed@gmail.com>

* refactored code

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* improve code

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Removed RuntimeScope

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* removed scope references and improved error handling as per kubeflow standards

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

---------

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>
Signed-off-by: Moeed <shaikmoeed@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: Fix runtime lookup fallback and test local SDK in E2E (kubeflow#307)

* fix: Install SDK locally in E2E workflow and improve error handling for runtime fetching in Kubernetes backend.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* refactor: Explicitly return errors from  and refine exception handling in .

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* docs: update comment to clarify Kubeflow SDK installation from source in e2e workflow.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* feat: Enhance runtime retrieval tests to cover Kubernetes API 404/403 errors and partial success for list operations on timeout.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* refactor: Update runtime listing to immediately raise exceptions on failure instead of collecting partial results.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

---------

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* chore(ci): bump actions/setup-python from 5 to 6 (kubeflow#298)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the python-minor group with 2 updates (kubeflow#299)

Bumps the python-minor group with 2 updates: [ruff](https://github.com/astral-sh/ruff) and [ty](https://github.com/astral-sh/ty).


Updates `ruff` from 0.15.0 to 0.15.1
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.15.0...0.15.1)

Updates `ty` from 0.0.16 to 0.0.17
- [Release notes](https://github.com/astral-sh/ty/releases)
- [Changelog](https://github.com/astral-sh/ty/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ty@0.0.16...0.0.17)

---
updated-dependencies:
- dependency-name: ruff
  dependency-version: 0.15.1
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: python-minor
- dependency-name: ty
  dependency-version: 0.0.17
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: python-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: improve logging around packages_to_install (kubeflow#269)

* improve logging around packages_to_install

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* exit when pip install fails, append errors from both attempts

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* Add shlex to address command injection vulnerabilities. Write pip install logfile to cwd

Signed-off-by: Brian Gallagher <briangal@gmail.com>

---------

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* feat: Add validate lockfile workflow to complement CVE scanning (kubeflow#306)

* feat: Add validate lockfile workflow to complement CVE scanning

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix: make cve fix pr branch static

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

---------

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix(trainer): handle falsy values in get_args_from_peft_config (kubeflow#328)

* fix(trainer): handle falsy values in get_args_from_peft_config

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: apply pre-commit formatting

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: also handle falsy train_on_input in dataset_preprocess_config

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: add missing newline at end of utils_test.py

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: pre-commit formatting

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

---------

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix(optimizer): prevent input mutation in optimize() (kubeflow#322)

* fix(optimizer): prevent input mutation in optimize()

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* remove unnecessary things

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* rename test

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

---------

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* feat: add TrainerClient examples for local PyTorch distributed training (kubeflow#312)

* docs: add TrainerClient examples for local PyTorch distributed training

- Add examples/trainer/pytorch_distributed_simple.py
- Add examples/trainer/README.md
- Demonstrates LocalProcessBackend usage without Kubernetes
- Fixes kubeflow#218

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

* docs: add training examples table to SDK website

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

* docs: expand examples table with PyTorch, MLX, DeepSpeed, and TorchTune examples grouped by framework

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

---------

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

* chore: fix docstrings in TrainerClient (kubeflow#333)

Signed-off-by: Transcendental-Programmer <priyena.programming@gmail.com>

* feat(spark): Refactor unit tests to sdk coding standards  (kubeflow#293)

* Refactored unit test

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Changes made

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Version

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Restructured clien_test

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* reformated backend_test.py

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* revert pyproject.toml and uv.lock changes

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Standarized spark backend tests

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* backend_tests

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

---------

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* fix(optimizer): add missing get_job_events() to RuntimeBackend base c… (kubeflow#325)

* fix(optimizer): add missing get_job_events() to RuntimeBackend base class

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* Update kubeflow/optimizer/backends/base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>

* Update kubeflow/optimizer/backends/base.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>

* fix: add abstractmethod, remove docstrings

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* make get_job_events abstract in RuntimeBackend

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* Update kubeflow/trainer/backends/localprocess/backend.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>

* fix

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

---------

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* chore(spark): migrate SDK to kubeflow_spark_api Pydantic models (kubeflow#295)

* chore(spark): add kubeflow-spark-api dependency

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): migrate options to typed Pydantic models

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): migrate utils to typed Pydantic models

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): migrate backend to typed Pydantic models

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): refactor tests to use typed models and cleanup

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): rename build_spark_connect_crd to build_spark_connect_cr

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* fix(spark): use typed model helpers in mock handlers

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): bump kubeflow-spark-api to 2.4.0

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* feat(docs): Update README with Spark Support  (kubeflow#349)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* fix(trainer): return TRAINJOB_COMPLETE when all steps are done (kubeflow#340)

* fix(local): return TRAINJOB_COMPLETE when all steps are done (kubeflow#338)

Signed-off-by: priyank <priyank8445@gmail.com>

* test(trainer): add test case for __get_job_status

Signed-off-by: priyank <priyank8445@gmail.com>

* fix(trainer): early return TRAINJOB_CREATED when job has no steps

Signed-off-by: priyank <priyank8445@gmail.com>

* test(trainer): refactor test_get_job_status with TestCase fixture

Signed-off-by: priyank <priyank8445@gmail.com>

---------

Signed-off-by: priyank <priyank8445@gmail.com>

* fix(trainer): adapt SDK to removal of numProcPerNode from TorchMLPolicySource (kubeflow#360)

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* fix: Make validate-lockfile action non-blocking (kubeflow#361)

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* chore(spark): change pyspark[connect] dependency (kubeflow#357)

Change pyspark[connect] 3.4.1 dependency to pyspark-connect 4.0.1.

This matches the version of Spark in the spark-operator container image
(https://github.com/kubeflow/spark-operator/blob/master/Dockerfile#L17).

Signed-off-by: Ali Maredia <amaredia@redhat.com>

* chore(spark): remove SDK-side validation from SparkClient (kubeflow#345)

Remove all SDK-side input validation from the spark module.
Validation will be handled server-side by the Spark Operator
admission webhooks (spark-operator#2862).

- Remove validation.py and validation_test.py
- Remove isinstance checks from _create_session()
- Remove ValidationError from public API

Closes: kubeflow#272

Signed-off-by: Yassin Nouh <yassinnouh21@gmail.com>
Signed-off-by: yassinnouh21 <yassinnouh21@gmail.com>

* chore: Merge upstream/main (preserving downstream config)

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* update workflow to skip requirements generation on merge conflict

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* remove compatibility with python 3.9 and udpated tests

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* fix tests

Signed-off-by: Brian Gallagher <briangal@gmail.com>

---------

Signed-off-by: Jon Burdo <jon@jonburdo.com>
Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
Signed-off-by: Shekhar Rajak <shekharrajak@live.com>
Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>
Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>
Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>
Signed-off-by: muhammadjunaid8047 <muhammadjunaid8047@gmail.com>
Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>
Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>
Signed-off-by: Moeed <shaikmoeed@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: Brian Gallagher <briangal@gmail.com>
Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>
Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>
Signed-off-by: Transcendental-Programmer <priyena.programming@gmail.com>
Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: priyank <priyank8445@gmail.com>
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Signed-off-by: Yassin Nouh <yassinnouh21@gmail.com>
Signed-off-by: yassinnouh21 <yassinnouh21@gmail.com>
Co-authored-by: Jon Burdo <jon@jonburdo.com>
Co-authored-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Co-authored-by: Hrithik Kanoje <128607033+HKanoje@users.noreply.github.com>
Co-authored-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>
Co-authored-by: Fiona Waters <fiwaters6@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>
Co-authored-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: google-oss-prow <92114575+google-oss-prow@users.noreply.github.com>
Co-authored-by: Moeed <shaikmoeed@gmail.com>
Co-authored-by: Yash Agarwal <2004agarwalyash@gmail.com>
Co-authored-by: krishdef7 <157892833+krishdef7@users.noreply.github.com>
Co-authored-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>
Co-authored-by: Mansi Singh <mansimaanu8627@gmail.com>
Co-authored-by: Priyansh Saxena <130545865+priyansh-saxena1@users.noreply.github.com>
Co-authored-by: DIGVIJAY <144053736+digvijay-y@users.noreply.github.com>
Co-authored-by: Tariq Hasan <mmtariquehsn@gmail.com>
Co-authored-by: Priyank Patel <147739348+priyank766@users.noreply.github.com>
Co-authored-by: Ali Maredia <amaredia@redhat.com>
Co-authored-by: Yassin Nouh <70436855+YassinNouh21@users.noreply.github.com>
1Ayush-Petwal added a commit to 1Ayush-Petwal/sdk that referenced this pull request Mar 21, 2026
Add docs/source/train/initializers.rst documenting the dataset and
model initializer types (HuggingFaceDatasetInitializer, S3DatasetInitializer,
DataCacheInitializer, HuggingFaceModelInitializer, S3ModelInitializer)
that were added to the container backend in PRs kubeflow#188 and kubeflow#313.

The guide covers: concept overview, per-initializer code examples,
combined usage, ContainerBackendConfig options (images, timeout),
log-based debugging, and backend limitations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1Ayush-Petwal added a commit to 1Ayush-Petwal/sdk that referenced this pull request Mar 21, 2026
Add docs/source/train/initializers.rst covering dataset and model
initializers for the container backend (added in kubeflow#188, parallelised
in kubeflow#313). Includes per-type code examples, combined usage, ContainerBackendConfig
options, and debugging via get_job_logs().
1Ayush-Petwal added a commit to 1Ayush-Petwal/sdk that referenced this pull request Mar 21, 2026
Add docs/source/train/initializers.rst covering dataset and model
initializers for the container backend (added in kubeflow#188, parallelised
in kubeflow#313). Includes per-type code examples, combined usage, ContainerBackendConfig
options, and debugging via get_job_logs().
1Ayush-Petwal added a commit to 1Ayush-Petwal/sdk that referenced this pull request Mar 21, 2026
Add docs/source/train/initializers.rst covering dataset and model
initializers for the container backend (added in kubeflow#188, parallelised
in kubeflow#313). Includes per-type code examples, combined usage, ContainerBackendConfig
options, and debugging via get_job_logs().

Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
1Ayush-Petwal added a commit to 1Ayush-Petwal/sdk that referenced this pull request Mar 28, 2026
Add docs/source/train/initializers.rst covering dataset and model
initializers for the container backend (added in kubeflow#188, parallelised
in kubeflow#313). Includes per-type code examples, combined usage, ContainerBackendConfig
options, and debugging via get_job_logs().

Signed-off-by: Ayush Petwal <ayushpetwal.0105@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants