Skip to content

Latest commit

 

History

History
239 lines (169 loc) · 6.52 KB

File metadata and controls

239 lines (169 loc) · 6.52 KB

Local Development with SDK Backends

This guide explains how to run Kubeflow training jobs locally using the SDK's different backends, helping you iterate faster before deploying to a Kubernetes cluster.

Overview

The Kubeflow Trainer SDK provides three backends for running training jobs:

Backend Comparison
Backend Best For Requirements
Local Process Quick prototyping, single-node testing Python 3.9+
Container Multi-node training, reproducibility Docker or Podman installed
Kubernetes Production deployments K8s cluster with Trainer operator

All backends use the same TrainerClient interface - only the configuration changes. This means you can develop locally and deploy to production with minimal code changes.

Local Process Backend

The fastest option for quick testing. Runs training directly as Python processes.

When to use:

  • Rapid prototyping and debugging
  • Testing training logic without container overhead
  • Environments without Docker/Podman

Example:

from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure local process backend
backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)

# Define your training function
def train_model():
    import torch
    print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}")
    # Your training logic here

# Create trainer and run
trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)

# View logs
client.get_job_logs(name=job_name, follow=True)

Limitations:

  • Single-node only (no distributed training)
  • No container isolation

Container Backend (Docker/Podman)

Run training in isolated containers with full multi-node distributed training support.

When to use:

  • Distributed training with multiple workers
  • Reproducible containerized environments
  • Testing production-like setups locally

Example with Docker:

from kubeflow.trainer import TrainerClient, ContainerBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure Docker backend
backend_config = ContainerBackendConfig(
    container_runtime="docker",  # or "podman"
)
client = TrainerClient(backend_config=backend_config)

# Same trainer works - now with multi-node support!
trainer = CustomTrainer(
    func=train_model,
    num_nodes=4,  # Distributed across 4 containers
)
job_name = client.train(trainer=trainer)

Container Host Configuration

When using the Container backend on macOS , you may need to configure the container_host parameter to point to your Docker or Podman socket. This is because the default socket path differs across operating systems.

OS Default container_host
Linux unix:///var/run/docker.sock (Docker) or unix:///run/user/<UID>/podman/podman.sock (Podman)
macOS unix://$HOME/.docker/run/docker.sock (Docker Desktop) or check podman machine inspect for Podman
Windows npipe:////./pipe/docker_engine (Docker Desktop)

Example for macOS:

import os

backend_config = ContainerBackendConfig(
    container_runtime="docker",
    # macOS Docker Desktop socket path
    container_host=f"unix://{os.environ['HOME']}/.docker/run/docker.sock",
)
client = TrainerClient(backend_config=backend_config)

Note

If you encounter Cannot connect to Docker daemon errors on macOS, verify the socket path by running docker context inspect and check the Host value in the output.

Choosing Docker vs Podman:

Runtime Recommended For
Docker General use, especially on macOS/Windows
Podman Linux servers, rootless/security-focused environments

Switching Between Backends

The key benefit of the SDK is seamless backend switching. Your training code stays the same - only the backend configuration changes:

# Development: Use local process for fast iteration
from kubeflow.trainer import LocalProcessBackendConfig
backend_config = LocalProcessBackendConfig()

# Testing: Switch to Docker for distributed testing
from kubeflow.trainer import ContainerBackendConfig
backend_config = ContainerBackendConfig(container_runtime="docker")

# Production: Deploy to Kubernetes
from kubeflow.trainer import KubernetesBackendConfig
backend_config = KubernetesBackendConfig(namespace="kubeflow")

# Same client and trainer code works with all backends!
client = TrainerClient(backend_config=backend_config)
job_name = client.train(trainer=trainer)

Common Operations

These operations work identically across all backends:

List Jobs:

jobs = client.list_jobs()
for job in jobs:
    print(f"{job.name}: {job.status}")

View Logs:

# Follow logs in real-time
for log_line in client.get_job_logs(name=job_name, follow=True):
    print(log_line)

Wait for Completion:

job = client.wait_for_job_status(
    name=job_name,
    timeout=3600,  # 1 hour timeout
)

Delete Jobs:

client.delete_job(name=job_name)

Troubleshooting

Local Process Backend:

  • ModuleNotFoundError: Ensure dependencies are installed in current environment
  • Training hangs: Check for infinite loops in your training function

Container Backend:

  • Cannot connect to Docker daemon: Start Docker/Podman service. On macOS, verify the socket path — see :ref:`container-host-configuration`.
  • Image pull errors: Check network connectivity and image registry access
  • Permission denied: For Podman, ensure rootless mode is configured

Next Steps