This guide explains how to run Kubeflow TrainJobs locally using the SDK's different backends, helping you iterate faster before deploying to a Kubernetes cluster.
The Kubeflow Trainer SDK provides three backends for running TrainJobs:
| Backend | Best For | Requirements |
|---|---|---|
| Local Process | Quick prototyping, single-node testing | Python 3.9+ |
| Container | Multi-node training, reproducibility | Docker or Podman installed |
| Kubernetes | Production deployments | K8s cluster with Trainer operator |
All backends use the same TrainerClient interface - only the configuration
changes. This means you can develop locally and deploy to production with
minimal code changes.
The fastest option for quick testing. Runs training directly as Python processes.
When to use:
- Rapid prototyping and debugging
- Testing training logic without container overhead
- Environments without Docker/Podman
Example:
from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig
from kubeflow.trainer import CustomTrainer
# Configure local process backend
backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)
# Define your training function
def train_model():
import torch
print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}")
# Your training logic here
# Create trainer and run
trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)
# View logs
client.get_job_logs(name=job_name, follow=True)Limitations:
- Single-node only (no distributed training)
- No container isolation
Run training in isolated containers with full multi-node distributed training support.
When to use:
- Distributed training with multiple workers
- Reproducible containerized environments
- Testing production-like setups locally
Example with Docker:
from kubeflow.trainer import TrainerClient, ContainerBackendConfig
from kubeflow.trainer import CustomTrainer
# Configure Docker backend
backend_config = ContainerBackendConfig(
container_runtime="docker", # or "podman"
)
client = TrainerClient(backend_config=backend_config)
# Same trainer works - now with multi-node support!
trainer = CustomTrainer(
func=train_model,
num_nodes=4, # Distributed across 4 containers
)
job_name = client.train(trainer=trainer)When using the Container backend on macOS , you may need to configure the
container_host parameter to point to your Docker or Podman socket. This is
because the default socket path differs across operating systems.
| OS | Default container_host |
|---|---|
| Linux | unix:///var/run/docker.sock (Docker) or unix:///run/user/<UID>/podman/podman.sock (Podman) |
| macOS | unix://$HOME/.docker/run/docker.sock (Docker Desktop) or check podman machine inspect for Podman |
| Windows | npipe:////./pipe/docker_engine (Docker Desktop) |
Example for macOS:
import os
backend_config = ContainerBackendConfig(
container_runtime="docker",
# macOS Docker Desktop socket path
container_host=f"unix://{os.environ['HOME']}/.docker/run/docker.sock",
)
client = TrainerClient(backend_config=backend_config)Note
If you encounter Cannot connect to Docker daemon errors on macOS,
verify the socket path by running docker context inspect and check
the Host value in the output.
The key benefit of the SDK is seamless backend switching. Your training code stays the same - only the backend configuration changes:
# Development: Use local process for fast iteration
from kubeflow.trainer import LocalProcessBackendConfig
backend_config = LocalProcessBackendConfig()
# Testing: Switch to Docker for distributed testing
from kubeflow.trainer import ContainerBackendConfig
backend_config = ContainerBackendConfig(container_runtime="docker")
# Production: Deploy to Kubernetes
from kubeflow.trainer import KubernetesBackendConfig
backend_config = KubernetesBackendConfig(namespace="kubeflow")
# Same client and trainer code works with all backends!
client = TrainerClient(backend_config=backend_config)
job_name = client.train(trainer=trainer)These operations work identically across all backends:
List Jobs:
jobs = client.list_jobs()
for job in jobs:
print(f"{job.name}: {job.status}")View Logs:
# Follow logs in real-time
for log_line in client.get_job_logs(name=job_name, follow=True):
print(log_line)Wait for Completion:
job = client.wait_for_job_status(
name=job_name,
timeout=3600, # 1 hour timeout
)Delete Jobs:
client.delete_job(name=job_name)Local Process Backend:
ModuleNotFoundError: Ensure dependencies are installed in current environment- Training hangs: Check for infinite loops in your training function
Container Backend:
Cannot connect to Docker daemon: Start Docker/Podman service. On macOS, verify the socket path — see :ref:`container-host-configuration`.- Image pull errors: Check network connectivity and image registry access
- Permission denied: For Podman, ensure rootless mode is configured
- Custom Training - Define your trainers
- Distributed Training - Scale across nodes
- Kubeflow Trainer Docs - Full documentation