Local Development

This guide explains how to run Kubeflow TrainJobs locally using the SDK's different backends, helping you iterate faster before deploying to a Kubernetes cluster.

Overview

The Kubeflow Trainer SDK provides three backends for running TrainJobs:

Backend Comparison

Backend	Best For	Requirements
Local Process	Quick prototyping, single-node testing	Python 3.9+
Container	Multi-node training, reproducibility	Docker or Podman installed
Kubernetes	Production deployments	K8s cluster with Trainer operator

All backends use the same TrainerClient interface - only the configuration changes. This means you can develop locally and deploy to production with minimal code changes.

Local Process Backend

The fastest option for quick testing. Runs training directly as Python processes.

When to use:

Rapid prototyping and debugging
Testing training logic without container overhead
Environments without Docker/Podman

Example:

from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure local process backend
backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)

# Define your training function
def train_model():
    import torch
    print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}")
    # Your training logic here

# Create trainer and run
trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)

# View logs
client.get_job_logs(name=job_name, follow=True)

Limitations:

Single-node only (no distributed training)
No container isolation

Container Backend (Docker/Podman)

Run training in isolated containers with full multi-node distributed training support.

When to use:

Distributed training with multiple workers
Reproducible containerized environments
Testing production-like setups locally

Example with Docker:

from kubeflow.trainer import TrainerClient, ContainerBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure Docker backend
backend_config = ContainerBackendConfig(
    container_runtime="docker",  # or "podman"
)
client = TrainerClient(backend_config=backend_config)

# Same trainer works - now with multi-node support!
trainer = CustomTrainer(
    func=train_model,
    num_nodes=4,  # Distributed across 4 containers
)
job_name = client.train(trainer=trainer)

Container Host Configuration

When using the Container backend on macOS , you may need to configure the container_host parameter to point to your Docker or Podman socket. This is because the default socket path differs across operating systems.

OS	Default `container_host`
Linux	`unix:///var/run/docker.sock` (Docker) or `unix:///run/user/<UID>/podman/podman.sock` (Podman)
macOS	`unix://$HOME/.docker/run/docker.sock` (Docker Desktop) or check `podman machine inspect` for Podman
Windows	`npipe:////./pipe/docker_engine` (Docker Desktop)

Example for macOS:

import os

backend_config = ContainerBackendConfig(
    container_runtime="docker",
    # macOS Docker Desktop socket path
    container_host=f"unix://{os.environ['HOME']}/.docker/run/docker.sock",
)
client = TrainerClient(backend_config=backend_config)

Note

If you encounter Cannot connect to Docker daemon errors on macOS, verify the socket path by running docker context inspect and check the Host value in the output.

Switching Between Backends

The key benefit of the SDK is seamless backend switching. Your training code stays the same - only the backend configuration changes:

# Development: Use local process for fast iteration
from kubeflow.trainer import LocalProcessBackendConfig
backend_config = LocalProcessBackendConfig()

# Testing: Switch to Docker for distributed testing
from kubeflow.trainer import ContainerBackendConfig
backend_config = ContainerBackendConfig(container_runtime="docker")

# Production: Deploy to Kubernetes
from kubeflow.trainer import KubernetesBackendConfig
backend_config = KubernetesBackendConfig(namespace="kubeflow")

# Same client and trainer code works with all backends!
client = TrainerClient(backend_config=backend_config)
job_name = client.train(trainer=trainer)

Common Operations

These operations work identically across all backends:

List Jobs:

jobs = client.list_jobs()
for job in jobs:
    print(f"{job.name}: {job.status}")

View Logs:

# Follow logs in real-time
for log_line in client.get_job_logs(name=job_name, follow=True):
    print(log_line)

Wait for Completion:

job = client.wait_for_job_status(
    name=job_name,
    timeout=3600,  # 1 hour timeout
)

Delete Jobs:

client.delete_job(name=job_name)

Troubleshooting

Local Process Backend:

ModuleNotFoundError: Ensure dependencies are installed in current environment
Training hangs: Check for infinite loops in your training function

Container Backend:

Cannot connect to Docker daemon: Start Docker/Podman service. On macOS, verify the socket path — see :ref:`container-host-configuration`.
Image pull errors: Check network connectivity and image registry access
Permission denied: For Podman, ensure rootless mode is configured

Next Steps

Custom Training - Define your trainers
Distributed Training - Scale across nodes
Kubeflow Trainer Docs - Full documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Development

Overview

Local Process Backend

Container Backend (Docker/Podman)

Container Host Configuration

Switching Between Backends

Common Operations

Troubleshooting

Next Steps

FilesExpand file tree

local-development.rst

Latest commit

History

local-development.rst

File metadata and controls

Local Development

Overview

Local Process Backend

Container Backend (Docker/Podman)

Container Host Configuration

Switching Between Backends

Common Operations

Troubleshooting

Next Steps