Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/source/getting-started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Here's how simple it is to train a model:
Next Steps
----------

.. grid:: 2
.. grid:: 3
:gutter: 3

.. grid-item-card:: Installation
Expand All @@ -69,3 +69,9 @@ Next Steps
:link-type: doc

Train your first model step-by-step.

.. grid-item-card:: Local Development
:link: local-development
:link-type: doc

Run training jobs locally using different SDK backends.
227 changes: 227 additions & 0 deletions docs/source/getting-started/local-development.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
Local Development
==================

This guide explains how to run Kubeflow TrainJobs locally using the SDK's
different backends, helping you iterate faster before deploying to a Kubernetes
cluster.

Overview
--------

The Kubeflow Trainer SDK provides three backends for running TrainJobs:

.. list-table:: Backend Comparison
:header-rows: 1
:widths: 20 35 45

* - Backend
- Best For
- Requirements
* - **Local Process**
- Quick prototyping, single-node testing
- Python 3.9+
* - **Container**
- Multi-node training, reproducibility
- Docker or Podman installed
* - **Kubernetes**
- Production deployments
- K8s cluster with Trainer operator

All backends use the same ``TrainerClient`` interface - only the configuration
changes. This means you can develop locally and deploy to production with
minimal code changes.

Local Process Backend
---------------------

The fastest option for quick testing. Runs training directly as Python processes.

**When to use:**

- Rapid prototyping and debugging
- Testing training logic without container overhead
- Environments without Docker/Podman

**Example:**

.. code-block:: python

from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure local process backend
backend_config = LocalProcessBackendConfig()
client = TrainerClient(backend_config=backend_config)

# Define your training function
def train_model():
import torch
print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}")
# Your training logic here

# Create trainer and run
trainer = CustomTrainer(func=train_model)
job_name = client.train(trainer=trainer)

# View logs
client.get_job_logs(name=job_name, follow=True)

**Limitations:**

- Single-node only (no distributed training)
- No container isolation

Container Backend (Docker/Podman)
---------------------------------

Run training in isolated containers with full multi-node distributed training support.

**When to use:**

- Distributed training with multiple workers
- Reproducible containerized environments
- Testing production-like setups locally

**Example with Docker:**

.. code-block:: python

from kubeflow.trainer import TrainerClient, ContainerBackendConfig
from kubeflow.trainer import CustomTrainer

# Configure Docker backend
backend_config = ContainerBackendConfig(
container_runtime="docker", # or "podman"
)
client = TrainerClient(backend_config=backend_config)

# Same trainer works - now with multi-node support!
trainer = CustomTrainer(
func=train_model,
num_nodes=4, # Distributed across 4 containers
)
job_name = client.train(trainer=trainer)

.. _container-host-configuration:

Container Host Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When using the Container backend on **macOS** , you may need to configure the
``container_host`` parameter to point to your Docker or Podman socket. This is
because the default socket path differs across operating systems.

.. list-table::
:header-rows: 1
:widths: 20 80

* - OS
- Default ``container_host``
* - Linux
- ``unix:///var/run/docker.sock`` (Docker) or ``unix:///run/user/<UID>/podman/podman.sock`` (Podman)
* - macOS
- ``unix://$HOME/.docker/run/docker.sock`` (Docker Desktop) or check ``podman machine inspect`` for Podman
* - Windows
- ``npipe:////./pipe/docker_engine`` (Docker Desktop)

**Example for macOS:**

.. code-block:: python

import os

backend_config = ContainerBackendConfig(
container_runtime="docker",
# macOS Docker Desktop socket path
container_host=f"unix://{os.environ['HOME']}/.docker/run/docker.sock",
)
client = TrainerClient(backend_config=backend_config)

.. note::

If you encounter ``Cannot connect to Docker daemon`` errors on macOS,
verify the socket path by running ``docker context inspect`` and check
the ``Host`` value in the output.


Switching Between Backends
--------------------------

The key benefit of the SDK is seamless backend switching. Your training code
stays the same - only the backend configuration changes:

.. code-block:: python

# Development: Use local process for fast iteration
from kubeflow.trainer import LocalProcessBackendConfig
backend_config = LocalProcessBackendConfig()

# Testing: Switch to Docker for distributed testing
from kubeflow.trainer import ContainerBackendConfig
backend_config = ContainerBackendConfig(container_runtime="docker")

# Production: Deploy to Kubernetes
from kubeflow.trainer import KubernetesBackendConfig
backend_config = KubernetesBackendConfig(namespace="kubeflow")

# Same client and trainer code works with all backends!
client = TrainerClient(backend_config=backend_config)
job_name = client.train(trainer=trainer)

Common Operations
-----------------

These operations work identically across all backends:

**List Jobs:**

.. code-block:: python

jobs = client.list_jobs()
for job in jobs:
print(f"{job.name}: {job.status}")

**View Logs:**

.. code-block:: python

# Follow logs in real-time
for log_line in client.get_job_logs(name=job_name, follow=True):
print(log_line)

**Wait for Completion:**

.. code-block:: python

job = client.wait_for_job_status(
name=job_name,
timeout=3600, # 1 hour timeout
)

**Delete Jobs:**

.. code-block:: python

client.delete_job(name=job_name)

Troubleshooting
---------------

**Local Process Backend:**

- ``ModuleNotFoundError``: Ensure dependencies are installed in current environment
- Training hangs: Check for infinite loops in your training function

**Container Backend:**

- ``Cannot connect to Docker daemon``: Start Docker/Podman service. On macOS,
verify the socket path — see :ref:`container-host-configuration`.
- Image pull errors: Check network connectivity and image registry access
- Permission denied: For Podman, ensure rootless mode is configured

Next Steps
----------

- `Custom Training <../train/custom-training.html>`_ - Define your trainers
- `Distributed Training <../train/distributed.html>`_ - Scale across nodes
- `Kubeflow Trainer Docs <https://www.kubeflow.org/docs/components/trainer/>`_ - Full documentation
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ Getting Involved

getting-started/installation
getting-started/quickstart
getting-started/local-development

.. toctree::
:maxdepth: 2
Expand Down