kubeflow · Raakshass · Feb 13, 2026 · Mar 3, 2026 · Mar 11, 2026 · Mar 11, 2026
diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst
@@ -55,7 +55,7 @@ Here's how simple it is to train a model:
 Next Steps
 ----------
 
-.. grid:: 2
+.. grid:: 3
    :gutter: 3
 
    .. grid-item-card:: Installation
@@ -69,3 +69,9 @@ Next Steps
       :link-type: doc
 
       Train your first model step-by-step.
+
+   .. grid-item-card:: Local Development
+      :link: local-development
+      :link-type: doc
+
+      Run training jobs locally using different SDK backends.
diff --git a/docs/source/getting-started/local-development.rst b/docs/source/getting-started/local-development.rst
@@ -0,0 +1,227 @@
+Local Development
+==================
+
+This guide explains how to run Kubeflow TrainJobs locally using the SDK's
+different backends, helping you iterate faster before deploying to a Kubernetes
+cluster.
+
+Overview
+--------
+
+The Kubeflow Trainer SDK provides three backends for running TrainJobs:
+
+.. list-table:: Backend Comparison
+   :header-rows: 1
+   :widths: 20 35 45
+
+   * - Backend
+     - Best For
+     - Requirements
+   * - **Local Process**
+     - Quick prototyping, single-node testing
+     - Python 3.9+
+   * - **Container**
+     - Multi-node training, reproducibility
+     - Docker or Podman installed
+   * - **Kubernetes**
+     - Production deployments
+     - K8s cluster with Trainer operator
+
+All backends use the same ``TrainerClient`` interface - only the configuration
+changes. This means you can develop locally and deploy to production with
+minimal code changes.
+
+Local Process Backend
+---------------------
+
+The fastest option for quick testing. Runs training directly as Python processes.
+
+**When to use:**
+
+- Rapid prototyping and debugging
+- Testing training logic without container overhead
+- Environments without Docker/Podman
+
+**Example:**
+
+.. code-block:: python
+
+   from kubeflow.trainer import TrainerClient, LocalProcessBackendConfig
+   from kubeflow.trainer import CustomTrainer
+
+   # Configure local process backend
+   backend_config = LocalProcessBackendConfig()
+   client = TrainerClient(backend_config=backend_config)
+
+   # Define your training function
+   def train_model():
+       import torch
+       print(f"Training on device: {torch.cuda.current_device() if torch.cuda.is_available() else 'cpu'}")
+       # Your training logic here
+
+   # Create trainer and run
+   trainer = CustomTrainer(func=train_model)
+   job_name = client.train(trainer=trainer)
+
+   # View logs
+   client.get_job_logs(name=job_name, follow=True)
+
+**Limitations:**
+
+- Single-node only (no distributed training)
+- No container isolation
+
+Container Backend (Docker/Podman)
+---------------------------------
+
+Run training in isolated containers with full multi-node distributed training support.
+
+**When to use:**
+
+- Distributed training with multiple workers
+- Reproducible containerized environments
+- Testing production-like setups locally
+
+**Example with Docker:**
+
+.. code-block:: python
+
+   from kubeflow.trainer import TrainerClient, ContainerBackendConfig
+   from kubeflow.trainer import CustomTrainer
+
+   # Configure Docker backend
+   backend_config = ContainerBackendConfig(
+       container_runtime="docker",  # or "podman"
+   )
+   client = TrainerClient(backend_config=backend_config)
+
+   # Same trainer works - now with multi-node support!
+   trainer = CustomTrainer(
+       func=train_model,
+       num_nodes=4,  # Distributed across 4 containers
+   )
+   job_name = client.train(trainer=trainer)
+
+.. _container-host-configuration:
+
+Container Host Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using the Container backend on **macOS** , you may need to configure the
+``container_host`` parameter to point to your Docker or Podman socket. This is
+because the default socket path differs across operating systems.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - OS
+     - Default ``container_host``
+   * - Linux
+     - ``unix:///var/run/docker.sock`` (Docker) or ``unix:///run/user/<UID>/podman/podman.sock`` (Podman)
+   * - macOS
+     - ``unix://$HOME/.docker/run/docker.sock`` (Docker Desktop) or check ``podman machine inspect`` for Podman
+   * - Windows
+     - ``npipe:////./pipe/docker_engine`` (Docker Desktop)
+
+**Example for macOS:**
+
+.. code-block:: python
+
+   import os
+
+   backend_config = ContainerBackendConfig(
+       container_runtime="docker",
+       # macOS Docker Desktop socket path
+       container_host=f"unix://{os.environ['HOME']}/.docker/run/docker.sock",
+   )
+   client = TrainerClient(backend_config=backend_config)
+
+.. note::
+
+   If you encounter ``Cannot connect to Docker daemon`` errors on macOS,
+   verify the socket path by running ``docker context inspect`` and check
+   the ``Host`` value in the output.
+
+
+Switching Between Backends
+--------------------------
+
+The key benefit of the SDK is seamless backend switching. Your training code
+stays the same - only the backend configuration changes:
+
+.. code-block:: python
+
+   # Development: Use local process for fast iteration
+   from kubeflow.trainer import LocalProcessBackendConfig
+   backend_config = LocalProcessBackendConfig()
+
+   # Testing: Switch to Docker for distributed testing
+   from kubeflow.trainer import ContainerBackendConfig
+   backend_config = ContainerBackendConfig(container_runtime="docker")
+
+   # Production: Deploy to Kubernetes
+   from kubeflow.trainer import KubernetesBackendConfig
+   backend_config = KubernetesBackendConfig(namespace="kubeflow")
+
+   # Same client and trainer code works with all backends!
+   client = TrainerClient(backend_config=backend_config)
+   job_name = client.train(trainer=trainer)
+
+Common Operations
+-----------------
+
+These operations work identically across all backends:
+
+**List Jobs:**
+
+.. code-block:: python
+
+   jobs = client.list_jobs()
+   for job in jobs:
+       print(f"{job.name}: {job.status}")
+
+**View Logs:**
+
+.. code-block:: python
+
+   # Follow logs in real-time
+   for log_line in client.get_job_logs(name=job_name, follow=True):
+       print(log_line)
+
+**Wait for Completion:**
+
+.. code-block:: python
+
+   job = client.wait_for_job_status(
+       name=job_name,
+       timeout=3600,  # 1 hour timeout
+   )
+
+**Delete Jobs:**
+
+.. code-block:: python
+
+   client.delete_job(name=job_name)
+
+Troubleshooting
+---------------
+
+**Local Process Backend:**
+
+- ``ModuleNotFoundError``: Ensure dependencies are installed in current environment
+- Training hangs: Check for infinite loops in your training function
+
+**Container Backend:**
+
+- ``Cannot connect to Docker daemon``: Start Docker/Podman service. On macOS,
+  verify the socket path — see :ref:`container-host-configuration`.
+- Image pull errors: Check network connectivity and image registry access
+- Permission denied: For Podman, ensure rootless mode is configured
+
+Next Steps
+----------
+
+- `Custom Training <../train/custom-training.html>`_ - Define your trainers
+- `Distributed Training <../train/distributed.html>`_ - Scale across nodes
+- `Kubeflow Trainer Docs <https://www.kubeflow.org/docs/components/trainer/>`_ - Full documentation
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -136,6 +136,7 @@ Getting Involved
 
    getting-started/installation
    getting-started/quickstart
+   getting-started/local-development
 
 .. toctree::
    :maxdepth: 2