Releases: run-house/kubetorch
v0.3.0
This release introduces the Kubetorch Data Store, a unified cluster data transfer system for seamless data movement between your local machine and Kubernetes pods.
Kubetorch Data Store
The data store provides a unified kt.put() and kt.get() API for both filesystem and GPU data. It solves two critical gaps in Kubernetes for machine learning:
- Fast deployment: Sync code and data to your cluster instantly via rsync - no container rebuilds necessary
- In-cluster data sharing: Peer-to-peer data transfer between pods with automatic caching and discovery - the "object store" functionality that Ray users miss
(#1994, #1989, #1988, #1987, #1985, #1982, #1981, #1979, #1933, #1932, #1929, #1893, #1997)
Highlights
- Unified API: Single
kt.put()/kt.get()interface handles both filesystem data (files/directories via rsync) and GPU data (CUDA tensors via NCCL broadcast) - No kubeconfig required: The Python client communicates through a centralized controller endpoint
- Peer-to-peer optimization: Intelligent routing that tries pod-to-pod transfers first before falling back to the central store
- GPU tensor & state dict transfers: First-class support for CUDA tensor broadcasting via NCCL, including efficient packed transfers for model state dicts
- Broadcast coordination:
BroadcastWindowenables coordinated multi-pod transfers with configurable quorum, fanout, and tree-based propagation
Example Usage
import kubetorch as kt
# Filesystem data
kt.put("my-service/weights", src="./model_weights/")
kt.get("my-service/weights", dest="./local_copy/")
# GPU tensors (NCCL broadcast)
kt.put("checkpoint", data=model.state_dict(), broadcast=kt.BroadcastWindow(world_size=2))
kt.get("checkpoint", dest=dest_state_dict, broadcast=kt.BroadcastWindow(world_size=2))
# List and manage keys
kt.ls("my-service/")
kt.rm("my-service/old-checkpoint")See the docs for more info.
Improvements
- Remove queue and scheduler from Compute configuration options (#1968)
- Added
kt teardownsupport for training jobs (#1986) - Updates to metrics streaming output (#1984)
- Remove OTEL as a Helm dependency (#2016, #2022)
- Allow custom annotations for Kubetorch service account configuration (#2009)
Bug Fixes
- Use correct container name when querying logs for Kubetorch services (#1972)
- Prevent events and logs from printing on same line (#2008)
- Async lifecycle management and cleanup (#2028)
- Start Ray on head node if no distributed config provided with BYO manifest (#2046)
- Handle image pull errors when checking for knative service readiness (#2050)
- Control over autoscaler pod eviction behavior for distributed jobs (#2052)
v0.2.9
v0.2.8
Improvements
- Support for loading all pod IPs for distributed workloads (#1937)
- Add app and deployment id labels for easier querying of all Kubetorch deployments (#1940)
- Improve pdb debugging (#1950)
- Remove resource limits for workloads (#1949)
- Set allowed serialization methods using local environment variable (#1951)
Bug Fixes
v0.2.7
v0.2.6
v0.2.5
New Features
Notebook Integration
- Added
kt notebookCLI command to launch a JupyterLab instance connected directly to your Kubetorch services (#1890) - You can now send Kubetorch functions defined inside local Jupyter notebooks to run on your cluster — no extra setup needed (#1892)
Improvements
Bug Fixes
- Module and submodule reimporting on the cluster (#1902)
v0.2.4
New Features
Metrics Streaming
Kubetorch now supports real-time metrics streaming during service execution.
While your service runs, you can watch live resource usage directly in your terminal, including:
- CPU utilization (per service or pod)
- Memory consumption (MiB)
- GPU metrics (DCGM-based utilization and memory usage, where relevant)
This feature makes it easier to monitor performance, detect bottlenecks, and verify resource scaling in real time.
Related PRs: #1856, #1867, #1881, #1887
Note: To disable metrics collection, set metrics.enabled to false in the values.yaml of the Helm chart
Improvements
- Helm chart cleanup of deprecated kubetorch config values (#1865)
- Convert cluster scoped RBAC to namespace scoped (#1864, #1861)
- Logging: updating callable name for clarity (#1876)
Bug Fixes
v0.2.3
v0.2.2
v0.2.1
Introducing Kubetorch
A Python interface for running ML workloads on Kubernetes
Components
- Python Client: Run any Python function or class directly on Kubernetes using a simple, intuitive API.
- Helm Chart: Deploy Kubetorch to any existing Kubernetes cluster with one command.
Highlights
- Fast Iteration: Hot reload and caching enable 1–3 second iteration loops for complex ML workloads.
- Resilient Execution: Built-in fault tolerance, preemption handling, and automatic restarts.
- Observability: Integrated logging and tracing for distributed workloads
For more info see the Getting Started guide.