Releases · run-house/kubetorch

30 Dec 16:44

jlewitt1

v0.3.0

5ece2cc

v0.3.0 Latest

Latest

This release introduces the Kubetorch Data Store, a unified cluster data transfer system for seamless data movement between your local machine and Kubernetes pods.

Kubetorch Data Store

The data store provides a unified kt.put() and kt.get() API for both filesystem and GPU data. It solves two critical gaps in Kubernetes for machine learning:

Fast deployment: Sync code and data to your cluster instantly via rsync - no container rebuilds necessary
In-cluster data sharing: Peer-to-peer data transfer between pods with automatic caching and discovery - the "object store" functionality that Ray users miss

(#1994, #1989, #1988, #1987, #1985, #1982, #1981, #1979, #1933, #1932, #1929, #1893, #1997)

Highlights

Unified API: Single kt.put()/kt.get() interface handles both filesystem data (files/directories via rsync) and GPU data (CUDA tensors via NCCL broadcast)
No kubeconfig required: The Python client communicates through a centralized controller endpoint
Peer-to-peer optimization: Intelligent routing that tries pod-to-pod transfers first before falling back to the central store
GPU tensor & state dict transfers: First-class support for CUDA tensor broadcasting via NCCL, including efficient packed transfers for model state dicts
Broadcast coordination: BroadcastWindow enables coordinated multi-pod transfers with configurable quorum, fanout, and tree-based propagation

Example Usage

import kubetorch as kt

# Filesystem data
kt.put("my-service/weights", src="./model_weights/")
kt.get("my-service/weights", dest="./local_copy/")

# GPU tensors (NCCL broadcast)
kt.put("checkpoint", data=model.state_dict(), broadcast=kt.BroadcastWindow(world_size=2))
kt.get("checkpoint", dest=dest_state_dict, broadcast=kt.BroadcastWindow(world_size=2))

# List and manage keys
kt.ls("my-service/")
kt.rm("my-service/old-checkpoint")

See the docs for more info.

Improvements

Remove queue and scheduler from Compute configuration options (#1968)
Added kt teardown support for training jobs (#1986)
Updates to metrics streaming output (#1984)
Remove OTEL as a Helm dependency (#2016, #2022)
Allow custom annotations for Kubetorch service account configuration (#2009)

Bug Fixes

Use correct container name when querying logs for Kubetorch services (#1972)
Prevent events and logs from printing on same line (#2008)
Async lifecycle management and cleanup (#2028)
Start Ray on head node if no distributed config provided with BYO manifest (#2046)
Handle image pull errors when checking for knative service readiness (#2050)
Control over autoscaler pod eviction behavior for distributed jobs (#2052)

Assets 2

09 Dec 00:37

jlewitt1

v0.2.9

452213e

v0.2.9

Improvements

Introduce global LoggingConfig for easier control of logging behavior with Kubetorch services (#1959)
Simplify compute spec requirements in factory and constructor (#1963)

Bug Fixes

Prevent log recursion errors (#1957)
Metrics streaming for async service calls (#1958)
Specify limits in pod template when requesting GPUs (#1965)

Assets 2

05 Dec 13:56

jlewitt1

v0.2.8

a2d81b9

v0.2.8

Improvements

Support for loading all pod IPs for distributed workloads (#1937)
Add app and deployment id labels for easier querying of all Kubetorch deployments (#1940)
Improve pdb debugging (#1950)
Remove resource limits for workloads (#1949)
Set allowed serialization methods using local environment variable (#1951)

Bug Fixes

Fix pdb support and other query params for callables (#1939)
Setting Kubetorch volume mount paths on creation & reload (#1944)
Fix to_async() when get_if_exists is set to True (#1945)

Assets 2

26 Nov 23:00

jlewitt1

v0.2.7

85402c8

v0.2.7

Improvements

Added kt logs CLI command to stream and follow Kubetorch deployment logs (including distributed deployments) (#1928)
Add affinity and tolerations for rsync and nginx proxy deployments (#1923)
Filter out metric service calls from kubetorch pods (#1918)

Bug Fixes

Always SSH into head node for Ray deployments (#1931)
Metrics streaming for async calls (#1924)

Assets 2

20 Nov 20:43

jlewitt1

v0.2.6

67adf58

v0.2.6

Improvements

Expanded python version support (#1907)
Support helm installation in any namespace for better isolation (#1913)
Suppress metrics log checks in kubetorch pod logs (#1918)
Relax cluster readiness checks (#1919)

Bug Fixes

Template label parsing (#1908)
Deploy headless service for distributed use cases only (#1920)

Assets 2

13 Nov 22:08

jlewitt1

v0.2.5

1b2d670

v0.2.5

New Features

Notebook Integration

Added kt notebook CLI command to launch a JupyterLab instance connected directly to your Kubetorch services (#1890)
You can now send Kubetorch functions defined inside local Jupyter notebooks to run on your cluster — no extra setup needed (#1892)

Improvements

Simplified kubetorch[client] dependencies (#1896)
Faster kt listCLI loading (#1905)

Bug Fixes

Module and submodule reimporting on the cluster (#1902)

Assets 2

12 Nov 09:52

jlewitt1

v0.2.4

8f86a29

v0.2.4

New Features

Metrics Streaming

Kubetorch now supports real-time metrics streaming during service execution.
While your service runs, you can watch live resource usage directly in your terminal, including:

CPU utilization (per service or pod)
Memory consumption (MiB)
GPU metrics (DCGM-based utilization and memory usage, where relevant)

This feature makes it easier to monitor performance, detect bottlenecks, and verify resource scaling in real time.

Related PRs: #1856, #1867, #1881, #1887

Note: To disable metrics collection, set metrics.enabled to false in the values.yaml of the Helm chart

Improvements

Helm chart cleanup of deprecated kubetorch config values (#1865)
Convert cluster scoped RBAC to namespace scoped (#1864, #1861)
Logging: updating callable name for clarity (#1876)

Bug Fixes

Fix dockerfile sync when running kt app (#1863)
kt config set/unset to only update specific config keys (#1882)
Reload cached submodules when reimporting kubetorch module on the server (#1883)

Assets 2

04 Nov 20:03

jlewitt1

v0.2.3

619077a

v0.2.3

Improvements

Allow teardown without username prefix (#1849)
Skip working directory sync outside packages or repos (#1850)
Error handling for missing or invalid kube config (#1854)
Update compute tolerations for GPU workloads (#1839)

Assets 2

22 Oct 01:13

jlewitt1

v0.2.2

186fe0a

v0.2.2

Bug Fixes

Rsync for non local mode (#1833)

Improvements

Helm chart cleanup (#1832)

Assets 2

22 Oct 00:50

jlewitt1

v0.2.1

12ef5df

v0.2.1

Introducing Kubetorch

A Python interface for running ML workloads on Kubernetes

Components

Python Client: Run any Python function or class directly on Kubernetes using a simple, intuitive API.
Helm Chart: Deploy Kubetorch to any existing Kubernetes cluster with one command.

Highlights

Fast Iteration: Hot reload and caching enable 1–3 second iteration loops for complex ML workloads.
Resilient Execution: Built-in fault tolerance, preemption handling, and automatic restarts.
Observability: Integrated logging and tracing for distributed workloads

For more info see the Getting Started guide.

Assets 2

Releases: run-house/kubetorch

v0.3.0

Kubetorch Data Store

Highlights

Example Usage

Improvements

Bug Fixes

Uh oh!

v0.2.9

Improvements

Bug Fixes

Uh oh!

v0.2.8

Improvements

Bug Fixes

Uh oh!

v0.2.7

Improvements

Bug Fixes

Uh oh!

v0.2.6

Improvements

Bug Fixes

Uh oh!

v0.2.5

New Features

Notebook Integration

Improvements

Bug Fixes

Uh oh!

v0.2.4

New Features

Metrics Streaming

Improvements

Bug Fixes

Uh oh!

v0.2.3

Improvements

Uh oh!

v0.2.2

Bug Fixes

Improvements

Uh oh!

v0.2.1

Introducing Kubetorch

Components

Highlights

Uh oh!