Skip to content

Releases: run-house/kubetorch

v0.3.0

30 Dec 16:44
5ece2cc

Choose a tag to compare

This release introduces the Kubetorch Data Store, a unified cluster data transfer system for seamless data movement between your local machine and Kubernetes pods.

Kubetorch Data Store

The data store provides a unified kt.put() and kt.get() API for both filesystem and GPU data. It solves two critical gaps in Kubernetes for machine learning:

  1. Fast deployment: Sync code and data to your cluster instantly via rsync - no container rebuilds necessary
  2. In-cluster data sharing: Peer-to-peer data transfer between pods with automatic caching and discovery - the "object store" functionality that Ray users miss

(#1994, #1989, #1988, #1987, #1985, #1982, #1981, #1979, #1933, #1932, #1929, #1893, #1997)

Highlights

  • Unified API: Single kt.put()/kt.get() interface handles both filesystem data (files/directories via rsync) and GPU data (CUDA tensors via NCCL broadcast)
  • No kubeconfig required: The Python client communicates through a centralized controller endpoint
  • Peer-to-peer optimization: Intelligent routing that tries pod-to-pod transfers first before falling back to the central store
  • GPU tensor & state dict transfers: First-class support for CUDA tensor broadcasting via NCCL, including efficient packed transfers for model state dicts
  • Broadcast coordination: BroadcastWindow enables coordinated multi-pod transfers with configurable quorum, fanout, and tree-based propagation

Example Usage

import kubetorch as kt

# Filesystem data
kt.put("my-service/weights", src="./model_weights/")
kt.get("my-service/weights", dest="./local_copy/")

# GPU tensors (NCCL broadcast)
kt.put("checkpoint", data=model.state_dict(), broadcast=kt.BroadcastWindow(world_size=2))
kt.get("checkpoint", dest=dest_state_dict, broadcast=kt.BroadcastWindow(world_size=2))

# List and manage keys
kt.ls("my-service/")
kt.rm("my-service/old-checkpoint")

See the docs for more info.

Improvements

  • Remove queue and scheduler from Compute configuration options (#1968)
  • Added kt teardown support for training jobs (#1986)
  • Updates to metrics streaming output (#1984)
  • Remove OTEL as a Helm dependency (#2016, #2022)
  • Allow custom annotations for Kubetorch service account configuration (#2009)

Bug Fixes

  • Use correct container name when querying logs for Kubetorch services (#1972)
  • Prevent events and logs from printing on same line (#2008)
  • Async lifecycle management and cleanup (#2028)
  • Start Ray on head node if no distributed config provided with BYO manifest (#2046)
  • Handle image pull errors when checking for knative service readiness (#2050)
  • Control over autoscaler pod eviction behavior for distributed jobs (#2052)

v0.2.9

09 Dec 00:37
452213e

Choose a tag to compare

Improvements

  • Introduce global LoggingConfig for easier control of logging behavior with Kubetorch services (#1959)
  • Simplify compute spec requirements in factory and constructor (#1963)

Bug Fixes

  • Prevent log recursion errors (#1957)
  • Metrics streaming for async service calls (#1958)
  • Specify limits in pod template when requesting GPUs (#1965)

v0.2.8

05 Dec 13:56
a2d81b9

Choose a tag to compare

Improvements

  • Support for loading all pod IPs for distributed workloads (#1937)
  • Add app and deployment id labels for easier querying of all Kubetorch deployments (#1940)
  • Improve pdb debugging (#1950)
  • Remove resource limits for workloads (#1949)
  • Set allowed serialization methods using local environment variable (#1951)

Bug Fixes

  • Fix pdb support and other query params for callables (#1939)
  • Setting Kubetorch volume mount paths on creation & reload (#1944)
  • Fix to_async() when get_if_exists is set to True (#1945)

v0.2.7

26 Nov 23:00
85402c8

Choose a tag to compare

Improvements

  • Added kt logs CLI command to stream and follow Kubetorch deployment logs (including distributed deployments) (#1928)
  • Add affinity and tolerations for rsync and nginx proxy deployments (#1923)
  • Filter out metric service calls from kubetorch pods (#1918)

Bug Fixes

  • Always SSH into head node for Ray deployments (#1931)
  • Metrics streaming for async calls (#1924)

v0.2.6

20 Nov 20:43
67adf58

Choose a tag to compare

Improvements

  • Expanded python version support (#1907)
  • Support helm installation in any namespace for better isolation (#1913)
  • Suppress metrics log checks in kubetorch pod logs (#1918)
  • Relax cluster readiness checks (#1919)

Bug Fixes

  • Template label parsing (#1908)
  • Deploy headless service for distributed use cases only (#1920)

v0.2.5

13 Nov 22:08
1b2d670

Choose a tag to compare

New Features

Notebook Integration

  • Added kt notebook CLI command to launch a JupyterLab instance connected directly to your Kubetorch services (#1890)
  • You can now send Kubetorch functions defined inside local Jupyter notebooks to run on your cluster — no extra setup needed (#1892)

Improvements

  • Simplified kubetorch[client] dependencies (#1896)
  • Faster kt listCLI loading (#1905)

Bug Fixes

  • Module and submodule reimporting on the cluster (#1902)

v0.2.4

12 Nov 09:52
8f86a29

Choose a tag to compare

New Features

Metrics Streaming

Kubetorch now supports real-time metrics streaming during service execution.
While your service runs, you can watch live resource usage directly in your terminal, including:

  • CPU utilization (per service or pod)
  • Memory consumption (MiB)
  • GPU metrics (DCGM-based utilization and memory usage, where relevant)

This feature makes it easier to monitor performance, detect bottlenecks, and verify resource scaling in real time.

Related PRs: #1856, #1867, #1881, #1887

Note: To disable metrics collection, set metrics.enabled to false in the values.yaml of the Helm chart

Improvements

  • Helm chart cleanup of deprecated kubetorch config values (#1865)
  • Convert cluster scoped RBAC to namespace scoped (#1864, #1861)
  • Logging: updating callable name for clarity (#1876)

Bug Fixes

  • Fix dockerfile sync when running kt app (#1863)
  • kt config set/unset to only update specific config keys (#1882)
  • Reload cached submodules when reimporting kubetorch module on the server (#1883)

v0.2.3

04 Nov 20:03
619077a

Choose a tag to compare

Improvements

  • Allow teardown without username prefix (#1849)
  • Skip working directory sync outside packages or repos (#1850)
  • Error handling for missing or invalid kube config (#1854)
  • Update compute tolerations for GPU workloads (#1839)

v0.2.2

22 Oct 01:13
186fe0a

Choose a tag to compare

Bug Fixes

  • Rsync for non local mode (#1833)

Improvements

  • Helm chart cleanup (#1832)

v0.2.1

22 Oct 00:50
12ef5df

Choose a tag to compare

Introducing Kubetorch

A Python interface for running ML workloads on Kubernetes

Components

  • Python Client: Run any Python function or class directly on Kubernetes using a simple, intuitive API.
  • Helm Chart: Deploy Kubetorch to any existing Kubernetes cluster with one command.

Highlights

  • Fast Iteration: Hot reload and caching enable 1–3 second iteration loops for complex ML workloads.
  • Resilient Execution: Built-in fault tolerance, preemption handling, and automatic restarts.
  • Observability: Integrated logging and tracing for distributed workloads

For more info see the Getting Started guide.