Skip to content

bsaptarshi/ladon-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Ladon-ML ๐Ÿ‰

The Hundred-Headed Guardian of Distributed Compute.

ladon-ml is a lightweight, infrastructure-level Python library designed to identify and reclaim Zombie Utilization in high-performance ML clusters. It is purpose-built for teams managing large-scale accelerator fleets (GPUs or specialized silicon like Trainium) where stalled kernels and silent hangs can lead to massive resource waste.


The Problem: Zombie Utilization

In distributed training, a "Zombie" is a node that reports 100% Compute Utilization and high power draw but provides 0% Functional Throughput. This usually happens due to:

  • Stalled Kernels: An infinite loop in a custom op pinning the chip.
  • Collective Deadlocks: A node waiting for a synchronization signal (Gloo/NCCL) that never arrives.
  • Silent HBM Failures: Memory errors that stop the training loop but keep the hardware active.

Core Features

  • Zero-Trust Telemetry: Correlates hardware-level metrics (Power, Thermal, Compute) with software-level progress (Network IO, Memory Delta, Gradient Updates).
  • Inverse Correlation Detection: Identifies the signature of a zombie: high power/utilization + near-zero network/collective-communication traffic.
  • Modular Reapers: Pluggable actions to "drain" a node, kill a specific process group, or signal orchestrators (Kubernetes/Slurm).
  • Low Overhead: Written to run as a sidecar or a lightweight daemon with minimal impact on training latency.

Architecture

Ladon follows a Collector โ†’ Brain โ†’ Reaper pattern:

  1. Collectors: Interface with hardware drivers (e.g., NVML, Neuron SDK) to pull raw telemetry.
  2. The Brain: Applies statistical analysis to detect anomalies and "Zombie" patterns over a sliding time window.
  3. The Reaper: Executes the reclamation policy (Log, Alert, or Kill).

Quick Start

Installation

pip install ladon-ml
Basic Usage
Run Ladon as a daemon to monitor the local node and detect stalled processes.

Python

import time
from ladon import LadonBrain, Collector

# Initialize with a 10-second sliding window
guardian = LadonBrain(window_size=10, sensitivity=0.05)

while True:
    # Get telemetry from GPU/Trainium chips
    stats = Collector.get_chip_telemetry()
    status = guardian.analyze(stats)
    
    if status == "ZOMBIE_DETECTED":
        print("Alert: Stalled compute detected on Device 0!")
        # Trigger your reclamation logic here (e.g., os.kill or k8s delete)
    
    time.sleep(1)
Configuration
Ladon can be configured via ladon_conf.yaml:

YAML

thresholds:
  compute_min: 90.0      # Min utilization to trigger "Pinned" state
  net_max: 0.01          # Max network throughput (MB/s) for "Silent" state
  window_seconds: 30     # How long to wait before declaring a Zombie
reclamation:
  action: "SIGTERM"      # Options: LOG, SIGTERM, DRAIN, CUSTOM
Why "Ladon"?
In Greek mythology, Ladon was the dragon with one hundred heads that guarded the golden apples. He never slept and each head watched a different part of the garden. Ladon-ML serves as that unblinking guardian for your compute garden, ensuring no single chip goes rogue and wastes your "golden" training time.

License
Apache 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages