The Hundred-Headed Guardian of Distributed Compute.
ladon-ml is a lightweight, infrastructure-level Python library designed to identify and reclaim Zombie Utilization in high-performance ML clusters. It is purpose-built for teams managing large-scale accelerator fleets (GPUs or specialized silicon like Trainium) where stalled kernels and silent hangs can lead to massive resource waste.
In distributed training, a "Zombie" is a node that reports 100% Compute Utilization and high power draw but provides 0% Functional Throughput. This usually happens due to:
- Stalled Kernels: An infinite loop in a custom op pinning the chip.
- Collective Deadlocks: A node waiting for a synchronization signal (Gloo/NCCL) that never arrives.
- Silent HBM Failures: Memory errors that stop the training loop but keep the hardware active.
- Zero-Trust Telemetry: Correlates hardware-level metrics (Power, Thermal, Compute) with software-level progress (Network IO, Memory Delta, Gradient Updates).
- Inverse Correlation Detection: Identifies the signature of a zombie: high power/utilization + near-zero network/collective-communication traffic.
- Modular Reapers: Pluggable actions to "drain" a node, kill a specific process group, or signal orchestrators (Kubernetes/Slurm).
- Low Overhead: Written to run as a sidecar or a lightweight daemon with minimal impact on training latency.
Ladon follows a Collector โ Brain โ Reaper pattern:
- Collectors: Interface with hardware drivers (e.g.,
NVML,Neuron SDK) to pull raw telemetry. - The Brain: Applies statistical analysis to detect anomalies and "Zombie" patterns over a sliding time window.
- The Reaper: Executes the reclamation policy (Log, Alert, or Kill).
pip install ladon-ml
Basic Usage
Run Ladon as a daemon to monitor the local node and detect stalled processes.
Python
import time
from ladon import LadonBrain, Collector
# Initialize with a 10-second sliding window
guardian = LadonBrain(window_size=10, sensitivity=0.05)
while True:
# Get telemetry from GPU/Trainium chips
stats = Collector.get_chip_telemetry()
status = guardian.analyze(stats)
if status == "ZOMBIE_DETECTED":
print("Alert: Stalled compute detected on Device 0!")
# Trigger your reclamation logic here (e.g., os.kill or k8s delete)
time.sleep(1)
Configuration
Ladon can be configured via ladon_conf.yaml:
YAML
thresholds:
compute_min: 90.0 # Min utilization to trigger "Pinned" state
net_max: 0.01 # Max network throughput (MB/s) for "Silent" state
window_seconds: 30 # How long to wait before declaring a Zombie
reclamation:
action: "SIGTERM" # Options: LOG, SIGTERM, DRAIN, CUSTOM
Why "Ladon"?
In Greek mythology, Ladon was the dragon with one hundred heads that guarded the golden apples. He never slept and each head watched a different part of the garden. Ladon-ML serves as that unblinking guardian for your compute garden, ensuring no single chip goes rogue and wastes your "golden" training time.
License
Apache 2.0