Highlights of Release v26.01.0-beta1
This release introduces comprehensive GPU metrics support via NVIDIA DCGM Exporter integration, modernized Python tooling, and UI/UX improvements on dark mode support.
New Features
GPU Metrics Support
- NVIDIA DCGM Integration: Full support for GPU metrics collection via NVIDIA DCGM Exporter
- GPU Compute & Memory Views: Separate monitoring views for GPU utilization and memory usage
- GPU Compute: Shows GPU utilization percentage per pod
- GPU Memory: Shows GPU memory usage per pod
- GPU Node Heatmap: Heatmap visualization for GPU nodes showing compute and memory utilization
- Per-pod GPU Usage Charts: Breakdown of GPU usage by pod within each node
- Conditional GPU Charts: GPU charts display only when DCGM endpoint is configured
- Helm Chart Support: Added DCGM Exporter configuration options in Helm chart
Theme & UI Improvements
- Theme-aware Legend Colors: Legend text colors adapt to the selected theme
- Enhanced Heatmap Labels: Improved label visibility and positioning
- Better Tooltip Positioning: Tooltips now position correctly across all views
Node Heatmap
- Resource Utilization View: Visual representation of CPU, Memory, and GPU (new) usage across nodes
Infrastructure & Build
Python Tooling Modernization
- pyproject.toml: Migrated from setup.py to modern pyproject.toml-based configuration
- uv Package Manager: Adopted uv for faster, more reliable dependency management
- Ruff Linter: Integrated Ruff for code formatting and linting
Container & CI/CD
- Ubuntu 24.04 Base Image: Upgraded container base image for improved security and performance
- Docker Workflow Improvements: Enhanced CI/CD pipeline for main branch pushes and PR builds
- Variable-based Docker Registry: Flexible Docker repository naming via CI variables
- GitHub Actions Updates: Bumped astral-sh/setup-uv and github/codeql-action versions
Bug Fixes
- Fixed regression in GPU metrics processing
- Fixed Y-axis rounding to 2 decimal places in stacked bar charts
- Fixed typo in pyproject.toml that was breaking Docker builds
- Removed logging of sensitive billing hourly rate
- Cleaned up unused variables and missing declarations
- Removed deprecated Azure and GCP pricing cost code
Documentation
- Updated README with accurate tech stack and installation instructions
- Added CLAUDE.md with development guidelines and versioning information
- Documented minimum version requirements for GPU support
New configuration variables
- Environment variable
KOA_DCGM_EXPORTER_ENDPOINTrequired for GPU monitoring
Breaking Changes
N/A
Acknowledgements
Big kudos to @ccamel for his huge contributions throughout this milestone. 💯 :
For deployment instructions, see the README.