|
| 1 | +--- |
| 2 | +slug: release-ltp-v1.6 |
| 3 | +title: Releasing Lucia Training Platform v1.6 |
| 4 | +author: Lucia Training Platform Team |
| 5 | +tags: [ltp, announcement, release] |
| 6 | +--- |
| 7 | + |
| 8 | +We are pleased to announce the official release of **Lucia Training Platform v1.6.0**! |
| 9 | + |
| 10 | +## Lucia Training Platform v1.6.0 Release Notes |
| 11 | + |
| 12 | +This release focuses on security hardening, Docker image optimization, infrastructure upgrades, and bug fixes across the platform. |
| 13 | + |
| 14 | +## Platform Features & Bug Fixes |
| 15 | +- Upgraded webportal to Node.js 24 and removed the separate webportal-dind service — webportal now runs directly without Docker-in-Docker, simplifying deployment and reducing image size |
| 16 | +- Fixed job-detail page error handling for permission denied errors — now shows a clear message instead of infinite loading |
| 17 | +- Fixed job YAML and output log display issues on the webportal |
| 18 | +- Added support for tagging different types of GPUs |
| 19 | +- Skipped validation job submission for CPU nodes |
| 20 | +- Made Prometheus retention size configurable per service to prevent disk full issues |
| 21 | +- Added tool to preserve application tokens when revoking all tokens |
| 22 | +- Removed cronjob of abnormal-detector when stopping the service |
| 23 | +- Fixed exception when no name exists in filter |
| 24 | + |
| 25 | +## Docker Image Optimization |
| 26 | +- Reduced Docker image sizes for cluster-local-storage, copilot-chat, dashboard-data-backup, utilization-reporter, abnormal-detector, cert-expiration-checker, cluster-utilization, reverse-proxy, and model-proxy |
| 27 | +- Upgraded metrics-cleaner base image from Python 3.7 to 3.12-slim |
| 28 | +- Cleaned up job-exporter Docker image |
| 29 | + |
| 30 | +## Infrastructure & Networking |
| 31 | +- Updated Cilium from 1.18.6 to 1.18.9 |
| 32 | +- Updated Go version to 1.25 across all Go-based components |
| 33 | +- Homebrew build for kube-scheduler and Grafana container images |
| 34 | +- Downgraded kube-scheduler version to match service Kubernetes version |
| 35 | +- Added IPoIB subnet route in init.sh to fix InfiniBand TCP connectivity on NetworkManager-managed nodes |
| 36 | +- Fixed DNS problem for cluster-local-storage |
| 37 | +- Fixed zlib 1.3.1 missing issue for pylon |
| 38 | +- Added Managed Identity support for build scripts |
| 39 | +- Made imagePullSecrets conditional to eliminate FailedToRetrieveImagePullSecret warnings |
| 40 | +- Removed secret deployment for image pull in favor of ACR credentials |
| 41 | + |
| 42 | +## Alert Manager & Node Management |
| 43 | +- Fixed KeyError when alert-parser processes validating nodes with no alerts |
| 44 | +- Downgraded hardware issues without Azure FaultCode to triaged_unknown to avoid broken OFR pipeline |
| 45 | +- Prevented node-recycler from submitting duplicate OFR tickets for the same node |
| 46 | +- Skipped classification for cordoned nodes with empty NodeId to prevent OFR pipeline stalling |
| 47 | + |
| 48 | +## Security |
| 49 | +- Updated Go toolchain and packages across all Go-based services |
| 50 | +- Updated Node.js packages for rest-server, alert-handler, job-status-change-notification, database-controller, and webportal |
| 51 | +- Updated Python packages for copilot-chat |
| 52 | +- Fixed S360 vulnerabilities across 13 container images including openssl, axios, follow-redirects, lodash, nodemailer, and minimatch |
| 53 | +- Updated go-ntlmssp to 0.1.1 for reverse proxy |
| 54 | +- Updated k8s-rdma-shared-dev-plugin to adapt to latest gRPC package |
| 55 | + |
| 56 | +## CI/CD |
| 57 | +- Updated CI workflow to filter dev-box from changed services detection |
| 58 | +- Removed all existing statefulsets in the system during cleanup instead of only config-defined ones |
0 commit comments