Commit c25887f
Ruigao/update for 1.6 (#177)
* update golang toolchain to latest version
* fix the package updates suggested by dependbot
* Update webportal to node.js 24 with necessary packages updating
* update webportal docker file to use node slim for output
* fix go version for some of dockerfiles
* reduce the size of clustrer-local-storage docker image
* reduce the size of copilot-chat docker image
* reduce the size of dashboard-data-backup docker image
* reduce the docker image size of utilization-reporter
* rduce the size of abnormal-detector docker image
* reduce the docker image size of cert-expiration-checker
* reduce the docker image size of cluster-utilization
* reduce the docker image size of reverse proxy
* reduce the docker image size of model-proxy
* downgrade the kube-scheduler version to the same one as the service k8s version
* fix the display problem of job's YAML and output log
* add cilium docker build to fix Azure security warnings
* fix the security warnings found by ai fix tool
* fix the build problem
* fix the dockerfile errors
* update k8s-rdma-shared-dev-plugin version to adapt latest grpc package
* update cilium to version 1.18.8
* update all the binaries with go version to 1.25
* security update with GO package update and NPM package update
* remove npm related packages for webportal service
* make imagePullSecrets conditional to eliminate FailedToRetrieveImagePullSecret warnings
When secret-name is not configured (or empty), deployment templates no longer
render imagePullSecrets, and the cluster-configuration scripts skip secret
creation/deletion. The rest-server also handles empty secret-name gracefully.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* skip the validation job submittion for CPU nodes
* fix the proble that no pip update for copilot chat
* fix NPM packages for docker images including rest server, alert manager and database controller
* update reverseproxy
* fix the gprc version for kube-scheduler
* fix S360 vulnerabilities for alert-handler (nodemailer) and job-status-change-notification (minimatch)
- alert-handler: add nodemailer resolution to force >=7.0.11, fixing vulnerable 6.10.1 pulled by email-templates/preview-email
- job-status-change-notification: switch to yarn workspaces focus --production to avoid installing devDependencies (eslint plugins with vulnerable minimatch), matching database-controller pattern
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* update package version for reverseproxy
* upgrade Cilium v1.18.8 to v1.18.9 to fix S360 grpc vulnerability (google.golang.org/grpc v1.74.2 -> v1.79.3)
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix job-detail page error handling for permission denied errors
When a user without permission opens another user's job page, fetchJobInfo
returns 403 but the error was silently ignored, causing the page to show
"Loading..." forever with a vague empty alert. Now fetchJobInfo checks HTTP
status, shows a clear permission error, and skips subsequent requests.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Add IPoIB subnet route in init.sh to fix IB TCP connectivity on NM-managed nodes
On VMSS nodes where NetworkManager manages IB interfaces, ifconfig sets
the IP with noprefixroute flag, preventing automatic subnet route creation.
This causes IPoIB TCP (rsync/bcast) to fail between nodes while RDMA works.
Add explicit route check and creation after ifconfig to ensure connectivity.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Skip classification for cordoned nodes with empty NodeId to prevent OFR pipeline from stalling
Nodes with empty NodeId would transition to triaged_hardware but OFR cannot
create IcM tickets without a valid NodeId, causing the pipeline to stall.
Now these nodes stay in cordoned status so the classifier retries on the next cycle.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Fix nodemailer S360 vulnerability by upgrading yarn resolution from 7.x to 8.0.5
The resolutions field pinned nodemailer to ^7.0.11 which overrode the
dependencies entry of ^8.0.5, causing yarn to install 7.0.13 in the image.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Fix S360 vulnerabilities across 13 container images
npm upgrades:
- alert-handler: axios 1.13.5->1.15.2, follow-redirects 1.15.11->1.16.0
- database-controller: lodash 4.17.23->4.18.1 (added yarn resolution)
- rest-server, job-status-change-notification, webportal: follow-redirects 1.15.11->1.16.0
Dockerfile updates (add tdnf update for Azure Linux openssl 3.3.5-4->3.3.5-5):
- alert-parser, node-recycler, node-issue-classifier, job-data-recorder
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Downgrade hardware issues without Azure FaultCode to triaged_unknown
Hardware issues like FrontendNetworkIssue and DiskError have no matching
Azure OFR fault code. Submitting OFR for these results in unresolvable
tickets and, combined with the lack of dedup in node-recycler, causes
repeated OFR submissions (as seen with openpai-00000s). By downgrading
to triaged_unknown the node stays visible for manual investigation
while avoiding the broken OFR pipeline.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Prevent node-recycler from submitting duplicate OFR tickets for the same node
Check the latest action before creating a new IcM OFR ticket — if
triaged_hardware-ua already exists, skip ticket creation and reuse the
existing ticket ID for polling. This fixes the bug where every pipeline
loop could spawn a new OFR request for the same node because
get_latest_action_by_state (endswith query) never matches the
triaged_hardware-ua action.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Make Prometheus retention size configurable per service
Hardcoded 8TB retention caused disk full on the we cluster (16T disk).
Now each service can override retention_size in services-configuration.yaml.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Fix KeyError when alert-parser processes validating node with no alerts
When a validating/available_nodata node has zero alert records in Kusto
(e.g. due to Prometheus data gap), find_node_alerts returns an empty
DataFrame without columns. Accessing period_alerts['alertname'] then
raises KeyError, causing the node to be stuck in validating indefinitely.
Add an empty check before accessing DataFrame columns.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* update go-ntlmssp version to 0.1.1 for reverse proxy
* Remove webportal-dind replacement logic from CI build workflow
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Filter out deleted webportal-dind from changed services detection
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Filter out dev-box from changed services detection in CI
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Upgrade metrics-cleaner base image from Python 3.7 to 3.12-slim
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
---------
Co-authored-by: Rui Gao <ruigao@microsoft.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>1 parent f219241 commit c25887f
350 files changed
Lines changed: 8670 additions & 22733 deletions
File tree
- .github/workflows
- build/model
- contrib
- aks/k8s-deploy
- chat-plugin
- src
- cluster-local-storage-plugin
- src
- App
- utils
- copilot-plugin
- src
- app
- kubespray/script
- submit-job-v2
- src
- App
- utils
- src
- alert-manager
- build
- deploy
- src
- alert-handler
- alert-parser
- tests
- cert-expiration-checker
- cluster-utilization
- job-status-change-notification
- node-issue-classifier
- tests
- node-recycler
- tests
- cilium/build
- cluster-configuration/deploy
- cluster-local-storage-worker/deploy
- cluster-local-storage
- bin
- build
- deploy
- src/kusto-sdk
- cluster/config
- copilot-chat
- build
- deploy
- dashboard-data-backup/build
- database-controller
- deploy
- sdk
- src
- dev-box/build
- device-plugin
- build
- deploy
- dshuttle-master/deploy
- dshuttle-worker/deploy
- fluentd/deploy
- frameworkcontroller
- build
- deploy
- src
- grafana
- build
- deploy
- hivedscheduler
- build
- deploy
- src
- internal-storage/deploy
- job-exporter
- build
- deploy
- k8s-dashboard/deploy
- log-manager/deploy
- ltp-storage-common/ltp_storage/data_schema
- marketplace-db/deploy
- marketplace-restserver/deploy
- marketplace-webportal/deploy
- model-proxy
- build
- deploy
- node-exporter/deploy
- openpai-js-sdk
- openpai-runtime
- build
- src/go
- postgresql-sdk/deploy
- postgresql
- build
- deploy
- prometheus-pushgateway
- build
- deploy
- src/metrics-cleaner
- prometheus
- config
- deploy
- pylon
- build
- deploy
- rest-server
- deploy
- src
- config
- models/v2/job
- utilization-reporter/build
- watchdog
- build
- deploy
- src
- webportal-dind
- build
- deploy
- webportal
- build
- config
- deploy
- server/config
- src
- app
- cluster-view
- hardware
- k8s
- services
- components
- dashboard
- home
- home
- index
- job-submission-demo
- components
- controls
- data
- job-information
- sidebar
- task-role
- tools
- topbar
- elements
- models
- utils
- job-submission
- components
- controls
- data
- sidebar
- task-role
- tools
- topbar
- yamledit-topbar
- models
- utils
- job
- breadcrumb
- job-docs
- job-submit-v1
- job-view/fabric
- JobList
- job-detail
- components
- job-event
- job-transfer
- task-attempt
- layout
- components
- plugin
- user
- fabric
- batchRegister
- components
- user-profile
- userView
- user-auth
- user-logout
- utils
- vc
- assets/img
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | | - | |
| 66 | + | |
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| |||
182 | 182 | | |
183 | 183 | | |
184 | 184 | | |
185 | | - | |
186 | 185 | | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | | - | |
194 | | - | |
195 | | - | |
196 | | - | |
197 | 186 | | |
198 | 187 | | |
199 | 188 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
| 53 | + | |
53 | 54 | | |
54 | 55 | | |
55 | 56 | | |
| |||
66 | 67 | | |
67 | 68 | | |
68 | 69 | | |
69 | | - | |
| 70 | + | |
| 71 | + | |
70 | 72 | | |
71 | 73 | | |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
4 | 9 | | |
5 | 10 | | |
6 | | - | |
7 | | - | |
8 | 11 | | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
9 | 16 | | |
10 | | - | |
| 17 | + | |
11 | 18 | | |
12 | 19 | | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
| 20 | + | |
| 21 | + | |
23 | 22 | | |
24 | | - | |
25 | | - | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
26 | 26 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
| 28 | + | |
26 | 29 | | |
27 | 30 | | |
| 31 | + | |
| 32 | + | |
28 | 33 | | |
29 | 34 | | |
30 | 35 | | |
31 | 36 | | |
32 | 37 | | |
33 | | - | |
34 | | - | |
35 | | - | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
36 | 42 | | |
37 | 43 | | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
42 | 48 | | |
43 | 49 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
| 13 | + | |
12 | 14 | | |
13 | 15 | | |
14 | 16 | | |
| |||
80 | 82 | | |
81 | 83 | | |
82 | 84 | | |
83 | | - | |
| 85 | + | |
84 | 86 | | |
85 | 87 | | |
86 | 88 | | |
| |||
269 | 271 | | |
270 | 272 | | |
271 | 273 | | |
272 | | - | |
| 274 | + | |
273 | 275 | | |
274 | 276 | | |
275 | 277 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| |||
Lines changed: 27 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
32 | 52 | | |
33 | 53 | | |
34 | 54 | | |
35 | 55 | | |
36 | 56 | | |
37 | 57 | | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
38 | 62 | | |
39 | 63 | | |
40 | 64 | | |
41 | 65 | | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
| 66 | + | |
52 | 67 | | |
53 | 68 | | |
54 | 69 | | |
| |||
0 commit comments