Skip to content

Commit c25887f

Browse files
hippogrRui Gaoclaude
authored
Ruigao/update for 1.6 (#177)
* update golang toolchain to latest version * fix the package updates suggested by dependbot * Update webportal to node.js 24 with necessary packages updating * update webportal docker file to use node slim for output * fix go version for some of dockerfiles * reduce the size of clustrer-local-storage docker image * reduce the size of copilot-chat docker image * reduce the size of dashboard-data-backup docker image * reduce the docker image size of utilization-reporter * rduce the size of abnormal-detector docker image * reduce the docker image size of cert-expiration-checker * reduce the docker image size of cluster-utilization * reduce the docker image size of reverse proxy * reduce the docker image size of model-proxy * downgrade the kube-scheduler version to the same one as the service k8s version * fix the display problem of job's YAML and output log * add cilium docker build to fix Azure security warnings * fix the security warnings found by ai fix tool * fix the build problem * fix the dockerfile errors * update k8s-rdma-shared-dev-plugin version to adapt latest grpc package * update cilium to version 1.18.8 * update all the binaries with go version to 1.25 * security update with GO package update and NPM package update * remove npm related packages for webportal service * make imagePullSecrets conditional to eliminate FailedToRetrieveImagePullSecret warnings When secret-name is not configured (or empty), deployment templates no longer render imagePullSecrets, and the cluster-configuration scripts skip secret creation/deletion. The rest-server also handles empty secret-name gracefully. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * skip the validation job submittion for CPU nodes * fix the proble that no pip update for copilot chat * fix NPM packages for docker images including rest server, alert manager and database controller * update reverseproxy * fix the gprc version for kube-scheduler * fix S360 vulnerabilities for alert-handler (nodemailer) and job-status-change-notification (minimatch) - alert-handler: add nodemailer resolution to force >=7.0.11, fixing vulnerable 6.10.1 pulled by email-templates/preview-email - job-status-change-notification: switch to yarn workspaces focus --production to avoid installing devDependencies (eslint plugins with vulnerable minimatch), matching database-controller pattern Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * update package version for reverseproxy * upgrade Cilium v1.18.8 to v1.18.9 to fix S360 grpc vulnerability (google.golang.org/grpc v1.74.2 -> v1.79.3) Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * fix job-detail page error handling for permission denied errors When a user without permission opens another user's job page, fetchJobInfo returns 403 but the error was silently ignored, causing the page to show "Loading..." forever with a vague empty alert. Now fetchJobInfo checks HTTP status, shows a clear permission error, and skips subsequent requests. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Add IPoIB subnet route in init.sh to fix IB TCP connectivity on NM-managed nodes On VMSS nodes where NetworkManager manages IB interfaces, ifconfig sets the IP with noprefixroute flag, preventing automatic subnet route creation. This causes IPoIB TCP (rsync/bcast) to fail between nodes while RDMA works. Add explicit route check and creation after ifconfig to ensure connectivity. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Skip classification for cordoned nodes with empty NodeId to prevent OFR pipeline from stalling Nodes with empty NodeId would transition to triaged_hardware but OFR cannot create IcM tickets without a valid NodeId, causing the pipeline to stall. Now these nodes stay in cordoned status so the classifier retries on the next cycle. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Fix nodemailer S360 vulnerability by upgrading yarn resolution from 7.x to 8.0.5 The resolutions field pinned nodemailer to ^7.0.11 which overrode the dependencies entry of ^8.0.5, causing yarn to install 7.0.13 in the image. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Fix S360 vulnerabilities across 13 container images npm upgrades: - alert-handler: axios 1.13.5->1.15.2, follow-redirects 1.15.11->1.16.0 - database-controller: lodash 4.17.23->4.18.1 (added yarn resolution) - rest-server, job-status-change-notification, webportal: follow-redirects 1.15.11->1.16.0 Dockerfile updates (add tdnf update for Azure Linux openssl 3.3.5-4->3.3.5-5): - alert-parser, node-recycler, node-issue-classifier, job-data-recorder Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Downgrade hardware issues without Azure FaultCode to triaged_unknown Hardware issues like FrontendNetworkIssue and DiskError have no matching Azure OFR fault code. Submitting OFR for these results in unresolvable tickets and, combined with the lack of dedup in node-recycler, causes repeated OFR submissions (as seen with openpai-00000s). By downgrading to triaged_unknown the node stays visible for manual investigation while avoiding the broken OFR pipeline. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Prevent node-recycler from submitting duplicate OFR tickets for the same node Check the latest action before creating a new IcM OFR ticket — if triaged_hardware-ua already exists, skip ticket creation and reuse the existing ticket ID for polling. This fixes the bug where every pipeline loop could spawn a new OFR request for the same node because get_latest_action_by_state (endswith query) never matches the triaged_hardware-ua action. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Make Prometheus retention size configurable per service Hardcoded 8TB retention caused disk full on the we cluster (16T disk). Now each service can override retention_size in services-configuration.yaml. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Fix KeyError when alert-parser processes validating node with no alerts When a validating/available_nodata node has zero alert records in Kusto (e.g. due to Prometheus data gap), find_node_alerts returns an empty DataFrame without columns. Accessing period_alerts['alertname'] then raises KeyError, causing the node to be stuck in validating indefinitely. Add an empty check before accessing DataFrame columns. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * update go-ntlmssp version to 0.1.1 for reverse proxy * Remove webportal-dind replacement logic from CI build workflow Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Filter out deleted webportal-dind from changed services detection Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Filter out dev-box from changed services detection in CI Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * Upgrade metrics-cleaner base image from Python 3.7 to 3.12-slim Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> --------- Co-authored-by: Rui Gao <ruigao@microsoft.com> Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
1 parent f219241 commit c25887f

350 files changed

Lines changed: 8670 additions & 22733 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/build-deploy-changes.yaml

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ jobs:
6363
# extract service folders under src/
6464
folders=$(echo "$changed_files" | grep '^src/' \
6565
| awk -F'/' '{print $2}' \
66-
| sort -u | tr '\n' ' ')
66+
| sort -u | grep -v -E '^(webportal-dind|dev-box)$' | tr '\n' ' ')
6767
echo "Changed folders: $folders"
6868
6969
# export as output for next steps
@@ -182,18 +182,7 @@ jobs:
182182
--overwrite-existing
183183
kubelogin convert-kubeconfig -l azurecli
184184
kubectl config use-context ${{ secrets.KUBERNETES_CLUSTER }}
185-
# Replace "webportal" with "webportal-dind" if "webportal" is changed
186185
services_to_deploy="${{ steps.changes.outputs.folders }}"
187-
if echo " $services_to_deploy " | grep -q " webportal "; then
188-
tmp=""
189-
for s in $services_to_deploy; do
190-
[ "$s" = "webportal" ] && continue
191-
[ "$s" = "webportal-dind" ] && continue
192-
tmp="$tmp $s"
193-
done
194-
services_to_deploy="$tmp webportal-dind"
195-
services_to_deploy=$(echo "$services_to_deploy" | xargs)
196-
fi
197186
echo "Final services to deploy: $services_to_deploy"
198187
if echo " $services_to_deploy " | grep -q " cluster-local-storage-worker "; then
199188
sed -i '42s/value: "8"/value: "0"/' $GITHUB_WORKSPACE/src/cluster-local-storage-worker/deploy/cluster-local-storage-worker.yaml.template

build/model/config_model.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,8 @@ def build_config_parse(self):
4949
self.buildConfigDict["dockerRegistryInfo"]["dockerTag"] = \
5050
buildConfigContent["cluster"]["docker-registry-info"]["docker-tag"]
5151
self.buildConfigDict["dockerRegistryInfo"]["secretName"] = \
52-
buildConfigContent["cluster"]["docker-registry-info"]["secret-name"]
52+
buildConfigContent["cluster"]["docker-registry-info"]["secret-name"] \
53+
if "secret-name" in buildConfigContent["cluster"]["docker-registry-info"] else ""
5354

5455
else:
5556
self.buildConfigDict["dockerRegistryInfo"] = buildConfigContent["cluster"]["docker-registry"]
@@ -66,6 +67,7 @@ def build_config_parse(self):
6667
self.buildConfigDict["dockerRegistryInfo"]["dockerTag"] = \
6768
buildConfigContent["cluster"]["docker-registry"]["tag"]
6869
self.buildConfigDict["dockerRegistryInfo"]["secretName"] = \
69-
buildConfigContent["cluster"]["docker-registry"]["secret-name"]
70+
buildConfigContent["cluster"]["docker-registry"]["secret-name"] \
71+
if "secret-name" in buildConfigContent["cluster"]["docker-registry"] else ""
7072

7173
return self.buildConfigDict

contrib/aks/k8s-deploy/cilium.yaml

Lines changed: 17 additions & 17 deletions
Large diffs are not rendered by default.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
declare module '*.css' {
2+
const content: { [className: string]: string };
3+
export default content;
4+
}
5+
6+
declare module '*.scss' {
7+
const content: { [className: string]: string };
8+
export default content;
9+
}
10+
11+
declare module '*.sass' {
12+
const content: { [className: string]: string };
13+
export default content;
14+
}
15+
16+
declare module '*.svg' {
17+
import * as React from 'react';
18+
export const ReactComponent: React.FunctionComponent<React.SVGProps<SVGSVGElement> & { title?: string }>;
19+
const src: string;
20+
export default src;
21+
}
22+
23+
declare module '*.png' {
24+
const src: string;
25+
export default src;
26+
}
27+
28+
declare module '*.jpg' {
29+
const src: string;
30+
export default src;
31+
}
32+
33+
declare module '*.jpeg' {
34+
const src: string;
35+
export default src;
36+
}
37+
38+
declare module '*.gif' {
39+
const src: string;
40+
export default src;
41+
}
42+
43+
declare module '*.bmp' {
44+
const src: string;
45+
export default src;
46+
}
47+
48+
declare module '*.webp' {
49+
const src: string;
50+
export default src;
51+
}

contrib/chat-plugin/tsconfig.json

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,26 @@
11
{
22
"compilerOptions": {
3-
"lib": ["dom", "dom.iterable", "esnext"],
3+
"target": "es5",
4+
"lib": [
5+
"dom",
6+
"dom.iterable",
7+
"esnext"
8+
],
49
"allowJs": true,
510
"skipLibCheck": true,
6-
"strict": true,
7-
"noEmit": true,
811
"esModuleInterop": true,
12+
"allowSyntheticDefaultImports": true,
13+
"strict": true,
14+
"forceConsistentCasingInFileNames": true,
15+
"noFallthroughCasesInSwitch": true,
916
"module": "esnext",
10-
"moduleResolution": "bundler",
17+
"moduleResolution": "node",
1118
"resolveJsonModule": true,
1219
"isolatedModules": true,
13-
"jsx": "preserve",
14-
"incremental": true,
15-
"plugins": [
16-
{
17-
"name": "next"
18-
}
19-
],
20-
"paths": {
21-
"@/*": ["./src/*"]
22-
}
20+
"noEmit": true,
21+
"jsx": "react-jsx"
2322
},
24-
"include": ["next-env.d.ts", "**/*.ts", "**/*.tsx", ".next/types/**/*.ts", "src/index.tsx"],
25-
"exclude": ["node_modules"]
23+
"include": [
24+
"src"
25+
]
2626
}

contrib/cluster-local-storage-plugin/package.json

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,24 +20,30 @@
2020
"homepage": "https://github.com/Microsoft/pai#readme",
2121
"license": "MIT",
2222
"dependencies": {
23-
"@microsoft/openpai-js-sdk": "^0.1.5",
23+
"@microsoft/openpai-js-sdk": "file:../../src/openpai-js-sdk/microsoft-openpai-js-sdk-0.2.0.tgz",
24+
"ajv": "^8.18.0",
25+
"buffer": "^6.0.3",
2426
"core-js": "^3.0.1",
2527
"office-ui-fabric-react": "^6.162.1",
28+
"process": "^0.11.10",
2629
"react": "^16.8.4",
2730
"react-dom": "^16.8.4",
31+
"stream-browserify": "^3.0.0",
32+
"util": "^0.12.5",
2833
"whatwg-fetch": "^3.0.0"
2934
},
3035
"devDependencies": {
3136
"@types/react": "^16.8.7",
3237
"@types/react-dom": "^16.8.2",
33-
"node-sass": "^4.14.1",
34-
"ts-loader": "^6.2.1",
35-
"ts-node": "^8.0.3",
38+
"sass": "^1.69.0",
39+
"sass-loader": "^13.0.0",
40+
"ts-loader": "^9.0.0",
41+
"ts-node": "^10.0.0",
3642
"tslint": "^5.20.1",
3743
"tslint-react": "^3.6.0",
38-
"typescript": "^3.3.3333",
39-
"webpack": "^4.29.6",
40-
"webpack-cli": "^3.2.3",
41-
"webpack-dev-server": "^3.2.1"
44+
"typescript": "^4.9.0",
45+
"webpack": "^5.88.0",
46+
"webpack-cli": "^5.1.0",
47+
"webpack-dev-server": "^4.15.0"
4248
}
4349
}

contrib/cluster-local-storage-plugin/src/App/form.tsx

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@ import React from "react";
55
import {
66
ChoiceGroup, DefaultPalette, Dropdown, DropdownMenuItemType, IDropdownOption,
77
Fabric, IChoiceGroupOption, PrimaryButton, Stack, Spinner, SpinnerSize, Text,
8-
TextField, Toggle, initializeIcons, mergeStyleSets, getTheme,
8+
TextField, Toggle, mergeStyleSets, getTheme,
99
} from "office-ui-fabric-react";
1010
import { PAIV2 } from "@microsoft/openpai-js-sdk";
1111

12+
import { initializeIconsOnce } from "../utils/icon-initializer";
13+
1214
const theme = getTheme();
1315
const styles = mergeStyleSets({
1416
form: {
@@ -80,7 +82,7 @@ const styles = mergeStyleSets({
8082
},
8183
});
8284

83-
initializeIcons();
85+
initializeIconsOnce();
8486

8587
interface IFormProps {
8688
api: string;
@@ -269,7 +271,7 @@ export default class Form extends React.Component<IFormProps, IFormState> {
269271
try {
270272
clusterlist = Object.keys(await this.client.virtualCluster.listVirtualClusters());
271273
} catch (err) {
272-
alert(`Failed to get virtual clusters: ${err.message}`);
274+
alert(`Failed to get virtual clusters: ${err instanceof Error ? err.message : String(err)}`);
273275
}
274276

275277
const endpoint = new URL("cluster-local-storage", window.location.href).href;

contrib/cluster-local-storage-plugin/src/App/index.tsx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
// Licensed under the MIT License.
33

44
import React from "react";
5-
import Form from "./Form";
5+
import Form from "./form";
66

77
interface IProps {
88
api: string;
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
// Copyright (c) Microsoft Corporation.
2+
// Licensed under the MIT License.
3+
4+
import { initializeIcons as fluentInitializeIcons } from "office-ui-fabric-react";
5+
6+
// Global flag to ensure icons are only initialized once
7+
const ICONS_INITIALIZED_KEY = "__FLUENT_ICONS_INITIALIZED__";
8+
9+
/**
10+
* Initialize Fluent UI icons once globally
11+
* This wrapper ensures initializeIcons is only called once across the entire application
12+
* to prevent duplicate icon registration warnings
13+
*/
14+
export function initializeIconsOnce() {
15+
// Check if icons have already been initialized
16+
if (typeof window !== "undefined" && (window as any)[ICONS_INITIALIZED_KEY]) {
17+
return;
18+
}
19+
20+
// Mark as initialized before calling to prevent race conditions
21+
if (typeof window !== "undefined") {
22+
(window as any)[ICONS_INITIALIZED_KEY] = true;
23+
}
24+
25+
// Initialize all Fluent UI MDL2 icons
26+
fluentInitializeIcons();
27+
}

contrib/cluster-local-storage-plugin/webpack.config.js

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -29,26 +29,41 @@ const configuration = {
2929
},
3030
resolve: {
3131
extensions: [".tsx", ".ts", ".js", ".json"],
32+
alias: {
33+
'process/browser': require.resolve('process/browser.js'),
34+
},
35+
fallback: {
36+
fs: false,
37+
net: false,
38+
tls: false,
39+
process: require.resolve('process/browser'),
40+
buffer: require.resolve('buffer/'),
41+
util: require.resolve("util/"),
42+
stream: require.resolve("stream-browserify"),
43+
http: false,
44+
https: false,
45+
zlib: false,
46+
path: false,
47+
crypto: false,
48+
url: false,
49+
querystring: false,
50+
assert: false,
51+
}
3252
},
3353
plugins: [
3454
new webpack.IgnorePlugin({
3555
resourceRegExp: /^esprima$/,
3656
contextRegExp: /js-yaml/,
3757
}),
58+
new webpack.ProvidePlugin({
59+
process: 'process/browser',
60+
Buffer: ['buffer', 'Buffer'],
61+
}),
3862
],
3963
devServer: {
4064
host: "0.0.0.0",
4165
port: 9290,
42-
contentBase: false,
43-
watchOptions: {
44-
ignored: /node_modules/,
45-
},
46-
disableHostCheck: true,
47-
},
48-
node: {
49-
fs: 'empty',
50-
net: 'empty',
51-
tls: 'empty',
66+
static: false,
5267
}
5368
};
5469

0 commit comments

Comments
 (0)