What's Changed
- ansible to create k8s cluster on equinix by @rootfs in #1
- auto shutdown k8s clusters by @rootfs in #2
- add k8s workflow by @rootfs in #3
- fix env by @rootfs in #4
- fix ssh key path by @rootfs in #5
- fixmetal download by @rootfs in #6
- sudo to chmod by @rootfs in #7
- prepare metal config by @rootfs in #8
- use pre-created metal config by @rootfs in #9
- allow user to set termination time by @rootfs in #10
- Adding Prom by @jtaleric in #12
- Adding First Perf test by @jtaleric in #13
- Wait for monitoring pods to be running by @jtaleric in #20
- Need to wait on statefulset by @jtaleric in #21
- integrate with kepler action as BM self host runner provider by @SamYuan1990 in #23
- set the kepler service unit file so to expose idle power by @rootfs in #25
- k8s action: monitor kepler cpu utilization by @rootfs in #28
- kepler stress test results by @rootfs in #29
- fix path by @rootfs in #30
- refactor by @rootfs in #31
- set git push by @rootfs in #32
- use checkout action by @rootfs in #33
- fix relative path by @rootfs in #34
- reformat output by @rootfs in #35
- add markdown files to create github pages by @rootfs in #36
- add nightly run schedule by @rootfs in #38
- don't overwrite validator queries by @rootfs in #39
- publish validation results by @rootfs in #40
- fix path by @rootfs in #41
- don't set global git config on self hosted runner by @rootfs in #42
- fix bug for kepler-action integration by @SamYuan1990 in #27
- fix schedule by @rootfs in #43
- support schedule trigger by @rootfs in #47
- switch to kepler action main by @SamYuan1990 in #44
- turn on logging by @rootfs in #48
- update commit message by @rootfs in #49
- add vm cpu pinning by @rootfs in #50
- try to integrate with kepler action by @SamYuan1990 in #46
- cancel schedule for action by @SamYuan1990 in #51
- fix(validator): adapt to changes to validator by @sthaha in #52
- update with latest validation config by @rootfs in #53
- chore: Fix typo in validation by @dave-tucker in #54
- make the vm use the entire bare metal instance by @rootfs in #55
- use default stressor by @rootfs in #56
- disable component power and bpf metrics in stress tests by @rootfs in #57
- Integrate Model Trainer by @KaiyiLiu1234 in #58
- feat(trainer): Update Trainer by @KaiyiLiu1234 in #64
- as kepler action support Tekton update code accordingly by @SamYuan1990 in #61
- fix(model_server_test): Add placeholder for model server test to main by @KaiyiLiu1234 in #68
- fix: 0 metrics found error do to large scrape-config by @sthaha in #71
- rename job=node to job=metal by @sthaha in #73
- chore: track cpu usage of both kepler and kube-apiserver in stress tests by @rootfs in #74
- fix markdown table rendering issue by @rootfs in #77
- slim down markdown table by @rootfs in #78
- fix broken table by @rootfs in #79
- try to fix table rendering issue again by @rootfs in #80
- try to fix table rendering issue again by removing offending characters by @rootfs in #81
- add stress test result regression checker bot by @rootfs in #82
- don't update report in kepler-action by @rootfs in #84
- tune the stress detection prompt by @rootfs in #85
- add validation regression checker by @rootfs in #86
- add validation regression checker by @rootfs in #87
- tune prompt and switch to gpt-4-turbo by @rootfs in #88
- bot: ignore idle mode validation regression by @rootfs in #91
- chore(metal): show verbose output by @sthaha in #106
- fix(metal): adapt to change to report path by @sthaha in #107
- feat(model_server): Incorporate Model Server by @KaiyiLiu1234 in #110
- feat(model_server): Add Separate Workflow for Model Server Implementation by @KaiyiLiu1234 in #112
- feat(model_server): Add Model Server to Metal CI by @KaiyiLiu1234 in #111
- chore(model_server): Remove Schedule and Git Push for Testing by @KaiyiLiu1234 in #118
- chore: add a k8s provisioning flow by @rootfs in #121
- chore: create kepler namespace first before setting prometheus role by @rootfs in #122
- fix(model_server): change workflow time to 11pm UTC by @KaiyiLiu1234 in #124
- chore(model_server): Change workflow name by @KaiyiLiu1234 in #126
- chore: add kepler validation diagrams by @rootfs in #142
- chore: don't re-setup python by @rootfs in #143
- chore: when rendering the comparison chart, reset the timestamp so that charts are not stretched by @rootfs in #146
- chore: decouple kepler and open metrics by @rootfs in #147
- chore: fine tune the prompt in regression detection by @rootfs in #148
- feat(model_trainer): Add Trainer Ansible Playbook and Workflow by @KaiyiLiu1234 in #131
- fix(model_server): Properly Activated new models by @KaiyiLiu1234 in #151
- fix(model_server): add model_server ansible vars by @KaiyiLiu1234 in #152
- chore(model_server): Update Model Server Version to 0.7.11-2 by @KaiyiLiu1234 in #156
- feat(model_e2e): Added Model e2e by @KaiyiLiu1234 in #163
- chore: add metal sweep action by @rootfs in #171
- WIP added node-density-churn by @shashank-boyapally in #170
- Node density churn new action name by @shashank-boyapally in #172
- chore: migrate to official equinix metal runner action by @vprashar2929 in #165
- Node density churn by @shashank-boyapally in #177
- chore: Reduce the vm resource allocation by @vprashar2929 in #178
- chore(rapl): Add a step to capture RAPL power domain availability by @vimalk78 in #179
- feat(model_e2e): Train and Validate Models on Equinix Server by @KaiyiLiu1234 in #182
- Update e2eSeparateServers with Changes from Main by @KaiyiLiu1234 in #185
- feat(model_e2e): Add Separate Server Workflow e2e by @KaiyiLiu1234 in #186
- chore(equinix_runner): Replaced Create-runner and Cleanup with reusable workflows by @KaiyiLiu1234 in #190
- chore(metal/flow): keep reports and artifacts its own dir by @sthaha in #198
- chore(ci): Update workflows for consistent artifact handling by @vprashar2929 in #199
- chore(push_validations): Push Validation Results to Github Repo by @KaiyiLiu1234 in #206
- chore(rename_workflows): Renamed workflows to be more descriptive by @KaiyiLiu1234 in #209
- chore(push_validations): Move all Validation Data to Repo by @KaiyiLiu1234 in #210
- feat(pyproject): for analytics by @sthaha in #202
- chore: Add playbook to install node-exporter by @vimalk78 in #217
- chore: enable selected collectors in node-exporter by @vimalk78 in #224
- changed commands by @shashank-boyapally in #203
- chore(ci): update metal delete action to use sustainable-computing-io fork by @vprashar2929 in #222
- fix: resolve node-exporter permission error for energy_uj access by @vprashar2929 in #229
- Merge main with overallErrorMetrics by @KaiyiLiu1234 in #234
- Add node config to model server validations workflow by @KaiyiLiu1234 in #235
- chore: selectively enable sweep by @rootfs in #236
- chore: use isolated cpu for the vm by @rootfs in #240
- Add Node Exporter to e2e by @KaiyiLiu1234 in #241
- indexing metrics by @shashank-boyapally in #237
- fix(node_config): Add hatch run to trainer by @KaiyiLiu1234 in #246
- fix(daily_validations): Enable selection of subset of error metrics by @KaiyiLiu1234 in #249
- chore: fix sweep workflow error by @rootfs in #250
- Update equinix_metal_sweep.yml by @rootfs in #252
- update actions by @shashank-boyapally in #255
- chore(ansible): refactor variable structure by moving host-specific and shared vars by @vprashar2929 in #256
- chore(ansible): refactor kvm_vm role by @vprashar2929 in #258
- feat: Add support for capturing Prometheus snapshot by @vprashar2929 in #266
- docs: add steps for using Prometheus Snapshot by @vprashar2929 in #277
- chore: add poly model in analytics by @rootfs in #280
- fix(setup_runner): Added setup runner acton in Validation Scripts by @KaiyiLiu1234 in #276
- update analytics vm-cpu-time-predict-vm-power by @sunya-ch in #281
- feat(ansible): Introduce role for Kepler installation by @vprashar2929 in #270
- [CI] add date validation before create new github issue to avoid dupl… by @SamYuan1990 in #265
- fix(model_server): Resolve git error by @KaiyiLiu1234 in #282
- [CI] Hot fix for actions/upload-artifact and actions/download-artifact by @SamYuan1990 in #285
- [CI] add Dependency Bot settings by @SamYuan1990 in #284
- chore(ci): use latest equinix metal-runner-action by @vprashar2929 in #283
- add error handle for action steps by @SamYuan1990 in #288
- feat: add timestamp to validator directory by @vprashar2929 in #289
- feat(train_logs): Upload Training Logs by @KaiyiLiu1234 in #294
- increasing timeout so that churn can continue by @shashank-boyapally in #291
- chore: not expose estimated idle power by @rootfs in #295
- chore(ci): ensure workflow fails on errors by @vprashar2929 in #296
- feat: Add support for deploying Kepler using Compose by @vprashar2929 in #290
- feat(vm_metrics): Add vm metrics training process by @KaiyiLiu1234 in #297
- fix(switch_to_main): Switched model server branch to main repo by @KaiyiLiu1234 in #301
- fix(disable_daily_vals): Disable daily validations by @KaiyiLiu1234 in #302
- fix(vm_metrics): Add --vm-train tag by @KaiyiLiu1234 in #303
- chore: use vm-train option to train models by @rootfs in #305
- chore: use the latest model server image tag by @rootfs in #306
- chore: ignore model server start error by @rootfs in #307
- Update equinix_k8s_flow_churncheck.yml by @rootfs in #308
- chore(model-server): Use hatch to deploy Model Server by @KaiyiLiu1234 in #309
- chore: make the model server image tag a param by @rootfs in #310
- chore: add model train and validate report by @rootfs in #311
- chore(deps): bump the github-actions group across 1 directory with 2 updates by @dependabot in #287
- chore(validate_model_defaults): Added Polynomial and Xgboost to default list of target models by @KaiyiLiu1234 in #312
- chore: add aws metal e2e by @rootfs in #314
- fix(exclude-sweapper-process): Exclude swapper process on metal kepler by @KaiyiLiu1234 in #316
- chore: add GHA to create aws ec2 ami with centos stream 9 and nvidia driver by @rootfs in #317
- Creating AMI with NVIDIA driver by @rootfs in #319
- chore: install nvidia dgcm in ami by @rootfs in #321
- chore: use the right dcgm output by @rootfs in #322
- feat: add GPU operation by @rootfs in #320
- [Feat][Chore] Pr level protect by @SamYuan1990 in #325
- don't re-install crio if it is already there by @rootfs in #327
- chore: pre install crio and pytorch images to save test image by @rootfs in #328
- fix(enable_ssh_file_access): Remove enforce by @KaiyiLiu1234 in #324
- [Nit] code refactor by @SamYuan1990 in #313
- feat(aws): Incorporate AWS fully into Metal CI by @KaiyiLiu1234 in #315
- feat(prometheus_snapshot): Upload prometheus snapshot for trainer workflows by @KaiyiLiu1234 in #330
- feat(upload-train-logs): Upload train logs to repo by @KaiyiLiu1234 in #333
- fix(add-train-logs) Add .log extension to train logs by @KaiyiLiu1234 in #334
- chore: set validator runtime by @rootfs in #332
- fix(upload-train-logs): Add correct destination filepath for log upload by @KaiyiLiu1234 in #336
- fix(duplicate-env): Remove duplicate env keyword by @KaiyiLiu1234 in #337
- chore: add support for kubeburner in PR check by @SamYuan1990 in #329
- chore(deps): Bump actions/checkout from 3 to 4 in the github-actions group by @dependabot in #338
- [CI]: using artifacts to keep result, a part of #339 by @SamYuan1990 in #341
- [fix]: fix checkout code after config kube error by @SamYuan1990 in #342
- [fix]: add checkout action in steps and add more log when kepler not … by @SamYuan1990 in #343
- [fix]: attempt to use config-dir to fix config folder check issue by @SamYuan1990 in #344
- feat(ansible): add Kepler config file support by @vprashar2929 in #345
- [fix]: try with kubeconfig with try best by @SamYuan1990 in #346
- [fix]: ref latest pip model document try to set up httpx client for o… by @SamYuan1990 in #347
- [fix]: add missing env for estimator service by @SamYuan1990 in #349
New Contributors
- @jtaleric made their first contribution in #12
- @SamYuan1990 made their first contribution in #23
- @sthaha made their first contribution in #52
- @dave-tucker made their first contribution in #54
- @shashank-boyapally made their first contribution in #170
- @vprashar2929 made their first contribution in #165
- @sunya-ch made their first contribution in #281
- @dependabot made their first contribution in #287
Full Changelog: v0.1...v0.2