- Support dynamic mig feature, please refer to this document
- Reinstall Hami will NOT crash GPU tasks
- Put all configurations into a configMap, you can customize hami installation by modify its content: see details
- Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
- Fix hami-core stuck on high glib images, like 'tf-serving:latest'
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by (@dependabot) in #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by (@dependabot) in #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0 by (@dependabot) in #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0 by (@dependabot) in #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0 by (@dependabot) in #792
- Fix Kubernetes version string handling by stripping metadata by (@Nimbus318) in #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit by (@archlitchi) in #624
- feat: support device plugin daemonset update strategy by (@devenami) in #628
- add ut about schedule policy by (@yt-huang) in #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch. by (@haitwang-cloud) in #626
- add ut for the scheduler by (@shijinye) in #645
- docs(issue-tmpl): add FAQ link to issue templates by (@Nimbus318) in #647
- fix: filter device registry to node by (@lengrongfu) in #639
- Add self-hosted runner by (@archlitchi) in #659
- fix-example-yaml by (@WQL782795) in #667
- update docs by (@yangshiqi) in #668
- add ut for ascend by (@shijinye) in #664
- optimization map init in test by (@lengrongfu) in #678
- Optimize monitor by (@for800000) in #683
- fix code lint faild by (@lengrongfu) in #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by (@Nimbus318) in #687
- fix vGPUmonitor deviceidx is always 0 by (@lengrongfu) in #684
- add ut for pkg/scheduler/event.go by (@Penguin-zlh) in #688
- add ut for nodes by (@shijinye) in #695
- add license for pkg/scheduler/event_test.go by (@Penguin-zlh) in #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently by (@lijm87) in #575
- add ut for device/nvidia by (@shijinye) in #657
- add ut for pkg/monitor/nvidia/v0/spec.go by (@yt-huang) in #670
- Enable Dynamic-mig feature for HAMi by (@archlitchi) in #708
- Fix chart can not be deployed properly by (@archlitchi) in #711
- Fix NodeLock issue by (@archlitchi) in #714
- fix example yaml by (@lixd) in #709
- add ut for device/cambricon by (@shijinye) in #712
- Update dynamic mig documents and examples by (@archlitchi) in #718
- random time may be zero by (@shijinye) in #697
- fix grafana dashboard and clarify dashboard usage more clearly. by (@jiangsanyin) in #543
- doc(README): add examples for GPU sharing and update-examples by (@xiaoyao) in #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go by (@yt-huang) in #673
- Add design document to 'dynamic-mig' feature by (@archlitchi) in #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist by (@elrondwong) in #724
- add ut for pkg/util/nodelock/nodelock.go by (@learner0810) in #719
- test: add ut for pkg/version/version.go by (@Penguin-zlh) in #677
- Update on mig mode by (@archlitchi) in #726
- Update documents for config & config_cn by (@archlitchi) in #729
- set PASS_DEVICE_SPECS ENV to device-plugin by (@jingzhe6414) in #690
- fix device-plugin-version by (@learner0810) in #743
- feat: Return the nodes that failed to be scheduled back to the scheduler by (@chaunceyjiang) in #746
- fix(log): fix missing log output in nvidiadeviceplugin server by (@elrondwong) in #735
- support configuration resources limits and requests by (@flpanbin) in #739
- feat(test): add TestMarshalNodeDevices scenarios by (@elrondwong) in #747
- print flags for device-plugin and scheduler by (@flpanbin) in #756
- Fix typos, add more contributors and maintainers. by (@yangshiqi) in #765
- Add a mind map(Chinese and English) to help understand this project by (@oceanweave) in #764
- [Docs] update config pages by (@windsonsea) in #760
- add ut for device-map by (@KubeKyrie) in #762
- refactor(ci): use go.mod file for Go version in workflows by (@yxxhero) in #766
- support set log level for device plugin by (@flpanbin) in #771
- feat: Restart/Upgrade device-plugin will not affect services. by (@chaunceyjiang) in #767
- add ut nvml devices by (@KubeKyrie) in #773
- add ut for device-map by (@KubeKyrie) in #772
- Optimize the time format layout by (@learner0810) in #741
- fix: nvidia-device-plugin no version info by (@chaunceyjiang) in #779
- HAMi supports e2e by (@Rei1010) in #775
- Proposal: enable E2E test by (@Rei1010) in #633
- add ut for device/iluvatar by (@shijinye) in #795
- add ut for device/hygon by (@shijinye) in #787
- add ut for pkg/monitor/nvidia/v1 by (@shijinye) in #780
- refactor(logging): enhance log messages for device resource counting by (@haitwang-cloud) in #778
- Enrich pod health check by (@Rei1010) in #801
- docs: fix broken link by (@lixd) in #802
- Optimize the E2E execution logic by (@Rei1010) in #803
- optimize MetricsBindAddress to MetricsBindPort by (@phoenixwu0229) in #796
- fix: handle the node nil issue & E2E test failure by (@haitwang-cloud) in #804
- add ut for device/mthreads by (@shijinye) in #808
- fix: Resolve formatting issue in ConfigMap causing display anomalies by (@lixd) in #814
- [docs] Update ascend910b-support.md by (@windsonsea) in #816
- Refine metrics logs by (@haitwang-cloud) in #817
- Update mig-related logics and refine logs by (@archlitchi) in #833
- Add 910B4 config to device-configmap for ascend by (@lijm87) in #828
- [docs] fix: glibc version requirement in README by (@chinaran) in #826
- Update HAMi-core for v2.5.0 by (@archlitchi) in #834
- FIx multi-process device memory count issue by (@archlitchi) in #835
- bump version to v2.5.0 by (@wawa0210) in #836
- Fix CI by (@archlitchi) in #838
- Fix CI release by (@archlitchi) in #840
- Fix release ci by (@archlitchi) in #841
- Fix Dockerfile to make CI pass by (@archlitchi) in #846
- Fix E2E failure with pod status check by (@Rei1010) in #847
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU by (@archlitchi) in #848
- yt-huang (@yt-huang)
- shijinye (@shijinye)
- WQL782795 (@WQL782795)
- yangshiqi (@yangshiqi)
- for800000 (@for800000)
- Penguin-zlh (@Penguin-zlh)
- lixd (@lixd)
- jiangsanyin (@jiangsanyin)
- xiaoyao (@xiaoyao)
- elrondwong (@elrondwong)
- learner0810 (@learner0810)
- jingzhe6414 (@jingzhe6414)
- flpanbin (@flpanbin)
- oceanweave (@oceanweave)
- windsonsea (@windsonsea)
- KubeKyrie (@KubeKyrie)
- yxxhero (@yxxhero)
- Rei1010 (@Rei1010)
- phoenixwu0229 (@phoenixwu0229)
- chinaran (@chinaran)
Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.4.1...v2.5.0