GPUd release notes (2025-02-07T10:34:19Z)
Welcome to this new release!
What's Changed
- feat(server): vacuum sqlite only once a week by @gyuho in #301
- feat(nvidia/sxid-xid-state): dedup in memory by minute level to minimize contending db inserts by @gyuho in #302
- chore(build): support arm64 build for linux by @photoszzt in #303
- feat(info): track gpud process self resource usage (file descriptors, RSS, start time, db size) by @gyuho in #296
- test(nvidia/xid): add more unit tests for extracting device uuid by @gyuho in #305
- fix(*): add missing close on process that runs commands as bash, rename abort to close by @gyuho in #294
- fix(join): use local context for join bash by @cardyok in #309
- feat(operation): add codecov coverage workflow by @leoshi01 in #311
- nit(gitignore): add .DS_Store, remove test binary by @gyuho in #316
- fix(diagnose): fix "gpud scan" (add missing temp db creation) by @gyuho in #314
- chore(ci): merge codecov workflow files by @leoshi01 in #313
- feat(sqlite): disable vacuum for now, bump up retention period, track latency, qps, expose via /info component by @gyuho in #307
- feat(pkg/dmesg): simpler watcher by @gyuho in #319
- fix(dmesg): add fallback "dmesg" command if lower-case "-w" fails by @gyuho in #320
- feat(nvidia): configurable nvidia-smi binary, ibstat binary, infiniband class dir paths for mock testing by @gyuho in #310
- fix(nvidia/hw-slowdown): evaluate state based on clock events per-minute frequency for the last 10-minute by @gyuho in #304
- fix(tailscale): remove unused version checks by @gyuho in #299
- feat(nvidia): move event type to common, define xid/sxid event level in the list by @gyuho in #297
- feat(nvidia/ib): remove "gpud run --expected-port-states-nvidia-infiniband" flag, only keep the default detection for backward compatibility by @gyuho in #308
- feat(*): remove redundant utc call by @gyuho in #325
- chore(e2e): refactor E2E and mock nvml/nvidia-smi/lspci by @FillZpp in #326
- feat(os/component): indicate os component is healthy via /states by @gyuho in #329
- feat(hw_slowdown): update suggested action by @cardyok in #330
- fix(hw_slowdown): fix description by @cardyok in #331
- feat(internal/session): set ib ports/rates dynamically by @gyuho in #328
- feat(components/db): add common events store (similar to components.Event) by @gyuho in #332
- feat(os/events): use read-only for reboot events, use common db pkg by @gyuho in #322
- feat(nvidia/xid,sxid/dmesg): add dmesg log line matcher, xid/sxid extractor by @gyuho in #333
- feat(components/memory): use common db + dmesg poller for events, move out of "dmesg" component by @gyuho in #324
- feat(nvidia/infiniband): log dynamic ports/rates config updates by @gyuho in #335
- feat(xid): simplify xid component by @cardyok in #321
- fix(diagnose): scan to use "dmesg" with no time limit by @gyuho in #338
- feat(fd): improve reasons, human-readable by @gyuho in #344
- feat(nvidia): set shared nvidia poller once in server.go by @gyuho in #345
- test(client/v1, nviai/query): increase unit test coverage by @gyuho in #347
- feat(nvidia/query): pass events store to shared poller by @gyuho in #346
- feat(sxid): simplify sxid component by @cardyok in #337
- feat(pci): use common db pkg for events by @gyuho in #340
- feat(fuse): use common events store pkg by @gyuho in #339
- fix(info): correctly compute GPUd sqlite metrics delta by @gyuho in #349
- test(config): increase unit test coverage by @gyuho in #354
- test(pkg/sqlite): increase test coverage by @gyuho in #352
- feat(nvidia/xid): use common events DB for NVML-based xid watcher, disable NVML Xid event watcher in favor of "dmesg" watcher, deprecate redundant "error-xid-sxid" component by @gyuho in #343
- feat(nvidia/hw-slowdown): use common db pkg for events by @gyuho in #342
- fix(components/state): clean up update queries, increase unit test coverage, read last API version using SQLite rowid column by @gyuho in #361
- feat(file-descriptor): cleanup fd check, only keep proc based fd chekking by @cardyok in #364
- fix(nvidia/remapped-rows): do not check row remapping for 4090 and other unsupported GPUs by @gyuho in #351
- releaser(ci, go): use go 1.23.6, disable arm releaser for now by @gyuho in #366
- fix(build): fix arm64 build by @photoszzt in #367
- fix(infiniband): make default value -1, allow future override by @cardyok in #369
New Contributors
Full Changelog: v0.3.9...v0.4.0