Skip to content

gpud-v0.4.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 07 Feb 10:10
ab3785a

GPUd release notes (2025-02-07T10:34:19Z)

Welcome to this new release!

What's Changed

  • feat(server): vacuum sqlite only once a week by @gyuho in #301
  • feat(nvidia/sxid-xid-state): dedup in memory by minute level to minimize contending db inserts by @gyuho in #302
  • chore(build): support arm64 build for linux by @photoszzt in #303
  • feat(info): track gpud process self resource usage (file descriptors, RSS, start time, db size) by @gyuho in #296
  • test(nvidia/xid): add more unit tests for extracting device uuid by @gyuho in #305
  • fix(*): add missing close on process that runs commands as bash, rename abort to close by @gyuho in #294
  • fix(join): use local context for join bash by @cardyok in #309
  • feat(operation): add codecov coverage workflow by @leoshi01 in #311
  • nit(gitignore): add .DS_Store, remove test binary by @gyuho in #316
  • fix(diagnose): fix "gpud scan" (add missing temp db creation) by @gyuho in #314
  • chore(ci): merge codecov workflow files by @leoshi01 in #313
  • feat(sqlite): disable vacuum for now, bump up retention period, track latency, qps, expose via /info component by @gyuho in #307
  • feat(pkg/dmesg): simpler watcher by @gyuho in #319
  • fix(dmesg): add fallback "dmesg" command if lower-case "-w" fails by @gyuho in #320
  • feat(nvidia): configurable nvidia-smi binary, ibstat binary, infiniband class dir paths for mock testing by @gyuho in #310
  • fix(nvidia/hw-slowdown): evaluate state based on clock events per-minute frequency for the last 10-minute by @gyuho in #304
  • fix(tailscale): remove unused version checks by @gyuho in #299
  • feat(nvidia): move event type to common, define xid/sxid event level in the list by @gyuho in #297
  • feat(nvidia/ib): remove "gpud run --expected-port-states-nvidia-infiniband" flag, only keep the default detection for backward compatibility by @gyuho in #308
  • feat(*): remove redundant utc call by @gyuho in #325
  • chore(e2e): refactor E2E and mock nvml/nvidia-smi/lspci by @FillZpp in #326
  • feat(os/component): indicate os component is healthy via /states by @gyuho in #329
  • feat(hw_slowdown): update suggested action by @cardyok in #330
  • fix(hw_slowdown): fix description by @cardyok in #331
  • feat(internal/session): set ib ports/rates dynamically by @gyuho in #328
  • feat(components/db): add common events store (similar to components.Event) by @gyuho in #332
  • feat(os/events): use read-only for reboot events, use common db pkg by @gyuho in #322
  • feat(nvidia/xid,sxid/dmesg): add dmesg log line matcher, xid/sxid extractor by @gyuho in #333
  • feat(components/memory): use common db + dmesg poller for events, move out of "dmesg" component by @gyuho in #324
  • feat(nvidia/infiniband): log dynamic ports/rates config updates by @gyuho in #335
  • feat(xid): simplify xid component by @cardyok in #321
  • fix(diagnose): scan to use "dmesg" with no time limit by @gyuho in #338
  • feat(fd): improve reasons, human-readable by @gyuho in #344
  • feat(nvidia): set shared nvidia poller once in server.go by @gyuho in #345
  • test(client/v1, nviai/query): increase unit test coverage by @gyuho in #347
  • feat(nvidia/query): pass events store to shared poller by @gyuho in #346
  • feat(sxid): simplify sxid component by @cardyok in #337
  • feat(pci): use common db pkg for events by @gyuho in #340
  • feat(fuse): use common events store pkg by @gyuho in #339
  • fix(info): correctly compute GPUd sqlite metrics delta by @gyuho in #349
  • test(config): increase unit test coverage by @gyuho in #354
  • test(pkg/sqlite): increase test coverage by @gyuho in #352
  • feat(nvidia/xid): use common events DB for NVML-based xid watcher, disable NVML Xid event watcher in favor of "dmesg" watcher, deprecate redundant "error-xid-sxid" component by @gyuho in #343
  • feat(nvidia/hw-slowdown): use common db pkg for events by @gyuho in #342
  • fix(components/state): clean up update queries, increase unit test coverage, read last API version using SQLite rowid column by @gyuho in #361
  • feat(file-descriptor): cleanup fd check, only keep proc based fd chekking by @cardyok in #364
  • fix(nvidia/remapped-rows): do not check row remapping for 4090 and other unsupported GPUs by @gyuho in #351
  • releaser(ci, go): use go 1.23.6, disable arm releaser for now by @gyuho in #366
  • fix(build): fix arm64 build by @photoszzt in #367
  • fix(infiniband): make default value -1, allow future override by @cardyok in #369

New Contributors

Full Changelog: v0.3.9...v0.4.0