Releases: MoonshotAI/checkpoint-engine
Releases · MoonshotAI/checkpoint-engine
v0.4.0
What's Changed
- refactor: use a shared TCPStore in ParameterServer and create ProcessGroup using PrefixStore by @HubertZhang in #82
- feat: add StatelessProcessGroup to extend collective library by @kip-cxj in #66
- feat: release ipc buffers before calling update_weights_from_ipc's post_hook by @HubertZhang in #84
Full Changelog: v0.3.4...v0.4.0
v0.4.0-rc0
What's Changed
- refactor: use a shared TCPStore in ParameterServer and create ProcessGroup using PrefixStore by @HubertZhang in #82
- feat: add StatelessProcessGroup to extend collective library by @kip-cxj in #66
- feat: release ipc buffers before calling update_weights_from_ipc's post_hook by @HubertZhang in #84
Full Changelog: v0.3.4...v0.4.0-rc0
v0.3.4
What's Changed
- support mtp by @youzhedian in #81
New Contributors
- @youzhedian made their first contribution in #81
Full Changelog: v0.3.3...v0.3.4
v0.3.3
What's Changed
- fix: npu free host cache by @kip-cxj in #78
- bugfix: skip empty safetensors file in inplace_pin_memory by @HubertZhang in #79
Full Changelog: v0.3.2...v0.3.3
v0.3.2
What's Changed
- fix p2p update error when disable_h2d_buffer is true by @ruizhang1230 in #76
- fix: set current CUDA device in _inplace_pin_memory function by @SongXiaoXi in #77
Full Changelog: v0.3.1...v0.3.2
v0.3.1
v0.3.1-rc0
What's Changed
- Update use of environment variable in ps.py by @HubertZhang in #73
- misc: split ps.py file into multiple files by @specture724 in #64
- feat: cache device uuid in VllmWorkerExtension by @kip-cxj in #74
Full Changelog: v0.3.0-rc1...v0.3.1-rc0
v0.3.0
What's Changed
- feat: docs added for
auto_pg, andauto_pgdefault set to True by @specture724 in #65 - hotfix: add a switch to disable inplace pinning of tensors by @specture724 in #68
- hotfix: inplace pin memory caused
cudaErrorHostMemoryAlreadyRegisteredby @specture724 in #69 - fix: CUDA OOM encountered with store based barrier by @specture724 in #70
Full Changelog: v0.3.0-rc0...v0.3.0-rc1
v0.2.3
v0.3.0-rc0
What's Changed
- fix: use tcp store_based_barrier to control p2p update synchronization by @specture724 in #51
Full Changelog: v0.2.2...v0.3.0-rc0