Skip to content

Compressed Tensors v0.14.0

Choose a tag to compare

@dhuangnm dhuangnm released this 27 Feb 22:29
· 28 commits to main since this release
d96634b

What's Changed

  • [decompression] Added qparam decompression by @shanjiaz in #537
  • Use standard E8M0 scale format for MXFP4 by @mgoin in #538
  • Add W4AFP8 preset scheme by @Etelis in #542
  • [MXFP4] Fix scale offset by @dsikka in #541
  • [CI] Add mergify and stale PR rules by @dsikka in #543
  • [Offload] Fix delete_offload_parameter, add clear_quantization by @kylesayrs in #539
  • [Deprecation] Add deprecation warning for marlin24 format by @Etelis in #544
  • [Offload] Add offloading logic by @kylesayrs in #529
  • Pin torch by @dsikka in #546
  • [Testing] Pin transformers by @kylesayrs in #549
  • limit transformers to <5.0.0 by @dhuangnm in #550
  • Modernize python310 type hints in quantization forward by @LudovicoYIN in #548
  • [Offload] Remove Accelerate by @kylesayrs in #530
  • KV Cache Quantization support deepseek v3 by @zkl-ai in #533
  • [Bugfix] Remove assert when dispatched to device by @kylesayrs in #554
  • Modernize python310 type hints in quantization by @LudovicoYIN in #553
  • [Observers] Change default weight observer to "memoryless_minmax" by @kylesayrs in #540
  • Update pytest command to include report options by @dsikka in #557
  • Change runner from IBM to GCP for Python tests by @dsikka in #561
  • Modernize python310 type hints in utils by @LudovicoYIN in #560
  • Update quantization strategy validation for actorder by @dsikka in #556
  • [Offload] DistributedCPUCache by @kylesayrs in #534
  • [MXFP4][GPTQ] Extend rounding to support FP32 by @dsikka in #551
  • [Tests] Fix typo, prepare for meta offload tensors by @kylesayrs in #562
  • Modernize python310 type hints in compressors by @LudovicoYIN in #563
  • Modernize python310 type hints in transform/offload/registry by @LudovicoYIN in #565
  • [Offload] [Bugfix] Fix distributed cpu tensor reconstruction by @kylesayrs in #567
  • [Offload] [Bugfix] Reserve extra dispatch memory for fragmentation by @kylesayrs in #566
  • Remove Neural Magic copyright by @Etelis in #559
  • [Transforms] Support loading transforms in transformers by @kylesayrs in #528
  • [Offload] DistributedDeviceCache by @kylesayrs in #568
  • Revert "[Transforms] Support loading transforms in transformers" by @HDCharles in #578
  • [Offload] DiskCache, DistributedDiskCache by @kylesayrs in #535
  • [Offload] Make update_offload_parameter more async and direct (2) by @kylesayrs in #576
  • [Copyright] Add vLLM copyright enforcement by @kylesayrs in #575
  • [Bugfix] Handle updating tensors with gradients by @kylesayrs in #580
  • [bugfix] get_device_memory rank>0 fix by @HDCharles in #582
  • Remove upper limit for torch dependency to support 2.10 by @dsikka in #583
  • Set seed to fix flaky test by @dsikka in #584
  • [Offload] Convert accelerate for loading/saving by @kylesayrs in #572
  • [Bugfix] Allow parameter overwrite if shapes do not match by @kylesayrs in #586
  • [Bugfix] [Offloading] Even more reserved memory, scaling with model size by @kylesayrs in #587
  • Implement init_dist for distributed setup by @HDCharles in #589
  • FP8 Block Quantization: Non-Divisible Shape Support by @Etelis in #547
  • [Bugfix]: Reduce memory usage when load device does not match dispatch device by @kylesayrs in #592
  • [bugfix] load_offloaded_model qwen3vl8b by @HDCharles in #591
  • [Offload] clean up deprecation warnings, which can accumulate to 100k+ warnings by @brian-dellabetta in #593
  • [Offload] Deprecate update_parameter_data by @kylesayrs in #588
  • [Bugfix] Fix clear_quantization by @kylesayrs in #596
  • Allow broadcasting fp8 by @HDCharles in #603
  • fix ruff for release by @HDCharles in #604
  • [Offload] Fully invertible conversion functions by @kylesayrs in #601
  • [Offload] Better device/cpu memory estimates when loading with load_offloaded_model by @kylesayrs in #605

New Contributors

Full Changelog: 0.13.0...0.14.0