Releases: NVIDIA/cub
CUB 1.8.0
Summary
CUB 1.8.0 introduces changes to the cub::Shuffle* interfaces.
Breaking Changes
- The interfaces of
cub::ShuffleIndex,cub::ShuffleUp, andcub::ShuffleDownhave been changed to allow for better computation of the PTX SHFL control constant for logical warps smaller than 32 threads.
Bug Fixes
- #112: Fix
cub::WarpScan's broadcast of warp-wide aggregate for logical warps smaller than 32 threads.
CUB 1.7.5
Summary
CUB 1.7.5 adds support for radix sorting __half keys and improved sorting performance for 1 byte keys. It was incorporated into Thrust 1.9.2.
Enhancements
- Radix sort support for
__halfkeys. - Radix sort tuning policy updates to improve 1 byte key performance.
Bug Fixes
CUB 1.7.4
CUB 1.7.3
CUB 1.7.2
CUB 1.7.1
Summary
CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs.
It is compatible with independent thread scheduling.
Breaking Changes
- Remove
cub::WarpAllandcub::WarpAny. These functions served to emulate__alland__anyfunctionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.
Other Enhancements
- Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.
Bug Fixes
- #86: Incorrect results with reduce-by-key.
CUB 1.7.0
Summary
CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling. It was incorporated into Thrust 1.9.2.
Breaking Changes
- Remove
cub::WarpAllandcub::WarpAny. These functions served to emulate__alland__anyfunctionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.
Other Enhancements
- Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.
Bug Fixes
- #86: Incorrect results with reduce-by-key.
CUB 1.6.4
Summary
CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x (Pascal) GPUs.
Enhancements
- Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.
Bug Fixes
- Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
- #65:
cub::DeviceSegmentedRadixSortshould allow inputs to have pointer-to-const type. - Mollify Clang device-side warnings.
- Remove out-dated MSVC project files.
CUB 1.6.3
Summary
CUB 1.6.3 improves support for Windows, changes cub::BlockLoad/cub::BlockStore interface to take the local data type, and enhances radix sort performance for SM6x (Pascal) GPUs.
Breaking Changes
cub::BlockLoadandcub::BlockStoreare now templated by the local data type, instead of theIteratortype. This allows for output iterators havingvoidas theirvalue_type(e.g. discard iterators).
Other Enhancements
- Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte keys/s on GP100.
- Improved support for Windows (warnings, alignment, etc).
Bug Fixes
- #74:
cub::WarpReduceexecutes reduction operator for out-of-bounds items. - #72:
cub:InequalityWrapper::operatorshould be non-const. - #71:
cub::KeyValuePairwon't work ifKeyhas non-trivial constructor. - #69: cub::BlockStore::Store
doesn't compile ifOutputIteratorT::value_typeisn'tT`. - #68:
cub::TilePrefixCallbackOp::WarpReducedoesn't permit PTX arch specialization.
CUB 1.6.2 (previously 1.5.5)
Summary
CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) GPUs.
Enhancements
- Radix sort tuning policies updated for SM6x (Pascal) GPUs.
Bug Fixes
- Fix AArch64 compilation of
cub::CachingDeviceAllocator.