This repository was archived by the owner on Mar 21, 2024. It is now read-only.
CUB 2.0.0 #551
alliepiper
announced in
Announcements
CUB 2.0.0
#551
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime
debug_synchronousdebugging flags.This release also includes several new features.
DeviceHistogramnow supports__halfand better handles various edge cases.WarpReducenow performs correctly when restricted to a single-thread “warp”, and will use the__reduce_add_syncaccelerated intrinsic (introduced with Ampere) when appropriate.DeviceRadixSortlearned to handle the case wherebegin_bit == end_bit.Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
Breaking Changes
NV_IF_TARGETports. #448 Add libcu++ dependency (v1.8.0+).NV_IF_TARGETports. #448: The following macros are no longer defined by default. They can be re-enabled by definingCUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.CUB_IS_HOST_CODE: Replace withNV_IF_TARGET.CUB_IS_DEVICE_CODE: Replace withNV_IF_TARGET.CUB_INCLUDE_HOST_CODE: Replace withNV_IF_TARGET.CUB_INCLUDE_DEVICE_CODE: Replace withNV_IF_TARGET.NV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.__host__ __device____host__ __device____host____host__ __device____host__CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:CUB_RDC_ENABLED: New macro, may be combined withNV_IF_TARGETto replace most usages ofCUB_RUNTIME_ENABLED. Behavior:__device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).__device__-only implementation ofoperator().__host__ __device__lambda.cuda::proclaim_return_type(Added in libcu++ 1.9.0)DeviceReducealgorithm, following guidance from http://wg21.link/P2322R6.Dispatch*Reducelayer have changed:DispatchReduce:initas initial type instead of output iterator value type.DispatchSegmentedReduce:Equality,Inequality,InequalityWrapper,Sum,Difference,Division,Max,ArgMax,Min,ArgMin.ThreadReducenow accepts accumulator type and uses a different type forprefix.DeviceScan,DeviceScanByKey, andDeviceReduceByKeyalgorithms, following guidance from http://wg21.link/P2322R6.Dispatchlayer have changed:DispatchScannow accepts accumulator type as a template parameter.DispatchScanByKeynow accepts accumulator type as a template parameter.DispatchReduceByKeynow accepts accumulator type as the last template parameter.debug_synchronousflags on device algorithms.CUB_DEBUG_SYNCduring compilation to enable these checks.New Features
__halfinDeviceHistogram.WarpReduce.__reduce_add_synchardware acceleration forWarpReduceon supported architectures.Bug Fixes
begin_bit == end_bit.DeviceHistogram::Evenfor a variety of edge cases:SampleTandLevelT.LevelTis an integral type and the number of levels does not evenly divide the level range.temp_storage_bytesis properly set in theAdjacentDifferenceCopydevice algorithms.AdjacentDifferenceCopydevice algorithms.Other Enhancements
NV_IF_TARGETports. #448: Removed special case code for unsupported CUDA architectures.NV_IF_TARGETports. #448: Replace several usages of__CUDA_ARCH__with<nv/target>to handle host/device code divergence.NV_IF_TARGETports. #448: Mark unused PTX arch parameters as legacy.CUB_DISABLE_BF16_SUPPORTto avoid including thecuda_bf16.hheader or using the__nv_bfloat16type.DeviceScanalgorithms.DeviceHistogramalgorithms.DevicePartitionalgorithms.Device*Sortalgorithms.DeviceReducealgorithms.DeviceRunLengthEncodealgorithms.DeviceSelectalgorithms.WarpMergeSortdocumentation.This discussion was created from the release CUB 2.0.0.
Beta Was this translation helpful? Give feedback.
All reactions