Releases: ROCm/rocPRIM
rocprim 4.0.1 for ROCm 7.0.2
rocPRIM code for ROCm 7.0.2 did not change. The library was rebuilt for the updated ROCm 7.0.2 stack.
rocprim 4.0.0 for ROCm 7.0.1
rocPRIM code for ROCm 7.0.1 did not change. The library was rebuilt for the updated ROCm 7.0.1 stack.
rocPRIM 4.0.0 for ROCm 7.0.0
Added
- Added
rocprim::accumulator_tto ensure parity with CCCL. - Added test for
rocprim::accumulator_t - Added
rocprim::invoke_result_rto ensure parity with CCCL. - Added function
is_build_inintorocprim::traits::get. - Added virtual shared memory as a fallback option in
rocprim::device_mergewhen it exceeds shared memory capacity, similar torocprim::device_select,rocprim::device_partition, androcprim::device_merge_sort, which already include this feature. - Added initial value support to device level inclusive scans.
- Added new optimization to the backend for
device_transformwhen the input and output are pointers. - Added
LoadTypetotransform_config, which is used for thedevice_transformwhen the input and output are pointers. - Added
rocprim:device_transformfor n-ary transform operations API with as inputnnumber of iterators inside arocprim::tuple. - Added gfx950 support.
- Added
rocprim::key_value_pair::operator==. - Added the
rocprim::unrolled_copythread function to copy multiple items inside a thread. - Added the
rocprim::unrolled_thread_loadfunction to load multiple items inside a thread usingrocprim::thread_load. - Added
rocprim::int128_tandrocprim::uint128_tto benchmarks for improved performance evaluation on 128-bit integers. - Added
rocprim::int128_tto the supported autotuning types to improve performance for 128-bit integers. - Added the
rocprim::merge_inplacefunction for merging in-place. - Added initial value support for warp- and block-level inclusive scan.
- Added support for building tests with device-side random data generation, making them finish faster. This requires rocRAND, and is enabled with the
WITH_ROCRAND=ONbuild flag. - Added tests and documentation to
lookback_scan_state. It is still in thedetailnamespace.
Optimizations
- Improved performance of
rocprim::device_selectandrocprim::device_partitionwhen using multiple streams on the MI3XX architecture.
Changed
- Changed the parameters
long_radix_bitsandLongRadixBitsfromsegmented_radix_sorttoradix_bitsandRadixBitsrespectively. - Marked the initialisation constructor of
rocprim::reverse_iterator<Iter>explicit, userocprim::make_reverse_iterator. - Merged
radix_key_codecinto type_traits system. - Renamed
type_traits_interface.hpptotype_traits.hpp, rename the originaltype_traits.hpptotype_traits_functions.hpp. - The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change.
The previous default accumulator types could lead to situations in which unexpected overflow occured, such as
when the input or inital type was smaller than the output type.- This is a complete list of affected functions and how their default accumulator types are changing:
rocprim::inclusive_scan- Previous default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, typename std::iterator_traits<InputIterator>::value_type>
- Previous default:
rocprim::deterministic_inclusive_scan- Previous default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, typename std::iterator_traits<InputIterator>::value_type>
- Previous default:
rocprim::exclusive_scan- Previous default:
class AccType = detail::input_type_t<InitValueType>> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, rocprim::detail::input_type_t<InitValueType>>
- Previous default:
rocprim::deterministic_exclusive_scan- Previous default:
class AccType = detail::input_type_t<InitValueType>> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, rocprim::detail::input_type_t<InitValueType>>
- Previous default:
- This is a complete list of affected functions and how their default accumulator types are changing:
- Undeprecated internal
detail::raw_storage. - A new version of
rocprim::thread_loadandrocprim::thread_storereplace the deprecatedrocprim::thread_loadandrocprim::thread_storefunctions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result. - Renamed
rocprim::load_cstorocprim::load_nontemporalandrocprim::store_cstorocprim::store_nontemporalto express the intent of these load and store methods better. - All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example,
rocprim::ROCPRIM_300400_NS::symbolinstead ofrocPRIM::symbol, letting the user link multiple libraries built with different versions of rocPRIM.
Upcoming changes
rocprim::invoke_result_binary_opandrocprim::invoke_result_binary_op_tare deprecated. Userocprim::accumulator_tnow.
Removed
- Removed
rocprim::detail::float_bit_maskand relative tests, userocprim::traits::float_bit_maskinstead. - Removed
rocprim::traits::is_fundamental, please userocprim::traits::get<T>::is_fundamental()directly. - Removed the deprecated parameters
short_radix_bitsandShortRadixBitsfrom thesegmented_radix_sortconfig. They were unused, it is only an API change. - Removed the deprecated
operator<<from the iterators. - Removed the deprecated
TwiddleInandTwiddleOut. Useradix_key_codecinstead. - Removed the deprecated flags API of
block_adjacent_difference. Usesubtract_left()orblock_discontinuity::flag_heads()instead. - Removed the deprecated
to_exclusivefunctions in the warp scans. - Removed the
rocprim::load_csfrom thecache_load_modifierenum. Userocprim::load_nontemporalinstead. - Removed the
rocprim::store_csfrom thecache_store_modifierenum. Userocprim::store_nontemporalinstead. - Removed the deprecated header file
rocprim/detail/match_result_type.hpp. Includerocprim/type_traits.hppinstead.- This header included
rocprim::detail::invoke_result. Userocprim::invoke_resultinstead. - This header included
rocprim::detail::invoke_result_binary_op. Userocprim::invoke_result_binary_opinstead. - This header included
rocprim::detail::match_result_type. Userocprim::invoke_result_binary_op_tinstead.
- This header included
- Removed the deprecated
rocprim::detail::radix_key_codecfunction. Userocprim::radix_key_codecinstead. - Removed
rocprim/detail/radix_sort.hpp, functionality can now be found inrocprim/thread/radix_key_codec.hpp. - Removed C++14 support, only C++17 is supported.
- Due to the removal of
__AMDGCN_WAVEFRONT_SIZEin the compiler, the following deprecated warp size-related symbols have been removed:rocprim::device_warp_size()- For compile-time constants, this is replaced with
rocprim::arch::wavefront::min_size()androcprim::arch::wavefront::max_size(). Use this when allocating global or shared memory. - For run-time constants, this is replaced with
rocprim::arch::wavefront::size().
- For compile-time constants, this is replaced with
rocprim::warp_size()- Use
rocprim::host_warp_size(),rocprim::arch::wavefront::min_size()orrocprim::arch::wavefront::max_size()instead.
- Use
ROCPRIM_WAVEFRONT_SIZE- Use
rocprim::arch::wavefront::min_size()orrocprim::arch::wavefront::max_size()instead.
- Use
__AMDGCN_WAVEFRONT_SIZE- This was a fallback define for the compiler's removed symbol, having the same name.
- This release removes support for custom builds on gfx940 and gfx941.
Resolved issues
- Fixed an issue where
device_batch_memcpyreported benchmarking throughput being 2x lower than it was in reality. - Fixed an issue where
device_segmented_reducereported autotuning throughput being 5x lower than it was in reality. - Fixed device radix sort not returning the correct required temporary storage when a double buffer contains
nullptr. - Fixed constness of equality operators (
==and!=) inrocprim::key_value_pair. - Fixed an issue for the comparison operators in
arg_index_iteratorandtexture_cache_iterator, where<and>comparators were swapped. - Fixed an issue for the
rocprim::thread_reducenot working correctly with a prefix value.
Known issues
- When using
rocprim::deterministic_inclusive_scan_by_keyandrocprim::deterministic_exclusive_scan_by_keythe intermediate values can change order on Navi3x- However if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs
rocPRIM 3.4.1 for ROCm 6.4.4
rocPRIM code for ROCm 6.4.4 did not change. The library was rebuilt for the updated ROCm 6.4.4 stack.
rocPRIM 3.4.1 for ROCm 6.4.3
rocPRIM code for ROCm 6.4.3 did not change. The library was rebuilt for the updated ROCm 6.4.3 stack.
rocPRIM 3.4.1 for ROCm 6.4.2
Upcoming changes
- Changes to the template parameters of warp and block algorithms will be made in an upcoming release.
Deprecations
- Due to an upcoming compiler change the following warp size-related symbols will be removed in the next major release and are thus marked as deprecated:
rocprim::device_warp_size()- For compile-time constants, this is replaced with
rocprim::arch::wavefront::min_size()androcprim::arch::wavefront::max_size(). Use this when allocating global or shared memory. - For run-time constants, this is replaced with
rocprim::arch::wavefront::size().
- For compile-time constants, this is replaced with
rocprim::warp_size()- `ROCPRIM_WAVEFRONT_SIZE
rocPRIM 3.4.0 for ROCm 6.4.1
rocPRIM code for ROCm 6.4.1 did not change. The library was rebuilt for the updated ROCm 6.4.1 stack.
rocPRIM 3.4.0 for ROCm 6.4.0
Added
- Added extended tests to
rtest.py. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer to run relative to smoke and regression tests. - Use
python rtest.py [--emulation|-e|--test|-t]=extendedto run these tests. - Added regression tests to
rtest.py. Regression tests are a subset of tests that caused hardware problems for past emulation environments.- Can be run with
python rtest.py [--emulation|-e|--test|-t]=regression
- Can be run with
- Added the parallel
find_first_ofdevice function with autotuned configurations, this function is similar tostd::find_first_of, it searches for the first occurrence of any of the provided elements. - Added
--emulationoption added forrtest.py- Unit tests can be run with
[--emulation|-e|--test|-t]=<test_name>
- Unit tests can be run with
- Added tuned configurations for segmented radix sort for gfx942 to improve performance on this architecture.
- Added a parallel device-level function,
rocprim::adjacent_find, similar to the C++ Standard Librarystd::adjacent_findalgorithm. - Added configuration autotuning to device adjacent find (
rocprim::adjacent_find) for improved performance on selected architectures. - Added rocprim::numeric_limits which is an extension of
std::numeric_limits, which includes support for 128-bit integers. - Added rocprim::int128_t and rocprim::uint128_t which are the __int128_t and __uint128_t types.
- Added the parallel
searchandfind_enddevice functions similar tostd::searchandstd::find_end, these functions search for the first and last occurrence of the sequence respectively. - Added a parallel device-level function,
rocprim::search_n, similar to the C++ Standard Librarystd::search_nalgorithm. - Added new constructors and a
basefunction, and addedconstexprspecifier to all functions inrocprim::reverse_iteratorto improve parity with the C++17std::reverse_iterator. - Added hipGraph support to device run-length-encode for non trivial runs (
rocprim::run_length_encode_non_trivial_runs). - Added configuration autotuning to device run-length-encode for non trivial runs (
rocprim::run_length_encode_non_trivial_runs) for improved performance on selected architectures. - Added configuration autotuning to device run-length-encode for trivial runs (
rocprim::run_length_encode) for improved performance on selected architectures. - Added a new type traits interface to enable users to provide additional type trait information to rocPRIM, facilitating better compatibility with custom types.
Changed
-
Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use
python rtest.py [--emulation|-e|--test|-t]=smoketo run these tests. -
The
rtest.pyoptions have changed.rtest.pyis now run with at least either--test|-tor--emulation|-e, but not both options. -
Changed the internal algorithm of block radix sort to use rank match to improve performance of various radix sort related algorithms.
-
Disabled padding in various cases where higher occupancy resulted in better performance despite more bank conflicts.
-
Removed HIP-CPU support. HIP-CPU support was experimental and broken.
-
Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.
-
You can use CMake HIP language support with CMake 3.18 and later. To use HIP language support, run
cmakewith-DUSE_HIPCXX=ONinstead of setting theCXXvariable to the path to a HIP-aware compiler.
Resolved issues
- Fixed an issue where
rmake.pywould generate wrong CMAKE commands while using Linux environment - Fixed an issue where
rocprim::partial_sort_copywould yield a compile error if the input iterator is const. - Fixed incorrect 128-bit signed and unsigned integers type traits.
- Fixed compilation issue when
rocprim::radix_key_codec<...>is specialized with a 128-bit integer. - Fixed the warp-level reduction
rocprim::warp_reduce.reduceDPP implementation to avoid undefined intermediate values during the reduction. - Fixed an issue that caused a segmentation fault when
hipStreamLegacywas passed to some API functions.
Upcoming changes
-
Using the initialisation constructor of
rocprim::reverse_iteratorwill throw a deprecation warning. It will be marked as explicit in the next major release. -
Using the initialisation constructor of rocprim::reverse_iterator will throw a deprecation warning. It will be marked as explicit in the next major release.
rocPRIM 3.3.0 for ROCm 6.3.3
rocPRIM code for ROCm 6.3.3 did not change. The library was rebuilt for the updated ROCm 6.3.3 stack.
rocPRIM 3.3.0 for ROCm 6.3.2
rocPRIM code for ROCm 6.3.2 did not change. The library was rebuilt for the updated ROCm 6.3.2 stack.