Vectorize output store in ublkcp DeviceTransform kernel#9481
Vectorize output store in ublkcp DeviceTransform kernel#9481nanan-nvidia wants to merge 11 commits into
Conversation
12991ad to
c6559fb
Compare
|
/ok to test c6559fb |
|
/ok to test 365bf3a |
365bf3a to
c2a253d
Compare
|
/ok to test c2a253d |
|
@nanan-nvidia can you please post the output of |
This comment was marked as outdated.
This comment was marked as outdated.
| if (full_tile || idx < valid_items) | ||
| using output_t = it_value_t<RandomAccessIteratorOut>; | ||
| constexpr int out_size = int{size_of<output_t>}; | ||
| constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1; |
There was a problem hiding this comment.
Q: Why can't we vectorize to 32 byte loads/stores here?
| // count is a multiple of N, if 1. there are no predicates, 2. memory layout is contiguous, 3. semantically we can | ||
| // raw copy, 4. size is power-of-2 and <= 16 bytes we can do vectorized store (STG.128) |
There was a problem hiding this comment.
Suggestion: Please format the list of requirements as a list:
| // count is a multiple of N, if 1. there are no predicates, 2. memory layout is contiguous, 3. semantically we can | |
| // raw copy, 4. size is power-of-2 and <= 16 bytes we can do vectorized store (STG.128) | |
| // count is a multiple of N, if | |
| // 1. there are no predicates, | |
| // 2. memory layout is contiguous, | |
| // 3. semantically we can raw copy, | |
| // 4. size is power-of-2 and <= 16 bytes | |
| // we can do vectorized store (STG.128) |
There was a problem hiding this comment.
And we should be able to emit STG.256 as well
| if (full_tile || idx < valid_items) | ||
| using output_t = it_value_t<RandomAccessIteratorOut>; | ||
| constexpr int out_size = int{size_of<output_t>}; | ||
| constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1; |
There was a problem hiding this comment.
Q: Why can't we vectorize types larger than 16 bytes?
| @@ -0,0 +1,103 @@ | |||
| // SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Important: Please update copyright year to today.
| C2H_TEST("DeviceTransform::Transform aligned_size_t<16> same-width", | ||
| "[device][transform]", | ||
| c2h::type_list<std::uint8_t, std::uint16_t, std::uint32_t, std::uint64_t, float, double, uchar3>) |
There was a problem hiding this comment.
Q: Why do we need to test float and double as well here? They have the same size and alignment as two other integer types already covered.
There was a problem hiding this comment.
Important: Once we approve the general mechanics, we should update the per-function documentation.
| OutputAlign >= 16 && vec_size > 1 && ::cuda::std::is_same_v<Predicate, ::cuda::always_true> | ||
| && THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorOut> | ||
| && THRUST_NS_QUALIFIER::is_trivially_relocatable_v<output_t> && ::cuda::is_power_of_two(out_size) | ||
| && (true && ... && ::cuda::is_power_of_two(int{sizeof(InTs)})); |
There was a problem hiding this comment.
Important: fold expressions on bools handle empty lists fine, so we can use an unary fold:
| && (true && ... && ::cuda::is_power_of_two(int{sizeof(InTs)})); | |
| && (... && ::cuda::is_power_of_two(int{sizeof(InTs)})); |
| // didn't merge the changes. The problem was mostly a 25% increase in integer instructions, as shown by ncu. | ||
| template <int threads_per_block, | ||
| int UnrollFactor, | ||
| int OutputAlign, |
There was a problem hiding this comment.
Important: cuda::aligned_size_t does not only promise output alignment, it promises the alignment for all inputs and outputs, as well as the size multiple in bytes. So maybe:
| int OutputAlign, | |
| int PointerAlignment, |
| // When the caller guarantees aligned_size_t<N> num_items, i.e. the output pointer is N-byte aligned and the element | ||
| // count is a multiple of N, if 1. there are no predicates, 2. memory layout is contiguous, 3. semantically we can |
There was a problem hiding this comment.
Critical:
the element count is a multiple of N
This is only true for elements with power of two size. Think of aligned_size_t<16>{n} and int3*, which is valid for every n that is a multiple of 4.
There was a problem hiding this comment.
In vectorize_store we require both input element size and output element size to be pow2, so this should not be a bug.
From the definition,
explicit constexpr aligned_size_t(size_t __s) : value(__s) {
_CCCL_ASSERT(value % align == 0,
"aligned_size_t must be constructed with a size that is a multiple of the alignment");
}
It seems aligned_size_t<16>(n) just mean n % 16 == 0? Does it actually mean (sizeof(T) * n) % 16 == 0 semantically?
There was a problem hiding this comment.
aligned_size_t refers to a size in bytes, not in elements, at least according to what we currently document.
| // When the caller guarantees aligned_size_t<N> num_items, i.e. the output pointer is N-byte aligned and the element | ||
| // count is a multiple of N, if 1. there are no predicates, 2. memory layout is contiguous, 3. semantically we can | ||
| // raw copy, 4. size is power-of-2 and <= 16 bytes we can do vectorized store (STG.128) | ||
| constexpr bool vectorize_store = |
There was a problem hiding this comment.
Suggestion: this is not only about store vectorization it seems, so please name the variable to reflect its purpose.
|
@nanan-nvidia the results look promising. I think the implementation needs a bit more work though. Please check out the notes I have from a discussion with DevTech working on PyTorch: https://github.com/NVIDIA-dev/cccl_private/issues/598. |
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 8h 21m: Pass: 100%/291 | Total: 3d 23h | Max: 1h 41m | Hits: 64%/327929See results here. |
|
/ok to test 6b64983 |
ublkcp kernel for deviceTransform when user promises cuda::aligned_size_t<16>…type-erased iterator)
…default-constructible outputs)
…ar3 is_aligned assert)
a1d2946 to
c9a53d7
Compare
|
/ok to test c9a53d7 |
As per #9210, we noticed that the
ublkcpkernel forDeviceTransformhas unvectorized stores, leaving some performance on the table. This PR vectorizes the output store (STG.128) whenever it's safe. There is no change to existing call sites.The store width is controlled by a tunable policy knob
store_vec(output elements per store):0= auto (a 16-byte STG.128, the default),1= disable vectorization (the branch is compiled out, useful for register-heavy functors), andN= useNelements, silently capped to 16 bytes.Benchmarks (B200)
Ref = stock CUB, Cmp = this PR. FAST = this PR faster.
For heavy and complex_cmp, we use the opt-out path (
store_vec=1), since in those cases we should disable the vectorized store. We also show the regressions at the end if vectorization is enabled for those cases.babelstream: before vs this PR (auto)
pytorch: before vs this PR (auto)
fill: before vs this PR (auto)
fib: before vs this PR (auto)
grayscale: before vs this PR (auto)
heavy: before vs this PR (
store_vec=1)complex_cmp — before vs this PR (store_vec=1 opt-out)