Skip to content

Implement parallel cuda::std::iota#7930

Open
miscco wants to merge 1 commit intoNVIDIA:mainfrom
miscco:parallel_iota
Open

Implement parallel cuda::std::iota#7930
miscco wants to merge 1 commit intoNVIDIA:mainfrom
miscco:parallel_iota

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Mar 9, 2026

This implements the iota algorithm for the cuda backend.

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #7927

@miscco miscco requested review from a team as code owners March 9, 2026 10:17
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 9, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 9, 2026
@miscco
Copy link
Contributor Author

miscco commented Mar 9, 2026

Performance looks good:

['thrust_iota.json', 'pstl_iota.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   I8    |    2^16    |   6.560 us |       7.31% |   6.623 us |      15.80% |     0.063 us |   0.96% |   SAME   |
|   I8    |    2^20    |   7.258 us |       4.65% |   7.307 us |      11.57% |     0.048 us |   0.67% |   SAME   |
|   I8    |    2^24    |  30.275 us |       1.59% |  30.323 us |       2.90% |     0.048 us |   0.16% |   SAME   |
|   I8    |    2^28    | 382.385 us |       0.14% | 382.336 us |       0.15% |    -0.049 us |  -0.01% |   SAME   |
|   I16   |    2^16    |   6.677 us |       7.26% |   6.555 us |       7.22% |    -0.122 us |  -1.83% |   SAME   |
|   I16   |    2^20    |   9.349 us |       4.37% |   9.370 us |       7.59% |     0.021 us |   0.22% |   SAME   |
|   I16   |    2^24    |  53.793 us |       1.96% |  53.768 us |       1.77% |    -0.025 us |  -0.05% |   SAME   |
|   I16   |    2^28    | 758.099 us |       0.09% | 757.924 us |       0.07% |    -0.175 us |  -0.02% |   SAME   |
|   I32   |    2^16    |   6.757 us |       7.48% |   6.688 us |       9.00% |    -0.069 us |  -1.02% |   SAME   |
|   I32   |    2^20    |  12.650 us |       5.41% |  12.644 us |       6.09% |    -0.006 us |  -0.05% |   SAME   |
|   I32   |    2^24    | 100.713 us |       0.50% | 100.687 us |       0.52% |    -0.026 us |  -0.03% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.05% |   1.509 ms |       0.05% |    -0.104 us |  -0.01% |   SAME   |
|   I64   |    2^16    |   7.441 us |      10.97% |   7.360 us |      11.94% |    -0.081 us |  -1.09% |   SAME   |
|   I64   |    2^20    |  18.860 us |       5.49% |  18.942 us |       2.83% |     0.082 us |   0.44% |   SAME   |
|   I64   |    2^24    | 194.946 us |       0.31% | 195.016 us |       0.32% |     0.069 us |   0.04% |   SAME   |
|   I64   |    2^28    |   3.014 ms |       0.04% |   3.015 ms |       0.04% |     1.409 us |   0.05% |   SLOW   |
|  I128   |    2^16    |  13.761 us |       7.28% |  13.567 us |       3.25% |    -0.194 us |  -1.41% |   SAME   |
|  I128   |    2^20    | 125.779 us |       2.82% | 123.812 us |       2.43% |    -1.966 us |  -1.56% |   SAME   |
|  I128   |    2^24    |   1.849 ms |       0.61% |   1.849 ms |       0.53% |    -0.425 us |  -0.02% |   SAME   |
|  I128   |    2^28    |  29.237 ms |       0.21% |  25.548 ms |       0.74% | -3689.171 us | -12.62% |   FAST   |
|   F32   |    2^16    |   7.274 us |       4.55% |   7.241 us |       5.15% |    -0.033 us |  -0.45% |   SAME   |
|   F32   |    2^20    |  12.699 us |       3.90% |  12.752 us |       7.50% |     0.053 us |   0.41% |   SAME   |
|   F32   |    2^24    | 106.859 us |       0.47% | 103.027 us |       2.73% |    -3.831 us |  -3.59% |   FAST   |
|   F32   |    2^28    |   1.510 ms |       0.04% |   1.510 ms |       0.04% |    -0.005 us |  -0.00% |   SAME   |
|   F64   |    2^16    |   8.023 us |       5.59% |   7.946 us |       6.03% |    -0.077 us |  -0.96% |   SAME   |
|   F64   |    2^20    |  19.398 us |       3.00% |  19.401 us |       2.73% |     0.003 us |   0.01% |   SAME   |
|   F64   |    2^24    | 196.175 us |       0.41% | 196.120 us |       0.36% |    -0.056 us |  -0.03% |   SAME   |
|   F64   |    2^28    |   3.035 ms |       0.04% |   3.037 ms |       0.05% |     1.782 us |   0.06% |   SLOW   |

# stepped

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   I8    |    2^16    |   6.694 us |       7.37% |   6.676 us |      13.31% |    -0.018 us |  -0.27% |   SAME   |
|   I8    |    2^20    |   7.235 us |      10.50% |   7.193 us |      10.22% |    -0.042 us |  -0.59% |   SAME   |
|   I8    |    2^24    |  30.269 us |       2.65% |  30.438 us |       2.31% |     0.170 us |   0.56% |   SAME   |
|   I8    |    2^28    | 382.063 us |       0.12% | 382.290 us |       0.15% |     0.227 us |   0.06% |   SAME   |
|   I16   |    2^16    |   6.601 us |       8.56% |   6.641 us |       8.26% |     0.040 us |   0.61% |   SAME   |
|   I16   |    2^20    |   9.389 us |       4.78% |   9.448 us |       5.15% |     0.060 us |   0.63% |   SAME   |
|   I16   |    2^24    |  53.761 us |       1.43% |  53.692 us |       1.03% |    -0.069 us |  -0.13% |   SAME   |
|   I16   |    2^28    | 757.933 us |       0.07% | 758.168 us |       0.09% |     0.235 us |   0.03% |   SAME   |
|   I32   |    2^16    |   6.620 us |       7.54% |   6.758 us |       8.20% |     0.138 us |   2.08% |   SAME   |
|   I32   |    2^20    |  12.761 us |       7.71% |  12.564 us |       6.24% |    -0.197 us |  -1.54% |   SAME   |
|   I32   |    2^24    | 100.626 us |       0.44% | 100.644 us |       0.47% |     0.017 us |   0.02% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.04% |   1.509 ms |       0.05% |     0.037 us |   0.00% |   SAME   |
|   I64   |    2^16    |   7.490 us |       7.48% |   7.423 us |      11.27% |    -0.068 us |  -0.90% |   SAME   |
|   I64   |    2^20    |  18.892 us |       4.93% |  18.947 us |       4.73% |     0.054 us |   0.29% |   SAME   |
|   I64   |    2^24    | 194.983 us |       0.29% | 195.050 us |       0.35% |     0.068 us |   0.03% |   SAME   |
|   I64   |    2^28    |   3.013 ms |       0.03% |   3.015 ms |       0.04% |     1.737 us |   0.06% |   SLOW   |
|  I128   |    2^16    |  13.867 us |       4.27% |  13.578 us |       3.33% |    -0.289 us |  -2.08% |   SAME   |
|  I128   |    2^20    | 126.155 us |       2.24% | 123.868 us |       3.07% |    -2.288 us |  -1.81% |   SAME   |
|  I128   |    2^24    |   1.841 ms |       0.71% |   1.840 ms |       0.54% |    -1.222 us |  -0.07% |   SAME   |
|  I128   |    2^28    |  29.121 ms |       0.11% |  25.510 ms |       0.68% | -3610.468 us | -12.40% |   FAST   |
|   F32   |    2^16    |   7.379 us |       5.74% |   7.269 us |       4.72% |    -0.110 us |  -1.49% |   SAME   |
|   F32   |    2^20    |  12.728 us |       4.23% |  12.721 us |       4.98% |    -0.007 us |  -0.06% |   SAME   |
|   F32   |    2^24    | 103.599 us |       2.81% | 103.653 us |       2.84% |     0.053 us |   0.05% |   SAME   |
|   F32   |    2^28    |   1.510 ms |       0.05% |   1.510 ms |       0.05% |    -0.062 us |  -0.00% |   SAME   |
|   F64   |    2^16    |   8.150 us |       6.81% |   8.082 us |      10.29% |    -0.068 us |  -0.84% |   SAME   |
|   F64   |    2^20    |  19.382 us |       3.03% |  19.420 us |       4.03% |     0.038 us |   0.20% |   SAME   |
|   F64   |    2^24    | 196.095 us |       0.41% | 196.382 us |       0.57% |     0.287 us |   0.15% |   SAME   |
|   F64   |    2^28    |   3.035 ms |       0.04% |   3.036 ms |       0.05% |     1.671 us |   0.06% |   SLOW   |

@miscco
Copy link
Contributor Author

miscco commented Mar 9, 2026

I have absolutely no idea why __iota_init_step_fn is faster for larger integer types than __iota_init_fn it uses more registers and more computation 🤷

@github-actions

This comment has been minimized.

This implements the `iota` algorithm for the cuda backend.

* `std::iota` see https://en.cppreference.com/w/cpp/algorithm/iota.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes NVIDIA#7927
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 19m: Pass: 100%/99 | Total: 1d 01h | Max: 53m 50s | Hits: 98%/255735

See results here.

Comment on lines +73 to +98
# if _LIBCUDACXX_HAS_NVFP16()
// We cannot rely on operator+ and constructors from integers to be available for the extended fp types
if constexpr (is_same_v<_Tp, __half>)
{
return ::__hadd(__init_, ::__ull2half_rn(__index));
}
else
# endif // _LIBCUDACXX_HAS_NVFP16()
# if _LIBCUDACXX_HAS_NVBF16()
if constexpr (is_same_v<_Tp, __nv_bfloat16>)
{
return ::__hadd(__init_, ::__ull2bfloat16_rn(__index));
}
else
# endif // _LIBCUDACXX_HAS_NVBF16()
if constexpr (is_arithmetic_v<_Tp>)
{ // avoid warnings about integer conversions
return static_cast<_Tp>(__init_ + static_cast<_Tp>(__index));
}
else if constexpr (__can_operator_plus_integral<_Tp>)
{
return __init_ + __index;
}
else if constexpr (__can_operator_plus_conversion<_Tp>)
{
return __init_ + static_cast<_Tp>(__index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Shouldn't cuda::std::plus already handle all these details? Can we just use that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it does not but does the plain return __x + __y;

That means we can get integer promotion / sign conversion warnings

_InputIterator __first,
_InputIterator __last,
const _Tp& __init,
const _Tp& __step)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: Adding a parallel iota is one thing, but adding an iota with a step feels too extreme to still be called cuda::std::iota.

I'm leaning more towards a cuda::sequence algorithm here, otherwise we're starting to twist our promise that everything in cuda/std is conforming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement CUDA backend for parallel cuda::std::iota

4 participants