Implement parallel `cuda::std::iota` by miscco · Pull Request #7930 · NVIDIA/cccl

miscco · 2026-03-09T10:17:42Z

This implements the iota algorithm for the cuda backend.

std::iota see https://en.cppreference.com/w/cpp/algorithm/iota.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

miscco · 2026-03-09T10:18:13Z

Performance looks good:

['thrust_iota.json', 'pstl_iota.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   I8    |    2^16    |   6.560 us |       7.31% |   6.623 us |      15.80% |     0.063 us |   0.96% |   SAME   |
|   I8    |    2^20    |   7.258 us |       4.65% |   7.307 us |      11.57% |     0.048 us |   0.67% |   SAME   |
|   I8    |    2^24    |  30.275 us |       1.59% |  30.323 us |       2.90% |     0.048 us |   0.16% |   SAME   |
|   I8    |    2^28    | 382.385 us |       0.14% | 382.336 us |       0.15% |    -0.049 us |  -0.01% |   SAME   |
|   I16   |    2^16    |   6.677 us |       7.26% |   6.555 us |       7.22% |    -0.122 us |  -1.83% |   SAME   |
|   I16   |    2^20    |   9.349 us |       4.37% |   9.370 us |       7.59% |     0.021 us |   0.22% |   SAME   |
|   I16   |    2^24    |  53.793 us |       1.96% |  53.768 us |       1.77% |    -0.025 us |  -0.05% |   SAME   |
|   I16   |    2^28    | 758.099 us |       0.09% | 757.924 us |       0.07% |    -0.175 us |  -0.02% |   SAME   |
|   I32   |    2^16    |   6.757 us |       7.48% |   6.688 us |       9.00% |    -0.069 us |  -1.02% |   SAME   |
|   I32   |    2^20    |  12.650 us |       5.41% |  12.644 us |       6.09% |    -0.006 us |  -0.05% |   SAME   |
|   I32   |    2^24    | 100.713 us |       0.50% | 100.687 us |       0.52% |    -0.026 us |  -0.03% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.05% |   1.509 ms |       0.05% |    -0.104 us |  -0.01% |   SAME   |
|   I64   |    2^16    |   7.441 us |      10.97% |   7.360 us |      11.94% |    -0.081 us |  -1.09% |   SAME   |
|   I64   |    2^20    |  18.860 us |       5.49% |  18.942 us |       2.83% |     0.082 us |   0.44% |   SAME   |
|   I64   |    2^24    | 194.946 us |       0.31% | 195.016 us |       0.32% |     0.069 us |   0.04% |   SAME   |
|   I64   |    2^28    |   3.014 ms |       0.04% |   3.015 ms |       0.04% |     1.409 us |   0.05% |   SLOW   |
|  I128   |    2^16    |  13.761 us |       7.28% |  13.567 us |       3.25% |    -0.194 us |  -1.41% |   SAME   |
|  I128   |    2^20    | 125.779 us |       2.82% | 123.812 us |       2.43% |    -1.966 us |  -1.56% |   SAME   |
|  I128   |    2^24    |   1.849 ms |       0.61% |   1.849 ms |       0.53% |    -0.425 us |  -0.02% |   SAME   |
|  I128   |    2^28    |  29.237 ms |       0.21% |  25.548 ms |       0.74% | -3689.171 us | -12.62% |   FAST   |
|   F32   |    2^16    |   7.274 us |       4.55% |   7.241 us |       5.15% |    -0.033 us |  -0.45% |   SAME   |
|   F32   |    2^20    |  12.699 us |       3.90% |  12.752 us |       7.50% |     0.053 us |   0.41% |   SAME   |
|   F32   |    2^24    | 106.859 us |       0.47% | 103.027 us |       2.73% |    -3.831 us |  -3.59% |   FAST   |
|   F32   |    2^28    |   1.510 ms |       0.04% |   1.510 ms |       0.04% |    -0.005 us |  -0.00% |   SAME   |
|   F64   |    2^16    |   8.023 us |       5.59% |   7.946 us |       6.03% |    -0.077 us |  -0.96% |   SAME   |
|   F64   |    2^20    |  19.398 us |       3.00% |  19.401 us |       2.73% |     0.003 us |   0.01% |   SAME   |
|   F64   |    2^24    | 196.175 us |       0.41% | 196.120 us |       0.36% |    -0.056 us |  -0.03% |   SAME   |
|   F64   |    2^28    |   3.035 ms |       0.04% |   3.037 ms |       0.05% |     1.782 us |   0.06% |   SLOW   |

# stepped

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   I8    |    2^16    |   6.694 us |       7.37% |   6.676 us |      13.31% |    -0.018 us |  -0.27% |   SAME   |
|   I8    |    2^20    |   7.235 us |      10.50% |   7.193 us |      10.22% |    -0.042 us |  -0.59% |   SAME   |
|   I8    |    2^24    |  30.269 us |       2.65% |  30.438 us |       2.31% |     0.170 us |   0.56% |   SAME   |
|   I8    |    2^28    | 382.063 us |       0.12% | 382.290 us |       0.15% |     0.227 us |   0.06% |   SAME   |
|   I16   |    2^16    |   6.601 us |       8.56% |   6.641 us |       8.26% |     0.040 us |   0.61% |   SAME   |
|   I16   |    2^20    |   9.389 us |       4.78% |   9.448 us |       5.15% |     0.060 us |   0.63% |   SAME   |
|   I16   |    2^24    |  53.761 us |       1.43% |  53.692 us |       1.03% |    -0.069 us |  -0.13% |   SAME   |
|   I16   |    2^28    | 757.933 us |       0.07% | 758.168 us |       0.09% |     0.235 us |   0.03% |   SAME   |
|   I32   |    2^16    |   6.620 us |       7.54% |   6.758 us |       8.20% |     0.138 us |   2.08% |   SAME   |
|   I32   |    2^20    |  12.761 us |       7.71% |  12.564 us |       6.24% |    -0.197 us |  -1.54% |   SAME   |
|   I32   |    2^24    | 100.626 us |       0.44% | 100.644 us |       0.47% |     0.017 us |   0.02% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.04% |   1.509 ms |       0.05% |     0.037 us |   0.00% |   SAME   |
|   I64   |    2^16    |   7.490 us |       7.48% |   7.423 us |      11.27% |    -0.068 us |  -0.90% |   SAME   |
|   I64   |    2^20    |  18.892 us |       4.93% |  18.947 us |       4.73% |     0.054 us |   0.29% |   SAME   |
|   I64   |    2^24    | 194.983 us |       0.29% | 195.050 us |       0.35% |     0.068 us |   0.03% |   SAME   |
|   I64   |    2^28    |   3.013 ms |       0.03% |   3.015 ms |       0.04% |     1.737 us |   0.06% |   SLOW   |
|  I128   |    2^16    |  13.867 us |       4.27% |  13.578 us |       3.33% |    -0.289 us |  -2.08% |   SAME   |
|  I128   |    2^20    | 126.155 us |       2.24% | 123.868 us |       3.07% |    -2.288 us |  -1.81% |   SAME   |
|  I128   |    2^24    |   1.841 ms |       0.71% |   1.840 ms |       0.54% |    -1.222 us |  -0.07% |   SAME   |
|  I128   |    2^28    |  29.121 ms |       0.11% |  25.510 ms |       0.68% | -3610.468 us | -12.40% |   FAST   |
|   F32   |    2^16    |   7.379 us |       5.74% |   7.269 us |       4.72% |    -0.110 us |  -1.49% |   SAME   |
|   F32   |    2^20    |  12.728 us |       4.23% |  12.721 us |       4.98% |    -0.007 us |  -0.06% |   SAME   |
|   F32   |    2^24    | 103.599 us |       2.81% | 103.653 us |       2.84% |     0.053 us |   0.05% |   SAME   |
|   F32   |    2^28    |   1.510 ms |       0.05% |   1.510 ms |       0.05% |    -0.062 us |  -0.00% |   SAME   |
|   F64   |    2^16    |   8.150 us |       6.81% |   8.082 us |      10.29% |    -0.068 us |  -0.84% |   SAME   |
|   F64   |    2^20    |  19.382 us |       3.03% |  19.420 us |       4.03% |     0.038 us |   0.20% |   SAME   |
|   F64   |    2^24    | 196.095 us |       0.41% | 196.382 us |       0.57% |     0.287 us |   0.15% |   SAME   |
|   F64   |    2^28    |   3.035 ms |       0.04% |   3.036 ms |       0.05% |     1.671 us |   0.06% |   SLOW   |

miscco · 2026-03-09T10:18:54Z

I have absolutely no idea why __iota_init_step_fn is faster for larger integer types than __iota_init_fn it uses more registers and more computation 🤷

This implements the `iota` algorithm for the cuda backend. * `std::iota` see https://en.cppreference.com/w/cpp/algorithm/iota.html It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++ The functionality is publicly available yet and implemented in a private internal header Fixes NVIDIA#7927

github-actions · 2026-03-09T14:00:19Z

🥳 CI Workflow Results

🟩 Finished in 1h 19m: Pass: 100%/99 | Total: 1d 01h | Max: 53m 50s | Hits: 98%/255735

See results here.

jrhemstad · 2026-03-10T17:30:54Z

libcudacxx/include/cuda/std/__pstl/iota.h

+#  if _LIBCUDACXX_HAS_NVFP16()
+    // We cannot rely on operator+ and constructors from integers to be available for the extended fp types
+    if constexpr (is_same_v<_Tp, __half>)
+    {
+      return ::__hadd(__init_, ::__ull2half_rn(__index));
+    }
+    else
+#  endif // _LIBCUDACXX_HAS_NVFP16()
+#  if _LIBCUDACXX_HAS_NVBF16()
+      if constexpr (is_same_v<_Tp, __nv_bfloat16>)
+    {
+      return ::__hadd(__init_, ::__ull2bfloat16_rn(__index));
+    }
+    else
+#  endif // _LIBCUDACXX_HAS_NVBF16()
+      if constexpr (is_arithmetic_v<_Tp>)
+      { // avoid warnings about integer conversions
+        return static_cast<_Tp>(__init_ + static_cast<_Tp>(__index));
+      }
+      else if constexpr (__can_operator_plus_integral<_Tp>)
+      {
+        return __init_ + __index;
+      }
+      else if constexpr (__can_operator_plus_conversion<_Tp>)
+      {
+        return __init_ + static_cast<_Tp>(__index);


question: Shouldn't cuda::std::plus already handle all these details? Can we just use that here?

Unfortunately, it does not but does the plain return __x + __y;

That means we can get integer promotion / sign conversion warnings

jrhemstad · 2026-03-10T17:47:09Z

libcudacxx/include/cuda/std/__pstl/iota.h

+     _InputIterator __first,
+     _InputIterator __last,
+     const _Tp& __init,
+     const _Tp& __step)


comment: Adding a parallel iota is one thing, but adding an iota with a step feels too extreme to still be called cuda::std::iota.

I'm leaning more towards a cuda::sequence algorithm here, otherwise we're starting to twist our promise that everything in cuda/std is conforming.

miscco requested review from a team as code owners March 9, 2026 10:17

miscco requested review from oleksandr-pavlyk and pciolkosz March 9, 2026 10:17

github-project-automation bot added this to CCCL Mar 9, 2026

github-project-automation bot moved this to Todo in CCCL Mar 9, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 9, 2026

miscco force-pushed the parallel_iota branch from e05f0f0 to 3a08273 Compare March 9, 2026 10:21

This comment has been minimized.

Sign in to view

miscco force-pushed the parallel_iota branch from 3a08273 to caf77a1 Compare March 9, 2026 12:38

miscco force-pushed the parallel_iota branch from caf77a1 to 57cae7b Compare March 9, 2026 12:39

davebayer approved these changes Mar 9, 2026

View reviewed changes

oleksandr-pavlyk approved these changes Mar 9, 2026

View reviewed changes

jrhemstad reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel `cuda::std::iota`#7930

Implement parallel `cuda::std::iota`#7930
miscco wants to merge 1 commit intoNVIDIA:mainfrom
miscco:parallel_iota

miscco commented Mar 9, 2026

Uh oh!

miscco commented Mar 9, 2026

Uh oh!

miscco commented Mar 9, 2026

Uh oh!

This comment has been minimized.

github-actions bot commented Mar 9, 2026

Uh oh!

jrhemstad Mar 10, 2026

Uh oh!

miscco Mar 11, 2026

Uh oh!

jrhemstad Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

miscco commented Mar 9, 2026

Uh oh!

miscco commented Mar 9, 2026

Uh oh!

miscco commented Mar 9, 2026

Uh oh!

This comment has been minimized.

github-actions bot commented Mar 9, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 19m: Pass: 100%/99 | Total: 1d 01h | Max: 53m 50s | Hits: 98%/255735

Uh oh!

jrhemstad Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

miscco Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

jrhemstad Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants