Skip to content

Conversation

@srinivasyadav18
Copy link
Contributor

@srinivasyadav18 srinivasyadav18 commented Dec 11, 2025

Description

closes #6898

Checklist

Status

The current version of PR show's good speed ups for I32/F32 (reaching upto 70%) with Sum, but only very decent improvements (upto 10% SOL from < 1% SOL) with more complex operator's like ArgMax or larger input types (> 4B).

Some intial benchmarks:

Sum T{ct}=F32 opt_sum_F32_I32_speedup_heatmap
ArgMax T{ct}=F64 opt_argmax_F64_I32_speedup_heatmap

@srinivasyadav18 srinivasyadav18 requested review from a team as code owners December 11, 2025 01:45
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 11, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 11, 2025
@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 3h 53m: Pass: 80%/136 | Total: 5d 12h | Max: 3h 25m | Hits: 84%/212452

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Optimize device_segment_reduce for small and medium varaible segment size's

1 participant