Skip to content

feat: implement configurable tie_break for filtered topk#3095

Merged
kahyunnam merged 10 commits intoflashinfer-ai:mainfrom
zianglih:topk-tie
Apr 22, 2026
Merged

feat: implement configurable tie_break for filtered topk#3095
kahyunnam merged 10 commits intoflashinfer-ai:mainfrom
zianglih:topk-tie

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented Apr 17, 2026

📌 Description

@HumansAnd

Implement configurable tie_break for filtered topk.

This automatically achieves determinism.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

python3 -m pytest tests/utils/test_topk.py -v passed on B200.

Summary by CodeRabbit

  • New Features

    • Added TopKTieBreak modes (NONE, SMALL, LARGE) and a tie_break option across top‑k APIs; non‑NONE modes force deterministic behavior and TopKTieBreak is re‑exported at package level.
  • CLI

    • New --tie‑break flag (auto‑enables deterministic, disallows compare‑algorithms) and adds tie‑break labels/columns to output tables.
  • Benchmarks

    • Benchmarks report deterministic tie‑break variants and per‑variant slowdowns vs baseline.
  • Tests

    • Added tests covering tie‑break modes for top‑k and transform APIs.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a TopKTieBreak enum and threads a tie-break parameter through Python APIs, benchmarks/CLI, FFI bindings, C++ dispatch, and CUDA kernels; selecting a tie-break forces deterministic execution. Tests and benchmarks were extended to exercise SMALL/LARGE tie-break modes and reporting.

Changes

Cohort / File(s) Summary
Python API & package
flashinfer/topk.py, flashinfer/__init__.py
Add TopKTieBreak IntEnum; add tie_break param to public APIs and re-export enum; force deterministic=True when tie-break != NONE; thread tie_break into custom-op calls.
Benchmarks & CLI
benchmarks/bench_topk.py
Add tie-break benchmarking helpers, --tie-break CLI flag (forces deterministic), propagate compare_tie_break through benchmark entry points, and extend CLI table headers/rows with tie-break columns and slowdown/error labels.
FFI bindings
csrc/flashinfer_topk_binding.cu
Extend exported functions radix_topk, radix_topk_page_table_transform, radix_topk_ragged_transform signatures to accept trailing int64_t tie_break.
C++ dispatch / bindings
csrc/topk.cu
Parse integer tie_break to enum, force deterministic when tie_break != None, and pass tie_break into sampling dispatch calls.
CUDA kernels / headers
include/flashinfer/topk.cuh
Add TopKTieBreak enum; introduce deterministic contiguous collection routine; add tie_break template/runtime parameter across FilteredTopK pipeline; force filtered path when tie_break != None and enforce constraints.
Tests
tests/utils/test_topk.py
Parametrize existing top-k tests over tie-break modes (NONE, SMALL, LARGE); add tests validating tie-break behavior for filtered/top-k transform APIs; update smoke-test invocations.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Benchmarks
    participant PythonAPI
    participant FFI
    participant CppDispatch
    participant CUDA
    rect rgba(200,200,255,0.5)
    User->>CLI: run benchmark with --tie-break
    CLI->>Benchmarks: enable compare_tie_break / force deterministic
    end
    Benchmarks->>PythonAPI: call top_k / transforms (tie_break)
    PythonAPI->>FFI: call radix_topk*(..., deterministic, tie_break)
    FFI->>CppDispatch: invoke radix_topk entry (tie_break)
    CppDispatch->>CUDA: select kernel/collection path (FilteredTopK + tie_break)
    CUDA-->>CppDispatch: return results
    CppDispatch-->>FFI: return outputs
    FFI-->>PythonAPI: return tensors/timings
    PythonAPI-->>Benchmarks: return metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

benchmark

Suggested reviewers

  • yzh119
  • aleozlx
  • sricketts
  • cyx-6
  • samuellees
  • saltyminty
  • jimmyzho
  • bkryu

Poem

"I hop through kernels, tidy and bright,
Small or Large — I settle tie-framed light.
Deterministic steps make indices clear,
Benchmarks applaud with an eager cheer.
🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: implementing a configurable tie_break feature for filtered topk operations.
Description check ✅ Passed The description covers the key change and includes completed pre-commit checks and tests; however, it lacks detail on related issues, specific implementation rationale, and comprehensive explanation of the tie_break feature.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zianglih
Copy link
Copy Markdown
Contributor Author

Hi @aleozlx , could you help review? Thank you!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tie-breaking support for Top-K operations in FlashInfer, enabling deterministic selection of smaller or larger indices during value ties. The changes include a new TopKTieBreak enum, a DeterministicContiguousCollect CUDA helper, and updates to the Python API, C++ bindings, and benchmarking tools. Reviewer feedback highlights opportunities to reduce code duplication in the benchmark script and dispatch logic, and suggests restructuring macros for improved readability.

Comment thread benchmarks/bench_topk.py Outdated
Comment thread include/flashinfer/topk.cuh Outdated
Comment thread include/flashinfer/topk.cuh Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
benchmarks/bench_topk.py (2)

685-699: Minor: "tie" column shows integer code instead of mode name.

tie_break is a TopKTieBreak (IntEnum). With the current :>7 format spec, Python formats it via the mixed-in int.__format__, so table rows render 0/1/2, even though the banner lines (','.join(map(str, tie_break_modes))) render none,small,large via the overridden __str__. This inconsistency makes it harder to eyeball results in a multi-mode sweep. Consider formatting the column as str(tie_break) (or tie_break.name.lower()) for readability. Applies to all four print loops (top_k, dsa_topk, page_table, ragged).

♻️ Example tweak
-            f"{result['batch_size']:>6} {result['seq_len']:>10} {result['k']:>6} {tie_break:>7} | "
+            f"{result['batch_size']:>6} {result['seq_len']:>10} {result['k']:>6} {str(tie_break):>7} | "

Also applies to: 731-741

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/bench_topk.py` around lines 685 - 699, The table's "tie" column is
showing the IntEnum value (0/1/2) because TopKTieBreak (an IntEnum) is formatted
with int.__format__; update the row formatting where tie_break is printed (in
the top_k, dsa_topk, page_table, and ragged print loops) to render the
human-readable name instead — e.g., use str(tie_break) or tie_break.name.lower()
when building the row string so the column shows "none"/"small"/"large" instead
of integers, keeping the same width/alignment specifier.

518-529: Argparse choices and CLI input format are inconsistent and may confuse users.

TopKTieBreak explicitly overrides __str__() to return the lowercase name ("none"/"small"/"large"), and __format__() delegates to it. When argparse displays choices=list(TopKTieBreak) in help output, it invokes str() on each choice, rendering them as {none,small,large} — but parse_tie_break_mode only accepts integer strings 0/1/2 (anything else raises ArgumentTypeError). Users following the help text verbatim (--tie-break-modes none) will hit an error.

Two options:

  • Drop choices=list(TopKTieBreak) (validation already lives in parse_tie_break_mode, whose error message is explicit), or
  • Accept both integer and name inputs in parse_tie_break_mode so the help text and accepted inputs agree.
♻️ Proposed fix (accept names too, drop redundant choices)
 def parse_tie_break_mode(mode_str: str) -> TopKTieBreak:
     """Parse CLI tie-break mode into TopKTieBreak enum."""
     try:
-        return TopKTieBreak(int(mode_str))
-    except ValueError as e:
+        return TopKTieBreak(int(mode_str))
+    except ValueError:
+        pass
+    try:
+        return TopKTieBreak[mode_str.upper()]
+    except KeyError as e:
         raise argparse.ArgumentTypeError(
-            "tie_break mode must be one of: 0 (none), 1 (small), 2 (large)"
+            "tie_break mode must be one of: 0/none, 1/small, 2/large"
         ) from e
@@
     parser.add_argument(
         "--tie-break-modes",
         type=parse_tie_break_mode,
         nargs="+",
-        choices=list(TopKTieBreak),
         default=[TopKTieBreak.NONE],
         help=(
             "Tie-break modes to benchmark. "
-            "Use one or more of: 0 (none), 1 (prefer small indices), 2 (prefer large indices). "
+            "Use one or more of: 0/none, 1/small (prefer small indices), 2/large (prefer large indices). "
             "Default: 0"
         ),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/bench_topk.py` around lines 518 - 529, Help text and accepted
inputs disagree: argparse currently lists choices as list(TopKTieBreak) (which
stringifies to "none/small/large") but parse_tie_break_mode only accepts
"0/1/2". Update parse_tie_break_mode to also accept enum names
(case-insensitive, e.g. "none","small","large") in addition to numeric strings
and return the corresponding TopKTieBreak, and remove the redundant
choices=list(TopKTieBreak) from the parser.add_argument call for
"--tie-break-modes" so help and parsing behavior match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/flashinfer/topk.cuh`:
- Around line 3333-3342: The code forces any non-None TopKTieBreak onto the
filtered-topk path and returns cudaErrorNotSupported when FILTERED_TOPK_MAX_K is
exceeded or CanImplementFilteredTopK() is false, causing valid existing
top_k(...) calls to fail; update the logic around tie_break,
deterministic_effective, and use_filtered so we do not unconditionally force
filtered mode: (1) if the radix path can support the requested tie_break, allow
the radix path by not setting use_filtered=true; (2) if radix cannot support
tie_break and filtered topk is unavailable or k>FILTERED_TOPK_MAX_K, return a
clear API error (not a silent cudaErrorNotSupported) indicating tie_break is
unsupported for this shape/GPU and suggesting removing tie_break or using a
supported device/shape; and (3) centralize this check where use_filtered is
computed (the block that calls ShouldUseFilteredTopK<DType>, references
TopKTieBreak, CanImplementFilteredTopK(), FILTERED_TOPK_MAX_K, and
deterministic_effective) so the decision honors radix capability and emits a
descriptive error message instead of breaking existing callers.

In `@tests/utils/test_topk.py`:
- Around line 106-114: The direct-run "__main__" smoke invocation calls
test_top_k with an outdated signature and raises TypeError because test_top_k
now requires the tie_break parameter; update the "__main__" block so it calls
test_top_k with a valid tie_break (for example iterate over
flashinfer.TopKTieBreak options or pass a default like
flashinfer.TopKTieBreak.NONE) and mirror this change for the other direct-run
calls that invoke test_top_k (the same pattern at the other two locations),
ensuring the call sites match the new test_top_k(batch_size, vocab_size, k,
dtype, tie_break) signature.

---

Nitpick comments:
In `@benchmarks/bench_topk.py`:
- Around line 685-699: The table's "tie" column is showing the IntEnum value
(0/1/2) because TopKTieBreak (an IntEnum) is formatted with int.__format__;
update the row formatting where tie_break is printed (in the top_k, dsa_topk,
page_table, and ragged print loops) to render the human-readable name instead —
e.g., use str(tie_break) or tie_break.name.lower() when building the row string
so the column shows "none"/"small"/"large" instead of integers, keeping the same
width/alignment specifier.
- Around line 518-529: Help text and accepted inputs disagree: argparse
currently lists choices as list(TopKTieBreak) (which stringifies to
"none/small/large") but parse_tie_break_mode only accepts "0/1/2". Update
parse_tie_break_mode to also accept enum names (case-insensitive, e.g.
"none","small","large") in addition to numeric strings and return the
corresponding TopKTieBreak, and remove the redundant choices=list(TopKTieBreak)
from the parser.add_argument call for "--tie-break-modes" so help and parsing
behavior match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6af13619-8a53-408e-bda4-818cceb77d46

📥 Commits

Reviewing files that changed from the base of the PR and between a99ee72 and 1714c33.

📒 Files selected for processing (7)
  • benchmarks/bench_topk.py
  • csrc/flashinfer_topk_binding.cu
  • csrc/topk.cu
  • flashinfer/__init__.py
  • flashinfer/topk.py
  • include/flashinfer/topk.cuh
  • tests/utils/test_topk.py

Comment thread include/flashinfer/topk.cuh Outdated
Comment thread tests/utils/test_topk.py
@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented Apr 17, 2026

python3 benchmarks/bench_topk.py --op dsa_topk --dtype fp32 --dsa-case all --dsa-input-pattern dsa_relu --dsa-topk 2048 --deterministic --tie-break-modes 0 1 2 --compare-sglang
====================================================================================================
dsa_topk: DeepSeek DSA-like indexer top-k workload (dtype=FP32, deterministic=True, dsa_pattern=dsa_relu, k=2048, tie_break_modes=none,small,large)
NOTE: deterministic mode also benchmarks FlashInfer(non-det) for direct comparison when tie_break=0
NOTE: torch column uses torch.topk with torch.use_deterministic_algorithms(True)
====================================================================================================
                    case     rows    seq_len      k     tie |   FlashInfer FlashInfer(det) DetSlowdown    torch.det    Speedup
------------------------------------------------------------------------------------------------------------------------------
/root/flashinfer-src/flashinfer/testing/utils.py:1620: UserWarning: CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events for benchmarking.
  return bench_gpu_time_with_cupti(
/root/flashinfer-src/flashinfer/testing/utils.py:1620: UserWarning: CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events for benchmarking.
  return bench_gpu_time_with_cupti(
      decode_b1_q1_l128k        1     131072   2048    none |      48.21us        46.05us       0.96x     101.55us      2.21x
       decode_b8_q1_l64k        8      65536   2048    none |      41.92us        45.06us       1.07x     104.50us      2.32x
     decode_b32_q1_l128k       32     131072   2048    none |      61.44us       104.45us       1.70x     143.36us      1.37x
   prefill_b1_q128_l128k      128     131072   2048    none |      68.45us       111.52us       1.63x     252.96us      2.27x
      decode_b1_q1_l128k        1     131072   2048   small |          n/a        89.02us         n/a     101.41us      1.14x
       decode_b8_q1_l64k        8      65536   2048   small |          n/a        68.64us         n/a     104.48us      1.52x
     decode_b32_q1_l128k       32     131072   2048   small |          n/a       106.56us         n/a     143.23us      1.34x
   prefill_b1_q128_l128k      128     131072   2048   small |          n/a       112.64us         n/a     252.96us      2.25x
      decode_b1_q1_l128k        1     131072   2048   large |          n/a        97.28us         n/a     101.39us      1.04x
       decode_b8_q1_l64k        8      65536   2048   large |          n/a        70.66us         n/a     104.48us      1.48x
     decode_b32_q1_l128k       32     131072   2048   large |          n/a       109.54us         n/a     143.36us      1.31x
   prefill_b1_q128_l128k      128     131072   2048   large |          n/a       115.68us         n/a     252.96us      2.19x
python3 benchmarks/bench_topk.py --op page_table --dtype fp32 --dsa-case all --dsa-input-pattern dsa_relu --dsa-topk 2048 --deterministic --tie-break-modes 0 1 2 --compare-sglang

====================================================================================================
top_k_page_table_transform: Fused top-k + page table gather (dtype=FP32, deterministic=True, pattern=random, tie_break_modes=none,small,large)
NOTE: SGLang only supports k=2048 and float32
NOTE: deterministic mode also benchmarks FlashInfer(non-det) for direct comparison when tie_break=0
====================================================================================================
 batch    seq_len      k     tie |   FlashInfer FlashInfer(det) DetSlowdown       SGLang    Speedup
---------------------------------------------------------------------------------------------------
/root/flashinfer-src/flashinfer/testing/utils.py:1620: UserWarning: CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events for benchmarking.
  return bench_gpu_time_with_cupti(
/root/flashinfer-src/flashinfer/testing/utils.py:1620: UserWarning: CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events for benchmarking.
  return bench_gpu_time_with_cupti(
     1        256    256    none |       6.18us         6.21us       1.01x
     1        512    256    none |      10.24us        15.28us       1.49x
     1        512    512    none |       6.21us         6.21us       1.00x
     1       1024    256    none |      12.29us        19.39us       1.58x
     1       1024    512    none |      10.27us        15.26us       1.49x
     1       1024   1024    none |       6.22us         8.06us       1.30x
     1       2048    256    none |      12.29us        19.46us       1.58x
     1       2048    512    none |      14.34us        21.47us       1.50x
     1       2048   1024    none |      10.24us        15.33us       1.50x
     1       2048   2048    none |       8.21us         8.19us       1.00x       8.16us      1.00x
     1       4096    256    none |      14.37us        23.52us       1.64x
     1       4096    512    none |      14.34us        23.52us       1.64x
     1       4096   1024    none |      14.34us        21.47us       1.50x
     1       4096   2048    none |      10.24us        17.41us       1.70x      10.24us      0.59x
     1       4096   4096    none |       8.19us         8.22us       1.00x
     1      16384    256    none |      18.43us        26.66us       1.45x
     1      16384    512    none |      18.43us        27.60us       1.50x
     1      16384   1024    none |      20.48us        27.62us       1.35x
     1      16384   2048    none |      20.51us        31.58us       1.54x      18.46us      0.58x
     1      16384   4096    none |      24.45us        24.61us       1.01x
     1      65536    256    none |      36.86us        36.98us       1.00x
     1      65536    512    none |      36.90us        38.88us       1.05x
     1      65536   1024    none |      36.99us        38.91us       1.05x
     1      65536   2048    none |      38.91us        38.94us       1.00x      40.99us      1.05x
     1      65536   4096    none |      40.96us        40.96us       1.00x
     1     131072    256    none |      40.99us        43.01us       1.05x
     1     131072    512    none |      40.99us        43.01us       1.05x
     1     131072   1024    none |      43.01us        43.01us       1.00x
     1     131072   2048    none |      43.04us        45.06us       1.05x      67.55us      1.50x
     1     131072   4096    none |      47.14us        46.96us       1.00x
     1     262144    256    none |      47.10us        47.14us       1.00x
     1     262144    512    none |      47.10us        47.26us       1.00x
     1     262144   1024    none |      47.10us        49.15us       1.04x
     1     262144   2048    none |      47.23us        49.15us       1.04x     118.82us      2.42x
     1     262144   4096    none |      51.20us        51.20us       1.00x
     1     524288    256    none |      47.14us        49.15us       1.04x
     1     524288    512    none |      47.20us        49.15us       1.04x
     1     524288   1024    none |      47.17us        49.15us       1.04x
     1     524288   2048    none |      49.15us        49.20us       1.00x     196.61us      4.00x
     1     524288   4096    none |      49.15us        51.20us       1.04x
    16        256    256    none |       6.18us         6.21us       1.01x
    16        512    256    none |      12.32us        17.41us       1.41x
    16        512    512    none |       6.18us         6.19us       1.00x
    16       1024    256    none |      14.34us        19.52us       1.36x
    16       1024    512    none |      12.29us        17.39us       1.42x
    16       1024   1024    none |       6.27us         8.10us       1.29x
    16       2048    256    none |      14.34us        22.56us       1.57x
    16       2048    512    none |      14.34us        21.47us       1.50x
    16       2048   1024    none |      12.29us        17.41us       1.42x
    16       2048   2048    none |       8.19us         8.19us       1.00x       8.19us      1.00x
    16       4096    256    none |      14.34us        23.52us       1.64x
    16       4096    512    none |      14.37us        23.49us       1.63x
    16       4096   1024    none |      14.34us        22.53us       1.57x
    16       4096   2048    none |      14.34us        20.51us       1.43x      12.29us      0.60x
    16       4096   4096    none |       8.19us         8.19us       1.00x
    16      16384    256    none |      18.43us        25.63us       1.39x
    16      16384    512    none |      18.43us        27.62us       1.50x
    16      16384   1024    none |      20.48us        28.67us       1.40x
    16      16384   2048    none |      20.48us        31.68us       1.55x      18.43us      0.58x
    16      16384   4096    none |      24.58us        24.58us       1.00x
    16      65536    256    none |      32.77us        39.06us       1.19x
    16      65536    512    none |      36.90us        46.05us       1.25x
    16      65536   1024    none |      36.90us        46.02us       1.25x
    16      65536   2048    none |      40.96us        48.13us       1.18x      43.01us      0.89x
    16      65536   4096    none |      44.93us        43.01us       0.96x
    16     131072    256    none |      49.18us        45.09us       0.92x
    16     131072    512    none |      51.20us        46.98us       0.92x
    16     131072   1024    none |      59.39us        47.10us       0.79x
    16     131072   2048    none |      61.44us        47.10us       0.77x      67.62us      1.44x
    16     131072   4096    none |      49.18us        49.15us       1.00x
    16     262144    256    none |      49.18us        51.20us       1.04x
    16     262144    512    none |      49.34us        51.20us       1.04x
    16     262144   1024    none |      51.20us        51.20us       1.00x
    16     262144   2048    none |      51.23us        53.25us       1.04x     118.82us      2.23x
    16     262144   4096    none |      53.28us        53.28us       1.00x
    16     524288    256    none |      90.69us        94.24us       1.04x
    16     524288    512    none |      92.10us        94.24us       1.02x
    16     524288   1024    none |      92.16us        94.24us       1.02x
    16     524288   2048    none |      94.21us        96.26us       1.02x     198.66us      2.06x
    16     524288   4096    none |      96.26us        98.30us       1.02x
    64        256    256    none |       6.18us         6.37us       1.03x
    64        512    256    none |      14.18us        17.41us       1.23x
    64        512    512    none |       6.24us         8.03us       1.29x
    64       1024    256    none |      14.34us        20.45us       1.43x
    64       1024    512    none |      14.27us        17.41us       1.22x
    64       1024   1024    none |       6.35us         8.16us       1.28x
    64       2048    256    none |      14.34us        23.55us       1.64x
    64       2048    512    none |      14.34us        21.47us       1.50x
    64       2048   1024    none |      14.34us        18.46us       1.29x
    64       2048   2048    none |       8.19us         8.19us       1.00x       8.19us      1.00x
    64       4096    256    none |      14.34us        23.52us       1.64x
    64       4096    512    none |      16.38us        23.55us       1.44x
    64       4096   1024    none |      14.34us        23.55us       1.64x
    64       4096   2048    none |      14.34us        21.47us       1.50x      12.29us      0.57x
    64       4096   4096    none |       8.19us         8.19us       1.00x
    64      16384    256    none |      20.48us        27.65us       1.35x
    64      16384    512    none |      20.48us        28.67us       1.40x
    64      16384   1024    none |      22.42us        31.71us       1.41x
    64      16384   2048    none |      22.53us        31.74us       1.41x      20.48us      0.65x
    64      16384   4096    none |      24.58us        24.58us       1.00x
    64      65536    256    none |      32.77us        41.95us       1.28x
    64      65536    512    none |      38.91us        46.11us       1.19x
    64      65536   1024    none |      38.91us        46.11us       1.19x
    64      65536   2048    none |      43.01us        52.03us       1.21x      43.04us      0.83x
    64      65536   4096    none |      45.09us        45.06us       1.00x
    64     131072    256    none |      53.22us        58.30us       1.10x
    64     131072    512    none |      53.25us        58.37us       1.10x
    64     131072   1024    none |      63.46us        67.58us       1.07x
    64     131072   2048    none |      63.65us        69.71us       1.10x      69.63us      1.00x
    64     131072   4096    none |      94.24us        94.21us       1.00x
    64     262144    256    none |      89.92us        88.06us       0.98x
    64     262144    512    none |      96.22us        95.14us       0.99x
    64     262144   1024    none |      96.26us        94.34us       0.98x
    64     262144   2048    none |     112.67us       113.50us       1.01x     137.22us      1.21x
    64     262144   4096    none |     151.58us       151.58us       1.00x
    64     524288    256    none |     188.42us       176.16us       0.93x
    64     524288    512    none |     188.45us       176.13us       0.93x
    64     524288   1024    none |     200.74us       187.54us       0.93x
    64     524288   2048    none |     202.72us       190.50us       0.94x     341.98us      1.80x
    64     524288   4096    none |     235.58us       239.58us       1.02x
   256        256    256    none |       8.19us         8.19us       1.00x
   256        512    256    none |      18.46us        21.50us       1.16x
   256        512    512    none |       8.19us         8.19us       1.00x
   256       1024    256    none |      22.53us        26.64us       1.18x
   256       1024    512    none |      20.48us        25.54us       1.25x
   256       1024   1024    none |       8.19us         8.19us       1.00x
   256       2048    256    none |      22.53us        29.70us       1.32x
   256       2048    512    none |      24.58us        28.67us       1.17x
   256       2048   1024    none |      20.48us        25.60us       1.25x
   256       2048   2048    none |      10.24us        10.24us       1.00x       8.19us      0.80x
   256       4096    256    none |      24.58us        29.79us       1.21x
   256       4096    512    none |      24.61us        31.71us       1.29x
   256       4096   1024    none |      24.59us        31.71us       1.29x
   256       4096   2048    none |      22.53us        29.63us       1.32x      14.34us      0.48x
   256       4096   4096    none |      12.29us        12.29us       1.00x
   256      16384    256    none |      34.82us        39.97us       1.15x
   256      16384    512    none |      36.83us        41.98us       1.14x
   256      16384   1024    none |      38.91us        44.11us       1.13x
   256      16384   2048    none |      40.96us        48.13us       1.18x      28.67us      0.60x
   256      16384   4096    none |      43.01us        45.06us       1.05x
   256      65536    256    none |      65.54us        66.59us       1.02x
   256      65536    512    none |      75.78us        78.82us       1.04x
   256      65536   1024    none |      77.82us        80.90us       1.04x
   256      65536   2048    none |      85.86us        89.25us       1.04x      65.60us      0.74x
   256      65536   4096    none |     159.60us       155.65us       0.98x
   256     131072    256    none |     106.50us       105.41us       0.99x
   256     131072    512    none |     106.59us       105.41us       0.99x
   256     131072   1024    none |     125.78us       148.48us       1.18x
   256     131072   2048    none |     130.10us       132.06us       1.02x     127.07us      0.96x
   256     131072   4096    none |     267.31us       264.08us       0.99x
   256     262144    256    none |     202.93us       191.34us       0.94x
   256     262144    512    none |     217.09us       203.71us       0.94x
   256     262144   1024    none |     219.14us       205.82us       0.94x
   256     262144   2048    none |     256.00us       244.74us       0.96x     231.49us      0.95x
   256     262144   4096    none |     428.03us       427.04us       1.00x
   256     524288    256    none |     370.85us       340.94us       0.92x
   256     524288    512    none |     372.77us       340.99us       0.91x
   256     524288   1024    none |     399.36us       367.78us       0.92x
   256     524288   2048    none |     403.49us       375.74us       0.93x     401.44us      1.07x
   256     524288   4096    none |     847.87us       857.18us       1.01x
     1        256    256   small |          n/a         6.27us         n/a
     1        512    256   small |          n/a        17.41us         n/a
     1        512    512   small |          n/a         6.21us         n/a
     1       1024    256   small |          n/a        20.66us         n/a
     1       1024    512   small |          n/a        15.23us         n/a
     1       1024   1024   small |          n/a         8.10us         n/a
     1       2048    256   small |          n/a        21.47us         n/a
     1       2048    512   small |          n/a        21.46us         n/a
     1       2048   1024   small |          n/a        17.41us         n/a
     1       2048   2048   small |          n/a         8.19us         n/a       8.19us      1.00x
     1       4096    256   small |          n/a        23.52us         n/a
     1       4096    512   small |          n/a        23.54us         n/a
     1       4096   1024   small |          n/a        21.47us         n/a
     1       4096   2048   small |          n/a        17.38us         n/a      10.24us      0.59x
     1       4096   4096   small |          n/a         8.19us         n/a
     1      16384    256   small |          n/a        25.68us         n/a
     1      16384    512   small |          n/a        25.63us         n/a
     1      16384   1024   small |          n/a        27.65us         n/a
     1      16384   2048   small |          n/a        31.23us         n/a      18.43us      0.59x
     1      16384   4096   small | UNSUPPORTED
     1      65536    256   small |          n/a        38.02us         n/a
     1      65536    512   small |          n/a        43.95us         n/a
     1      65536   1024   small |          n/a        45.92us         n/a
     1      65536   2048   small |          n/a        48.06us         n/a      43.01us      0.89x
     1      65536   4096   small | UNSUPPORTED
     1     131072    256   small |          n/a        54.27us         n/a
     1     131072    512   small |          n/a        54.30us         n/a
     1     131072   1024   small |          n/a        65.50us         n/a
     1     131072   2048   small |          n/a        65.71us         n/a      67.58us      1.03x
     1     131072   4096   small | UNSUPPORTED
     1     262144    256   small |          n/a        81.92us         n/a
     1     262144    512   small |          n/a        88.16us         n/a
     1     262144   1024   small |          n/a        89.09us         n/a
     1     262144   2048   small |          n/a       105.31us         n/a     118.78us      1.13x
     1     262144   4096   small | UNSUPPORTED
     1     524288    256   small |          n/a       140.21us         n/a
     1     524288    512   small |          n/a       140.26us         n/a
     1     524288   1024   small |          n/a       150.56us         n/a
     1     524288   2048   small |          n/a       150.51us         n/a     196.58us      1.31x
     1     524288   4096   small | UNSUPPORTED
    16        256    256   small |          n/a         6.18us         n/a
    16        512    256   small |          n/a        17.41us         n/a
    16        512    512   small |          n/a         6.21us         n/a
    16       1024    256   small |          n/a        19.49us         n/a
    16       1024    512   small |          n/a        17.44us         n/a
    16       1024   1024   small |          n/a         8.13us         n/a
    16       2048    256   small |          n/a        21.50us         n/a
    16       2048    512   small |          n/a        20.51us         n/a
    16       2048   1024   small |          n/a        17.41us         n/a
    16       2048   2048   small |          n/a         8.19us         n/a       8.19us      1.00x
    16       4096    256   small |          n/a        23.46us         n/a
    16       4096    512   small |          n/a        23.55us         n/a
    16       4096   1024   small |          n/a        21.47us         n/a
    16       4096   2048   small |          n/a        19.46us         n/a      12.29us      0.63x
    16       4096   4096   small |          n/a         8.19us         n/a
    16      16384    256   small |          n/a        27.58us         n/a
    16      16384    512   small |          n/a        27.65us         n/a
    16      16384   1024   small |          n/a        27.65us         n/a
    16      16384   2048   small |          n/a        31.71us         n/a      20.48us      0.65x
    16      16384   4096   small | UNSUPPORTED
    16      65536    256   small |          n/a        39.87us         n/a
    16      65536    512   small |          n/a        46.05us         n/a
    16      65536   1024   small |          n/a        45.09us         n/a
    16      65536   2048   small |          n/a        48.14us         n/a      43.01us      0.89x
    16      65536   4096   small | UNSUPPORTED
    16     131072    256   small |          n/a        56.35us         n/a
    16     131072    512   small |          n/a        56.32us         n/a
    16     131072   1024   small |          n/a        66.51us         n/a
    16     131072   2048   small |          n/a        66.58us         n/a      67.58us      1.02x
    16     131072   4096   small | UNSUPPORTED
    16     262144    256   small |          n/a        82.94us         n/a
    16     262144    512   small |          n/a        89.01us         n/a
    16     262144   1024   small |          n/a        89.09us         n/a
    16     262144   2048   small |          n/a       105.47us         n/a     118.82us      1.13x
    16     262144   4096   small | UNSUPPORTED
    16     524288    256   small |          n/a       140.29us         n/a
    16     524288    512   small |          n/a       140.21us         n/a
    16     524288   1024   small |          n/a       150.54us         n/a
    16     524288   2048   small |          n/a       154.40us         n/a     198.66us      1.29x
    16     524288   4096   small | UNSUPPORTED
    64        256    256   small |          n/a         6.29us         n/a
    64        512    256   small |          n/a        17.44us         n/a
    64        512    512   small |          n/a         6.37us         n/a
    64       1024    256   small |          n/a        21.47us         n/a
    64       1024    512   small |          n/a        17.44us         n/a
    64       1024   1024   small |          n/a         8.16us         n/a
    64       2048    256   small |          n/a        23.49us         n/a
    64       2048    512   small |          n/a        21.47us         n/a
    64       2048   1024   small |          n/a        19.42us         n/a
    64       2048   2048   small |          n/a         8.19us         n/a       8.19us      1.00x
    64       4096    256   small |          n/a        23.62us         n/a
    64       4096    512   small |          n/a        23.49us         n/a
    64       4096   1024   small |          n/a        23.42us         n/a
    64       4096   2048   small |          n/a        21.44us         n/a      12.29us      0.57x
    64       4096   4096   small |          n/a         8.19us         n/a
    64      16384    256   small |          n/a        27.50us         n/a
    64      16384    512   small |          n/a        27.70us         n/a
    64      16384   1024   small |          n/a        31.62us         n/a
    64      16384   2048   small |          n/a        31.78us         n/a      20.48us      0.64x
    64      16384   4096   small | UNSUPPORTED
    64      65536    256   small |          n/a        41.92us         n/a
    64      65536    512   small |          n/a        46.08us         n/a
    64      65536   1024   small |          n/a        46.08us         n/a
    64      65536   2048   small |          n/a        50.21us         n/a      43.01us      0.86x
    64      65536   4096   small | UNSUPPORTED
    64     131072    256   small |          n/a        58.30us         n/a
    64     131072    512   small |          n/a        58.30us         n/a
    64     131072   1024   small |          n/a        66.58us         n/a
    64     131072   2048   small |          n/a        69.63us         n/a      69.63us      1.00x
    64     131072   4096   small | UNSUPPORTED
    64     262144    256   small |          n/a        88.05us         n/a
    64     262144    512   small |          n/a        94.24us         n/a
    64     262144   1024   small |          n/a        95.23us         n/a
    64     262144   2048   small |          n/a       113.68us         n/a     136.21us      1.20x
    64     262144   4096   small | UNSUPPORTED
    64     524288    256   small |          n/a       176.16us         n/a
    64     524288    512   small |          n/a       176.13us         n/a
    64     524288   1024   small |          n/a       187.52us         n/a
    64     524288   2048   small |          n/a       190.46us         n/a     342.08us      1.80x
    64     524288   4096   small | UNSUPPORTED
   256        256    256   small |          n/a         8.19us         n/a
   256        512    256   small |          n/a        22.69us         n/a
   256        512    512   small |          n/a         8.19us         n/a
   256       1024    256   small |          n/a        26.80us         n/a
   256       1024    512   small |          n/a        25.50us         n/a
   256       1024   1024   small |          n/a         8.19us         n/a
   256       2048    256   small |          n/a        29.70us         n/a
   256       2048    512   small |          n/a        27.60us         n/a
   256       2048   1024   small |          n/a        25.60us         n/a
   256       2048   2048   small |          n/a        10.24us         n/a       8.19us      0.80x
   256       4096    256   small |          n/a        29.74us         n/a
   256       4096    512   small |          n/a        31.71us         n/a
   256       4096   1024   small |          n/a        31.71us         n/a
   256       4096   2048   small |          n/a        28.94us         n/a      14.34us      0.50x
   256       4096   4096   small |          n/a        12.29us         n/a
   256      16384    256   small |          n/a        41.94us         n/a
   256      16384    512   small |          n/a        41.98us         n/a
   256      16384   1024   small |          n/a        45.06us         n/a
   256      16384   2048   small |          n/a        48.13us         n/a      28.67us      0.60x
   256      16384   4096   small | UNSUPPORTED
   256      65536    256   small |          n/a        66.64us         n/a
   256      65536    512   small |          n/a        78.86us         n/a
   256      65536   1024   small |          n/a        80.90us         n/a
   256      65536   2048   small |          n/a        91.02us         n/a      69.66us      0.77x
   256      65536   4096   small | UNSUPPORTED
   256     131072    256   small |          n/a       105.41us         n/a
   256     131072    512   small |          n/a       105.44us         n/a
   256     131072   1024   small |          n/a       125.98us         n/a
   256     131072   2048   small |          n/a       140.42us         n/a     127.01us      0.90x
   256     131072   4096   small | UNSUPPORTED
   256     262144    256   small |          n/a       190.66us         n/a
   256     262144    512   small |          n/a       203.66us         n/a
   256     262144   1024   small |          n/a       205.82us         n/a
   256     262144   2048   small |          n/a       244.74us         n/a     232.53us      0.95x
   256     262144   4096   small | UNSUPPORTED
   256     524288    256   small |          n/a       340.96us         n/a
   256     524288    512   small |          n/a       340.99us         n/a
   256     524288   1024   small |          n/a       368.62us         n/a
   256     524288   2048   small |          n/a       375.78us         n/a     401.57us      1.07x
   256     524288   4096   small | UNSUPPORTED
     1        256    256   large |          n/a         6.21us         n/a
     1        512    256   large |          n/a        14.90us         n/a
     1        512    512   large |          n/a         6.18us         n/a
     1       1024    256   large |          n/a        20.56us         n/a
     1       1024    512   large |          n/a        15.22us         n/a
     1       1024   1024   large |          n/a         8.08us         n/a
     1       2048    256   large |          n/a        21.66us         n/a
     1       2048    512   large |          n/a        21.42us         n/a
     1       2048   1024   large |          n/a        15.31us         n/a
     1       2048   2048   large |          n/a         8.19us         n/a       8.16us      1.00x
     1       4096    256   large |          n/a        21.50us         n/a
     1       4096    512   large |          n/a        23.49us         n/a
     1       4096   1024   large |          n/a        21.44us         n/a
     1       4096   2048   large |          n/a        17.41us         n/a      10.24us      0.59x
     1       4096   4096   large |          n/a         8.19us         n/a
     1      16384    256   large |          n/a        25.63us         n/a
     1      16384    512   large |          n/a        27.62us         n/a
     1      16384   1024   large |          n/a        27.62us         n/a
     1      16384   2048   large |          n/a        31.68us         n/a      18.43us      0.58x
     1      16384   4096   large | UNSUPPORTED
     1      65536    256   large |          n/a        37.95us         n/a
     1      65536    512   large |          n/a        44.00us         n/a
     1      65536   1024   large |          n/a        44.06us         n/a
     1      65536   2048   large |          n/a        48.05us         n/a      40.99us      0.85x
     1      65536   4096   large | UNSUPPORTED
     1     131072    256   large |          n/a        53.25us         n/a
     1     131072    512   large |          n/a        54.27us         n/a
     1     131072   1024   large |          n/a        65.57us         n/a
     1     131072   2048   large |          n/a        64.51us         n/a      67.47us      1.05x
     1     131072   4096   large | UNSUPPORTED
     1     262144    256   large |          n/a        82.11us         n/a
     1     262144    512   large |          n/a        88.10us         n/a
     1     262144   1024   large |          n/a        89.06us         n/a
     1     262144   2048   large |          n/a       105.54us         n/a     116.80us      1.11x
     1     262144   4096   large | UNSUPPORTED
     1     524288    256   large |          n/a       138.29us         n/a
     1     524288    512   large |          n/a       138.24us         n/a
     1     524288   1024   large |          n/a       150.56us         n/a
     1     524288   2048   large |          n/a       150.56us         n/a     197.62us      1.31x
     1     524288   4096   large | UNSUPPORTED
    16        256    256   large |          n/a         6.26us         n/a
    16        512    256   large |          n/a        17.41us         n/a
    16        512    512   large |          n/a         6.27us         n/a
    16       1024    256   large |          n/a        19.47us         n/a
    16       1024    512   large |          n/a        17.38us         n/a
    16       1024   1024   large |          n/a         7.98us         n/a
    16       2048    256   large |          n/a        22.53us         n/a
    16       2048    512   large |          n/a        20.51us         n/a
    16       2048   1024   large |          n/a        17.38us         n/a
    16       2048   2048   large |          n/a         8.19us         n/a       8.19us      1.00x
    16       4096    256   large |          n/a        23.58us         n/a
    16       4096    512   large |          n/a        23.58us         n/a
    16       4096   1024   large |          n/a        21.44us         n/a
    16       4096   2048   large |          n/a        20.51us         n/a      12.29us      0.60x
    16       4096   4096   large |          n/a         8.19us         n/a
    16      16384    256   large |          n/a        25.65us         n/a
    16      16384    512   large |          n/a        27.52us         n/a
    16      16384   1024   large |          n/a        28.72us         n/a
    16      16384   2048   large |          n/a        31.71us         n/a      18.43us      0.58x
    16      16384   4096   large | UNSUPPORTED
    16      65536    256   large |          n/a        39.86us         n/a
    16      65536    512   large |          n/a        45.22us         n/a
    16      65536   1024   large |          n/a        46.05us         n/a
    16      65536   2048   large |          n/a        48.13us         n/a      43.01us      0.89x
    16      65536   4096   large | UNSUPPORTED
    16     131072    256   large |          n/a        57.86us         n/a
    16     131072    512   large |          n/a        56.32us         n/a
    16     131072   1024   large |          n/a        66.45us         n/a
    16     131072   2048   large |          n/a        66.54us         n/a      69.47us      1.04x
    16     131072   4096   large | UNSUPPORTED
    16     262144    256   large |          n/a        82.94us         n/a
    16     262144    512   large |          n/a        89.09us         n/a
    16     262144   1024   large |          n/a        89.09us         n/a
    16     262144   2048   large |          n/a       105.54us         n/a     120.83us      1.14x
    16     262144   4096   large | UNSUPPORTED
    16     524288    256   large |          n/a       140.29us         n/a
    16     524288    512   large |          n/a       140.26us         n/a
    16     524288   1024   large |          n/a       150.62us         n/a
    16     524288   2048   large |          n/a       153.66us         n/a     198.69us      1.29x
    16     524288   4096   large | UNSUPPORTED
    64        256    256   large |          n/a         6.21us         n/a
    64        512    256   large |          n/a        17.44us         n/a
    64        512    512   large |          n/a         8.03us         n/a
    64       1024    256   large |          n/a        19.55us         n/a
    64       1024    512   large |          n/a        17.44us         n/a
    64       1024   1024   large |          n/a         8.11us         n/a
    64       2048    256   large |          n/a        23.55us         n/a
    64       2048    512   large |          n/a        21.47us         n/a
    64       2048   1024   large |          n/a        19.38us         n/a
    64       2048   2048   large |          n/a         8.19us         n/a       8.19us      1.00x
    64       4096    256   large |          n/a        23.47us         n/a
    64       4096    512   large |          n/a        23.58us         n/a
    64       4096   1024   large |          n/a        23.55us         n/a
    64       4096   2048   large |          n/a        21.44us         n/a      12.29us      0.57x
    64       4096   4096   large |          n/a         8.22us         n/a
    64      16384    256   large |          n/a        27.52us         n/a
    64      16384    512   large |          n/a        28.64us         n/a
    64      16384   1024   large |          n/a        33.70us         n/a
    64      16384   2048   large |          n/a        31.74us         n/a      20.48us      0.65x
    64      16384   4096   large | UNSUPPORTED
    64      65536    256   large |          n/a        41.98us         n/a
    64      65536    512   large |          n/a        46.08us         n/a
    64      65536   1024   large |          n/a        46.11us         n/a
    64      65536   2048   large |          n/a        50.22us         n/a      43.04us      0.86x
    64      65536   4096   large | UNSUPPORTED
    64     131072    256   large |          n/a        58.30us         n/a
    64     131072    512   large |          n/a        58.34us         n/a
    64     131072   1024   large |          n/a        67.55us         n/a
    64     131072   2048   large |          n/a        86.91us         n/a      69.63us      0.80x
    64     131072   4096   large | UNSUPPORTED
    64     262144    256   large |          n/a        90.08us         n/a
    64     262144    512   large |          n/a        95.23us         n/a
    64     262144   1024   large |          n/a        96.22us         n/a
    64     262144   2048   large |          n/a       114.85us         n/a     143.38us      1.25x
    64     262144   4096   large | UNSUPPORTED
    64     524288    256   large |          n/a       176.30us         n/a
    64     524288    512   large |          n/a       177.06us         n/a
    64     524288   1024   large |          n/a       188.45us         n/a
    64     524288   2048   large |          n/a       190.53us         n/a     342.05us      1.80x
    64     524288   4096   large | UNSUPPORTED
   256        256    256   large |          n/a         8.19us         n/a
   256        512    256   large |          n/a        21.54us         n/a
   256        512    512   large |          n/a         8.19us         n/a
   256       1024    256   large |          n/a        27.44us         n/a
   256       1024    512   large |          n/a        25.04us         n/a
   256       1024   1024   large |          n/a         8.19us         n/a
   256       2048    256   large |          n/a        29.70us         n/a
   256       2048    512   large |          n/a        27.62us         n/a
   256       2048   1024   large |          n/a        25.60us         n/a
   256       2048   2048   large |          n/a        10.24us         n/a       8.18us      0.80x
   256       4096    256   large |          n/a        29.82us         n/a
   256       4096    512   large |          n/a        31.74us         n/a
   256       4096   1024   large |          n/a        30.80us         n/a
   256       4096   2048   large |          n/a        28.69us         n/a      14.34us      0.50x
   256       4096   4096   large |          n/a        12.29us         n/a
   256      16384    256   large |          n/a        41.92us         n/a
   256      16384    512   large |          n/a        42.02us         n/a
   256      16384   1024   large |          n/a        45.12us         n/a
   256      16384   2048   large |          n/a        48.18us         n/a      28.67us      0.60x
   256      16384   4096   large | UNSUPPORTED
   256      65536    256   large |          n/a        67.58us         n/a
   256      65536    512   large |          n/a        78.85us         n/a
   256      65536   1024   large |          n/a        80.98us         n/a
   256      65536   2048   large |          n/a        90.98us         n/a      69.65us      0.77x
   256      65536   4096   large | UNSUPPORTED
   256     131072    256   large |          n/a       105.44us         n/a
   256     131072    512   large |          n/a       105.41us         n/a
   256     131072   1024   large |          n/a       125.98us         n/a
   256     131072   2048   large |          n/a       132.06us         n/a     127.01us      0.96x
   256     131072   4096   large | UNSUPPORTED
   256     262144    256   large |          n/a       191.39us         n/a
   256     262144    512   large |          n/a       203.74us         n/a
   256     262144   1024   large |          n/a       205.82us         n/a
   256     262144   2048   large |          n/a       244.74us         n/a     232.45us      0.95x
   256     262144   4096   large | UNSUPPORTED
   256     524288    256   large |          n/a       340.99us         n/a
   256     524288    512   large |          n/a       340.99us         n/a
   256     524288   1024   large |          n/a       369.46us         n/a
   256     524288   2048   large |          n/a       375.84us         n/a     401.44us      1.07x
   256     524288   4096   large | UNSUPPORTED
python3 benchmarks/bench_topk.py --op ragged --dtype fp32 --dsa-case all --dsa-input-pattern dsa_relu --dsa-topk 2048 --deterministic --tie-break-modes 0 1 2 --compare-sglang

====================================================================================================
top_k_ragged_transform: Fused top-k + ragged index transform (dtype=FP32, deterministic=True, pattern=random, tie_break_modes=none,small,large)
NOTE: SGLang only supports k=2048 and float32
NOTE: deterministic mode also benchmarks FlashInfer(non-det) for direct comparison when tie_break=0
====================================================================================================
 batch    seq_len      k     tie |   FlashInfer FlashInfer(det) DetSlowdown       SGLang    Speedup
---------------------------------------------------------------------------------------------------
/root/flashinfer-src/flashinfer/testing/utils.py:1620: UserWarning: CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events for benchmarking.
  return bench_gpu_time_with_cupti(
/root/flashinfer-src/flashinfer/testing/utils.py:1620: UserWarning: CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events for benchmarking.
  return bench_gpu_time_with_cupti(
     1        256    256    none |       6.18us         6.18us       1.00x
     1        512    256    none |      10.24us        15.30us       1.49x
     1        512    512    none |       6.14us         6.14us       1.00x
     1       1024    256    none |      10.24us        16.38us       1.60x
     1       1024    512    none |       8.19us        14.34us       1.75x
     1       1024   1024    none |       6.18us         6.18us       1.00x
     1       2048    256    none |      12.29us        18.45us       1.50x
     1       2048    512    none |      12.29us        18.43us       1.50x
     1       2048   1024    none |       8.19us        15.34us       1.87x
     1       2048   2048    none |       6.35us         6.21us       0.98x       6.14us      0.99x
     1       4096    256    none |       8.19us        15.33us       1.87x
     1       4096    512    none |      12.29us        17.41us       1.42x
     1       4096   1024    none |      12.29us        17.44us       1.42x
     1       4096   2048    none |      12.29us        21.47us       1.75x      12.29us      0.57x
     1       4096   4096    none |       8.11us         8.16us       1.01x
     1      16384    256    none |      14.34us        21.47us       1.50x
     1      16384    512    none |      16.38us        23.47us       1.43x
     1      16384   1024    none |      16.42us        23.55us       1.43x
     1      16384   2048    none |      18.43us        27.62us       1.50x      18.43us      0.67x
     1      16384   4096    none |      20.48us        20.51us       1.00x
     1      65536    256    none |      36.86us        36.86us       1.00x
     1      65536    512    none |      36.86us        34.85us       0.95x
     1      65536   1024    none |      36.86us        36.69us       1.00x
     1      65536   2048    none |      38.91us        36.86us       0.95x      45.04us      1.22x
     1      65536   4096    none |      38.91us        36.86us       0.95x
     1     131072    256    none |      42.91us        40.96us       0.95x
     1     131072    512    none |      43.01us        40.96us       0.95x
     1     131072   1024    none |      43.01us        40.96us       0.95x
     1     131072   2048    none |      45.06us        40.96us       0.91x      71.71us      1.75x
     1     131072   4096    none |      45.09us        42.88us       0.95x
     1     262144    256    none |      47.10us        45.06us       0.96x
     1     262144    512    none |      47.10us        45.06us       0.96x
     1     262144   1024    none |      47.14us        45.06us       0.96x
     1     262144   2048    none |      47.25us        47.10us       1.00x     131.07us      2.78x
     1     262144   4096    none |      49.15us        46.96us       0.96x
     1     524288    256    none |      47.10us        47.07us       1.00x
     1     524288    512    none |      47.10us        47.10us       1.00x
     1     524288   1024    none |      47.17us        47.10us       1.00x
     1     524288   2048    none |      49.15us        47.10us       0.96x     223.23us      4.74x
     1     524288   4096    none |      49.15us        47.10us       0.96x
    16        256    256    none |       6.14us         6.14us       1.00x
    16        512    256    none |      10.24us        15.30us       1.49x
    16        512    512    none |       6.14us         6.14us       1.00x
    16       1024    256    none |      12.29us        16.35us       1.33x
    16       1024    512    none |      10.24us        15.36us       1.50x
    16       1024   1024    none |       6.14us         6.14us       1.00x
    16       2048    256    none |      12.29us        16.54us       1.35x
    16       2048    512    none |      12.29us        17.30us       1.41x
    16       2048   1024    none |      10.24us        17.25us       1.68x
    16       2048   2048    none |       6.18us         6.14us       0.99x       6.14us      1.00x
    16       4096    256    none |      12.29us        19.42us       1.58x
    16       4096    512    none |      14.34us        19.49us       1.36x
    16       4096   1024    none |      12.29us        19.42us       1.58x
    16       4096   2048    none |      10.27us        19.39us       1.89x      10.24us      0.53x
    16       4096   4096    none |       6.14us         6.18us       1.01x
    16      16384    256    none |      16.38us        21.58us       1.32x
    16      16384    512    none |      16.42us        23.54us       1.43x
    16      16384   1024    none |      18.43us        25.54us       1.39x
    16      16384   2048    none |      18.51us        27.65us       1.49x      20.48us      0.74x
    16      16384   4096    none |      20.48us        20.48us       1.00x
    16      65536    256    none |      28.67us        33.81us       1.18x
    16      65536    512    none |      34.21us        39.94us       1.17x
    16      65536   1024    none |      32.77us        39.90us       1.22x
    16      65536   2048    none |      36.86us        44.03us       1.19x      45.06us      1.02x
    16      65536   4096    none |      41.06us        38.91us       0.95x
    16     131072    256    none |      45.09us        43.01us       0.95x
    16     131072    512    none |      45.06us        43.01us       0.95x
    16     131072   1024    none |      53.25us        43.04us       0.81x
    16     131072   2048    none |      53.31us        45.06us       0.85x      73.73us      1.64x
    16     131072   4096    none |      49.15us        45.06us       0.92x
    16     262144    256    none |      49.15us        48.94us       1.00x
    16     262144    512    none |      49.15us        47.18us       0.96x
    16     262144   1024    none |      49.36us        49.15us       1.00x
    16     262144   2048    none |      51.20us        49.15us       0.96x     131.10us      2.67x
    16     262144   4096    none |      51.20us        49.15us       0.96x
    16     524288    256    none |      90.11us        88.06us       0.98x
    16     524288    512    none |      90.11us        88.10us       0.98x
    16     524288   1024    none |      91.23us        88.06us       0.97x
    16     524288   2048    none |      92.16us        88.10us       0.96x     223.23us      2.53x
    16     524288   4096    none |      94.18us        90.11us       0.96x
    64        256    256    none |       6.14us         6.14us       1.00x
    64        512    256    none |      12.29us        15.41us       1.25x
    64        512    512    none |       6.14us         6.14us       1.00x
    64       1024    256    none |      12.29us        16.38us       1.33x
    64       1024    512    none |      12.29us        16.38us       1.33x
    64       1024   1024    none |       6.14us         6.14us       1.00x
    64       2048    256    none |      12.29us        17.41us       1.42x
    64       2048    512    none |      12.29us        17.38us       1.41x
    64       2048   1024    none |      12.29us        17.47us       1.42x
    64       2048   2048    none |       6.14us         6.14us       1.00x       6.14us      1.00x
    64       4096    256    none |      14.34us        19.49us       1.36x
    64       4096    512    none |      14.34us        19.49us       1.36x
    64       4096   1024    none |      12.29us        19.33us       1.57x
    64       4096   2048    none |      12.29us        21.42us       1.74x      12.29us      0.57x
    64       4096   4096    none |       6.18us         6.14us       0.99x
    64      16384    256    none |      16.38us        23.49us       1.43x
    64      16384    512    none |      16.54us        23.55us       1.42x
    64      16384   1024    none |      18.43us        23.58us       1.28x
    64      16384   2048    none |      20.48us        27.65us       1.35x      20.48us      0.74x
    64      16384   4096    none |      21.89us        22.53us       1.03x
    64      65536    256    none |      28.67us        35.81us       1.25x
    64      65536    512    none |      34.82us        39.97us       1.15x
    64      65536   1024    none |      34.82us        39.94us       1.15x
    64      65536   2048    none |      36.86us        45.06us       1.22x      45.06us      1.00x
    64      65536   4096    none |      43.01us        40.96us       0.95x
    64     131072    256    none |      45.09us        52.21us       1.16x
    64     131072    512    none |      45.06us        52.19us       1.16x
    64     131072   1024    none |      55.30us        60.45us       1.09x
    64     131072   2048    none |      55.30us        63.46us       1.15x      74.77us      1.18x
    64     131072   4096    none |      90.11us        83.97us       0.93x
    64     262144    256    none |      75.78us        81.92us       1.08x
    64     262144    512    none |      81.92us        88.16us       1.08x
    64     262144   1024    none |      81.92us        88.10us       1.08x
    64     262144   2048    none |      98.30us       105.52us       1.07x     147.49us      1.40x
    64     262144   4096    none |     143.36us       135.17us       0.94x
    64     524288    256    none |     163.84us       169.95us       1.04x
    64     524288    512    none |     163.87us       169.98us       1.04x
    64     524288   1024    none |     176.13us       182.27us       1.03x
    64     524288   2048    none |     176.16us       184.48us       1.05x     369.58us      2.00x
    64     524288   4096    none |     229.38us       217.09us       0.95x
   256        256    256    none |       8.13us         8.06us       0.99x
   256        512    256    none |      14.37us        21.34us       1.49x
   256        512    512    none |       8.19us         8.10us       0.99x
   256       1024    256    none |      18.43us        24.58us       1.33x
   256       1024    512    none |      16.38us        21.54us       1.31x
   256       1024   1024    none |       8.19us         8.19us       1.00x
   256       2048    256    none |      18.43us        24.58us       1.33x
   256       2048    512    none |      18.43us        25.54us       1.39x
   256       2048   1024    none |      16.42us        23.52us       1.43x
   256       2048   2048    none |       8.19us         8.19us       1.00x       6.14us      0.75x
   256       4096    256    none |      20.48us        27.58us       1.35x
   256       4096    512    none |      20.48us        27.62us       1.35x
   256       4096   1024    none |      20.48us        28.67us       1.40x
   256       4096   2048    none |      18.43us        28.88us       1.57x      14.34us      0.50x
   256       4096   4096    none |       8.19us         8.19us       1.00x
   256      16384    256    none |      28.67us        35.74us       1.25x
   256      16384    512    none |      28.67us        35.84us       1.25x
   256      16384   1024    none |      30.72us        38.94us       1.27x
   256      16384   2048    none |      34.82us        44.05us       1.27x      28.64us      0.65x
   256      16384   4096    none |      38.85us        38.91us       1.00x
   256      65536    256    none |      53.34us        60.42us       1.13x
   256      65536    512    none |      63.65us        70.66us       1.11x
   256      65536   1024    none |      63.52us        72.70us       1.14x
   256      65536   2048    none |      69.63us        80.03us       1.15x      63.49us      0.79x
   256      65536   4096    none |     149.50us       137.22us       0.92x
   256     131072    256    none |      92.19us        99.30us       1.08x
   256     131072    512    none |      92.19us        99.33us       1.08x
   256     131072   1024    none |     110.59us       118.78us       1.07x
   256     131072   2048    none |     110.59us       123.73us       1.12x     128.88us      1.04x
   256     131072   4096    none |     256.00us       235.52us       0.92x
   256     262144    256    none |     178.30us       184.83us       1.04x
   256     262144    512    none |     191.47us       197.60us       1.03x
   256     262144   1024    none |     192.51us       199.76us       1.04x
   256     262144   2048    none |     225.31us       237.60us       1.05x     235.52us      0.99x
   256     262144   4096    none |     407.70us       383.01us       0.94x
   256     524288    256    none |     329.76us       334.85us       1.02x
   256     524288    512    none |     329.76us       334.85us       1.02x
   256     524288   1024    none |     355.26us       362.50us       1.02x
   256     524288   2048    none |     356.27us       366.59us       1.03x     409.68us      1.12x
   256     524288   4096    none |     828.61us       786.46us       0.95x
     1        256    256   small |          n/a         6.18us         n/a
     1        512    256   small |          n/a        12.27us         n/a
     1        512    512   small |          n/a         6.14us         n/a
     1       1024    256   small |          n/a        17.41us         n/a
     1       1024    512   small |          n/a        14.34us         n/a
     1       1024   1024   small |          n/a         6.18us         n/a
     1       2048    256   small |          n/a        18.43us         n/a
     1       2048    512   small |          n/a        18.43us         n/a
     1       2048   1024   small |          n/a        17.36us         n/a
     1       2048   2048   small |          n/a         6.21us         n/a       6.14us      0.99x
     1       4096    256   small |          n/a        17.41us         n/a
     1       4096    512   small |          n/a        19.46us         n/a
     1       4096   1024   small |          n/a        19.42us         n/a
     1       4096   2048   small |          n/a        19.46us         n/a      10.24us      0.53x
     1       4096   4096   small |          n/a         8.13us         n/a
     1      16384    256   small |          n/a        21.55us         n/a
     1      16384    512   small |          n/a        23.46us         n/a
     1      16384   1024   small |          n/a        23.62us         n/a
     1      16384   2048   small |          n/a        27.60us         n/a      18.43us      0.67x
     1      16384   4096   small | UNSUPPORTED
     1      65536    256   small |          n/a        34.82us         n/a
     1      65536    512   small |          n/a        39.87us         n/a
     1      65536   1024   small |          n/a        38.94us         n/a
     1      65536   2048   small |          n/a        43.04us         n/a      43.01us      1.00x
     1      65536   4096   small | UNSUPPORTED
     1     131072    256   small |          n/a        48.16us         n/a
     1     131072    512   small |          n/a        50.11us         n/a
     1     131072   1024   small |          n/a        59.39us         n/a
     1     131072   2048   small |          n/a        60.42us         n/a      71.70us      1.19x
     1     131072   4096   small | UNSUPPORTED
     1     262144    256   small |          n/a        76.80us         n/a
     1     262144    512   small |          n/a        82.94us         n/a
     1     262144   1024   small |          n/a        82.91us         n/a
     1     262144   2048   small |          n/a       100.46us         n/a     131.07us      1.30x
     1     262144   4096   small | UNSUPPORTED
     1     524288    256   small |          n/a       133.97us         n/a
     1     524288    512   small |          n/a       134.14us         n/a
     1     524288   1024   small |          n/a       144.54us         n/a
     1     524288   2048   small |          n/a       146.43us         n/a     223.20us      1.52x
     1     524288   4096   small | UNSUPPORTED
    16        256    256   small |          n/a         6.14us         n/a
    16        512    256   small |          n/a        15.30us         n/a
    16        512    512   small |          n/a         6.18us         n/a
    16       1024    256   small |          n/a        16.38us         n/a
    16       1024    512   small |          n/a        16.38us         n/a
    16       1024   1024   small |          n/a         6.14us         n/a
    16       2048    256   small |          n/a        16.85us         n/a
    16       2048    512   small |          n/a        16.42us         n/a
    16       2048   1024   small |          n/a        17.47us         n/a
    16       2048   2048   small |          n/a         6.14us         n/a       6.14us      1.00x
    16       4096    256   small |          n/a        19.33us         n/a
    16       4096    512   small |          n/a        19.49us         n/a
    16       4096   1024   small |          n/a        19.49us         n/a
    16       4096   2048   small |          n/a        20.48us         n/a      12.29us      0.60x
    16       4096   4096   small |          n/a         6.16us         n/a
    16      16384    256   small |          n/a        22.53us         n/a
    16      16384    512   small |          n/a        23.39us         n/a
    16      16384   1024   small |          n/a        23.58us         n/a
    16      16384   2048   small |          n/a        27.58us         n/a      18.43us      0.67x
    16      16384   4096   small | UNSUPPORTED
    16      65536    256   small |          n/a        33.89us         n/a
    16      65536    512   small |          n/a        39.89us         n/a
    16      65536   1024   small |          n/a        39.10us         n/a
    16      65536   2048   small |          n/a        44.03us         n/a      45.06us      1.02x
    16      65536   4096   small | UNSUPPORTED
    16     131072    256   small |          n/a        50.21us         n/a
    16     131072    512   small |          n/a        50.26us         n/a
    16     131072   1024   small |          n/a        60.32us         n/a
    16     131072   2048   small |          n/a        62.38us         n/a      73.54us      1.18x
    16     131072   4096   small | UNSUPPORTED
    16     262144    256   small |          n/a        76.77us         n/a
    16     262144    512   small |          n/a        82.88us         n/a
    16     262144   1024   small |          n/a        82.94us         n/a
    16     262144   2048   small |          n/a       101.42us         n/a     131.07us      1.29x
    16     262144   4096   small | UNSUPPORTED
    16     524288    256   small |          n/a       134.02us         n/a
    16     524288    512   small |          n/a       134.14us         n/a
    16     524288   1024   small |          n/a       144.40us         n/a
    16     524288   2048   small |          n/a       147.42us         n/a     223.20us      1.51x
    16     524288   4096   small | UNSUPPORTED
    64        256    256   small |          n/a         6.14us         n/a
    64        512    256   small |          n/a        15.98us         n/a
    64        512    512   small |          n/a         6.14us         n/a
    64       1024    256   small |          n/a        16.38us         n/a
    64       1024    512   small |          n/a        16.35us         n/a
    64       1024   1024   small |          n/a         6.14us         n/a
    64       2048    256   small |          n/a        17.41us         n/a
    64       2048    512   small |          n/a        17.41us         n/a
    64       2048   1024   small |          n/a        18.00us         n/a
    64       2048   2048   small |          n/a         6.14us         n/a       6.14us      1.00x
    64       4096    256   small |          n/a        19.52us         n/a
    64       4096    512   small |          n/a        19.39us         n/a
    64       4096   1024   small |          n/a        19.55us         n/a
    64       4096   2048   small |          n/a        21.41us         n/a      12.29us      0.57x
    64       4096   4096   small |          n/a         6.18us         n/a
    64      16384    256   small |          n/a        23.36us         n/a
    64      16384    512   small |          n/a        23.57us         n/a
    64      16384   1024   small |          n/a        25.44us         n/a
    64      16384   2048   small |          n/a        27.68us         n/a      20.48us      0.74x
    64      16384   4096   small | UNSUPPORTED
    64      65536    256   small |          n/a        35.71us         n/a
    64      65536    512   small |          n/a        39.94us         n/a
    64      65536   1024   small |          n/a        40.06us         n/a
    64      65536   2048   small |          n/a        45.06us         n/a      45.06us      1.00x
    64      65536   4096   small | UNSUPPORTED
    64     131072    256   small |          n/a        52.06us         n/a
    64     131072    512   small |          n/a        52.19us         n/a
    64     131072   1024   small |          n/a        60.42us         n/a
    64     131072   2048   small |          n/a        63.49us         n/a      75.62us      1.19x
    64     131072   4096   small | UNSUPPORTED
    64     262144    256   small |          n/a        81.09us         n/a
    64     262144    512   small |          n/a        88.06us         n/a
    64     262144   1024   small |          n/a        89.04us         n/a
    64     262144   2048   small |          n/a       106.67us         n/a     147.46us      1.38x
    64     262144   4096   small | UNSUPPORTED
    64     524288    256   small |          n/a       169.98us         n/a
    64     524288    512   small |          n/a       169.98us         n/a
    64     524288   1024   small |          n/a       182.29us         n/a
    64     524288   2048   small |          n/a       185.18us         n/a     370.69us      2.00x
    64     524288   4096   small | UNSUPPORTED
   256        256    256   small |          n/a         8.11us         n/a
   256        512    256   small |          n/a        22.53us         n/a
   256        512    512   small |          n/a         8.16us         n/a
   256       1024    256   small |          n/a        24.58us         n/a
   256       1024    512   small |          n/a        23.42us         n/a
   256       1024   1024   small |          n/a         8.19us         n/a
   256       2048    256   small |          n/a        24.61us         n/a
   256       2048    512   small |          n/a        25.79us         n/a
   256       2048   1024   small |          n/a        23.55us         n/a
   256       2048   2048   small |          n/a         8.19us         n/a       6.14us      0.75x
   256       4096    256   small |          n/a        27.50us         n/a
   256       4096    512   small |          n/a        26.66us         n/a
   256       4096   1024   small |          n/a        28.67us         n/a
   256       4096   2048   small |          n/a        28.82us         n/a      14.34us      0.50x
   256       4096   4096   small |          n/a         8.19us         n/a
   256      16384    256   small |          n/a        35.68us         n/a
   256      16384    512   small |          n/a        35.95us         n/a
   256      16384   1024   small |          n/a        39.94us         n/a
   256      16384   2048   small |          n/a        44.13us         n/a      26.66us      0.60x
   256      16384   4096   small | UNSUPPORTED
   256      65536    256   small |          n/a        61.44us         n/a
   256      65536    512   small |          n/a        70.62us         n/a
   256      65536   1024   small |          n/a        72.70us         n/a
   256      65536   2048   small |          n/a        80.00us         n/a      63.49us      0.79x
   256      65536   4096   small | UNSUPPORTED
   256     131072    256   small |          n/a        99.30us         n/a
   256     131072    512   small |          n/a        99.15us         n/a
   256     131072   1024   small |          n/a       118.77us         n/a
   256     131072   2048   small |          n/a       122.96us         n/a     127.07us      1.03x
   256     131072   4096   small | UNSUPPORTED
   256     262144    256   small |          n/a       185.18us         n/a
   256     262144    512   small |          n/a       197.60us         n/a
   256     262144   1024   small |          n/a       199.68us         n/a
   256     262144   2048   small |          n/a       237.57us         n/a     235.52us      0.99x
   256     262144   4096   small | UNSUPPORTED
   256     524288    256   small |          n/a       334.82us         n/a
   256     524288    512   small |          n/a       334.85us         n/a
   256     524288   1024   small |          n/a       362.46us         n/a
   256     524288   2048   small |          n/a       366.75us         n/a     409.63us      1.12x
   256     524288   4096   small | UNSUPPORTED
     1        256    256   large |          n/a         6.18us         n/a
     1        512    256   large |          n/a        12.29us         n/a
     1        512    512   large |          n/a         6.14us         n/a
     1       1024    256   large |          n/a        13.39us         n/a
     1       1024    512   large |          n/a        14.34us         n/a
     1       1024   1024   large |          n/a         6.18us         n/a
     1       2048    256   large |          n/a        17.44us         n/a
     1       2048    512   large |          n/a        16.38us         n/a
     1       2048   1024   large |          n/a        15.22us         n/a
     1       2048   2048   large |          n/a         6.21us         n/a       6.14us      0.99x
     1       4096    256   large |          n/a        19.39us         n/a
     1       4096    512   large |          n/a        19.41us         n/a
     1       4096   1024   large |          n/a        19.30us         n/a
     1       4096   2048   large |          n/a        19.42us         n/a      10.24us      0.53x
     1       4096   4096   large |          n/a         8.06us         n/a
     1      16384    256   large |          n/a        22.53us         n/a
     1      16384    512   large |          n/a        21.50us         n/a
     1      16384   1024   large |          n/a        23.55us         n/a
     1      16384   2048   large |          n/a        27.57us         n/a      18.43us      0.67x
     1      16384   4096   large | UNSUPPORTED
     1      65536    256   large |          n/a        33.79us         n/a
     1      65536    512   large |          n/a        38.91us         n/a
     1      65536   1024   large |          n/a        38.91us         n/a
     1      65536   2048   large |          n/a        43.94us         n/a      43.04us      0.98x
     1      65536   4096   large | UNSUPPORTED
     1     131072    256   large |          n/a        50.05us         n/a
     1     131072    512   large |          n/a        50.21us         n/a
     1     131072   1024   large |          n/a        59.39us         n/a
     1     131072   2048   large |          n/a        62.43us         n/a      73.73us      1.18x
     1     131072   4096   large | UNSUPPORTED
     1     262144    256   large |          n/a        76.80us         n/a
     1     262144    512   large |          n/a        82.82us         n/a
     1     262144   1024   large |          n/a        82.94us         n/a
     1     262144   2048   large |          n/a       101.30us         n/a     131.07us      1.29x
     1     262144   4096   large | UNSUPPORTED
     1     524288    256   large |          n/a       134.02us         n/a
     1     524288    512   large |          n/a       133.97us         n/a
     1     524288   1024   large |          n/a       145.41us         n/a
     1     524288   2048   large |          n/a       146.43us         n/a     223.26us      1.52x
     1     524288   4096   large | UNSUPPORTED
    16        256    256   large |          n/a         6.14us         n/a
    16        512    256   large |          n/a        15.39us         n/a
    16        512    512   large |          n/a         6.16us         n/a
    16       1024    256   large |          n/a        15.49us         n/a
    16       1024    512   large |          n/a        15.57us         n/a
    16       1024   1024   large |          n/a         6.14us         n/a
    16       2048    256   large |          n/a        16.42us         n/a
    16       2048    512   large |          n/a        17.41us         n/a
    16       2048   1024   large |          n/a        17.44us         n/a
    16       2048   2048   large |          n/a         6.14us         n/a       6.14us      1.00x
    16       4096    256   large |          n/a        19.39us         n/a
    16       4096    512   large |          n/a        19.36us         n/a
    16       4096   1024   large |          n/a        19.39us         n/a
    16       4096   2048   large |          n/a        21.47us         n/a      12.29us      0.57x
    16       4096   4096   large |          n/a         6.18us         n/a
    16      16384    256   large |          n/a        23.46us         n/a
    16      16384    512   large |          n/a        23.44us         n/a
    16      16384   1024   large |          n/a        23.63us         n/a
    16      16384   2048   large |          n/a        26.69us         n/a      18.43us      0.69x
    16      16384   4096   large | UNSUPPORTED
    16      65536    256   large |          n/a        33.87us         n/a
    16      65536    512   large |          n/a        39.87us         n/a
    16      65536   1024   large |          n/a        39.84us         n/a
    16      65536   2048   large |          n/a        44.16us         n/a      45.06us      1.02x
    16      65536   4096   large | UNSUPPORTED
    16     131072    256   large |          n/a        51.20us         n/a
    16     131072    512   large |          n/a        50.30us         n/a
    16     131072   1024   large |          n/a        60.35us         n/a
    16     131072   2048   large |          n/a        62.34us         n/a      73.62us      1.18x
    16     131072   4096   large | UNSUPPORTED
    16     262144    256   large |          n/a        76.80us         n/a
    16     262144    512   large |          n/a        82.88us         n/a
    16     262144   1024   large |          n/a        82.94us         n/a
    16     262144   2048   large |          n/a       101.34us         n/a     131.07us      1.29x
    16     262144   4096   large | UNSUPPORTED
    16     524288    256   large |          n/a       134.14us         n/a
    16     524288    512   large |          n/a       134.14us         n/a
    16     524288   1024   large |          n/a       144.40us         n/a
    16     524288   2048   large |          n/a       146.43us         n/a     223.23us      1.52x
    16     524288   4096   large | UNSUPPORTED
    64        256    256   large |          n/a         6.14us         n/a
    64        512    256   large |          n/a        16.37us         n/a
    64        512    512   large |          n/a         6.18us         n/a
    64       1024    256   large |          n/a        15.47us         n/a
    64       1024    512   large |          n/a        15.42us         n/a
    64       1024   1024   large |          n/a         6.14us         n/a
    64       2048    256   large |          n/a        17.41us         n/a
    64       2048    512   large |          n/a        17.41us         n/a
    64       2048   1024   large |          n/a        17.34us         n/a
    64       2048   2048   large |          n/a         6.14us         n/a       6.14us      1.00x
    64       4096    256   large |          n/a        19.41us         n/a
    64       4096    512   large |          n/a        19.49us         n/a
    64       4096   1024   large |          n/a        19.42us         n/a
    64       4096   2048   large |          n/a        19.49us         n/a      12.29us      0.63x
    64       4096   4096   large |          n/a         6.18us         n/a
    64      16384    256   large |          n/a        23.52us         n/a
    64      16384    512   large |          n/a        23.42us         n/a
    64      16384   1024   large |          n/a        25.23us         n/a
    64      16384   2048   large |          n/a        27.65us         n/a      20.48us      0.74x
    64      16384   4096   large | UNSUPPORTED
    64      65536    256   large |          n/a        35.65us         n/a
    64      65536    512   large |          n/a        39.90us         n/a
    64      65536   1024   large |          n/a        39.94us         n/a
    64      65536   2048   large |          n/a        45.06us         n/a      45.06us      1.00x
    64      65536   4096   large | UNSUPPORTED
    64     131072    256   large |          n/a        52.16us         n/a
    64     131072    512   large |          n/a        80.83us         n/a
    64     131072   1024   large |          n/a        60.45us         n/a
    64     131072   2048   large |          n/a        64.34us         n/a      75.78us      1.18x
    64     131072   4096   large | UNSUPPORTED
    64     262144    256   large |          n/a        81.95us         n/a
    64     262144    512   large |          n/a        88.06us         n/a
    64     262144   1024   large |          n/a        88.91us         n/a
    64     262144   2048   large |          n/a       106.48us         n/a     147.46us      1.38x
    64     262144   4096   large | UNSUPPORTED
    64     524288    256   large |          n/a       170.05us         n/a
    64     524288    512   large |          n/a       170.83us         n/a
    64     524288   1024   large |          n/a       182.78us         n/a
    64     524288   2048   large |          n/a       185.31us         n/a     369.50us      1.99x
    64     524288   4096   large | UNSUPPORTED
   256        256    256   large |          n/a         8.08us         n/a
   256        512    256   large |          n/a        21.52us         n/a
   256        512    512   large |          n/a         8.18us         n/a
   256       1024    256   large |          n/a        23.55us         n/a
   256       1024    512   large |          n/a        23.42us         n/a
   256       1024   1024   large |          n/a         8.19us         n/a
   256       2048    256   large |          n/a        24.59us         n/a
   256       2048    512   large |          n/a        25.76us         n/a
   256       2048   1024   large |          n/a        23.49us         n/a
   256       2048   2048   large |          n/a         8.19us         n/a       6.14us      0.75x
   256       4096    256   large |          n/a        25.63us         n/a
   256       4096    512   large |          n/a        27.62us         n/a
   256       4096   1024   large |          n/a        28.70us         n/a
   256       4096   2048   large |          n/a        29.47us         n/a      14.34us      0.49x
   256       4096   4096   large |          n/a         8.19us         n/a
   256      16384    256   large |          n/a        35.81us         n/a
   256      16384    512   large |          n/a        35.84us         n/a
   256      16384   1024   large |          n/a        39.90us         n/a
   256      16384   2048   large |          n/a        45.86us         n/a      28.61us      0.62x
   256      16384   4096   large | UNSUPPORTED
   256      65536    256   large |          n/a        60.42us         n/a
   256      65536    512   large |          n/a        70.62us         n/a
   256      65536   1024   large |          n/a        72.85us         n/a
   256      65536   2048   large |          n/a        80.74us         n/a      63.52us      0.79x
   256      65536   4096   large | UNSUPPORTED
   256     131072    256   large |          n/a        99.30us         n/a
   256     131072    512   large |          n/a        99.33us         n/a
   256     131072   1024   large |          n/a       118.78us         n/a
   256     131072   2048   large |          n/a       123.42us         n/a     127.07us      1.03x
   256     131072   4096   large | UNSUPPORTED
   256     262144    256   large |          n/a       185.28us         n/a
   256     262144    512   large |          n/a       197.63us         n/a
   256     262144   1024   large |          n/a       199.68us         n/a
   256     262144   2048   large |          n/a       238.43us         n/a     235.52us      0.99x
   256     262144   4096   large | UNSUPPORTED
   256     524288    256   large |          n/a       334.85us         n/a
   256     524288    512   large |          n/a       334.94us         n/a
   256     524288   1024   large |          n/a       363.41us         n/a
   256     524288   2048   large |          n/a       366.77us         n/a     409.63us      1.12x
   256     524288   4096   large | UNSUPPORTED

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/utils/test_topk.py (1)

1960-1961: Use the TopKTieBreak enum instead of raw integers for consistency.

The existing parametrized tests (Lines 106-113, 152-159, 197-204, 229-236) all pass flashinfer.TopKTieBreak.{NONE,SMALL,LARGE}. The two new tests (here and at Lines 2022-2033) pass bare integers 1 and 2, which is less readable and inconsistent with the style used throughout the rest of the file.

♻️ Proposed change
-    values_small, indices_small = flashinfer.top_k(logits, k, tie_break=1)
-    values_large, indices_large = flashinfer.top_k(logits, k, tie_break=2)
+    values_small, indices_small = flashinfer.top_k(
+        logits, k, tie_break=flashinfer.TopKTieBreak.SMALL
+    )
+    values_large, indices_large = flashinfer.top_k(
+        logits, k, tie_break=flashinfer.TopKTieBreak.LARGE
+    )

Apply the equivalent change at Lines 2022-2033.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/test_topk.py` around lines 1960 - 1961, Replace raw integer
tie_break arguments with the TopKTieBreak enum for consistency: change the calls
to flashinfer.top_k(logits, k, tie_break=1) and tie_break=2 to use
flashinfer.TopKTieBreak.SMALL and flashinfer.TopKTieBreak.LARGE respectively
(the variables values_small/indices_small and values_large/indices_large remain
the same); apply the same replacement in the other new test mentioned (the block
around the second occurrence) so all tests use flashinfer.TopKTieBreak instead
of integers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/utils/test_topk.py`:
- Around line 1952-1978: The test currently assumes a stable output ordering
from flashinfer.top_k(logits, k, tie_break=1/2) but ordering is only guaranteed
when sorted=True; either pass sorted=True into both top_k calls or (preferred)
keep top_k as-is and replace the strict torch.equal index checks with the
unordered comparator _assert_unordered_indices_match so the selection is
validated without requiring a specific permutation; keep the existing values
assertions (torch.testing.assert_close on values_small/values_large) and change
the two index assertions to call _assert_unordered_indices_match(indices_small,
expected_small) and _assert_unordered_indices_match(indices_large,
expected_large).

---

Nitpick comments:
In `@tests/utils/test_topk.py`:
- Around line 1960-1961: Replace raw integer tie_break arguments with the
TopKTieBreak enum for consistency: change the calls to flashinfer.top_k(logits,
k, tie_break=1) and tie_break=2 to use flashinfer.TopKTieBreak.SMALL and
flashinfer.TopKTieBreak.LARGE respectively (the variables
values_small/indices_small and values_large/indices_large remain the same);
apply the same replacement in the other new test mentioned (the block around the
second occurrence) so all tests use flashinfer.TopKTieBreak instead of integers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3b055778-aa2a-42f4-9b49-890884a643b7

📥 Commits

Reviewing files that changed from the base of the PR and between 1714c33 and 673df9a.

📒 Files selected for processing (1)
  • tests/utils/test_topk.py

Comment thread tests/utils/test_topk.py Outdated
@zianglih zianglih changed the title feat: implement configurable tie_break for topk feat: implement configurable tie_break for filtered topk Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zianglih zianglih requested a review from aleozlx April 17, 2026 21:26
@zianglih
Copy link
Copy Markdown
Contributor Author

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

✅ Actions performed

Reviews resumed.

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Apr 18, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !567 has been created, and the CI pipeline #48832531 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Member

@kahyunnam kahyunnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! but need to rebase with conflicts and test, a recent topk change went in (#3009) @zianglih

@zianglih
Copy link
Copy Markdown
Contributor Author

@kahyunnam thanks for letting me know! I will rebase. Implmentation code looks fine but bench_topk.py now has a lot of conflicts.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/bench_topk.py (1)

690-721: ⚠️ Potential issue | 🟡 Minor

Check --tie-break before the deterministic compare-algorithms rejection.

Line 694 makes args.deterministic=True, so the Line 719 args.tie_break branch is unreachable for --compare-algorithms --tie-break; users get the deterministic error instead of the tie-break-specific error.

Proposed fix
     # Algorithm comparison mode
     if args.compare_algorithms:
+        if args.tie_break:
+            print("ERROR: --compare-algorithms does not support --tie-break")
+            return
         if args.deterministic:
             print(
                 "ERROR: --compare-algorithms is only meaningful with non-deterministic mode"
             )
             return
-        if args.tie_break:
-            print("ERROR: --compare-algorithms does not support --tie-break")
-            return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/bench_topk.py` around lines 690 - 721, The tie-break auto-enable
logic (the if args.tie_break block that sets args.deterministic = True) must not
run before the compare-algorithms validation because it masks the intended
tie-break incompatibility error; change the order or the condition so that the
compare-algorithms check (the if args.compare_algorithms block that errors when
args.deterministic is True and when args.tie_break is True) is evaluated first,
or make the tie-break auto-enable skip if args.compare_algorithms is set,
ensuring the args.tie_break path triggers its specific error rather than being
hidden by the deterministic check.
🧹 Nitpick comments (1)
benchmarks/bench_topk.py (1)

640-647: Consider registering the new tie-break benchmark mode in the unified benchmark framework.

This adds another custom CLI benchmark path in bench_topk.py; the benchmark cases would be easier to run consistently if wired through benchmarks/flashinfer_benchmark.py.

As per coding guidelines, benchmarks/**/*.py: Use the unified benchmarking framework in benchmarks/flashinfer_benchmark.py for kernel benchmarking with CUPTI timing support.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/bench_topk.py` around lines 640 - 647, You added a new CLI flag
"--tie-break" via parser.add_argument in bench_topk.py but did not wire this
mode into the unified benchmark framework; update the benchmark registration so
the tie-break variants are exposed through the central runner in
flashinfer_benchmark.py instead of a custom path. Specifically, detect the
tie-break option where existing modes are mapped (the code that
registers/dispatches benchmark cases) and add entries that create the
deterministic tie-break benchmark cases (tie-small/tie-large) with the same
CUPTI timing and slowdown alignment logic used by existing FlashInfer variants;
ensure the dispatch uses the same helper/factory functions used for other modes
so CUPTI setup and reporting are consistent with the rest of the benchmarks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/bench_topk.py`:
- Around line 485-502: The tie-break variant runs use a different top-k
algorithm than the baseline because set_topk_algo("auto") is active; fix by
running bench_tie_break_variants under the same algorithm used when measuring
fi_ms/fi_nondeterministic_ms (call set_topk_algo("default") or the saved
baseline algorithm before invoking bench_tie_break_variants and restore the
prior algorithm afterwards). Update both places where bench_tie_break_variants
is called (the block using compare_tie_break and the similar block at the other
location) so the top-k algorithm matches the baseline measurement when
constructing the lambda that calls flashinfer.top_k_page_table_transform.
- Around line 89-103: The lambda passed to bench_median_ms inside
bench_tie_break_variants closes over the loop variable tie_break, which can lead
to late-binding bugs; change the lambda to bind the current tie_break value as a
default parameter (e.g. use a parameter like tb=tie_break) and call
run_flashinfer_with_tie_break(tb) so each iteration captures its own tie_break;
update the call site where bench_median_ms is invoked to use this bound lambda
and keep all other metric logic the same (references: bench_tie_break_variants,
run_flashinfer_with_tie_break, TIE_BREAK_VARIANTS, bench_median_ms).

---

Outside diff comments:
In `@benchmarks/bench_topk.py`:
- Around line 690-721: The tie-break auto-enable logic (the if args.tie_break
block that sets args.deterministic = True) must not run before the
compare-algorithms validation because it masks the intended tie-break
incompatibility error; change the order or the condition so that the
compare-algorithms check (the if args.compare_algorithms block that errors when
args.deterministic is True and when args.tie_break is True) is evaluated first,
or make the tie-break auto-enable skip if args.compare_algorithms is set,
ensuring the args.tie_break path triggers its specific error rather than being
hidden by the deterministic check.

---

Nitpick comments:
In `@benchmarks/bench_topk.py`:
- Around line 640-647: You added a new CLI flag "--tie-break" via
parser.add_argument in bench_topk.py but did not wire this mode into the unified
benchmark framework; update the benchmark registration so the tie-break variants
are exposed through the central runner in flashinfer_benchmark.py instead of a
custom path. Specifically, detect the tie-break option where existing modes are
mapped (the code that registers/dispatches benchmark cases) and add entries that
create the deterministic tie-break benchmark cases (tie-small/tie-large) with
the same CUPTI timing and slowdown alignment logic used by existing FlashInfer
variants; ensure the dispatch uses the same helper/factory functions used for
other modes so CUPTI setup and reporting are consistent with the rest of the
benchmarks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 30686dc6-f940-40cc-bfe3-b065f285cf91

📥 Commits

Reviewing files that changed from the base of the PR and between d544b3d and 181e0e3.

📒 Files selected for processing (1)
  • benchmarks/bench_topk.py

Comment thread benchmarks/bench_topk.py
Comment thread benchmarks/bench_topk.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/flashinfer/topk.cuh (1)

3063-3112: ⚠️ Potential issue | 🟡 Minor

tie_break is silently dropped when deterministic == false in the lower-level launch APIs.

In DISPATCH_VEC_SIZE, the !deterministic branch hard-codes TopKTieBreak::None regardless of the incoming tie_break argument:

if (!deterministic) {
  LAUNCH_FILTERED_KERNEL(VS, false, TopKTieBreak::None);   // tie_break ignored
} else {
  ...
}

The three public wrappers (FilteredTopK, FilteredTopKPageTableTransform, FilteredTopKRaggedTransform) also accept tie_break and forward it verbatim without forcing deterministic = true. Only the top-level TopKDispatch / TopKPageTableTransformDispatch / TopKRaggedTransformDispatch flip deterministic when tie_break != None.

That means any future/internal caller that reaches the lower layer with (deterministic=false, tie_break=Small|Large) will silently get non-deterministic, non-tie-broken results — violating the documented contract that "selecting a tie-break forces deterministic execution."

Either force deterministic = true here (and in the three wrappers) when tie_break != None, or at minimum return cudaErrorInvalidValue to make the misuse loud.

🛠️ Possible fix at the launch site
   const int vec_size = ComputeFilteredTopKVecSize<DType>(max_len);
+  if (tie_break != TopKTieBreak::None && !deterministic) {
+    // Tie-break modes require deterministic emission; the non-deterministic
+    // kernel path ignores TIE_BREAK at compile time.
+    deterministic = true;
+  }

Or, stricter:

+  if (tie_break != TopKTieBreak::None && !deterministic) {
+    return cudaErrorInvalidValue;
+  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/topk.cuh` around lines 3063 - 3112, The DISPATCH_VEC_SIZE
path currently ignores the incoming tie_break when deterministic==false; update
LaunchFilteredTopKUnified so that if tie_break != TopKTieBreak::None you force
deterministic=true (or return cudaErrorInvalidValue) before dispatching the
kernel; specifically, check the tie_break parameter at the start of
LaunchFilteredTopKUnified (or inside DISPATCH_VEC_SIZE) and either set
deterministic=true when tie_break is Small or Large, or return an error, and
mirror this same enforcement in the public wrappers FilteredTopK,
FilteredTopKPageTableTransform, and FilteredTopKRaggedTransform so callers
cannot reach the lower layer with (deterministic=false, tie_break!=None).
🧹 Nitpick comments (1)
include/flashinfer/topk.cuh (1)

222-301: Add a brief justification for the per-thread-consecutive item layout.

DeterministicContiguousCollect deliberately gives thread t the ITEMS_PER_THREAD consecutive indices [t*IPT, t*IPT+IPT) within each chunk, which makes intra-warp loads strided by ITEMS_PER_THREAD rather than unit-stride. This is the right choice here (a CUB BlockScan over the per-thread counts naturally yields contiguous/REVERSE-contiguous emission order, which is exactly what row-global tie-breaking needs), but it is a non-obvious perf trade-off vs. the tx, tx+BLOCK_THREADS, … layout used by DeterministicThreadStridedCollect.

Per the coding guideline on performance-critical hot paths, it would help future readers to note this rationale (and mention the strided-layout alternative) next to the existing // TODO: maybe tune ITEMS_PER_THREAD and vectorize comment. No code change required.

As per coding guidelines: "For performance-critical hot paths, leave comments with justification for special algorithmic choices and mention alternative approaches considered".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/topk.cuh` around lines 222 - 301, Add a short comment in
DeterministicContiguousCollect (near the ITEMS_PER_THREAD/TODO comment)
explaining that the per-thread-consecutive layout (thread t owning indices
[t*ITEMS_PER_THREAD, ...)) intentionally trades unit-stride loads for intra-warp
strided loads because cub::BlockScan over per-thread selected counts yields
contiguous (or REVERSE-contiguous) global emission order required for row-global
tie-breaking; also mention the alternative strided layout (tx, tx+BLOCK_THREADS,
...) used by DeterministicThreadStridedCollect and why it was rejected here (to
preserve deterministic contiguous output ordering), so future readers understand
the performance/correctness rationale.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@include/flashinfer/topk.cuh`:
- Around line 3063-3112: The DISPATCH_VEC_SIZE path currently ignores the
incoming tie_break when deterministic==false; update LaunchFilteredTopKUnified
so that if tie_break != TopKTieBreak::None you force deterministic=true (or
return cudaErrorInvalidValue) before dispatching the kernel; specifically, check
the tie_break parameter at the start of LaunchFilteredTopKUnified (or inside
DISPATCH_VEC_SIZE) and either set deterministic=true when tie_break is Small or
Large, or return an error, and mirror this same enforcement in the public
wrappers FilteredTopK, FilteredTopKPageTableTransform, and
FilteredTopKRaggedTransform so callers cannot reach the lower layer with
(deterministic=false, tie_break!=None).

---

Nitpick comments:
In `@include/flashinfer/topk.cuh`:
- Around line 222-301: Add a short comment in DeterministicContiguousCollect
(near the ITEMS_PER_THREAD/TODO comment) explaining that the
per-thread-consecutive layout (thread t owning indices [t*ITEMS_PER_THREAD,
...)) intentionally trades unit-stride loads for intra-warp strided loads
because cub::BlockScan over per-thread selected counts yields contiguous (or
REVERSE-contiguous) global emission order required for row-global tie-breaking;
also mention the alternative strided layout (tx, tx+BLOCK_THREADS, ...) used by
DeterministicThreadStridedCollect and why it was rejected here (to preserve
deterministic contiguous output ordering), so future readers understand the
performance/correctness rationale.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3702cdcc-e9b9-4b5c-9e74-9b0fa7b5009b

📥 Commits

Reviewing files that changed from the base of the PR and between 181e0e3 and 85d7b7e.

📒 Files selected for processing (2)
  • benchmarks/bench_topk.py
  • include/flashinfer/topk.cuh

@zianglih
Copy link
Copy Markdown
Contributor Author

@coderabbitai stop

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/flashinfer/topk.cuh (1)

3080-3111: ⚠️ Potential issue | 🟡 Minor

tie_break is silently dropped on the !deterministic branch.

DISPATCH_VEC_SIZE always launches with TopKTieBreak::None when deterministic is false, regardless of the tie_break argument. The top-level TopKDispatch / TopKPageTableTransformDispatch / TopKRaggedTransformDispatch force deterministic=true whenever tie_break != None, so end users are safe — but LaunchFilteredTopKUnified and the mid-level wrappers FilteredTopK, FilteredTopKPageTableTransform, FilteredTopKRaggedTransform all expose tie_break alongside a defaulted deterministic=false. A direct caller passing tie_break=Small/Large without flipping deterministic will get ordinary non-deterministic output with no diagnostic. Prefer failing loudly (or auto-promoting to deterministic) at this layer so the invariant tie_break != None ⇒ deterministic is enforced at every entry point.

🛡️ Proposed guard
 cudaError_t LaunchFilteredTopKUnified(DType* input, IdType* output, DType* aux_output,
                                       const IdType* aux_input, int64_t aux_stride,
                                       const IdType* row_to_batch, const IdType* lengths,
                                       uint32_t num_rows, uint32_t top_k_val, uint32_t max_len,
                                       bool deterministic = false,
                                       TopKTieBreak tie_break = TopKTieBreak::None,
                                       cudaStream_t stream = 0) {
+  if (tie_break != TopKTieBreak::None && !deterministic) {
+    // Tie-break selection requires the deterministic collect path.
+    return cudaErrorInvalidValue;
+  }
   constexpr size_t smem_size = FILTERED_TOPK_SMEM_DYNAMIC;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/topk.cuh` around lines 3080 - 3111, The DISPATCH_VEC_SIZE
macro silently drops the provided tie_break when deterministic==false; update
LaunchFilteredTopKUnified (and callers FilteredTopK /
FilteredTopKPageTableTransform / FilteredTopKRaggedTransform) to enforce the
invariant "tie_break != TopKTieBreak::None ⇒ deterministic==true": add an
explicit guard before dispatching (e.g., if (!deterministic && tie_break !=
TopKTieBreak::None) return cudaErrorInvalidValue;) so callers who pass
tie_break=Small/Large get a clear error instead of losing the tie-break
behavior; adjust the branch logic in DISPATCH_VEC_SIZE / LAUNCH_FILTERED_KERNEL
accordingly so the deterministic path is used only when deterministic==true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@include/flashinfer/topk.cuh`:
- Around line 3080-3111: The DISPATCH_VEC_SIZE macro silently drops the provided
tie_break when deterministic==false; update LaunchFilteredTopKUnified (and
callers FilteredTopK / FilteredTopKPageTableTransform /
FilteredTopKRaggedTransform) to enforce the invariant "tie_break !=
TopKTieBreak::None ⇒ deterministic==true": add an explicit guard before
dispatching (e.g., if (!deterministic && tie_break != TopKTieBreak::None) return
cudaErrorInvalidValue;) so callers who pass tie_break=Small/Large get a clear
error instead of losing the tie-break behavior; adjust the branch logic in
DISPATCH_VEC_SIZE / LAUNCH_FILTERED_KERNEL accordingly so the deterministic path
is used only when deterministic==true.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1a727b01-c095-4a81-9994-2ba40b30ac61

📥 Commits

Reviewing files that changed from the base of the PR and between 85d7b7e and a568720.

📒 Files selected for processing (7)
  • benchmarks/bench_topk.py
  • csrc/flashinfer_topk_binding.cu
  • csrc/topk.cu
  • flashinfer/__init__.py
  • flashinfer/topk.py
  • include/flashinfer/topk.cuh
  • tests/utils/test_topk.py
✅ Files skipped from review due to trivial changes (1)
  • flashinfer/init.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • csrc/topk.cu
  • flashinfer/topk.py
  • csrc/flashinfer_topk_binding.cu
  • tests/utils/test_topk.py

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

@zianglih Sure! I'll pause automated reviews for this PR.


🧠 Learnings used
Learnt from: kahyunnam
Repo: flashinfer-ai/flashinfer PR: 2965
File: tests/norm/test_fused_rmsnorm_silu.py:138-141
Timestamp: 2026-04-03T21:06:16.453Z
Learning: In `tests/norm/test_fused_rmsnorm_silu.py` (flashinfer-ai/flashinfer PR `#2965`), the full `ALL_LUT_SHAPES` test matrix (8 hidden sizes × 5 token counts, up to 399,360 tokens) across bf16, FP8, and NVFP4 is intentionally kept as the default CI parametrization. The maintainer confirmed the tests are fast and do not need to be split into a smoke subset vs. a slow marker. Do not flag this test matrix as too large for CI.

Learnt from: ishovkun
Repo: flashinfer-ai/flashinfer PR: 2865
File: include/flashinfer/mamba/kernel_selective_state_update_mtp_vertical.cuh:343-366
Timestamp: 2026-03-23T21:04:23.630Z
Learning: FlashInfer Mamba SSU MTP kernels: When applying the z gate in epilogues, index z using its own strides (z_stride_batch/z_stride_mtp), not the output strides. Verified in include/flashinfer/mamba/kernel_selective_state_update_mtp_vertical.cuh (role_epilogue) and include/flashinfer/mamba/kernel_selective_state_update_mtp_horizontal.cuh (inline epilogue).
✅ Actions performed

Reviews paused.

@kahyunnam
Copy link
Copy Markdown
Member

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !567 has been updated with latest changes, and the CI pipeline #49096155 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Member

@kahyunnam kahyunnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zianglih some CICD unit tests are failing on RTX pro 6000 (sm120). Please take a look; if "operation not supported" is expected here, then a change in the test script may be needed to handle this case / catch the error without falling through.

Test log:

==========================================
Running:  pytest --continue-on-collection-errors --junitxml=/tmp/junit/tests_utils_test_topk.py.xml "tests/utils/test_topk.py"
==========================================
============================= test session starts ==============================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0
rootdir: /workspace/flashinfer
configfile: pytest.ini
collected 1334 items
tests/utils/test_topk.py ............................................... [  3%]
..................................FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [  8%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 14%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF................FFFF [ 19%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFF..FFFF........FFFFFFFFFFFFFFFF.............. [ 25%]
........................................................................ [ 30%]
........................................................................ [ 35%]
........................................................................ [ 41%]
........................................................................ [ 46%]
......................................................s..s..s..s..s..s.. [ 52%]
s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s.. [ 57%]
s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..s..........sss. [ 62%]
.ss..ss..ssssssss....s...s...s...sssssss..s....ssss...s..s..s..s.s.s.... [ 68%]
...sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 73%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 79%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 84%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 89%]
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 95%]
sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss          [100%]
=================================== FAILURES ===================================
E   RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported
/workspace/flashinfer/csrc/topk.cu:79: RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported
E   RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported
/workspace/flashinfer/csrc/topk.cu:79: RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported
E   RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported
/workspace/flashinfer/csrc/topk.cu:79: RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported
E   RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported
/workspace/flashinfer/csrc/topk.cu:79: RuntimeError: Check failed: (status == cudaSuccess) is false: TopK failed with error code operation not supported

... etc...

=========================== short test summary info ============================
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-32000-1] - Runti...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-32000-16] - Runt...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-32000-64] - Runt...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-65536-1] - Runti...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-65536-16] - Runt...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-65536-64] - Runt...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-128512-1] - Runt...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-128512-16] - Run...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-256-128512-64] - Run...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-512-32000-1] - Runti...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-512-32000-16] - Runt...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-512-32000-64] - Runt...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-512-65536-1] - Runti...
FAILED tests/utils/test_topk.py::test_top_k[small-dtype0-512-65536-16] - Runt...

... etc ... 

=========== 214 failed, 614 passed, 506 skipped in 104.92s (0:01:44) ===========
❌ FAILED: tests/utils/test_topk.py
==========================================

@zianglih
Copy link
Copy Markdown
Contributor Author

Hi @kahyunnam, this is fixed by 53eb16f . Waiting ci. Thanks!

@kahyunnam
Copy link
Copy Markdown
Member

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !567 has been updated with latest changes, and the CI pipeline #49117211 is currently running. I'll report back once the pipeline job completes.

@zianglih
Copy link
Copy Markdown
Contributor Author

Previously failing A10G and T4 ci passed.

@kahyunnam kahyunnam merged commit 5432f6d into flashinfer-ai:main Apr 22, 2026
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants