Fix autotuner oom by zack041 · Pull Request #2442 · flashinfer-ai/flashinfer

zack041 · 2026-01-30T10:19:07Z

📌 Description

Add graceful OOM handling during autotuning. When torch.cuda.OutOfMemoryError occurs, the autotuner now clears CUDA cache and falls back to the default tactic (runners[0], -1) instead of crashing. The try-except block wraps the entire profiling loop, covering methods like _prepare_input_tensors() that could also cause OOM. OOM from the inner profiling loop is raised to be caught by the outer exception handler.

🔍 Related Issues

Fixes #2357

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

No tests added because OOM during autotuning is difficult to reliably reproduce in a test environment.

Summary by CodeRabbit

Bug Fixes
- Improved profiling error handling so individual tactic failures are caught, logged, recorded, and do not abort tuning.
- Added robust out-of-memory handling that clears GPU resources and falls back to safe/previous configurations instead of crashing.
- Ensured tuning continues after non‑OOM errors, preserves cache/metrics consistency, and still selects the best measured configuration when available.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

gemini-code-assist · 2026-01-30T10:19:24Z

Summary of Changes

Hello @zack041, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the autotuner by introducing comprehensive Out-Of-Memory (OOM) error handling. It prevents the autotuner from crashing when GPU memory is exhausted during profiling, ensuring a more stable and resilient tuning process. Instead of failing, the system now clears the CUDA cache and reverts to a safe default configuration, allowing the application to continue operating without interruption.

Highlights

Graceful OOM Handling: Implemented try-except blocks to catch torch.cuda.OutOfMemoryError during the autotuning process, preventing crashes due to GPU memory exhaustion.
CUDA Cache Clearing: Upon detecting an OOM error, torch.cuda.empty_cache() is called to free up GPU memory, improving recovery chances.
Default Tactic Fallback: The autotuner now gracefully falls back to a default tactic (runners[0], -1) instead of crashing when an OOM occurs, ensuring continued operation.
Comprehensive Error Coverage: The OOM handling wraps the entire profiling loop, including methods like _prepare_input_tensors(), ensuring broad coverage for potential memory issues.
Inner Loop OOM Propagation: OOM errors originating from the inner _profile_single_kernel profiling loop are specifically re-raised to be caught by the outer, more comprehensive exception handler.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-30T10:19:25Z

📝 Walkthrough

Walkthrough

Added robust exception handling in choose_one profiling: per-tactic try/except to continue on errors, special handling for torch.cuda.OutOfMemoryError that clears CUDA cache and falls back to a safe default tactic, and consistent cache/stat updates for failed profiling attempts.

Changes

Cohort / File(s)	Summary
Profiling Exception Handling `flashinfer/autotuner.py`	Wrapped profiling loop in per-tactic try/except. Catches `torch.cuda.OutOfMemoryError`, clears CUDA cache, and returns a fallback runner/tactic (-1). On other exceptions logs a warning, records failure (sets measured time to ∞), updates failed-profiling counters/cache entries, and continues evaluating remaining tactics. Ensures cache keys and chosen (runner, tactic) selection logic remain consistent.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Autotuner as Autotuner.choose_one
    participant Runner as Runner.profile
    participant CUDA
    participant Cache

    Caller->>Autotuner: request best (runner,tactic)
    Autotuner->>Cache: lookup cached (runner,tactic)
    alt cache miss / needs profiling
        Autotuner->>Runner: profile(tactic_i)
        Runner->>CUDA: allocate / run kernel
        alt torch.cuda.OutOfMemoryError
            CUDA-->>Runner: OOM error
            Runner-->>Autotuner: raise OOM
            Autotuner->>CUDA: torch.cuda.empty_cache()
            Autotuner-->>Caller: return fallback (runner, tactic=-1)
        else other Exception
            Runner-->>Autotuner: exception
            Autotuner->>Cache: record failed profiling (time=∞)
            Autotuner->>Runner: continue with next tactic
        else success
            Runner-->>Autotuner: time_measured
            Autotuner->>Cache: update best (runner,tactic)
            Autotuner-->>Caller: return chosen (runner,tactic)
        end
    else cached
        Cache-->>Autotuner: cached (runner,tactic)
        Autotuner-->>Caller: return cached (runner,tactic)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 When kernels heap and memory's thin and tactics trip and stray,
I nibble bugs and empty cache, then gently hop away.
If OOM clouds the tuning sky, I pick the safest lane—
a rabbit's hop, a tidy fix, and profiling's calm again. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Fix autotuner oom' is vague and uses an abbreviation without clarity; 'oom' lacks context for readers unfamiliar with the issue.	Expand the title to be more descriptive, such as 'Fix autotuner graceful handling of out-of-memory errors' to provide clearer context.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description follows the template structure with complete Description, Related Issues, and Checklist sections; pre-commit checks are marked complete and testing notes are provided.
Linked Issues check	✅ Passed	The PR successfully implements the objectives from issue `#2357`: graceful OOM handling with CUDA cache clearing and fallback to the default tactic (runners[0], -1) instead of crashing.
Out of Scope Changes check	✅ Passed	All changes in autotuner.py are focused on OOM exception handling during profiling, directly addressing the requirements in issue `#2357`; no extraneous modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces important graceful Out-Of-Memory (OOM) handling during the autotuning process, which enhances the robustness of the system. The implementation correctly wraps the profiling loop with a try-except block for torch.cuda.OutOfMemoryError, clears the CUDA cache, and falls back to a default tactic. However, the cache update and statistics increment logic is currently positioned such that it executes even for cache hits, leading to inaccurate statistics and redundant operations. This should be adjusted to only apply when a new optimal configuration is successfully profiled.

gemini-code-assist · 2026-01-30T10:21:20Z

+            if runner_id is not None:
+                # At least one valid (runner, tactic) pair is found
+                cache_key = AutoTuner._get_cache_key(
+                    custom_op, runners[runner_id], p.get_opt_shapes(), tuning_config
+                )
+                # inspect call stack
+                self.profiling_cache[cache_key] = (runner_id, tactic, p)
+                self.stats.tuned_op_successful_configs[custom_op] = (
+                    self.stats.tuned_op_successful_configs.get(custom_op, 0) + 1
+                )
+                logger.debug(
+                    f"[Autotuner]: profiling chosen runner: {runners[runner_id]} {tactic} for {cache_key}"
+                )


This block of code, which updates the profiling cache and statistics, is currently placed outside the if not is_cache_hit: condition. This means it will execute even when a configuration is retrieved from the cache, leading to incorrect tuned_op_successful_configs counts and redundant cache updates. It should only execute when a new best runner/tactic is found after profiling. Please move this block inside the if not is_cache_hit: block, at the same indentation level as min_time = float("inf") (line 475).

yzh119 · 2026-01-30T18:52:37Z

/bot run

yzh119 · 2026-01-30T18:52:43Z

@flashinfer-bot run

flashinfer-bot · 2026-01-30T18:53:27Z

GitLab MR !279 has been created, and the CI pipeline #42921952 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-01-31T08:00:12Z

[FAILED] Pipeline #42921952: 8/20 passed

yzh119 · 2026-02-03T09:00:07Z

@flashinfer-bot run

## 📌 Description Add graceful OOM handling during autotuning. When `torch.cuda.OutOfMemoryError` occurs, the autotuner now clears CUDA cache and falls back to the default tactic `(runners[0], -1)` instead of crashing. The try-except block wraps the entire profiling loop, covering methods like `_prepare_input_tensors()` that could also cause OOM. OOM from the inner profiling loop is raised to be caught by the outer exception handler. ## 🔍 Related Issues Fixes flashinfer-ai#2357 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes No tests added because OOM during autotuning is difficult to reliably reproduce in a test environment.  ## Summary by CodeRabbit * **Bug Fixes** * Improved profiling error handling so individual tactic failures are caught, logged, recorded, and do not abort tuning. * Added robust out-of-memory handling that clears GPU resources and falls back to safe/previous configurations instead of crashing. * Ensured tuning continues after non‑OOM errors, preserves cache/metrics consistency, and still selects the best measured configuration when available. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

fix: graceful OOM fallback during autotuning

5a70f90

zack041 requested review from bkryu, cyx-6, jimmyzho, nvmbreughe and yzh119 as code owners January 30, 2026 10:19

gemini-code-assist Bot reviewed Jan 30, 2026

View reviewed changes

style: apply pre-commit formatting

2c43050

zack041 force-pushed the fix-autotuner-oom branch from 300008d to 2c43050 Compare January 30, 2026 10:39

flashinfer-bot added the run-ci label Feb 3, 2026

yzh119 approved these changes Feb 3, 2026

View reviewed changes

yzh119 merged commit 0eb69bb into flashinfer-ai:main Feb 3, 2026
41 checks passed

coderabbitai Bot mentioned this pull request Mar 3, 2026

feat: Autotuner support CUDA graph and cold L2 cache #2663

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 11, 2026

Fix autotuner crash when input tensor is None #2756

Merged

coderabbitai Bot mentioned this pull request Apr 1, 2026

feat: SM121 (GB10) tile filtering and autotuner robustness #2927

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix autotuner oom#2442

Fix autotuner oom#2442
yzh119 merged 2 commits intoflashinfer-ai:mainfrom
zack041:fix-autotuner-oom

zack041 commented Jan 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

gemini-code-assist Bot commented Jan 30, 2026

Uh oh!

coderabbitai Bot commented Jan 30, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 30, 2026

Uh oh!

yzh119 commented Jan 30, 2026

Uh oh!

yzh119 commented Jan 30, 2026

Uh oh!

flashinfer-bot commented Jan 30, 2026

Uh oh!

flashinfer-bot commented Jan 31, 2026

Uh oh!

yzh119 commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zack041 commented Jan 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Jan 30, 2026

Uh oh!

yzh119 commented Jan 30, 2026

Uh oh!

flashinfer-bot commented Jan 30, 2026

Uh oh!

flashinfer-bot commented Jan 31, 2026

Uh oh!

yzh119 commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zack041 commented Jan 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 30, 2026 •

edited

Loading