Skip to content

Fix for AveragePool performance regression by guarding ceil_mode clamp#27331

Open
tengli-alaska wants to merge 4 commits intomicrosoft:mainfrom
tengli-alaska:fix/averagepool-regression
Open

Fix for AveragePool performance regression by guarding ceil_mode clamp#27331
tengli-alaska wants to merge 4 commits intomicrosoft:mainfrom
tengli-alaska:fix/averagepool-regression

Conversation

@tengli-alaska
Copy link

Description

Fixes #27190

Guards the std::min bounds clamping in AveragePool1DTask, AveragePool2DTask, and AveragePool3DTask behind a ceil_mode check to restore performance for the default ceil_mode=0 path.

Root Cause

PR #16752 added std::min clamping on hend/wend/dend inside the hot pooling loops to fix correctness when ceil_mode=1. However, these clamps run unconditionally, including when ceil_mode=0 (the default for the vast majority of models). This introduces a data-dependent loop bound that can prevent compiler auto-vectorization of the inner accumulation loop, causing a +17% to +28% kernel slowdown as reported in #27190.

Fix

When ceil_mode=0, the output dimensions are computed with floor division, which guarantees that hstart + kernel_size <= height + pad — the clamp is a no-op. This PR guards each clamp behind if (ceil_mode), restoring the original fast path while preserving the correctness fix for ceil_mode=1.

Changes

  • pool_functors.h: Added int64_t ceil_mode field to AveragePool1DTask, AveragePool2DTask, AveragePool3DTask and wrapped each std::min clamp in if (ceil_mode)
  • pool.cc: Pass pool_attrs_.ceil_mode to the three AveragePool task initializers

Benchmark Results

Regression confirmed on AMD EPYC 7543 (x86_64, Linux) using ORT's built-in profiling on averagepool_pool_11_18_average_pool_3d_valid_input:

Version Kernel Time (ms) Change
1.20.0 0.1655 baseline
1.21.0 0.1710 +3.3%

The original issue observed +17% to +28% regression. Unable to build ORT from source locally to benchmark the patched version, would appreciate if @junghyunpark2001 could validate the fix on their setup.

Notes

@tengli-alaska
Copy link
Author

@microsoft-github-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a performance regression in the AveragePool operator introduced in PR #16752. The regression (17-28% kernel slowdown) was caused by unconditional std::min clamping operations in the hot pooling loops, which prevented compiler auto-vectorization. The fix guards these clamps behind ceil_mode checks, restoring performance for the default ceil_mode=0 path while preserving correctness for ceil_mode=1.

Changes:

  • Added ceil_mode field to AveragePool1DTask, AveragePool2DTask, and AveragePool3DTask structs
  • Conditionally guards all std::min clamps with if (ceil_mode) checks
  • Updated task initializers in pool.cc to pass pool_attrs_.ceil_mode

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/core/providers/cpu/nn/pool_functors.h Added ceil_mode field and conditional clamp guards for 1D, 2D, and 3D average pooling tasks
onnxruntime/core/providers/cpu/nn/pool.cc Updated task initializations to pass ceil_mode parameter
.vscode/settings.json Unrelated formatting changes and addition of C_Cpp_Runner.msvcBatchPath configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@titaiwangms
Copy link
Contributor

/azp run Linux QNN CI Pipeline
/azp run Win_TRT_Minimal_CUDA_Test_CI
/azp run Windows ARM64 QNN CI Pipeline
/azp run Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@titaiwangms
Copy link
Contributor

/azp run Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@titaiwangms
Copy link
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Contributor

@titaiwangms titaiwangms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also only discovered non-significant changes.

{
  "model": "/home/titaiwang/onnxruntime/profiling_avgpool/averagepool_pool_11_18_average_pool_3d_valid_input/averagepool_pool_11_18_average_pool_3d_valid_input.onnx",
  "operator": "AveragePool",
  "old_version": "1.20.0",
  "new_version": "1.21.0",
  "summary": {
    "kernel_old_ms": 0.191,
    "kernel_new_ms": 0.198,
    "kernel_change_ms": 0.007000000000000006,
    "kernel_change_pct": 3.6649214659685896,
    "total_old_ms": 0.201,
    "total_new_ms": 0.204,
    "is_regression": false
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] Performance regression in AveragePool operator

2 participants