[Performance] Performance regression in GatherND operator between v1.20.0 and v1.21.0

### Describe the issue


## Description

We observed a **33% performance regression** in the **GatherND** operator with **batch_dims=1** configuration for **int32 data** between ONNXRuntime v1.20.0 and v1.21.0.

## Affected Operator

### GatherND
- **Opset Version**: 13
- **Data Type**: int32 (data), int64 (indices)
- **Configuration**: batch_dims=1 with 4D tensor and deep indexing
- **Data Shape**: [2, 64, 56, 56] (4D tensor)
- **Indices Shape**: [2, 16, 16, 3]
- **Output Shape**: [2, 16, 16, 56, 56]
- **Regression**: +33% slowdown

## Test Case Details

### Test Case: `gathernd_13_v2_gathernd_int32_batch_dims_1_4d_tensor_deep_index`

**Input 0 (data):**
- Name: `input_0`
- Shape: `[2, 64, 56, 56]` (4D tensor)
- Data type: int32
- Total elements: 401,408

**Input 1 (indices):**
- Name: `input_1`
- Shape: `[2, 16, 16, 3]`
- Data type: int64
- Total elements: 1,536

**Output:**
- Name: `output`
- Shape: `[2, 16, 16, 56, 56]`
- Data type: int32
- Total elements: 25,690,112

**Attributes:**
```json
{
  "batch_dims": 1
}
```

**Performance:**
- v1.20.0: 0.003 ms (kernel time)
- v1.21.0: 0.004 ms (kernel time)
- **Regression: +33% slowdown**

## Regression Magnitude

- **Kernel time**: +33% slower (0.003 ms → 0.004 ms)
- **Total time**: +20% slower (0.005 ms → 0.006 ms)

## Observed Characteristics

- **batch_dims=1**: Uses batched gather operation
- **Deep indexing**: Indices with shape [2, 16, 16, 3] selecting from 4D data
- **Large data tensor**: 401K elements in input data
- **Large output**: 25.7M elements in output tensor (significant memory bandwidth)
- **int32 data type**: May use different code path than float32

## Operation Details

GatherND with batch_dims=1 means:
- The first dimension (batch=2) is preserved across data and indices
- For each batch element, indices of shape [16, 16, 3] gather from data of shape [64, 56, 56]
- Each index tuple (3 elements) selects a [56, 56] subregion from the data
- Result is [2, 16, 16, 56, 56] output tensor

This creates a complex memory access pattern with substantial output tensor size.

### To reproduce


1. Download zip

2. Run benchmark using the provided script:
python profile_operator.py gathernd_13_v2_gathernd_int32_batch_dims_1_4d_tensor_deep_index  1.20.0 1.21.0


[Archive.zip](https://github.com/user-attachments/files/24695672/Archive.zip)



3. Compare the reported latencies between the two versions.

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 24.04.3 LTS

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.21.0

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

### Model File

_No response_

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Performance regression in GatherND operator between v1.20.0 and v1.21.0 #27053

Describe the issue

Description

Affected Operator

GatherND

Test Case Details

Test Case: `gathernd_13_v2_gathernd_int32_batch_dims_1_4d_tensor_deep_index`

Regression Magnitude

Observed Characteristics

Operation Details

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Performance regression in GatherND operator between v1.20.0 and v1.21.0 #27053

Description

Describe the issue

Description

Affected Operator

GatherND

Test Case Details

Test Case: gathernd_13_v2_gathernd_int32_batch_dims_1_4d_tensor_deep_index

Regression Magnitude

Observed Characteristics

Operation Details

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Test Case: `gathernd_13_v2_gathernd_int32_batch_dims_1_4d_tensor_deep_index`