[Performance] Performance regression in Mod operator for float32 with fmod=1 between v1.20.0 and v1.21.0

### Describe the issue

## Description

We observed a performance regression in the **Mod** operator when using **float32 data type with fmod=1 attribute** between ONNXRuntime v1.20.0 and v1.21.0. This regression is **specific to this configuration** - integer types (int32, int64) and fmod=0 are not affected.

## Affected Operator

### Mod
- **Opset Version**: 13
- **Data Type**: float32 (regressed)
- **Attribute**: fmod=1 (regressed)
- **Regression**: +21% to +149% kernel slowdown

## Test Case Details

### Test Case 1: `mod_13_v2_mod_float32_fmod_one_large_4d`

**Inputs:**
- **input_0** tensor:
  - Data type: **float32** (type=1)
  - Shape: [2, 64, 56, 56]

- **input_1** tensor:
  - Data type: **float32** (type=1)
  - Shape: [2, 64, 56, 56]

**Attributes:**
- **fmod**: 1 (C-style fmod semantics)

**Output:**
- Data type: float32
- Shape: [2, 64, 56, 56]
- Element-wise modulo operation

**Performance:**
- v1.20.0: 3448.4 ms (kernel time)
- v1.21.0: 4184.8 ms (kernel time)
- **Kernel regression: +21.4% slowdown**
- **Total time regression: +21.3% slowdown**

### Test Case 2: `mod_13_v3_test_mod_basic_float32_fmod`

**Inputs:**
- **A** tensor:
  - Data type: **float32** (type=1)
  - Shape: [2, 64, 56, 56]

- **B** tensor:
  - Data type: **float32** (type=1)
  - Shape: [2, 64, 56, 56]

**Attributes:**
- **fmod**: 1

**Performance:**
- v1.20.0: 3453.2 ms (kernel time)
- v1.21.0: 4184.7 ms (kernel time)
- **Kernel regression: +21.2% slowdown**

### Test Case 3: `mod_13_v3_test_mod_mixed_shape_broadcast_float32`

**Inputs:**
- **A** tensor:
  - Data type: **float32** (type=1)
  - Shape: [1, 3, 32, 32]

- **B** tensor:
  - Data type: **float32** (type=1)
  - Shape: [2, 3, 32, 32]

**Attributes:**
- **fmod**: 1

**Output:**
- Shape: [2, 3, 32, 32] (broadcast result)

**Performance:**
- v1.20.0: 0.126 ms (kernel time)
- v1.21.0: 0.266 ms (kernel time)
- **Kernel regression: +110.2% slowdown**
- **Total time regression: +102.2% slowdown**

### Test Case 4: `mod_mod_13_mod_fmod1_float32_negative_divisor`

**Inputs:**
- **X** tensor:
  - Data type: **float32** (type=1)
  - Shape: [8, 128]

- **Y** tensor:
  - Data type: **float32** (type=1)
  - Shape: [8, 128]

**Attributes:**
- **fmod**: 1

**Performance:**
- v1.20.0: 5.18 ms (kernel time)
- v1.21.0: 12.33 ms (kernel time)
- **Kernel regression: +138.1% slowdown**

## Regression Characteristics

### Configuration-Specific Regression

**REGRESSED** (float32 + fmod=1):
- `mod_13_v2_mod_float32_fmod_one_large_4d`: +21.4% slowdown
- `mod_13_v3_test_mod_basic_float32_fmod`: +21.2% slowdown
- `mod_13_v3_test_mod_mixed_shape_broadcast_float32`: +110.2% slowdown (broadcast)
- `mod_mod_13_mod_fmod1_float32_negative_divisor`: +138.1% slowdown

**NOT REGRESSED** (int32 + fmod=0):
- `mod_13_v2_mod_int32_default_attribute_large_4d`: -2.9% (improved)
- `mod_13_v2_mod_int32_mixed_signs_fmod_zero_2d`: No regression

**NOT REGRESSED** (int64 + fmod=0):
- `mod_13_v2_mod_int64_explicit_fmod_zero_3d`: No regression

### Key Characteristics
- **Configuration-specific**: Only float32 with fmod=1 affected
- **Opset version**: Version 13
- **Shape-dependent**: Broadcast operations show higher regression (+110% vs +21%)
- **Persistence**: Regression persists to latest version (1.23.0)
- **Partial recovery**: v1.23.0 shows 3-5% improvement over v1.21.0, but still regressed from v1.20.0

### To reproduce

1. Download zip file

[Archive.zip](https://github.com/user-attachments/files/24859572/Archive.zip)

2. Run benchmark using the provided script:
   ```bash
   python script_profiling.py mod_13_v3_test_mod_mixed_shape_broadcast_float32 1.20.0 1.21.0
   ```

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 24.04.3 LTS

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.21.0

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

### Model File

_No response_

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Performance regression in Mod operator for float32 with fmod=1 between v1.20.0 and v1.21.0 #27152

Describe the issue

Description

Affected Operator

Mod

Test Case Details

Test Case 1: `mod_13_v2_mod_float32_fmod_one_large_4d`

Test Case 2: `mod_13_v3_test_mod_basic_float32_fmod`

Test Case 3: `mod_13_v3_test_mod_mixed_shape_broadcast_float32`

Test Case 4: `mod_mod_13_mod_fmod1_float32_negative_divisor`

Regression Characteristics

Configuration-Specific Regression

Key Characteristics

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Performance regression in Mod operator for float32 with fmod=1 between v1.20.0 and v1.21.0 #27152

Description

Describe the issue

Description

Affected Operator

Mod

Test Case Details

Test Case 1: mod_13_v2_mod_float32_fmod_one_large_4d

Test Case 2: mod_13_v3_test_mod_basic_float32_fmod

Test Case 3: mod_13_v3_test_mod_mixed_shape_broadcast_float32

Test Case 4: mod_mod_13_mod_fmod1_float32_negative_divisor

Regression Characteristics

Configuration-Specific Regression

Key Characteristics

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Test Case 1: `mod_13_v2_mod_float32_fmod_one_large_4d`

Test Case 2: `mod_13_v3_test_mod_basic_float32_fmod`

Test Case 3: `mod_13_v3_test_mod_mixed_shape_broadcast_float32`

Test Case 4: `mod_mod_13_mod_fmod1_float32_negative_divisor`