[Performance] Performance regression in Hardmax operator with axis=1 between v1.18.0 and v1.19.0

### Describe the issue

## Description

We observed a performance regression in the **Hardmax** operator when using **explicit axis=1** configuration between ONNXRuntime v1.18.0 and v1.19.0. This regression is **axis-configuration specific** - default axis configurations show no regression or even improvement.

## Affected Operator

### Hardmax
- **Opset Version**: 13
- **Data Type**: float32
- **Configuration**: explicit axis=1
- **Regression**: +8.85% kernel slowdown (+14.42% total time)

## Test Case Details

### Test Case: `hardmax_13_v3_test_hardmax_float32_axis1`

**Input:**
- **input** tensor:
  - Data type: float32
  - Shape: [2, 64, 56, 56]
  - Total elements: 401,408

**Attributes:**
- **axis**: 1 (explicit, non-last axis)

**Output:**
- Data type: float32
- Shape: [2, 64, 56, 56]

**Operation:**
Computes hardmax (one-hot encoding of argmax) along axis 1.

**Performance:**
- v1.18.0: 17.05 ms (kernel time)
- v1.19.0: 18.56 ms (kernel time)
- **Kernel regression: +8.85% slowdown**
- **Total time regression: +14.42% slowdown**

## Regression Characteristics

### Axis-Specific Regression

**REGRESSED** (explicit axis=1):
- `hardmax_13_v3_test_hardmax_float32_axis1`: +8.85% slowdown
  - Shape: [2, 64, 56, 56], axis=1

**REGRESSED** (negative axis):
- `hardmax_13_v3_test_hardmax_float32_negative_axis`: +6.24% slowdown

**NOT REGRESSED** (default axis):
- `hardmax_13_v3_test_hardmax_basic_float32_default_axis`: +0.12% (stable)
  - Shape: [2, 3, 32, 32], default axis

**IMPROVED** (default axis, larger tensor):
- `hardmax_hardmax_13_hardmax_default_axis_float32_4d`: -65.33% improvement
  - Shape: [2, 64, 28, 28], default axis


### To reproduce

1. Download zip file 

[Archive.zip](https://github.com/user-attachments/files/24885498/Archive.zip)

 
2. Run benchmark using the provided script:   ```bash
   python script_profiling.py hardmax_13_v3_test_hardmax_float32_axis1 1.18.0 1.19.0
   ```

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 24.04.3 LTS

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.19

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

### Model File

_No response_

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Performance regression in Hardmax operator with axis=1 between v1.18.0 and v1.19.0 #27173

Describe the issue

Description

Affected Operator

Hardmax

Test Case Details

Test Case: `hardmax_13_v3_test_hardmax_float32_axis1`

Regression Characteristics

Axis-Specific Regression

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Performance regression in Hardmax operator with axis=1 between v1.18.0 and v1.19.0 #27173

Description

Describe the issue

Description

Affected Operator

Hardmax

Test Case Details

Test Case: hardmax_13_v3_test_hardmax_float32_axis1

Regression Characteristics

Axis-Specific Regression

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Test Case: `hardmax_13_v3_test_hardmax_float32_axis1`