Mismatch between Matmul op in FLOAT16 and pytorch Linear op.

### Describe the issue

I’ve noticed mismatches between the outputs of a PyTorch model and the corresponding ONNX model when running inference with ONNX Runtime. Specifically, I’m working with float16 precision, and the results differ between the two frameworks. While I’m aware that such mismatches can occur for float32, should I also expect similar discrepancies when working with float16 (maybe because intermediate ops are computed in float32) ? If so, what are the potential causes, and how can I resolve or minimize these differences?

Any insights or guidance on this matter would be greatly appreciated!

### To reproduce

```python
import numpy as np
import onnxruntime
import torch
import torch.nn as nn

class Dense(nn.Linear):
    def __init__(self, in_features, out_features):
        super().__init__(in_features=in_features, out_features=out_features,
                         bias=False, device="cpu", dtype=torch.float16)
        self.weight.requires_grad = False

    def forward(self, input):
        return super().forward(input)


def compare_outputs(pytorch_model, onnx_model_path, inputs):
    def _to_numpy(tensor):
        return tensor.cpu().numpy()

    # ONNXRuntime inference
    ort_session = onnxruntime.InferenceSession(onnx_model_path)
    ort_outputs = ort_session.run(None, {'x': _to_numpy(inputs)})

   # Torch inference
    pytorch_model.eval()
    torch_outputs = [_to_numpy(pytorch_model(inputs))]

    # Test fail
    np.testing.assert_array_equal(ort_outputs, torch_outputs)


def main():
    torch.manual_seed(0)

    # Create random float16 inputs either between [-fp16min, fp16max]
    size = (64, 256)
    x_rand_tensor = torch.rand(size, requires_grad=False, dtype=torch.float32)
    f16_min = torch.finfo(torch.float16).min + 1
    f16_max = torch.finfo(torch.float16).max - 1

    scale_factor = (f16_max - f16_min)
    offset = f16_min

    x = (x_rand_tensor * scale_factor + offset).to(torch.float16)

    # Create the model
    dense_model = Dense(256, 1024)

    onnx_model_path = "dense_model.onnx"

    torch.onnx.export(
        dense_model,
        x,
        onnx_model_path,
        opset_version=15,
        input_names=['x'],
        output_names=['output'],
    )

    print(f"[INFO] Model exported to {onnx_model_path}")
    compare_outputs(dense_model, onnx_model_path, x)


if __name__ == "__main__":
    main()

```

### Urgency

No

### Platform

Linux

### OS Version

Ubuntu 22.04.3 LTS

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

1.21.0

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between Matmul op in FLOAT16 and pytorch Linear op. #23272

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mismatch between Matmul op in FLOAT16 and pytorch Linear op. #23272

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions