[Performance] Multithreading for DequantizeLinear

### Describe the issue

The current DequantizeLinear CPU operator does not use threads.

I have implemented a quick prototype that shows a 4x speed up on that operator when used with a Qwen 2.5 0.5B model 

I do see a comment about this:

https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/quantization/quantize_linear.cc#L302

@fajin-corp is this something you were planning to implement? I'd be happy to help under your guidance

### To reproduce

n/a

### Urgency

_No response_

### Platform

Windows

### OS Version

any

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

main

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

### Model File

_No response_

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Multithreading for DequantizeLinear #23395

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development