[Performance] GPU op placement control when some ops must be on the CPU

### Describe the issue

We export ONNX transformer encoder models in [OML4Py](https://docs.oracle.com/en/database/oracle/machine-learning/oml4py/2-23ai/mlpug/convert-pretrained-models-onnx-model-end-end-instructions.html) with the tokenizer attached to the bottom of the model, so the ONNX model accepts a string tensor input and returns the embedding vector. We use the tokenizer operations from onnxruntime-extensions, which are CPU only, and have wrapped around an ONNX graph which batches, pads and truncates the tokenized representation using the `SequenceMap` operation.

When increasing the batch size we've noticed that much of the runtime is spent in the `Loop` op which `SequenceMap` uses, which is very odd considering it doesn't actually do very much. After some investigation we determined that this was due to most of the ops in the tokenizer graph being placed on the GPU rather than the CPU, even though the subgraph we're looping over must be placed on the CPU due to the presence of the `BertTokenizer` op. We would like the whole tokenization graph including the `Loop` op to be placed on the CPU EP, but there doesn't appear to be a way to control op placement in the C API. Alternatively there might be a bug in the way that ops are placed on the CPU, as [this code](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/fallback_cpu_capability.cc#L42) looks like it should fallback the whole subgraph & loop to CPU, but I don't understand it well enough to see if that's part of the issue.

### To reproduce

Run the supplied `bert-tok.onnx` graph with a batch size of 100 with the CUDA EP enabled, most of the runtime is spent in the tokenization loop operation transferring data between CPU and GPU.

### Urgency

This performance issue prevents the use of GPUs to accelerate our models.

### Platform

Linux

### OS Version

OL8

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

1.20.0

### ONNX Runtime API

C

### Architecture

X64

### Execution Provider

CUDA

### Execution Provider Library Version

_No response_

### Model File

[bert-tok.onnx.zip](https://github.com/user-attachments/files/18202077/bert-tok.onnx.zip)


### Is this a quantized model?

No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] GPU op placement control when some ops must be on the CPU #23154

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] GPU op placement control when some ops must be on the CPU #23154

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions