Override SentenceTransformers exporter path for optimum-neuron modeling models

# Details
optimum==1.24.0
optimum-neuron==0.4.0
libneuronxla==2.2.12677.0+470fa032
neuronx-cc==2.21.18209.0+043b1bf7
neuronx-distributed==0.15.22404+1f27bddf
neuronx-distributed-inference==0.6.10598+a59fdc00
torch-neuronx==2.8.0.2.10.13553+1e4dd6ca
sentence-transformers==5.1.1
transformers==4.55.4

machine: trn2

Qwen3 Embedding model defaults to SentenceTransformers: Enable TP by using NXDI modeling code

By default, when targeting neuron for compilation, any file with a config_sentence_transformers.json will be targeted with the sentence_transformers library, and then traced, with no support for nxdi or TP. I've implemented the Qwen3 Embedding model with details below, but it is not targeted as the modeling implementation unless I delete the json file because of the exporter going down the sentencetransformers path. **Is there a cleaner way to target the optimum-neuron implmentation in such cases?**

exporters/tasks.py:1994
        elif (
            any(file_path.startswith("sentence_") for file_path in all_files)
            or "config_sentence_transformers.json" in all_files
        ):
            inferred_library_name = "sentence_transformers"


# Original Behavior
By default, if we are trying to run a Qwen3Embedding model with the code below, compilation succeeds, but the original sentencetransformers model is traced, but tensor parallelism is not possible in this.

`optimum-cli export neuron --model Qwen/Qwen3-Embedding-0.6B  --batch_size 1 --sequence_length 1024  --auto_cast matmul --instance_type trn2 --tensor_parallel_size 4 qwen3-embedding-0.6b-neuron/; python qwen3_embedding.py`

```from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForSentenceTransformers

tokenizer = AutoTokenizer.from_pretrained('/home/ubuntu/qwen3-embedding-0.6b-neuron')
model = NeuronModelForSentenceTransformers.from_pretrained('/home/ubuntu/qwen3-embedding-0.6b-neuron')

inputs = tokenizer("Sample input.", max_length=1024, padding='max_length', truncation=True, return_tensors="pt")
```

# Modified Implementation
I've implemented the qwen3 embedding model to use the decoder modeling code and override some forward functions, with up to 4x improvement in throughput for the same model and seqlen, also with tensor parallelism: enabled: 
https://github.com/tonywngzon/optimum-neuron
This can be tested by locally cloning the model and removing modules.json from the model repo, and compiling with:
`optimum-cli export neuron --model Qwen/Qwen3-Embedding-0.6B  --batch_size 1 --sequence_length 1024  --auto_cast matmul --instance_type trn2 --tensor_parallel_size 1 --task feature-extraction --output_hidden_states qwen3-embedding-0.6b-neuron/`

```from transformers import AutoTokenizer
from optimum.neuron.models.inference.qwen3.modeling_qwen3_embedding import Qwen3NxDModelForCausalLMEmbedding


model = Qwen3NxDModelForCausalLMEmbedding.from_pretrained(
    "/home/ubuntu/qwen3-embedding-0.6b-neuron",
)
tokenizer = AutoTokenizer.from_pretrained("/home/ubuntu/qwen3-embedding-0.6b-neuron")```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Override SentenceTransformers exporter path for optimum-neuron modeling models #996

Details

Original Behavior

Modified Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Override SentenceTransformers exporter path for optimum-neuron modeling models #996

Description

Details

Original Behavior

Modified Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions