-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Details
optimum==1.24.0
optimum-neuron==0.4.0
libneuronxla==2.2.12677.0+470fa032
neuronx-cc==2.21.18209.0+043b1bf7
neuronx-distributed==0.15.22404+1f27bddf
neuronx-distributed-inference==0.6.10598+a59fdc00
torch-neuronx==2.8.0.2.10.13553+1e4dd6ca
sentence-transformers==5.1.1
transformers==4.55.4
machine: trn2
Qwen3 Embedding model defaults to SentenceTransformers: Enable TP by using NXDI modeling code
By default, when targeting neuron for compilation, any file with a config_sentence_transformers.json will be targeted with the sentence_transformers library, and then traced, with no support for nxdi or TP. I've implemented the Qwen3 Embedding model with details below, but it is not targeted as the modeling implementation unless I delete the json file because of the exporter going down the sentencetransformers path. Is there a cleaner way to target the optimum-neuron implmentation in such cases?
exporters/tasks.py:1994
elif (
any(file_path.startswith("sentence_") for file_path in all_files)
or "config_sentence_transformers.json" in all_files
):
inferred_library_name = "sentence_transformers"
Original Behavior
By default, if we are trying to run a Qwen3Embedding model with the code below, compilation succeeds, but the original sentencetransformers model is traced, but tensor parallelism is not possible in this.
optimum-cli export neuron --model Qwen/Qwen3-Embedding-0.6B --batch_size 1 --sequence_length 1024 --auto_cast matmul --instance_type trn2 --tensor_parallel_size 4 qwen3-embedding-0.6b-neuron/; python qwen3_embedding.py
from optimum.neuron import NeuronModelForSentenceTransformers
tokenizer = AutoTokenizer.from_pretrained('/home/ubuntu/qwen3-embedding-0.6b-neuron')
model = NeuronModelForSentenceTransformers.from_pretrained('/home/ubuntu/qwen3-embedding-0.6b-neuron')
inputs = tokenizer("Sample input.", max_length=1024, padding='max_length', truncation=True, return_tensors="pt")
Modified Implementation
I've implemented the qwen3 embedding model to use the decoder modeling code and override some forward functions, with up to 4x improvement in throughput for the same model and seqlen, also with tensor parallelism: enabled:
https://github.com/tonywngzon/optimum-neuron
This can be tested by locally cloning the model and removing modules.json from the model repo, and compiling with:
optimum-cli export neuron --model Qwen/Qwen3-Embedding-0.6B --batch_size 1 --sequence_length 1024 --auto_cast matmul --instance_type trn2 --tensor_parallel_size 1 --task feature-extraction --output_hidden_states qwen3-embedding-0.6b-neuron/
from optimum.neuron.models.inference.qwen3.modeling_qwen3_embedding import Qwen3NxDModelForCausalLMEmbedding
model = Qwen3NxDModelForCausalLMEmbedding.from_pretrained(
"/home/ubuntu/qwen3-embedding-0.6b-neuron",
)
tokenizer = AutoTokenizer.from_pretrained("/home/ubuntu/qwen3-embedding-0.6b-neuron")```