-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Firstly, I am still confused how to choose negative instance if I use MultipleNegativesRankingLoss, in https://github.com/huggingface/sentence-transformers/blob/main/sentence_transformers/losses/MultipleNegativesRankingLoss.py# L113
embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
I guess embeddings should include three parts, anchor, positive and negative from in-batch data, however, no matter how I change batchsize, I still found len(embeddings)=2, is it means that this embeddings only include two parts?
Here is my simple training script, I didn't add negative part in dataset,
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
import json
import torch
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
InputExample,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from datasets import load_dataset, Dataset
def train_embedding_model():
train_epo = 3
save_path = f"/app/raw_model/tmp"
data_path = "/app/emb_train_1205.json"
model = SentenceTransformer(
"/app/download_models/Qwen3-Embedding-0.6B",
model_kwargs={
"attn_implementation": "flash_attention_2",
"torch_dtype": "auto"
}
)
model.tokenizer.padding_side = "left"
model.tokenizer.pad_token = model.tokenizer.eos_token
model.tokenizer.model_max_length = 2048
dataset = load_dataset("json", data_files=data_path)
'''
DatasetDict({
train: Dataset({
features: ['question', 'positive'],
num_rows: 4000
})
})
'''
loss = MultipleNegativesRankingLoss(model)
args = SentenceTransformerTrainingArguments(
output_dir=save_path,
num_train_epochs=train_epo,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
learning_rate=5e-5,
warmup_ratio=0.1,
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
bf16=False, # Set to True if you have a GPU that supports BF16
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
optim='adamw_torch_fused',
logging_steps=5,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=dataset['train'], # dataset['train'], train_dataset
eval_dataset=dataset['train'], # dataset['train'], train_dataset
loss=loss,
)
trainer.train()
model.save_pretrained(save_path)
Besides, can I manually add a list of negatives directly into the dataset while still using the MultipleNegativesRankingLoss?