Skip to content

How to choose negative instance when using MultipleNegativesRankingLoss train embedding model? #3585

@4daJKong

Description

@4daJKong

Firstly, I am still confused how to choose negative instance if I use MultipleNegativesRankingLoss, in https://github.com/huggingface/sentence-transformers/blob/main/sentence_transformers/losses/MultipleNegativesRankingLoss.py# L113
embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
I guess embeddings should include three parts, anchor, positive and negative from in-batch data, however, no matter how I change batchsize, I still found len(embeddings)=2, is it means that this embeddings only include two parts?

Here is my simple training script, I didn't add negative part in dataset,

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
import json
import torch
from sentence_transformers import (
    SentenceTransformer, 
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    InputExample, 
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from datasets import load_dataset, Dataset
def train_embedding_model():
    train_epo = 3
    save_path = f"/app/raw_model/tmp"
    data_path = "/app/emb_train_1205.json"
    model = SentenceTransformer(
        "/app/download_models/Qwen3-Embedding-0.6B",
        model_kwargs={
            "attn_implementation": "flash_attention_2",
            "torch_dtype": "auto"
        }
    )
    model.tokenizer.padding_side = "left"
    model.tokenizer.pad_token = model.tokenizer.eos_token
    model.tokenizer.model_max_length = 2048

    dataset = load_dataset("json", data_files=data_path)
    '''
    DatasetDict({
        train: Dataset({
            features: ['question', 'positive'],
            num_rows: 4000
        })
    })
    '''
    loss = MultipleNegativesRankingLoss(model)
    args = SentenceTransformerTrainingArguments(
        output_dir=save_path,
        num_train_epochs=train_epo,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=1,
        learning_rate=5e-5,
        warmup_ratio=0.1,
        fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
        bf16=False,  # Set to True if you have a GPU that supports BF16
        batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch 
        optim='adamw_torch_fused',
        logging_steps=5,
    )

    trainer = SentenceTransformerTrainer(
        model=model,
        args=args,
        train_dataset=dataset['train'], # dataset['train'], train_dataset
        eval_dataset=dataset['train'], # dataset['train'], train_dataset
        loss=loss,
    )
    trainer.train()
    model.save_pretrained(save_path)

Besides, can I manually add a list of negatives directly into the dataset while still using the MultipleNegativesRankingLoss?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions