Overview

借 ColBERT 的图，广义的计算句子之间相似度的模型大概有四种架构：

其中第一种被称为“双塔模型(dual-encoder)”，Query 和 Document 分别做 Embeddings，特征交互只发生在最后，一般是简单的算两个Embeddings的余弦距离。 Document 的 Embeddings 可以离线预计算，Embeddings存储量比较小，线上只需要算简单的余弦距离，还可以用ANN (Approximate Nearest Neighbor) 加速。可以scale，从几万甚至百万千万文档召回。

后面几种 Query 和 Document 特征从浅层开始融合，效果肯定比第一种“双塔模型”好，但从浅层开始融合意味着需要Query和Document两两计算，计算量也很大，没办法scale。

retrieval rerank 两阶段检索，第一阶段先用双塔模型大量召回比如，top100,，第二阶段将召回候选集和Query两两计算，得到更精确的检索排序。

Survey

Sun, 27 Nov 2022 Dense Text Retrieval based on Pretrained Language Models: A Survey
- 2022 年对于 Dense Text Retrieval 的 Survey 已经有 351 引用
- 其中包括 6 篇之前的 Survey。行吧
Mon, 27 May 2024 Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark
Mon, 28 Jul 2025 On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Traditional Retrieval(Sparse lexical search algorithms)

虽然 dense retrieval 从2020年开始变成检索模型的主流，传统检索算法比如 BM25 对关键词、专业名词等召回效果比较好，仍然是 dense retrieval 有效的补充。

2009 The probabilistic relevance framework: Bm25 and beyond
- Stephen Robertson, Hugo Zaragoza, et al. 2009.
- Foundations and Trends in Information Retrieval, 3(4):333–389.
2020 Which bm25 do you mean? a large-scale reproducibility study of scoring variants
- Chris Kamphuis, Arjen P De Vries, Leonid Boytsov, and Jimmy Lin. 2020. : 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020,
- In Advances in Information Retrieval Proceedings, Part II 42, pages 28–34. Springer.
Thu, 4 Jul 2024 BM25S: Orders of magnitude faster lexical search via eager sparse scoring
- We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy.
- It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products.
- 单线程跟 lucene 有的一拼，速度真的很快

Hybrid Retrievers

Fri, 22 Mar 2024 Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers
Mon, 1 Jul 2024 Searching for Best Practices in Retrieval-Augmented Generation
- Taking efficiency into consideration, Hybrid Search combines sparse retrieval (BM25) and dense retrieval (Original embedding) and achieves notable performance with relatively low latency.

History

诡异的数据集构建和测试方法怎么来的。

Fri, 31 Mar 2017 Reading Wikipedia to Answer Open-Domain Questions
- Machine reading at scale (MRS).
  - Using Wikipedia articles as the knowledge source causes the task of question answering (QA) to combine the challenges of both large-scale open-domain QA and of machine comprehension of text.
- In this paper, we show how multiple existing QA datasets can be used to evaluate MRS by requiring an open-domain system to perform well on all of them at once.
- In the following we describe our system DrQA for MRS which consists of two components:
  - (1) the Document Retriever module for finding relevant articles and
  - (2) a machine comprehension model, Document Reader, for extracting answers from a single document or a small collection of documents.
  - Retriever + Reader 架构至少在2017年已经提出
- Document Retriever
  - we use an efficient (non-machine learning) document retrieval system to first narrow our search space and focus on reading only articles that are likely to be relevant.
- Document Reader
  - Our Document Reader model is inspired by the recent success of neural network models on machine comprehension tasks, in a similar spirit to the AttentiveReader described in (Hermann et al., 2015; Chen et al., 2016).
  - RNN model predicting the two ends of the span.
- Experimental Setup:
  - (WebQuestions 2013(WQ), CuratedTREC 2015(TREC), WikiMovies 2016, SQuAD v1.1 2016)
  - Evidence Corpus
    - We use the 2016-12-21 dump of English Wikipedia for all of our full-scale experiments as the knowledge source used to answer questions.
Fri, 5 May 2017 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
- BiLSTM with mean/max pooling 我的天
- head (u, v, |u − v|, u ∗ v) -> fully-connected layers -> 3-way softmax
Sat, 1 Jun 2019 Latent Retrieval for Weakly Supervised Open Domain Question Answering
- 将 ORQA 列在历史里，而将 DPR 列为 Retrieval 模型的第一篇是否有失偏颇？
- open domain question answering (QA)
  - Due to recent advances in reading comprehension systems,
  - there has been a revival of interest in open domain question answering (QA),
  - where the evidence must be retrieved from an open corpus, rather than being given as input.
  - This presents a more realistic scenario for practical applications.
- However, QA is fundamentally different from IR (Singh, 2012).
- Retriever component
  - Query 和 Document使用不同模型，也就是要训练两个模型，inner product 作为相似函数。Embeddings 投影到维度 128。
  - Retriever模型结构跟DPR大差不差
- Reader component
  - The reader is a span-based variant of the reading comprehension model proposed in Devlin et al. (2018):
  - Reader模型结构跟DPR大差不差
- Inverse Cloze Task
  - Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task.
  - 虽然Retriever component模型结构和DPR大差不差，差异在 loss，DPR 使用 Metric Learning 方法。
- Experimental Setup:
  - Evidence Corpus
    - We use the English Wikipedia snapshot from December 20, 2018 as the evidence corpus.
    - The corpus is greedily split into chunks of at most 288 wordpieces based on BERT’s tokenizer, while preserving sentence boundaries.
  - We train and evaluate on data from 5 existing question answering or reading comprehension datasets.
  - (Natural Questions 2019(NQ), WebQuestions 2013(WQ), CuratedTREC 2015(TREC), TriviaQA 2017, SQuAD v1.1 2016) 与 DPR 相同
  - we convert them to open formats, following DrQA (Chen et al., 2017).
    - Natural Questions
      - we only keep questions with short answers and discard the given evidence document.
      - Answers with many tokens often resemble extractive snippets rather than canonical answers, so we discard answers with more than 5 tokens.
    - WebQuestions
      - The answers are annotated with respect to Freebase, but we only keep the string representation of the entities.
    - CuratedTrec
    - TriviaQA
      - We use their unfiltered set and discard their distantly supervised evidence
    - SQuAD
  - Dataset Train Dev Test
    
    Natural Questions 79168 8757 3610
    
    WebQuestions 3417 361 2032
    
    CuratedTrec 1353 133 694
    
    TriviaQA 78785 8837 11313
    
    SQuAD 78713 8886 10570
- Main Results
  - BM25 + BERT 在 TriviaQA SQuAD 效果好
  - ORQA(ours) 在 Natural Questions, WebQuestions, CuratedTrec 上效果好
Thu, 22 Aug 2019 Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering
- previous work defines passages as articles, paragraphs, or sentences. However, the question of proper granularity of passages is still underexplored.
  - (RAG chucking 的粒度问题从 2019 年一直讨论到 2024 年，base模型的推理能力和上下文能力每增强一次都会重新讨论一次
- we find that splitting articles into passages with the length of 100 words by sliding window improves performance by 4%.
  - We set the window size as 100 words, and the stride as 50 words(half the window size).
  - 记住这个 100 words as passages
- Passage ranker reranks all retrieved passages, and selects a list of high-quality passages for the multi-passage BERT model.
  - First, the retriever returns top-100 passages for each question. Then, the passage ranker is employed to rerank these 100 passages. Finally, multi-passage BERT takes top30 reranked passages as input to pinpoint the final answer.
  - (ranker 理念出现的时间也非常早
- we use the 2016-12-21 English Wikipedia dump. Following DrQA 2017

Dataset	Train	Dev	Test
Natural Questions	79168	8757	3610
WebQuestions	3417	361	2032
CuratedTrec	1353	133	694
TriviaQA	78785	8837	11313
SQuAD	78713	8886	10570

Retrieval(Embeddings) model

Tue, 27 Aug 2019 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- sbert 是 Siamese Dual Encoder 也就是一个模型，DPR 是 Asymmetric Dual Encoder 也就是两个模型。SDE效果比ADE好，当然这是后话了。
- 这篇论文没有向open domain question answering 方向发力，非常可惜，对每个数据集都 Supervised Fine-tuning
- 评估方法现在没什么论文follow，所以一般都没有跟DPR进行比较。当然sentence-transformers的名气也很大了。
- sentence-transformers
- Document
- Objective Function
  - Classification Objective Function o = softmax(Wt(u, v, |u − v|))
  - Regression Objective Function cosine-sim(u, v)
  - Triplet Objective Function max(||sa − sp|| − ||sa − sn|| + s, 0)
- Ablation Study
  - Pooling Strategy MEAN 80.78 > CLS 79.07 > MAX 79.80
  - Concatenation (u, v, |u − v|) 80.78 效果居然还不错
Fri, 10 Apr 2020 Dense Passage Retrieval for Open-Domain Question Answering
- DPR 论文提出的整个体系，包括模型、训练、在线推理，都跟现在主流相近，数据集的处理方式最新论文还在follow，在几个数据集上的结果最新论文仍作为baseline比较。那就多写一点。
- Transformer 2017年发布，BERT 2019年发布，开始刷nlp任务。真是勃勃生机万物竞发的时代。
- 使用BERT预训练模型，Embeddings 维度 768，Query 和 Document使用不同模型，也就是要训练两个模型，inner product 作为相似函数， loss function 使用 negative log likelihood
- DPR performs consistently better than BM25 on all datasets. （Dense Retrieval 登上历史的舞台
- Experimental Setup:
  - Evidence Corpus
    - English Wikipedia dump from Dec. 20, 2018 as the source documents for answering questions. Following (Lee et al., 2019), Following DrQA 2017
    - We then split each article into multiple, disjoint text blocks of 100 words as passages, serving as our basic retrieval units. Following (Wang et al., 2019)
    - 记住这个 100 words as passages
  - five QA datasets(Natural Questions 2019(NQ), TriviaQA 2017, WebQuestions 2013(WQ), CuratedTREC 2015(TREC), SQuAD v1.1 2016)
  - Selection of positive passages
    - TREC, WebQuestions and TriviaQA: we use the highest-ranked passage from BM25 that contains the answer as the positive passage. If none of the top 100 retrieved passages has the answer, the question will be discarded.
    - SQuAD and Natural Questions: since the original passages have been split and processed differently than our pool of candidate passages, we match and replace each gold passage with the corresponding passage in the candidate pool.
- Ablation Study on Model Training
  - Sample efficiency
    - a dense passage retriever trained using only 1,000 examples already outperforms BM25.
    - Adding more training examples (from 1k to 59k) further improves the retrieval accuracy consistently.
  - In-batch negative training
    - re-using gold passages from the same batch as negatives can make the computation efficient while achieving great performance.
    - a batch size of 128 and one additional BM25 negative passage per question
    - in-batch negative training improves the results substantially. As a result, accuracy consistently improves as the batch size grows.
    - “hard” negative passages that have high BM25 scores given the question, but do not contain the answer string (the bottom block)
    - We find that adding a single BM25 negative passage improves the result substantially while adding two does not help further.
  - Impact of gold passages
  - Similarity and loss
    - L2 performs comparable to dot product, and both of them are superior to cosine.
    - Our experiments show that using triplet loss does not affect the results much.
  - Cross-dataset generalization
  - Qualitative Analysis
    - Term-matching methods like BM25 are sensitive to highly selective keywords and phrases
    - while DPR captures lexical variations or semantic relationships better.
  - Run-time Efficiency
    - BM25+Lucene vs DPR+FAISS
- End-to-end QA System
  - The probabilities of a token being the starting/ending positions of an answer span and a passage being selected. (这种预测答案位置的 reading comprehension (RC) 任务已经退出历史舞台)
  - measured by exact match with the reference answer after minor normalization as in (Chen et al., 2017; Lee et al., 2019) (EM的评判标准一直延续下来）
  - higher retriever accuracy typically leads to better final QA results
  - Recent work (Izacard and Grave, 2020; Lewis et al., 2020b) have also shown that DPR can be combined with generation models such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2019), achieving good performance on open-domain QA and other knowledge-intensive tasks.
  - Retrieval + generation 这已经很RAG了
- Main Results
  - DPR 全面超越 BM25，multi-dataset训练出来的模型效果更好，DPR+BM25相互补充效果稍微有提高
Fri, 16 Oct 2020 RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering
- 百度的 ERNIE 的 RocketQA 会不会因为使用 PaddlePaddle 而不是 pytorch 被低估
Sun, 18 Apr 2021 SimCSE: Simple Contrastive Learning of Sentence Embeddings
- We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective,
- with only standard dropout used as noise.
- This simple method works surprisingly well, performing on par with previous supervised counterparts.
- We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse.
  - 有意思。但 unsupervised 效果甚至不如 BM25 （见E5论文）
Thu, 19 Aug 2021 Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
- Google 的 sentence embeddings from text-to-text transformers (ST5)
  - ST5-Enc Base 110M, Large 335M, 3B 1.24B, 11B 4.8B
  - ST5-EncDec Base 110M, Large 335M, 3B 3B, 11B 11B
- ST5-Enc mean 效果比 ST5-EncDec first 和 ST5-Enc first 效果好。
- encoder-only 对于 Retrieval 任务已经足够了
Wed, 15 Dec 2021 Large Dual Encoders Are Generalizable Retrievers
- Google 的 Generalizable T5-based dense Retrievers (GTR)
- Base 110M, Large 335M, XL 1.24B, XXL 4.8B
Thu, 16 Dec 2021 Unsupervised Dense Information Retrieval with Contrastive Learning
- Finally, we also consider additional data augmentations such as random word deletion, replacement or masking.
- We use these perturbations in addition to random cropping.
- MoCo
- Contriever 比 SimCSE 效果好，接近BM25
Thu, 14 Oct 2021 RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
- 百度的 ERNIE 的 RocketQA 会不会因为使用 PaddlePaddle 而不是 pytorch 被低估
Thu, 14 Apr 2022 Exploring Dual Encoder Architectures for Question Answering
- Dual encoders have been used for questionanswering (QA) and information retrieval (IR) tasks with good results.
- Previous research focuses on two major types of dual encoders,
  - Siamese Dual Encoder (SDE), with parameters shared across two encoders, (SBERT (Reimers and Gurevych, 2019), ST5 (Ni et al., 2021b)
  - and Asymmetric Dual Encoder (ADE), with two distinctly parameterized encoders. (DPR (Karpukhin et al., 2020), DensePhrases (Lee et al., 2021a)
- we show that SDE performs significantly better than ADE.
- We further propose three different improved versions of ADEs by sharing or freezing parts of the architectures between two encoder towers.
  - We find that sharing parameters in projection layers would enable ADEs to perform competitively with or outperform SDEs.
  - We further explore and explain why parameter sharing in projection layer significantly improves the efficacy of the dual encoders, by directly probing the embedding spaces of the two encoder towers with t-SNE algorithm.
- Main Results
  - By directly probing the embedding space, we demonstrate that the shared projection layers in SDE and ADE-SPL maps the embeddings of the two encoder towers into coinciding parameter spaces,
  - which is crucial for improving the retrieval quality. Therefore, we recommend to share the projection layers between two encoders of ADEs in practice.
  - 这个结论可以泛化在整个在 Metric Learning 问题
Thu, 26 May 2022 Matryoshka Representation Learning
- 支持多个向量维度
Wed, 7 Dec 2022 Text Embeddings by Weakly-Supervised Contrastive Pre-training
- 微软的E5
- We pre-train on our proposed text pair dataset for three model sizes: E5small, E5base and E5large initialized from MiniLM, bert-base-uncased, and bert-large-uncased-whole-wordmasking respectively
- Training Recipe (two-stage training)
  - Weakly-Supervised Contrastive Pre-training
    - 构建 CCPairs 数据集
    - we use mined hard negatives and knowledge distillation from a cross-encoder (CE) teacher model
  - Supervised Fine-tuning
    - MS-MARCO， NQ， NLI
    - We reuse the mined hard negatives and re-ranker scores from SimLM [58] for the first two datasets.
- Evaluation
  - BEIR， MTEB
  - Weakly-Supervised Contrastive Pre-training
    - E5-PT base outperforms the classic BM25 algorithm by 1.2 points.
    - To the best of our knowledge, this is the first reported result that an unsupervised model can beat BM25 on the BEIR benchmark.
    - BM25 41.7， SimCSE 20.3，E5-PT large 44.2
    - E5-PT large 也就略好于 BM25
  - Supervised Fine-tuning
    - Most datasets benefit from supervised finetuning
    - Supervised models E5large 比之前 GTRxxl、Sentence-T5xxl强
    - Since the difference between BERT-FT base and E5 base is that BERT-FT base only has fine-tuning stage,
    - their performance gap demonstrates the usefulness of contrastive pre-training on our proposed CCPairs dataset.
    - Is Contrastive Pre-training Necessary? （Improving Text Embeddings with Large Language Models）
- Weakly-Supervised Contrastive Pre-training + Supervised Fine-tuning 称为 sota 模型的标配
- BM25 vs Dense Retrieval
  - The answer is likely “not yet”. BM25 still holds obvious advantages in terms of simplicity, efficiency, and interpretability. For long-tail domains such as Trec-Covid [55] and retrieval tasks that involve long documents (Touche-2020) [4] or rely heavily on exact lexical match (Fever) [54], further research efforts are still necessary to improve current dense retrievers.
Thu, 20 Jul 2023 [Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models]
- Architecture
  - T5 architecture 估计是对标 Google 的 GTR
- Pairwise Data Preparation
  - De-Duplication， Language Filtering， Consistency Filtering
- Triplet Data Preparation
  - we leverage the ms-marco-MiniLM-L-6-v2 model5 to verify whether the difference in retrieval scores determined by the model exceeds a threshold r(q, p) − r(q, n) > κ, with threshold κ = 0.2, and eliminate all other pairs.
  - This methodology draws inspiration from the de-noising strategy proposed in [Qu et al., 2021].
- Negation Data Preparation
- This dataset, based on positive pairs from the SNLI dataset and negatives created with GPT-3.5
- Training
  - Training on Pairwise Data
  - Training on Triplet Data
Mon, 7 Aug 2023 Towards General Text Embeddings with Multi-stage Contrastive Learning
- Alibaba 的 GTE
- Architecture
  - Bert 512上下文长度， mean pooling， InfoNCE loss
- Training Recipe (two-stage training)
  - Weakly-Supervised Contrastive Pre-training (from bert pre-train)
    - Weakly supervised text relevance data is readily available in publicly accessible web sources, such as the inherent connection between queries and answers on QA forums.
  - Supervised Fine-tuning
    - we use relatively lower-sized datasets with human annotation of the relevance between two pieces of text and optional hard negatives mined by an extra retriever to form text triples.
- Improved Contrastive Loss
  - in which the first two terms are used for query to document contrast, where as the last two terms are used for the inverse.
  - s(qi,qj) 和 s(dj ,di) 也用起来
  - The temperature τ is fixed to 0.01 in this work.
- Evaluation
  - BEIR：we find that our base size model significantly outperforms the models with comparable size, like SimCSE, Contriever and E5. Our base model is comparable to E5large without using human supervision.
- Analysis
  - Number of Training Datasets
    - The results presented in Figure 3a demonstrate that the inclusion of more diverse data sources consistently enhances model performance during both the pre-training and finetuning stages.
  - Pre-training Batch Size
    - model performance saturates at around a batch size of ten thousands.
  - Number of Model Parameters
    - It can be observed that as the model size grows exponentially, the model performance also improves linearly.
  - Influence of Different Training Stages
    - PT 和 FT 都有用
  - Ablation of the Contrastive Objective
    - FT 上稍微有一点提高，但之后的mGTE并没有使用
Thu, 14 Sep 2023 C-Pack: Packed Resources For General Chinese Embeddings
- architecture, bert small, base, large 512 长度
- Training Recipe (three-stage training)
  - MAE Pre-Training 使用 Wudao corpora 从零开始训练，We leverage the MAE-style approach presented in RetroMAE
  - General purpose fine-tuning
    - The pre-trained model is finetuned on C-MTP (unlabeled) via contrastive learning
    - we purely rely on in-batch negative samples [25] and resort to a big batch size (as large as 19,200) to improve the discriminativeness of the embedding.
  - Task-specific fine-tuning
    - The hard negative sample is mined from the task’s original corpus, following the ANN-style sampling strategy in [61].
- Detailed Analysis
  - pretrain & finetune
    - a mixture of high-quality and diversified labeled data is able to bring forth substantial and comprehensive improvements for a pre-trained embedding model.
  - batch size
    - By making a parallel comparison between bz: 256, 2028, 19,200, we observe consistent improvement in embedding quality with the expansion of batch size (noted as bz).
  - Instruct
    - using instructions may substantially contribute to the quality of task-specific fine-tuning
Fri, 22 Sep 2023 AnglE-optimized Text Embeddings
- AnglE-BERT & AnglE-LLaMA2-7B
- COSINE OBJECTIVE 可能饱和，所以提出 ANGLE OBJECTIVE
- 度量学习 angular space （Angular Contrastive Learning）真的只有一篇论文吗
- applies LLMs as data annotators to label the pseudo-supervised data for AnglE training.
  - For the STS task, we use the prompt “You are a highly smart same-meaning/opposite-meaning sentence-generating system. Your job is to generate {size} synonymous/antonym sentences of a given input sentence. Input sentence: {text}. Output:” to generate positive/negative pairs. {size} and {text} are placeholders for the generated size and the input text, respectively.
  - 我的天，generate positive/negative pairs
Thu, 12 Oct 2023 Fine-Tuning LLaMA for Multi-Stage Text Retrieval
- LLM as Retrieval +1
- RepLLaMA & RankLLaMA 使用 LLaMA-2-7B 和 LLaMA-2-13B
Mon, 30 Oct 2023 Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents
- Architecture
  - BERT with ALiBi, GEGLU, BF16, mean pooling
  - bert-small 33M, bert-base 137M
- Training
  - Pre-training the Backbone, (C4 + 30% MLM)
  - First Fine-tuning with Text Pairs, InfoNCE
  - Second Fine-tuning with Hard Negatives
    - includes one positive and 15 negative instances
    - To ensure that hard negative passages are indeed less relevant than the annotated relevant ones, we employ a cross-encoder model to validate that their relevance score is indeed lower.
Fri, 29 Dec 2023 MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
- 这篇更多讲的是预训练bert
- This architecture combines FlashAttention [11], ALiBi [44], Gated Linear Units[12, 50], a dynamic unpadding module [66], and low precision LayerNorm.
Sun, 31 Dec 2023 Improving Text Embeddings with Large Language Models
- LLM as Retrieval +1
- we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
  - we design a two-step prompt template that first prompts LLMs brainstorm a list of tasks, and then generates a concrete example conditioned on the task definition.
  - query positive_document hard_negative_document 全合成啊？？这也太野了
  - w/ synthetic only 63.1，w/ synthetic + msmarco 64.5，w/o synthetic data 64.6，full data 66.6
  - 你确定你的合成数据有用？
- 5.1 Is Contrastive Pre-training Necessary?
  - contrastive pre-training benefits XLM-Rlarge, enhancing its retrieval performance
  - However, for Mistral-7B based models, contrastive pre-training has negligible impact on the model quality
- 对比 E5 和 E5 mistral-7b， 7B 吊打 330M？
Fri, 2 Feb 2024 Nomic Embed: Training a Reproducible Long Context Text Embedder
- NomicBertModel 架构比较现代 bert_with_rope
- https://huggingface.co/nomic-ai/nomic-bert-2048
- Architecture
  - rotary + SwiGLU + Flash Attention，12层768维，137B，训练长度2048
  - Dynamic NTK interpolation at inference to scale to 8192 sequence length
- Training Recipe (three-stage training)
  - MLM pre-train (from scratch)
    - Masked Language Modeling Pretraining (BooksCorpus and Wikipedia)，长度 2048, 30% masking rate
    - Additionally, we opt for SwiGLU versus GeGLU like proposed in Portes et al. (2023) as runtime is roughly 25% faster for SwiGLU using the Flash Attention repository.
  - Weakly-Supervised Contrastive Pretraining
    - Consistency Filtering
      - Since many of these datasets may contain noisy examples, we employ consistency filtering to remove the potential false positives in the dataset
  - Supervised Contrastive Fine-tuning
    - data (MSMarco, NQ, NLI....)
    - For other non-retrieval datasets, we randomly sample negatives among the corpus in place of mining hard negatives as we found that mining did not improve performance.
    - We also found that training for multiple epochs hurts performance.
    - Instead of choosing the first N negatives, we randomly sampled the mined negatives.
    - We found this to improve performance as some of the mined negatives introduced false negatives.
- Evaluate
  - test on MTEB，Jina’s Long Context Benchmark，LoCo
Sun, 4 Feb 2024 为RAG而生-BCE embedding技术报告
- 同时发布 embedding 和 reranker 确实是为RAG而生，但512序列长度显然是每预测单接下来轰轰烈烈的长上下文时代
- 二阶段检索器（Two-stage Retriever）“离线”的Embedding搭配“在线”的Reranker
- 难负样例挖掘？
  - 我们在训练Embedding模型时发现，过难的负样本对模型训练有损害，训练过程中会使模型“困惑”，影响模型最终性能。
  - 在大量的语料库中，脱离人工校验的自动化难负样例挖掘，难免会“挖到正例”。
  - 其实所谓的“正例”和“难负样例”是根据你业务的定义来的。
  - 所以回归业务目标和好的检索器的“评判标准”，Embedding模型应该能尽量召回相关片段，不要将Reranker要干的精排任务强压在Embedding身上，“越俎代庖”终究会害了它。
Mon, 5 Feb 2024 BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
- BAAI 的 BGE M3
- 使用 XLM-RoBERTa 架构还是比较传统， 23 层 1024维
- we introduce a new embedding model called M3-Embedding Supervised models
  - Multi-Linguality: It provides a uniform support for the semantic retrieval of more than 100 working languages. Enables both multilingual retrieval within each language and crosslingual retrieval between different languages.
  - Multi-Functionality: It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval.
  - Multi-Granularity: Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens
- Related Work
  - powerful text encoders bert 2019, DPR 2020, ST5 2022
  - negative sampling (Xiong et al., 2020; Qu et al.,2021)
  - knowledge distillation (Hofstatter et al. ¨ , 2021; Ren et al., 2021; Zhang et al., 2021a).
  - 之前的 Dense Retrieval：Contriever (Izacard et al., 2022), LLM-Embedder (Zhang et al., 2023a), E5 (Wang et al., 2022), BGE (Xiao et al., 2023), SGPT (Muennighoff, 2022), and Open Text Embedding (Neelakantan et al., 2022),
- In our work, the following technical contributions are made to optimize the embedding quality.
  - Firstly, we propose a novel self knowledge distillation framework
    - the [CLS] embedding is used for dense retrieval, while embeddings from other tokens are used for sparse retrieval and multi-vector retrieval.
    - we integrate the relevance scores from different retrieval functions as the teacher signal, which is used to enhance the learning process via knowledge distillation.
  - Secondly, we optimize the batching strategy to achieve a large batch size and high training throughput, which substantially contributes to the discriminativeness of embeddings.
  - Last but not least, we perform extensive and high-quality data curation.
    - Our dataset includes three sources:
      - the extraction of unsupervised data from massive multi-lingual corpora, In total, it brings in 1.2 billion text pairs of 194 languages and 2655 cross-lingual correspondences.
      - we collect relatively small but diverse and high-quality fine-tuning data from labeled corpora. we incorporate 8 datasets, For Chinese, we integrate 7 datasets, For other languages, we leverage the training data from Mr. Tydi (Zhang et al., 2021b) and MIRACL (Zhang et al., 2023c).
      - the synthesization of scarce training data
        
        Specifically, we sample lengthy articles from Wikipedia, Wudao (Yuan et al., 2021) and mC4 datasets and randomly choose paragraphs from them.
        
        Then we use GPT3.5 to generate questions based on these paragraphs.
        
        使用 GPT 合成数据训练模型开始成为主流
    - The three data sources are complement to each other and applied to different training stages, which lays a solid foundation for the versatile text embeddings.
  - Train
    - loss
      - minimize the InfoNCE loss(NCE stands for Noise-Contrastive Estimation)
    - native multi-objective training can be unfavorable to the embedding’s quality.
      - we integrate the relevance scores from different retrieval functions as the teacher signal, which is used to enhance the learning process via knowledge distillation.
    - The training process constituta multi-stage workflow
      - the text encoder (an XLM-RoBERTa (Conneau et al., 2020) model adapted by RetroMAE (Xiao et al., 2022) method) is pre-trained with the massive unsupervised data, where only the dense retrieval is trained in the basic form of contrastive learning.
      - The self-knowledge distillation is applied to the second stage, where the embedding model is fine-tuned to establish the three retrieval functionalities.
        
        Both labeled and synthetic data are used in this stage, where hard negative samples are introduced for each query following the ANCE method (Xiong et al., 2020).
    - Efficient Batch
      - It also needs to keep the batch size as large as possible(introducing a huge amount of in-batch negatives) to ensure the discriminativeness of text embeddings
      - Particularly, the training data is pre-processed by being grouped by sequence length. When producing a mini-batch, the training instances are sampled from the same group.
      - We iteratively encode each sub-batch using gradient checkpointing (Chen et al., 2016)and gather all generated embeddings.
      - Finally, the embeddings from different GPUs are broadcasted, allowing each device to obtain all embeddings in the distributed environment,
        
        which notably expands the scale of in-bath negative samples.
  - Experiment
    - Multi-Lingual Retrieval
    - Cross-Lingual Retrieval
    - Multilingual Long-Doc Retrieval
  - Ablation study
    - Self-knowledge distillation
    - Impact of multi-stage training
      - Model (Dense) MIRACL
        
        Fine-tune 60.5
        
        RetroMAE + Fine-tune 66.1
        
        RetroMAE + Unsup + Fine-tune 69.2
Thu, 8 Feb 2024 Multilingual E5 Text Embeddings: A Technical Report
- Initialization: microsoft/Multilingual-MiniLM-L12-H384, xlm-roberta-base, xlm-roberta-large
- contrastive pre-training + fine-tuning
Sat, 24 Feb 2024 OpenAI vs Open-Source Multilingual Embedding Models Choosing the model that works best for your data
- Generate a custom Q/A dataset
- 介绍了一种用ChatGPT合成问答数据集测试检索模型的方法
- 用ChatGPT合成问答数据集训练模型，用ChatGPT合成问答数据集测试模型的世界达成了
Mon, 26 Feb 2024 GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
- 使用 Guided 模型移除 in-batch negative 里面的假负，相当于 CONSISTENCY FILTERING
Mon, 26 Feb 2024 Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings
- 多语言版 Jina Embeddings 2， BERT with ALiBi
Wed, 27 Mar 2024 Scaling Laws For Dense Retrieval
- 24 BERT checkpoints from the original Google release, with model sizes ranging from 0.5 million (BERT-Tiny) to 82 million parameters (BERT-Base)
- For experiments on Chinese retrieval benchmarks, we selected the ERNIE series
- 比较的模型略小，2025年大概能摸到模型大小的 sweet spot，再回看上面的图，多语言模型确实需要更多参数
  - 单语言对标 BERT BASE：L=12，H=768，A=12 110M
  - 多语言对标 BERT LARGE：L=24，H=1024，A=16 340M
Fri, 29 Mar 2024 Gecko: Versatile Text Embeddings Distilled from Large Language Models
- Training Recipe (two-stage training)
  - Weakly-Supervised Contrastive Pre-training
    - 类似构建 CCPairs 数据集
    - Note that we do not utilize hard negatives during pre-finetuning and utilize the maximum batch size that fits into the device.
  - Supervised Fine-tuning
    - FRet: Two-Step LLM Distillation
      - LLM-based Diverse Query Generation
        
        we employ few-shot prompts to control the diversity of queries
    - LLM-based Positive and Negative Mining
      - we use an existing embedding model1 to retrieve top 𝑁 neighbors 𝑃 from the corpus given a generated query 𝑞.
      - We then employ the same LLM used for the query generation to rank these retrieved passages based on their relevance to the query
        
        query likelihood， relevance classification
      - we create the FRet dataset, comprised of 6.6M examples, each containing a task, a query, a positive passage, and a negative passage.
- Analysis
  - LLM as a Labeler
    - we find that using the most relevant passage chosen by an LLM is always better than using the original passage as positive.
  - Diversity of FRet
  - Learning Semantic Similarity and Classification
  - Qualitative Analysis
    - First, we observe that the LLM does generate diverse tasks and queries by conditioning on seed passages 𝑝 seed
    - Second, the table highlights the LLM’s ability to find a passage (𝑝1) that provides a more direct and relevant answer to the generated query than the seed passage (𝑝seed)
    - Furthermore, LLM-ranked hard negatives make a challenging task of understanding nuanced differences.
    - These examples demonstrate how the 2-step LLM distillation process effectively brings the LLM’s diverse domain knowledge and global ranking preferences into the text embedding model.
Tue, 9 Apr 2024 LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
- additional training phase with a specially designed masked token prediction to warm-up the bidirectional attention.
- LLM as Retrieval +2
Wed, 8 May 2024 Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
- architecture
  - v1, BertModel 22m， 33m, 110m, 137m, 335m. 512 长度
  - m-long, NomicBertModel, max_trained_positions: 2048
  - m-v1.5, BertModel, 512 长度
- Synthetic Data For Semantic Dense Mining
- Tunable Hard Negative Mining
  - How hard should these negatives be for maximally effective learning in the fine-tuning phase?
  - Our answer to this question was ultimately a tunable hard negative mining strategy in which we leveraged a preexisting text embedding model to identify and score the hardest negatives for each training example.
  - Then, we applied a score threshold to discard the hard negatives from the above set.
  - We found that using an upper threshold rather than a specific rank helped account for the fact that some queries admit much harder top-k negatives than others.
  - we perform a parameter sweep of the negative hardness threshold to demonstrate the value of a tunable approach (the optimal threshold value scores significantly better than other choices).
- Training Recipe
  - Large Scale Contrastive Pretraining With In-Batch Negatives (infoNCE)
    - Longer Truncation Length
      - We used a document sequence length of 256 in large-scale contrastive training, in contrast to the 128 truncation length used in GTE and BGE.
      - We truncated query sequence length to 32, consistent with BGE’s source code15
  - Quality-Focused Contrastive Training With Curated Negatives
    - We truncate sequence lengths to 512 for queries and documents for all models, including the long-context variant m-long.
    - For each query in a batch, we include one positive document and ten hard negative documents
Sat, 11 May 2024 Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
- 商汤的 piccolo2
- Architecture bert base 24层 1024维， 512序列长度，MRL Training
- 没有介绍 Weakly-Supervised Contrastive Pretraining，只 Supervised Contrastive Fine-tuning？居然比 gte-Qwen1.5-7B-instruct 都强？？？
- Multi-task Hybrid Loss
  - Retrieval and Reranking Loss，use the standard InfoNCE loss with in-batch negative
  - STS and PairClassification Loss，cosent loss function
  - Classification and Clustering Loss，SFR embedding method
- Datasets
  - Datasets Synthetic Pipeline
  - 数据收集
- Hard Negative Mining
  - For each retrieval task, we use piccolo-base-zh [12] to conduct negative sample mining.
  - We randomly select 15 samples from the mining negatives of rank 50 - 100 as the final hard negative samples.
  - We avoid using higher-rank negative samples as their inclusion typically leads to a decline in performance.
  - This is caused by a variety of reasons, such as inaccurate dataset annotation.
Mon, 27 May 2024 NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
- LLM as Retrieval +3
- Architecture
  - Mistral-7B + LLM2Vec + latent attention layer
- Training (two-stage contrastive instruction-tuning method)
  - starting with the pretrained Mistral-7B
  - In the first stage, we apply contrastive training with instructions on retrieval datasets, utilizing in-batch negative and curated hard-negative examples
  - In the second stage, we blend carefully curated non-retrieval datasets into the stage-one training data.
- hard-negative technique (NV-Retriever)
  - we apply the recently proposed positiveaware hard-negative technique (Moreira et al., 2024) that considers the positive relevance scores for better false negatives removal
  - Following the ablation studies in Moreira et al. (2024), we use E5-mistral-7b-instruct (Wang et al., 2023b) as a teacher retrieval model to identify the optimal hardnegative passages relevant to the query.
  - We set the maximum threshold for negative scores based on a percentage of the positive score (TopKPercPos) with a 95% margin, described as follows:
  - max_negative_score_threshold = pos_score * percentage_margin
- ABLATION STUDY
  - TWO-STAGE TRAINING
  - CAUSAL ATTENTION VS. BIDIRECTIONAL ATTENTION
    - This indicates that embeddings generated with causal attention masks are significantly less effective than those produced with bidirectional attention masks.
  - POOLING METHODS
    - -last, mean, latent-attention, and self-attention pooling types
    - In contrast, the latent-attention layer proved beneficial for majority of embedding tasks
  - MULTI-CLASS CLASSIFICATION AND CLUSTERING LABELS
  - HARDNEGATIVE MINING AND SYNTHETICALLY GENERATED DATASET
    - baseline (S0) 70.73
    - hard negative mining technique (S1) 71.83
    - additional public retrieval data (S2) 72.07
    - synthetically generated data (S3) 72.31
Mon, 22 Jul 2024 NV-Retriever: Improving text embedding models with effective hard-negative mining
- hard-negative mining
Fri, 26 Jul 2024 bge-multilingual-gemma2,bge-en-icl
- 2024-07-31 MTEB en榜单 1.bge-en-icl 2.stella_en_1.5B_v5 3.SFR-Embedding-2_R 4.gte-Qwen2-7B-instruct 5.stella_en_400M_v5 6.bge-multilingual-gemma2 7.NV-Embed-v1 8.voyage-large-2-instruct 9.Linq-Embed-Mistral 10.SFR-Embedding-Mistral
- LLM as Retrieval +4 +5
Mon, 29 Jul 2024 mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
- Alibaba 的 mGTE
- 从头训练一个基础模型，并微调成一个 Retrieval(Embeddings) model 和 Reranking Model, 有钱真好
- https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base
- Architecture
  - BERT + RoPE + GLU + xformers， 12 层 768 维，306M 比 bge m3 小， [CLS] pooling
  - pre-trained by masked language modeling (MLM) via a two-stage curriculum for the native 8,192 tokens context.
- Training Recipe (three-stage training)
  - MLM pre-train (from scratch)
    - We pre-train the model via masked language modeling (MLM)，The MLM probability is set to 30%
    - To train the native 8192-context model more efficiently, we adopt a phased training curriculum (Xiong et al., 2024)
      - MLM-2048: we chunk the input into 2048 tokens and set RoPE base to 10, 000.
      - MLM-8192: we chunk the input into 8192 tokens and set RoPE base to 160, 000.
  - Retrieval(Embeddings) model
    - we construct the TRM for first-stage text retrieval in two steps:
      - contrastive pre-training and fine-tuning (Wang et al., 2022; Li et al., 2023).
      - Both steps share the same InfoNCE(Oord et al., 2018) learning objective
    - Contrastive Pre-Training
    - Matryoshka Embedding
    - Sparse Representation
    - Contrastive Fine-Tuning
  - Text Reranking Model
    - It takes the query and document as input: [CLS] q [SEP] d, and directly predicts their relevance score by the [CLS] output state:
    - srerank = W h[CLS]
    - The model is fine-tuned by InfoNCE in one step6 based on our text encoder
- reversed NTK
  - We utilize the reversed NTK scaling in contrastive pre-training to reduce required text length
  - With revNTK, models exhibit slightly lower performance on 1k context but achieve more stable 8k performance across different training steps.
- 论文没有提 Hard Example Mining，不知道是想表达 no bells and whistles，Stella_v5系列在这个基础上微调效果就好一些。
- gte-Qwen2-7B-instruct gte-Qwen1.5-7B-instruct
  - gte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family that ranks No.1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark MTEB benchmark (as of June 16, 2024).
  - LLM as Retrieval +6 +7 居然没有写在论文里
Wed, 28 Aug 2024 Conan-embedding: General Text Embedding with More and Better Negative Samples
- dynamic hard negative mining
- prompt-response pairs from LLMs can be used for embedding training
- Matryoshka Embedding
Mon, 16 Sep 2024 jina-embeddings-v3: Multilingual Embeddings With Task LoRA
- Architecture
  - bert_with_rope， MRL， mean pooling，FlashAttention 2，LoRA
  - Based on the Jina-XLM-RoBERTa architecture, this model supports Rotary Position Embeddings to handle long input sequences up to 8192 tokens.
- Multilingual Embeddings With Task LoRA
  - retrieval.query: Used for query embeddings in asymmetric retrieval tasks
  - retrieval.passage: Used for passage embeddings in asymmetric retrieval tasks
  - separation: Used for embeddings in clustering and re-ranking applications
  - classification: Used for embeddings in classification tasks
  - text-matching: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks
- Training
  - We initialize the model using the weights of the original XLM-RoBERTa model.
  - Pre-Training， MLM whole word masking
  - Fine-Tuning for Embedding Tasks， InfoNCE
  - Training Task-Specific Adapters
- Failure Analysis for Asymmetric Retrieval
  - Misleading Syntactic Similarities
  - Misinterpretation of Named Entities
  - No Understanding of Polar Questions
  - Preference for Low-Quality Documents
- Performance on LongEmbed MTEB
  - Table 5 demonstrate that jina-embeddings-v3 with the text-matching adapter achieves the highest average performance.
  - These findings underscore the effectiveness of the RoPE-based positional embeddings, outperforming both the fixed positional embeddings used by bge-m3 and the ALiBi-based approach employed in jina-embeddings-v2.
Fri, 28 Oct 2024 SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning
- The SFR-Embedding-Mistral marks a significant advancement in text-embedding models, building upon the solid foundations of E5-mistral-7b-instruct and Mistral-7B-v0.1
- LoRA adapters with rank r=8 are added to all linear layers, resulting in 21M trainable parameters.
- Multi-task Training Benefits Generalization
  - incorporating additional clustering training yields significant improvements across all tasks
- Task-Homogenous Batching
  - Consequently, the in-batch negative becomes more challenging as other examples within the batch closely resemble the test case scenario.
- Impact of Hard Negatives
  - Strategy to Eliminate False Negatives
    - The results indicate that the range from 30 to 100 yields improved performance.
    - This implies that the top-ranked documents (0-100) may include some false negatives,
    - while those ranked lower (50-100) lack sufficient challenge.
  - Number of Hard Negatives
    - Nevertheless, our findings suggest that the training process remains relatively stable regardless of the number of hard negatives utilized.
  - Impact of Batch Size
    - However, enlarging the batch size from 2048 to 8192 does not result in any significant change in performance.
  - Teacher models for hard negative mining
    - in general, more powerful models can yield more effective hard negatives (SFR-Embedding-Mistral > E5-Mistral > BGE-base).
    - In the future, it will be intriguing to explore the impact of multi-round training on two fronts
- Impact of Context Length
  - we observe that after a certain length threshold, i.e., 25 for queries and 700 for documents,
  - BGE model is significantly less likely to rank the gold document higher than SFR-Embedding-Mistral owing to the inherent power of LLMs to represent long-context.
Tue, 3 Dec 2024 Arctic-Embed 2.0: Multilingual Retrieval Without Compromise
- Architecture
  - m_v2: gte-multilingual-mlm-base
  - l_v2: bge-m3-retromae
- three-stage training framework
  - pretraining via masked language modeling 这些预训练模型还是没有充分训练？
  - contrastive pretraining
  - Finetuning
    - Hard Negative Mining
      - we adopt the strategy from NV Retriever
      - we confirm Moreira et al. (2024)’s finding that stronger teacher models yield higher-quality fine-tuning datasets
    - Matryoshka Representation Learning
    - We also extend the maximum sequence length for queries and documents to 512 tokens,
    - adjusting the batch size to 256 sets of 1 query, 1 positive doc, and 10 negative docs,
    - changing the learning rate to 1e-5 and 5e-6 for medium and large models, respectively,
    - and adjusting our WSD learning rate schedule to have no warmup and
    - perform linear decay for 6,000 out of a total of 9,342 steps.
- Cross-lingual Transfer
- English Performance Gap
- Data quality matters more than quantity.
- Model “knowledge” and task-calibration are both important yet possibly orthogonal.
Wed, 18 Dec 2024 ModernBERT
- reduce Bias Terms, GeGLU, Rotary
- Alternating Attention, 每 3 层部署全局注意力, 其余层则采用128 token 滑动窗口的局部注意力
- Model Design, 深而窄的架构, 渐进式参数空间扩展
- Training Settings: MLM use a masking rate of 30 percent, Warmup-Stable-Decay (WSD)
- Weight Initialization and Tiling
- 词表考虑和code数据加入
Tue, 11 Feb 2025 Training Sparse Mixture Of Experts Text Embedding Models
- Embedding Models 进入 Mixture Of Experts 时代
- 论文覆盖内容也是非常全面详实，安利
- 3.1. Masked Language Modeling
- 3.2. Mixture of Experts (MoE)
- 3.3. Contrastive Learning
  - 3.3.1. TRAINING TEXT EMBEDDING MODELS
    - Text embedding models are generally trained in two stages: weakly-supervised contrastive pretraining and contrastive finetuning
    - The contrastive pretraining stage uses the InfoNCE objective
    - Contrastive finetuning incorporates high-quality human labeled datasets and hard negatives to improve retrieval performance
    - Matryoshka Representation Learning
  - 3.3.2. CONSISTENCY FILTERING
    - Consistency filtering improves dataset quality by removing potential false positives from weakly supervised data
  - 3.3.3. HARD NEGATIVE MINING
    - Text embedding models are typically finetuned with hard negatives mined by an existing retriever
    - positive-aware hard negative mining
Tue, 7 Jan 2025 voyage-3-large: the new state-of-the-art general-purpose embedding model
- Enabled by Matryoshka learning and quantization-aware training,
- voyage-3-large supports smaller dimensions and int8 and binary quantization that dramatically reduce vectorDB costs with minimal impact on retrieval quality.
- The following figure illustrates the tradeoff between retrieval quality and storage cost (which is proportion to the number of bits per vector).
- We see that voyage-3-large with int8 precision and 1024 dimensions is only 0.31% below voyage-3-large with float precision and 2048 dimensions, despite using 8x less storage.
Wed, 22 Jan 2025 Alibaba-NLP/gte-modernbert-base
Fri, 27 Jan 2025 ModernBERT 为我们带来了哪些启示？
- 对比 ModernBERT（2024 年 12 月）， jina-XLM-RoBERTa（2024 年 9 月）， RoBERTa-large（2019 年 7 月）
- 苦涩的教训？通过提升数据利用效率，现有规模的嵌入模型仍有巨大优化空间，根本无需盲目扩张语料库或堆砌参数。
Tue, 22 Apr 2025 腾讯Conan-Embedding-V2发布
- 我们设计了Conan-1.4B，包含8层Attention Layers，Hidden Size为3584，最长上下文32k。它的参数量是1.4B，能够在较少的参数下提供更大的Embedding维度。
- (这是一个矮胖的模型，跟modernbert瘦相反)
- Conan-embedding-v2训练过程分为四个阶段
  - 我们从基础的字母、符号上，在约40万条多语言语料上训练了Conan的BBPE分词器，目标词表大小15万
  - 在大语言模型（LLM）训练阶段（第1和第2阶段），我们加入了嵌入数据，以更好地使LLM与嵌入任务对齐
  - 我们首先在约3T Tokens的通用数据上对模型进行了预训练，重点增加了针对性的适合做Pair的数据
  - 随后，我们收集了约6亿条监督微调（SFT）数据，这些数据以配对数据（Query - Positive Sample）的形式组织，格式为指令、输入和输出。
  - 弱监督训练
    - 在此阶段，我们使用与LLM监督微调相同的数据，但采用不同的数据格式和损失函数。具体来说，我们将指令和输入作为查询，输出作为正例段落。
    - 为了确保更高的数据质量，我们使用gte-Qwen2-7B-instruct模型进行评分，并丢弃得分低于0.4的数据。
    - 为了高效且有效地利用配对数据，我们在训练中采用了InfoNCE损失函数，并结合In-Batch Negative采样
  - SoftMask (LLM2Vec)
    - 嵌入训练需要对句子进行整体理解，使用双向掩码（bidirectional mask）进行向量级别的建模。这两种掩码之间存在几个关键差距。
    - 如果在弱监督微调阶段直接从因果掩码切换到双向掩码，训练可能会由于低秩而快速收敛，但容易陷入局部最优，使得进一步优化变得困难。
    - 我们绘制了使用和不使用软掩码机制的损失曲线。结果表明，初始阶段，使用软掩码的损失下降速度比不使用软掩码的损失更慢。
    - 然而，使用软掩码的最终损失更低。这表明软掩码方法使模型在训练早期能够学习到更全面的特征表示。
  - 监督训练
    - 我们将任务分为四类：检索、跨语言检索、分类和语义文本相似度（STS）
  - 动态难负例挖掘
Fri, 9 May 2025 喝下这一碗模型汤，掌握向量模型的训练秘方
- 通过融合不同阶段的检查点, 泛化性更好
- 类似 mergekit
Mon, 10 Mar 2025 Gemini Embedding: Generalizable Embeddings from Gemini
- Architecture Gemini + mean pooling, dimensional embeddings 3072 估计是 gemma-3-4b-it
- noise-contrastive estimation (NCE) loss + MRL
- two-stage training
  - Pre-finetuning on a large number of potentially noisy (query, target) pairs
  - Finetuning on a large mixture of task-specific datasets which contain(query, target, hard negative target) triples
- Model Soup
  - To obtain additional generalization performance, we averaged the parameters obtained from individual fine-tuning runs
- Improving Data Quality with Gemini
  - Synthetic Data Generation
  - Data Filtering
  - Hard Negative Mining
Tue, 20 May 2025 voyage-3.5 and voyage-3.5-lite: improved quality for a new retrieval frontier
- Matryoshka learning and quantization-aware training
Thu, 5 Jun 2025 Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
- The Qwen3 embedding and reranking models are built on the dense version of Qwen3 foundation models and are available in three sizes: 0.6B, 4B, and 8B parameters
  - For text embeddings, we utilize LLMs with causal attention, appending an [EOS] token at the end of the input sequence. 没有使用双向掩码（bidirectional mask）进行向量级别的建模 (LLM2Vec)
  - Reranking Models，使用生成模型的模版，Reranking 认为是个生成 yes 和 no token问题
- Models Training
  - Embedding Models
    - Training Objective InfoNCE
    - stage 1 Large-Scale Synthetic Data-Driven Weak Supervision Training
      - Large-Scale Synthetic Data-Driven Weak Supervision Training
      - Due to the exceptional performance of the Qwen3 Foundation model, the synthesized data is of notably high quality.\
    - stage 2 Supervised Fine-Tuning with High-Quality Synthetic and labeled Data
    - stage 3 Model Merging using sampled Checkpoints from stage 2
  - Reranking Models
    - Supervised Fine-Tuning (SFT) loss
    - stage 2 Supervised Fine-Tuning with High-Quality Synthetic and labeled Data
    - stage 3 Model Merging using sampled Checkpoints from stage 2
- Analysis
  - Effectiveness of Large-Scale Weakly Supervised Pre-Training
  - Effectiveness of Model Merging
  - 没有使用 LLM2Vec, 没有使用 Hard Negative Mining, 没有用 Knowledge distillation，没有多各种任务使用不同的 instruct
  - Without bells and whistles 的感觉
Fri, 15 Aug 2025 CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity
- CMTEB
Thu, 4 Sep 2025 EmbeddingGemma3
- https://huggingface.co/collections/google/embeddinggemma-68b9ae3a72a82f0562a80dc4
- Architecture
  - Gemma3 308M
  - use bi-directional attention instead of causal (one-way) attention
  - 2k‑token context window
  - mean pooling
  - two dense layers transform the text embeddings into their final form, a 768-dimensional vector.
  - MRL 512, 256, or 128
  - approximately 320 billion tokens
Jan 15, 2026 The Voyage 4 model family: shared embedding space with MoE architecture
- base on qwen3
- A single shared embedding space & Asymmetric retrieval
- Matryoshka learning and quantization
Tue, 17 Feb 2026 jina-embeddings-v5-text: Task-Targeted Embedding Distillation
- jina-embeddings-v5-text-small base on Qwen3-0.6B
- jina-embeddings-v5-text-nano base on EuroBERT-210M
- Matryoshka learning and quantization
- Training
  - Embedding Distillation
    - We use distillation to transfer knowledge from Qwen3-Embedding-4B model
  - Task-Specific Adapter Training
    - Asymmetric Retrieval Adapter
    - Text Matching (STS) Adapter
    - Clustering Adapter
    - Classification Adapter

Model (Dense)	MIRACL
Fine-tune	60.5
RetroMAE + Fine-tune	66.1
RetroMAE + Unsup + Fine-tune	69.2

Retrieval(Embeddings) benchmark

Sat, 17 Apr 2021 BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
- 18个数据集
- average query length between 3 and 192 words
- average document length between 11 and 635 words
Mon, 12 Feb 2024 Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT
- 12 task
Thu, 13 Oct 2022 MTEB: Massive Text Embedding Benchmark
- Hugging Face 主导的 Massive Text Embedding Benchmark
- https://huggingface.co/spaces/mteb/leaderboard
Fri, 7 Apr 2023 T2Ranking: A large-scale Chinese Benchmark for Passage Ranking
- 制作了一个 T2Ranking 数据集
- Chinese BERTbase 训练了 retrieval 和 Re-rank
Thu, 18 Apr 2024 LongEmbed: Extending Embedding Models for Long Context Retrieval
- 检索模型进入长上下文时代，RoPE 含金量还在不断上升
- LONGEMBED benchmark, which includes two synthetic and four real-world tasks
- we pretrain E5-RoPE following the training procedure and data of E5.
- E5Base and E5-RoPEBase are selected as the comparison subjects thanks to their shared training process, training data, and comparable performance on BEIR and LONGEMBED benchmarks.
- Comparison of Extension Methods
  - APE-based Models.
    - We observe that plug-and-play methods including GP, RP, PI and PCW strategies yield comparable results with no significant disparities.
    - On the other hand, further tuning consistently yields additional performance gains for both models, across all target context lengths.
    - This suggests that freezing the original model weights and fine-tuning exclusively the added position embeddings can effectively extend the model’s context window while strictly maintaining model’s original ability
  - RoPE-based Models.
    - It is observed that RoPE-specific methods including NTK and SE yield significant improvements for both models across all datasets, surpassing PCW, PI and GP by a large margin.
    - SE / NTK is short for SelfExtend / NTK-Aware Interpolation
  - Further, our analysis reveals the superiority of RoPE-based embedding models over APE-based ones in context window extension.
  - Hence, we advocate for the use of RoPE for future embedding models.
Tue, 16 Jul 2024 BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
- we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.
- 我觉得你在故意难为检索模型
- MAIN RESULTS
  - Existing retrieval systems perform poorly on BRIGHT.
  - Querying with LLM reasoning steps improves retrieval performance.
  - Retrieval augmentation boosts performance in question-answering.
- ANALYSIS
  - RERANKING WITH LLMS ENHANCES RETRIEVAL PERFORMANCE
  - ROBUSTNESS AGAINST DATA LEAKAGE FROM PRETRAINING
  - LONG-CONTEXT RETRIEVAL WITH A REDUCED SEARCH SPACE IS CHALLENGING
Tue, 17 Dec 2024 AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
- Automated, Heterogeneous, Dynamic
- Candidate Generation
  - Sample one document from the raw corpus as the positive document
  - Prompt LLM to generate the characters who might find the document useful
  - Prompt LLM to generate the scenarios in which the character might find the document useful
  - Prompt LLM to generate the query ori_qi based on the specific character and scenario
  - To diversify the generated queries, we consider the following attributes when designing the prompt: query length, query type, information-based type, and expression style.
  - Prompt LLM to rewrite the generated query for multiple times to try to avoid the duplicated tokens as in the raw corpus
  - Prompt LLM to generate some hard negative documents based on the generated query qi and the positive document
  - Repeat Step 1-6
- Quality Control
  - Filter low-quality queries
    - To improve the quality of generated queries, we utilize LLM to access the relevance between the query qi and the positive document
  - Correct the false relevance labels
    - we design a three-step pipeline to correct the false relevance labels
    - Recall with embedding model
    - Pre-label with re-ranking models
      - Use multiple re-ranking models to re-rank Lrecall
    - Label with LLM (Use LLM as labeler.)
Wed, 19 Feb 2025 MMTEB: Massive Multilingual Text Embedding Benchmark
- we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits.
- These optimizations allow us to introduce benchmarks that drastically reduce computational demands

总结，如何训练一个效果很好的Retrieval(Embeddings) model

模型
- Dense Retrieval 总体展现出明显的 Scaling Laws，但时不时也有小模型的在MTEB榜单前列
  - 单语言对标 BERT BASE：L=12，H=768，A=12 110M
  - 多语言对标 BERT LARGE：L=24，H=1024，A=16 340M
  - 更大的模型比如 1B，7B 对于 Embeddings 来说真的有意义吗存疑
- 选择适合的基础模型，多语言能力和长上下文能力比较重要，但 bert 都是2020年左右训练的，普遍不如现在llm训练的充分
- 越来越多的 Large decoder-only language models (LLMs) as Retrieval的模型上MTEB榜，基础模型选择范围就大大拓宽了
- 更有钱的公司会从头训练一个基础模型，微调成一个检索模型比如 ST5、mGTE。。
数据
- 使用 GPT 合成数据
- Weakly-Supervised Contrastive Pre-training + Supervised Fine-tuning
- 知识蒸馏
算法
- CONSISTENCY FILTERING + 适合难度的 Hard Example Mining
- 从其他 Metric Learning 和 Contrastive Learning 学习任务中寻找启发
- 多任务学习，模型蒸馏

Rerank model

很多文章吧 Rerank model 称为 cross-encoder，相对与 Dense Retrieval 的 dual-encoder。相比 Dense Retrieval 模型，算法上可以结合Metric Learning 和 Contrastive Learning，系统上可以跟 approximate nearest neighbor 结合，下游任务又可以跟 large-scale open-domain QA 结合。 Rerank model 真的要无聊很多，Rerank model 本质上就是个二分类任务，约等于 bert 预训练任务 Next Sentence Prediction。

Sun, 13 Jan 2019 Passage Re-ranking with BERT
- Re-ranking 的历史比 Dense Retrieval 还早
Thu, 12 Oct 2023 Fine-Tuning LLaMA for Multi-Stage Text Retrieval
- llm as Reranker
Sun, 4 Feb 2024 为RAG而生-BCE embedding技术报告
- 我们将BCEmbedding设计为二阶段检索器，分工合作：“离线”的Embedding负责尽可能召回，“在线”的Reranker负责精排和低质量过滤。
  - 精排阶段为了解决信息交互的问题，采用cross-encoder架构（如图二-2 (b)所示）。Reranker模型可以实现用户问题和知识库语料的信息交互，使模型可以“见机行事”地识别到更加准确的语义关系，算法性能上限可以很高。该方式的缺点是，需要对用户问题和知识库语料进行在线（online）地语义关系提取，效率比较低，无法对全量的知识库语料进行实时处理。
  - 结合召回和精排二者的优势，召回阶段可以快速找到用户问题相关文本片段，精排阶段可以将正确相关片段尽可能排在靠前位置，并过滤掉低质量的片段。二阶段检索可以很好地权衡检索效果和效率，具有巨大应用价值。
- 有意义的Rerank分数
  - “评判标准”中好检索器还有一个特点，可以过滤低质量信息。我们设计的Reranker模型，输出的(query, passage)语义相关分数，不仅能用来做psaages排序，其分数的绝对值可表征真实的语义相关程度，这可以用来判断哪些是低质量passages，实现低质量片段过滤。这对RAG中LLM回答问题非常有帮助，更干练、干扰信息少的context，可以有效提高LLM回答质量[17]。
  - 根据我们业务实践经验和开源社区的反馈，bce-reranker-base_v1输出的分数推荐以0.35～0.4为阈值，来进行低质量passage过滤。用户实际使用反馈，收获很不错的效果。
Mon, 18 Mar 2024 bge-reranker-v2-m3、BAAI/bge-reranker-v2-gemma、 BAAI/bge-reranker-v2-minicpm-layerwise.
- reranker 都不配有篇技术报告
- LLM-based reranker
Tue, 25 Jun 2024 Jina Reranker v2 for Agentic RAG: Ultra-Fast, Multilingual, Function-Calling & Code Search
- Jina Reranker v2 将 Reranker 玩出了新高度
- Multilingual: More relevant search results in 100+ languages, outperforming bge-reranker-v2-m3;
- Agentic: State-of-the-art function-calling and text-to-SQL aware document reranking for agentic RAG;
- Code retrieval: Top performance on code retrieval tasks, and
- Ultra-fast: 15x more documents throughput than bge-reranker-v2-m3, and 6x more than jina-reranker-v1-base-en.
Fri, 26 Jul 2024 bge-reranker-v2.5-gemma2-lightweight
- trained based on gemma2-9b
Mon, 29 Jul 2024 mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
- 从头训练一个基础模型，并微调成一个 Retrieval(Embeddings) model 和 Reranking Model, 有钱真好
Thu, 7 Nov 2024 Best Practices for Distilling Large Language Models into BERT for Web Search Ranking
Wed, 22 Jan 2025 Alibaba-NLP/gte-reranker-modernbert-base
Wed, 13 Mar 2025 Baked-in Brilliance: Reranking Meets RL with mxbai-rerank-v2
- Qwen2ForCausalLM
- we used a three-step reinforcement-learning process:
  - GRPO (Guided Reinforcement Prompt Optimization)
  - Contrastive Learning
  - Preference Learning
Wed, 4 Jun 2025 ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
- Qwen2.5 llm as reranker
- In summary, our paper demonstrates three-fold contributions:
  - Our quantitative analysis reveals that SLMs struggle with understanding task prompts and generating correctly formatted outputs for reranking tasks without task-specific fine-tuning.
  - We propose a novel two-stage approach, ProRank, to activate the power of SLMs to effectively rerank documents with interpretable relevance scores, combining GRPO (Shao et al., 2024) for coarse-grained scoring, followed by fine-grained scoring.
  - Extensive evaluations demonstrate that our approach achieves superior reranking performance, with ProRank 0.5B SLM model outperforming the larger 32B fine-tuned LLM reranking models.
- GRPO 对输出格式是否为 0，1 有明显帮助。但GRPO + sft 相比直接sft是否有提高好像没有说清楚，不失为一种 Warmup 的方法。
Thu, 5 Jun 2025 Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
- Reranking Models，使用生成模型的模版，Reranking 认为是个生成 yes 和 no token问题
- Models Training
  - Reranking Models
    - Supervised Fine-Tuning (SFT) loss
    - stage 2 Supervised Fine-Tuning with High-Quality Synthetic and labeled Data
    - stage 3 Model Merging using sampled Checkpoints from stage 2
Fri, 22 Aug 2025 How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
- llm + pairwise & listwise approaches 有没有搞头
- colbert ranker 确实比 embedding强一些，比正经的ranker弱一些
Sun, 27 Aug 2025 contextual rerank-v2
- MistralForCausalLM 6b
- Qwen3ForCausalLM 2b
- Qwen3ForCausalLM 1b
18 Oct 2025 llama-nemotron-rerank-1b-v2
- LlamaBidirectionalForSequenceClassification
- Mean pooling

listwise reranker

Fri, 22 Aug 2025 How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
- llm + pairwise & listwise approaches 有没有搞头
Mon, 29 Sep 2025 jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking
- 恭喜 Elastic (NYSE: ESTC) 完成对Jina AI的收购
- Built upon Qwen3-0.6B
- listwise reranker that introduces a novel last but not late interaction
  - Prompt Template: Query Document 1 <|doc_emb|> Document 2 <|doc_emb|> Document 3 <|doc_emb|> Query <|query_emb|>
  - <|doc_emb|> & <|query_emb|> -> projector -> cosine score
- Training
  - Loss Functions: InfoNCE + dispersive loss + similarity loss
  - Multi-Stage Training
    - Stage 1: Foundation Specialization
    - Stage 2: Context and Hard Negative Mining
      - Training systematically mines hard negatives across multiple retrieval systems including BGE, Jina, GTE, and E5-Large with up to 25 negatives per query
    - Stage 3: Model Ensemble and Optimization
- 重排器的技术演进
- 在深入 Jina Reranker v3 的设计之前，我们有必要快速回顾一下重排器技术的主流范式。传统的学习排序（Learning-to-Rank）方法，大致可以分为三类：
  - Pointwise：这是最早期的方法。模型每次仅考虑一个文档，逐个判断每个文档与查询的相关性，输出一个绝对分数。它的问题在于完全忽略了文档之间的相互关系，缺乏全局视野。
  - Pairwise：该方法更进一步。模型不再是看单个文档，而是通过比较文档对（例如，判断“文档 A 是否比文档 B 更相关”）来学习排序。它开始具备相对的判断能力，但其视野依然局限在“二选一”的比较中，无法理解整个候选列表的全局信息。
  - 然而，这两种方法都缺乏对候选文档的整体洞察。一个理想的重排器，应该能够审阅全部的候选文档，综合考量它们之间的互补、冗余甚至矛盾关系，最终给出一个最优的整体排序。这正是 Listwise 的核心思想，也是 Jina Reranker v3 所选择的技术路线。
  - 为了实现这一点，Jina Reranker v3 将用户查询和所有候选文档拼接 (concatenate) 成一个长序列，作为一个整体输入到模型中。在单一的上下文窗口内，模型通过因果注意力机制，同时处理所有文本。
- 这个设计使得任何一个文档在被编码时，都能关注到其他文档的内容，从而实现跨文档的信息交互。当整个序列处理完毕后，模型再从每个文档末尾预设的特殊位置，提取出包含了上下文信息的表征向量。

ColBERT

ColBERT + Late Chunking 有没有搞头？

Mon, 27 Apr 2020 ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Thu, 2 Dec 2021 ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
Tue, 23 Apr 2024 A Reproducibility Study of PLAID
- maxsim 加速
Wed, 29 May 2024 MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings
- maxsim 加速
Mon, 23 Sep 2024 Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
- Token Pooling 一半，性能甚至有所提升
Thu, 10 Jul 2025 maxsim-cpu
- maxsim 加速

Sparse Retrieval

Mon, 12 Jul 2021 SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking
Tue, 21 Sep 2021 SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
Mon, 11 Mar 2024 SPLADE-v3: New baselines for SPLADE

reasoning-intensive retrieval

Tue, 16 Jul 2024 BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
- we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.
Tue, 29 Apr 2025 ReasonIR: Training Retrievers for Reasoning Tasks
- Pilot Study
  - Existing public training datasets are helpful for factual retrieval but not for reasoning-intensive retrieval
    - Queries from these two datasets are mostly simple factual questions, whose relevant documents can often be retrieved using direct lexical or semantic matching.
    - However, queries in reasoning benchmarks are much longer and more complex.
  - Longer effective context length is desirable to better leverage test-time scaling through query rewriting.
    - the length of a rewritten query can be a new dimension of test-time scaling and a longer effective context length is desirable for long rewritten queries.
    - Query decomposition has been shown to be effective in multi-hop retrieval tasks
      - an information-rich long query is better than several decomposed short queries on BRIGHT
- ReasonIR: Synthesizing Hard and Varied-length Retriever Training Data
  - public data to specifically train a general autoregressive LLM for retrieval
  - varied-length (VL ) data to extend the effective context length of the retriever for input queries
    - we ask the LLM to also generate a positive document for the query, following the distillation idea in Wang et al. (2023b)
  - hard query (HQ) data to improve the retriever’s ability to handle reasoning-intensive queries
    - we synthesize reasoning-intensive training data by generating hard queries (HQ) from high-quality documents using a “human-like brainstorm guideline” for hard query generation.
    - Reasoning-worthy seed document selection
    - Reasoning-intensive document-to-query generation.
      - An ideal set of reasoning-intensive queries has three properties: challenging, self-contained, diverse
      - As previous work has shown unsuccessful attempts on directly prompting an LLM to generate difficult questions
      - we ask the LLM to reason about the background knowledge, common problem-solving patterns, and realistic scenarios before formulating a difficult question.
      - ...
  - Multi-turn Hard Negative Generation
    - Existing research typically identifies hard negatives by selecting top-ranked but irrelevant documents from a retriever such as BM25
    - However, we find that this does not work for reasoning-intensive queries for 3 reasons:
      - First, existing retrievers perform poorly on reasoning-intensive queries
      - Second, the goal of retrieval has shifted from finding documents that contain direct answers to finding a wide range of documents that are helpful for reasoning
      - Third, the seed document may not be the most relevant to the generated query
    - generating the hard negative in a separate turn
  - Reasoning-intensive Information Retrieval (IR) Performance
    - REASONIR-8B benefits from test-time scaling with query rewriting on BRIGHT
    - REASONIR-8B can form an ensemble with a sparse retriever or be combined with an LLM-based reranker for better retrieval.
Thu, 22 May 2025 lightonai/Reason-ModernColBERT
- Reason-ModernColBERT is a late interaction model trained on the reasonir-hq dataset.
- It achieves extremely competitive performance on the BRIGHT benchmark aimed at evaluating reasoning-intensive retrieval performance
- modernbert 和 ColBERT 的含金量不断提升
Mon, 11 Aug 2025 DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval
- DIVER-DChunk
  - To handle lengthy documents, we employed the Chonkie2 library to perform semantic-aware chunking.
  - Using the Qwen3-Embedding-0.6B (Zhang et al., 2025b) model with a similarity threshold of 0.5,
  - the text was divided into smaller chunks of up to 4096 tokens, with a minimum size of one sentence per chunk.
- DIVER-QExpand
  - we retain its iterative design in DIVER-QExpand but make two practical modifications
    - First, we replace the BM25 retriever with a dense retriever trained for reasoningintensive task
    - Second, instead of concatenating all intermediate query expansions, which can often exceed 2000 tokens, we simplify the process by retaining only the original query and the final-round expansion.
- DIVER-Retriever
  - 通用Retriever不如Reason专用Retriever，Qwen3-4B Avg只有 5.6 2333
  - +BM25 (Hybrid) 提高比较大
- DIVER-Rerank
  - DIVER(v2) includes advanced query expansion and combined pointwise and listwise DIVER-Rerank, achieving the latest state-of-the-art.
  - DIVER(v2) reaches an nDCG@10 of 45.8, surpassing BGE-Reasoner by +0.8 points and establishing a new SOTA
  - To complement this local evaluation, the listwise module (DIVER-Rerank-Listwise) employs an LLM (e.g. Deepseek-R1-0528) to directly rank the top-100 candidate documents of the query, offering a global view of document relevance. The final reranking result integrates both modules, leveraging the fine-grained local scoring of pointwise rerankers and the holistic ranking ability of listwise rerankers.
  - 使用 Deepseek-R1-0528 rerank top-100 是否杀死了比赛

learned sparse representations \ late-interaction methods

Chunking / Chucking Granularity

跳转

Toolkit

Tue, 27 Aug 2019 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- sentence-transformers
- Document
Fri, 19 Feb 2021 Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations
- pyserini

Other

Fri, 23 Feb 2024 Self-Retrieval: Building an Information Retrieval System with One Large Language Model
- LLM can memorize (passage -> title)
- 至少Hit@1、 Hit@5、 MRR@5 指标比 dense retrieval 模型 GTR BGE OpenAI 效果好?? 所以 dense retrieval 必须配合 reranker ??

Architecture

Tue, 20 Apr 2021 RoFormer: Enhanced Transformer with Rotary Position Embedding
- Rotary Position Embedding
Wed, 7 Dec 2022 Text Embeddings by Weakly-Supervised Contrastive Pre-training
- 经典 Bert 512 长度
Mon, 30 Oct 2023 Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents
- BERT with ALiBi, GEGLU, BF16, mean pooling
Fri, 29 Dec 2023 MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
- This architecture combines FlashAttention [11], ALiBi [44], Gated Linear Units[12, 50], a dynamic unpadding module [66], and low precision LayerNorm.
Fri, 2 Feb 2024 Nomic Embed: Training a Reproducible Long Context Text Embedder
- NomicBertModel 架构比较现代 bert_with_rope
- rotary + SwiGLU + Flash Attention，12层768维，137B，训练长度2048
Mon, 5 Feb 2024 BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
- 使用 XLM-RoBERTa，绝对位置编码，8192 长度，23 层 1024维
Wed, 8 May 2024 Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
- v1, BertModel 22m, 33m, 110m, 137m, 335m. 512 长度
- m-long, NomicBertModel, max_trained_positions: 2048
- m-v1.5, BertModel, 512 长度
Mon, 27 May 2024 NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
- Mistral-7B + LLM2Vec + latent attention layer
Mon, 29 Jul 2024 mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
- BERT + RoPE + GLU + xformers， 12 层 768 维，306M 比 bge m3 小
- pre-trained by masked language modeling (MLM) via a two-stage curriculum for the native 8,192 tokens context.
Tue, 3 Dec 2024 Arctic-Embed 2.0: Multilingual Retrieval Without Compromise
- m_v2: gte-multilingual-mlm-base
- l_v2: bge-m3-retromae
Mon, 16 Sep 2024 jina-embeddings-v3: Multilingual Embeddings With Task LoRA
- Based on the Jina-XLM-RoBERTa architecture, this model supports Rotary Position Embeddings to handle long input sequences up to 8192 tokens.
Tue, 3 Dec 2024 Arctic-Embed 2.0: Multilingual Retrieval Without Compromise
- m_v2: gte-multilingual-mlm-base
- l_v2: bge-m3-retromae
Wed, 18 Dec 2024 ModernBERT
Tue, 11 Feb 2025 Training Sparse Mixture Of Experts Text Embedding Models
- Embedding Models 进入 Mixture Of Experts 时代
Thu, 5 Jun 2025 Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
- Qwen3

Training data

Synthetic data (Data Augmentation (DA))

Tue, 17 Dec 2024 AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
- Automated, Heterogeneous, Dynamic
Thu, 5 Jun 2025 Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
- stage 2 Supervised Fine-Tuning with High-Quality Synthetic and labeled Data

Synthetic Query (Query Augmentation, pseudo query generation(GenQ))

Wed, 17 Apr 2019 Document Expansion by Query Prediction
- the task is to predict a set of queries for which that document will be relevant.
- We optionally re-rank these retrieved documents using BERT (Devlin et al., 2018) as described by Nogueira and Cho (2019).
Dec, 2 2019 From doc2query to docTTTTTquery
- 好短，只有三页
- 用 T5 做 Query 生成，恍如隔世，人类能生成自由只有短短几年
Wed, 29 Apr 2020 Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation
- QGen (3 Synthetic Question Generation)
- Our question generator is an encoder-decoder with Transformer
- Parameter weights are also shared and are initialized from a pretrained RoBERTa (Liu et al., 2019) checkpoints.
Mar 1, 2021BeIR/query-gen-msmarco-t5-large-v1
- 还在使用docTTTTTquery
Tue, 14 Dec 2021 GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval Query Generation via T5 (DocT5Query)
2022.11.30. OpenAI 发布GPT-3.5
Wed, 15 Feb 2023 How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval
- 使用 docTTTTTquery？为什么不用GPT-3.5
Sat, 24 Feb 2024 OpenAI vs Open-Source Multilingual Embedding Models Choosing the model that works best for your data
- 用ChatGPT合成问答数据集训练模型，用ChatGPT合成问答数据集测试模型的世界达成了
Mon, 5 Feb 2024 BGE M3-Embedding
- continue pretraining (RetroMAE)
  - We can observe that RetroMAE can significantly improve the retrieval performance, and pre-training on unsupervised data can further enhance the retrieval quality of the embedding model.
- Supervised Fine-tuning https://huggingface.co/datasets/Shitao/bge-m3-data
  - 8英文7中文2其他，混合数据
- synthetic data 使用 GPT3.5 做数据合成 https://huggingface.co/datasets/Shitao/MLDR
  - MLDR is a Multilingual Long-Document Retrieval dataset built on Wikipeida, Wudao and mC4, covering 13 typologically diverse languages. Specifically, we sample lengthy articles from Wikipedia, Wudao and mC4 datasets and randomly choose paragraphs from them. Then we use GPT-3.5 to generate questions based on these paragraphs.
Wed, 8 May 2024 Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
- we leverage Large Language Models to generate novel queries
Fri, 29 Mar 2024 Gecko: Versatile Text Embeddings Distilled from Large Language Models
- LLM-based Diverse Query Generation
  - we employ few-shot prompts to control the diversity of queries

Synthetic Document

Thu, 20 Jul 2023 [Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models]
- This dataset, based on positive pairs from the SNLI dataset and negatives created with GPT-3.5
Fri, 22 Sep 2023 AnglE-optimized Text Embeddings
- 为 sts任务生成 positive/negative pairs
Sun, 31 Dec 2023 Improving Text Embeddings with Large Language Models
- query positive_document hard_negative_document 全合成啊？？这也太野了

LLM Scoring

Mon, 30 Jan 2023 REPLUG: Retrieval-Augmented Black-Box Language Models
- REPLUG LSR (REPLUG with LM-Supervised Retrieval) treating the LM as a frozen, black-box scoring function.
- 4.1. Computing Retrieval Likelihood
- 4.2. Computing LM likelihood
  - We use the LM as a scoring function to measure how much each document could improve the LM perplexity.
  - Specifically, we first compute PLM(y | d, x), the LM probability of the ground truth output y given the input context x and a document d.
  - The higher the probability, the better the document di is at improving the LM’s perplexity.
- 4.3. Loss Function ( KL divergence )
- 4.4. Asynchronous Update of the Datastore Index
Wed, 3 May 2023 Improving Contrastive Learning of Sentence Embeddings from AI Feedback
- we first mask some words of the original sentence with different mask rates using the token, in order to delete some information in the original sentence.
- Then we write a task description prompt to steer GPT-3 to generate new sentences based on masked sentences.
- We write a task description prompt to steer GPT-3 to generate a similarity score between 0 and 1 for each sample pair generated in step 1
- 居然效果比 SimCSE 好，是 SimCSE 效果太差了吗
Sat, 21 Oct 2023 Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels
- However, the lack of intermediate relevance label options may cause the LLM to provide noisy or biased answers for documents that are partially relevant to the query.
- We propose to incorporate fine-grained relevance labels into the prompt for LLM rankers,
- enabling them to better differentiate among documents with different levels of relevance to the query and thus derive a more accurate ranking
Fri, 29 Mar 2024 Gecko: Versatile Text Embeddings Distilled from Large Language Models
- LLM-based Positive and Negative Mining
  - we use an existing embedding model1 to retrieve top 𝑁 neighbors 𝑃 from the corpus given a generated query 𝑞.
  - We then employ the same LLM used for the query generation to rank these retrieved passages based on their relevance to the query
    - query likelihood， relevance classification
  - we create the FRet dataset, comprised of 6.6M examples, each containing a task, a query, a positive passage, and a negative passage.
  - we find that using the most relevant passage chosen by an LLM is always better than using the original passage as positive.

reasoning-intensive

Tue, 29 Apr 2025 ReasonIR: Training Retrievers for Reasoning Tasks
- Varied-length Synthetic Query and Positive Document Generation
- Reasoning-intensive Document-to-query Generation

Knowledge distillation

随着开源的模型越来越多，知识蒸馏越来越成为高效的训练手段

Tue, 14 Dec 2021 GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval
- In this paper, we propose the novel unsupervised domain adaptation method Generative Pseudo Labeling (GPL), which combines a query generator with pseudo labeling from a cross-encoder.
- As we show in Appendix E, just using these unsupervised techniques is not sufficient and the resulting models perform poorly.
- Query Generation via T5 (DocT5Query)
- Negative Mining via Dense Retrieval (msmarco-distilbert-base-v3 and msmarco-MiniLML-6-v3)
- Pseudo Labeling via Cross-Encoder (ms-marco-MiniLM-L6-v2 cross-encoder)
  - Previous work has shown that cross-encoders achieve much higher performances
  - and are less prone to domain shifts
- MultipleNegativesRanking(MNRL) loss
Mon, 19 Aug 2024 Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores
- utilizes soft labels derived from expert-augmented scores
Thu, 26 Dec 2024 Jasper and Stella: distillation of SOTA embedding models
- Stage 1&2: Distillation from Multiple Teachers
  - Lcosine
  - Lsim
  - Lresim teacher models to automatically generate soft labels for all text pairs
  - The biggest advantage of distillation vectors is that we do not need any supervised data
  - In stage 1, stage 2 and stage 3, we use fineweb-edu as our main text training dataset
- Stage 3: Dimension Reduction
  - Matryoshka Embedding
- Stage 4: Unlock Multimodal Potential
Thu, 7 Nov 2024 Best Practices for Distilling Large Language Models into BERT for Web Search Ranking
- Knowledge Distillation with Rank Loss
Wed, 26 Mar 2025 Dewey Long Context Embedding Model: A Technical Report
- Architecture
  - modernbert-large
- Training Recipe
  - Chunk-Alignment Training
    - Our model can generate three types of embeddings:Late Chunking 的含金量在不断升高
      - CLS embedding
      - Chunk embeddings
      - Mean embedding
  - Knowledge distillation
    - We use Linq-Embed-Mistral(Kim et al., 2024) as our teacher model.
    - We get unsupervised texts from Infinity-Instructand fineweb-edu.
    - 直接用 unsupervised texts 做 Knowledge distillation 啊
    - We take two strategies to split text to chunks:
      - Split Text by Word 30% probability
      - RecursiveCharacterTextSplitter in langchain 70% probability

Hard Negative Mining

Contrastive Pre-training 使用大 batchsize in-batch negatives，Supervised Fine-tuning 使用另一个强模型离线找好负样本

Wed, 1 Jul 2020 Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
- Inefficacy of Local In-Batch Negatives
- Approximate nearest neighbor Negative Contrastive Learning (ANCE)
  - Asynchronous Index Refresh
  - 在线搜全局 Hard Negative
  - 现在 Supervised Fine-tuning 阶段都使用另外一个强模型离线 Hard Negative Mining
Fri, 16 Oct 2020 RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering
- Cross-batch Negatives
- Denoised Hard Negatives
- Data Augmentation
Thu, 14 Sep 2023 C-Pack: Packed Resources For General Chinese Embeddings
- Contrastive Pre-training
  - we purely rely on in-batch negative samples [25] and resort to a big batch size (as large as 19,200) to improve the discriminativeness of the embedding.
Sun, 4 Feb 2024 为RAG而生-BCE embedding技术报告
- 难负样例挖掘？
  - 我们在训练Embedding模型时发现，过难的负样本对模型训练有损害，训练过程中会使模型“困惑”，影响模型最终性能。
  - 在大量的语料库中，脱离人工校验的自动化难负样例挖掘，难免会“挖到正例”。
Wed, 8 May 2024 Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models
- Supervised Fine-tuning
  - Tunable Hard Negative Mining
Sat, 11 May 2024 Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
- For each retrieval task, we use piccolo-base-zh [12] to conduct negative sample mining.
- We randomly select 15 samples from the mining negatives of rank 50 - 100 as the final hard negative samples.
- We avoid using higher-rank negative samples as their inclusion typically leads to a decline in performance.
- This is caused by a variety of reasons, such as inaccurate dataset annotation.
Mon, 22 Jul 2024 NV-Retriever: Improving text embedding models with effective hard-negative mining
- hard-negative mining
Fri, 28 Oct 2024 SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning
- Impact of Hard Negatives
  - Strategy to Eliminate False Negatives
    - The results indicate that the range from 30 to 100 yields improved performance.
    - This implies that the top-ranked documents (0-100) may include some false negatives,
    - while those ranked lower (50-100) lack sufficient challenge.
  - Number of Hard Negatives
    - Nevertheless, our findings suggest that the training process remains relatively stable regardless of the number of hard negatives utilized.
  - Impact of Batch Size
    - However, enlarging the batch size from 2048 to 8192 does not result in any significant change in performance.
  - Teacher models for hard negative mining
    - in general, more powerful models can yield more effective hard negatives (SFR-Embedding-Mistral > E5-Mistral > BGE-base).
    - In the future, it will be intriguing to explore the impact of multi-round training on two fronts

Loss

Tue, 27 Aug 2019 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Objective Function
  - Classification Objective Function o = softmax(Wt(u, v, |u − v|))
  - Regression Objective Function cosine-sim(u, v)
  - Triplet Objective Function max(||sa − sp|| − ||sa − sn|| + s, 0)
Tue, 25 Feb 2020 Circle Loss: A Unified Perspective of Pair Similarity Optimization
Fri, 22 Sep 2023 AnglE-optimized Text Embeddings
- ANGLE OBJECTIVE
Mon, 27 Mar 2023 Sigmoid Loss for Language Image Pre-Training
- The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes.
Sat, 11 May 2024 Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training
- Multi-task Hybrid Loss
  - Retrieval and Reranking Loss，use the standard InfoNCE loss with in-batch negative
  - STS and PairClassification Loss，cosent loss function
  - Classification and Clustering Loss，SFR embedding method
Wed, 11 Jun 2025 On the Similarities of Embeddings in Contrastive Learning
- we propose an auxiliary loss that reduces the variance of negative-pair similarities in mini-batch settings.
- Empirical results show that incorporating the proposed loss improves performance in small-batch settings.

PEFT

Wed, 24 Aug 2022 DPTDR: Deep Prompt Tuning for Dense Passage Retrieval
- 使用 Deep Prompt Tuning 能得到有竞争力的模型吗
Mon, 16 Sep 2024 jina-embeddings-v3: Multilingual Embeddings With Task LoRA
- 使用 lora 呢
Fri, 28 Oct 2024 SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning
- The SFR-Embedding-Mistral marks a significant advancement in text-embedding models, building upon the solid foundations of E5-mistral-7b-instruct and Mistral-7B-v0.1
- LoRA adapters with rank r=8 are added to all linear layers, resulting in 21M trainable parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview

Survey

Traditional Retrieval(Sparse lexical search algorithms)

Hybrid Retrievers

History

Retrieval(Embeddings) model

Retrieval(Embeddings) benchmark

总结，如何训练一个效果很好的Retrieval(Embeddings) model

Rerank model

listwise reranker

ColBERT

Sparse Retrieval

reasoning-intensive retrieval

learned sparse representations \ late-interaction methods

Chunking / Chucking Granularity

Toolkit

Other

Architecture

Training data

Synthetic data (Data Augmentation (DA))

Synthetic Query (Query Augmentation, pseudo query generation(GenQ))

Synthetic Document

LLM Scoring

reasoning-intensive

Knowledge distillation

Hard Negative Mining

Loss

PEFT

FilesExpand file tree

awesome_retrieval.md

Latest commit

History

awesome_retrieval.md

File metadata and controls

Overview

Survey

Traditional Retrieval(Sparse lexical search algorithms)

Hybrid Retrievers

History

Retrieval(Embeddings) model

Retrieval(Embeddings) benchmark

总结，如何训练一个效果很好的Retrieval(Embeddings) model

Rerank model

listwise reranker

ColBERT

Sparse Retrieval

reasoning-intensive retrieval

learned sparse representations \ late-interaction methods

Chunking / Chucking Granularity

Toolkit

Other

Architecture

Training data

Synthetic data (Data Augmentation (DA))

Synthetic Query (Query Augmentation, pseudo query generation(GenQ))

Synthetic Document

LLM Scoring

reasoning-intensive

Knowledge distillation

Hard Negative Mining

Loss

PEFT