[feat] Support Minimind retrieval-augmented generation (RAG) #534
+1,040
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
感谢作者开源这么好的项目,让我收获良多!我基于Minimind实现了检索增强生成 (Retrieval-Augmented Generation, RAG) 功能,让Minimind拥有查阅外部资料后再回答的能力,基本是在Minimind上对 DPR (Karpukhin et al., 2020) 的简易复现。
RAG大致分三步,首先让retriever在大量文档中找到与用户问题相关的top-k条内容,再用reranker对k条内容精细排序,最后将挑选出的外部资料以某种方式喂给语言模型(比如把rank1的文档与用户问题拼接组成新的prompt)。这个PR涉及第一步与第三步,第二步的reranker采用了jina-reranker-v2-base-multilingual。
最后的效果肯定比不上DeepSeek, Qwen等模型,因为参数量和数据规模的限制,即便将正确资料直接提供给Minimind2,它很多时候也不能提取出答案,这我们完全可以理解。这个PR旨在自己动手实现经典算法,并用其为自己训练出的模型插上(稚嫩的)翅膀,非常符合Minimind的哲学。
主要特性
效果展示
[A] paraphrase-multilingual-MiniLM-L12-v2 (retriever) + Minimind2-full-sft-768
[B] paraphrase-multilingual-MiniLM-L12-v2 (retriever) + Minimind2-rag-768 (微调 Minimind2-full-sft-768 得到,输出格式更统一)
[C] Minimind2-Small-Pretrain (retriever) + Minimind2-full-sft-768
[D] Minimind2-Small-DPR (retriever) + Minimind2-full-sft-768
[E] without RAG + Minimind2-full-sft-768
以上问答有一个共同点:问题和资料都比较短,且相关度很高。如果问题比较隐晦,在资料中没有与答案强相关,那么 Minimind 就表现得不太理想,它更像在复述资料中的话,把答案猜出来,而不是真正理解了资料。这使得它难以实现另一个功能(也可能是我能力有限):发现资料与问题无关。因为复述是简单的,而知道资料与问题无关的前提是要理解两者之间的关系。
代码用法
为了少改动原仓库代码,我将train和eval放在了一个文件,这导致参数比较多,建议用notebook、ipython或在另一个py文件里 import 本文件的方式跑,以下是一些示例