Hi, thanks for the great work and the released code!
While reproducing the ChainRAG experiments, I encountered a design choice that I am not fully sure I understand, and I hope you can clarify it.
When examining the retrieval module in ChainRAG, I noticed that:
Each question is associated with its own small set of documents
(usually ~10 passages or paragraphs)
Retrieval is performed only within this per-question mini-corpus,
instead of performing retrieval from a global shared corpus covering all the documents for all questions.
This means that for question i, the retriever only searches within the documents that belong to question i, rather than searching over the full dataset.
🤔 My Question / Confusion
I would like to understand the motivation for this design choice.
Specifically:
Why does ChainRAG restrict retrieval to “the documents related to that single question” instead of using a unified global corpus for all questions?
Some possible reasons I considered (but I may be mistaken):
Was this done to match the setting of the original dataset?
To avoid cross-question document contamination?
To reduce retrieval noise?
For computational efficiency?
Or is this simply a simplified experimental setting for fair comparison with baselines?
Right now, it feels more like oracle retrieval, since the system already knows the candidate documents per question in advance.