Skip to content

About document retrieval scope in ChainRAG: Why does each question use a separate small corpus instead of a global corpus? #1

@How-Young-X

Description

@How-Young-X

Hi, thanks for the great work and the released code!
While reproducing the ChainRAG experiments, I encountered a design choice that I am not fully sure I understand, and I hope you can clarify it.

When examining the retrieval module in ChainRAG, I noticed that:

Each question is associated with its own small set of documents
(usually ~10 passages or paragraphs)

Retrieval is performed only within this per-question mini-corpus,
instead of performing retrieval from a global shared corpus covering all the documents for all questions.

This means that for question i, the retriever only searches within the documents that belong to question i, rather than searching over the full dataset.

🤔 My Question / Confusion

I would like to understand the motivation for this design choice.

Specifically:

Why does ChainRAG restrict retrieval to “the documents related to that single question” instead of using a unified global corpus for all questions?

Some possible reasons I considered (but I may be mistaken):

Was this done to match the setting of the original dataset?

To avoid cross-question document contamination?

To reduce retrieval noise?

For computational efficiency?

Or is this simply a simplified experimental setting for fair comparison with baselines?

Right now, it feels more like oracle retrieval, since the system already knows the candidate documents per question in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions