Skip to content

dataset: Add MMDocIR#4230

Open
whybe-choi wants to merge 1 commit intoembeddings-benchmark:mainfrom
whybe-choi:dataset/mmdocir
Open

dataset: Add MMDocIR#4230
whybe-choi wants to merge 1 commit intoembeddings-benchmark:mainfrom
whybe-choi:dataset/mmdocir

Conversation

@whybe-choi
Copy link
Contributor

@whybe-choi whybe-choi commented Mar 12, 2026

Close #3209

@whybe-choi
Copy link
Contributor Author

I found that the actual statistics of the dataset differ slightly from those mentioned in the paper. Although the total count matches, the distribution across certain domains is inconsistent:

Domain #Doc Paper #QA Paper Diff
Research report / Introduction 34 34 194 200 -6
Administration/Industry file 10 10 56 59 -3
Tutorial/Workshop 17 17 104 102 +2
Academic paper 75 75 389 386 +3
Brochure 15 15 76 76 0
Financial report 51 51 344 343 +1
Guidebook 22 22 115 112 +3
Government 44 44 111 111 0
Laws 44 44 132 132 0
News 1 1 137 137 0
Total 313 313 1658 1658 0

@Samoed Samoed added the new dataset Issues related to adding a new task or dataset label Mar 12, 2026
@whybe-choi
Copy link
Contributor Author

Here are the evaluation results for vidore/colpali-v1.1, measured based on the information from MMDocIR/checkpoint:

Domain R@5 (Ours) R@5 (Paper) Diff
Research report 78.7 84.6 -5.9
Admin & Industry 66.4 79.3 -12.9
Tutorial/Workshop 74.9 82.3 -7.4
Academic paper 66.1 89.0 -22.9
Brochure 66.4 79.8 -13.4
Financial report 53.2 72.1 -18.9
Guidebook 71.8 86.7 -14.9
Government 64.2 84.9 -20.7
Laws 73.5 92.4 -18.9
News 60.6 56.9 +3.7
Average (Macro) 67.6 80.8 -13.2

@Samoed
Copy link
Member

Samoed commented Mar 12, 2026

From thier repo it seems that they fine-tune models, but maybe I'm wrong https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpali-v1.1

@Samoed
Copy link
Member

Samoed commented Mar 12, 2026

I think you can try to run text only model to see if scores will match or not

@whybe-choi
Copy link
Contributor Author

From thier repo it seems that they fine-tune models, but maybe I'm wrong https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpali-v1.1

it seems like they only fine-tuned a specific model and used the rest as they are.

image

@whybe-choi
Copy link
Contributor Author

Here are the evaluation results for BAAI/bge-large-en-v1.5 on vlm_text:

Domain R@5 (ours) R@5 (paper) Diff
Research report 69.5 79.5 -10.0
Admin & Industry 48.9 65.8 -16.9
Tutorial/Workshop 64.2 71.3 -7.1
Academic paper 48.3 76.8 -28.5
Brochure 57.6 62.4 -4.8
Financial report 34.3 56.0 -21.7
Guidebook 63.7 77.2 -13.5
Government 55.2 77.4 -22.2
Laws 64.4 79.5 -15.1
News 38.7 38.0 +0.7
Average (Macro) 54.5 68.4 -13.9

@Samoed
Copy link
Member

Samoed commented Mar 12, 2026

Very strange. From paper seems that they tune only DPR-Phi3 and Col-Phi3, but strange that we can't reproduce any scores. Probably we need to try to run from their repo, but seems this would require some work. Maybe @daviddongkc can help (don't think that he would answer)

@Samoed Samoed added the image The image extension of MTEB label Mar 12, 2026
@whybe-choi
Copy link
Contributor Author

Is the data loading okay, or am I missing something?

@whybe-choi whybe-choi marked this pull request as ready for review March 12, 2026 13:12
@Samoed
Copy link
Member

Samoed commented Mar 12, 2026

I don't see problems in data loading

@Samoed
Copy link
Member

Samoed commented Mar 12, 2026

They can use any mode for evaluation, so this is hard to say what the source of problem 'vlm_text', 'oct_text', 'image_binary', 'image_hybrid'

@whybe-choi
Copy link
Contributor Author

Since the dataset I worked on is for Page Retrieval, I think it's correct to use image_binary as the retrieval target. For text embeddings, I've already processed vlm_text (between that and ocr_text). It seems like I've followed the paper's configuration regarding the modes, so I'm also not sure what's going wrong here 🫠

image

@whybe-choi
Copy link
Contributor Author

I found that the evaluation performs per-document retrieval, where each query only searches within the pages of its own document .

encode.py stores per-document page ranges in query_indices:

def get_queries(file_in):
    for line in open(file_in, 'r', encoding="utf-8"):
        item = json.loads(line.strip())
        doc_page = item["page_indices"]    # (start_pid, end_pid) for this document
        doc_layout = item["layout_indices"]
        for qa in item["questions"]:
            query_indices.append((q_count, *doc_page, *doc_layout))

search.py slices only the target document's pages at search time:

for (query_id, start_pid, end_pid, start_lid, end_lid) in query_indices:
    query_vec = encoded_query[query_id]
    page_vecs = encoded_page[start_pid:end_pid + 1]  # only this document's pages
    scores_page = batch_dot_product(query_vec, page_vecs)

@whybe-choi
Copy link
Contributor Author

How should I handle this?

@Samoed
Copy link
Member

Samoed commented Mar 14, 2026

I think you can change this task to reranking (adding top_ranked) and this should solve problems

@whybe-choi
Copy link
Contributor Author

Here are the evaluation results for BAAI/bge-large-en-v1.5 on vlm_text and vidore/colpali-v1.1 after adding top_ranked for per-document retrieval:

  • BAAI/bge-large-en-v1.5
Domain R@5 (Ours) R@5 (Paper) Diff
Research report 79.5 79.5 0.0
Admin & Industry 67.1 65.8 +1.3
Tutorial/Workshop 73.1 71.3 +1.8
Academic paper 76.2 76.8 -0.6
Brochure 61.1 62.4 -1.3
Financial report 57.8 56.0 +1.8
Guidebook 77.6 77.2 +0.4
Government 76.2 77.4 -1.2
Laws 81.1 79.5 +1.6
News 38.7 38.0 +0.7
Macro 68.8 68.4 +0.4
  • vidore/colpali-v1.1
Domain R@5 (Ours) R@5 (Paper) Diff
Research report 85.4 84.6 +0.8
Admin & Industry 79.6 79.3 +0.3
Tutorial/Workshop 81.2 82.3 -1.1
Academic paper 87.2 89.0 -1.8
Brochure 78.4 79.8 -1.4
Financial report 71.6 72.1 -0.5
Guidebook 81.4 86.7 -5.3
Government 84.0 84.9 -0.9
Laws 87.9 92.4 -4.5
News 60.6 56.9 +3.7
Macro 79.7 80.8 -1.1

@Samoed
Copy link
Member

Samoed commented Mar 14, 2026

Great! I think this is good enough. For future, it's better to commit directly without force push

@whybe-choi
Copy link
Contributor Author

Understood, I will make sure to commit directly !

@Samoed
Copy link
Member

Samoed commented Mar 14, 2026

I reupload this dataset to https://huggingface.co/datasets/mteb/MMDocIRT2ITRetrieval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

image The image extension of MTEB new dataset Issues related to adding a new task or dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataset: MMDocIR

2 participants