dataset: Add MMDocIR by whybe-choi · Pull Request #4230 · embeddings-benchmark/mteb

whybe-choi · 2026-03-12T02:08:50Z

whybe-choi · 2026-03-12T11:45:31Z

I found that the actual statistics of the dataset differ slightly from those mentioned in the paper. Although the total count matches, the distribution across certain domains is inconsistent:

Domain	#Doc	Paper	#QA	Paper	Diff
Research report / Introduction	34	34	194	200	-6
Administration/Industry file	10	10	56	59	-3
Tutorial/Workshop	17	17	104	102	+2
Academic paper	75	75	389	386	+3
Brochure	15	15	76	76	0
Financial report	51	51	344	343	+1
Guidebook	22	22	115	112	+3
Government	44	44	111	111	0
Laws	44	44	132	132	0
News	1	1	137	137	0
Total	313	313	1658	1658	0

whybe-choi · 2026-03-12T12:16:19Z

Here are the evaluation results for vidore/colpali-v1.1, measured based on the information from MMDocIR/checkpoint:

Domain	R@5 (Ours)	R@5 (Paper)	Diff
Research report	78.7	84.6	-5.9
Admin & Industry	66.4	79.3	-12.9
Tutorial/Workshop	74.9	82.3	-7.4
Academic paper	66.1	89.0	-22.9
Brochure	66.4	79.8	-13.4
Financial report	53.2	72.1	-18.9
Guidebook	71.8	86.7	-14.9
Government	64.2	84.9	-20.7
Laws	73.5	92.4	-18.9
News	60.6	56.9	+3.7
Average (Macro)	67.6	80.8	-13.2

Samoed · 2026-03-12T12:28:08Z

From thier repo it seems that they fine-tune models, but maybe I'm wrong https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpali-v1.1

Samoed · 2026-03-12T12:30:57Z

I think you can try to run text only model to see if scores will match or not

whybe-choi · 2026-03-12T12:33:58Z

From thier repo it seems that they fine-tune models, but maybe I'm wrong https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpali-v1.1

it seems like they only fine-tuned a specific model and used the rest as they are.

whybe-choi · 2026-03-12T12:45:04Z

Here are the evaluation results for BAAI/bge-large-en-v1.5 on vlm_text:

Domain	R@5 (ours)	R@5 (paper)	Diff
Research report	69.5	79.5	-10.0
Admin & Industry	48.9	65.8	-16.9
Tutorial/Workshop	64.2	71.3	-7.1
Academic paper	48.3	76.8	-28.5
Brochure	57.6	62.4	-4.8
Financial report	34.3	56.0	-21.7
Guidebook	63.7	77.2	-13.5
Government	55.2	77.4	-22.2
Laws	64.4	79.5	-15.1
News	38.7	38.0	+0.7
Average (Macro)	54.5	68.4	-13.9

Samoed · 2026-03-12T12:56:46Z

Very strange. From paper seems that they tune only DPR-Phi3 and Col-Phi3, but strange that we can't reproduce any scores. Probably we need to try to run from their repo, but seems this would require some work. Maybe @daviddongkc can help (don't think that he would answer)

whybe-choi · 2026-03-12T13:09:03Z

Is the data loading okay, or am I missing something?

Samoed · 2026-03-12T13:14:15Z

I don't see problems in data loading

Samoed · 2026-03-12T13:21:02Z

They can use any mode for evaluation, so this is hard to say what the source of problem 'vlm_text', 'oct_text', 'image_binary', 'image_hybrid'

whybe-choi · 2026-03-12T13:27:06Z

Since the dataset I worked on is for Page Retrieval, I think it's correct to use image_binary as the retrieval target. For text embeddings, I've already processed vlm_text (between that and ocr_text). It seems like I've followed the paper's configuration regarding the modes, so I'm also not sure what's going wrong here 🫠

whybe-choi · 2026-03-13T02:05:02Z

I found that the evaluation performs per-document retrieval, where each query only searches within the pages of its own document .

encode.py stores per-document page ranges in query_indices:

def get_queries(file_in):
    for line in open(file_in, 'r', encoding="utf-8"):
        item = json.loads(line.strip())
        doc_page = item["page_indices"]    # (start_pid, end_pid) for this document
        doc_layout = item["layout_indices"]
        for qa in item["questions"]:
            query_indices.append((q_count, *doc_page, *doc_layout))

search.py slices only the target document's pages at search time:

for (query_id, start_pid, end_pid, start_lid, end_lid) in query_indices:
    query_vec = encoded_query[query_id]
    page_vecs = encoded_page[start_pid:end_pid + 1]  # only this document's pages
    scores_page = batch_dot_product(query_vec, page_vecs)

whybe-choi · 2026-03-14T14:49:29Z

How should I handle this?

Samoed · 2026-03-14T14:55:18Z

I think you can change this task to reranking (adding top_ranked) and this should solve problems

whybe-choi · 2026-03-14T17:01:03Z

Here are the evaluation results for BAAI/bge-large-en-v1.5 on vlm_text and vidore/colpali-v1.1 after adding top_ranked for per-document retrieval:

BAAI/bge-large-en-v1.5

Domain	R@5 (Ours)	R@5 (Paper)	Diff
Research report	79.5	79.5	0.0
Admin & Industry	67.1	65.8	+1.3
Tutorial/Workshop	73.1	71.3	+1.8
Academic paper	76.2	76.8	-0.6
Brochure	61.1	62.4	-1.3
Financial report	57.8	56.0	+1.8
Guidebook	77.6	77.2	+0.4
Government	76.2	77.4	-1.2
Laws	81.1	79.5	+1.6
News	38.7	38.0	+0.7
Macro	68.8	68.4	+0.4

vidore/colpali-v1.1

Domain	R@5 (Ours)	R@5 (Paper)	Diff
Research report	85.4	84.6	+0.8
Admin & Industry	79.6	79.3	+0.3
Tutorial/Workshop	81.2	82.3	-1.1
Academic paper	87.2	89.0	-1.8
Brochure	78.4	79.8	-1.4
Financial report	71.6	72.1	-0.5
Guidebook	81.4	86.7	-5.3
Government	84.0	84.9	-0.9
Laws	87.9	92.4	-4.5
News	60.6	56.9	+3.7
Macro	79.7	80.8	-1.1

Samoed · 2026-03-14T17:20:49Z

Great! I think this is good enough. For future, it's better to commit directly without force push

whybe-choi · 2026-03-14T18:01:42Z

Understood, I will make sure to commit directly !

Samoed · 2026-03-14T20:32:23Z

I reupload this dataset to https://huggingface.co/datasets/mteb/MMDocIRT2ITRetrieval

whybe-choi force-pushed the dataset/mmdocir branch from 9c4924f to cd67a21 Compare March 12, 2026 11:43

Samoed added the new dataset Issues related to adding a new task or dataset label Mar 12, 2026

Samoed added the image The image extension of MTEB label Mar 12, 2026

whybe-choi marked this pull request as ready for review March 12, 2026 13:12

dataset: Add MMDocIR

f7a3bf6

whybe-choi force-pushed the dataset/mmdocir branch from cd67a21 to f7a3bf6 Compare March 14, 2026 17:00

Conversation

whybe-choi commented Mar 12, 2026 • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whybe-choi commented Mar 12, 2026

Uh oh!

whybe-choi commented Mar 12, 2026

Uh oh!

Samoed commented Mar 12, 2026

Uh oh!

Samoed commented Mar 12, 2026

Uh oh!

whybe-choi commented Mar 12, 2026

Uh oh!

whybe-choi commented Mar 12, 2026

Uh oh!

Samoed commented Mar 12, 2026

Uh oh!

whybe-choi commented Mar 12, 2026

Uh oh!

Samoed commented Mar 12, 2026

Uh oh!

Samoed commented Mar 12, 2026

Uh oh!

whybe-choi commented Mar 12, 2026

Uh oh!

whybe-choi commented Mar 13, 2026

Uh oh!

whybe-choi commented Mar 14, 2026

Uh oh!

Samoed commented Mar 14, 2026

Uh oh!

whybe-choi commented Mar 14, 2026

Uh oh!

Samoed commented Mar 14, 2026

Uh oh!

whybe-choi commented Mar 14, 2026

Uh oh!

Samoed commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

whybe-choi commented Mar 12, 2026 •

edited by Samoed

Loading

Samoed commented Mar 14, 2026 •

edited

Loading