Conversation
9c4924f to
cd67a21
Compare
|
I found that the actual statistics of the dataset differ slightly from those mentioned in the paper. Although the total count matches, the distribution across certain domains is inconsistent:
|
|
Here are the evaluation results for
|
|
From thier repo it seems that they fine-tune models, but maybe I'm wrong https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpali-v1.1 |
|
I think you can try to run text only model to see if scores will match or not |
it seems like they only fine-tuned a specific model and used the rest as they are.
|
|
Here are the evaluation results for
|
|
Very strange. From paper seems that they tune only DPR-Phi3 and Col-Phi3, but strange that we can't reproduce any scores. Probably we need to try to run from their repo, but seems this would require some work. Maybe @daviddongkc can help (don't think that he would answer) |
|
Is the data loading okay, or am I missing something? |
|
I don't see problems in data loading |
|
They can use any mode for evaluation, so this is hard to say what the source of problem |
|
I found that the evaluation performs per-document retrieval, where each query only searches within the pages of its own document . encode.py stores per-document page ranges in def get_queries(file_in):
for line in open(file_in, 'r', encoding="utf-8"):
item = json.loads(line.strip())
doc_page = item["page_indices"] # (start_pid, end_pid) for this document
doc_layout = item["layout_indices"]
for qa in item["questions"]:
query_indices.append((q_count, *doc_page, *doc_layout))search.py slices only the target document's pages at search time: for (query_id, start_pid, end_pid, start_lid, end_lid) in query_indices:
query_vec = encoded_query[query_id]
page_vecs = encoded_page[start_pid:end_pid + 1] # only this document's pages
scores_page = batch_dot_product(query_vec, page_vecs) |
|
How should I handle this? |
|
I think you can change this task to reranking (adding |
cd67a21 to
f7a3bf6
Compare
|
Here are the evaluation results for
|
|
Great! I think this is good enough. For future, it's better to commit directly without force push |
|
Understood, I will make sure to commit directly ! |
|
I reupload this dataset to https://huggingface.co/datasets/mteb/MMDocIRT2ITRetrieval |


Close #3209