🌐 Homepage | 🤗 Dataset | 📖 Paper
- 🔥[2025-03-25]: Dataset available on HuggingFace. Paper available on arXiv.
Download tsv
files from HuggingFace and store them in data/tsv/
. The directory should be like data/tsv/{DATASET}_{SETTING}_{LANGUAGE}.tsv
.
conda create -n pm4bench python=3.10.5
conda activate pm4bench
pip install -r requirements.txt
API inference requires an API_KEY
. Please configure the API_KEY
in the .env
file in the following format:
gpt-4o-2024-11-20='xxx'
step-1o-vision-32k='xxx'
qwen2.5-vl-72b-instruct='xxx'
gemini-2.0-flash-thinking-exp='xxx'
DeepSeek-R1='xxx'
gpt-4o-mini='xxx'
The API_KEY
will be loaded through the infer_api.py
file using:
load_dotenv() # load .env file to get API_KEY
API_KEY = os.getenv(MODEL)
🔴 Attention: All codes and scripts files are executed in the root directory!
e.g. python code/infer_api.py [MODEL] [MODE] [SETTING] [LANGUAGE] [TASK] [DATASET] [MAX_TOKENS]
MODEL
: Official model name, such asgpt-4o-2024-11-20
,qwen2.5-vl-72b-instruct
, etc.MODE
: For normal VLMs, usedirect
; for reasoning VLMs, usecot
.SETTING
:traditional
orvision
, for detailed explanations please refer to our paper.LANGUAGE
: 10 languages choices,[ZH, EN, AR, SR, TH, RU, KO, CS, HU, VI]
TASK
:OCR
for OCR tasks, andVQA
for VQA tasks undertraditional
orvision
settings.DATASET
:[MDUR, MIQA, MMJB, MSOCR]
MAX_TOKENS
: For different models, theMAX_TOKENS
should be different in case of cut off problems.
Besides, we provide a standard script template scripts/infer_api.sh
. You can modify parameters directly and run it using
nohup bash scripts/infer_api.sh > logs/infer_api.log 2>&1 &
Step 0. Use LMDeploy to serve models
A special thanks to LMDeploy for their work, which has greatly assisted in providing local inference for our work. Please refer to LMDeploy docs for detailed information of VLMs' deployment and serve. Before inference, you should make sure that VLM is running and you have a local port (like 23333
) to call it:
CUDA_VISIBLE_DEVICES=$CUDA_DEVICES nohup lmdeploy serve api_server $MODEL_PATH
--backend turbomind --dtype $DTYPE --server-port $SERVER_PORT --tp $TP > $LOG_PATH 2>&1 &
We only provide a simplified command line here and if you want to know more paramters and their meanings, please run
lmdeploy serve api_server --help
🔴 Attention: All codes and scripts files are executed in the root directory!
e.g. python code/infer_lmdeploy.py [MODEL] [MODE] [SETTING] [LANGUAGE] [TASK] [DATASET] [MAX_TOKENS] [PORT]
MODEL
: Model name, such asInternVL2_5-78B-MPO
,qwen2.5-vl-72b-instruct
, etc.MODE
: For normal VLMs, usedirect
; for reasoning VLMs, usecot
.SETTING
:traditional
orvision
, for detailed explanations please refer to our paper.LANGUAGE
: 10 languages choices,[ZH, EN, AR, SR, TH, RU, KO, CS, HU, VI]
TASK
:OCR
for OCR tasks, andVQA
for VQA tasks undertraditional
orvision
settings.DATASET
:[MDUR, MIQA, MMJB, MSOCR]
MAX_TOKENS
: For different models, theMAX_TOKENS
should be different in case of cut off problems.PORT
: Local port (like23333
) for lmdeploy server to call.
Besides, we provide a standard script template scripts/infer_lmdeploy.sh
. You can modify parameters directly and run it using
nohup bash scripts/infer_lmdeploy.sh > logs/infer_lmdeploy.log 2>&1 &
We use gpt-4o-2024-11-20
to judge VQA performance so you should configure API_KEY
before evaluation. Besides, you can change base model in code/eval/{DATASET}/eval_{DATASET}_vqa.py
:
OPENAI_API_BASE = "https://api.openai.com/v1"
client = OpenAI(
api_key = os.getenv('gpt-4o-2024-11-20'),
base_url = OPENAI_API_BASE
)
The evaluation codes are executed by:
python code/eval/{DATASET}/eval_{DATASET}_{TASK}.py
where DATASET
is chosen from [MDUR, MIQA, MMJB, MSOCR]
and TASK
is chosen from [VQA, OCR]
.
The statistics codes are executed by:
python code/score.py
and the results are stored in data/results/{DATASET}_{TASK}_{SETTING}.csv
If you find this work helpful, please consider to star🌟 this repo. Thanks for your support!
@misc{gao2025pm4benchparallelmultilingualmultimodal,
title={PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model},
author={Junyuan Gao and Jiahe Song and Jiang Wu and Runchuan Zhu and Guanlin Shen and Shasha Wang and Xingjian Wei and Haote Yang and Songyang Zhang and Weijia Li and Bin Wang and Dahua Lin and Lijun Wu and Conghui He},
year={2025},
eprint={2503.18484},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.18484},
}