PM⁴Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

🌐 Homepage | 🤗 Dataset | 📖 Paper

📢 News

🔥[2025-03-25]: Dataset available on HuggingFace. Paper available on arXiv.

Warning: MMJB data contains potentially offensive and harmful text.

🧑‍💻 How to Run?

🏠 Set Up

Dataset Download

Download tsv files from HuggingFace and store them in data/tsv/. The directory should be like data/tsv/{DATASET}_{SETTING}_{LANGUAGE}.tsv.

Environment Configuration

conda create -n pm4bench python=3.10.5
conda activate pm4bench
pip install -r requirements.txt

⚙️ Inference

API Inference

Step 0. Configure `.env` file

API inference requires an API_KEY. Please configure the API_KEY in the .env file in the following format:

gpt-4o-2024-11-20='xxx'
step-1o-vision-32k='xxx'
qwen2.5-vl-72b-instruct='xxx'
gemini-2.0-flash-thinking-exp='xxx'
DeepSeek-R1='xxx'
gpt-4o-mini='xxx'

The API_KEY will be loaded through the infer_api.py file using:

load_dotenv()  # load .env file to get API_KEY
API_KEY = os.getenv(MODEL)

Step 1. Start Inference!

🔴 Attention: All codes and scripts files are executed in the root directory! e.g. python code/infer_api.py [MODEL] [MODE] [SETTING] [LANGUAGE] [TASK] [DATASET] [MAX_TOKENS]

MODEL: Official model name, such as gpt-4o-2024-11-20, qwen2.5-vl-72b-instruct, etc.
MODE: For normal VLMs, use direct; for reasoning VLMs, use cot.
SETTING: traditional or vision, for detailed explanations please refer to our paper.
LANGUAGE: 10 languages choices, [ZH, EN, AR, SR, TH, RU, KO, CS, HU, VI]
TASK: OCR for OCR tasks, and VQA for VQA tasks under traditional or vision settings.
DATASET: [MDUR, MIQA, MMJB, MSOCR]
MAX_TOKENS: For different models, the MAX_TOKENS should be different in case of cut off problems.

Besides, we provide a standard script template scripts/infer_api.sh. You can modify parameters directly and run it using

nohup bash scripts/infer_api.sh > logs/infer_api.log 2>&1 &

Local VLMs Inference

Step 0. Use LMDeploy to serve models

A special thanks to LMDeploy for their work, which has greatly assisted in providing local inference for our work. Please refer to LMDeploy docs for detailed information of VLMs' deployment and serve. Before inference, you should make sure that VLM is running and you have a local port (like 23333) to call it:

CUDA_VISIBLE_DEVICES=$CUDA_DEVICES nohup lmdeploy serve api_server $MODEL_PATH 
--backend turbomind --dtype $DTYPE --server-port $SERVER_PORT --tp $TP > $LOG_PATH 2>&1 &

We only provide a simplified command line here and if you want to know more paramters and their meanings, please run

lmdeploy serve api_server --help

Step 1. Start Inference!

🔴 Attention: All codes and scripts files are executed in the root directory! e.g. python code/infer_lmdeploy.py [MODEL] [MODE] [SETTING] [LANGUAGE] [TASK] [DATASET] [MAX_TOKENS] [PORT]

MODEL: Model name, such as InternVL2_5-78B-MPO, qwen2.5-vl-72b-instruct, etc.
MODE: For normal VLMs, use direct; for reasoning VLMs, use cot.
SETTING: traditional or vision, for detailed explanations please refer to our paper.
LANGUAGE: 10 languages choices, [ZH, EN, AR, SR, TH, RU, KO, CS, HU, VI]
TASK: OCR for OCR tasks, and VQA for VQA tasks under traditional or vision settings.
DATASET: [MDUR, MIQA, MMJB, MSOCR]
MAX_TOKENS: For different models, the MAX_TOKENS should be different in case of cut off problems.
PORT: Local port (like 23333) for lmdeploy server to call.

Besides, we provide a standard script template scripts/infer_lmdeploy.sh. You can modify parameters directly and run it using

nohup bash scripts/infer_lmdeploy.sh > logs/infer_lmdeploy.log 2>&1 &

📉 Evaluation & Statistics

Step 0. Evaluation

We use gpt-4o-2024-11-20 to judge VQA performance so you should configure API_KEY before evaluation. Besides, you can change base model in code/eval/{DATASET}/eval_{DATASET}_vqa.py:

OPENAI_API_BASE = "https://api.openai.com/v1"
client = OpenAI(
    api_key = os.getenv('gpt-4o-2024-11-20'),
    base_url = OPENAI_API_BASE
)

The evaluation codes are executed by:

python code/eval/{DATASET}/eval_{DATASET}_{TASK}.py

where DATASET is chosen from [MDUR, MIQA, MMJB, MSOCR] and TASK is chosen from [VQA, OCR].

Step 1. Statistics

The statistics codes are executed by:

python code/score.py

and the results are stored in data/results/{DATASET}_{TASK}_{SETTING}.csv

Citation

If you find this work helpful, please consider to star🌟 this repo. Thanks for your support!

@misc{gao2025pm4benchparallelmultilingualmultimodal,
      title={PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model}, 
      author={Junyuan Gao and Jiahe Song and Jiang Wu and Runchuan Zhu and Guanlin Shen and Shasha Wang and Xingjian Wei and Haote Yang and Songyang Zhang and Weijia Li and Bin Wang and Dahua Lin and Lijun Wu and Conghui He},
      year={2025},
      eprint={2503.18484},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18484}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
MIQA_extracted_score		MIQA_extracted_score
VLM_output		VLM_output
VLM_output_judge		VLM_output_judge
code		code
data		data
scripts		scripts
.env		.env
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PM⁴Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

📢 News

Warning: MMJB data contains potentially offensive and harmful text.

🧑‍💻 How to Run?

🏠 Set Up

Dataset Download

Environment Configuration

⚙️ Inference

API Inference

Step 0. Configure `.env` file

Step 1. Start Inference!

Local VLMs Inference

Step 0. Use LMDeploy to serve models

Step 1. Start Inference!

📉 Evaluation & Statistics

Step 0. Evaluation

Step 1. Statistics

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

opendatalab/PM4Bench

Folders and files

Latest commit

History

Repository files navigation

PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

📢 News

Warning: MMJB data contains potentially offensive and harmful text.

🧑‍💻 How to Run?

🏠 Set Up

Dataset Download

Environment Configuration

⚙️ Inference

API Inference

Step 0. Configure .env file

Step 1. Start Inference!

Local VLMs Inference

Step 0. Use LMDeploy to serve models

Step 1. Start Inference!

📉 Evaluation & Statistics

Step 0. Evaluation

Step 1. Statistics

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

PM⁴Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

Step 0. Configure `.env` file

Packages