WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Code an model for paper: "WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts" [ACL 2025 - Findings]

Data

The MCQs can be found in Annotations/wikimixQA_MCQs.json.
Questions metadata (e.g., images, tables, etc.) is available on Hugging Face: WikiMixQA.
The file Annotations/qid_to_path.tsv maps each question ID (QID) to its corresponding metadata folder (i.e., Hugging Face dataset.)

Evaluation

Blind evaluation (i.e., without any context provided)

python scripts/evaluate_gpt.py \
    --model-name gpt-4o	 \
    --qa-type blind \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results

Wikidoc evaluation (i.e., with screenshots of the full Wikipedia page)

python scripts/evaluate_gemini.py \
    --model-name gpt-4o	 \
    --qa-type wikidoc \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results

Oracle evaluation (i.e., with the two modalities)

python scripts/evaluate_gemini.py \
    --model-name gpt-4o	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results

Evaluation with InternVL2

Setup

Install dependencies

pip install vllm

Oracle evaluation (i.e., with the two modalities)

python scripts/evaluate_vllm.py \
    --model-name "OpenGVLab/InternVL2-2B"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results_1001 \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 8

Wikidoc evaluation (i.e., with screenshots of the full Wikipedia page)

Note that we limit the number of screenshot to the first 15 images for VRAM constraints. We can only compute it for 2B model.

python scripts/evaluate_vllm.py \
    --model-name "OpenGVLab/InternVL2-2B"	 \
    --qa-type wikidoc \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv

Blind evaluation (i.e., without any context provided)

python scripts/evaluate_gemini.py \
    --model-name "OpenGVLab/InternVL2-2B"	 \
    --qa-type blind \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv

Evaluation with QwenVL-2

Oracle evaluation (i.e., with the two modalities)

python scripts/evaluate_vllm.py \
    --model-name "Qwen/Qwen2-VL-7B-Instruct"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 4

python scripts/evaluate_vllm.py \
    --model-name "Qwen/Qwen2-VL-72B-Instruct"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 8

Evaluation with Llama-3.2-11B-Vision-Instruct

python scripts/evaluate_vllm.py \
    --model-name "meta-llama/Llama-3.2-11B-Vision-Instruct"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 8

WTabHTML: HTML Wikitables extractor

Input:

Wikipedia HTML dump
Language

Output:

File format: JSON list. Each line is a json object of

{
    title: wikipedia title
    wikidata: wikidata ID
    url: the url that link to Wikipedia page
    index: the index of table in the Wikipedia page
    html: html content of table
    caption: table caption
    aspects: (Hierachy sections of Wikipedia)  
}

Usage:

Download, Extract, and dump wikitables in CR language

python wtabhtml.py dump -l cr

Download, Extract, dump wikitables, and generate table images in CR language

python wtabhtml.py gen-images -l cr -n 3

Note: User can download our preprocessed dumps then, copy all {LANGUAGE}.jsonl.bz2 (the wikitables dump in PubTabNet format) to wtabhtml/data/models/wikitables_html_pubtabnet to generate photo images faster.

If user want to re-run all pipeline, the tool will download Wikipedia HTML dump, extract wikitables, and dump it to wtabhtml/data/models/wikitables_html_pubtabnet\{LANGUAGE}.jsonl.bz2 file as the following pipeline.

Pipeline of Wikitable processing in cr language

# Download dump
python wtabhtml.py download -l cr
# Parse dump and save json file
python wtabhtml.py parse -l cr
# Read dump
python wtabhtml.py read -l 1 -i ./data/models/cr.jsonl.bz2
# Generate images
python wtabhtml.py gen-images -l cr -n 3

Download images

python scripts/download_images.py

Convert SVG to PNG

python scripts/convert_svg.py

Getting Wikipedia categories from P31

python scripts/get_category.py

Merge files for each topic/subtopics

python scripts/merge_files.py

Getting HTML pages and generating images from HTML pages

Get raw HTML pages for each article:

topic="Wikimedia"
subtopic="Person"
mkdir -p html/$topic/$subtopic
python scripts/get_wiki_html.py --input-file "data/${topic}_${subtopic}.json" --output-dir "html/$topic/$subtopic/"

Generate the image from HTML pages:

python scripts/html2image.py --input-file "data/${topic}_${subtopic}.json" --output-dir "html/$topic/$subtopic/"

Predicting whether the images are likely to be a chart based on the filenames:

python scripts/predict_chart_or_not.py --input-file "data/${topic}_${subtopic}.json" --output-file "data/${topic}_${subtopic}_chart.json"

Compute statistics about the images

python scripts/category_image_stats.py

Fetching Wiki text

python scripts/extract_wiki_text.py

Create folder with final images

python scripts/move_images_to_folder.py

Create folder with tables

python scripts/move_tables_to_folder.py

Create final dataset

python scripts/create_final_dataset.py

Split large images generated from HTML pages

python scripts/split_large_images.py --input-dir final

Extract HTML tables from raw article HTML

python scripts/extract_tables_from_html.py --input-dir final

Generate images from HTML tables

First, install geckodriver:

wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
tar -xvzf geckodriver-v0.34.0-linux64.tar.gz

Then, install Firefox from APT:

sudo snap remove firefox
sudo add-apt-repository ppa:mozillateam/ppa
echo '
Package: *
Pin: release o=LP-PPA-mozillateam
Pin-Priority: 1001
' | sudo tee /etc/apt/preferences.d/mozilla-firefox
echo 'Unattended-Upgrade::Allowed-Origins:: "LP-PPA-mozillateam:${distro_codename}";' | sudo tee /etc/apt/apt.conf.d/51unattended-upgrades-firefox
sudo apt install firefox

Finally, generate the images

# add path to geckodriver
export PATH=$PATH:. 
python scripts/gen_images_from_html_tables.py --input-dir final

Add extra information about the images with gpt-4-vision-preview

Getting details for each articles:

topic="Economy"
subtopic="Stock market"
python scripts/predict_chart_information.py \
    --data-dir "data" \
    --output-dir "final" \
    --topic $topic \
    --subtopic $subtopic

Getting details only for articles with charts:

python scripts/predict_chart_information.py \
    --data-dir "data" \
    --output-dir "final" \
    --topic $topic \
    --subtopic $subtopic \
    --chart

Generate table description with LLaMa-3-8b-instruct

python scripts/table-to-text.py \
    --input-dir "final" \
    --checkpoint-dir "/mnt/datastore/models/meta-llama/Meta-Llama-3-8B-Instruct" \
    --dtype fp16 \
    --device "cuda:2"

Compute embeddings for chart and table descriptions

python scripts/compute_embeddings.py \
    --input-dir "final" \
    --model-name "BAAI/bge-reranker-v2-m3" \
    --use-fp16 \
    --device "cuda:2"

Citation

If you use this code for your research, please cite our paper:

@inproceedings{foroutan-etal-2025-wikimixqa,
      title={WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts}, 
      author={Negar Foroutan and Angelika Romanou and Matin Ansaripour and Julian Martin Eisenschlos and Karl Aberer and Rémi Lebret},
      year={2025},
      url={https://arxiv.org/abs/2506.15594},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Annotations		Annotations
assets		assets
cli		cli
config		config
core		core
generation		generation
scripts		scripts
tools		tools
README.md		README.md
wtabhtml.py		wtabhtml.py

Folders and files

Latest commit

History

Repository files navigation

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Data

Evaluation

Blind evaluation (i.e., without any context provided)

Wikidoc evaluation (i.e., with screenshots of the full Wikipedia page)

Oracle evaluation (i.e., with the two modalities)

Evaluation with InternVL2

Setup

Oracle evaluation (i.e., with the two modalities)

Wikidoc evaluation (i.e., with screenshots of the full Wikipedia page)

Blind evaluation (i.e., without any context provided)

Evaluation with QwenVL-2

Oracle evaluation (i.e., with the two modalities)

Evaluation with Llama-3.2-11B-Vision-Instruct

WTabHTML: HTML Wikitables extractor

Input:

Output:

Usage:

Download, Extract, and dump wikitables in CR language

Download, Extract, dump wikitables, and generate table images in CR language

Pipeline of Wikitable processing in cr language

Download images

Getting Wikipedia categories from P31

Merge files for each topic/subtopics

Getting HTML pages and generating images from HTML pages

Fetching Wiki text

Create folder with final images

Create folder with tables

Create final dataset

Split large images generated from HTML pages

Extract HTML tables from raw article HTML

Generate images from HTML tables

Add extra information about the images with gpt-4-vision-preview

Generate table description with LLaMa-3-8b-instruct

Compute embeddings for chart and table descriptions

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages