Skip to content

negar-foroutan/WikiMixQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv License

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Model Overview

Code an model for paper: "WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts" [ACL 2025 - Findings]

Data

  • The MCQs can be found in Annotations/wikimixQA_MCQs.json.
  • Questions metadata (e.g., images, tables, etc.) is available on Hugging Face: WikiMixQA.
  • The file Annotations/qid_to_path.tsv maps each question ID (QID) to its corresponding metadata folder (i.e., Hugging Face dataset.)

Evaluation

Blind evaluation (i.e., without any context provided)

python scripts/evaluate_gpt.py \
    --model-name gpt-4o	 \
    --qa-type blind \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results

Wikidoc evaluation (i.e., with screenshots of the full Wikipedia page)

python scripts/evaluate_gemini.py \
    --model-name gpt-4o	 \
    --qa-type wikidoc \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results

Oracle evaluation (i.e., with the two modalities)

python scripts/evaluate_gemini.py \
    --model-name gpt-4o	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results

Evaluation with InternVL2

Setup

Install dependencies

pip install vllm

Oracle evaluation (i.e., with the two modalities)

python scripts/evaluate_vllm.py \
    --model-name "OpenGVLab/InternVL2-2B"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results_1001 \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 8

Wikidoc evaluation (i.e., with screenshots of the full Wikipedia page)

Note that we limit the number of screenshot to the first 15 images for VRAM constraints. We can only compute it for 2B model.

python scripts/evaluate_vllm.py \
    --model-name "OpenGVLab/InternVL2-2B"	 \
    --qa-type wikidoc \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv

Blind evaluation (i.e., without any context provided)

python scripts/evaluate_gemini.py \
    --model-name "OpenGVLab/InternVL2-2B"	 \
    --qa-type blind \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv

Evaluation with QwenVL-2

Oracle evaluation (i.e., with the two modalities)

python scripts/evaluate_vllm.py \
    --model-name "Qwen/Qwen2-VL-7B-Instruct"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 4
python scripts/evaluate_vllm.py \
    --model-name "Qwen/Qwen2-VL-72B-Instruct"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 8

Evaluation with Llama-3.2-11B-Vision-Instruct

python scripts/evaluate_vllm.py \
    --model-name "meta-llama/Llama-3.2-11B-Vision-Instruct"	 \
    --qa-type oracle \
    --input-file Annotations/wikimixQA_MCQs.json \
    --output-dir results \
    --qid-to-path Annotations/qid_to_path.tsv \
    --num-gpus 8

WTabHTML: HTML Wikitables extractor

Input:

  • Wikipedia HTML dump
  • Language

Output:

File format: JSON list. Each line is a json object of

{
    title: wikipedia title
    wikidata: wikidata ID
    url: the url that link to Wikipedia page
    index: the index of table in the Wikipedia page
    html: html content of table
    caption: table caption
    aspects: (Hierachy sections of Wikipedia)  
}

Usage:

Download, Extract, and dump wikitables in CR language

python wtabhtml.py dump -l cr

Download, Extract, dump wikitables, and generate table images in CR language

python wtabhtml.py gen-images -l cr -n 3

Note: User can download our preprocessed dumps then, copy all {LANGUAGE}.jsonl.bz2 (the wikitables dump in PubTabNet format) to wtabhtml/data/models/wikitables_html_pubtabnet to generate photo images faster.

If user want to re-run all pipeline, the tool will download Wikipedia HTML dump, extract wikitables, and dump it to wtabhtml/data/models/wikitables_html_pubtabnet\{LANGUAGE}.jsonl.bz2 file as the following pipeline.

Pipeline of Wikitable processing in cr language

# Download dump
python wtabhtml.py download -l cr
# Parse dump and save json file
python wtabhtml.py parse -l cr
# Read dump
python wtabhtml.py read -l 1 -i ./data/models/cr.jsonl.bz2
# Generate images
python wtabhtml.py gen-images -l cr -n 3

Download images

python scripts/download_images.py

Convert SVG to PNG

python scripts/convert_svg.py

Getting Wikipedia categories from P31

python scripts/get_category.py

Merge files for each topic/subtopics

python scripts/merge_files.py

Getting HTML pages and generating images from HTML pages

Get raw HTML pages for each article:

topic="Wikimedia"
subtopic="Person"
mkdir -p html/$topic/$subtopic
python scripts/get_wiki_html.py --input-file "data/${topic}_${subtopic}.json" --output-dir "html/$topic/$subtopic/" 

Generate the image from HTML pages:

python scripts/html2image.py --input-file "data/${topic}_${subtopic}.json" --output-dir "html/$topic/$subtopic/" 

Predicting whether the images are likely to be a chart based on the filenames:

python scripts/predict_chart_or_not.py --input-file "data/${topic}_${subtopic}.json" --output-file "data/${topic}_${subtopic}_chart.json"

Compute statistics about the images

python scripts/category_image_stats.py

Fetching Wiki text

python scripts/extract_wiki_text.py

Create folder with final images

python scripts/move_images_to_folder.py

Create folder with tables

python scripts/move_tables_to_folder.py

Create final dataset

python scripts/create_final_dataset.py

Split large images generated from HTML pages

python scripts/split_large_images.py --input-dir final

Extract HTML tables from raw article HTML

python scripts/extract_tables_from_html.py --input-dir final

Generate images from HTML tables

First, install geckodriver:

wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
tar -xvzf geckodriver-v0.34.0-linux64.tar.gz

Then, install Firefox from APT:

sudo snap remove firefox
sudo add-apt-repository ppa:mozillateam/ppa
echo '
Package: *
Pin: release o=LP-PPA-mozillateam
Pin-Priority: 1001
' | sudo tee /etc/apt/preferences.d/mozilla-firefox
echo 'Unattended-Upgrade::Allowed-Origins:: "LP-PPA-mozillateam:${distro_codename}";' | sudo tee /etc/apt/apt.conf.d/51unattended-upgrades-firefox
sudo apt install firefox

Finally, generate the images

# add path to geckodriver
export PATH=$PATH:. 
python scripts/gen_images_from_html_tables.py --input-dir final

Add extra information about the images with gpt-4-vision-preview

Getting details for each articles:

topic="Economy"
subtopic="Stock market"
python scripts/predict_chart_information.py \
    --data-dir "data" \
    --output-dir "final" \
    --topic $topic \
    --subtopic $subtopic 

Getting details only for articles with charts:

python scripts/predict_chart_information.py \
    --data-dir "data" \
    --output-dir "final" \
    --topic $topic \
    --subtopic $subtopic \
    --chart 

Generate table description with LLaMa-3-8b-instruct

python scripts/table-to-text.py \
    --input-dir "final" \
    --checkpoint-dir "/mnt/datastore/models/meta-llama/Meta-Llama-3-8B-Instruct" \
    --dtype fp16 \
    --device "cuda:2"

Compute embeddings for chart and table descriptions

python scripts/compute_embeddings.py \
    --input-dir "final" \
    --model-name "BAAI/bge-reranker-v2-m3" \
    --use-fp16 \
    --device "cuda:2"

Citation

If you use this code for your research, please cite our paper:

@inproceedings{foroutan-etal-2025-wikimixqa,
      title={WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts}, 
      author={Negar Foroutan and Angelika Romanou and Matin Ansaripour and Julian Martin Eisenschlos and Karl Aberer and Rémi Lebret},
      year={2025},
      url={https://arxiv.org/abs/2506.15594},
}

About

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts [ACL 2025]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages