💜 Project Page | 🤗 Dataset | 📚 Code | 📑 Paper
AIBench is a benchmark and toolkit for evaluating academic illustration generation.
It focuses on whether modern image generation models can produce paper-ready method/framework figures that are both:
- logically consistent with the source paper
- aesthetically acceptable as academic illustrations
This repository currently provides two core modules:
infer/: image generationeval/: standalone evaluation
These modules are designed to work together: first generate images, then evaluate either generated outputs or baseline references.
Image generation has improved rapidly, but its ability to produce ready-to-use academic figures is still underexplored. Prior evaluations often rely on direct VLM-to-VLM holistic comparisons, which implicitly assume oracle-level multimodal understanding. That assumption becomes fragile when method text and diagrams are long, dense, and logically complex.
AIBench addresses this gap with a VQA-style logic evaluation plus model-based aesthetic evaluation, reducing reliance on a single black-box holistic judgment.
- First VQA-based benchmark for this task: evaluates academic illustration quality through structured question answering
- Four-level logic assessment: from low-level components to high-level semantics
- Aesthetic assessment via VLMs: complements logic consistency with style/visual quality
- Decoupled design: inference and evaluation are separated for easier debugging and reuse
- JSONL data flow: structured JSONL input/output for scalable batch processing
- Multi-role evaluation: supports generated images (
image_gen), ground-truth images (gt), and blank-image baselines (blank) - Config-driven workflow: model backend, parallelism, retries, and paths are all controlled via YAML
For logic evaluation, AIBench formulates assessment as a sequence of VQA tasks built from a logic graph summarized from paper method text. Questions are organized into four levels:
- Component-Existence Level
- Local-Topology Level
- Phase-Architecture Level
- Global-Semantics Level
This design provides fine-grained, interpretable signals on where a generated illustration succeeds or fails. Text rendering quality is also implicitly tested, because many QA items require correctly generated labels/tokens in the figure.
- 300 open-access papers from top-tier venues
- 5704 QA pairs for logic and quality checking
- QAs are generated through the pipeline and manually checked by multiple human experts
- The performance gap between models on academic illustration generation is substantially larger than on common generation tasks.
- Logic correctness and aesthetics are often in tension; improving one may hurt the other.
- Strong reasoning and high-density visual generation are both necessary for this task.
- Test-time scaling strategies on both reasoning and generation sides can significantly improve performance.
- AIBench: the first VQA-based benchmark for evaluating academic illustration generation with logic and aesthetic dimensions.
- A scalable QA construction framework: generates high-quality, multi-level questions from summarized logic diagrams.
- A comprehensive evaluation of SOTA open-/closed-source unified models and T2I models, including test-time scaling analyses.
.
├── infer/
│ ├── main.py
│ ├── pipeline.py
│ ├── generate_stage.py
│ ├── image_gen_client.py
│ ├── config_schema.py
│ └── configs/
│ └── example.yaml
├── eval/
│ ├── main.py
│ ├── pipeline.py
│ ├── eval_stage.py
│ ├── eval_client.py
│ ├── config_schema.py
│ └── configs/
│ └── example.yaml
├── utils/
├── scripts/
├── requirements.txt
└── README.md
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtKey dependencies in requirements.txt include pyyaml, pydantic, Pillow, tqdm, requests, openai, dotenv, and the aesthetic scoring package unipercept-reward.
- Prepare prompt data in JSONL format
- Run generation with
infer/main.py - Set the generated records path to
data.infer_records_filein the eval config - Run evaluation with
eval/main.py - Check evaluation records and metrics
Run:
python infer/main.py -c infer/configs/example.yamlKey config fields (infer/configs/example.yaml):
data.prompts_file: prompt JSONL input (benchmark source data, e.g.,data/AIBench.jsonl)data.out_dir: output directory (e.g.,work_dirs/infer/example)data.records_file: generated records filename (e.g.,gen_records.jsonl)image_gen.provider: generation backend (e.g.,nano_banana,wan2.6)
Each sample should include an id and the paper method section field method_text.
If data.text_prompt_template is set, the text will be formatted into that template before being sent to the generation model, for example:
Generate a framework figure based on the given paper method section.
{method_text}
Outputs:
- Images:
<out_dir>/gen_images/*.png - Records:
<out_dir>/<records_file>
Record example (one JSON object per line):
{"prompt_id":"123","status":"success","image_path":".../gen_images/123.png","provider":"nano_banana","latency_sec":4.21,"ts":1760000000.0}Run:
python eval/main.py -c eval/configs/example.yamlKey config fields (eval/configs/example.yaml):
data.prompts_file: prompt JSONL (benchmark source data, e.g.,data/AIBench.jsonl)data.infer_records_file: infer output records JSONL (withprompt_idandimage_path)data.gt_image_root: GT image root directory (used whenrole=gt)eval.role:image_gen: evaluate generated images from infer recordsgt: evaluate dataset GT images (image_urlorgt_image_root + image)blank: evaluate blank white-image baseline
eval.tasks: evaluation tasks (e.g.,visual_qa,aesthetic)
Outputs:
- Evaluation records:
<out_dir>/<records_file> - Aggregated metrics:
<out_dir>/<metric_file>
- This project reads environment variables via
!env(for example,API_KEY_1) - Do not commit secrets; inject keys through
.envor shell environment variables - Tune
max_workersandretry_timesbased on your compute resources and rate limits
If you find this code useful for your research, please cite our paper:
@article{liao2025aibench,
title={AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation},
author={Liao, Zhaohe and Jiang, Kaixun and Liu, Zhihang and Wei, Yujie and Yu, Junqiu and Li, Quanhao and Yu, Hongtao and Li, Pandeng and Wang, Yuzheng and Xing, Zhen and Zhang, Shiwei and Xie, Chen-Wei and Zheng, Yun and Liu, Xihui},
journal={arXiv preprint arXiv:2603.28068},
year={2026}
}