AIBench

💜 Project Page | 🤗 Dataset | 📚 Code | 📑 Paper

AIBench is a benchmark and toolkit for evaluating academic illustration generation.
It focuses on whether modern image generation models can produce paper-ready method/framework figures that are both:

logically consistent with the source paper
aesthetically acceptable as academic illustrations

This repository currently provides two core modules:

infer/: image generation
eval/: standalone evaluation

These modules are designed to work together: first generate images, then evaluate either generated outputs or baseline references.

Why AIBench

Image generation has improved rapidly, but its ability to produce ready-to-use academic figures is still underexplored. Prior evaluations often rely on direct VLM-to-VLM holistic comparisons, which implicitly assume oracle-level multimodal understanding. That assumption becomes fragile when method text and diagrams are long, dense, and logically complex.

AIBench addresses this gap with a VQA-style logic evaluation plus model-based aesthetic evaluation, reducing reliance on a single black-box holistic judgment.

Features

First VQA-based benchmark for this task: evaluates academic illustration quality through structured question answering
Four-level logic assessment: from low-level components to high-level semantics
Aesthetic assessment via VLMs: complements logic consistency with style/visual quality
Decoupled design: inference and evaluation are separated for easier debugging and reuse
JSONL data flow: structured JSONL input/output for scalable batch processing
Multi-role evaluation: supports generated images (image_gen), ground-truth images (gt), and blank-image baselines (blank)
Config-driven workflow: model backend, parallelism, retries, and paths are all controlled via YAML

Method Overview

For logic evaluation, AIBench formulates assessment as a sequence of VQA tasks built from a logic graph summarized from paper method text. Questions are organized into four levels:

Component-Existence Level
Local-Topology Level
Phase-Architecture Level
Global-Semantics Level

This design provides fine-grained, interpretable signals on where a generated illustration succeeds or fails. Text rendering quality is also implicitly tested, because many QA items require correctly generated labels/tokens in the figure.

Dataset Snapshot

300 open-access papers from top-tier venues
5704 QA pairs for logic and quality checking
QAs are generated through the pipeline and manually checked by multiple human experts

Main Findings (from the paper)

The performance gap between models on academic illustration generation is substantially larger than on common generation tasks.
Logic correctness and aesthetics are often in tension; improving one may hurt the other.
Strong reasoning and high-density visual generation are both necessary for this task.
Test-time scaling strategies on both reasoning and generation sides can significantly improve performance.

Contributions

AIBench: the first VQA-based benchmark for evaluating academic illustration generation with logic and aesthetic dimensions.
A scalable QA construction framework: generates high-quality, multi-level questions from summarized logic diagrams.
A comprehensive evaluation of SOTA open-/closed-source unified models and T2I models, including test-time scaling analyses.

Repository Structure

.
├── infer/
│   ├── main.py
│   ├── pipeline.py
│   ├── generate_stage.py
│   ├── image_gen_client.py
│   ├── config_schema.py
│   └── configs/
│       └── example.yaml
├── eval/
│   ├── main.py
│   ├── pipeline.py
│   ├── eval_stage.py
│   ├── eval_client.py
│   ├── config_schema.py
│   └── configs/
│       └── example.yaml
├── utils/
├── scripts/
├── requirements.txt
└── README.md

Installation

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Key dependencies in requirements.txt include pyyaml, pydantic, Pillow, tqdm, requests, openai, dotenv, and the aesthetic scoring package unipercept-reward.

Workflow

Prepare prompt data in JSONL format
Run generation with infer/main.py
Set the generated records path to data.infer_records_file in the eval config
Run evaluation with eval/main.py
Check evaluation records and metrics

Inference: Generate Images

Run:

python infer/main.py -c infer/configs/example.yaml

Key config fields (infer/configs/example.yaml):

data.prompts_file: prompt JSONL input (benchmark source data, e.g., data/AIBench.jsonl)
data.out_dir: output directory (e.g., work_dirs/infer/example)
data.records_file: generated records filename (e.g., gen_records.jsonl)
image_gen.provider: generation backend (e.g., nano_banana, wan2.6)

Prompt Input (JSONL)

Each sample should include an id and the paper method section field method_text.

If data.text_prompt_template is set, the text will be formatted into that template before being sent to the generation model, for example:

Generate a framework figure based on the given paper method section.
{method_text}

Outputs:

Images: <out_dir>/gen_images/*.png
Records: <out_dir>/<records_file>

Record example (one JSON object per line):

{"prompt_id":"123","status":"success","image_path":".../gen_images/123.png","provider":"nano_banana","latency_sec":4.21,"ts":1760000000.0}

Evaluation: Score Results

Run:

python eval/main.py -c eval/configs/example.yaml

Key config fields (eval/configs/example.yaml):

data.prompts_file: prompt JSONL (benchmark source data, e.g., data/AIBench.jsonl)
data.infer_records_file: infer output records JSONL (with prompt_id and image_path)
data.gt_image_root: GT image root directory (used when role=gt)
eval.role:
- image_gen: evaluate generated images from infer records
- gt: evaluate dataset GT images (image_url or gt_image_root + image)
- blank: evaluate blank white-image baseline
eval.tasks: evaluation tasks (e.g., visual_qa, aesthetic)

Outputs:

Evaluation records: <out_dir>/<records_file>
Aggregated metrics: <out_dir>/<metric_file>

Configuration Notes

This project reads environment variables via !env (for example, API_KEY_1)
Do not commit secrets; inject keys through .env or shell environment variables
Tune max_workers and retry_times based on your compute resources and rate limits

Citation

If you find this code useful for your research, please cite our paper:

@article{liao2025aibench,
  title={AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation},
  author={Liao, Zhaohe and Jiang, Kaixun and Liu, Zhihang and Wei, Yujie and Yu, Junqiu and Li, Quanhao and Yu, Hongtao and Li, Pandeng and Wang, Yuzheng and Xing, Zhen and Zhang, Shiwei and Xie, Chen-Wei and Zheng, Yun and Liu, Xihui},
  journal={arXiv preprint arXiv:2603.28068},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIBench

Why AIBench

Features

Method Overview

Dataset Snapshot

Main Findings (from the paper)

Contributions

Repository Structure

Installation

Workflow

Inference: Generate Images

Prompt Input (JSONL)

Evaluation: Score Results

Configuration Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
eval		eval
infer		infer
scripts		scripts
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AIBench

Why AIBench

Features

Method Overview

Dataset Snapshot

Main Findings (from the paper)

Contributions

Repository Structure

Installation

Workflow

Inference: Generate Images

Prompt Input (JSONL)

Evaluation: Score Results

Configuration Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages