Skip to content

VisJudgeBench: A comprehensive benchmark for aesthetics and quality assessment of visualizations, featuring 3,090 expert-annotated samples with six-dimensional quality scores.

Notifications You must be signed in to change notification settings

HKUSTDial/VisJudgeBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š VisJudgeBench

VisJudgeBench: Aesthetics and Quality Assessment of Visualizations

arXiv Model

πŸ“„ Paper: https://arxiv.org/abs/2510.22373
πŸ€– Model: https://huggingface.co/xypkent/visjudge-7b

🎯 About VisJudgeBench

VisJudgeBench is a comprehensive benchmark for evaluating MLLM visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Each sample includes visualization images, six-dimensional quality scores based on the Fidelity-Expressiveness-Aesthetics evaluation framework, and evaluation prompts.

Research Motivation

Why we need specialized visualization assessment: MLLMs excel at general aesthetics but struggle with visualization-specific evaluation

πŸ” Evaluation Framework

Our benchmark evaluates visualizations across three fundamental dimensions, operationalized into six measurable metrics:

Fidelity-Expressiveness-Aesthetics Framework

The Fidelity-Expressiveness-Aesthetics evaluation framework with positive and negative examples

1. Fidelity - Data Accuracy and Truthfulness

  • data_fidelity: Evaluates whether visual encodings accurately reflect the original data, avoiding misleading interpretations caused by improper axis settings, scale distortions, or other design flaws.

2. Expressiveness - Information Clarity and Understandability

  • semantic_readability: Assesses the clarity of basic information encoding and whether users can unambiguously decode visual elements.
  • insight_discovery: Evaluates the effectiveness in revealing deep data patterns, trends, or outliers, helping users transition from "reading information" to "gaining insights".

3. Aesthetics - Visual Aesthetics and Refinement

  • design_style: Measures the innovation and uniqueness of design, including novel visual elements and distinctive style.
  • visual_composition: Focuses on the rationality of spatial layout, evaluating the balance and order of element positioning, size proportions, and spacing arrangements.
  • color_harmony: Assesses the coordination and functionality of color combinations, ensuring color palette choices balance aesthetics with effective information communication.

πŸ—οΈ Benchmark Construction

Our benchmark follows a rigorous three-stage construction pipeline to ensure high-quality annotations:

Benchmark Construction Pipeline

Three-stage benchmark construction: Data Collection β†’ Evaluation Framework β†’ Expert Annotation

πŸ“ Repository Structure

VisJudgeBench/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ VisJudgeBench.json          # Complete dataset with 3,090 annotated samples
β”œβ”€β”€ figures/                    # Figures for documentation
└── images/                     # Visualization images organized by category
    β”œβ”€β”€ single_vis/             # Single visualization charts
    β”œβ”€β”€ multi_vis/              # Multi-panel visualizations
    └── dashboard/              # Dashboard-style visualizations

πŸ“ˆ Dataset Statistics

Our benchmark contains 3,090 expert-annotated samples collected from real-world visualizations via web search engines, covering three main categories and 32 distinct subtypes:

Category Samples Subtypes All Subtypes (Count)
Single Visualization 1,041 22 Bar Chart (176) β€’ Pie Chart (129) β€’ Line Chart (100) β€’ Area Chart (75) β€’ Treemap (62) β€’ Sankey Diagram (61) β€’ Heatmap (55) β€’ Scatter Plot (49) β€’ Histogram (48) β€’ Donut Chart (47) β€’ Funnel Chart (45) β€’ Bubble Chart (29) β€’ Choropleth Map (25) β€’ Radar Chart (24) β€’ Network Graph (23) β€’ Candlestick Chart (20) β€’ Gauge Chart (20) β€’ Box Plot (17) β€’ Point Map (12) β€’ Word Cloud (1) β€’ Violin Plot (1) β€’ Other Single View (22)
Multiple Visualizations 1,024 5 Comparison Views (670) β€’ Small Multiples (195) β€’ Coordinated Views (97) β€’ Other Multi View (59) β€’ Overview Detail (3)
Dashboard 1,025 5 Analytical Dashboard (743) β€’ Operational Dashboard (122) β€’ Interactive Dashboard (91) β€’ Strategic Dashboard (62) β€’ Other Dashboard (7)
🎯 Total 3,090 32 Complete Coverage Across All Visualization Types

πŸ† Benchmark Results

We systematically evaluate multiple state-of-the-art multimodal large language models (MLLMs) on VisJudgeBench to assess their visualization quality assessment capabilities.

πŸ€– Can MLLMs Assess Visualization Quality and Aesthetics Like Humans?

Model MAE ↓ MSE ↓ Correlation ↑
VisJudge 0.442 0.306 0.681
GPT-5 0.551 0.484 0.429
GPT-4o 0.609 0.575 0.482
Claude-4-Sonnet 0.618 0.596 0.470
Gemini-2.0-Flash 0.680 0.716 0.395
Gemini-2.5-Pro 0.661 0.674 0.266
Claude-3.5-Sonnet 0.823 1.006 0.395
Qwen2.5-VL-7B 1.048 1.502 0.322

Key Findings:

  • 🎯 VisJudge achieves 19.8% MAE improvement over GPT-5 (from 0.551 to 0.442)
  • πŸ“ˆ VisJudge shows 58.7% higher correlation with human experts compared to GPT-5 (from 0.429 to 0.681)
  • πŸ… Outperforms all commercial MLLMs across all metrics on visualization assessment tasks
  • πŸ“Š Even the most advanced models (GPT-5) show significant gaps compared to human expert judgment

Performance by Evaluation Dimensions (MAE ↓)

Model Overall Data Fidelity Semantic Readability Insight Discovery Design Style Visual Composition Color Harmony
VisJudge 0.442 0.662 0.649 0.679 0.581 0.546 0.604
GPT-5 0.551 0.861 0.780 0.776 0.648 0.698 0.682
GPT-4o 0.609 0.986 0.804 0.742 0.608 0.694 0.657
Claude-4-Sonnet 0.618 0.839 0.757 0.830 0.678 0.733 0.785
Gemini-2.0-Flash 0.680 0.828 0.910 0.818 0.637 0.728 0.798
Gemini-2.5-Pro 0.661 1.241 0.944 0.898 0.839 0.918 0.980
Claude-3.5-Sonnet 0.823 0.977 0.902 1.152 0.782 0.939 0.862
Qwen2.5-VL-7B 1.048 1.169 1.294 0.857 0.755 0.812 0.772

Performance by Evaluation Dimensions (Correlation ↑)

Model Overall Data Fidelity Semantic Readability Insight Discovery Design Style Visual Composition Color Harmony
VisJudge 0.681 0.571 0.625 0.572 0.567 0.512 0.385
GPT-5 0.429 0.256 0.438 0.383 0.463 0.277 0.295
GPT-4o 0.482 0.382 0.539 0.442 0.472 0.277 0.363
Claude-4-Sonnet 0.470 0.392 0.548 0.453 0.422 0.164 0.228
Gemini-2.0-Flash 0.395 0.371 0.458 0.418 0.460 0.157 0.209
Gemini-2.5-Pro 0.266 0.180 0.379 0.357 0.447 0.194 0.208
Claude-3.5-Sonnet 0.395 0.325 0.491 0.366 0.456 0.137 0.259
Qwen2.5-VL-7B 0.322 0.340 0.349 0.278 0.356 0.148 0.155

Key Observations:

  • All models struggle most with Aesthetics dimensions (Design Style, Visual Composition, Color Harmony)
  • Data Fidelity is relatively easier but still challenging for most models
  • VisJudge consistently outperforms baseline models across all six dimensions

πŸ“Š Do MLLMs Exhibit Human-like Scoring Behaviors?

Score Distribution

Rating patterns of different models compared to human experts (ΞΌhuman=3.13)

Systematic Biases Revealed:

  • Score Inflation: Most models (Qwen2.5-VL-7B ΞΌ=3.89, Claude-3.5-Sonnet ΞΌ=3.87) tend to over-rate visualizations
  • Overly Conservative: Gemini-2.5-Pro (ΞΌ=3.02) tends to under-rate visualizations
  • Perfect Alignment: VisJudge (ΞΌ=3.11) achieves near-perfect alignment with human rating distribution (ΞΌ=3.13)

πŸ“ˆ How Does Visualization Complexity Affect Model Performance?

Model Performance Radar Chart

Model-human rating correlation across different visualization types and evaluation dimensions

Key Insights:

  • All models show performance degradation as complexity increases: Single Vis > Multi Vis > Dashboard
  • VisJudge maintains the best performance across all types: 0.577 (Single), 0.565 (Multi), 0.375 (Dashboard)
  • Aesthetic dimensions (especially Visual Composition) are most challenging in complex dashboards

πŸ” How Do Model Evaluation Behaviors Differ in Practice?

Our case studies reveal two common biases in model evaluation behaviors: score inflation and overly conservative assessments.

Score Inflation Examples

Model evaluation examples on low-quality visualizations showing score inflation bias

Score Inflation: For a chaotic treemap (human rating: 1.67), baseline models give inflated scores. For instance, Qwen2.5-VL-7B (3.67) praises its "clear legend" while ignoring the confusing layout, and Claude-4-Sonnet (3.08) incorrectly highlights "excellent spatial organization". In contrast, VisJudge's score of 2.00 aligns with human judgment, correctly identifying the "chaotic layout" that impairs interpretation.

Conservative Bias Examples

Case study highlighting the conservative bias of Gemini-2.5-Pro

Overly Conservative: Conversely, Gemini-2.5-Pro exhibits overly conservative bias. For a high-quality dashboard rated 4.17 by humans, Gemini-2.5-Pro gives a disproportionately low score of 2.94, focusing on a single data inconsistency while overlooking the chart's overall effectiveness. Similarly, for another chart (human rating: 3.56), it scores only 2.33 due to the use of dual Y-axes. VisJudge demonstrates more balanced evaluations (3.83 and 3.00, respectively).

πŸ“‹ Data Format

The dataset is stored in JSON format (VisJudgeBench.json), where each entry contains the following fields:

  • _id: Unique identifier for each sample
  • type: Visualization category (single_vis, multi_vis, or dashboard)
  • subtype: Specific subcategory within the main type
  • image_path: Path to the corresponding visualization image
  • overall_score: Overall quality score (average of six dimension scores, ranging from 1.0 to 5.0)
  • dimension_scores: Six-dimensional quality assessment scores (see Evaluation Framework above for detailed descriptions of each dimension)
  • prompt: Complete evaluation prompt with detailed scoring criteria for each dimension

πŸ’‘ Example Entry

{
  "_id": 1,
  "type": "dashboard",
  "subtype": "analytical_dashboard",
  "image_path": "images/dashboard/example.png",
  "overall_score": 3.75,
  "dimension_scores": {
    "data_fidelity": 4.0,
    "semantic_readability": 4.5,
    "insight_discovery": 3.5,
    "design_style": 3.0,
    "visual_composition": 4.0,
    "color_harmony": 3.5
  },
  "prompt": "..."
}

πŸ€– VisJudge Model

To address the significant gaps between general MLLMs and human expert judgment in visualization quality assessment, we developed VisJudge β€” a specialized model fine-tuned on our benchmark data using the GRPO (Group Relative Policy Optimization) method. VisJudge is based on Qwen2.5-VL-7B-Instruct and trained specifically for visualization quality assessment across the Fidelity-Expressiveness-Aesthetics dimensions.

πŸ€— Model Repository: https://huggingface.co/xypkent/visjudge-7b

Performance Highlights

  • 🎯 19.8% MAE improvement over GPT-5 (0.442 vs 0.551)
  • πŸ“ˆ 58.7% higher correlation with human experts compared to GPT-5 (0.681 vs 0.429)
  • πŸ… Outperforms all commercial MLLMs across all metrics on visualization assessment tasks

Quick Start

from transformers import Qwen2VLForConditionalGeneration
from peft import PeftModel
import torch

# Load base model and LoRA weights
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

model = PeftModel.from_pretrained(base_model, "xypkent/visjudge-7b")
model.eval()

# Use the model for visualization quality assessment
# See the model repository for detailed usage examples

πŸ“ Citation

If you find VisJudgeBench useful for your research, please cite our paper:

@misc{xie2025visjudge,
      title={VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations}, 
      author={Yupeng Xie and Zhiyang Zhang and Yifan Wu and Sirong Lu and Jiayi Zhang and Zhaoyang Yu and Jinlin Wang and Sirui Hong and Bang Liu and Chenglin Wu and Yuyu Luo},
      year={2025},
      eprint={2510.22373},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.22373}, 
}

Paper: https://arxiv.org/abs/2510.22373


For questions or feedback, please contact: [email protected]

About

VisJudgeBench: A comprehensive benchmark for aesthetics and quality assessment of visualizations, featuring 3,090 expert-annotated samples with six-dimensional quality scores.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published