What happens when the input is messy—blurred labels, typos, occlusions, or color shifts? 🤔 CHAOS (CHart Analysis with Outlier Samples) is the first benchmark purposely designed to stress‑test MLLMs under realistic noise. We:
- evaluate 10 visual and 5 textual perturbations, each at three increasing severity levels (easy → mid → hard);
- span 112,500 perturbed charts (2️⃣ 5️⃣ 0️⃣ 0️⃣ per perturbation × 3 levels × 15 types);
- introduce a Robustness Score that unifies vision‑ and text‑side degradations for apples‑to‑apples model comparison.
Our goal is simple: measure how and understand why gracefully MLLMs fail—and, ideally, still succeed—when reality gets noisy.
Clone the repo with submodules:
git clone --recurse-submodules https://github.com/moured/CHAOS
cd CHAOSCreate the environment (Python 3.10 recommended):
conda create -n chaos python=3.10
conda activate chaosInstall dependencies (you can use a different torch version — in our case we experimented with torch==2.6.0):
cd VLMEvalKit
pip install -e .
pip install accelerate qwen-vl-utils
pip install flash-attn --no-build-isolation
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124Copy custom CHAOS dataset files:
cp ../custom_files/* ./VLMEvalKit/vlmeval/dataset/Run with a single GPU:
python run.py --data CHAOS_text --model Qwen2.5-VL-7B-Instruct --verbose
Run with multiple GPUs:
torchrun --nproc-per-node=4 run.py --data CHAOS_text --model Qwen2.5-VL-7B-Instruct --verboseYou can experiment with different models — please check the VLMEvalKit repository for a list of supported models.
TBD
TBD
