Skip to content

JierunChen/Ref-L4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ref-L4: a New REC Benchmark in the LMM Era

This repository contains the data loader and evaluation code for our Ref-L4, a new REC benchmark in the LMM era. Check out our paper Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models for more details.

🧭 Introduction

Referring expression comprehension (REC) is the task of locating a specific region in an image based on a textual description. Current popular REC benchmarks, such as RefCOCO, RefCOCO+, and RefCOCOg, may not fully capture the comprehensive capabilities of large multimodal models (LMMs) due to limitations in:

  • Linguistic complexity: Expressions tend to be overly simple, often consisting of single words like "yellow," "lady," or "far left person."
  • Visual complexity: The benchmarks are restricted to a limited set of object categories (fewer than 80).
  • Labeling accuracy: RefCOCO, RefCOCO+, and RefCOCOg have high labeling error rates (14%, 24%, and 5%, respectively).

In response, we rigorously exclude the problematic instances and reevaluating LMMs, showing significant accuracy improvements. We also introduce Ref-L4, a new and more coprehensive REC benchmark to evaluate LMMs. Ref-L4 offers:

  • Lengthy referring expressions averaging 24.2 words;
  • Extensive vocabulary comprising 22,813 unique words;
  • Diverse objects covering 365 categories with varying instance scales (from 30 to 3,767);
  • Substantial sample size with 45,341 annotations.

😃 Ref-L4 examples

😑 Labeling errors in RefCOCO, +, g

In the REC task, a referring expression should uniquely describe an instance, which is represented by an accurate bounding box. We have identified and visualized three common types of labeling errors in the RefCOCO, RefCOCO+, and RefCOCOg benchmarks: (a) non-unique referring expressions, which refer to multiple instances within the same image; (b) inaccurate bounding boxes; and (c) misalignment between target instances and their referring expressions, where the referring expressions are either ambiguous or do not refer to any instance in the image.

🛠️ Installation

git clone https://github.com/JierunChen/Ref-L4.git
cd Ref-L4
pip install ./

🚀 Loading Ref-L4 Dataset

from ref_l4 import RefL4Dataset
ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all')
print(len(ref_l4_dataset))
# iterate over the dataset
for img_pil, data in ref_l4_dataset:
    print(data)
    break

# swith to 'val' or 'test' split
ref_l4_dataset.change_split('val')
print(len(ref_l4_dataset))

The Ref-L4 dataset can also be downloaded from Hugging Face.

🗂️ Evaluation

After getting model predictions, we can save them as a json file, which is a list of dictionaries, each containing three keys: [id, pred_bbox, format]. The id is the annotation ID, and the format specifies the format of pred_bbox, which should be either xyxy or xywh. A sample prediction file the can be found here. It is obtained by running the CogVLM-Grounding model.

Now we can run the following command to evaluate the predictions with ground truth.

python evaluate_pred.py --dataset_path JierunChen/Ref-L4 --split all --pred_json_path ./demo_models/cogvlm_grounding_pred.json

Here is an example output for the predictions from CogVLM-Grounding on the "all" splits:

Item for split all                    | Value
--------------------------------------------------------------------------------
Ann-level acc iou 0.5                 | 81.69868331091065
Ann-level acc iou 0.75                | 70.76597340155709
Ann-level acc iou 0.9                 | 48.351381751615534
Ann-level macc iou 0.5:0.95           | 66.08808804393375
Ann-level accs for copy               | 81.7, 70.77, 48.35, 66.09
--------------------------------------------------------------------------------
Small acc iou 0.5                     | 75.0561797752809
Small macc iou 0.5:0.95               | 52.853932584269664
Medium acc iou 0.5                    | 86.43470790378007
Medium macc iou 0.5:0.95              | 71.31099656357388
Large acc iou 0.5                     | 77.90972003774772
Large macc iou 0.5:0.95               | 66.25196602705253
Size level accs for copy              | 75.06, 52.85, 86.43, 71.31, 77.91, 66.25
--------------------------------------------------------------------------------
Average class-level acc iou 0.5       | 72.42029130234654
Average class-level macc iou 0.5:0.95 | 52.56151912966669
Avg class-level accs for copy         | 72.42, 52.56

The output includes three parts:

  • The accuracy under various IoU thresholds and the average accuracy of IoU from 0.5 to 0.95 with a stride of 0.05.
  • The accuracy of small, medium, and large objects.
  • The average accuracy across different classes.

📖 Dataset License

The Ref-L4 dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Please note that the images in the Ref-L4 dataset are derived from the following datasets, each with their respective licenses:

By using the Ref-L4 dataset, you agree to comply with the licensing terms of these source datasets.

About

Evaluation code for Ref-L4, a new REC benchmark in the LMM era

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages