Skip to content

[NeurIPS 24] A new training and evaluation framework for learning interpretable deep vision models and benchmarking different interpretable concept-bottleneck-models (CBMs)

Notifications You must be signed in to change notification settings

Trustworthy-ML-Lab/VLG-CBM

Repository files navigation

Vision-Language-Guided Concept Bottleneck Model (VLG-CBM)

This is the official repository for our paper VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance, NeurIPS 2024. [Project Website] [Paper]

  • VLG-CBM provides a novel method to train Concept Bottleneck Models(CBMs) with guidance from both vision and language domain.
  • VLG-CBM provides concise and accurate concept attribution for the decision made by the model. The following figure compares decision explanation of VLG-CBM with existing methods by listing top-five contributions for their decisions.

[Update July 2025] We release our new tool for ANEC evaluation! If you want to measure ANEC on your own model, please check out This repository. Simply save your model's output and run one line of command to get ANEC result!

Decision Explanation

Table of Contents

Setup

  1. Setup conda environment and install dependencies
  conda create -n vlg-cbm python=3.12
  conda activate vlg-cbm
  pip install -r requirements.txt
  1. (optional) Install Grounding DINO for generating annotations on custom datasets
git clone https://github.com/IDEA-Research/GroundingDINO
cd GroundingDINO
pip install -e .
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth
cd ..

Quick Start

We provide scripts to download and evaluate pretrained models for CIFAR10, CIFAR100, CUB200, Places365, and ImageNet. To quickly evaluate the pretrained models, follow the steps below:

  1. Download pretrained models from here, unzip them, and place them in the saved_models folder.
  2. Run evaluation script to evaluate the pretrained models under different NEC and obtain Accuracy at different NEC (ANEC) for each dataset.
python sparse_evaluation.py --load_path <path-to-model-dir>

For example, to evaluate the pretrained model for CUB200, run

python sparse_evaluation.py --load_path saved_models/cub

Training

Overview

VLG-CBM Overview

Annotation Generation (Optional)

To train VLG-CBM, images must be annotated with concepts from a Vision-Language model, and this work uses Grounding-DINO for annotation generation. Use the following command to generate annotations for a dataset:

python -m scripts.generate_annotations --dataset <dataset-name> --device cuda --batch_size 32 --text_threshold 0.15 --output_dir annotations

Note: Supported datasets include cifar10, cifar100, cub, places365, and imagenet. The generated annotations will be saved under annotations folder.

Training Pipeline

  1. Download annotated data from here, unzip them, and place it in the annotations folder or generate it using Grounding DINO as described in the previous section.

  2. All datasets must be placed in a single folder specified by the environment variable $DATASET_FOLDER. By default, $DATASET_FOLDER is set to datasets.

Note: To download and process CUB dataset, please run bash download_cub.sh and move the folder under $DATASET_FOLDER. To use ImageNet dataset, you need to download the ImageNet dataset yourself and put it under $DATASET_FOLDER. The other datasets could be downloaded automatically by Torchvision.

  1. Train a concept bottleneck model using the config files in ./configs. For instance, to train a CUB model, run the following command:
  python train_cbm.py --config configs/cub.json --annotation_dir annotations

Evaluate trained models

Number of Effective Concepts (NEC) needs to be controlled to enable a fair comparison of model performance. To evaluate a trained model under different NEC, run the following command:

python sparse_evaluation.py --load_path <path-to-model-dir> --lam <lambda-value>

Results

Accuracy at NEC=5 (ANEC-5) for non-CLIP backbone models

Dataset CIFAR10 CIFAR100 CUB200 Places365 ImageNet
Random 67.55% 29.52% 68.91% 17.57% 41.49%
LF-CBM 84.05% 56.52% 53.51% 37.65% 60.30%
LM4CV 53.72% 14.64% N/A N/A N/A
LaBo 78.69% 44.82% N/A N/A N/A
VLG-CBM(Ours) 88.55% 65.73% 75.79% 41.92% 73.15%

Accuracy at NEC=5 (ANEC-5) for CLIP backbone models

Dataset CIFAR10 CIFAR100 ImageNet CUB
Random 67.55% 29.52% 18.04% 25.37%
LF-CBM 84.05% 56.52% 52.88% 31.35%
LM4CV 53.72% 14.64% 3.77% 3.63%
LaBo 78.69% 44.82% 24.27% 41.97%
VLG-CBM (Ours) 88.55% 65.73% 59.74% 60.38%

Explainable Decisions

Visualization of activated images

Sources

Cite this work

If you find this work useful, please consider citing:

@inproceedings{srivastava2024vlg,
        title={VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance},
        author={Srivastava, Divyansh and Yan, Ge and Weng, Tsui-Wei},
        journal={NeurIPS},
        year={2024}
}

About

[NeurIPS 24] A new training and evaluation framework for learning interpretable deep vision models and benchmarking different interpretable concept-bottleneck-models (CBMs)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •