Skip to content

Kamichanw/MimIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mimic In-Context Learning for Multimodal Tasks

Static Badge

overview figure

Mimic In-Context Learning (MimIC) is a novel framework to adapt vision language models by approximating shift effects from in-context demonstrations. By integrating lightweight learnable modules int models, it demonstrates superior performance compared to previous shift-vector methods and LoRA.

Setup

1. Create environment

The following command can help you build the environment for testing idefics1 and idefics2.

conda create -y -n mimic python=3.10
pip install -r requirements.txt

2. Specify the root path of your models and datasets in src/paths.py

For models, we currently support idefics1, idefics2 and llava-next-interleave. For datasets, VQAv2, OK-VQA, COCO, flickr30k, MME and SEED-bench are available.

How to Run

cd ./script
# select a bash file to run
bash run_*.sh 

Code Reading Guides

We would like to introduce some key files to help you understand how MimIC works.

In this file, we implemented MimIC attention heads (AttnApproximator) and another vector-based method -- LIVE (AttnFFNShift).

As we mentioned in paper, self-attention layers are substituted by MimIC attention heads. Such an integration is achieved by replacing forward of those self-attention layers in models (see *_attn_forward and register_shift_hooks). For example, as you can see idefics_attn_forward, we do a shift on regular attention output base on key and query of idefics.

In do_shift of AttnApproximator, we implemented $f(\cdot)$ and $\boldsymbol{v}$ to approximate in-context demonstrations affected terms (Section 3.2).

In this file, we implemented training framework of MimIC, as illustrated in Figure 3. ShiftModel feeds contexts prepared from data_module.py to the model and calculate losses depends on model_strategy. The model_strategy describes which types of losses should be calculated. For exmaple, MimIC uses Strategy.LAYER_WISE_MSE and Strategy.LM_LOSS, which stand for $L_{\text{align}}$ and $L_{\text{gt}}$ (Eq. 6), respectively. To train LIVE, Strategy.LOGITS_KL_DIV and Strategy.LM_LOSS are used. As to LoRA, only Strategy.LM_LOSS should be applied.

We firstly feed in-context demonstrations and query to model to capture hidden states $H^\prime$ from all layers. This is achieved by forward_hook, please see register_record_hook in shift_encoder.py for details. Then, we enable shift_hook (introduced in previous section) only feed query to model to obtain shifted hidden states $H$. Finally, layer-wise alignment loss is computed with these hidden states.

Customization

Customize new datasets

  1. You may need to add your dataset path to src/paths.py firstly.
  2. Create a new python script in src/dataset_utils.
  3. Create a new class named Dataset and inherits from src.dataset_utils.iterface.DatasetBase.
  4. Implement all abstract methods and some special required attributes (see docstring of DatasetBase).

Then you are able to use -d option to specify new dataset in run_* bash scripts.

Customize new model

This could be a kinda complicate. 0. You may need to add your model path to src/paths.py firstly.

  1. Create your new model in testbed/models, following ICLTestbed guides here.
  2. Specify the method of loading the model in build_models from src/utils.py.
  3. Global search idefics in shift_model.py and implement corresponding methods.
  4. Determine how many epochs to run and when to save in src/train.py.

Recommended Citation

@InProceedings{Jiang_2025_CVPR,
    author    = {Jiang, Yuchu and Fu, Jiale and Hao, Chenduo and Hu, Xinting and Peng, Yingzhe and Geng, Xin and Yang, Xu},
    title     = {Mimic In-Context Learning for Multimodal Tasks},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {29825-29835}
}

About

[CVPR'25] Official code of paper "Mimic In-Context Learning for Multimodal Tasks"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published