Mimic In-Context Learning for Multimodal Tasks

Mimic In-Context Learning (MimIC) is a novel framework to adapt vision language models by approximating shift effects from in-context demonstrations. By integrating lightweight learnable modules int models, it demonstrates superior performance compared to previous shift-vector methods and LoRA.

Setup

1. Create environment

The following command can help you build the environment for testing idefics1 and idefics2.

conda create -y -n mimic python=3.10
pip install -r requirements.txt

2. Specify the root path of your models and datasets in `src/paths.py`

For models, we currently support idefics1, idefics2 and llava-next-interleave. For datasets, VQAv2, OK-VQA, COCO, flickr30k, MME and SEED-bench are available.

How to Run

cd ./script
# select a bash file to run
bash run_*.sh

Code Reading Guides

We would like to introduce some key files to help you understand how MimIC works.

`shift_encoder.py`

In this file, we implemented MimIC attention heads (AttnApproximator) and another vector-based method -- LIVE (AttnFFNShift).

As we mentioned in paper, self-attention layers are substituted by MimIC attention heads. Such an integration is achieved by replacing forward of those self-attention layers in models (see *_attn_forward and register_shift_hooks). For example, as you can see idefics_attn_forward, we do a shift on regular attention output base on key and query of idefics.

In do_shift of AttnApproximator, we implemented $f(\cdot)$ and $\boldsymbol{v}$ to approximate in-context demonstrations affected terms (Section 3.2).

`shift_model.py`

In this file, we implemented training framework of MimIC, as illustrated in Figure 3. ShiftModel feeds contexts prepared from data_module.py to the model and calculate losses depends on model_strategy. The model_strategy describes which types of losses should be calculated. For exmaple, MimIC uses Strategy.LAYER_WISE_MSE and Strategy.LM_LOSS, which stand for $L_{\text{align}}$ and $L_{\text{gt}}$ (Eq. 6), respectively. To train LIVE, Strategy.LOGITS_KL_DIV and Strategy.LM_LOSS are used. As to LoRA, only Strategy.LM_LOSS should be applied.

We firstly feed in-context demonstrations and query to model to capture hidden states $H^\prime$ from all layers. This is achieved by forward_hook, please see register_record_hook in shift_encoder.py for details. Then, we enable shift_hook (introduced in previous section) only feed query to model to obtain shifted hidden states $H$. Finally, layer-wise alignment loss is computed with these hidden states.

Customization

Customize new datasets

You may need to add your dataset path to src/paths.py firstly.
Create a new python script in src/dataset_utils.
Create a new class named Dataset and inherits from src.dataset_utils.iterface.DatasetBase.
Implement all abstract methods and some special required attributes (see docstring of DatasetBase).

Then you are able to use -d option to specify new dataset in run_* bash scripts.

Customize new model

This could be a kinda complicate. 0. You may need to add your model path to src/paths.py firstly.

Create your new model in testbed/models, following ICLTestbed guides here.
Specify the method of loading the model in build_models from src/utils.py.
Global search idefics in shift_model.py and implement corresponding methods.
Determine how many epochs to run and when to save in src/train.py.

Recommended Citation

@InProceedings{Jiang_2025_CVPR,
    author    = {Jiang, Yuchu and Fu, Jiale and Hao, Chenduo and Hu, Xinting and Peng, Yingzhe and Geng, Xin and Yang, Xu},
    title     = {Mimic In-Context Learning for Multimodal Tasks},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {29825-29835}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
scripts		scripts
src		src
testbed		testbed
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mimic In-Context Learning for Multimodal Tasks

Setup

1. Create environment

2. Specify the root path of your models and datasets in `src/paths.py`

How to Run

Code Reading Guides

`shift_encoder.py`

`shift_model.py`

Customization

Customize new datasets

Customize new model

Recommended Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Kamichanw/MimIC

Folders and files

Latest commit

History

Repository files navigation

Mimic In-Context Learning for Multimodal Tasks

Setup

1. Create environment

2. Specify the root path of your models and datasets in src/paths.py

How to Run

Code Reading Guides

shift_encoder.py

shift_model.py

Customization

Customize new datasets

Customize new model

Recommended Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

2. Specify the root path of your models and datasets in `src/paths.py`

`shift_encoder.py`

`shift_model.py`

Packages