Cheng Lei, Jie Fan, Xinran Li, Tianzhu Xiang, Ao Li, Ce Zhu, Le Zhang,
University of Electronic Science and Technology of China; Space42, UAE
Abstract: Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?", we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_{\beta}^w$ scores of 72.9\% on CAMO and 71.7\% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks. The source code will be made available at \url{https://github.com/AVC2-UESTC/ZSCOS-CaMF}.
For setup, refer to the Quick Start guide for a fast setup, or follow the detailed instructions below for a step-by-step configuration.
The code requires python>=3.9, as well as pytorch>=2.0.0. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended.
Please install MMCV following the instructions here.
Please install xFormers following the instructions here.
Please install the following dependencies:
pip install -r requirements.txt
You can download the pretrained weights eva02_L_pt_m38m_p14to16.pt from EVA02 or here.
Run the following command to convert the PyTorch weights to the format used in this repository.
python convert_pt_weights.py For training, put the converted weights in the model_weights folder.
| Method | Dataset | Weights | Configs |
|---|---|---|---|
| CAMF-ZS | DUTS | camf_duts.pth | config |
| CAMF-S | CAMO+COD10K | camf_cod.pth |
For testing, put the pretrained weights and fine-tuned weights in the model_weights folder.
The following datasets are used in this paper:
Make sure cuda 11.8 is installed in your virtual environment. Linux is recommmended.
Install pytorch
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118Install xformers
pip install xformers==0.0.22 --index-url https://download.pytorch.org/whl/cu118
# test installation (optional)
python -m xformers.infoInstall mmcv
pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.4/index.htmlOther dependencies
pip install -r requirements.txtWe follow the ADE20K dataset format. Organize your dataset files as follows:
./datasets/dataset_name/
├── images/
│ ├── training/ # Put training images here
│ └── validation/ # Put validation images here
└── annotations/
├── training/ # Put training segmentation maps here
└── validation/ # Put validation segmentation maps here
Put the model weights into the model_weights folder, and run the following command to test the model.
python test.py
Preparing
If you want to debug the code, ckeck train_debug.py and test_debug.py.
If you find the code helpful in your research or work, please cite the following paper:
@article{lei2024towards,
title={Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations},
author={Lei, Cheng and Fan, Jie and Li, Xinran and Xiang, Tianzhu and Li, Ao and Zhu, Ce and Zhang, Le},
journal={arXiv preprint arXiv:2410.16953},
year={2024}
}
This project is based on MMCV, timm, EVA02, MAM, and EVP. We thank the authors for their valuable contributions.
