changes.txt

For our first improvement, we will be targeting the issue of having the classification accuracy of a generative archi- tecture like Flamingo lagging behind contrastive models. To solve this, we will be implementing a instance match- ing technique outlined in ”Enhancing the medical foun- dation model with multi-scale and cross-modality feature learning”. To incorporate instance matching into the Med- Flamingo model, we begin by extracting features using the existing components of the model. The CLIP vision en- coder generates visual features, which are then converted into tokens via the perceiver resampler. Simultaneously, the LLaMA-7B model processes textual inputs through its cross-attention layers. These visual and textual features are subsequently fused, typically through concatenation, to form a comprehensive representation of the input pair. To enable instance matching, a binary classification head is in- troduced, consisting of two linear layers designed to predict whether the input pairs (image and text) are matched or not. The training process is enhanced by incorporating a binary cross-entropy loss function, specifically tailored to train this classifier. This classifier and the associated loss are then in- tegrated into the existing training loop of the Med-Flamingo model. By ensuring that gradients flow correctly and op- timizing both instance matching and generative tasks, the model leverages the strengths of contrastive learning. This enhancement improves the Med-Flamingo model’s perfor- mance in complex medical tasks, such as breast cancer di- agnosis and analysis, by creating a more robust and reli- able feature distribution combining the power of contrastive learning with generative modeling.[1]
Secondly, we would like to address the issue of hallu- cinations and limited specialized medical information. We propose a RAG approach that allows the model to incor- porate relevant information that could be outside of what it was trained on. To integrate the Retrieval-Augmented Gen- eration (RAG) framework into the Med-Flamingo model for breast cancer diagnosis, we enhance the model’s text gener- ation process by incorporating relevant external medical in- formation. First, we index specialized medical texts related to breast cancer in a vector database using FAISS. When the model receives an input image, the CLIP vision encoder extracts visual features, which are then projected through
a projection layer specific to the task. These projected vi- sual features are used to query the FAISS index, retrieving the most relevant medical texts based on similarity. The retrieved texts are then combined to form a contextual in- put, which is tokenized and fed into the LLaMA-7B model. The model uses a gated cross-attention mechanism to in- corporate the visual tokens as context while processing the retrieved text. This setup ensures that the generated diag- nostic report is informed by both the visual features of the input image and the retrieved relevant medical texts, reduc- ing hallucinations and improving the accuracy of the spe- cialized medical information in the output. This approach is inspired by ”Contrastive X-ray-Report Pair Retrieval based Generation (CXR-RePaiR-Gen)”. [5]
Lastly, if time permits, we will be addressing the is- sue of catastrophic forgetting. Vision-language models like Med-Flamingo often struggle with catastrophic forgetting when fine-tuned on new tasks, resulting in the loss of pre- viously acquired knowledge. As described in ”Learning without Forgetting for Vision-Language Models”, while LoRA (Low-Rank Adaptation) introduces low-rank updates to model weights and freezes most parameters to address this issue, it doesn’t fully exploit cross-modal interactions or effectively retain task-specific knowledge. PROOF (Pro- jection Fusion for VLM) addresses these limitations by im- plementing projection layers that map pre-trained visual and textual features into a new space, creating expand- able projections for new tasks while freezing old ones to preserve earlier learnings. Additionally, PROOF employs self-attention mechanisms to fuse visual and textual infor- mation, enhancing the model’s context-aware predictions. [2] To implement PROOF in Med-Flamingo for the task of breast cancer diagnosis, we plan to freeze 70% of the model’s layers, retaining the stability of the pre-trained knowledge while allowing 30% of the layers to remain trainable for adaptation to the new task. Specifically, we will apply projection layers to the visual tokens from the perceiver resampler and the textual embeddings from the LLaMA-7B model. For this single task, we introduce new projection layers for both visual and text features, ensur- ing that the previous layers remain frozen to retain learned knowledge. Finally, we use the projected visual tokens as context within the gated cross-attention layers of the LLaMA-7B model, enabling the model to integrate visual and textual information effectively for accurate breast can- cer diagnosis. This approach allows Med-Flamingo to adapt to the specific task of breast cancer diagnosis without for- getting previously learned information, leveraging both sta- bility and adaptability for enhanced performance.
Conclusively, we will not only be applying state of the art Vision Language models to breast cancer detection and diagnosis but also improving upon the general ability of vision language models to specialize on medical problems