This repository provides an end-to-end pipeline for fine-tuning and evaluating the LLaVA (Large Language and Vision Assistant) model on VQA-RAD, a medical visual question-answering dataset. The framework includes dataset handling, model fine-tuning using LoRA, and evaluation with custom metrics.
LLAVA/
│── logslurms/ # Logs for training and evaluation runs
│── notebooks/
│ └── training_analysis.ipynb # Jupyter notebook for analyzing training logs
│── src/
│ │── data_utils.py # Handles dataset loading and preprocessing
│ │── model_utils.py # Functions for loading, modifying, and configuring models
│ │── fine_tuning.py # LoRA fine-tuning pipeline
│ │── evaluation_base.py # Base classes for evaluation
│ │── evaluation.py # Concrete evaluation implementation
│ │── callbacks.py # Custom training callbacks for monitoring training loss
│ │── run_evaluation.py # Script to run evaluation and generate reports
│── terminal_results/ # Evaluation Results
│ │── evaluation_finetuned.json # Evaluation results for the fine-tuned model
│ │── evaluation_pretrained.json # Evaluation results for the pretrained model
└── README.md # Documentation
Ensure you have Python 3.8+ and install the required packages:
pip install -r requirements.txtThe default model used is llava-hf/llava-1.5-7b-hf. If necessary, modify MODEL_NAME in fine_tuning.py accordingly.
To fine-tune LLaVA on VQA-RAD using LoRA, run:
python src/fine_tuning.pyThis script:
- Loads the VQA-RAD dataset (
data_utils.py) - Loads and modifies the LLaVA model (
model_utils.py) - Applies LoRA for efficient fine-tuning
- Trains the model using
Trainerfrom Hugging Face - Logs losses using
LossLoggerCallback(callbacks.py) - Saves the fine-tuned model
Run evaluation using:
python src/run_evaluation.pyThis script:
- Loads the fine-tuned model
- Generates predictions on test data
- Computes evaluation metrics using
evaluation.py - Saves results to
evaluation_finetuned.json
To compare with the pretrained model:
python src/run_evaluation.py --pretrainedResults are stored in evaluation_pretrained.json.
- Loads the VQA-RAD dataset and processes images & text for LLaVA.
- Implements
CustomDataCollatorfor efficient batch processing.
- Loads the LLaVA model and tokenizer.
- Applies LoRA (
apply_lora) to fine-tune efficiently.
- Defines
fine_tune_model(), which handles the full fine-tuning process. - Uses mixed precision training (
fp16) and LoRA adapters. - Logs training loss using
LossLoggerCallback(callbacks.py).
- Implements
MedicalResponseEvaluator, which computes:- Clinical relevance metrics
- Defines
BaseEvaluator, an abstract class for evaluation. MedicalResponseEvaluatorextends this to implement custom evaluation logic.
- Implements
LossLoggerCallbackto track training loss and evaluation loss at each step.
- Loads the model and dataset.
- Generates predictions using
generate_predictions(). - Computes evaluation metrics using
MedicalResponseEvaluator. - Saves results as a JSON file.
The model was evaluated using an LLM-as-a-judge framework, an automated scoring system where a large language model (LLM) assesses the correctness and quality of responses. Initially, DeepSeek R1 was used as the evaluation model, but to align better with the specific domain of the fine-tuned model, MedAlpaca, a more clinically relevant LLM, was later adopted for scoring. This transition was crucial to ensure that evaluations reflected domain-specific understanding rather than generic LLM biases.
Below are the comparative results between the fine-tuned model and the pre-trained model:
| Metric | Pre-Trained Model | Fine-Tuned Model |
|---|---|---|
| total_responses | 451.00 | 451.00 |
| valid_responses | 426.00 | 407.00 |
| correct_rate | 0.51 | 0.47 |
| incorrect_rate | 0.42 | 0.48 |
| neutral_rate | 0.07 | 0.05 |
| contradiction_rate | 0.00 | 0.00 |
| invalid_format_rate | 0.06 | 0.10 |
| average_confidence | 4.89 | 4.76 |
| confidence_completeness | 1.00 | 1.00 |
- The correct rate dropped from 51% to 47%, suggesting that fine-tuning did not improve factual accuracy and may have caused degradation in some areas.
- The incorrect rate increased from 42% to 48%, which could indicate overfitting to the fine-tuning dataset, leading to poorer generalization.
- The neutral rate dropped from 7% to 5%, meaning the model provided fewer uncertain or ambiguous answers.
- The invalid format rate increased from 6% to 10%, suggesting that the fine-tuned model generated more responses that did not conform to the expected output structure.
- Average confidence decreased slightly from 4.89 to 4.76, implying a minor reduction in the model’s certainty when providing answers.
- Contradiction rate remained at 0, confirming that the fine-tuning process did not introduce logical inconsistencies into the model’s outputs.
While LLM-as-a-judge frameworks offer scalability and flexibility in model evaluation, they also have inherent limitations:
- Misalignment with fine-tuning objectives – The evaluation was conducted using an LLM-based scorer, which, despite being a reasonable approach for assessing nuanced responses, may not fully capture improvements in reasoning or domain-specific adaptations.
- Potential Biases in Judgments – Even though MedAlpaca was selected for its clinical relevance, LLM judges can still exhibit biases, misinterpret context, or fail to correctly assess cases where multiple valid responses exist.
- Inability to Soft-Match – Unlike human evaluation, traditional NLP metrics (e.g., ROUGE, BLEU) struggle with assessing responses that are semantically correct but phrased differently. LLM-based judges attempt to address this but are still prone to errors, especially in complex or ambiguous cases.
While the evaluation suggests a decline in accuracy metrics, this does not necessarily indicate a failure of the fine-tuning process. Instead, it highlights the complexity of LLM adaptation, trade-offs in generalization vs. specialization, and the limitations of LLM-based evaluation.
Further analysis indicates that prompt engineering significantly influences results, with optimized prompting increasing the correct rate by 10%. Additionally, the observed degradation provides valuable insights for refining future fine-tuning strategies and ensuring alignment with real-world use cases.
Edit training_args in fine_tuning.py:
training_args = TrainingArguments(
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
learning_rate=2e-4,
num_train_epochs=3,
save_strategy="epoch",
logging_steps=100,
evaluation_strategy="epoch",
)Modify evaluate() in evaluation.py to add custom metrics.
If you’d like to contribute:
- Fork the repo
- Create a new branch (
feature-xyz) - Commit changes and open a PR
This fine-tuned model is based on LLaVA-1.5, which in turn utilizes LLaMA 2. As such, it is subject to the LLaMA 2 Community License Agreement, which governs the use, distribution, and modification of the model.
Additionally, LLaVA-1.5 has been trained using GPT-4-generated multimodal instruction-following data, which may impose further restrictions, particularly on commercial usage. Users must review and comply with any applicable terms related to both LLaMA 2 and GPT-4 data usage.
- The fine-tuned model inherits the LLaMA 2 Community License and must be used in accordance with its terms.
- Any deployment or distribution of this model should acknowledge the original authors of LLaVA-1.5 and adhere to its licensing conditions.
- If you plan to use this model commercially, ensure compliance with both Meta’s LLaMA 2 license and OpenAI's policies regarding GPT-4-generated data.
For full details, please refer to:
By using this fine-tuned model, you agree to abide by the terms and conditions set forth in the above licenses.