LLaVA Fine-Tuning and Evaluation Framework

This repository provides an end-to-end pipeline for fine-tuning and evaluating the LLaVA (Large Language and Vision Assistant) model on VQA-RAD, a medical visual question-answering dataset. The framework includes dataset handling, model fine-tuning using LoRA, and evaluation with custom metrics.

Project Structure

LLAVA/
│── logslurms/                   # Logs for training and evaluation runs
│── notebooks/
│   └── training_analysis.ipynb  # Jupyter notebook for analyzing training logs
│── src/
│   │── data_utils.py         # Handles dataset loading and preprocessing
│   │── model_utils.py        # Functions for loading, modifying, and configuring models
│   │── fine_tuning.py        # LoRA fine-tuning pipeline
│   │── evaluation_base.py    # Base classes for evaluation
│   │── evaluation.py         # Concrete evaluation implementation
│   │── callbacks.py          # Custom training callbacks for monitoring training loss
│   │── run_evaluation.py     # Script to run evaluation and generate reports
│── terminal_results/         # Evaluation Results
│   │── evaluation_finetuned.json  # Evaluation results for the fine-tuned model
│   │── evaluation_pretrained.json # Evaluation results for the pretrained model
└── README.md                 # Documentation

Installation & Setup

1. Install Dependencies

Ensure you have Python 3.8+ and install the required packages:

pip install -r requirements.txt

2. Download the LLaVA Model

The default model used is llava-hf/llava-1.5-7b-hf. If necessary, modify MODEL_NAME in fine_tuning.py accordingly.

Usage

1. Fine-Tuning the Model

To fine-tune LLaVA on VQA-RAD using LoRA, run:

python src/fine_tuning.py

This script:

Loads the VQA-RAD dataset (data_utils.py)
Loads and modifies the LLaVA model (model_utils.py)
Applies LoRA for efficient fine-tuning
Trains the model using Trainer from Hugging Face
Logs losses using LossLoggerCallback (callbacks.py)
Saves the fine-tuned model

2. Evaluating the Model

Run evaluation using:

python src/run_evaluation.py

This script:

Loads the fine-tuned model
Generates predictions on test data
Computes evaluation metrics using evaluation.py
Saves results to evaluation_finetuned.json

To compare with the pretrained model:

python src/run_evaluation.py --pretrained

Results are stored in evaluation_pretrained.json.

Key Components

1. `data_utils.py`

Loads the VQA-RAD dataset and processes images & text for LLaVA.
Implements CustomDataCollator for efficient batch processing.

2. `model_utils.py`

Loads the LLaVA model and tokenizer.
Applies LoRA (apply_lora) to fine-tune efficiently.

3. `fine_tuning.py`

Defines fine_tune_model(), which handles the full fine-tuning process.
Uses mixed precision training (fp16) and LoRA adapters.
Logs training loss using LossLoggerCallback (callbacks.py).

4. `evaluation.py`

Implements MedicalResponseEvaluator, which computes:
- Clinical relevance metrics

5. `evaluation_base.py`

Defines BaseEvaluator, an abstract class for evaluation.
MedicalResponseEvaluator extends this to implement custom evaluation logic.

6. `callbacks.py`

Implements LossLoggerCallback to track training loss and evaluation loss at each step.

7. `run_evaluation.py`

Loads the model and dataset.
Generates predictions using generate_predictions().
Computes evaluation metrics using MedicalResponseEvaluator.
Saves results as a JSON file.

Evaluation Results

The model was evaluated using an LLM-as-a-judge framework, an automated scoring system where a large language model (LLM) assesses the correctness and quality of responses. Initially, DeepSeek R1 was used as the evaluation model, but to align better with the specific domain of the fine-tuned model, MedAlpaca, a more clinically relevant LLM, was later adopted for scoring. This transition was crucial to ensure that evaluations reflected domain-specific understanding rather than generic LLM biases.

Below are the comparative results between the fine-tuned model and the pre-trained model:

Metric	Pre-Trained Model	Fine-Tuned Model
total_responses	451.00	451.00
valid_responses	426.00	407.00
correct_rate	0.51	0.47
incorrect_rate	0.42	0.48
neutral_rate	0.07	0.05
contradiction_rate	0.00	0.00
invalid_format_rate	0.06	0.10
average_confidence	4.89	4.76
confidence_completeness	1.00	1.00

Key Takeaways

The correct rate dropped from 51% to 47%, suggesting that fine-tuning did not improve factual accuracy and may have caused degradation in some areas.
The incorrect rate increased from 42% to 48%, which could indicate overfitting to the fine-tuning dataset, leading to poorer generalization.
The neutral rate dropped from 7% to 5%, meaning the model provided fewer uncertain or ambiguous answers.
The invalid format rate increased from 6% to 10%, suggesting that the fine-tuned model generated more responses that did not conform to the expected output structure.
Average confidence decreased slightly from 4.89 to 4.76, implying a minor reduction in the model’s certainty when providing answers.
Contradiction rate remained at 0, confirming that the fine-tuning process did not introduce logical inconsistencies into the model’s outputs.

Limitations of LLM-as-a-Judge Evaluation

While LLM-as-a-judge frameworks offer scalability and flexibility in model evaluation, they also have inherent limitations:

Misalignment with fine-tuning objectives – The evaluation was conducted using an LLM-based scorer, which, despite being a reasonable approach for assessing nuanced responses, may not fully capture improvements in reasoning or domain-specific adaptations.
Potential Biases in Judgments – Even though MedAlpaca was selected for its clinical relevance, LLM judges can still exhibit biases, misinterpret context, or fail to correctly assess cases where multiple valid responses exist.
Inability to Soft-Match – Unlike human evaluation, traditional NLP metrics (e.g., ROUGE, BLEU) struggle with assessing responses that are semantically correct but phrased differently. LLM-based judges attempt to address this but are still prone to errors, especially in complex or ambiguous cases.

Sensitivity to Prompting

While the evaluation suggests a decline in accuracy metrics, this does not necessarily indicate a failure of the fine-tuning process. Instead, it highlights the complexity of LLM adaptation, trade-offs in generalization vs. specialization, and the limitations of LLM-based evaluation.

Further analysis indicates that prompt engineering significantly influences results, with optimized prompting increasing the correct rate by 10%. Additionally, the observed degradation provides valuable insights for refining future fine-tuning strategies and ensuring alignment with real-world use cases.

Customization

Modifying Fine-Tuning Hyperparameters

Edit training_args in fine_tuning.py:

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    save_strategy="epoch",
    logging_steps=100,
    evaluation_strategy="epoch",
)

Changing the Evaluation Metrics

Modify evaluate() in evaluation.py to add custom metrics.

Contributing

If you’d like to contribute:

Fork the repo
Create a new branch (feature-xyz)
Commit changes and open a PR

License

This fine-tuned model is based on LLaVA-1.5, which in turn utilizes LLaMA 2. As such, it is subject to the LLaMA 2 Community License Agreement, which governs the use, distribution, and modification of the model.

Additionally, LLaVA-1.5 has been trained using GPT-4-generated multimodal instruction-following data, which may impose further restrictions, particularly on commercial usage. Users must review and comply with any applicable terms related to both LLaMA 2 and GPT-4 data usage.

Key Licensing Considerations:

The fine-tuned model inherits the LLaMA 2 Community License and must be used in accordance with its terms.
Any deployment or distribution of this model should acknowledge the original authors of LLaVA-1.5 and adhere to its licensing conditions.
If you plan to use this model commercially, ensure compliance with both Meta’s LLaMA 2 license and OpenAI's policies regarding GPT-4-generated data.

For full details, please refer to:

By using this fine-tuned model, you agree to abide by the terms and conditions set forth in the above licenses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA Fine-Tuning and Evaluation Framework

Project Structure

Installation & Setup

1. Install Dependencies

2. Download the LLaVA Model

Usage

1. Fine-Tuning the Model

2. Evaluating the Model

Key Components

1. `data_utils.py`

2. `model_utils.py`

3. `fine_tuning.py`

4. `evaluation.py`

5. `evaluation_base.py`

6. `callbacks.py`

7. `run_evaluation.py`

Evaluation Results

Key Takeaways

Limitations of LLM-as-a-Judge Evaluation

Sensitivity to Prompting

Customization

Modifying Fine-Tuning Hyperparameters

Changing the Evaluation Metrics

Contributing

License

Key Licensing Considerations:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
logslurms		logslurms
notebooks		notebooks
src		src
terminal_results		terminal_results
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

LLaVA Fine-Tuning and Evaluation Framework

Project Structure

Installation & Setup

1. Install Dependencies

2. Download the LLaVA Model

Usage

1. Fine-Tuning the Model

2. Evaluating the Model

Key Components

1. data_utils.py

2. model_utils.py

3. fine_tuning.py

4. evaluation.py

5. evaluation_base.py

6. callbacks.py

7. run_evaluation.py

Evaluation Results

Key Takeaways

Limitations of LLM-as-a-Judge Evaluation

Sensitivity to Prompting

Customization

Modifying Fine-Tuning Hyperparameters

Changing the Evaluation Metrics

Contributing

License

Key Licensing Considerations:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. `data_utils.py`

2. `model_utils.py`

3. `fine_tuning.py`

4. `evaluation.py`

5. `evaluation_base.py`

6. `callbacks.py`

7. `run_evaluation.py`

Packages