A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing

This repository contains the replication package for the paper "A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing". The study We conduct the first large-scale empirical study on fine-tuning 37 popular LLMs in three unit testing scenarios across five benchmarks and eight metrics, utilizing over 3,000 NVIDIA A100 GPU hours.

Abstract

Unit testing plays a pivotal role in software development, improving software quality and reliability. However, generating effective test cases manually is time-consuming, prompting interest in unit testing research. Recently, Large Language Models (LLMs) have shown potential in various unit testing tasks, including test generation, assertion generation, and test evolution, but existing studies are limited in scope and lack a systematic evaluation of the effectiveness of LLMs.

To bridge this gap, we present a large-scale empirical study on fine-tuning LLMs for unit testing. Our study involves three unit testing tasks, five benchmarks, eight evaluation metrics, and 37 popular LLMs across various architectures and sizes, consuming over 3,000 NVIDIA A100 GPU hours. We focus on three key research questions: (1) the performance of LLMs compared to state-of-the-art methods, (2) the impact of different factors on LLM performance, and (3) the effectiveness of fine-tuning versus prompt engineering. Our findings reveal that LLMs outperform existing state-of-the-art approaches on all three unit testing tasks across nearly all metrics, highlighting the potential of fine-tuning LLMs in unit testing tasks. Furthermore, large-scale, decoder-only models achieve the best results across tasks, while encoder-decoder models perform better under the same parameter scale. Additionally, the comparison of the performance between fine-tuning and prompt engineering approaches reveals the considerable potential capability of the prompt engineering approach in unit testing tasks. We then discuss the concerned issues on the test generation task, including data leakage issues, bug detection capabilities, and metrics comparisons. Finally, we further pinpoint carious practical guidelines for LLM-based approaches to unit testing tasks in the near future. Overall, our work demonstrates the promising future of fine-tuning LLMs on unit testing tasks and reduces the manual efforts of unit testing experts in practical scenarios.

Project Structure

The repository contains:

Datasets: the fine-tuning dataset for three unit testing tasks
Finetune_Script: the scripts for LLM finetuning
Inference_Script: the scripts for prompt engineering approach
TextEvaluate: the evaluate script for text-based metrics
TGonDefects4J: the evaluate script for runtime-based metrics on Defects4J dataset.
Prompt_template.md: the template of prompt using for prompt engineering approach.
Total_result.pdf: the complete result of LLMs on three unit testing tasks.

Research Questions

RQ1: How does the performance of fine-tuning LLMs compare to existing approaches on three unit testing tasks?
RQ2: What is the impact of various factors (including model series, model architecture, and model size) on the performance of LLMs?
RQ3: How do fine-tuning approaches perform compared to prompt engineering approaches?

How to Use

Clone the Repository:

git clone https://anonymous.4open.science/r/LLM4UT-ISSTA
cd LLM-Unit-Test-Generation

Setup Dependencies: Ensure all necessary dependencies are available, especially fine-tune dataset and Defects4J dataset.
Run Experiments: Navigate to the Finetune_Script and Inference_Script directory and follow the instructions in the respective subdirectories to reproduce the experiments.
Evaluate Results: Navigate to the TextEvaluate and TGonDefects4J directory and follow the instructions in the respective subdirectories to evaluate the results.

Citation

If you use this package in your research, please cite our paper:

@article{shang2024large,
  title={A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing},
  author={Shang, Ye and Zhang, Quanjun and Fang, Chunrong and Gu, Siqi and Zhou, Jianyi and Chen, Zhenyu},
  journal={arXiv preprint arXiv:2412.16620},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

We thank the developers of the tools and datasets used in this study. Special thanks to the reviewers for their valuable feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing

Abstract

Project Structure

Research Questions

How to Use

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Datasets		Datasets
Finetune_Script		Finetune_Script
Inference_Script		Inference_Script
TGonDefects4J		TGonDefects4J
TextEvaluate		TextEvaluate
Prompt_template.md		Prompt_template.md
README.md		README.md
Total_result.pdf		Total_result.pdf

iSEngLab/LLM4UT_Empirical

Folders and files

Latest commit

History

Repository files navigation

A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing

Abstract

Project Structure

Research Questions

How to Use

Citation

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages