An Approach for Extracting Training Data from fine-tuned Large Language Models for Code

Replication package for the MSc thesis titled: "An Approach for Extracting Training Data from fine-tuned Large Language Models for Code"

Requirements

Ensure you have the following software installed on your machine:

Python 3.8

The requirements can be installed running: pip install -r requirements.txt

The code was intended to run on an Nvidia A100 with 80GB Vram, 32GB of RAM and 16 CPU cores. The extraction experiments can run on a single GPU. The fine-tuning requires multiple GPUs. Specifically, for StarCoder2-3B, 7B, and 15B, we employ two, four, and six GPUs.

Data

In the tune-memorization/data directory you will find two subdirectories:

tune-memorization/data/ftune-dataset: This directory contains the script that automatically downloads and processes the data used for fine-tuning and evaluation that can be retrieved from this LINK.
tune-memorization/data/samples: Is the destination of attack samples that can be retrieved from this LINK
- /pre-train: To run the code and replicate the dataset construction, you need to have the Java subset of the-stack-v2-dedup loaded in your Huggingface cache folder. You can find it HERE.
- /fine-tune: To run the code and replicate the dataset construction, you must have the fine-tuning set in your Huggingface cache folder.

Training

In the tune-memorization/training directory you will find the fine-tuning scripts for each model size of the StarCoder2: /scoder3b, /scoder7b, /scoder15b . Additionally in the folder /train-stats we share figures of the fine-tuning stats.

Evaluation

In the tune-memorization/evaluation directory you will find three subdirectories:

tune-memorization/evaluation/forgetting The directory includes the following:
- Evaluation Scripts: Scripts used to run the pre-training code attacks, organized by experiments.
- Plots and Tables: Includes notebooks with all the plots and tables used in the associated paper, as well as additional ones not included in the publication.
tune-memorization/evaluation/memorization The directory includes the following:
- Evaluation Scripts: Scripts used to run the fine-tuning code attack, organized by experiments.
- Plots and Tables: Includes notebooks with all the plots and tables used in the associated paper, as well as additional ones not included in the publication.
tune-memorization/evaluation/data-inspection The directory includes the following:
- Plots and Tables: Includes notebooks with all the plots and tables used in the associated paper, as well as additional ones not included in the publication.

Ethical use

Please use the code and concepts shared here responsibly and ethically. The authors have provided this code to enhance the security and safety of large language models (LLMs). Avoid using this code for any malicious purposes. When disclosing data leakage, take care not to compromise individuals' privacy unnecessarily.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
tune-memorization		tune-memorization
2024Msc_Fabio_Salerno_Memorization.pdf		2024Msc_Fabio_Salerno_Memorization.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Approach for Extracting Training Data from fine-tuned Large Language Models for Code

Requirements

Data

Training

Evaluation

Ethical use

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

fabiosalern/tune-memorization

Folders and files

Latest commit

History

Repository files navigation

An Approach for Extracting Training Data from fine-tuned Large Language Models for Code

Requirements

Data

Training

Evaluation

Ethical use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages