Synthetic Fine Tuning

Jameson, Amitoj, Karanbir, Sukhamrit

Abstract

Fine tuning an LLM for synthetic data generation then using synthetically generated data to fine tune subsequent LLMs for specific use cases.

Introduction

Problem

• LLMs have difficulty with domain-specific questions. • Fine-tuning needs larger datasets for accuracy. • Bigger models can be costly and environmentally detrimental. • Smaller models alone can't always replace larger ones. • Using certain data for fine-tuning raises privacy issues.

Proposed Solution

• Fine tune a large LLM for data synthesis • Generate large synthetic data sets from small samples • Fine tune many small LLMs with synthetic data

Methods / Algorithm

Procedure

Synthetic Model Creation Fine tune LLM for data synthesis
Synthetic Data Generation Synthesize data for specific use case
Variable Model Creation Fine tune LLM using synthetic data

Synthetic Model Creation

• Fine tune a larger LLM in order to take a small sample dataset and generate synthetic data which can be used to fine tune a smaller LLM • The larger LLM will be known as the synthetic model (SM) and is only trained once • The SM will be fine tuned using many generic data sets in order to learn how to generate good synthetic data

Synthetic Data Generation

• The synthetic model is given a use case and a sample dataset which the synthetic model then uses to generate a large amount of synthetic data • The use case will be reused later and will be the purpose of the variable model • The sample data set must be large enough that the synthetic model can learn associations necessary for generating synthetic data

Variable Model Creation

• The synthetic data from the SM is then used to fine tune an untuned smaller LLM for the use case specified in synthetic data generation • This smaller model will be known as the variable model (VM) and is fine tuned for every new use case

Results

We used synthetic fine tuning for three use cases:

Anonymous Medical Data Abstraction
Financial Fraud Detection
Product Attribute Extraction

Use Case 1: Anonymous Medical Data Abstraction

• Input: pre existing or collected clinical data including responses to questionnaires, information on health conditions, and doctor diagnoses in order to create a new dataset of patients • Impact: Allows a way to disguise sensitive data about participants in order to protect the privacy of the original participants. The identity of the original patient is concealed and the integrity of the original data set is preserved since the newly created data mimics trends observed in the original dataset

Use Case 2: Financial Fraud Detection

• Input: large chunks of data on company info including revenue_growth, customer_reviews, and etc to create a list of potentially fraudulent companies • Impact: Allows a quick and easy method to create a suspicion list saving the government more time and resources when detecting fraudulent companies.

Use Case 3: Product Attribute Extraction

• Input: large amounts of data on various product titles that include a product’s attributes such as product brand, product category, and etc. to create a list of products and their features • Impact: Creates a quick and easy way to access a specific brand’s product inventory and

Conclusion

We build some cool stuff! Please check it out :). I hope you have a good day.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
cases.py		cases.py
main.py		main.py
synth_model.py		synth_model.py
var_model.py		var_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Fine Tuning

Abstract

Introduction

Problem

Proposed Solution

Methods / Algorithm

Procedure

Synthetic Model Creation

Synthetic Data Generation

Variable Model Creation

Results

Use Case 1: Anonymous Medical Data Abstraction

Use Case 2: Financial Fraud Detection

Use Case 3: Product Attribute Extraction

Conclusion

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

amitojsingh2022/SyntheticTuning

Folders and files

Latest commit

History

Repository files navigation

Synthetic Fine Tuning

Abstract

Introduction

Problem

Proposed Solution

Methods / Algorithm

Procedure

Synthetic Model Creation

Synthetic Data Generation

Variable Model Creation

Results

Use Case 1: Anonymous Medical Data Abstraction

Use Case 2: Financial Fraud Detection

Use Case 3: Product Attribute Extraction

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages