RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM

Abstract: LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs vary in how useful their outputs are for training, and selecting the best LLM as a generator from a selection of possible LLMs remains challenging: extrinsic evaluation requires costly human annotations (which are missing for many low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round-robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on a candidate generator's (LLM's) outputs and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of an optimal generator baseline, measured in terms of downstream performance by training a small model on the chosen generator’s outputs (optimal vs. chosen by proxy metric) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.

Files

requirements.txt - contains the needed requirements to run vLLM and other related libraries (tested on Python 3.11.6)

generated_data_{task}_{language}_icl_revised/ - contains the generated data from each LLM (generation setup of RoSE)

generated_data_{task}_{language}_simple/ - contains the generated data from each LLM (generation setup of RoSE-Z)

results_{task}_{lang-abrv}_icl_rev_sampled/ - contains the results for each LLM used in the study for 10 seeds, rows with orig denotes the F1 score on human data, the other rows denote performance of that classifier (finetuned on given LLMs generated data) evaluated on data from other LLMs (this used for RoSE computation).

results_{task}_rose-z/ - same as above, but for RoSE-Z setup.

human_data/ - contains the datasets used during the study

intrinsic_metrics/ - contains the resulting metrics per task-language (also RoSE)

intrinsic_metrics_rose-z/ - contains the resulting metrics per task-language (also RoSE-Z)

run_finetuning_for_rose{-z}.sh - runs finetuning for a specified (via inputs to the script) combination of task and language

generate_{topic}_rose-{z}.sh - generates data for given task by either RoSE generation setup or RoSE-Z

proxy_metrics_computations.ipynb - computes the proxy metrics and analyses the results

The other python scripts are either used for finetuning of XLM-R, generating data via vLLM or for revision of the generated data via LLM judge.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
generated_data		generated_data
intrinsic_metrics		intrinsic_metrics
intrinsic_metrics_rose-z		intrinsic_metrics_rose-z
results_rose-z		results_rose-z
results_rose		results_rose
README.md		README.md
clean_intent_csvs.ipynb		clean_intent_csvs.ipynb
clean_sentiment_csvs.ipynb		clean_sentiment_csvs.ipynb
clean_topic_csvs.ipynb		clean_topic_csvs.ipynb
constant_llm_revision.py		constant_llm_revision.py
constants_intent.py		constants_intent.py
constants_sentiment.py		constants_sentiment.py
constants_topic.py		constants_topic.py
finetune_for_rose.py		finetune_for_rose.py
finetune_rose-z.py		finetune_rose-z.py
generate_data_intent_rose-z.py		generate_data_intent_rose-z.py
generate_data_intent_rose.py		generate_data_intent_rose.py
generate_data_sentiment_rose-z.py		generate_data_sentiment_rose-z.py
generate_data_sentiment_rose.py		generate_data_sentiment_rose.py
generate_data_topic_rose-z.py		generate_data_topic_rose-z.py
generate_data_topic_rose.py		generate_data_topic_rose.py
generate_intent_rose.sh		generate_intent_rose.sh
generate_sentiment_rose.sh		generate_sentiment_rose.sh
generate_topic_rose.sh		generate_topic_rose.sh
llm_revision_general.py		llm_revision_general.py
proxy_metrics_computations.ipynb		proxy_metrics_computations.ipynb
requirements.txt		requirements.txt
run_finetuning_for_rose-z.sh		run_finetuning_for_rose-z.sh
run_finetuning_for_rose.sh		run_finetuning_for_rose.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM

Files

About

Uh oh!

Releases

Packages

Languages

kinit-sk/RoSE

Folders and files

Latest commit

History

Repository files navigation

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages