Skip to content

kinit-sk/RoSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM

Abstract: LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs vary in how useful their outputs are for training, and selecting the best LLM as a generator from a selection of possible LLMs remains challenging: extrinsic evaluation requires costly human annotations (which are missing for many low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round-robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on a candidate generator's (LLM's) outputs and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of an optimal generator baseline, measured in terms of downstream performance by training a small model on the chosen generator’s outputs (optimal vs. chosen by proxy metric) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.

Files

requirements.txt - contains the needed requirements to run vLLM and other related libraries (tested on Python 3.11.6)

generated_data_{task}_{language}_icl_revised/ - contains the generated data from each LLM (generation setup of RoSE)

generated_data_{task}_{language}_simple/ - contains the generated data from each LLM (generation setup of RoSE-Z)

results_{task}_{lang-abrv}_icl_rev_sampled/ - contains the results for each LLM used in the study for 10 seeds, rows with orig denotes the F1 score on human data, the other rows denote performance of that classifier (finetuned on given LLMs generated data) evaluated on data from other LLMs (this used for RoSE computation).

results_{task}_rose-z/ - same as above, but for RoSE-Z setup.

human_data/ - contains the datasets used during the study

intrinsic_metrics/ - contains the resulting metrics per task-language (also RoSE)

intrinsic_metrics_rose-z/ - contains the resulting metrics per task-language (also RoSE-Z)

run_finetuning_for_rose{-z}.sh - runs finetuning for a specified (via inputs to the script) combination of task and language

generate_{topic}_rose-{z}.sh - generates data for given task by either RoSE generation setup or RoSE-Z

proxy_metrics_computations.ipynb - computes the proxy metrics and analyses the results

The other python scripts are either used for finetuning of XLM-R, generating data via vLLM or for revision of the generated data via LLM judge.

About

RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published