Skip to content

Latest commit

 

History

History
133 lines (91 loc) · 5.74 KB

File metadata and controls

133 lines (91 loc) · 5.74 KB

ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks

DOI

Submitted to IEEE Transactions on Big Data

Overview

This repository provides the code and data for "ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks". This research extends our previous work ("Culinary Class Wars") by introducing a rigorous meta-evaluation of the ASH (Authenticity, Sensitivity, Harmony) framework.

In this extended study, we not only evaluate Large Language Models (LLMs) on cuisine transfer tasks but also rigorously validate the "LLM-as-a-judge" paradigm itself. We implement and compare eight distinct prompt engineering strategies (ranging from simple scoring to Chain-of-Thought) to identify the most human-aligned evaluation methodology.

Table of Contents

Project Structure

The repository is organized into two main parts: data and code, including the new prompt_engineering module.

.
├── data
│   ├── generation
│   │   └── v0_recipes.csv                 # Recipes generated by LLMs (4,800 recipes)
│   └── evaluation
│       ├── 5-round                        # Five evaluations per recipe (Baseline)
│       │   ├── v0_recipes_eval_5_ollama.csv
│       │   └── ...
│       └── human                          # Human annotated ground truth
│           ├── human_ground_truth_200.csv # 200 recipes evaluated by diverse humans
└── code
    ├── generation                         # Recipe generation scripts
    │   └── generate_recipes.py
    ├── evaluation                         # Standard ASH evaluation scripts
    │   ├── evaluate_recipes_5_ollama.py
    │   └── ...
    └── prompt_engineering                 # [NEW] Prompt Optimization Experiments
        └── evaluate_recipes_prompt_check_ollama.py # Script for evaluating recipes with 8 prompt strategies

Setup

  1. Clone the repository:
git clone [https://github.com/dmis-lab/ASH.git](https://github.com/dmis-lab/ASH.git)
cd ASH-Framework
  1. Install dependencies:
pip install -r requirements.txt
  1. API Keys: Ensure you have the necessary API keys configured:
  • OpenAI API key: ../API_KEY/API_KEY_openai.txt
  • Google Gemini API key: ../API_KEY/API_KEY_gemini.txt

Data Description

  • data/generation: Contains 4,800 recipes generated by 6 models across 40 cuisines.
  • data/evaluation: Contains baseline ASH evaluation results.
  • data/prompt_experiments: Contains the outputs of the meta-evaluation, where different prompt strategies (e.g., CoT, Role-Playing) were tested against human ground truth to find the optimal evaluator.

How to Run: Generation & Evaluation

1. Recipe Generation

Generate recipes using the standardized prompt template.

python code/generation/generate_recipes.py --model mistral:7b --output data/generation/recipes_mistral.csv

2. Standard ASH Evaluation (Baseline)

Evaluate the generated recipes using the default scoring prompt.

python code/evaluation/5-round/evaluate_recipes_5_ollama.py data/generation/v0_recipes.csv

How to Run: Prompt Engineering Experiments

This section reproduces the meta-evaluation experiments (Table III in the paper) to identify the optimal prompt strategy.

Evaluate with 8 Prompt Strategies

Run the comprehensive evaluation script. This script utilizes multiprocessing to distribute tasks across available GPUs and evaluates recipes using 8 distinct prompt strategies (Default, Role-Playing, Scoring Scale, CoT, etc.) and multiple evaluator models.

Usage: Ensure your Ollama server is running and the required models (e.g., gemma2:9b, mistral:7b, llama3.1:8b) are pulled.

# Run the prompt check script
python code/prompt_engineering/evaluate_recipes_prompt_check_ollama.py

Expected Output: Rankings of prompt strategies based on MSE. (e.g., Strategy 3: Scoring Scale Specification typically yields the lowest MSE).

Results

  • Generative Capability: Comparison of 6 LLMs showing the trade-off between Sensitivity (Style) and Authenticity (Substance).
  • Evaluator Reliability: The "Scoring Scale Specification" strategy was found to be more robust (MSE 1.087) than complex Chain-of-Thought prompts, highlighting a "Complexity Paradox" in automated evaluation.

Contributors

Name Affiliation Email
Hoonick Lee (First Author) Dept. of Computer Science & Engineering, Korea University hoonick@korea.ac.kr
Mogan Gim Dept. of Biomedical Engineering, Hankuk University of Foreign Studies gimmogan@hufs.ac.kr
Donghyeon Park Dept. of AI and Data Science, Sejong University parkdh@sejong.ac.kr
Donghee Choi† School of Computer Science & Engineering, Pusan National University dchoi@pusan.ac.kr
Jaewoo Kang† Dept. of Computer Science & Engineering, Korea University kangj@korea.ac.kr

† Corresponding Authors

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) under grant No. NRF-2023R1A2C3004176. This work was also supported by Hankuk University of Foreign Studies Research Fund (of 2025) and a New Faculty Research Grant of Pusan National University, 2025.