ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks

Submitted to IEEE Transactions on Big Data

Overview

This repository provides the code and data for "ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks". This research extends our previous work ("Culinary Class Wars") by introducing a rigorous meta-evaluation of the ASH (Authenticity, Sensitivity, Harmony) framework.

In this extended study, we not only evaluate Large Language Models (LLMs) on cuisine transfer tasks but also rigorously validate the "LLM-as-a-judge" paradigm itself. We implement and compare eight distinct prompt engineering strategies (ranging from simple scoring to Chain-of-Thought) to identify the most human-aligned evaluation methodology.

Project Structure
Setup
Data Description
How to Run: Generation & Evaluation
How to Run: Prompt Engineering Experiments
Results
Contributors
Acknowledgements

Project Structure

The repository is organized into two main parts: data and code, including the new prompt_engineering module.

.
├── data
│   ├── generation
│   │   └── v0_recipes.csv                 # Recipes generated by LLMs (4,800 recipes)
│   └── evaluation
│       ├── 5-round                        # Five evaluations per recipe (Baseline)
│       │   ├── v0_recipes_eval_5_ollama.csv
│       │   └── ...
│       └── human                          # Human annotated ground truth
│           ├── human_ground_truth_200.csv # 200 recipes evaluated by diverse humans
└── code
    ├── generation                         # Recipe generation scripts
    │   └── generate_recipes.py
    ├── evaluation                         # Standard ASH evaluation scripts
    │   ├── evaluate_recipes_5_ollama.py
    │   └── ...
    └── prompt_engineering                 # [NEW] Prompt Optimization Experiments
        └── evaluate_recipes_prompt_check_ollama.py # Script for evaluating recipes with 8 prompt strategies

Setup

Clone the repository:

git clone [https://github.com/dmis-lab/ASH.git](https://github.com/dmis-lab/ASH.git)
cd ASH-Framework

Install dependencies:

pip install -r requirements.txt

API Keys: Ensure you have the necessary API keys configured:

OpenAI API key: ../API_KEY/API_KEY_openai.txt
Google Gemini API key: ../API_KEY/API_KEY_gemini.txt

Data Description

data/generation: Contains 4,800 recipes generated by 6 models across 40 cuisines.
data/evaluation: Contains baseline ASH evaluation results.
data/prompt_experiments: Contains the outputs of the meta-evaluation, where different prompt strategies (e.g., CoT, Role-Playing) were tested against human ground truth to find the optimal evaluator.

How to Run: Generation & Evaluation

1. Recipe Generation

Generate recipes using the standardized prompt template.

python code/generation/generate_recipes.py --model mistral:7b --output data/generation/recipes_mistral.csv

2. Standard ASH Evaluation (Baseline)

Evaluate the generated recipes using the default scoring prompt.

python code/evaluation/5-round/evaluate_recipes_5_ollama.py data/generation/v0_recipes.csv

How to Run: Prompt Engineering Experiments

This section reproduces the meta-evaluation experiments (Table III in the paper) to identify the optimal prompt strategy.

Evaluate with 8 Prompt Strategies

Run the comprehensive evaluation script. This script utilizes multiprocessing to distribute tasks across available GPUs and evaluates recipes using 8 distinct prompt strategies (Default, Role-Playing, Scoring Scale, CoT, etc.) and multiple evaluator models.

Usage: Ensure your Ollama server is running and the required models (e.g., gemma2:9b, mistral:7b, llama3.1:8b) are pulled.

# Run the prompt check script
python code/prompt_engineering/evaluate_recipes_prompt_check_ollama.py

Expected Output: Rankings of prompt strategies based on MSE. (e.g., Strategy 3: Scoring Scale Specification typically yields the lowest MSE).

Results

Generative Capability: Comparison of 6 LLMs showing the trade-off between Sensitivity (Style) and Authenticity (Substance).
Evaluator Reliability: The "Scoring Scale Specification" strategy was found to be more robust (MSE 1.087) than complex Chain-of-Thought prompts, highlighting a "Complexity Paradox" in automated evaluation.

Contributors

Name	Affiliation	Email
Hoonick Lee (First Author)	Dept. of Computer Science & Engineering, Korea University	hoonick@korea.ac.kr
Mogan Gim	Dept. of Biomedical Engineering, Hankuk University of Foreign Studies	gimmogan@hufs.ac.kr
Donghyeon Park	Dept. of AI and Data Science, Sejong University	parkdh@sejong.ac.kr
Donghee Choi†	School of Computer Science & Engineering, Pusan National University	dchoi@pusan.ac.kr
Jaewoo Kang†	Dept. of Computer Science & Engineering, Korea University	kangj@korea.ac.kr

† Corresponding Authors

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) under grant No. NRF-2023R1A2C3004176. This work was also supported by Hankuk University of Foreign Studies Research Fund (of 2025) and a New Faculty Research Grant of Pusan National University, 2025.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks

Overview

Table of Contents

Project Structure

Setup

Data Description

How to Run: Generation & Evaluation

1. Recipe Generation

2. Standard ASH Evaluation (Baseline)

How to Run: Prompt Engineering Experiments

Evaluate with 8 Prompt Strategies

Results

Contributors

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks

Overview

Table of Contents

Project Structure

Setup

Data Description

How to Run: Generation & Evaluation

1. Recipe Generation

2. Standard ASH Evaluation (Baseline)

How to Run: Prompt Engineering Experiments

Evaluate with 8 Prompt Strategies

Results

Contributors

Acknowledgements