Skip to content

ZERA: Zero-init Instruction Evolving Refinement Agent - From Zero Instructions to Structured Prompts via Principle-based Optimization

License

Notifications You must be signed in to change notification settings

younatics/zera-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

ZERA: Zero-init Instruction Evolving Refinement Agent

Python License Paper arXiv Stars Forks

๐ŸŽฏ The First Joint System-User Prompt Optimization Agent
๐Ÿš€ From Zero Instructions to Structured Prompts via Self-Refining Optimization

ZERA: Zero-init Instruction Evolving Refinement Agent

๐ŸŽฏ Overview

ZERA is the first-of-its-kind prompt auto-tuning agent that revolutionizes how we approach prompt engineering. Unlike traditional methods that require extensive manual crafting, ZERA starts from zero instructions and automatically evolves into high-performance, structured prompts through intelligent self-refinement.

โœจ What Makes ZERA Special?

  • ๐Ÿš€ Zero to Hero: Start with minimal instructions, end with expert-level prompts
  • ๐Ÿ”„ Self-Evolving: Continuously improves prompts through automated critique and refinement
  • ๐ŸŽฏ Joint Optimization: Simultaneously optimizes both system and user prompts
  • โšก Lightning Fast: Achieves high-quality results with only 5-20 samples
  • ๐Ÿง  Principle-Based: Uses 8 evaluation principles for consistent quality
  • ๐Ÿ“Š Weighted Scoring: Adaptive importance weighting for each principle
  • ๐ŸŒŸ Model Agnostic: Works with any LLM (GPT-4, Claude, Solar, LLaMA, etc.)

๐ŸŽฌ See ZERA in Action

Task Type Before (Zero Prompt) After (ZERA Optimized)
Math Reasoning "Solve this" "You are an expert mathematician. Analyze the problem step-by-step, show your work clearly, and provide a comprehensive solution with explanations."
Code Generation "Write code" "You are a senior software engineer. Write clean, efficient, and well-documented code. Include error handling, edge cases, and follow best practices."
Text Summarization "Summarize this" "You are a professional editor. Create concise, accurate summaries that capture key points while maintaining readability and coherence."

๐Ÿ“š Research Paper

๐ŸŽ‰ Congratulations! ZERA has been accepted to EMNLP 2025 Main Conference! ๐ŸŽ‰

๐Ÿ“– Paper Details

Title: ZERA: Zero-init Instruction Evolving Refinement Agent โ€“ From Zero Instructions to Structured Prompts via Principle-based Optimization Conference: EMNLP 2025 Main Track (Oral Presentation)
Status: โœ… Accepted
Authors: Seungyoun Yi, Minsoo Khang, Sungrae Park

Paper Badge arXiv Badge

Research Contribution

  • Joint Optimization: Unlike prior APO (Automatic Prompt Optimization) methods that only refine user prompts, ZERA jointly optimizes both system and user prompts.
  • Principle-based Evaluation: Introduces eight general evaluation principles (Correctness, Reasoning Quality, Conciseness, etc.) with adaptive weighting to guide prompt refinement.
  • Self-Refining Framework: Iterative loop of PCG (Principle-based Critique Generation) and MPR (Meta-cognitive Prompt Refinement) enables evolution from minimal โ€œzeroโ€ prompts to structured, task-optimized prompts.
  • Efficiency: Achieves high-quality prompts with only 5โ€“20 samples and short iteration cycles.

๐Ÿ“Š Performance Results

ZERA has been extensively benchmarked and shows competitive performance compared to state-of-the-art methods:

๐Ÿ”ฌ Model Coverage

  • 5 LLMs: GPT-3.5, GPT-4o, LLaMA-3.1, Qwen-2.5, Mistral-7B
  • 9 Datasets: MMLU, GSM8K, BBH, CNN/DailyMail, SAMSum, MBPP, HumanEval, TruthfulQA, HellaSwag

โœจ Key Strengths

  • Competitive Performance: Shows comparable or better results compared to recent APO methods
  • Efficient Convergence: Achieves good results with minimal samples (5-20)
  • Broad Applicability: Works across diverse domains without task-specific tuning
  • Zero-Shot Capability: Starts from minimal instructions, no handcrafted prompts needed

๐Ÿ“Ž Read the Full Paper (arXiv)


๐Ÿ”„ Core Concept: Self-Refining Optimization

ZERA implements a revolutionary Self-Refining Optimization process that transforms minimal instructions into expert-level prompts through intelligent iteration.

๐ŸŽฏ The ZERA Loop

ZERA Concept

ZERA's iterative prompt refinement process: PCG (Principle-based Critique Generation) โ†’ MPR (Meta-cognitive Prompt Refinement) โ†’ Enhanced Prompt

๐Ÿ”ง How It Works

  1. ๐Ÿ”„ PCG (Principle-based Critique Generation)

    • Evaluates prompt performance against 8 evaluation principles
    • Generates detailed critiques with scores, analysis, and suggestions
    • Provides weighted feedback based on principle importance
  2. โšก MPR (Meta-cognitive Prompt Refinement)

    • Uses critiques to intelligently refine both system and user prompts
    • Leverages historical best prompts and prompt replay data
    • Maintains consistency and improves prompt quality iteratively
  3. โ™พ๏ธ Continuous Refinement Loop

    • Task samples โ†’ Inference โ†’ Evaluation โ†’ Critique โ†’ Refinement
    • Each iteration produces better prompts based on principle-based feedback
    • Rapid convergence to optimal prompts with minimal samples

๐Ÿ“‹ 8 Evaluation Principles

ZERA evaluates prompts using eight comprehensive principles with adaptive weighting that adjusts based on task requirements and performance:

Principle Description Focus Area
Meaning Captures key details and core information Content accuracy
Completeness Covers all essential aspects comprehensively Information coverage
Expression Uses appropriate tone and style Communication quality
Faithfulness Stays true to source without fabrication Source adherence
Conciseness Maintains brevity while being complete Efficiency
Correctness Provides accurate and factual information Factual accuracy
Structural Organizes content in logical structure Organization
Reasoning Demonstrates clear logical thinking Logical flow

Each principle contributes to the overall prompt quality score, with weights that dynamically adjust to guide the refinement process toward optimal performance.


Directory Structure and Roles

agent/
  app/           # Streamlit-based web UI and state management
  common/        # Common utilities including API clients
  core/          # Core logic for prompt tuning and iteration result management
  dataset/       # Various benchmark datasets and data loaders
  prompts/       # System/user/meta prompt templates
  test/          # Unit test code
  __init__.py    # Package initialization

evaluation/
  base/                # Common base for evaluation system and execution scripts
  dataset_evaluator/   # Dataset-specific evaluators (LLM-based)
    bert/              # BERTScore-based prompt comparison
    llm_judge/         # LLM Judge-based comparison results
  llm_judge/           # LLM Judge evaluation result CSVs
  examples/            # Evaluation and tuning example code
  results/             # Evaluation result storage
  samples/             # Sample data

scripts/               # Command-line interface tools and utilities
  run_prompt_tuning.py      # CLI for prompt tuning experiments
  run_batch_experiments.py  # Batch experiment execution
  update_results.py         # Result update utilities
  run_background.sh         # Background process management

agent directory

  • app/: Streamlit-based web interface and state management
  • common/: Common client for communicating with various LLM APIs
  • core/: Core logic for prompt auto-tuning and iteration result management
  • dataset/: Various benchmark dataset loaders and data folders
  • prompts/: System/user/meta/evaluation prompt templates
  • test/: Prompt tuner test code

evaluation directory

  • base/: Common base classes for evaluation system (BaseEvaluator) and execution scripts (main.py)
  • dataset_evaluator/: LLM evaluators for each dataset (e.g., gsm8k_evaluator.py, mmlu_evaluator.py, etc.)
    • bert/: Prompt comparison using BERTScore and results (bert_compare_prompts.py, zera_score.json, base_score.json, etc.)
    • llm_judge/: LLM Judge-based comparison result storage
  • llm_judge/: Comparison result CSVs generated by LLM Judge
  • examples/: Dataset-specific evaluation/tuning example code and execution methods
  • results/: Evaluation result storage folder
  • samples/: Sample data

Key Features

  • Prompt Auto-tuning:

    • Iteratively improve system/user prompts to maximize LLM performance
    • Utilize meta-prompts to guide LLMs to directly improve prompts themselves
  • Support for Various Models and Datasets:

    • Support for various models including OpenAI GPT, Anthropic Claude, Upstage Solar, local LLMs
    • Built-in benchmark datasets including MMLU, GSM8K, CNN, MBPP, TruthfulQA
  • Automated Output Evaluation:

    • Automatically evaluate LLM outputs using 8 evaluation criteria (accuracy, completeness, expression, reliability, conciseness, correctness, structural consistency, reasoning quality)
    • Improve prompts based on evaluation results
  • Various Evaluation Methods:

    • LLM-based Evaluation: LLMs directly perform correctness assessment, scoring, and detailed evaluation for each dataset
    • BERTScore-based Evaluation: Compare output similarity (F1, Precision, Recall, etc.) between prompts using BERT embeddings
    • LLM Judge-based Evaluation: LLMs directly compare outputs from two prompts to determine winner/loser and reasons
  • Web UI:

    • Intuitive experiment management and result visualization based on Streamlit

Evaluation System Usage

1. LLM Evaluation Execution

You can execute LLM evaluation with various datasets and prompts through evaluation/base/main.py.

python evaluation/base/main.py --dataset <dataset_name> --model <model_name> --model_version <version> \
  --base_system_prompt <existing_system_prompt> --base_user_prompt <existing_user_prompt> \
  --zera_system_prompt <zera_system_prompt> --zera_user_prompt <zera_user_prompt> \
  --num_samples <sample_count>
  • Evaluation results are stored in evaluation/results/.
  • You can compare prompt performance using various metrics like accuracy, ROUGE, etc.

2. BERTScore-based Prompt Comparison

Running evaluation/dataset_evaluator/bert/bert_compare_prompts.py allows you to compare ZERA prompt and existing prompt outputs using BERTScore.

python evaluation/dataset_evaluator/bert/bert_compare_prompts.py
  • Results are saved as comparison_results.csv.

3. LLM Judge-based Comparison

You can check results where LLMs directly compare outputs from two prompts (winner, reasons, etc.) in evaluation/llm_judge/comparison_results.csv.

4. Example Execution

The evaluation/examples/ directory contains example code for each dataset.

python evaluation/examples/<dataset>_example.py
  • Requires requirements.txt installation and .env environment variable setup before running examples

๐Ÿš€ Quick Start

Get up and running with ZERA in under 5 minutes! โšก

1. Clone and Setup

git clone https://github.com/younatics/zera-agent.git
cd zera-agent
pip install -r requirements.txt

2. Configure API Keys

Create a .env file in the project root with your API keys:

# Required API Keys
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
SOLAR_API_KEY=your_solar_api_key_here
SOLAR_STRAWBERRY_API_KEY=your_solar_strawberry_api_key_here

# Optional: Local model configuration
LOCAL_MODEL_ENDPOINT=http://localhost:8000/v1
LOCAL_MODEL_API_KEY=your_local_api_key_here

# Optional: Slack notifications
SLACK_WEBHOOK_URL=your_slack_webhook_url_here
SLACK_CHANNEL=#experiments

Note: You only need to set the API keys for the models you plan to use.

3. Run Your First Experiment

# Quick test with BBH dataset
python scripts/run_prompt_tuning.py \
  --dataset bbh \
  --total_samples 10 \
  --iterations 3 \
  --model solar

4. Explore with Web UI

streamlit run agent/app/streamlit_app.py

Installation and Execution

  1. Install dependencies

    pip install -r requirements.txt
    
  2. Set environment variables
    Enter OpenAI, Anthropic, etc. API keys in .env file

  3. Run web UI

    streamlit run agent/app/streamlit_app.py
    
  4. Run CLI tools (optional)

    # Run prompt tuning experiment
    python scripts/run_prompt_tuning.py --dataset bbh --total_samples 20 --iterations 5 --model solar
    
    # Run batch experiments
    python scripts/run_batch_experiments.py --config experiments_config.json
    
    # Update results
    python scripts/update_results.py

Usage Examples

  • Automatically generate optimal prompts for new tasks
  • Automate LLM benchmark experiments and result comparison
  • Prompt engineering research and experiments
  • Quantitative/qualitative prompt performance comparison using various evaluation methods (LLM, BERT, LLM Judge)

Troubleshooting

Common Issues

๐Ÿ”‘ API Key Errors

Error: No API key found for model 'solar'

Solution: Ensure your .env file contains the correct API key for the model you're using.

๐Ÿ“ฆ Import Errors

ModuleNotFoundError: No module named 'agent'

Solution: Make sure you're running commands from the project root directory, not from subdirectories.

๐Ÿ’พ Memory Issues

MemoryError: Unable to allocate array

Solution: Reduce the --total_samples or --iteration_samples parameters.

โฑ๏ธ Timeout Errors

RequestTimeout: Request timed out

Solution: Check your internet connection and API rate limits.

๐Ÿ“Š Evaluation Errors

EvaluationError: Failed to evaluate response

Solution: Verify your evaluation prompts are properly formatted and the model can access them.

Getting Help

๐Ÿค Community & Contributing

Join the ZERA community and help shape the future of prompt engineering! ๐ŸŒŸ

๐Ÿš€ Get Involved

๐Ÿ“ง Stay Connected

๐Ÿ† Contributors

We welcome contributions from the community! See our Contributing Guide for details on how to get involved.

Citation

If you use ZERA in your research, please cite our paper:

@inproceedings{yi2025zera,
  title={ZERA: Zero-init Instruction Evolving Refinement Agent โ€“ From Zero Instructions to Structured Prompts via Principle-based Optimization},
  author={Yi, Seungyoun and Khang, Minsoo and Park, Sungrae},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025},
  publisher={Association for Computational Linguistics}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


๐ŸŽ‰ Ready to Transform Your Prompt Engineering?

ZERA is not just another toolโ€”it's a revolution in how we approach AI prompting. ๐Ÿš€

๐Ÿš€ What's Next?

  1. ๐ŸŽฏ Try ZERA: Run your first experiment in minutes
  2. ๐Ÿ“š Read the Paper: Dive deep into the research
  3. ๐ŸŒŸ Star the Repo: Show your support
  4. ๐Ÿค Contribute: Help shape the future of prompt engineering
  5. ๐Ÿ“ข Share: Let others know about ZERA

๐Ÿ”ฎ The Future of Prompt Engineering

With ZERA, the era of manual prompt crafting is over. Welcome to the future where:

  • Zero instructions become expert-level prompts
  • Manual tuning becomes automated optimization
  • Trial and error becomes intelligent refinement
  • Domain expertise becomes universal capability

ZERA: Zero-init Instruction Evolving Refinement Agent
From Zero Instructions to Structured Prompts via Self-Refining Optimization


Ready to experience the future of prompt engineering? ๐Ÿš€

About

ZERA: Zero-init Instruction Evolving Refinement Agent - From Zero Instructions to Structured Prompts via Principle-based Optimization

Resources

License

Stars

Watchers

Forks

Packages

No packages published