๐ฏ The First Joint System-User Prompt Optimization Agent
๐ From Zero Instructions to Structured Prompts via Self-Refining Optimization
ZERA is the first-of-its-kind prompt auto-tuning agent that revolutionizes how we approach prompt engineering. Unlike traditional methods that require extensive manual crafting, ZERA starts from zero instructions and automatically evolves into high-performance, structured prompts through intelligent self-refinement.
- ๐ Zero to Hero: Start with minimal instructions, end with expert-level prompts
- ๐ Self-Evolving: Continuously improves prompts through automated critique and refinement
- ๐ฏ Joint Optimization: Simultaneously optimizes both system and user prompts
- โก Lightning Fast: Achieves high-quality results with only 5-20 samples
- ๐ง Principle-Based: Uses 8 evaluation principles for consistent quality
- ๐ Weighted Scoring: Adaptive importance weighting for each principle
- ๐ Model Agnostic: Works with any LLM (GPT-4, Claude, Solar, LLaMA, etc.)
| Task Type | Before (Zero Prompt) | After (ZERA Optimized) |
|---|---|---|
| Math Reasoning | "Solve this" | "You are an expert mathematician. Analyze the problem step-by-step, show your work clearly, and provide a comprehensive solution with explanations." |
| Code Generation | "Write code" | "You are a senior software engineer. Write clean, efficient, and well-documented code. Include error handling, edge cases, and follow best practices." |
| Text Summarization | "Summarize this" | "You are a professional editor. Create concise, accurate summaries that capture key points while maintaining readability and coherence." |
๐ Congratulations! ZERA has been accepted to EMNLP 2025 Main Conference! ๐
Title: ZERA: Zero-init Instruction Evolving Refinement Agent โ From Zero Instructions to Structured Prompts via Principle-based Optimization
Conference: EMNLP 2025 Main Track (Oral Presentation)
Status: โ
Accepted
Authors: Seungyoun Yi, Minsoo Khang, Sungrae Park
- Joint Optimization: Unlike prior APO (Automatic Prompt Optimization) methods that only refine user prompts, ZERA jointly optimizes both system and user prompts.
- Principle-based Evaluation: Introduces eight general evaluation principles (Correctness, Reasoning Quality, Conciseness, etc.) with adaptive weighting to guide prompt refinement.
- Self-Refining Framework: Iterative loop of PCG (Principle-based Critique Generation) and MPR (Meta-cognitive Prompt Refinement) enables evolution from minimal โzeroโ prompts to structured, task-optimized prompts.
- Efficiency: Achieves high-quality prompts with only 5โ20 samples and short iteration cycles.
ZERA has been extensively benchmarked and shows competitive performance compared to state-of-the-art methods:
- 5 LLMs: GPT-3.5, GPT-4o, LLaMA-3.1, Qwen-2.5, Mistral-7B
- 9 Datasets: MMLU, GSM8K, BBH, CNN/DailyMail, SAMSum, MBPP, HumanEval, TruthfulQA, HellaSwag
- Competitive Performance: Shows comparable or better results compared to recent APO methods
- Efficient Convergence: Achieves good results with minimal samples (5-20)
- Broad Applicability: Works across diverse domains without task-specific tuning
- Zero-Shot Capability: Starts from minimal instructions, no handcrafted prompts needed
๐ Read the Full Paper (arXiv)
ZERA implements a revolutionary Self-Refining Optimization process that transforms minimal instructions into expert-level prompts through intelligent iteration.
ZERA's iterative prompt refinement process: PCG (Principle-based Critique Generation) โ MPR (Meta-cognitive Prompt Refinement) โ Enhanced Prompt
-
๐ PCG (Principle-based Critique Generation)
- Evaluates prompt performance against 8 evaluation principles
- Generates detailed critiques with scores, analysis, and suggestions
- Provides weighted feedback based on principle importance
-
โก MPR (Meta-cognitive Prompt Refinement)
- Uses critiques to intelligently refine both system and user prompts
- Leverages historical best prompts and prompt replay data
- Maintains consistency and improves prompt quality iteratively
-
โพ๏ธ Continuous Refinement Loop
- Task samples โ Inference โ Evaluation โ Critique โ Refinement
- Each iteration produces better prompts based on principle-based feedback
- Rapid convergence to optimal prompts with minimal samples
ZERA evaluates prompts using eight comprehensive principles with adaptive weighting that adjusts based on task requirements and performance:
| Principle | Description | Focus Area |
|---|---|---|
| Meaning | Captures key details and core information | Content accuracy |
| Completeness | Covers all essential aspects comprehensively | Information coverage |
| Expression | Uses appropriate tone and style | Communication quality |
| Faithfulness | Stays true to source without fabrication | Source adherence |
| Conciseness | Maintains brevity while being complete | Efficiency |
| Correctness | Provides accurate and factual information | Factual accuracy |
| Structural | Organizes content in logical structure | Organization |
| Reasoning | Demonstrates clear logical thinking | Logical flow |
Each principle contributes to the overall prompt quality score, with weights that dynamically adjust to guide the refinement process toward optimal performance.
agent/
app/ # Streamlit-based web UI and state management
common/ # Common utilities including API clients
core/ # Core logic for prompt tuning and iteration result management
dataset/ # Various benchmark datasets and data loaders
prompts/ # System/user/meta prompt templates
test/ # Unit test code
__init__.py # Package initialization
evaluation/
base/ # Common base for evaluation system and execution scripts
dataset_evaluator/ # Dataset-specific evaluators (LLM-based)
bert/ # BERTScore-based prompt comparison
llm_judge/ # LLM Judge-based comparison results
llm_judge/ # LLM Judge evaluation result CSVs
examples/ # Evaluation and tuning example code
results/ # Evaluation result storage
samples/ # Sample data
scripts/ # Command-line interface tools and utilities
run_prompt_tuning.py # CLI for prompt tuning experiments
run_batch_experiments.py # Batch experiment execution
update_results.py # Result update utilities
run_background.sh # Background process management
- app/: Streamlit-based web interface and state management
- common/: Common client for communicating with various LLM APIs
- core/: Core logic for prompt auto-tuning and iteration result management
- dataset/: Various benchmark dataset loaders and data folders
- prompts/: System/user/meta/evaluation prompt templates
- test/: Prompt tuner test code
- base/: Common base classes for evaluation system (
BaseEvaluator) and execution scripts (main.py) - dataset_evaluator/: LLM evaluators for each dataset (e.g.,
gsm8k_evaluator.py,mmlu_evaluator.py, etc.)- bert/: Prompt comparison using BERTScore and results (
bert_compare_prompts.py,zera_score.json,base_score.json, etc.) - llm_judge/: LLM Judge-based comparison result storage
- bert/: Prompt comparison using BERTScore and results (
- llm_judge/: Comparison result CSVs generated by LLM Judge
- examples/: Dataset-specific evaluation/tuning example code and execution methods
- results/: Evaluation result storage folder
- samples/: Sample data
-
Prompt Auto-tuning:
- Iteratively improve system/user prompts to maximize LLM performance
- Utilize meta-prompts to guide LLMs to directly improve prompts themselves
-
Support for Various Models and Datasets:
- Support for various models including OpenAI GPT, Anthropic Claude, Upstage Solar, local LLMs
- Built-in benchmark datasets including MMLU, GSM8K, CNN, MBPP, TruthfulQA
-
Automated Output Evaluation:
- Automatically evaluate LLM outputs using 8 evaluation criteria (accuracy, completeness, expression, reliability, conciseness, correctness, structural consistency, reasoning quality)
- Improve prompts based on evaluation results
-
Various Evaluation Methods:
- LLM-based Evaluation: LLMs directly perform correctness assessment, scoring, and detailed evaluation for each dataset
- BERTScore-based Evaluation: Compare output similarity (F1, Precision, Recall, etc.) between prompts using BERT embeddings
- LLM Judge-based Evaluation: LLMs directly compare outputs from two prompts to determine winner/loser and reasons
-
Web UI:
- Intuitive experiment management and result visualization based on Streamlit
You can execute LLM evaluation with various datasets and prompts through evaluation/base/main.py.
python evaluation/base/main.py --dataset <dataset_name> --model <model_name> --model_version <version> \
--base_system_prompt <existing_system_prompt> --base_user_prompt <existing_user_prompt> \
--zera_system_prompt <zera_system_prompt> --zera_user_prompt <zera_user_prompt> \
--num_samples <sample_count>- Evaluation results are stored in
evaluation/results/. - You can compare prompt performance using various metrics like accuracy, ROUGE, etc.
Running evaluation/dataset_evaluator/bert/bert_compare_prompts.py allows you to compare ZERA prompt and existing prompt outputs using BERTScore.
python evaluation/dataset_evaluator/bert/bert_compare_prompts.py- Results are saved as
comparison_results.csv.
You can check results where LLMs directly compare outputs from two prompts (winner, reasons, etc.) in evaluation/llm_judge/comparison_results.csv.
The evaluation/examples/ directory contains example code for each dataset.
python evaluation/examples/<dataset>_example.py- Requires
requirements.txtinstallation and.envenvironment variable setup before running examples
Get up and running with ZERA in under 5 minutes! โก
git clone https://github.com/younatics/zera-agent.git
cd zera-agent
pip install -r requirements.txtCreate a .env file in the project root with your API keys:
# Required API Keys
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
SOLAR_API_KEY=your_solar_api_key_here
SOLAR_STRAWBERRY_API_KEY=your_solar_strawberry_api_key_here
# Optional: Local model configuration
LOCAL_MODEL_ENDPOINT=http://localhost:8000/v1
LOCAL_MODEL_API_KEY=your_local_api_key_here
# Optional: Slack notifications
SLACK_WEBHOOK_URL=your_slack_webhook_url_here
SLACK_CHANNEL=#experimentsNote: You only need to set the API keys for the models you plan to use.
# Quick test with BBH dataset
python scripts/run_prompt_tuning.py \
--dataset bbh \
--total_samples 10 \
--iterations 3 \
--model solarstreamlit run agent/app/streamlit_app.py-
Install dependencies
pip install -r requirements.txt -
Set environment variables
Enter OpenAI, Anthropic, etc. API keys in.envfile -
Run web UI
streamlit run agent/app/streamlit_app.py -
Run CLI tools (optional)
# Run prompt tuning experiment python scripts/run_prompt_tuning.py --dataset bbh --total_samples 20 --iterations 5 --model solar # Run batch experiments python scripts/run_batch_experiments.py --config experiments_config.json # Update results python scripts/update_results.py
- Automatically generate optimal prompts for new tasks
- Automate LLM benchmark experiments and result comparison
- Prompt engineering research and experiments
- Quantitative/qualitative prompt performance comparison using various evaluation methods (LLM, BERT, LLM Judge)
Error: No API key found for model 'solar'Solution: Ensure your .env file contains the correct API key for the model you're using.
ModuleNotFoundError: No module named 'agent'Solution: Make sure you're running commands from the project root directory, not from subdirectories.
MemoryError: Unable to allocate arraySolution: Reduce the --total_samples or --iteration_samples parameters.
RequestTimeout: Request timed outSolution: Check your internet connection and API rate limits.
EvaluationError: Failed to evaluate responseSolution: Verify your evaluation prompts are properly formatted and the model can access them.
- GitHub Issues: Report bugs and request features
- Discussions: Join community discussions
- Documentation: Check the scripts/README.md for CLI usage details
Join the ZERA community and help shape the future of prompt engineering! ๐
- ๐ Report Bugs: GitHub Issues
- ๐ก Request Features: Feature Requests
- ๐ Ask Questions: Q&A Discussions
- ๐ง Contribute Code: Pull Requests
- ๐ Improve Docs: Documentation PRs
- ๐ผ LinkedIn: Seungyoun Yi
We welcome contributions from the community! See our Contributing Guide for details on how to get involved.
If you use ZERA in your research, please cite our paper:
@inproceedings{yi2025zera,
title={ZERA: Zero-init Instruction Evolving Refinement Agent โ From Zero Instructions to Structured Prompts via Principle-based Optimization},
author={Yi, Seungyoun and Khang, Minsoo and Park, Sungrae},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
publisher={Association for Computational Linguistics}
}This project is licensed under the MIT License - see the LICENSE file for details.
ZERA is not just another toolโit's a revolution in how we approach AI prompting. ๐
- ๐ฏ Try ZERA: Run your first experiment in minutes
- ๐ Read the Paper: Dive deep into the research
- ๐ Star the Repo: Show your support
- ๐ค Contribute: Help shape the future of prompt engineering
- ๐ข Share: Let others know about ZERA
With ZERA, the era of manual prompt crafting is over. Welcome to the future where:
- Zero instructions become expert-level prompts
- Manual tuning becomes automated optimization
- Trial and error becomes intelligent refinement
- Domain expertise becomes universal capability
ZERA: Zero-init Instruction Evolving Refinement Agent
From Zero Instructions to Structured Prompts via Self-Refining Optimization
Ready to experience the future of prompt engineering? ๐

