Skip to content

microsoft/SkillOpt

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Train agent skills like you train neural networks — with epochs, (mini-)batchsize, learning rates, and validation gates — but without touching model weights.

Project Page Paper Project Video Python 3.10+ License: MIT

🎬 SkillOpt Demo Video

64c8f76086bed7bd7a5ce664a7a14f40_raw.mp4

▶ Watch the full demo on YouTube


Install

Requirements: Python 3.10+

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

# For ALFWorld benchmark (optional):
pip install -e ".[alfworld]"
alfworld-download

Configure API Credentials

cp .env.example .env
# Edit .env with your API credentials, then:
source .env

Azure OpenAI (recommended):

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
# Option 1: API key auth
export AZURE_OPENAI_API_KEY="your-key"
# Option 2: Azure CLI auth (no API key needed)
export AZURE_OPENAI_AUTH_MODE="azure_cli"

Note: AZURE_OPENAI_ENDPOINT is required for all three modes (api_key, azure_cli, openai_compatible). Without it, all LLM calls will fail.

OpenAI-compatible endpoints:

export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
export AZURE_OPENAI_API_KEY="sk-..."
export AZURE_OPENAI_AUTH_MODE="openai_compatible"

This routes all calls through the plain OpenAI Python client (no Azure auth, no api-version header).

Note: SkillOpt reuses the AZURE_OPENAI_* env var names even in this mode — there is no separate OPENAI_API_KEY knob.

Anthropic Claude:

export ANTHROPIC_API_KEY="sk-ant-..."

Qwen (local vLLM):

export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"

Data Preparation

SkillOpt expects data in a split directory with train/, val/, test/ subdirectories, each containing a JSON file (e.g., items.json).

data/my_split/
├── train/items.json
├── val/items.json
└── test/items.json

Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:

[
  {
    "id": "unique_item_id",
    "question": "Who wrote the novel ...",
    "context": "[DOC] relevant passage text ...",
    "answers": ["expected answer"]
  }
]

See skillopt/envs/<benchmark>/dataloader.py for the exact format each benchmark expects.

Note: Benchmark datasets are not included in this repository. Prepare your own data following the format above.

Supported Benchmarks

Benchmark Type Config
SearchQA QA configs/searchqa/default.yaml
ALFWorld Embodied agent configs/alfworld/default.yaml
DocVQA Document QA configs/docvqa/default.yaml
LiveMathematicianBench Math configs/livemathematicianbench/default.yaml
SpreadsheetBench Code generation configs/spreadsheetbench/default.yaml
OfficeQA Tool-augmented QA configs/officeqa/default.yaml

Quick Start

Training

# Minimal example — train on SearchQA:
python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/your/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on LiveMathematicianBench:
python scripts/train.py \
    --config configs/livemathematicianbench/default.yaml \
    --split_dir /path/to/your/livemath_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on ALFWorld:
python scripts/train.py \
    --config configs/alfworld/default.yaml \
    --split_dir /path/to/your/alfworld_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

Key CLI arguments:

Argument Description Example
--config Benchmark config YAML configs/searchqa/default.yaml
--split_dir Path to data split directory /path/to/split
--azure_openai_endpoint Azure OpenAI endpoint URL https://your-resource.openai.azure.com/
--optimizer_model Optimizer model deployment name gpt-5.5
--target_model Target model deployment name gpt-5.5
--num_epochs Number of training epochs 4
--batch_size Batch size per step 40
--workers Parallel rollout workers 8
--out_root Output directory outputs/my_run

Eval Only

Evaluate a trained skill on specific data splits without training:

# Evaluate on test set only:
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/

# Evaluate on all splits (train + val + test):
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split all \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/
Split Description
valid_unseen Test set
valid_seen Validation set
train Training set
all All splits combined (default)

Output Structure

Each run writes to a structured output directory:

outputs/<run_name>/
├── config.json              # Flattened runtime config
├── history.json             # Per-step training history
├── runtime_state.json       # Resume checkpoint
├── best_skill.md            # Best validated skill document
├── skills/skill_vXXXX.md   # Skill snapshot per step
├── steps/step_XXXX/        # Per-step artifacts (patches, evals)
├── slow_update/epoch_XX/   # Slow update logs
└── meta_skill/epoch_XX/    # Meta skill logs

Re-running the same command auto-resumes from the last completed step.


Community-contributed configs

These are not default SkillOpt settings — they are reference configs contributed by users for specific scenarios. The paper-reported numbers were obtained with the default settings, not these.

  • configs/examples/soft_gate.yaml (PR #25, contributed by @lvbaocheng) — switches the validation gate from exact-match (hard) to soft / partial-credit (soft or mixed). Useful when the held-out selection split is small (e.g. ≤ ~10 items) and the reward is continuous, where the discrete hard gate often rejects every candidate and training stalls. See the comment at the top of the file for details and when not to use it.

WebUI

Launch the monitoring dashboard (optional):

pip install -e ".[webui]"
python -m skillopt_webui.app
Flag Default Description
--port 7860 Server port
--host 0.0.0.0 Bind address
--share off Create a public Gradio share link
# With public share link (useful for remote servers)
python -m skillopt_webui.app --share

Citation

@misc{yang2026skilloptexecutivestrategyselfevolving,
      title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills}, 
      author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo},
      year={2026},
      eprint={2605.23904},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.23904}
}

About

SkillOpt is a text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits, validation-gated updates, and deployable best_skill.md artifacts.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors