FlowSteer studies Agent Designing Agentic Workflows: a lightweight policy agent designs a workflow graph, and a downstream executor LLM runs that workflow to solve the task. The current repository is aligned with the arXiv v4 formulation, which centers on three ideas:
- Workflow Canvas: an executable graph-state environment that maintains the workflow, checks each atomic edit, executes operators, and returns feedback.
- Designer--Executor decoupling: the Flow-Director designs the workflow, while a pluggable executor backend runs the designed graph.
- Reinforced Progressive Canvas Editing: the Flow-Director commits one atomic edit per turn and is trained end-to-end with a canvas-masked GRPO objective and diversity-constrained reward.
At each turn, the Flow-Director observes the task, operator library, workflow state, and canvas feedback. It emits a brief reflection plus exactly one action. The canvas applies that action, validates the graph, executes available nodes when needed, and appends feedback for the next turn.
train_interactive.py training entry point for multi-turn canvas editing
eval_only.py inference/evaluation entry point
merge_and_upload.py LoRA merge and upload utility
config/training_interactive.yaml paper-aligned training configuration
config/aflow_llm.yaml.example executor backend configuration template
config/operator.json operator descriptions
scripts/operators.py operator implementations
src/interactive/workflow_env.py Workflow Canvas environment
src/interactive/workflow_graph.py graph state and structure checks
src/interactive/action_parser.py XML/action parsing
src/interactive/grpo_trainer.py GRPO utilities
src/interactive/trajectory_reward.py diversity-gated reward
figs/ figures synchronized with the arXiv v4 manuscript
- Python 3.10+
- CUDA-capable GPU
- vLLM with LoRA serving enabled
- A local or API executor backend configured through
config/aflow_llm.yaml
The paper experiments use Qwen3-8B as the Flow-Director policy model, LoRA fine-tuning, bfloat16 precision, and a GPT-OSS-120B executor backend.
git clone https://github.com/beita6969/FlowSteer.git
cd FlowSteer
conda create -n flowsteer python=3.10 -y
conda activate flowsteer
pip install -r requirements.txt
pip install "vllm>=0.6.0"The hosted dataset can be downloaded from Hugging Face:
python - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="beita6969/FlowSteer-Dataset",
repo_type="dataset",
local_dir="data",
allow_patterns=["train/train_12k.jsonl", "eval/*.jsonl"],
endpoint="https://huggingface.co",
)
PYThe paper evaluates 12 datasets: six IID datasets for training/testing and six OOD datasets for generalization.
IID: GSM8K, MATH, HotPotQA, SQuAD v2, MBPP, HumanEval
OOD: TriviaQA, NaturalQuestions, MathQA, AIME 2025, APPS, DS-1000
The arXiv v4 appendix specifies the paper training recipe as 10,778 IID training instances: 2,560 each from GSM8K, MATH, HotPotQA, and SQuAD v2, plus 374 MBPP and 164 HumanEval examples. The public dataset repository also provides evaluation JSONL files under data/eval/ for the 12 benchmark families.
Create the executor configuration from the template:
cp config/aflow_llm.yaml.example config/aflow_llm.yamlFor an OpenAI-compatible local executor service, set:
models:
gpt-oss-120b:
api_type: openai
base_url: http://127.0.0.1:8004/v1
api_key: EMPTY
model_name: gpt-oss-120b
temperature: 0
top_p: 1
max_tokens: 4096Then ensure config/training_interactive.yaml points to the same executor model name:
aflow_executor_model: "gpt-oss-120b"CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--model /path/to/Qwen3-8B \
--served-model-name Qwen3-8B \
--port 8003 \
--gpu-memory-utilization 0.85 \
--max-model-len 16384 \
--enable-lora \
--max-loras 2 \
--max-lora-rank 64 \
--trust-remote-code \
--dtype bfloat16CUDA_VISIBLE_DEVICES=0 python train_interactive.py \
--config config/training_interactive.yamlImportant paper-aligned defaults are already set in config/training_interactive.yaml:
| Category | Setting |
|---|---|
| Policy model | Qwen3-8B |
| LoRA | rank 64, alpha 64, dropout 0.05, q/k/v/o projections |
| RL objective | GRPO with canvas token mask |
| Samples per group | 36 |
| Clip / KL | 0.20 / 0.005 |
| Generation | temperature 0.6, top-p 0.95, top-k 20, max new tokens 2048 |
| Interaction | max 20 rounds |
| Reward | base -1.0, diversity cap 1.0, correctness released only after full structural reward |
| Executor timeout | 600 seconds |
Evaluate a single benchmark file:
python eval_only.py \
--config config/training_interactive.yaml \
--data data/eval/gsm8k.jsonl \
--num-samples 128 \
--workers 16Evaluate with a served LoRA adapter by starting vLLM with the adapter first, then passing the served adapter name:
python eval_only.py \
--config config/training_interactive.yaml \
--data data/eval/humaneval.jsonl \
--vllm-model flowsteer-adapter \
--workers 16--checkpoint is recorded for diagnosis only; the adapter must already be loaded by the vLLM server.
The released Flow-Director model is hosted at:
https://huggingface.co/beita6969/FlowSteer-8b
This repository is released for research use. Please also follow the licenses and terms of the upstream models, datasets, and benchmark suites used with FlowSteer.

