Cross-framing heavy reasoning for Claude Code — fixes the two failure modes (Appendix A "diversity collapse", Appendix B "iteration drift") documented in the HeavySkill paper. Ships as five composable skills with a single user-facing entry:
/divergent-think.
The HeavySkill paper showed that parallel-reasoning + sequential-deliberation outperforms majority-voting / Best-of-N — but it also documented two limitations:
| Paper finding | Limitation | Fix this repo ships |
|---|---|---|
| Appendix A | K parallel trajectories share the same prompt → Max-Diversity selection ≈ Random selection | auto-reframe generates K=6 axis-disjoint framings so each trajectory enters the problem through a structurally different conceptual lens |
| Appendix B | Iterative deliberation: HM@K rises but HP@K falls (in-frame noise accumulates) | frame-critic injects a fresh axis between iterations; sees only framings + summaries (never trajectory bodies), enforced by sub-agent context isolation |
The divergent-think orchestrator combines both into one end-to-end pipeline; the original single-frame heavyskill is kept as a baseline for already-canonical problems (competition math, well-posed STEM).
Two complementary ways to use this repo:
| Mode | For who | Entry point |
|---|---|---|
| A — Claude Code skill (recommended for interactive use) | Anyone using Claude Code as their primary harness | /divergent-think <query> |
| B — Python workflow (for paper repro & batch benchmarking) | Researchers running ablations against open-weight models | python scripts/run_divergent.py ... |
Both modes implement the same pipeline shape and the same hard limits (K=6 framings, K¹=4 deliberation samples, N_max=3 critic iterations) defined in the paper.
┌──────────────────────────────────────────────┐
User query ──▶│ divergent-think (orchestrator) │
└──────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
Stage 1 │ Reframe in-context │ reads prompts/01-reframe.md
│ → 6 framings, strict JSON │
│ (domain / abstraction / actor /│
│ goal / scale / analogy) │
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
Stage 2 │ Spawn 6 Agents in parallel │ reads prompts/02-worker.md
│ each with ONE framing │ ─▶ 6 trajectories
│ Deliberate in-context (×4) │ reads prompts/03-deliberation.md
│ → summary[iter] │
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
Stage 3 │ Spawn 1 Agent (frame-critic) │ reads prompts/04-critic.md
│ sees ONLY framings + summaries │ ─▶ STOP | CONTINUE
│ (NEVER trajectory bodies) │
└──────────────────────────────────┘
│
┌───────────┴──────────────┐
│ │
STOP ▼ ▼ CONTINUE (≤ N_max=3)
Final answer Stage 4 — Partial re-run:
spawn 1 Agent for the new
framing only, re-deliberate,
loop to Stage 3
| Skill | Role | User-invocable? |
|---|---|---|
divergent-think |
Orchestrator — only this skill executes the pipeline | ✅ Primary entry point |
heavyskill |
Single-frame baseline (K=3 parallel on the same prompt) | ✅ For canonical math/STEM |
auto-reframe |
Narrative reference for Stage 1 | ❌ Documentation only |
heavy-think-divergent |
Narrative reference for Stage 2 | ❌ Documentation only |
frame-critic |
Narrative reference for Stage 3 | ❌ Documentation only |
The orchestrator reads these at runtime — they are the single source of truth for prompt content:
.claude/skills/divergent-think/prompts/
├── 01-reframe.md # Stage 1 protocol (6 axes + anti-anchoring + JSON schema)
├── 02-worker.md # Stage 2 per-framing worker prompt
├── 03-deliberation.md # Stage 2 cross-framing synthesis prompt
└── 04-critic.md # Stage 3 STOP/CONTINUE prompt
Edit these to tune behavior — never edit the inlined prompt content in the SKILL.md files (there is none in v0.3.0; the orchestrator Reads these at each stage).
Download or copy the bundled tarball (dist/heavyskill-skill-v0.3.0.tar.gz) and extract it into either your project's .claude/skills/ or your user-level ~/.claude/skills/:
# Project-level install (recommended while iterating)
mkdir -p .claude/skills
tar -xzvf heavyskill-skill-v0.3.0.tar.gz -C .claude/skills/
# OR user-level install (available across all your projects)
mkdir -p ~/.claude/skills
tar -xzvf heavyskill-skill-v0.3.0.tar.gz -C ~/.claude/skills/After extraction you should see five skill folders (divergent-think/, heavyskill/, auto-reframe/, heavy-think-divergent/, frame-critic/) and the divergent-think/prompts/ subdirectory with four .md files.
In a fresh Claude Code session inside the project:
/divergent-think 設計一個讓 SaaS 用戶留存率上升的功能(不能讓 DAU 下降)
You should observe:
- Stage 1's 6-framing JSON emitted inline by the orchestrator
- Six parallel
Agenttool calls in the terminal (one per axis) — this is the key visible signal that the pipeline is working - A 7th
Agentcall for the frame-critic - A clean Chinese-language final answer (matching query language), no meta-narration prefix
If you only see Stage 1 JSON and no parallel Agents → the orchestrator isn't reading prompts/02-worker.md. Re-extract the tarball and confirm the prompts/ directory landed correctly.
/divergent-think <multi-vector reasoning query> # full divergent pipeline (1–3 min)
/heavyskill <canonical math/STEM query> # single-frame baseline (faster, cheaper)
The orchestrator auto-detects when a query is canonical and may hand off to heavyskill itself.
git clone https://github.com/wjn1996/HeavySkill.git
cd HeavySkill
pip install -e .Run the divergent pipeline (matches divergent-think skill semantics with K=6, K¹=4, N_max=3 defaults):
python scripts/run_divergent.py \
--query "Find the number of paths of length 16 on an 8x8 grid that change direction exactly four times." \
--model "deepseek-r1" \
--api_base "http://localhost:8080" \
--output "outputs/divergent_result.json" \
--verboseRun the single-frame baseline (matches heavyskill skill):
python scripts/run_heavyskill.py \
--query "Your problem here" \
--model "deepseek-r1" \
--api_base "http://localhost:8080" \
--reason_k 8 --summary_k 4 \
--output "outputs/baseline_result.json"Using a separate deliberation model:
python scripts/run_heavyskill.py \
--query "..." \
--model "r1-distill-qwen-7b" --api_base "http://localhost:8080" \
--summary_model "qwen3-32b" --summary_api_base "http://localhost:8081" \
--reason_k 16 --summary_k 4Batch evaluation:
python scripts/run_heavyskill.py \
--input_file "examples/example_math.json" \
--model "deepseek-r1" --api_base "http://localhost:8080" \
--output "outputs/batch_result.json"The Python pipeline supports any OpenAI-compatible endpoint: vLLM, DeepSeek API, Together AI, OpenRouter, local Ollama, etc.
Orchestrator reads prompts/01-reframe.md and executes the protocol in its own context. Produces 6 framings, one per required axis (domain, abstraction, actor, goal, scale, analogy). Anti-anchoring guards ensure the framing names reuse ≤ 30% of the query's content tokens — this prevents the "security audit" framing of a security-audit query.
Orchestrator reads prompts/02-worker.md, substitutes the 5 placeholders per framing, and dispatches 6 Agent tool calls in parallel in a single response. Each sub-agent reasons in its framing's vocabulary, then translates back to the original query. Trajectories return into the orchestrator's context.
Orchestrator reads prompts/03-deliberation.md and runs the synthesis itself (must hold all 6 trajectories at once — cannot delegate). Produces 4 samples; picks the most internally consistent. Each summary explicitly names a cross-framing combination that no single framing produced alone, or honestly reports "no genuine combination found".
Orchestrator reads prompts/04-critic.md and spawns ONE Agent call with {query, framings, summaries_history, iterations_done, n_max, axes_unused} — never the trajectory bodies. Sub-agent context isolation enforces the "critic outside the deliberation frame" contract. Critic returns strict JSON: STOP, or CONTINUE + new framing + axis to evict.
Swap one (framing, trajectory) pair; spawn ONE Agent for the new framing's trajectory only; re-run Stage 2 deliberation over the updated 6-tuple; loop to Stage 3. Cost stays linear in N, not N × K.
Match the original query's language and format conventions. No "after deep thinking..." preamble. The user sees an answer that reads as if written directly in response to the query.
| Resource | Typical full run |
|---|---|
| Input + output tokens | ~200k – 400k (varies with query complexity) |
| Wall clock | 1–3 minutes (depends on sub-agent throughput) |
| Sub-agent calls | 6 (Stage 2 parallel) + 1 (Stage 3 critic), + up to N_max × (1 worker + 1 critic) on CONTINUE branches |
The orchestrator tells the user "Running divergent-think (~1–3 min)" before starting so they can interrupt if metered-API cost is a concern. The hard max_total_tokens budget guard (set in the Python config at 1.5M) fires an early-STOP and returns the best summary so far if exceeded.
For canonical math/STEM where framing is already given, prefer /heavyskill — it runs K=3 parallel on a single prompt (no reframe, no critic), typically 1/3 the cost.
HeavySkill/
├── .claude/skills/ # Mode A: Claude Code skill bundle
│ ├── divergent-think/
│ │ ├── SKILL.md # Orchestrator logic (v0.3.0)
│ │ └── prompts/ # Executable prompt templates
│ │ ├── 01-reframe.md
│ │ ├── 02-worker.md
│ │ ├── 03-deliberation.md
│ │ └── 04-critic.md
│ ├── heavyskill/SKILL.md # Single-frame baseline
│ ├── auto-reframe/SKILL.md # Narrative reference for Stage 1
│ ├── heavy-think-divergent/SKILL.md # Narrative reference for Stage 2
│ └── frame-critic/SKILL.md # Narrative reference for Stage 3
│
├── workflow/ # Mode B: Python pipeline
│ ├── config.py # HeavySkillConfig dataclass
│ ├── pipeline.py # Single-frame orchestration
│ ├── prompts.py # Prompt templates (general / STEM, CN / EN)
│ ├── divergent/ # Divergent variant
│ │ ├── pipeline.py # Critic-driven outer loop
│ │ ├── reframer.py # auto-reframe equivalent
│ │ ├── cross_framing_deliberation.py
│ │ ├── frame_critic.py
│ │ ├── axes.py / distance.py / metrics.py / types.py
│ │ └── config.py # K=6, K¹=4, N_max=3 defaults
│ └── agent/openai_compatible.py # Async OpenAI-compatible client
│
├── scripts/
│ ├── run_heavyskill.py # Mode B single-frame CLI
│ ├── run_divergent.py # Mode B divergent CLI
│ ├── run_benchmark.py # Batch benchmark harness
│ ├── judge_arena.py # Arena-Hard auto-judge
│ └── evaluate.py # Accuracy evaluation utility
│
├── examples/
│ ├── example_math.json
│ ├── aime_hmmt_subset.json
│ ├── arena_hard_subset.json
│ └── ctf_seed_v0.json
│
├── paper/heavyskill.pdf # Paper (arXiv:2605.02396)
├── tests/ # pytest suite
├── dist/ # Skill tarball releases (gitignored)
├── pyproject.toml
└── README.md
Issues and PRs welcome. A few conventions to make review fast:
- Prompt edits belong in
.claude/skills/divergent-think/prompts/*.md, not in anySKILL.md. The SKILL files describe orchestration and concepts; prompt content is in the runtime files. - New axis or framing rule changes belong in
prompts/01-reframe.mdand also in the corresponding section ofworkflow/divergent/axes.pyto keep the two modes consistent. - Smoke test before sending a PR: run
/divergent-think <a short multi-vector query>and confirm the terminal shows 6 parallelAgentcalls + 1 criticAgentcall. If it doesn't, the pipeline is broken. - Hard limits (K=6, K¹=4, N_max=3) come from the paper — change them in the config file, not by hard-coding new numbers in prompts.
If you use this work, please cite the original paper:
@article{wang2026heavyskill,
title={HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness},
author={Wang, Jianing and Guo, Linsen and Chen, Zhengyu and Guo, Qi and Zang, Hongyu and Shi, Wenjie and Ma, Haoxiang and Xi, Xiangyu and Li, Xiaoyu and Wang, Wei and Cai, Xunliang},
journal={arXiv preprint arXiv:2605.02396},
year={2026},
url={https://arxiv.org/abs/2605.02396}
}The Claude Code skill bundle in .claude/skills/ is an independent implementation of the paper's divergent variant (and its single-frame baseline) packaged for the Claude Code agentic harness. Bug reports against the skill bundle are tracked separately from paper errata.
Apache-2.0