This is one of the most important conceptual documents in MassGen. The system's power comes not from any single primitive, but from how they compose. Finding the right compositions — the right personas to build a rigorous plan, the right evaluation criteria to ensure every aspect of that plan is executed to a shockingly high standard, the right decomposition to let specialists own what they're best at — is what unlocks the full potential of multi-agent coordination.
MassGen provides composable primitives that each shape a different dimension of agent behavior. Some run as subagent spawns (separate MassGen execution outside the main coordination loop), some run as inline analysis (reusing existing agents), and one operates as per-round injection into the main loop.
| Primitive | Mechanism | What it shapes | Output | Injection point |
|---|---|---|---|---|
| Persona generation | Subagent spawn | WHO the agents are — perspective, values, approach | Per-agent persona text (strong + softened) | Prepended to system message each round |
| Evaluation criteria generation | Subagent spawn | WHAT quality means — task-specific checklist gates | E1..EN criteria (core/stretch) | Replaces default E1-E4 in checklist tool + system message |
| Task decomposition | Subagent spawn | WHAT each agent works on — subtask ownership | Agent-to-subtask mapping | Wraps user message with [YOUR ASSIGNED SUBTASK: ...] + fed into coordination system message |
| Planning mode analysis | Inline (reuses agent) | HOW agents can act — tool access during coordination | Planning/execution mode flags | Sets backend.set_planning_mode(), blocks tool access |
Understanding where each primitive's output lands is critical for reasoning about compositions:
Persona generation output flows into the system message at the start of every round. Round 0 gets the strong version ("your perspective is X, prioritize Y, approach Z"). Round 1+ gets the softened version ("treat your perspective as a preference, not a position to defend"). This means personas shape how agents interpret the task, what they prioritize, and what they notice in peer answers.
Evaluation criteria output flows into two places: (1) the checklist tool state, replacing the default E1-E4 items that gate the submit_checklist decision, and (2) the system message via custom_checklist_items, so agents know what they're being evaluated on. This means criteria control whether agents vote to converge or keep iterating.
Task decomposition output flows into (1) the user message per agent, wrapping the original prompt with the agent's assigned subtask, and (2) the coordination system message so agents and the final presenter understand the decomposition. This means each agent sees a scoped version of the task rather than the full prompt.
Planning mode output flows into backend state, toggling tool availability. During coordination rounds, agents describe what they would do rather than doing it. Only the winning agent gets tools restored for final execution. This means agents compete on plans, not partially-executed actions.
Main coordination turns involve N agents iterating over rounds, seeing each other's work, voting via checklist gates, and converging. The subagent primitives are different:
- They run before the main loop, not as part of it.
- They spawn a separate MassGen execution with stripped-down config (no filesystem tools, no MCP).
- Their output is structured data (personas, criteria, subtask maps) consumed by the orchestrator, not free-form answers.
- They do not participate in the voting/convergence loop — they produce their output and exit.
This distinction matters for composition: a subagent primitive produces context that shapes all subsequent rounds, but it cannot be refined by those rounds. If you want iterative refinement of personas themselves, you need to compose multiple phases (see Composition Patterns).
No single primitive is sufficient for high-quality output. Consider the quality matrix:
Without refinement With refinement
───────────────── ───────────────
Without personas Generic first drafts Polished mediocrity
With personas Ambitious but rough Distinctive, mature work
Now extend this to the full primitive set:
- Personas alone → diverse perspectives, but agents may not know what "good" looks like for this task.
- Eval criteria alone → agents know quality gates, but all approach the task identically.
- Personas + eval criteria → diverse approaches held to task-specific quality standards.
- Personas + eval criteria + decomposition → specialized agents with quality gates on their owned subtasks.
- Planning + personas + eval criteria + execution → agents with strong perspectives debate the best plan, the plan is held to rigorous criteria, then the winning plan is executed with fresh personas optimized for implementation.
The power grows combinatorially. And because subagent primitives are themselves MassGen executions, you can apply the quality matrix to them too:
- Generate personas using multiple agents with iterative refinement — N agents debating what the best personas would be.
- Generate evaluation criteria with personas already injected — so each evaluator brings a different quality philosophy.
- Decompose tasks with multiple agents voting on the best decomposition strategy.
The most immediately powerful composition. Multiple agents with strong personas debate a plan. The plan must pass task-specific evaluation criteria before execution begins. Then the winning plan is executed in chunks — each chunk potentially with its own personas, decomposition strategy, or evaluation criteria.
Phase 1: Plan generation
├── Persona generation (diverse planning perspectives)
├── Eval criteria generation (what makes a good plan?)
└── N agents × M rounds → winning plan
↓ plan output
Phase 2: Chunk execution (per plan section)
├── Persona generation (implementation-focused, per chunk)
├── Eval criteria generation (what makes good execution of THIS chunk?)
└── N agents × M rounds → executed chunk
↓ chunk output
Some chunks may use decomposition instead of parallel:
├── Chunk A: parallel mode, creative personas, creative eval criteria
├── Chunk B: decomposition mode, specialist subtask owners
└── Chunk C: parallel mode, analytical personas, correctness eval criteria
Different subtasks need different quality standards. A creative writing subtask needs different evaluation criteria than a data analysis subtask. A decomposition primitive assigns ownership, then each subtask runs as its own coordination with appropriate primitives.
Phase 1: Task decomposition
├── Persona generation (architectural perspectives)
└── N agents vote on decomposition
↓ subtask map
Phase 2: Per-subtask execution (each a separate coordination)
├── Subtask A: parallel mode, creative personas, creative eval criteria
├── Subtask B: parallel mode, analytical personas, correctness eval criteria
└── Subtask C: single agent, deep specialist persona, domain eval criteria
↓ per-subtask outputs
Phase 3: Synthesis
├── Integration personas (cross-domain connectors)
├── Synthesis eval criteria (coherence, consistency, completeness)
└── Combine subtask outputs into unified result
Use MassGen to improve MassGen's own preparation. Generate rough personas, use them to generate better evaluation criteria, then use those criteria to evaluate and regenerate the personas.
Phase 1: Bootstrap → rough personas
Phase 2: Rough personas → eval criteria for personas
Phase 3: Eval criteria → refined personas (iterative refinement)
Phase 4: Refined personas + task-specific eval criteria → main execution
For complex analytical tasks, decompose into parallel analysis tracks with methodology-specific personas, then synthesize across dimensions.
Phase 1: Decompose into analysis dimensions
Phase 2: Per-dimension parallel analysis (methodology personas per dimension)
Phase 3: Cross-dimension synthesis (integration personas, synthesis eval criteria)
The default checklist items (E1-E4) are designed for general task output. But special primitives — persona generation, task decomposition, evaluation criteria generation, and analytical tasks like prompt crafting or log analysis — have well-defined quality characteristics that don't require another level of prompt generation to specify.
These are the recommended default criteria for each primitive type. When a primitive runs as a standalone coordination, these criteria should replace the generic E1-E4.
What makes personas good is well-specified: they must be distinct, actionable, and task-relevant.
| ID | Criterion | Category |
|---|---|---|
| E1 | Each persona articulates a clear, specific perspective that would lead to meaningfully different outputs — not just surface variation in tone or vocabulary. Two personas that would produce essentially the same answer are a failure. | core |
| E2 | Personas are grounded in the actual task. Each perspective is relevant to the problem domain and brings a genuinely useful lens, not an arbitrary or forced viewpoint. | core |
| E3 | Personas are actionable instructions, not character descriptions. An agent receiving this persona knows exactly how it changes their approach, priorities, and decision-making — not just who they are pretending to be. | core |
| E4 | The persona set collectively provides coverage — the major reasonable approaches, value trade-offs, or methodological choices for this task are represented. No critical perspective is missing. | core |
| E5 | Personas are vivid enough to resist homogenization under peer pressure. The perspective is strongly stated so that even after seeing other agents' answers, the core viewpoint remains distinguishable. | stretch |
Good decomposition must produce subtasks that are independently executable, collectively exhaustive, and appropriately scoped.
| ID | Criterion | Category |
|---|---|---|
| E1 | Subtasks are collectively exhaustive — completing all subtasks fully produces the complete output. No significant aspect of the original task falls through the cracks between subtasks. | core |
| E2 | Subtasks have minimal coupling — each can be executed independently without requiring intermediate results from other subtasks. Where dependencies exist, they are explicit and the dependency order is specified. | core |
| E3 | Subtask scoping is balanced — no single subtask is trivial while another carries the bulk of the complexity. Work is distributed so each agent has a meaningful, roughly comparable contribution. | core |
| E4 | Each subtask description is self-contained and specific enough that an agent can execute it without needing to infer intent from other subtasks or the original prompt. | core |
| E5 | The decomposition strategy is appropriate for the task type — creative tasks split along conceptual boundaries, technical tasks along component boundaries, analytical tasks along dimension boundaries. | stretch |
Meta-quality: the criteria that judge quality must themselves be high quality.
| ID | Criterion | Category |
|---|---|---|
| E1 | Each criterion is specific to the actual task — not generic advice that applies to any output. A criterion that could be copy-pasted to an unrelated task is too vague. | core |
| E2 | Criteria are evaluable — an agent can determine pass/fail by examining the output, not by making subjective judgments about intent. "Addresses edge cases" is vague; "handles empty input, null values, and boundary conditions" is evaluable. | core |
| E3 | The criteria set distinguishes excellent work from adequate work. If every competent first draft would pass all criteria, the bar is too low. At least one criterion should require genuine effort to satisfy. | core |
| E4 | Core vs. stretch categorization is correct. Core criteria represent non-negotiable requirements; stretch criteria represent quality differentiators. A misclassified core criterion blocks good work; a misclassified stretch criterion lets mediocre work pass. | core |
| E5 | Criteria do not conflict with each other or create impossible trade-offs. Meeting one criterion should not require violating another. Where genuine tensions exist, the criteria acknowledge the trade-off explicitly. | stretch |
When using MassGen to generate prompts, system messages, briefs, or instructions for downstream use.
| ID | Criterion | Category |
|---|---|---|
| E1 | The prompt achieves its functional goal — an agent receiving this prompt would produce the intended type of output without additional clarification. Test: could you hand this to a capable model cold and get back what you need? | core |
| E2 | The prompt is appropriately scoped — it constrains enough to prevent unhelpful outputs but does not over-constrain in ways that eliminate valid approaches. | core |
| E3 | Important requirements are explicit, not implied. The prompt does not depend on shared context, cultural assumptions, or "obvious" intentions that a model might miss. | core |
| E4 | The prompt is structured for parseability — key instructions are prominent, not buried in paragraphs. An agent skimming the prompt would still catch the critical constraints. | stretch |
| E5 | The prompt anticipates likely failure modes for its task type and includes guardrails against them (e.g., "do not summarize when asked to analyze" or "include concrete examples, not abstract principles"). | stretch |
When using MassGen to analyze logs, execution traces, performance data, or prior MassGen outputs.
| ID | Criterion | Category |
|---|---|---|
| E1 | The analysis identifies concrete, specific findings — not vague observations. Each finding points to a specific location, pattern, or data point in the source material. | core |
| E2 | Findings are supported by evidence from the actual data, not inferred from assumptions about what "usually" happens. Claims include references to specific log entries, metrics, or examples. | core |
| E3 | The analysis distinguishes symptoms from root causes. Surface-level observations (e.g., "agent 2 was slow") are traced to underlying explanations (e.g., "agent 2 hit rate limits due to tool call volume"). | core |
| E4 | Actionable recommendations follow from findings. Each significant finding includes a concrete suggestion for what to change, not just a description of what went wrong. | core |
| E5 | The analysis identifies patterns across the dataset, not just individual anomalies. Recurring behaviors, systematic biases, or structural issues are surfaced alongside one-off events. | stretch |
Today, primitives execute in this fixed sequence:
In chat() — before coordination:
1. Planning mode analysis (if enabled)
→ Reuses an existing agent inline
→ Sets tool constraints on all agent backends
→ Output: backend.set_planning_mode() flags
In _coordinate_agents() — at coordination start:
2. Persona generation ⎤ Subagent spawns
⎥ (concurrent if both enabled)
3. Eval criteria gen ⎦ Output stored in orchestrator state
4. Task decomposition (decomposition mode only)
→ Subagent spawn, runs after personas/criteria
→ Output: self._agent_subtasks dict
5. Main round loop begins
Per round, per agent:
→ Persona text prepended to system message
→ Eval criteria passed as checklist items + system message section
→ Subtask (if decomposition) wraps user message
→ Planning mode constrains available tools
The current implementation hard-codes the ordering and each primitive runs as a single subagent spawn without iterative refinement. The vision is explicit phase composition where users define ordered phases, each phase being a full MassGen coordination (with its own agents, rounds, primitives, and checklist gates):
# Conceptual — not yet implemented
phases:
- name: persona_generation
coordination:
agents: 3
max_rounds: 2
checklist_criteria: persona # uses persona-specific gates from above
output_type: personas
feeds_into: [plan, execute] # which phases consume this output
- name: plan
coordination:
agents: 5
max_rounds: 4
persona_generator:
enabled: true # personas for the planners themselves
checklist_criteria: prompt # plan judged as a brief/prompt
output_type: plan
feeds_into: [execute]
- name: execute
coordination:
agents: 3
max_rounds: 6
personas: $phases.persona_generation.output
evaluation_criteria_generator:
enabled: true
checklist_criteria: auto
input: $phases.plan.outputEach phase is a full coordination with its own quality gates. The output of one phase feeds specific injection points in the next. This is where the combinatorial power lives — and finding the right compositions for different task types is one of MassGen's most important ongoing research directions.
The space is vast: different personas for planning vs. execution, different evaluation criteria per plan chunk, decomposition within one phase but parallel in another, recursive refinement of the primitives themselves. The primitives are simple; the compositions are where the magic happens.