Skip to content

Latest commit

 

History

History
266 lines (196 loc) · 18.2 KB

File metadata and controls

266 lines (196 loc) · 18.2 KB

Composable Primitives and Phase Architecture

This is one of the most important conceptual documents in MassGen. The system's power comes not from any single primitive, but from how they compose. Finding the right compositions — the right personas to build a rigorous plan, the right evaluation criteria to ensure every aspect of that plan is executed to a shockingly high standard, the right decomposition to let specialists own what they're best at — is what unlocks the full potential of multi-agent coordination.

The Primitives

MassGen provides composable primitives that each shape a different dimension of agent behavior. Some run as subagent spawns (separate MassGen execution outside the main coordination loop), some run as inline analysis (reusing existing agents), and one operates as per-round injection into the main loop.

Primitive Inventory

Primitive Mechanism What it shapes Output Injection point
Persona generation Subagent spawn WHO the agents are — perspective, values, approach Per-agent persona text (strong + softened) Prepended to system message each round
Evaluation criteria generation Subagent spawn WHAT quality means — task-specific checklist gates E1..EN criteria (core/stretch) Replaces default E1-E4 in checklist tool + system message
Task decomposition Subagent spawn WHAT each agent works on — subtask ownership Agent-to-subtask mapping Wraps user message with [YOUR ASSIGNED SUBTASK: ...] + fed into coordination system message
Planning mode analysis Inline (reuses agent) HOW agents can act — tool access during coordination Planning/execution mode flags Sets backend.set_planning_mode(), blocks tool access

How Output Flows Into the System

Understanding where each primitive's output lands is critical for reasoning about compositions:

Persona generation output flows into the system message at the start of every round. Round 0 gets the strong version ("your perspective is X, prioritize Y, approach Z"). Round 1+ gets the softened version ("treat your perspective as a preference, not a position to defend"). This means personas shape how agents interpret the task, what they prioritize, and what they notice in peer answers.

Evaluation criteria output flows into two places: (1) the checklist tool state, replacing the default E1-E4 items that gate the submit_checklist decision, and (2) the system message via custom_checklist_items, so agents know what they're being evaluated on. This means criteria control whether agents vote to converge or keep iterating.

Task decomposition output flows into (1) the user message per agent, wrapping the original prompt with the agent's assigned subtask, and (2) the coordination system message so agents and the final presenter understand the decomposition. This means each agent sees a scoped version of the task rather than the full prompt.

Planning mode output flows into backend state, toggling tool availability. During coordination rounds, agents describe what they would do rather than doing it. Only the winning agent gets tools restored for final execution. This means agents compete on plans, not partially-executed actions.

How Primitives Differ from Main Coordination Turns

Main coordination turns involve N agents iterating over rounds, seeing each other's work, voting via checklist gates, and converging. The subagent primitives are different:

  • They run before the main loop, not as part of it.
  • They spawn a separate MassGen execution with stripped-down config (no filesystem tools, no MCP).
  • Their output is structured data (personas, criteria, subtask maps) consumed by the orchestrator, not free-form answers.
  • They do not participate in the voting/convergence loop — they produce their output and exit.

This distinction matters for composition: a subagent primitive produces context that shapes all subsequent rounds, but it cannot be refined by those rounds. If you want iterative refinement of personas themselves, you need to compose multiple phases (see Composition Patterns).

Why Composition Matters

No single primitive is sufficient for high-quality output. Consider the quality matrix:

                    Without refinement          With refinement
                    ─────────────────           ───────────────
Without personas    Generic first drafts        Polished mediocrity
With personas       Ambitious but rough         Distinctive, mature work

Now extend this to the full primitive set:

  • Personas alone → diverse perspectives, but agents may not know what "good" looks like for this task.
  • Eval criteria alone → agents know quality gates, but all approach the task identically.
  • Personas + eval criteria → diverse approaches held to task-specific quality standards.
  • Personas + eval criteria + decomposition → specialized agents with quality gates on their owned subtasks.
  • Planning + personas + eval criteria + execution → agents with strong perspectives debate the best plan, the plan is held to rigorous criteria, then the winning plan is executed with fresh personas optimized for implementation.

The power grows combinatorially. And because subagent primitives are themselves MassGen executions, you can apply the quality matrix to them too:

  • Generate personas using multiple agents with iterative refinement — N agents debating what the best personas would be.
  • Generate evaluation criteria with personas already injected — so each evaluator brings a different quality philosophy.
  • Decompose tasks with multiple agents voting on the best decomposition strategy.

Composition Patterns

Pattern 1: Quality-Gated Planning → Chunked Execution

The most immediately powerful composition. Multiple agents with strong personas debate a plan. The plan must pass task-specific evaluation criteria before execution begins. Then the winning plan is executed in chunks — each chunk potentially with its own personas, decomposition strategy, or evaluation criteria.

Phase 1: Plan generation
  ├── Persona generation (diverse planning perspectives)
  ├── Eval criteria generation (what makes a good plan?)
  └── N agents × M rounds → winning plan
       ↓ plan output

Phase 2: Chunk execution (per plan section)
  ├── Persona generation (implementation-focused, per chunk)
  ├── Eval criteria generation (what makes good execution of THIS chunk?)
  └── N agents × M rounds → executed chunk
       ↓ chunk output

  Some chunks may use decomposition instead of parallel:
  ├── Chunk A: parallel mode, creative personas, creative eval criteria
  ├── Chunk B: decomposition mode, specialist subtask owners
  └── Chunk C: parallel mode, analytical personas, correctness eval criteria

Pattern 2: Decomposition with Per-Subtask Quality Gates

Different subtasks need different quality standards. A creative writing subtask needs different evaluation criteria than a data analysis subtask. A decomposition primitive assigns ownership, then each subtask runs as its own coordination with appropriate primitives.

Phase 1: Task decomposition
  ├── Persona generation (architectural perspectives)
  └── N agents vote on decomposition
       ↓ subtask map

Phase 2: Per-subtask execution (each a separate coordination)
  ├── Subtask A: parallel mode, creative personas, creative eval criteria
  ├── Subtask B: parallel mode, analytical personas, correctness eval criteria
  └── Subtask C: single agent, deep specialist persona, domain eval criteria
       ↓ per-subtask outputs

Phase 3: Synthesis
  ├── Integration personas (cross-domain connectors)
  ├── Synthesis eval criteria (coherence, consistency, completeness)
  └── Combine subtask outputs into unified result

Pattern 3: Recursive Refinement of Primitives

Use MassGen to improve MassGen's own preparation. Generate rough personas, use them to generate better evaluation criteria, then use those criteria to evaluate and regenerate the personas.

Phase 1: Bootstrap → rough personas
Phase 2: Rough personas → eval criteria for personas
Phase 3: Eval criteria → refined personas (iterative refinement)
Phase 4: Refined personas + task-specific eval criteria → main execution

Pattern 4: Analysis → Synthesis

For complex analytical tasks, decompose into parallel analysis tracks with methodology-specific personas, then synthesize across dimensions.

Phase 1: Decompose into analysis dimensions
Phase 2: Per-dimension parallel analysis (methodology personas per dimension)
Phase 3: Cross-dimension synthesis (integration personas, synthesis eval criteria)

Checklist Gate Criteria for Special Primitives

The default checklist items (E1-E4) are designed for general task output. But special primitives — persona generation, task decomposition, evaluation criteria generation, and analytical tasks like prompt crafting or log analysis — have well-defined quality characteristics that don't require another level of prompt generation to specify.

These are the recommended default criteria for each primitive type. When a primitive runs as a standalone coordination, these criteria should replace the generic E1-E4.

Persona Generation

What makes personas good is well-specified: they must be distinct, actionable, and task-relevant.

ID Criterion Category
E1 Each persona articulates a clear, specific perspective that would lead to meaningfully different outputs — not just surface variation in tone or vocabulary. Two personas that would produce essentially the same answer are a failure. core
E2 Personas are grounded in the actual task. Each perspective is relevant to the problem domain and brings a genuinely useful lens, not an arbitrary or forced viewpoint. core
E3 Personas are actionable instructions, not character descriptions. An agent receiving this persona knows exactly how it changes their approach, priorities, and decision-making — not just who they are pretending to be. core
E4 The persona set collectively provides coverage — the major reasonable approaches, value trade-offs, or methodological choices for this task are represented. No critical perspective is missing. core
E5 Personas are vivid enough to resist homogenization under peer pressure. The perspective is strongly stated so that even after seeing other agents' answers, the core viewpoint remains distinguishable. stretch

Task Decomposition

Good decomposition must produce subtasks that are independently executable, collectively exhaustive, and appropriately scoped.

ID Criterion Category
E1 Subtasks are collectively exhaustive — completing all subtasks fully produces the complete output. No significant aspect of the original task falls through the cracks between subtasks. core
E2 Subtasks have minimal coupling — each can be executed independently without requiring intermediate results from other subtasks. Where dependencies exist, they are explicit and the dependency order is specified. core
E3 Subtask scoping is balanced — no single subtask is trivial while another carries the bulk of the complexity. Work is distributed so each agent has a meaningful, roughly comparable contribution. core
E4 Each subtask description is self-contained and specific enough that an agent can execute it without needing to infer intent from other subtasks or the original prompt. core
E5 The decomposition strategy is appropriate for the task type — creative tasks split along conceptual boundaries, technical tasks along component boundaries, analytical tasks along dimension boundaries. stretch

Evaluation Criteria Generation

Meta-quality: the criteria that judge quality must themselves be high quality.

ID Criterion Category
E1 Each criterion is specific to the actual task — not generic advice that applies to any output. A criterion that could be copy-pasted to an unrelated task is too vague. core
E2 Criteria are evaluable — an agent can determine pass/fail by examining the output, not by making subjective judgments about intent. "Addresses edge cases" is vague; "handles empty input, null values, and boundary conditions" is evaluable. core
E3 The criteria set distinguishes excellent work from adequate work. If every competent first draft would pass all criteria, the bar is too low. At least one criterion should require genuine effort to satisfy. core
E4 Core vs. stretch categorization is correct. Core criteria represent non-negotiable requirements; stretch criteria represent quality differentiators. A misclassified core criterion blocks good work; a misclassified stretch criterion lets mediocre work pass. core
E5 Criteria do not conflict with each other or create impossible trade-offs. Meeting one criterion should not require violating another. Where genuine tensions exist, the criteria acknowledge the trade-off explicitly. stretch

Prompt / Brief Crafting

When using MassGen to generate prompts, system messages, briefs, or instructions for downstream use.

ID Criterion Category
E1 The prompt achieves its functional goal — an agent receiving this prompt would produce the intended type of output without additional clarification. Test: could you hand this to a capable model cold and get back what you need? core
E2 The prompt is appropriately scoped — it constrains enough to prevent unhelpful outputs but does not over-constrain in ways that eliminate valid approaches. core
E3 Important requirements are explicit, not implied. The prompt does not depend on shared context, cultural assumptions, or "obvious" intentions that a model might miss. core
E4 The prompt is structured for parseability — key instructions are prominent, not buried in paragraphs. An agent skimming the prompt would still catch the critical constraints. stretch
E5 The prompt anticipates likely failure modes for its task type and includes guardrails against them (e.g., "do not summarize when asked to analyze" or "include concrete examples, not abstract principles"). stretch

Log / Output Analysis

When using MassGen to analyze logs, execution traces, performance data, or prior MassGen outputs.

ID Criterion Category
E1 The analysis identifies concrete, specific findings — not vague observations. Each finding points to a specific location, pattern, or data point in the source material. core
E2 Findings are supported by evidence from the actual data, not inferred from assumptions about what "usually" happens. Claims include references to specific log entries, metrics, or examples. core
E3 The analysis distinguishes symptoms from root causes. Surface-level observations (e.g., "agent 2 was slow") are traced to underlying explanations (e.g., "agent 2 hit rate limits due to tool call volume"). core
E4 Actionable recommendations follow from findings. Each significant finding includes a concrete suggestion for what to change, not just a description of what went wrong. core
E5 The analysis identifies patterns across the dataset, not just individual anomalies. Recurring behaviors, systematic biases, or structural issues are surfaced alongside one-off events. stretch

Current Execution Order

Today, primitives execute in this fixed sequence:

In chat() — before coordination:
  1. Planning mode analysis (if enabled)
     → Reuses an existing agent inline
     → Sets tool constraints on all agent backends
     → Output: backend.set_planning_mode() flags

In _coordinate_agents() — at coordination start:
  2. Persona generation  ⎤  Subagent spawns
                         ⎥  (concurrent if both enabled)
  3. Eval criteria gen   ⎦  Output stored in orchestrator state

  4. Task decomposition (decomposition mode only)
     → Subagent spawn, runs after personas/criteria
     → Output: self._agent_subtasks dict

  5. Main round loop begins
     Per round, per agent:
       → Persona text prepended to system message
       → Eval criteria passed as checklist items + system message section
       → Subtask (if decomposition) wraps user message
       → Planning mode constrains available tools

Future: Explicit Phase Composition

The current implementation hard-codes the ordering and each primitive runs as a single subagent spawn without iterative refinement. The vision is explicit phase composition where users define ordered phases, each phase being a full MassGen coordination (with its own agents, rounds, primitives, and checklist gates):

# Conceptual — not yet implemented
phases:
  - name: persona_generation
    coordination:
      agents: 3
      max_rounds: 2
      checklist_criteria: persona  # uses persona-specific gates from above
    output_type: personas
    feeds_into: [plan, execute]    # which phases consume this output

  - name: plan
    coordination:
      agents: 5
      max_rounds: 4
      persona_generator:
        enabled: true             # personas for the planners themselves
      checklist_criteria: prompt  # plan judged as a brief/prompt
    output_type: plan
    feeds_into: [execute]

  - name: execute
    coordination:
      agents: 3
      max_rounds: 6
      personas: $phases.persona_generation.output
      evaluation_criteria_generator:
        enabled: true
      checklist_criteria: auto
    input: $phases.plan.output

Each phase is a full coordination with its own quality gates. The output of one phase feeds specific injection points in the next. This is where the combinatorial power lives — and finding the right compositions for different task types is one of MassGen's most important ongoing research directions.

The space is vast: different personas for planning vs. execution, different evaluation criteria per plan chunk, decomposition within one phase but parallel in another, recursive refinement of the primitives themselves. The primitives are simple; the compositions are where the magic happens.