Skip to content

[INFRA/FEATURE] Implement SSoT Reasoning Exploration and Hardened Server Management #473

@RUFFY-369

Description

@RUFFY-369

Is your feature request related to a problem? Please describe.
The Atropos RL engine currently faces significant stability and exploration bottlenecks when training small models (3B) on complex reasoning tasks like GSM8K. Specifically:

  1. Policy Collapse: Models quickly find a single deterministic "answer-seeking" path and stop exploring diverse reasoning strategies, leading to stagnant reward curves.
  2. Trajectory Data Loss: A critical bug in gsm8k_server.py purges textual reasoning messages during the scoring phase, leaving the trainer "blind" to the model's deliberation process.
  3. Infrastructure Fragility: The ServerManager hard-codes base_url port settings, causing collisions during local or multi-node rollouts.

Describe the solution you'd like
Implement a model-agnostic SSoT (String Seed of Thought) wrapper to enable Probabilistic Instruction Following (PIF). This solution includes:

  • Entropy Injection: Requiring the model to generate a 16-character random seed to force "mental path" diversity.
  • Surgical Interception: An "Action-First" interceptor that allows models to generate 1000+ tokens of reasoning traces while providing clean, environment-compliant tool calls.
  • Infrastructure Hardening: Patching the ServerManager to respect custom API configurations and fixing the ScoredDataGroup serialization to preserve full trajectory fidelity.

Describe alternatives you've considered
We considered using standard Temperature-based sampling (T > 0.7) to drive exploration. However, empirical testing showed that for 3B parameter models, high temperature often leads to "Reasoning Drift" and hallucination, rather than logical diversity. SSoT provides a "Zero-Temperature" diversity mechanism by shifting entropy from the logits to the prompt instructions.

Additional context
This implementation is grounded in the SSoT paper (arXiv:2510.21150). Initial verification via the Mimi/Frankie case study confirmed that the "Action-First" interceptor can successfully extract correct answers (e.g., \boxed{20}) from complex reasoning noise while the trainer learns from the captured trace. This allows for a successful Transition of Knowledge from standard policies to deep-reasoning policies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions