Is your feature request related to a problem? Please describe.
The Atropos RL engine currently faces significant stability and exploration bottlenecks when training small models (3B) on complex reasoning tasks like GSM8K. Specifically:
- Policy Collapse: Models quickly find a single deterministic "answer-seeking" path and stop exploring diverse reasoning strategies, leading to stagnant reward curves.
- Trajectory Data Loss: A critical bug in
gsm8k_server.py purges textual reasoning messages during the scoring phase, leaving the trainer "blind" to the model's deliberation process.
- Infrastructure Fragility: The
ServerManager hard-codes base_url port settings, causing collisions during local or multi-node rollouts.
Describe the solution you'd like
Implement a model-agnostic SSoT (String Seed of Thought) wrapper to enable Probabilistic Instruction Following (PIF). This solution includes:
- Entropy Injection: Requiring the model to generate a 16-character random seed to force "mental path" diversity.
- Surgical Interception: An "Action-First" interceptor that allows models to generate 1000+ tokens of reasoning traces while providing clean, environment-compliant tool calls.
- Infrastructure Hardening: Patching the
ServerManager to respect custom API configurations and fixing the ScoredDataGroup serialization to preserve full trajectory fidelity.
Describe alternatives you've considered
We considered using standard Temperature-based sampling (T > 0.7) to drive exploration. However, empirical testing showed that for 3B parameter models, high temperature often leads to "Reasoning Drift" and hallucination, rather than logical diversity. SSoT provides a "Zero-Temperature" diversity mechanism by shifting entropy from the logits to the prompt instructions.
Additional context
This implementation is grounded in the SSoT paper (arXiv:2510.21150). Initial verification via the Mimi/Frankie case study confirmed that the "Action-First" interceptor can successfully extract correct answers (e.g., \boxed{20}) from complex reasoning noise while the trainer learns from the captured trace. This allows for a successful Transition of Knowledge from standard policies to deep-reasoning policies.
Is your feature request related to a problem? Please describe.
The Atropos RL engine currently faces significant stability and exploration bottlenecks when training small models (3B) on complex reasoning tasks like GSM8K. Specifically:
gsm8k_server.pypurges textual reasoning messages during the scoring phase, leaving the trainer "blind" to the model's deliberation process.ServerManagerhard-codesbase_urlport settings, causing collisions during local or multi-node rollouts.Describe the solution you'd like
Implement a model-agnostic SSoT (String Seed of Thought) wrapper to enable Probabilistic Instruction Following (PIF). This solution includes:
ServerManagerto respect custom API configurations and fixing theScoredDataGroupserialization to preserve full trajectory fidelity.Describe alternatives you've considered
We considered using standard Temperature-based sampling (T > 0.7) to drive exploration. However, empirical testing showed that for 3B parameter models, high temperature often leads to "Reasoning Drift" and hallucination, rather than logical diversity. SSoT provides a "Zero-Temperature" diversity mechanism by shifting entropy from the logits to the prompt instructions.
Additional context
This implementation is grounded in the SSoT paper (arXiv:2510.21150). Initial verification via the Mimi/Frankie case study confirmed that the "Action-First" interceptor can successfully extract correct answers (e.g.,
\boxed{20}) from complex reasoning noise while the trainer learns from the captured trace. This allows for a successful Transition of Knowledge from standard policies to deep-reasoning policies.