[INFRA/FEATURE] Implement SSoT Reasoning Exploration and Hardened Server Management

**Is your feature request related to a problem? Please describe.**
The Atropos RL engine currently faces significant stability and exploration bottlenecks when training small models (3B) on complex reasoning tasks like GSM8K. Specifically:
1. **Policy Collapse**: Models quickly find a single deterministic "answer-seeking" path and stop exploring diverse reasoning strategies, leading to stagnant reward curves.
2. **Trajectory Data Loss**: A critical bug in `gsm8k_server.py` purges textual reasoning messages during the scoring phase, leaving the trainer "blind" to the model's deliberation process.
3. **Infrastructure Fragility**: The `ServerManager` hard-codes `base_url` port settings, causing collisions during local or multi-node rollouts.

**Describe the solution you'd like**
Implement a model-agnostic **SSoT (String Seed of Thought)** wrapper to enable **Probabilistic Instruction Following (PIF)**. This solution includes:
* **Entropy Injection**: Requiring the model to generate a 16-character random seed to force "mental path" diversity.
* **Surgical Interception**: An "Action-First" interceptor that allows models to generate 1000+ tokens of reasoning traces while providing clean, environment-compliant tool calls.
* **Infrastructure Hardening**: Patching the `ServerManager` to respect custom API configurations and fixing the `ScoredDataGroup` serialization to preserve full trajectory fidelity.

**Describe alternatives you've considered**
We considered using standard **Temperature-based sampling** (T > 0.7) to drive exploration. However, empirical testing showed that for 3B parameter models, high temperature often leads to "Reasoning Drift" and hallucination, rather than logical diversity. SSoT provides a "Zero-Temperature" diversity mechanism by shifting entropy from the logits to the prompt instructions.

**Additional context**
This implementation is grounded in the **SSoT paper ([arXiv:2510.21150](https://arxiv.org/abs/2510.21150))**. Initial verification via the **Mimi/Frankie** case study confirmed that the "Action-First" interceptor can successfully extract correct answers (e.g., `\boxed{20}`) from complex reasoning noise while the trainer learns from the captured trace. This allows for a successful **Transition of Knowledge** from standard policies to deep-reasoning policies.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INFRA/FEATURE] Implement SSoT Reasoning Exploration and Hardened Server Management #473

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[INFRA/FEATURE] Implement SSoT Reasoning Exploration and Hardened Server Management #473

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions