LR scheduler exhausts early in agentic training with AgentNativeStepEnvManager

## Bug Description

When using `AgentNativeStepEnvManager` (step-level env manager) for agentic training, the LR scheduler exhausts its step budget far before all pipeline steps complete, causing the learning rate to drop to zero mid-training.

In a 200-step training run with `lr_scheduler_type: "linear"`, the LR reached zero at **pipeline step 123** — meaning 38.5% of training happened with **zero learning rate** and no learning.

## Root Cause

`PPOConfig.set_max_steps()` computes the total optimizer steps for the LR scheduler using `rollout_batch_size` (number of **trajectories**):

https://github.com/alibaba/ROLL/blob/main/roll/configs/base_config.py#L701-L718

```python
self.actor_train.training_args.max_steps = max(1, (
    max_steps
    * self.rollout_batch_size              # trajectories per rollout
    * self.actor_infer.generating_args.num_return_sequences
    * self.ppo_epochs
    // actor_backward_batch_size
))
```

With the config in `agent_val_rock_swe_qwen35_2b.yaml`:
```
max_steps=200, rollout_batch_size=4, num_return_sequences=1, ppo_epochs=1, backward_batch_size=4
→ scheduler total = 200 * 4 * 1 * 1 // 4 = 200 optimizer steps
```

**But the training batch contains chunks (one per agent turn), not trajectories.** `AgentNativeStepEnvManager.formulate_rollouts()` creates one training sample per turn:

https://github.com/alibaba/ROLL/blob/main/roll/pipeline/agentic/env_manager/agent_native_env_manager.py#L248

```python
for step, history in enumerate(rollout_cache.history):
    # ... one DataProto per turn ...
    samples.append(lm_input)
batch = DataProto.concat(samples)  # all turns as separate samples
```

So with 4 trajectories × ~10 turns each = ~40 training samples per pipeline step. With `backward_batch_size=4`, that's **~10 optimizer steps per pipeline step** — not the 1 that the scheduler was budgeted for.

Additionally, `batch_adjust_mode: "random_sample"` has a bifurcation that makes this worse:
- When `total_chunks % backward_batch_size == 0`: keeps ALL chunks → many optimizer steps
- When not divisible: subsamples to exactly `backward_batch_size` → 1 optimizer step

https://github.com/alibaba/ROLL/blob/main/roll/pipeline/agentic/agentic_pipeline.py (search for `adjust_batch`)

This creates wildly inconsistent optimizer step counts per pipeline step, making the scheduler exhaustion unpredictable.

## Evidence from Training Run

wandb run: https://wandb.ai/shamanework-pl/roll-agentic/runs/gvoe0mq8

Config: `openreward_endless_terminals_IPA_qwen35_2b.yaml` (same architecture, `rollout_batch_size=16`, `backward_batch_size=16`)

| Pipeline Step | Backward Steps | Cumulative Optimizer Steps | LR |
|---|---|---|---|
| 0 | 1 | 1 | 9.95e-7 |
| 2 | **15** | 17 | 9.15e-7 |
| 21 | **20** | 55 | 7.25e-7 |
| 64 | **11** | 118 | 4.10e-7 |
| 108 | **12** | 178 | 1.10e-7 |
| **123** | 6 | 205 | **0.00** |
| 199 | 1 | 313 | 0.00 |

**Total: 313 optimizer steps across 200 pipeline steps**, but scheduler budgeted for 200. LR hit zero at step 123.

## Affected Configs

Any agentic config using `AgentNativeStepEnvManager` with a decaying LR scheduler (linear, cosine, etc.):
- `examples/agentic_demo/agent_val_rock_swe_qwen35_2b.yaml`
- Any similar agentic training config

## Suggested Fix

**Option A — Use constant LR** (simplest, no code change):
```yaml
actor_train:
  training_args:
    lr_scheduler_type: "constant_with_warmup"
    warmup_steps: 10
```

**Option B — Fix `set_max_steps` for agentic training** (proper fix):

Override `set_max_steps` in `AgenticConfig` to account for the actual number of chunks per trajectory:

```python
# In AgenticConfig:
def set_max_steps(self, max_steps: int):
    actor_backward_batch_size = (
        self.actor_train.training_args.per_device_train_batch_size
        * self.actor_train.training_args.gradient_accumulation_steps
    )
    # Estimate chunks per trajectory (each turn = 1 training sample)
    estimated_avg_turns = self.max_actions_per_traj // 2  # conservative midpoint
    self.actor_train.training_args.max_steps = max(1, (
        max_steps
        * self.rollout_batch_size
        * estimated_avg_turns
        * self.ppo_epochs
        // actor_backward_batch_size
    ))
```

**Option C — Fix `batch_adjust_mode`** (complementary):

Change `random_sample` to always produce a consistent batch size (e.g., always round down with `"delete"` mode), so each pipeline step = exactly 1 optimizer step, matching the current `set_max_steps` formula.

## Environment

- ROLL version: main branch
- Model: Qwen3.5-2B
- Environment: SWE-bench / OpenReward EndlessTerminals
- GPUs: 8× (TP=2, CP=2 for training, 8× vLLM inference)

/cc @shamanez

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LR scheduler exhausts early in agentic training with AgentNativeStepEnvManager #407

Bug Description

Root Cause

Evidence from Training Run

Affected Configs

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pipeline Step	Backward Steps	Cumulative Optimizer Steps	LR
0	1	1	9.95e-7
2	15	17	9.15e-7
21	20	55	7.25e-7
64	11	118	4.10e-7
108	12	178	1.10e-7
123	6	205	0.00
199	1	313	0.00

LR scheduler exhausts early in agentic training with AgentNativeStepEnvManager #407

Description

Bug Description

Root Cause

Evidence from Training Run

Affected Configs

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions