Skip to content

[Agentic SFT] Post-train Marin-32B on the TerminalCorpus #4760

Description

@AlienKevin

Description

Post-train marin-community/marin-32b-base on 15% of Nemotron-Terminal-Corpus with the same SFT recipe used in #4307, then evaluate on Terminal-Bench 2.0 against the Qwen3-32B SFT checkpoint at the same training fraction.

Hypothesis or Goal

Marin-32B base should absorb the same terminal-skills mix with comparable or better sample efficiency than the Qwen3-32B baseline when both are trained on 15% of the full corpus.

Links

Results

Summary

This experiment fine-tuned marin-community/marin-32b-base for 859 steps on 15% of Nemotron-Terminal-Corpus and evaluated the checkpoint on Terminal-Bench 2.0 with the Harbor/Terminus-2 setup. Early failures were traced to an EOS metadata mismatch in the saved checkpoint, not to the SFT weights themselves; after patching the checkpoint and adding pipeline safeguards, the apples-to-apples eval still rejected the hypothesis. At a comparable training budget, Marin-32B solved 2/87 tasks (~2.3%) versus Qwen3-32B SFT's 18/86 (~20.9%), and Marin's two solves were also solved by Qwen3 at half its training budget. The current recommendation is to treat the 15% Marin-32B SFT as a negative result and only continue if testing full-corpus Marin SFT is worth the cost.

Helpful links

Metadata

Metadata

Assignees

Labels

agent-generatedCreated by automation/agentexperimenttldrIssue has a community-friendly TL;DR summary

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions