Skip to content

[datasets] Add Hermes trace support to the SFT pipeline#4858

Draft
taivu1998 wants to merge 2 commits intomarin-community:mainfrom
taivu1998:tdv/pr1-hermes-sft-plumbing
Draft

[datasets] Add Hermes trace support to the SFT pipeline#4858
taivu1998 wants to merge 2 commits intomarin-community:mainfrom
taivu1998:tdv/pr1-hermes-sft-plumbing

Conversation

@taivu1998
Copy link
Copy Markdown

Add conversation transform hooks for dataset-specific message normalization and stable row IDs, then register the Hermes glm-5.1 and kimi trace splits for SFT. Adds regression fixtures/tests and a trace-focused pilot experiment.

This adds minimal postprocessing and row-ID hook seams for SFT conversation transforms so Hermes-style trace normalization can land without reshaping the existing adapter stack. It also preserves transformed dataset hash stability when the new hooks are unset and adds regression tests for hook ordering and signature behavior.
Normalize Hermes tool responses into Marin's chat format while preserving raw think/tool-call assistant turns. Register the glm-5.1 and kimi splits, add focused fixtures and regression tests, and add a trace-focused pilot SFT experiment.
Copy link
Copy Markdown
Author

🤖 Spec

Problem
Marin's conversation transform path only accepted raw adapter output, so datasets that need dataset-specific message cleanup or source-derived stable IDs could not be integrated without bespoke pipelines. Hermes traces need tool responses unwrapped into Marin's structured tool-message shape while preserving assistant <think> and <tool_call> content, and the dataset rows should keep source IDs.

Approach
Add optional message_postprocess_fn and row_id_fn hooks to TransformAdapter, with signature generation that omits unset hook fields so existing dataset output paths stay stable. Implement Hermes normalization in trace_normalization.py, register the glm-5.1 and kimi splits in instruction_datasets.py, and add a pilot SFT experiment plus focused fixtures and regression tests.

Key code

  • transform_row(...) now applies adapter postprocessing before replacements/tool normalization and prefers row_id_fn when present.
  • normalize_hermes_trace_messages(...) only rewrites well-formed <tool_response> tool turns; malformed payloads fall back to raw source content unchanged.
  • hermes_trace_row_id(...) preserves dataset IDs and falls back to a content hash only when the source row has no usable id.

Tests
./infra/pre-commit.py --all-files --fix
UV_CACHE_DIR=/tmp/uv-cache-pr1 uv run --all-packages pytest tests/transform/test_agent_trace_conversation.py tests/transform/test_conversation.py tests/test_marin_chat_template.py -q
UV_CACHE_DIR=/tmp/uv-cache-pr1 uv run --all-packages python -c "import experiments.exp_hermes_trace_sft_pilot"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant