[datakit] Add OpenResearcher rollout dataset pipeline by taivu1998 · Pull Request #6272 · marin-community/marin

taivu1998 · 2026-06-07T22:59:59Z

Add OpenResearcher as a rollout-data dataset pipeline for Marin. The new Datakit transform downloads the pinned train parquet shards for seed_42 through seed_57, validates seed-specific paths, and renders each browser-backed trajectory into a single document.

The transform preserves original nested messages, compact browser tool schema metadata, source seed, answer diagnostics, browser call counts, and stable trajectory ids so repeated qids across seed configs do not collide. The rollout runner tokenizes the processed documents with Marin's tokenizer without adding a direct SFT registry shortcut or a new normalized source surface.

Add a seed-aware Datakit downloader and transcript-preserving transform for OpenResearcher trajectories. This keeps browser tool calls, observations, and answer diagnostics available for rollout-data tokenization.

taivu1998 · 2026-06-07T23:01:14Z

🤖 Specification for the OpenResearcher rollout-data integration.

Problem: Marin does not have a source-backed way to ingest OpenResearcher/OpenResearcher-Dataset. The dataset is not a simple prompt/response SFT table: each row is a browser-backed deep research trajectory with nested messages, tool calls, tool observations, repeated qids across seed configs, and final-answer diagnostics. A direct instruction-dataset registration would flatten away provenance and collide repeated qids.

Approach: Add a Datakit download/transform module at lib/marin/src/marin/datakit/download/openresearcher.py that downloads the pinned HF revision, selects only seed_42 through seed_57 train parquet shards, validates seed-specific paths, renders each trajectory into a single text document, and keeps structured metadata for source seed, original messages, tool schema, browser call counts, answer match status, and stable ids. Add experiments/rollout_data/openresearcher.py as the tokenization runner so the processed documents can be launched through StepRunner without introducing a direct SFT registry shortcut or normalized source surface. Add focused tests in tests/datakit/download/test_openresearcher.py for seed validation, transcript rendering, repeated-qid identity, mismatch retention, invalid row drops, HF glob selection, and local parquet transform behavior.

Key code:

def row_to_doc(row: dict) -> list[dict]:
    messages = row.get("messages")
    if not isinstance(messages, list) or not messages:
        counters.increment("openresearcher/dropped/no_messages")
        return []
    if not all(isinstance(message, dict) for message in messages):
        counters.increment("openresearcher/dropped/malformed_message")
        return []

    question = string_or_empty(row.get("question")).strip()
    if not question:
        counters.increment("openresearcher/dropped/no_question")
        return []

    status = string_or_empty(row.get("status"))
    if status and status != "success":
        counters.increment("openresearcher/dropped/non_success_status")
        return []

def _validated_seed_configs(seed_configs: Sequence[str]) -> tuple[str, ...]:
    seed_configs = tuple(seed_configs)
    if not seed_configs:
        raise ValueError("seed_configs must contain at least one OpenResearcher seed config")

    unknown_seeds = sorted(set(seed_configs) - SEED_CONFIG_SET)
    if unknown_seeds:
        valid_range = f"{SEED_CONFIGS[0]} through {SEED_CONFIGS[-1]}"
        raise ValueError(f"Unknown OpenResearcher seed config(s): {unknown_seeds}. Expected {valid_range}.")

    return seed_configs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 151bd04b3b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T00:52:21Z

+    for block in content:
+        if isinstance(block, dict) and block.get("type") == "system_content":
+            return block


Preserve untyped system content blocks

When processing the pinned OpenResearcher shards, the system message content block is not guaranteed to carry type == "system_content" (the HF viewer shows rows whose first content object starts with channel_config, conversation_start_date, and knowledge_cutoff directly). In that case this helper returns {}, so both compact_system_text() and _tool_schema() drop the system metadata and browser tool schema for every affected trajectory instead of preserving the compact schema metadata this pipeline advertises. Accepting the system block based on its expected keys, not only the synthetic type marker used in the test fixture, would keep those fields in production rows.

Useful? React with 👍 / 👎.

Helw150 · 2026-06-10T16:32:38Z

cc: @penfever

[datakit] Add OpenResearcher rollout dataset

151bd04

Add a seed-aware Datakit downloader and transcript-preserving transform for OpenResearcher trajectories. This keeps browser tool calls, observations, and answer diagnostics available for rollout-data tokenization.

taivu1998 marked this pull request as ready for review June 8, 2026 00:49

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datakit] Add OpenResearcher rollout dataset pipeline#6272

[datakit] Add OpenResearcher rollout dataset pipeline#6272
taivu1998 wants to merge 1 commit into
marin-community:mainfrom
taivu1998:tdv/openresearcher-deep-research-sft-20260607

taivu1998 commented Jun 7, 2026

Uh oh!

taivu1998 commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

Helw150 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taivu1998 commented Jun 7, 2026

Uh oh!

taivu1998 commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Helw150 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants