Skip to content

[datakit] Add OpenResearcher rollout dataset pipeline#6272

Open
taivu1998 wants to merge 1 commit into
marin-community:mainfrom
taivu1998:tdv/openresearcher-deep-research-sft-20260607
Open

[datakit] Add OpenResearcher rollout dataset pipeline#6272
taivu1998 wants to merge 1 commit into
marin-community:mainfrom
taivu1998:tdv/openresearcher-deep-research-sft-20260607

Conversation

@taivu1998

Copy link
Copy Markdown
Contributor

Add OpenResearcher as a rollout-data dataset pipeline for Marin. The new Datakit transform downloads the pinned train parquet shards for seed_42 through seed_57, validates seed-specific paths, and renders each browser-backed trajectory into a single document.

The transform preserves original nested messages, compact browser tool schema metadata, source seed, answer diagnostics, browser call counts, and stable trajectory ids so repeated qids across seed configs do not collide. The rollout runner tokenizes the processed documents with Marin's tokenizer without adding a direct SFT registry shortcut or a new normalized source surface.

Add a seed-aware Datakit downloader and transcript-preserving transform for OpenResearcher trajectories. This keeps browser tool calls, observations, and answer diagnostics available for rollout-data tokenization.
@taivu1998

Copy link
Copy Markdown
Contributor Author

🤖 Specification for the OpenResearcher rollout-data integration.

Problem: Marin does not have a source-backed way to ingest OpenResearcher/OpenResearcher-Dataset. The dataset is not a simple prompt/response SFT table: each row is a browser-backed deep research trajectory with nested messages, tool calls, tool observations, repeated qids across seed configs, and final-answer diagnostics. A direct instruction-dataset registration would flatten away provenance and collide repeated qids.

Approach: Add a Datakit download/transform module at lib/marin/src/marin/datakit/download/openresearcher.py that downloads the pinned HF revision, selects only seed_42 through seed_57 train parquet shards, validates seed-specific paths, renders each trajectory into a single text document, and keeps structured metadata for source seed, original messages, tool schema, browser call counts, answer match status, and stable ids. Add experiments/rollout_data/openresearcher.py as the tokenization runner so the processed documents can be launched through StepRunner without introducing a direct SFT registry shortcut or normalized source surface. Add focused tests in tests/datakit/download/test_openresearcher.py for seed validation, transcript rendering, repeated-qid identity, mismatch retention, invalid row drops, HF glob selection, and local parquet transform behavior.

Key code:

def row_to_doc(row: dict) -> list[dict]:
    messages = row.get("messages")
    if not isinstance(messages, list) or not messages:
        counters.increment("openresearcher/dropped/no_messages")
        return []
    if not all(isinstance(message, dict) for message in messages):
        counters.increment("openresearcher/dropped/malformed_message")
        return []

    question = string_or_empty(row.get("question")).strip()
    if not question:
        counters.increment("openresearcher/dropped/no_question")
        return []

    status = string_or_empty(row.get("status"))
    if status and status != "success":
        counters.increment("openresearcher/dropped/non_success_status")
        return []
def _validated_seed_configs(seed_configs: Sequence[str]) -> tuple[str, ...]:
    seed_configs = tuple(seed_configs)
    if not seed_configs:
        raise ValueError("seed_configs must contain at least one OpenResearcher seed config")

    unknown_seeds = sorted(set(seed_configs) - SEED_CONFIG_SET)
    if unknown_seeds:
        valid_range = f"{SEED_CONFIGS[0]} through {SEED_CONFIGS[-1]}"
        raise ValueError(f"Unknown OpenResearcher seed config(s): {unknown_seeds}. Expected {valid_range}.")

    return seed_configs

@taivu1998 taivu1998 marked this pull request as ready for review June 8, 2026 00:49

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 151bd04b3b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +88 to +90
for block in content:
if isinstance(block, dict) and block.get("type") == "system_content":
return block

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve untyped system content blocks

When processing the pinned OpenResearcher shards, the system message content block is not guaranteed to carry type == "system_content" (the HF viewer shows rows whose first content object starts with channel_config, conversation_start_date, and knowledge_cutoff directly). In that case this helper returns {}, so both compact_system_text() and _tool_schema() drop the system metadata and browser tool schema for every affected trajectory instead of preserving the compact schema metadata this pipeline advertises. Accepting the system block based on its expected keys, not only the synthetic type marker used in the test fixture, would keep those fields in production rows.

Useful? React with 👍 / 👎.

@Helw150

Helw150 commented Jun 10, 2026

Copy link
Copy Markdown
Member

cc: @penfever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants