[datakit] Add OpenResearcher rollout dataset pipeline#6272
Conversation
Add a seed-aware Datakit downloader and transcript-preserving transform for OpenResearcher trajectories. This keeps browser tool calls, observations, and answer diagnostics available for rollout-data tokenization.
|
🤖 Specification for the OpenResearcher rollout-data integration. Problem: Marin does not have a source-backed way to ingest OpenResearcher/OpenResearcher-Dataset. The dataset is not a simple prompt/response SFT table: each row is a browser-backed deep research trajectory with nested messages, tool calls, tool observations, repeated qids across seed configs, and final-answer diagnostics. A direct instruction-dataset registration would flatten away provenance and collide repeated qids. Approach: Add a Datakit download/transform module at lib/marin/src/marin/datakit/download/openresearcher.py that downloads the pinned HF revision, selects only seed_42 through seed_57 train parquet shards, validates seed-specific paths, renders each trajectory into a single text document, and keeps structured metadata for source seed, original messages, tool schema, browser call counts, answer match status, and stable ids. Add experiments/rollout_data/openresearcher.py as the tokenization runner so the processed documents can be launched through StepRunner without introducing a direct SFT registry shortcut or normalized source surface. Add focused tests in tests/datakit/download/test_openresearcher.py for seed validation, transcript rendering, repeated-qid identity, mismatch retention, invalid row drops, HF glob selection, and local parquet transform behavior. Key code: def row_to_doc(row: dict) -> list[dict]:
messages = row.get("messages")
if not isinstance(messages, list) or not messages:
counters.increment("openresearcher/dropped/no_messages")
return []
if not all(isinstance(message, dict) for message in messages):
counters.increment("openresearcher/dropped/malformed_message")
return []
question = string_or_empty(row.get("question")).strip()
if not question:
counters.increment("openresearcher/dropped/no_question")
return []
status = string_or_empty(row.get("status"))
if status and status != "success":
counters.increment("openresearcher/dropped/non_success_status")
return []def _validated_seed_configs(seed_configs: Sequence[str]) -> tuple[str, ...]:
seed_configs = tuple(seed_configs)
if not seed_configs:
raise ValueError("seed_configs must contain at least one OpenResearcher seed config")
unknown_seeds = sorted(set(seed_configs) - SEED_CONFIG_SET)
if unknown_seeds:
valid_range = f"{SEED_CONFIGS[0]} through {SEED_CONFIGS[-1]}"
raise ValueError(f"Unknown OpenResearcher seed config(s): {unknown_seeds}. Expected {valid_range}.")
return seed_configs |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 151bd04b3b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for block in content: | ||
| if isinstance(block, dict) and block.get("type") == "system_content": | ||
| return block |
There was a problem hiding this comment.
Preserve untyped system content blocks
When processing the pinned OpenResearcher shards, the system message content block is not guaranteed to carry type == "system_content" (the HF viewer shows rows whose first content object starts with channel_config, conversation_start_date, and knowledge_cutoff directly). In that case this helper returns {}, so both compact_system_text() and _tool_schema() drop the system metadata and browser tool schema for every affected trajectory instead of preserving the compact schema metadata this pipeline advertises. Accepting the system block based on its expected keys, not only the synthetic type marker used in the test fixture, would keep those fields in production rows.
Useful? React with 👍 / 👎.
|
cc: @penfever |
Add OpenResearcher as a rollout-data dataset pipeline for Marin. The new Datakit transform downloads the pinned train parquet shards for seed_42 through seed_57, validates seed-specific paths, and renders each browser-backed trajectory into a single document.
The transform preserves original nested messages, compact browser tool schema metadata, source seed, answer diagnostics, browser call counts, and stable trajectory ids so repeated qids across seed configs do not collide. The rollout runner tokenizes the processed documents with Marin's tokenizer without adding a direct SFT registry shortcut or a new normalized source surface.