[datakit] Add locuslab Safety Pretraining datasets by claude[bot] · Pull Request #5743 · marin-community/marin

claude · 2026-05-14T17:55:47Z

Register canonical download recipes for the four ODC-BY datasets in locuslab/safety-pretraining-datasets: moral_education, safeweb, refuseweb, and fineweb_annotated. Each family pins a revision and exposes its top-level score-bucket directories as subset paths. Adds parametrized unit tests covering download identity and per-subset normalize-output distinctness.

Fixes #5742

Register canonical download recipes for the four ODC-BY datasets in locuslab/safety-pretraining-datasets: moral_education, safeweb, refuseweb, and fineweb_annotated. Each family pins a revision and exposes top-level score-bucket directories as subset paths. Fixes #5742

Wires the locuslab Safety Pretraining canonical chains into all_sources() (moral_education + safeweb + refuseweb subsets; fineweb_annotated excluded as it overlaps with the FineWeb corpus already in the catalog). Token counts come from tokenizing each subset with marin-community/marin-tokenizer via the new scripts/datakit/tokenize_safety_pt.py driver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude Bot added the agent-generated Created by automation/agent label May 14, 2026

claude Bot mentioned this pull request May 14, 2026

Add Safety Pretraining Datasets to Datakit #5742

Closed

Helw150 approved these changes May 15, 2026

View reviewed changes

Helw150 merged commit 3a8b6b9 into main May 15, 2026
35 checks passed

Helw150 deleted the agent/20260514-fix-5742 branch May 15, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datakit] Add locuslab Safety Pretraining datasets#5743

[datakit] Add locuslab Safety Pretraining datasets#5743
Helw150 merged 2 commits into
mainfrom
agent/20260514-fix-5742

claude Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant