Skip to content

[datakit] Add locuslab Safety Pretraining datasets#5743

Merged
Helw150 merged 2 commits into
mainfrom
agent/20260514-fix-5742
May 15, 2026
Merged

[datakit] Add locuslab Safety Pretraining datasets#5743
Helw150 merged 2 commits into
mainfrom
agent/20260514-fix-5742

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented May 14, 2026

Register canonical download recipes for the four ODC-BY datasets in locuslab/safety-pretraining-datasets: moral_education, safeweb, refuseweb, and fineweb_annotated. Each family pins a revision and exposes its top-level score-bucket directories as subset paths. Adds parametrized unit tests covering download identity and per-subset normalize-output distinctness.

Fixes #5742

Register canonical download recipes for the four ODC-BY datasets in
locuslab/safety-pretraining-datasets: moral_education, safeweb,
refuseweb, and fineweb_annotated. Each family pins a revision and
exposes top-level score-bucket directories as subset paths.

Fixes #5742
@claude claude Bot added the agent-generated Created by automation/agent label May 14, 2026
Wires the locuslab Safety Pretraining canonical chains into
all_sources() (moral_education + safeweb + refuseweb subsets;
fineweb_annotated excluded as it overlaps with the FineWeb corpus
already in the catalog). Token counts come from tokenizing each
subset with marin-community/marin-tokenizer via the new
scripts/datakit/tokenize_safety_pt.py driver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Helw150 Helw150 merged commit 3a8b6b9 into main May 15, 2026
35 checks passed
@Helw150 Helw150 deleted the agent/20260514-fix-5742 branch May 15, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Safety Pretraining Datasets to Datakit

1 participant